aboutsummaryrefslogtreecommitdiffstats
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2006-07-14[PATCH] del_timer_sync(): add cpu_relax()Andrew Morton1-0/+1
Relax the CPU in the del_timer_sync() busywait loop. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-14[PATCH] remove kernel/kthread.c:kthread_stop_sem()Adrian Bunk1-22/+2
Remove the now-unneeded kthread_stop_sem(). Signed-off-by: Adrian Bunk <bunk@stusta.de> Acked-by: Alan Stern <stern@rowland.harvard.edu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-14[PATCH] null-terminate over-long /proc/kallsyms symbolsAndreas Gruenbacher2-9/+6
Got a customer bug report (https://bugzilla.novell.com/190296) about kernel symbols longer than 127 characters which end up in a string buffer that is not NULL terminated, leading to garbage in /proc/kallsyms. Using strlcpy prevents this from happening, even though such symbols still won't come out right. A better fix would be to not use a fixed-size buffer, but it's probably not worth the trouble. (Modversion'ed symbols even have a length limit of 60.) [bunk@stusta.de: build fix] Signed-off-by: Andreas Gruenbacher <agruen@suse.de> Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-12[PATCH] The scheduled unexport of insert_resourceAdrian Bunk1-2/+0
Implement the scheduled unexport of insert_resource. Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2006-07-12[PATCH] remove kernel/power/pm.c:pm_unregister_all()Adrian Bunk1-37/+0
Remove the deprecated and no longer used pm_unregister_all(). Signed-off-by: Adrian Bunk <bunk@stusta.de> Acked-by: Pavel Machek <pavel@suse.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2006-07-12[PATCH] Fix prctl privilege escalation and suid_dumpable (CVE-2006-2451)Marcel Holtmann1-1/+1
Based on a patch from Ernie Petrides During security research, Red Hat discovered a behavioral flaw in core dump handling. A local user could create a program that would cause a core file to be dumped into a directory they would not normally have permissions to write to. This could lead to a denial of service (disk consumption), or allow the local user to gain root privileges. The prctl() system call should never allow to set "dumpable" to the value 2. Especially not for non-privileged users. This can be split into three cases: 1) running as root -- then core dumps will already be done as root, and so prctl(PR_SET_DUMPABLE, 2) is not useful 2) running as non-root w/setuid-to-root -- this is the debatable case 3) running as non-root w/setuid-to-non-root -- then you definitely do NOT want "dumpable" to get set to 2 because you have the privilege escalation vulnerability With case #2, the only potential usefulness is for a program that has designed to run with higher privilege (than the user invoking it) that wants to be able to create root-owned root-validated core dumps. This might be useful as a debugging aid, but would only be safe if the program had done a chdir() to a safe directory. There is no benefit to a production setuid-to-root utility, because it shouldn't be dumping core in the first place. If this is true, then the same debugging aid could also be accomplished with the "suid_dumpable" sysctl. Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-10[PATCH] lockdep: disable lock debugging when kernel state becomes untrustedArjan van de Ven1-0/+2
Disable lockdep debugging in two situations where the integrity of the kernel no longer is guaranteed: when oopsing and when hitting a tainting-condition. The goal is to not get weird lockdep traces that don't make sense or are otherwise undebuggable, to not waste time. Lockdep assumes that the previous state it knows about is valid to operate, which is why lockdep turns itself off after the first violation it reports, after that point it can no longer make that assumption. A kernel oops means that the integrity of the kernel compromised; in addition anything lockdep would report is of lesser importance than the oops. All the tainting conditions are of similar integrity-violating nature and also make debugging/diagnosing more difficult. Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-10[PATCH] remove the tasklist_lock exportChristoph Hellwig1-3/+1
As announced half a year ago this patch will remove the tasklist_lock export. The previous two patches got rid of the remaining modular users. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-10[PATCH] uninline init_waitqueue_head()Ingo Molnar1-2/+6
allyesconfig vmlinux size delta: text data bss dec filename 20736884 6073834 3075176 29885894 vmlinux.before 20721009 6073966 3075176 29870151 vmlinux.after ~18 bytes per callsite, 15K of text size (~0.1%) saved. (as an added bonus this also removes a lockdep annotation.) Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-10[PATCH] swsusp: fix panic when signature can't be readLinus Torvalds1-2/+4
Do not panic a machine when swsusp signature can't be read. Signed-off-by: Pavel Machek <pavel@suse.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-10[PATCH] swsusp warning fixAndrew Morton1-10/+10
kernel/power/swap.c: In function 'swsusp_write': kernel/power/swap.c:275: warning: 'start' may be used uninitialized in this function gcc isn't smart enough, so help it. Cc: Pavel Machek <pavel@ucw.cz> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-10[PATCH] swsusp: do not use memcpy for snapshotting memoryRafael J. Wysocki1-2/+8
swsusp should not use memcpy for snapshotting memory, because on some architectures memcpy may increase preempt_count (i386 does this when CONFIG_X86_USE_3DNOW is set). Then, as a result, wrong value of preempt_count is stored in the image. Replace memcpy in copy_data_pages with an open-coded loop. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-10[PATCH] adjust clock for lost ticksRoman Zippel1-38/+47
A large number of lost ticks can cause an overadjustment of the clock. To compensate for this we look at the current error and the larger the error already is the more careful we are at adjusting the error. As small extra fix reset the error when the clock is set. Signed-off-by: Roman Zippel <zippel@linux-m68k.org> Acked-by: john stultz <johnstul@us.ibm.com> Cc: Uwe Bugla <uwe.bugla@gmx.de> Cc: James Bottomley <James.Bottomley@SteelEye.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-10[PATCH] pi-futex: Validate futex type instead of oopsingThomas Gleixner1-0/+6
Calling futex_lock_pi is called with a reference to a non PI futex and waiters exist already, lookup_pi_state() oopses due to pi_state == NULL. Check this condition and return -EINVAL to userspace. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: Jakub Jelinek <jakub@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-10[PATCH] kernel/softirq.c: EXPORT_UNUSED_SYMBOLAdrian Bunk1-1/+1
This patch marks an unused export as EXPORT_UNUSED_SYMBOL. Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-10[PATCH] kernel/printk.c: EXPORT_SYMBOL_UNUSEDAdrian Bunk1-2/+2
This patch marks unused exports as EXPORT_SYMBOL_UNUSED. Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-10[PATCH] lockdep: core, reduce per-lock class-cache sizeIngo Molnar1-33/+54
lockdep_map is embedded into every lock, which blows up data structure sizes all around the kernel. Reduce the class-cache to be for the default class only - that is used in 99.9% of the cases and even if we dont have a class cached, the lookup in the class-hash is lockless. This change reduces the per-lock dep_map overhead by 56 bytes on 64-bit platforms and by 28 bytes on 32-bit platforms. Signed-off-by: Ingo Molnar <mingo@elte.hu> Cc: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-10[PATCH] lockdep: improve debug outputArjan van de Ven1-2/+3
Make lockdep print which lock is held, in the "kfree() of a live lock" scenario. Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-10[PATCH] Minor cleanup to lockdep.cAndi Kleen1-32/+12
- Use printk formatting for indentation - Don't leave NTFS in the default event filter Signed-off-by: Andi Kleen <ak@suse.de> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-10[PATCH] small kernel/sched.c cleanupAndreas Mohr1-10/+7
- constify and optimize stat_nam (thanks to Michael Tokarev!) - spelling and comment fixes Signed-off-by: Andreas Mohr <andi@lisas.de> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-10[PATCH] sched: fix bug in __migrate_task()Peter Williams1-1/+1
Problem: In the function __migrate_task(), deactivate_task() followed by activate_task() is used to move the task from one run queue to another. This has two undesirable effects: 1. The task's priority is recalculated. (Nowhere else in the scheduler code is the priority recalculated for a change of CPU.) 2. The task's time stamp is set to the current time. At the very least, this makes the adjustment of the time stamp before the call to deactivate_task() redundant but I believe the problem is more serious as the time stamp now holds the time of the queue change instead of the time at which the task was woken. In addition, unless dest_rq is the same queue as "current" is on the time stamp could be inaccurate due to inter CPU drift. Solution: Replace the call to activate_task() with one to __activate_task(). Signed-off-by: Peter Williams <pwil3058@bigpond.net.au> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-04Merge master.kernel.org:/pub/scm/linux/kernel/git/davej/cpufreqLinus Torvalds1-25/+32
* master.kernel.org:/pub/scm/linux/kernel/git/davej/cpufreq: Move workqueue exports to where the functions are defined. [CPUFREQ] Misc cleanups in ondemand. [CPUFREQ] Make ondemand sampling per CPU and remove the mutex usage in sampling path. [CPUFREQ] Add queue_delayed_work_on() interface for workqueues. [CPUFREQ] Remove slowdown from ondemand sampling path.
2006-07-03[PATCH] revert "kthread: convert stop_machine into a kthread"Andrew Morton1-11/+6
Jiri reports that the stop_machin kthread conversion caused his machine to hang when suspending. Hyperthreading is apparently involved. I don't see why that would be and I can't reproduce it. Revert to the 2.6.17 code. Cc: "Serge E. Hallyn" <serue@us.ibm.com> Cc: Jiri Slaby <jirislaby@gmail.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03Merge git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpcLinus Torvalds1-1/+4
* git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc: powerpc: add defconfig for Freescale MPC8349E-mITX board powerpc: Add base support for the Freescale MPC8349E-mITX eval board Documentation: correct values in MPC8548E SEC example node [POWERPC] Actually copy over i8259.c to arch/ppc/syslib this time [POWERPC] Add new interrupt mapping core and change platforms to use it [POWERPC] Copy i8259 code back to arch/ppc [POWERPC] New device-tree interrupt parsing code [POWERPC] Use the genirq framework [PATCH] genirq: Allow fasteoi handler to retrigger disabled interrupts [POWERPC] Update the SWIM3 (powermac) floppy driver [POWERPC] Fix error handling in detecting legacy serial ports [POWERPC] Fix booting on Momentum "Apache" board (a Maple derivative) [POWERPC] Fix various offb and BootX-related issues [POWERPC] Add a default config for 32-bit CHRP machines [POWERPC] fix implicit declaration on cell. [POWERPC] change get_property to return void *
2006-07-03[PATCH] sched: cleanup, convert sched.c-internal typedefs to structIngo Molnar1-125/+125
convert: - runqueue_t to 'struct rq' - prio_array_t to 'struct prio_array' - migration_req_t to 'struct migration_req' I was the one who added these but they are both against the kernel coding style and also were used inconsistently at places. So just get rid of them at once, now that we are flushing the scheduler patch-queue anyway. Conversion was mostly scripted, the result was reviewed and all secondary whitespace and style impact (if any) was fixed up by hand. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03[PATCH] sched: cleanup, remove task_t, convert to struct task_structIngo Molnar12-138/+153
cleanup: remove task_t and convert all the uses to struct task_struct. I introduced it for the scheduler anno and it was a mistake. Conversion was mostly scripted, the result was reviewed and all secondary whitespace and style impact (if any) was fixed up by hand. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03[PATCH] sched: clean up fallout of recent changesIngo Molnar1-166/+194
Clean up some of the impact of recent (and not so recent) scheduler changes: - turning macros into nice inline functions - sanitizing and unifying variable definitions - whitespace, style consistency, 80-lines, comment correctness, spelling and curly braces police Due to the macro hell and variable placement simplifications there's even 26 bytes of .text saved: text data bss dec hex filename 25510 4153 192 29855 749f sched.o.before 25484 4153 192 29829 7485 sched.o.after [akpm@osdl.org: build fix] Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03[PATCH] lockdep: irqtrace subsystem, move account_system_vtime() calls into ↵Paul Mackerras1-0/+4
kernel/softirq.c At the moment, powerpc and s390 have their own versions of do_softirq which include local_bh_disable() and __local_bh_enable() calls. They end up calling __do_softirq (in kernel/softirq.c) which also does local_bh_disable/enable. Apparently the two levels of disable/enable trigger a warning from some validation code that Ingo is working on, and he would like to see the outer level removed. But to do that, we have to move the account_system_vtime calls that are currently in the arch do_softirq() implementations for powerpc and s390 into the generic __do_softirq() (this is a no-op for other archs because account_system_vtime is defined to be an empty inline function on all other archs). This patch does that. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03[PATCH] lockdep: annotate on-stack completionsIngo Molnar1-1/+1
lockdep needs to have the waitqueue lock initialized for on-stack waitqueues implicitly initialized by DECLARE_COMPLETION(). Annotate on-stack completions accordingly. Has no effect on non-lockdep kernels. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03[PATCH] lockdep: annotate enable_in_hardirq()Ingo Molnar1-1/+1
Make use of local_irq_enable_in_hardirq() API to annotate places that enable hardirqs in hardirq context. Has no effect on non-lockdep kernels. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03[PATCH] lockdep: annotate ->mmap_semIngo Molnar1-1/+4
Teach special (recursive) locking code to the lock validator. Has no effect on non-lockdep kernels. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03[PATCH] lockdep: annotate hrtimer base locksIngo Molnar1-1/+3
Teach special (recursive) locking code to the lock validator. Has no effect on non-lockdep kernels. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03[PATCH] lockdep: annotate scheduler runqueue locksIngo Molnar1-0/+2
Teach per-CPU runqueue locks and recursive locking code to the lock validator. Has no effect on non-lockdep kernels. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03[PATCH] lockdep: annotate timer base locksIngo Molnar1-0/+9
Split the per-CPU timer base locks up into separate lock classes, because they are used recursively. Has no effect on non-lockdep kernels. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03[PATCH] lockdep: annotate waitqueuesIngo Molnar1-0/+4
Create one lock class for all waitqueue locks in the kernel. Has no effect on non-lockdep kernels. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03[PATCH] lockdep: annotate genirqIngo Molnar1-0/+16
Teach special (recursive) locking code to the lock validator. Has no effect on non-lockdep kernels. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03[PATCH] lockdep: annotate futexIngo Molnar1-10/+18
Teach special (recursive) locking code to the lock validator. Introduces double_lock_hb() to unify double- hash-bucket-lock taking. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03[PATCH] lockdep: do not recurse in printkIngo Molnar1-5/+18
Make printk()-ing from within the lock validation code safer by using the lockdep-recursion counter. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03[PATCH] lockdep: prove mutex locking correctnessIngo Molnar3-8/+28
Use the lock validator framework to prove mutex locking correctness. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03[PATCH] lockdep: prove spinlock rwlock locking correctnessIngo Molnar3-9/+81
Use the lock validator framework to prove spinlock and rwlock locking correctness. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03[PATCH] lockdep: prove rwsem locking correctnessIngo Molnar2-1/+43
Use the lock validator framework to prove rwsem locking correctness. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03[PATCH] lockdep: procfsIngo Molnar2-0/+348
Lock validator /proc/lockdep and /proc/lockdep_stats support. (FIXME: should go into debugfs) Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03[PATCH] lockdep: allow read_lock() recursion of same classIngo Molnar1-3/+2
From: Ingo Molnar <mingo@elte.hu> lockdep so far only allowed read-recursion for the same lock instance. This is enough in the overwhelming majority of cases, but a hostap case triggered and reported by Miles Lane relies on same-class different-instance recursion. So we relax the restriction on read-lock recursion. (This change does not allow rwsem read-recursion, which is still forbidden.) Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03[PATCH] lockdep: coreIngo Molnar6-0/+2796
Do 'make oldconfig' and accept all the defaults for new config options - reboot into the kernel and if everything goes well it should boot up fine and you should have /proc/lockdep and /proc/lockdep_stats files. Typically if the lock validator finds some problem it will print out voluminous debug output that begins with "BUG: ..." and which syslog output can be used by kernel developers to figure out the precise locking scenario. What does the lock validator do? It "observes" and maps all locking rules as they occur dynamically (as triggered by the kernel's natural use of spinlocks, rwlocks, mutexes and rwsems). Whenever the lock validator subsystem detects a new locking scenario, it validates this new rule against the existing set of rules. If this new rule is consistent with the existing set of rules then the new rule is added transparently and the kernel continues as normal. If the new rule could create a deadlock scenario then this condition is printed out. When determining validity of locking, all possible "deadlock scenarios" are considered: assuming arbitrary number of CPUs, arbitrary irq context and task context constellations, running arbitrary combinations of all the existing locking scenarios. In a typical system this means millions of separate scenarios. This is why we call it a "locking correctness" validator - for all rules that are observed the lock validator proves it with mathematical certainty that a deadlock could not occur (assuming that the lock validator implementation itself is correct and its internal data structures are not corrupted by some other kernel subsystem). [see more details and conditionals of this statement in include/linux/lockdep.h and Documentation/lockdep-design.txt] Furthermore, this "all possible scenarios" property of the validator also enables the finding of complex, highly unlikely multi-CPU multi-context races via single single-context rules, increasing the likelyhood of finding bugs drastically. In practical terms: the lock validator already found a bug in the upstream kernel that could only occur on systems with 3 or more CPUs, and which needed 3 very unlikely code sequences to occur at once on the 3 CPUs. That bug was found and reported on a single-CPU system (!). So in essence a race will be found "piecemail-wise", triggering all the necessary components for the race, without having to reproduce the race scenario itself! In its short existence the lock validator found and reported many bugs before they actually caused a real deadlock. To further increase the efficiency of the validator, the mapping is not per "lock instance", but per "lock-class". For example, all struct inode objects in the kernel have inode->inotify_mutex. If there are 10,000 inodes cached, then there are 10,000 lock objects. But ->inotify_mutex is a single "lock type", and all locking activities that occur against ->inotify_mutex are "unified" into this single lock-class. The advantage of the lock-class approach is that all historical ->inotify_mutex uses are mapped into a single (and as narrow as possible) set of locking rules - regardless of how many different tasks or inode structures it took to build this set of rules. The set of rules persist during the lifetime of the kernel. To see the rough magnitude of checking that the lock validator does, here's a portion of /proc/lockdep_stats, fresh after bootup: lock-classes: 694 [max: 2048] direct dependencies: 1598 [max: 8192] indirect dependencies: 17896 all direct dependencies: 16206 dependency chains: 1910 [max: 8192] in-hardirq chains: 17 in-softirq chains: 105 in-process chains: 1065 stack-trace entries: 38761 [max: 131072] combined max dependencies: 2033928 hardirq-safe locks: 24 hardirq-unsafe locks: 176 softirq-safe locks: 53 softirq-unsafe locks: 137 irq-safe locks: 59 irq-unsafe locks: 176 The lock validator has observed 1598 actual single-thread locking patterns, and has validated all possible 2033928 distinct locking scenarios. More details about the design of the lock validator can be found in Documentation/lockdep-design.txt, which can also found at: http://redhat.com/~mingo/lockdep-patches/lockdep-design.txt [bunk@stusta.de: cleanups] Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03[PATCH] lockdep: irqtrace subsystem, coreIngo Molnar3-20/+140
Accurate hard-IRQ-flags and softirq-flags state tracing. This allows us to attach extra functionality to IRQ flags on/off events (such as trace-on/off). Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03[PATCH] lockdep: stacktrace subsystem, coreIngo Molnar2-0/+25
Framework to generate and save stacktraces quickly, without printing anything to the console. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03[PATCH] lockdep: locking init debugging improvementIngo Molnar2-3/+3
Locking init improvement: - introduce and use __SPIN_LOCK_UNLOCKED for array initializations, to pass in the name string of locks, used by debugging Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03[PATCH] lockdep: mutex section binutils workaroundIngo Molnar1-1/+1
Work around weird section nesting build bug causing smp-alternatives failures under certain circumstances. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03[PATCH] lockdep: better lock debuggingIngo Molnar11-482/+104
Generic lock debugging: - generalized lock debugging framework. For example, a bug in one lock subsystem turns off debugging in all lock subsystems. - got rid of the caller address passing (__IP__/__IP_DECL__/etc.) from the mutex/rtmutex debugging code: it caused way too much prototype hackery, and lockdep will give the same information anyway. - ability to do silent tests - check lock freeing in vfree too. - more finegrained debugging options, to allow distributions to turn off more expensive debugging features. There's no separate 'held mutexes' list anymore - but there's a 'held locks' stack within lockdep, which unifies deadlock detection across all lock classes. (this is independent of the lockdep validation stuff - lockdep first checks whether we are holding a lock already) Here are the current debugging options: CONFIG_DEBUG_MUTEXES=y CONFIG_DEBUG_LOCK_ALLOC=y which do: config DEBUG_MUTEXES bool "Mutex debugging, basic checks" config DEBUG_LOCK_ALLOC bool "Detect incorrect freeing of live mutexes" Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03[PATCH] lockdep: remove mutex deadlock checking codeIngo Molnar1-316/+0
With the lock validator we detect mutex deadlocks (and more), the mutex deadlock checking code is both redundant and slower. So remove it. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03[PATCH] lockdep: remove DEBUG_BUG_ON()Ingo Molnar1-8/+0
cleanup: remove unused DEBUG_BUG_ON() defines. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03[PATCH] lockdep: rename DEBUG_WARN_ON()Ingo Molnar4-26/+26
Rename DEBUG_WARN_ON() to the less generic DEBUG_LOCKS_WARN_ON() name, so that it's clear that this is a lock-debugging internal mechanism. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03[PATCH] lockdep: clean up rwsemsIngo Molnar1-0/+105
Clean up rwsems. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03[PATCH] lockdep: add is_module_address()Ingo Molnar1-0/+23
Add is_module_address() method - to be used by lockdep. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03[PATCH] ZVC/zone_reclaim: Leave 1% of unmapped pagecache pages for file I/OChristoph Lameter1-0/+11
It turns out that it is advantageous to leave a small portion of unmapped file backed pages if all of a zone's pages (or almost all pages) are allocated and so the page allocator has to go off-node. This allows recently used file I/O buffers to stay on the node and reduces the times that zone reclaim is invoked if file I/O occurs when we run out of memory in a zone. The problem is that zone reclaim runs too frequently when the page cache is used for file I/O (read write and therefore unmapped pages!) alone and we have almost all pages of the zone allocated. Zone reclaim may remove 32 unmapped pages. File I/O will use these pages for the next read/write requests and the unmapped pages increase. After the zone has filled up again zone reclaim will remove it again after only 32 pages. This cycle is too inefficient and there are potentially too many zone reclaim cycles. With the 1% boundary we may still remove all unmapped pages for file I/O in zone reclaim pass. However. it will take a large number of read and writes to get back to 1% again where we trigger zone reclaim again. The zone reclaim 2.6.16/17 does not show this behavior because we have a 30 second timeout. [akpm@osdl.org: rename the /proc file and the variable] Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03[PATCH] genirq: Allow fasteoi handler to retrigger disabled interruptsBenjamin Herrenschmidt1-1/+4
Make the fasteoi handler mark disabled interrupts as pending if they happen anyway. This allow implementation of a delayed disable scheme with the fasteoi handler. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Acked-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Paul Mackerras <paulus@samba.org>
2006-07-02[PATCH] genirq:fixup missing SA_PERCPU replacementThomas Gleixner1-2/+2
The irqflags consolidation converted SA_PERCPU_IRQ to IRQF_PERCPU but did not define the new constant. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-02[PATCH] genirq: ARM dyntick cleanupThomas Gleixner1-12/+1
Linus: "The hacks in kernel/irq/handle.c are really horrid. REALLY horrid." They are indeed. Move the dyntick quirks to ARM where they belong. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-02Merge branch 'genirq' of master.kernel.org:/home/rmk/linux-2.6-armLinus Torvalds2-3/+41
* 'genirq' of master.kernel.org:/home/rmk/linux-2.6-arm: (24 commits) [ARM] 3683/2: ARM: Convert at91rm9200 to generic irq handling [ARM] 3682/2: ARM: Convert ixp4xx to generic irq handling [ARM] 3702/1: ARM: Convert ixp23xx to generic irq handling [ARM] 3701/1: ARM: Convert plat-omap to generic irq handling [ARM] 3700/1: ARM: Convert lh7a40x to generic irq handling [ARM] 3699/1: ARM: Convert s3c2410 to generic irq handling [ARM] 3698/1: ARM: Convert sa1100 to generic irq handling [ARM] 3697/1: ARM: Convert shark to generic irq handling [ARM] 3696/1: ARM: Convert clps711x to generic irq handling [ARM] 3694/1: ARM: Convert ecard driver to generic irq handling [ARM] 3693/1: ARM: Convert omap1 to generic irq handling [ARM] 3691/1: ARM: Convert imx to generic irq handling [ARM] 3688/1: ARM: Convert clps7500 to generic irq handling [ARM] 3687/1: ARM: Convert integrator to generic irq handling [ARM] 3685/1: ARM: Convert pxa to generic irq handling [ARM] 3684/1: ARM: Convert l7200 to generic irq handling [ARM] 3681/1: ARM: Convert ixp2000 to generic irq handling [ARM] 3680/1: ARM: Convert footbridge to generic irq handling [ARM] 3695/1: ARM drivers/pcmcia: Fixup includes [ARM] 3689/1: ARM drivers/input/touchscreen: Fixup includes ... Manual conflict resolved in kernel/irq/handle.c (butt-ugly ARM tickless code).
2006-07-02[PATCH] irq-flags: generic irq: Use the new IRQF_ constantsThomas Gleixner3-23/+23
Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: "David S. Miller" <davem@davemloft.net> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-01[ARM] 3690/1: genirq: Introduce and make use of dummy irq chipThomas Gleixner2-3/+28
Patch from Thomas Gleixner From: Thomas Gleixner <tglx@linutronix.de> ARM has a couple of really dumb interrupt controllers. Implement a generic one and fixup the ARM migration. ARM reused the no_irq_chip for this purpose, but this does not work out for platforms which are not converted to the new interrupt type handling model. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2006-07-01[ARM] 3679/1: ARM: Make ARM dyntick implementation work with genirqThomas Gleixner1-0/+13
Patch from Thomas Gleixner From: Thomas Gleixner <tglx@linutronix.de> Make the ARM dyntick implementation work with the generic irq code. This hopefully goes away once we consolidated the dyntick implementations. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2006-07-01Merge branch 'audit.b22' of ↵Linus Torvalds3-66/+209
git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current * 'audit.b22' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current: [PATCH] audit syscall classes [PATCH] audit: support for object context filters [PATCH] audit: rename AUDIT_SE_* constants [PATCH] add rule filterkey
2006-07-01[PATCH] IRQ: warning message cleanupBjorn Helgaas1-6/+7
Make warnings more consistent. Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com> Cc: Thomas Gleixner <tglx@linutronix.de> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-01[PATCH] IRQ: Use SA_PERCPU_IRQ, not IRQ_PER_CPU, for irqaction.flagsBjorn Helgaas1-1/+2
IRQ_PER_CPU is a bit in the struct irq_desc "status" field, not in the struct irqaction "flags", so the previous code checked the wrong bit. SA_PERCPU_IRQ is only used by drivers/char/mmtimer.c for SGI ia64 boxes. Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-01[PATCH] pi-futex: futex_wake() lockup fixIngo Molnar1-2/+4
Fix futex_wake() exit condition bug when handling the robust-list with PI futexes on them. (reported by Ulrich Drepper, debugged by the lock validator.) Signed-off-by: Ingo Molnar <mingo@elte.hu> Cc: Ulrich Drepper <drepper@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-01[PATCH] pi-futex: fix mm_struct memory leakVernon Mauery1-1/+1
lock_queue was getting called essentially twice in a row and was continually incrementing the mm_count ref count, thus causing a memory leak. Dinakar Guniguntala provided a proper fix for the problem that simply grabs the spinlock for the hash bucket queue rather than calling lock_queue. The second time we do a queue_lock in futex_lock_pi, we really only need to take the hash bucket lock. Signed-off-by: Dinakar Guniguntala <dino@in.ibm.com> Signed-off-by: Vernon Mauery <vernux@us.ibm.com> Acked-by: Paul E. McKenney <paulmck@us.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-01[PATCH] audit syscall classesAl Viro1-0/+39
Allow to tie upper bits of syscall bitmap in audit rules to kernel-defined sets of syscalls. Infrastructure, a couple of classes (with 32bit counterparts for biarch targets) and actual tie-in on i386, amd64 and ia64. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-07-01[PATCH] audit: support for object context filtersDarrel Goeddel2-0/+65
This patch introduces object audit filters based on the elements of the SELinux context. Signed-off-by: Darrel Goeddel <dgoeddel@trustedcs.com> Acked-by: Stephen Smalley <sds@tycho.nsa.gov> kernel/auditfilter.c | 25 +++++++++++++++++++++++++ kernel/auditsc.c | 40 ++++++++++++++++++++++++++++++++++++++++ security/selinux/ss/services.c | 18 +++++++++++++++++- 3 files changed, 82 insertions(+), 1 deletion(-) Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-07-01[PATCH] audit: rename AUDIT_SE_* constantsDarrel Goeddel2-30/+30
This patch renames some audit constant definitions and adds additional definitions used by the following patch. The renaming avoids ambiguity with respect to the new definitions. Signed-off-by: Darrel Goeddel <dgoeddel@trustedcs.com> include/linux/audit.h | 15 ++++++++---- kernel/auditfilter.c | 50 ++++++++++++++++++++--------------------- kernel/auditsc.c | 10 ++++---- security/selinux/ss/services.c | 32 +++++++++++++------------- 4 files changed, 56 insertions(+), 51 deletions(-) Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-07-01[PATCH] add rule filterkeyAmy Griffis3-36/+75
Add support for a rule key, which can be used to tie audit records to audit rules. This is useful when a watched file is accessed through a link or symlink, as well as for general audit log analysis. Because this patch uses a string key instead of an integer key, there is a bit of extra overhead to do the kstrdup() when a rule fires. However, we're also allocating memory for the audit record buffer, so it's probably not that significant. I went ahead with a string key because it seems more user-friendly. Note that the user must ensure that filterkeys are unique. The kernel only checks for duplicate rules. Signed-off-by: Amy Griffis <amy.griffis@hpd.com>
2006-06-30Merge git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivialLinus Torvalds21-33/+1
* git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial: Remove obsolete #include <linux/config.h> remove obsolete swsusp_encrypt arch/arm26/Kconfig typos Documentation/IPMI typos Kconfig: Typos in net/sched/Kconfig v9fs: do not include linux/version.h Documentation/DocBook/mtdnand.tmpl: typo fixes typo fixes: specfic -> specific typo fixes in Documentation/networking/pktgen.txt typo fixes: occuring -> occurring typo fixes: infomation -> information typo fixes: disadvantadge -> disadvantage typo fixes: aquire -> acquire typo fixes: mecanism -> mechanism typo fixes: bandwith -> bandwidth fix a typo in the RTC_CLASS help text smb is no longer maintained Manually merged trivial conflict in arch/um/kernel/vmlinux.lds.S
2006-06-30[PATCH] cond_resched() fixAndrew Morton1-12/+13
Fix a bug identified by Zou Nan hai <nanhai.zou@intel.com>: If the system is in state SYSTEM_BOOTING, and need_resched() is true, cond_resched() returns true even though it didn't reschedule. Consequently need_resched() remains true and JBD locks up. Fix that by teaching cond_resched() to only return true if it really did call schedule(). cond_resched_lock() and cond_resched_softirq() have a problem too. If we're in SYSTEM_BOOTING state and need_resched() is true, these functions will drop the lock and will then try to call schedule(), but the SYSTEM_BOOTING state will prevent schedule() from being called. So on return, need_resched() will still be true, but cond_resched_lock() has to return 1 to tell the caller that the lock was dropped. The caller will probably lock up. Bottom line: if these functions dropped the lock, they _must_ call schedule() to clear need_resched(). Make it so. Also, uninline __cond_resched(). It's largeish, and slowpath. Acked-by: Ingo Molnar <mingo@elte.hu> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-30[PATCH] SELinux: add security hook call to kill_proc_info_as_uidDavid Quigley1-2/+5
This patch adds a call to the extended security_task_kill hook introduced by the prior patch to the kill_proc_info_as_uid function so that these signals can be properly mediated by security modules. It also updates the existing hook call in check_kill_permission. Signed-off-by: David Quigley <dpquigl@tycho.nsa.gov> Signed-off-by: James Morris <jmorris@namei.org> Cc: Stephen Smalley <sds@tycho.nsa.gov> Cc: Chris Wright <chrisw@sous-sol.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-30[PATCH] zoned vm counters: zone_reclaim: remove ↵Christoph Lameter1-9/+0
/proc/sys/vm/zone_reclaim_interval The zone_reclaim_interval was necessary because we were not able to determine how many unmapped pages exist in a zone. Therefore we had to scan in intervals to figure out if any pages were unmapped. With the zoned counters and NR_ANON_PAGES we now know the number of pagecache pages and the number of mapped pages in a zone. So we can simply skip the reclaim if there is an insufficient number of unmapped pages. We use SWAP_CLUSTER_MAX as the boundary. Drop all support for /proc/sys/vm/zone_reclaim_interval. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-30Remove obsolete #include <linux/config.h>Jörn Engel20-20/+0
Signed-off-by: Jörn Engel <joern@wohnheim.fh-wedel.de> Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-06-30remove obsolete swsusp_encryptPavel Machek1-12/+0
Remove SWSUSP_ENCRYPT config option; it is no longer implemented. Signed-off-by: Pavel Machek <pavel@suse.cz> Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-06-30typo fixes: occuring -> occurringAdrian Bunk1-1/+1
Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-06-30Move workqueue exports to where the functions are defined.Dave Jones1-11/+10
Signed-off-by: Dave Jones <davej@redhat.com>
2006-06-30[CPUFREQ] Add queue_delayed_work_on() interface for workqueues.Venkatesh Pallipadi1-15/+23
Add queue_delayed_work_on() interface for workqueues. Signed-off-by: Alexey Starikovskiy <alexey.y.starikovskiy@intel.com> Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Signed-off-by: Dave Jones <davej@redhat.com>
2006-06-29[NETLINK]: Encapsulate eff_cap usage within security framework.Darrel Goeddel1-4/+4
This patch encapsulates the usage of eff_cap (in netlink_skb_params) within the security framework by extending security_netlink_recv to include a required capability parameter and converting all direct usage of eff_caps outside of the lsm modules to use the interface. It also updates the SELinux implementation of the security_netlink_send and security_netlink_recv hooks to take advantage of the sid in the netlink_skb_params struct. This also enables SELinux to perform auditing of netlink capability checks. Please apply, for 2.6.18 if possible. Signed-off-by: Darrel Goeddel <dgoeddel@trustedcs.com> Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov> Acked-by: James Morris <jmorris@namei.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-29Merge git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpcLinus Torvalds1-4/+2
* git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc: (43 commits) [POWERPC] Use little-endian bit from firmware ibm,pa-features property [POWERPC] Make sure smp_processor_id works very early in boot [POWERPC] U4 DART improvements [POWERPC] todc: add support for Time-Of-Day-Clock [POWERPC] Make lparcfg.c work when both iseries and pseries are selected [POWERPC] Fix idr locking in init_new_context [POWERPC] mpc7448hpc2 (taiga) board config file [POWERPC] Add tsi108 pci and platform device data register function [POWERPC] Add general support for mpc7448hpc2 (Taiga) platform [POWERPC] Correct the MAX_CONTEXT definition powerpc: minor cleanups for mpc86xx [POWERPC] Make sure we select CONFIG_NEW_LEDS if ADB_PMU_LED is set [POWERPC] Simplify the code defining the 64-bit CPU features [POWERPC] powerpc: kconfig warning fix [POWERPC] Consolidate some of kernel/misc*.S [POWERPC] Remove unused function call_with_mmu_off [POWERPC] update asm-powerpc/time.h [POWERPC] Clean up it_lp_queue.h [POWERPC] Skip the "copy down" of the kernel if it is already at zero. [POWERPC] Add the use of the firmware soft-reset-nmi to kdump. ...
2006-06-29Merge master.kernel.org:/pub/scm/linux/kernel/git/gregkh/pci-2.6Linus Torvalds1-25/+27
* master.kernel.org:/pub/scm/linux/kernel/git/gregkh/pci-2.6: [PATCH] i386: export memory more than 4G through /proc/iomem [PATCH] 64bit Resource: finally enable 64bit resource sizes [PATCH] 64bit Resource: convert a few remaining drivers to use resource_size_t where needed [PATCH] 64bit resource: change pnp core to use resource_size_t [PATCH] 64bit resource: change pci core and arch code to use resource_size_t [PATCH] 64bit resource: change resource core to use resource_size_t [PATCH] 64bit resource: introduce resource_size_t for the start and end of struct resource [PATCH] 64bit resource: fix up printks for resources in misc drivers [PATCH] 64bit resource: fix up printks for resources in arch and core code [PATCH] 64bit resource: fix up printks for resources in pcmcia drivers [PATCH] 64bit resource: fix up printks for resources in video drivers [PATCH] 64bit resource: fix up printks for resources in ide drivers [PATCH] 64bit resource: fix up printks for resources in mtd drivers [PATCH] 64bit resource: fix up printks for resources in pci core and hotplug drivers [PATCH] 64bit resource: fix up printks for resources in networks drivers [PATCH] 64bit resource: fix up printks for resources in sound drivers [PATCH] 64bit resource: C99 changes for struct resource declarations Fixed up trivial conflict in drivers/ide/pci/cmd64x.c (the printk that was changed by the 64-bit resources had been deleted in the meantime ;)
2006-06-29[PATCH] genirq: add chip->eoi(), fastack -> fasteoiIngo Molnar1-14/+11
Clean up the fastack concept by turning it into fasteoi and introducing the ->eoi() method for chips. This also allows the cleanup of an i386 EOI quirk - now the quirk is cleanly separated from the pure ACK implementation. Signed-off-by: Ingo Molnar <mingo@elte.hu> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Roland Dreier <rolandd@cisco.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29[PATCH] genirq: fasteoi handler: handle interrupt disablingBenjamin Herrenschmidt1-1/+4
Note when a disable interrupt happened with the fasteoi handler as well so that delayed disable can be implemented with fasteoi-type controllers. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29[PATCH] genirq: more verbose debugging on unexpected IRQ vectorsIngo Molnar2-0/+42
One frequent sign of IRQ handling bugs is the appearance of unexpected vectors. Print out all the IRQ state in that case. We dont want this patch upstream, but it is useful during initial testing. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29[PATCH] genirq: cleanup: no_irq_type -> no_irq_chip renameIngo Molnar3-5/+5
Rename no_irq_type to no_irq_chip. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29[PATCH] genirq: add SA_TRIGGER supportThomas Gleixner1-4/+27
Enable drivers to request an IRQ with a given irq-flow (trigger/polarity) setting. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29[PATCH] genirq: add irq-wake (power-management) supportThomas Gleixner1-0/+21
Enable platforms to set the irq-wake (power-management) properties of an IRQ. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29[PATCH] genirq: add handle_bad_irq()Ingo Molnar2-0/+9
Handle bad IRQ vectors via the irqchip mechanism. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29[PATCH] genirq: add irq-chip supportThomas Gleixner3-2/+527
Enable platforms to use the irq-chip and irq-flow abstractions: allow setting of the chip, the type and provide highlevel handlers for common irq-flows. [rostedt@goodmis.org: misroute-irq: Don't call desc->chip->end because of edge interrupts] Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29[PATCH] genirq: coreThomas Gleixner6-3/+41
Core genirq support: add the irq-chip and irq-flow abstractions. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29[PATCH] genirq: update copyrightsIngo Molnar2-2/+7
Update/add copyrights in the generic IRQ code. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29[PATCH] genirq: add IRQ_NOAUTOEN supportThomas Gleixner2-7/+12
Enable platforms to disable the automatic enabling of freshly set up irqs. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29[PATCH] genirq: add IRQ_NOREQUEST supportThomas Gleixner1-1/+3
Enable platforms to disable request_irq() for certain interrupts. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29[PATCH] genirq: add IRQ_NOPROBE supportThomas Gleixner2-2/+6
Introduce IRQ_NOPROBE: enables platforms to control chip-probing. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29[PATCH] genirq: add genirq sw IRQ-retriggerThomas Gleixner3-10/+80
Enable platforms that do not have a hardware-assisted hardirq-resend mechanism to resend them via a softirq-driven IRQ emulation mechanism. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29[PATCH] genirq: cleanup: no_irq_type cleanupsIngo Molnar1-20/+25
Clean up no_irq_type: share the NOP functions where possible, and properly name the ack_bad() function. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29[PATCH] genirq: doc: handle_IRQ_event() and __do_IRQ() commentsIngo Molnar1-4/+16
Document handle_IRQ_event() and __do_IRQ(). Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29[PATCH] genirq: add ->retrigger() irq op to consolidate hw_irq_resend()Ingo Molnar1-1/+2
Add ->retrigger() irq op to consolidate hw_irq_resend() implementations. (Most architectures had it defined to NOP anyway.) NOTE: ia64 needs testing. i386 and x86_64 tested. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29[PATCH] genirq: debug: better debug printout in enable_irq()Thomas Gleixner1-0/+1
Make enable_irq() debug printouts user-readable. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29[PATCH] genirq: cleanup: turn ARCH_HAS_IRQ_PER_CPU into CONFIG_IRQ_PER_CPUIngo Molnar1-2/+2
Cleanup: change ARCH_HAS_IRQ_PER_CPU into a Kconfig method. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29[PATCH] genirq: cleanup: merge pending_irq_cpumask[] into irq_desc[]Ingo Molnar2-8/+4
Consolidation: remove the pending_irq_cpumask[NR_IRQS] array and move it into the irq_desc[NR_IRQS].pending_mask field. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29[PATCH] genirq: cleanup: merge irq_dir[], smp_affinity_entry[] into irq_desc[]Ingo Molnar1-13/+7
Consolidation: remove the irq_dir[NR_IRQS] and the smp_affinity_entry[NR_IRQS] arrays and move them into the irq_desc[] array. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29[PATCH] genirq: cleanup: reduce irq_desc_t use, mark it obsoleteIngo Molnar5-14/+15
Cleanup: remove irq_desc_t use from the generic IRQ code, and mark it obsolete. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29[PATCH] genirq: cleanup: misc code cleanupsIngo Molnar4-38/+34
Assorted code cleanups to the generic IRQ code. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29[PATCH] genirq: cleanup: remove fastcallIngo Molnar1-2/+2
Now that i386 defaults to regparm, explicit uses of fastcall are not needed anymore. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29[PATCH] genirq: cleanup: remove irq_descp()Ingo Molnar1-1/+1
Cleanup: remove irq_descp() - explicit use of irq_desc[] is shorter and more readable. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29[PATCH] genirq: cleanup: merge irq_affinity[] into irq_desc[]Ingo Molnar3-5/+6
Consolidation: remove the irq_affinity[NR_IRQS] array and move it into the irq_desc[NR_IRQS].affinity field. [akpm@osdl.org: sparc64 build fix] Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29[PATCH] genirq: sem2mutex probe_sem -> probing_activeIngo Molnar1-4/+5
Convert the irq auto-probing semaphore to a mutex. (This allows us to find probing API usage bugs sooner, via the mutex debugging code.) Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29[PATCH] genirq: rename desc->handler to desc->chipIngo Molnar6-33/+33
This patch-queue improves the generic IRQ layer to be truly generic, by adding various abstractions and features to it, without impacting existing functionality. While the queue can be best described as "fix and improve everything in the generic IRQ layer that we could think of", and thus it consists of many smaller features and lots of cleanups, the one feature that stands out most is the new 'irq chip' abstraction. The irq-chip abstraction is about describing and coding and IRQ controller driver by mapping its raw hardware capabilities [and quirks, if needed] in a straightforward way, without having to think about "IRQ flow" (level/edge/etc.) type of details. This stands in contrast with the current 'irq-type' model of genirq architectures, which 'mixes' raw hardware capabilities with 'flow' details. The patchset supports both types of irq controller designs at once, and converts i386 and x86_64 to the new irq-chip design. As a bonus side-effect of the irq-chip approach, chained interrupt controllers (master/slave PIC constructs, etc.) are now supported by design as well. The end result of this patchset intends to be simpler architecture-level code and more consolidation between architectures. We reused many bits of code and many concepts from Russell King's ARM IRQ layer, the merging of which was one of the motivations for this patchset. This patch: rename desc->handler to desc->chip. Originally i did not want to do this, because it's a big patch. But having both "desc->handler", "desc->handle_irq" and "action->handler" caused a large degree of confusion and made the code appear alot less clean than it truly is. I have also attempted a dual approach as well by introducing a desc->chip alias - but that just wasnt robust enough and broke frequently. So lets get over with this quickly. The conversion was done automatically via scripts and converts all the code in the kernel. This renaming patch is the first one amongst the patches, so that the remaining patches can stay flexible and can be merged and split up without having some big monolithic patch act as a merge barrier. [akpm@osdl.org: build fix] [akpm@osdl.org: another build fix] Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-28[PATCH] load_module() cleanupAndrew Morton1-6/+21
Undo bizarre declaration in load_module(). Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-28[PATCH] Add EXPORT_UNUSED_SYMBOL and EXPORT_UNUSED_SYMBOL_GPLArjan van de Ven1-2/+76
Temporarily add EXPORT_UNUSED_SYMBOL and EXPORT_UNUSED_SYMBOL_GPL. These will be used as a transition measure for symbols that aren't used in the kernel and are on the way out. When a module uses such a symbol, a warning is printk'd at modprobe time. The main reason for removing unused exports is size: eacho export takes roughly between 100 and 150 bytes of kernel space in the binary. This patch gives users the option to immediately get this size gain via a config option. Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-28[POWERPC] Add the use of the firmware soft-reset-nmi to kdump.David Wilder1-4/+2
With this patch, kdump uses the firmware soft-reset NMI for two purposes: 1) Initiate the kdump (take a crash dump) by issuing a soft-reset. 2) Break a CPU out of a deadlock condition that is detected during kdump processing. When a soft-reset is initiated each CPU will enter system_reset_exception() and set its corresponding bit in the global bit-array cpus_in_sr then call die(). When die() finds the CPU's bit set in cpu_in_sr crash_kexec() is called to initiate a crash dump. The first CPU to enter crash_kexec() is called the "crashing CPU". All other CPUs are "secondary CPUs". The secondary CPU's pass through to crash_kexec_secondary() and sleep. The crashing CPU waits for all CPUs to enter via soft-reset then boots the kdump kernel (see crash_soft_reset_check()) When the system crashes due to a panic or exception, crash_kexec() is called by panic() or die(). The crashing CPU sends an IPI to all other CPUs to notify them of the pending shutdown. If a CPU is in a deadlock or hung state with interrupts disabled, the IPI will not be delivered. The result being, that the kdump kernel is not booted. This problem is solved with the use of a firmware generated soft-reset. After the crashing_cpu has issued the IPI, it waits for 10 sec for all CPUs to enter crash_ipi_callback(). A CPU signifies its entry to crash_ipi_callback() by setting its corresponding bit in the cpus_in_crash bit array. After 10 sec, if one or more CPUs have not set their bit in cpus_in_crash we assume that the CPU(s) is deadlocked. The operator is then prompted to generate a soft-reset to break the deadlock. Each CPU enters the soft reset handler as described above. Two conditions must be handled at this point: 1) The system crashed because the operator generated a soft-reset. See 2) The system had crashed before the soft-reset was generated ( in the case of a Panic or oops). The first CPU to enter crash_kexec() uses the state of the kexec_lock to determine this state. If kexec_lock is already held then condition 2 is true and crash_kexec_secondary() is called, else; this CPU is flagged as the crashing CPU, the kexec_lock is acquired and crash_kexec() proceeds as described above. Each additional CPUs responding to the soft-reset will pass through crash_kexec() to kexec_secondary(). All secondary CPUs call crash_ipi_callback() readying them self's for the shutdown. When ready they clear their bit in cpus_in_sr. The crashing CPU waits in kexec_secondary() until all other CPUs have cleared their bits in cpus_in_sr. The kexec kernel boot is then started. Signed-off-by: Haren Myneni <haren@us.ibm.com> Signed-off-by: David Wilder <dwilder@us.ibm.com> Signed-off-by: Paul Mackerras <paulus@samba.org>
2006-06-27[PATCH] Remove redundant NULL checks before [kv]free - in kernel/Jesper Juhl1-2/+1
Remove redundant kfree NULL checks from kernel/ Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] futex_requeue() optimizationSebastien Dugue1-5/+8
In futex_requeue(), when the 2 futexes keys hash to the same bucket, there is no need to move the futex_q to the end of the bucket list. Signed-off-by: Sebastien Dugue <sebastien.dugue@bull.net> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] rtmutex: Propagate priority settings into PI lock chainsThomas Gleixner2-6/+38
When the priority of a task, which is blocked on a lock, changes we must propagate this change into the PI lock chain. Therefor the chain walk code is changed to get rid of the references to current to avoid false positives in the deadlock detector, as setscheduler might be called by a task which holds the lock on which the task whose priority is changed is blocked. Also add some comments about the get/put_task_struct usage to avoid confusion. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] rtmutex: Modify rtmutex-tester to test the setscheduler propagationThomas Gleixner1-14/+18
Make test suite setscheduler calls asynchronously. Remove the waits in the test cases and add a new testcase to verify the correctness of the setscheduler priority propagation. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] Drop tasklist lock in do_sched_setschedulerThomas Gleixner1-1/+3
There is no need to hold tasklist_lock across the setscheduler call, when we pin the task structure with get_task_struct(). Interrupts are disabled in setscheduler anyway and the permission checks do not need interrupts disabled. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] pi-futex: futex_lock_pi/futex_unlock_pi supportIngo Molnar5-41/+818
This adds the actual pi-futex implementation, based on rt-mutexes. [dino@in.ibm.com: fix an oops-causing race] Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Dinakar Guniguntala <dino@in.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] pi-futex: rt mutex futex apiIngo Molnar1-0/+55
Add proxy-locking rt-mutex functionality needed by pi-futexes. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] pi-futex: rt mutex testerThomas Gleixner4-1/+461
RT-mutex tester: scriptable tester for rt mutexes, which allows userspace scripting of mutex unit-tests (and dynamic tests as well), using the actual rt-mutex implementation of the kernel. [akpm@osdl.org: fixlet] Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] pi-futex: rt mutex debugIngo Molnar4-0/+552
Runtime debugging functionality for rt-mutexes. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] pi-futex: rt mutex coreIngo Molnar6-0/+1058
Core functions for the rt-mutex subsystem. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] pi-futex: scheduler support for piIngo Molnar1-29/+160
Add framework to boost/unboost the priority of RT tasks. This consists of: - caching the 'normal' priority in ->normal_prio - providing a functions to set/get the priority of the task - make sched_setscheduler() aware of boosting The effective_prio() cleanups also fix a priority-calculation bug pointed out by Andrey Gelman, in set_user_nice(). has_rt_policy() fix: Peter Williams <pwil3058@bigpond.net.au> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Cc: Andrey Gelman <agelman@012.net.il> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] pi-futex: futex code cleanupsIngo Molnar2-117/+131
We are pleased to announce "lightweight userspace priority inheritance" (PI) support for futexes. The following patchset and glibc patch implements it, ontop of the robust-futexes patchset which is included in 2.6.16-mm1. We are calling it lightweight for 3 reasons: - in the user-space fastpath a PI-enabled futex involves no kernel work (or any other PI complexity) at all. No registration, no extra kernel calls - just pure fast atomic ops in userspace. - in the slowpath (in the lock-contention case), the system call and scheduling pattern is in fact better than that of normal futexes, due to the 'integrated' nature of FUTEX_LOCK_PI. [more about that further down] - the in-kernel PI implementation is streamlined around the mutex abstraction, with strict rules that keep the implementation relatively simple: only a single owner may own a lock (i.e. no read-write lock support), only the owner may unlock a lock, no recursive locking, etc. Priority Inheritance - why, oh why??? ------------------------------------- Many of you heard the horror stories about the evil PI code circling Linux for years, which makes no real sense at all and is only used by buggy applications and which has horrible overhead. Some of you have dreaded this very moment, when someone actually submits working PI code ;-) So why would we like to see PI support for futexes? We'd like to see it done purely for technological reasons. We dont think it's a buggy concept, we think it's useful functionality to offer to applications, which functionality cannot be achieved in other ways. We also think it's the right thing to do, and we think we've got the right arguments and the right numbers to prove that. We also believe that we can address all the counter-arguments as well. For these reasons (and the reasons outlined below) we are submitting this patch-set for upstream kernel inclusion. What are the benefits of PI? The short reply: ---------------- User-space PI helps achieving/improving determinism for user-space applications. In the best-case, it can help achieve determinism and well-bound latencies. Even in the worst-case, PI will improve the statistical distribution of locking related application delays. The longer reply: ----------------- Firstly, sharing locks between multiple tasks is a common programming technique that often cannot be replaced with lockless algorithms. As we can see it in the kernel [which is a quite complex program in itself], lockless structures are rather the exception than the norm - the current ratio of lockless vs. locky code for shared data structures is somewhere between 1:10 and 1:100. Lockless is hard, and the complexity of lockless algorithms often endangers to ability to do robust reviews of said code. I.e. critical RT apps often choose lock structures to protect critical data structures, instead of lockless algorithms. Furthermore, there are cases (like shared hardware, or other resource limits) where lockless access is mathematically impossible. Media players (such as Jack) are an example of reasonable application design with multiple tasks (with multiple priority levels) sharing short-held locks: for example, a highprio audio playback thread is combined with medium-prio construct-audio-data threads and low-prio display-colory-stuff threads. Add video and decoding to the mix and we've got even more priority levels. So once we accept that synchronization objects (locks) are an unavoidable fact of life, and once we accept that multi-task userspace apps have a very fair expectation of being able to use locks, we've got to think about how to offer the option of a deterministic locking implementation to user-space. Most of the technical counter-arguments against doing priority inheritance only apply to kernel-space locks. But user-space locks are different, there we cannot disable interrupts or make the task non-preemptible in a critical section, so the 'use spinlocks' argument does not apply (user-space spinlocks have the same priority inversion problems as other user-space locking constructs). Fact is, pretty much the only technique that currently enables good determinism for userspace locks (such as futex-based pthread mutexes) is priority inheritance: Currently (without PI), if a high-prio and a low-prio task shares a lock [this is a quite common scenario for most non-trivial RT applications], even if all critical sections are coded carefully to be deterministic (i.e. all critical sections are short in duration and only execute a limited number of instructions), the kernel cannot guarantee any deterministic execution of the high-prio task: any medium-priority task could preempt the low-prio task while it holds the shared lock and executes the critical section, and could delay it indefinitely. Implementation: --------------- As mentioned before, the userspace fastpath of PI-enabled pthread mutexes involves no kernel work at all - they behave quite similarly to normal futex-based locks: a 0 value means unlocked, and a value==TID means locked. (This is the same method as used by list-based robust futexes.) Userspace uses atomic ops to lock/unlock these mutexes without entering the kernel. To handle the slowpath, we have added two new futex ops: FUTEX_LOCK_PI FUTEX_UNLOCK_PI If the lock-acquire fastpath fails, [i.e. an atomic transition from 0 to TID fails], then FUTEX_LOCK_PI is called. The kernel does all the remaining work: if there is no futex-queue attached to the futex address yet then the code looks up the task that owns the futex [it has put its own TID into the futex value], and attaches a 'PI state' structure to the futex-queue. The pi_state includes an rt-mutex, which is a PI-aware, kernel-based synchronization object. The 'other' task is made the owner of the rt-mutex, and the FUTEX_WAITERS bit is atomically set in the futex value. Then this task tries to lock the rt-mutex, on which it blocks. Once it returns, it has the mutex acquired, and it sets the futex value to its own TID and returns. Userspace has no other work to perform - it now owns the lock, and futex value contains FUTEX_WAITERS|TID. If the unlock side fastpath succeeds, [i.e. userspace manages to do a TID -> 0 atomic transition of the futex value], then no kernel work is triggered. If the unlock fastpath fails (because the FUTEX_WAITERS bit is set), then FUTEX_UNLOCK_PI is called, and the kernel unlocks the futex on the behalf of userspace - and it also unlocks the attached pi_state->rt_mutex and thus wakes up any potential waiters. Note that under this approach, contrary to other PI-futex approaches, there is no prior 'registration' of a PI-futex. [which is not quite possible anyway, due to existing ABI properties of pthread mutexes.] Also, under this scheme, 'robustness' and 'PI' are two orthogonal properties of futexes, and all four combinations are possible: futex, robust-futex, PI-futex, robust+PI-futex. glibc support: -------------- Ulrich Drepper and Jakub Jelinek have written glibc support for PI-futexes (and robust futexes), enabling robust and PI (PTHREAD_PRIO_INHERIT) POSIX mutexes. (PTHREAD_PRIO_PROTECT support will be added later on too, no additional kernel changes are needed for that). [NOTE: The glibc patch is obviously inofficial and unsupported without matching upstream kernel functionality.] the patch-queue and the glibc patch can also be downloaded from: http://redhat.com/~mingo/PI-futex-patches/ Many thanks go to the people who helped us create this kernel feature: Steven Rostedt, Esben Nielsen, Benedikt Spranger, Daniel Walker, John Cooper, Arjan van de Ven, Oleg Nesterov and others. Credits for related prior projects goes to Dirk Grambow, Inaky Perez-Gonzalez, Bill Huey and many others. Clean up the futex code, before adding more features to it: - use u32 as the futex field type - that's the ABI - use __user and pointers to u32 instead of unsigned long - code style / comment style cleanups - rename hash-bucket name from 'bh' to 'hb'. I checked the pre and post futex.o object files to make sure this patch has no code effects. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Cc: Ulrich Drepper <drepper@redhat.com> Cc: Jakub Jelinek <jakub@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] BUG() if setscheduler is called from interrupt contextSteven Rostedt1-0/+2
Thomas Gleixner is adding the call to a rtmutex function in setscheduler. This call grabs a spin_lock that is not always protected by interrupts disabled. So this means that setscheduler cant be called from interrupt context. To prevent this from happening in the future, this patch adds a BUG_ON(in_interrupt()) in that function. (Thanks to akpm <aka. Andrew Morton> for this suggestion). Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] sched: uninline task_rq_lock()Oleg Nesterov1-1/+1
Saves 543 bytes from sched.o (gcc 3.3.3). Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Ingo Molnar <mingo@elte.hu> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Con Kolivas <kernel@kolivas.org> Cc: Peter Williams <pwil3058@bigpond.net.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] sched: mc/smt power savings sched policySiddha, Suresh B1-25/+215
sysfs entries 'sched_mc_power_savings' and 'sched_smt_power_savings' in /sys/devices/system/cpu/ control the MC/SMT power savings policy for the scheduler. Based on the values (1-enable, 0-disable) for these controls, sched groups cpu power will be determined for different domains. When power savings policy is enabled and under light load conditions, scheduler will minimize the physical packages/cpu cores carrying the load and thus conserving power(with a perf impact based on the workload characteristics... see OLS 2005 CMP kernel scheduler paper for more details..) Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Con Kolivas <kernel@kolivas.org> Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] sched_domai: Allocate sched_group structures dynamicallySrivatsa Vaddagiri1-5/+43
As explained here: http://marc.theaimsgroup.com/?l=linux-kernel&m=114327539012323&w=2 there is a problem with sharing sched_group structures between two separate sched_group structures for different sched_domains. The patch has been tested and found to avoid the kernel lockup problem described in above URL. Signed-off-by: Srivatsa Vaddagiri <vatsa@in.ibm.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Ingo Molnar <mingo@elte.hu> Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] sched_domai: Use kmalloc_nodeSrivatsa Vaddagiri1-2/+3
The sched group structures used to represent various nodes need to be allocated from respective nodes (as suggested here also: http://uwsg.ucs.indiana.edu/hypermail/linux/kernel/0603.3/0051.html) Signed-off-by: Srivatsa Vaddagiri <vatsa@in.ibm.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Ingo Molnar <mingo@elte.hu> Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] sched_domai: Don't use GFP_ATOMICSrivatsa Vaddagiri1-1/+1
Replace GFP_ATOMIC allocation for sched_group_nodes with GFP_KERNEL based allocation. Signed-off-by: Srivatsa Vaddagiri <vatsa@in.ibm.com Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Ingo Molnar <mingo@elte.hu> Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] sched_domain: handle kmalloc failureSrivatsa Vaddagiri1-61/+78
Try to handle mem allocation failures in build_sched_domains by bailing out and cleaning up thus-far allocated memory. The patch has a direct consequence that we disable load balancing completely (even at sibling level) upon *any* memory allocation failure. [Lee.Schermerhorn@hp.com: bugfix] Signed-off-by: Srivatsa Vaddagir <vatsa@in.ibm.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Ingo Molnar <mingo@elte.hu> Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] sched: Avoid unnecessarily moving highest priority task move_tasks()Peter Williams1-5/+21
Problem: To help distribute high priority tasks evenly across the available CPUs move_tasks() does not, under some circumstances, skip tasks whose load weight is bigger than the designated amount. Because the highest priority task on the busiest queue may be on the expired array it may be moved as a result of this mechanism. Apart from not being the most desirable way to redistribute the high priority tasks (we'd rather move the second highest priority task), there is a risk that this could set up a loop with this task bouncing backwards and forwards between the two queues. (This latter possibility can be demonstrated by running a nice==-20 CPU bound task on an otherwise quiet 2 CPU system.) Solution: Modify the mechanism so that it does not override skip for the highest priority task on the CPU. Of course, if there are more than one tasks at the highest priority then it will allow the override for one of them as this is a desirable redistribution of high priority tasks. Signed-off-by: Peter Williams <pwil3058@bigpond.com.au> Cc: Ingo Molnar <mingo@elte.hu> Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] sched: modify move_tasks() to improve load balancing outcomesPeter Williams1-2/+10
Problem: The move_tasks() function is designed to move UP TO the amount of load it is asked to move and in doing this it skips over tasks looking for ones whose load weights are less than or equal to the remaining load to be moved. This is (in general) a good thing but it has the unfortunate result of breaking one of the original load balancer's good points: namely, that (within the limits imposed by the active/expired array model and the fact the expired is processed first) it moves high priority tasks before low priority ones and this means there's a good chance (see active/expired problem for why it's only a chance) that the highest priority task on the queue but not actually on the CPU will be moved to the other CPU where (as a high priority task) it may preempt the current task. Solution: Modify move_tasks() so that high priority tasks are not skipped when moving them will make them the highest priority task on their new run queue. Signed-off-by: Peter Williams <pwil3058@bigpond.com.au> Cc: Ingo Molnar <mingo@elte.hu> Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com> Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] sched: implement smpnicePeter Williams1-65/+219
Problem: The introduction of separate run queues per CPU has brought with it "nice" enforcement problems that are best described by a simple example. For the sake of argument suppose that on a single CPU machine with a nice==19 hard spinner and a nice==0 hard spinner running that the nice==0 task gets 95% of the CPU and the nice==19 task gets 5% of the CPU. Now suppose that there is a system with 2 CPUs and 2 nice==19 hard spinners and 2 nice==0 hard spinners running. The user of this system would be entitled to expect that the nice==0 tasks each get 95% of a CPU and the nice==19 tasks only get 5% each. However, whether this expectation is met is pretty much down to luck as there are four equally likely distributions of the tasks to the CPUs that the load balancing code will consider to be balanced with loads of 2.0 for each CPU. Two of these distributions involve one nice==0 and one nice==19 task per CPU and in these circumstances the users expectations will be met. The other two distributions both involve both nice==0 tasks being on one CPU and both nice==19 being on the other CPU and each task will get 50% of a CPU and the user's expectations will not be met. Solution: The solution to this problem that is implemented in the attached patch is to use weighted loads when determining if the system is balanced and, when an imbalance is detected, to move an amount of weighted load between run queues (as opposed to a number of tasks) to restore the balance. Once again, the easiest way to explain why both of these measures are necessary is to use a simple example. Suppose that (in a slight variation of the above example) that we have a two CPU system with 4 nice==0 and 4 nice=19 hard spinning tasks running and that the 4 nice==0 tasks are on one CPU and the 4 nice==19 tasks are on the other CPU. The weighted loads for the two CPUs would be 4.0 and 0.2 respectively and the load balancing code would move 2 tasks resulting in one CPU with a load of 2.0 and the other with load of 2.2. If this was considered to be a big enough imbalance to justify moving a task and that task was moved using the current move_tasks() then it would move the highest priority task that it found and this would result in one CPU with a load of 3.0 and the other with a load of 1.2 which would result in the movement of a task in the opposite direction and so on -- infinite loop. If, on the other hand, an amount of load to be moved is calculated from the imbalance (in this case 0.1) and move_tasks() skips tasks until it find ones whose contributions to the weighted load are less than this amount it would move two of the nice==19 tasks resulting in a system with 2 nice==0 and 2 nice=19 on each CPU with loads of 2.1 for each CPU. One of the advantages of this mechanism is that on a system where all tasks have nice==0 the load balancing calculations would be mathematically identical to the current load balancing code. Notes: struct task_struct: has a new field load_weight which (in a trade off of space for speed) stores the contribution that this task makes to a CPU's weighted load when it is runnable. struct runqueue: has a new field raw_weighted_load which is the sum of the load_weight values for the currently runnable tasks on this run queue. This field always needs to be updated when nr_running is updated so two new inline functions inc_nr_running() and dec_nr_running() have been created to make sure that this happens. This also offers a convenient way to optimize away this part of the smpnice mechanism when CONFIG_SMP is not defined. int try_to_wake_up(): in this function the value SCHED_LOAD_BALANCE is used to represent the load contribution of a single task in various calculations in the code that decides which CPU to put the waking task on. While this would be a valid on a system where the nice values for the runnable tasks were distributed evenly around zero it will lead to anomalous load balancing if the distribution is skewed in either direction. To overcome this problem SCHED_LOAD_SCALE has been replaced by the load_weight for the relevant task or by the average load_weight per task for the queue in question (as appropriate). int move_tasks(): The modifications to this function were complicated by the fact that active_load_balance() uses it to move exactly one task without checking whether an imbalance actually exists. This precluded the simple overloading of max_nr_move with max_load_move and necessitated the addition of the latter as an extra argument to the function. The internal implementation is then modified to move up to max_nr_move tasks and max_load_move of weighted load. This slightly complicates the code where move_tasks() is called and if ever active_load_balance() is changed to not use move_tasks() the implementation of move_tasks() should be simplified accordingly. struct sched_group *find_busiest_group(): Similar to try_to_wake_up(), there are places in this function where SCHED_LOAD_SCALE is used to represent the load contribution of a single task and the same issues are created. A similar solution is adopted except that it is now the average per task contribution to a group's load (as opposed to a run queue) that is required. As this value is not directly available from the group it is calculated on the fly as the queues in the groups are visited when determining the busiest group. A key change to this function is that it is no longer to scale down *imbalance on exit as move_tasks() uses the load in its scaled form. void set_user_nice(): has been modified to update the task's load_weight field when it's nice value and also to ensure that its run queue's raw_weighted_load field is updated if it was runnable. From: "Siddha, Suresh B" <suresh.b.siddha@intel.com> With smpnice, sched groups with highest priority tasks can mask the imbalance between the other sched groups with in the same domain. This patch fixes some of the listed down scenarios by not considering the sched groups which are lightly loaded. a) on a simple 4-way MP system, if we have one high priority and 4 normal priority tasks, with smpnice we would like to see the high priority task scheduled on one cpu, two other cpus getting one normal task each and the fourth cpu getting the remaining two normal tasks. but with current smpnice extra normal priority task keeps jumping from one cpu to another cpu having the normal priority task. This is because of the busiest_has_loaded_cpus, nr_loaded_cpus logic.. We are not including the cpu with high priority task in max_load calculations but including that in total and avg_load calcuations.. leading to max_load < avg_load and load balance between cpus running normal priority tasks(2 Vs 1) will always show imbalanace as one normal priority and the extra normal priority task will keep moving from one cpu to another cpu having normal priority task.. b) 4-way system with HT (8 logical processors). Package-P0 T0 has a highest priority task, T1 is idle. Package-P1 Both T0 and T1 have 1 normal priority task each.. P2 and P3 are idle. With this patch, one of the normal priority tasks on P1 will be moved to P2 or P3.. c) With the current weighted smp nice calculations, it doesn't always make sense to look at the highest weighted runqueue in the busy group.. Consider a load balance scenario on a DP with HT system, with Package-0 containing one high priority and one low priority, Package-1 containing one low priority(with other thread being idle).. Package-1 thinks that it need to take the low priority thread from Package-0. And find_busiest_queue() returns the cpu thread with highest priority task.. And ultimately(with help of active load balance) we move high priority task to Package-1. And same continues with Package-0 now, moving high priority task from package-1 to package-0.. Even without the presence of active load balance, load balance will fail to balance the above scenario.. Fix find_busiest_queue to use "imbalance" when it is lightly loaded. [kernel@kolivas.org: sched: store weighted load on up] [kernel@kolivas.org: sched: add discrete weighted cpu load function] [suresh.b.siddha@intel.com: sched: remove dead code] Signed-off-by: Peter Williams <pwil3058@bigpond.com.au> Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com> Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Con Kolivas <kernel@kolivas.org> Cc: John Hawkes <hawkes@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] sched: CPU hotplug race vs. set_cpus_allowed()Kirill Korotaev1-4/+14
There is a race between set_cpus_allowed() and move_task_off_dead_cpu(). __migrate_task() doesn't report any err code, so task can be left on its runqueue if its cpus_allowed mask changed so that dest_cpu is not longer a possible target. Also, chaning cpus_allowed mask requires rq->lock being held. Signed-off-by: Kirill Korotaev <dev@openvz.org> Acked-By: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] unnecessary long index i in schedSteven Rostedt1-1/+2
Unless we expect to have more than 2G CPUs, there's no reason to have 'i' as a long long here. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] sched: fix interactive ceiling codeCon Kolivas1-25/+27
The relationship between INTERACTIVE_SLEEP and the ceiling is not perfect and not explicit enough. The sleep boost is not supposed to be any larger than without this code and the comment is not clear enough about what exactly it does, just the reason it does it. Fix it. There is a ceiling to the priority beyond which tasks that only ever sleep for very long periods cannot surpass. Fix it. Prevent the on-runqueue bonus logic from defeating the idle sleep logic. Opportunity to micro-optimise. Signed-off-by: Con Kolivas <kernel@kolivas.org> Signed-off-by: Mike Galbraith <efault@gmx.de> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Ken Chen <kenneth.w.chen@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] sched: simplify bitmap definitionSteven Rostedt1-3/+1
Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] sched: fix smt nice lock contention and optimizationChen, Kenneth W1-123/+59
Initial report and lock contention fix from Chris Mason: Recent benchmarks showed some performance regressions between 2.6.16 and 2.6.5. We tracked down one of the regressions to lock contention in schedule heavy workloads (~70,000 context switches per second) kernel/sched.c:dependent_sleeper() was responsible for most of the lock contention, hammering on the run queue locks. The patch below is more of a discussion point than a suggested fix (although it does reduce lock contention significantly). The dependent_sleeper code looks very expensive to me, especially for using a spinlock to bounce control between two different siblings in the same cpu. It is further optimized: * perform dependent_sleeper check after next task is determined * convert wake_sleeping_dependent to use trylock * skip smt runqueue check if trylock fails * optimize double_rq_lock now that smt nice is converted to trylock * early exit in searching first SD_SHARE_CPUPOWER domain * speedup fast path of dependent_sleeper [akpm@osdl.org: cleanup] Signed-off-by: Ken Chen <kenneth.w.chen@intel.com> Acked-by: Ingo Molnar <mingo@elte.hu> Acked-by: Con Kolivas <kernel@kolivas.org> Signed-off-by: Nick Piggin <npiggin@suse.de> Acked-by: Chris Mason <mason@suse.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] cpu hotplug: make cpu_notifier related notifier calls __cpuinit onlyChandra Seetharaman1-3/+4
Make notifier_calls associated with cpu_notifier as __cpuinit. __cpuinit makes sure that the function is init time only unless CONFIG_HOTPLUG_CPU is defined. [akpm@osdl.org: section fix] Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Cc: Ashok Raj <ashok.raj@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] cpu hotplug: make [un]register_cpu_notifier init time onlyChandra Seetharaman1-3/+5
CPUs come online only at init time (unless CONFIG_HOTPLUG_CPU is defined). So, cpu_notifier functionality need to be available only at init time. This patch makes register_cpu_notifier() available only at init time, unless CONFIG_HOTPLUG_CPU is defined. This patch exports register_cpu_notifier() and unregister_cpu_notifier() only if CONFIG_HOTPLUG_CPU is defined. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Cc: Ashok Raj <ashok.raj@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] cpu hotplug: revert initdata patch submitted for 2.6.17Chandra Seetharaman6-6/+6
This patch reverts notifier_block changes made in 2.6.17 Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Cc: Ashok Raj <ashok.raj@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] cpu hotplug: revert init patch submitted for 2.6.17Chandra Seetharaman7-7/+7
In 2.6.17, there was a problem with cpu_notifiers and XFS. I provided a band-aid solution to solve that problem. In the process, i undid all the changes you both were making to ensure that these notifiers were available only at init time (unless CONFIG_HOTPLUG_CPU is defined). We deferred the real fix to 2.6.18. Here is a set of patches that fixes the XFS problem cleanly and makes the cpu notifiers available only at init time (unless CONFIG_HOTPLUG_CPU is defined). If CONFIG_HOTPLUG_CPU is defined then cpu notifiers are available at run time. This patch reverts the notifier_call changes made in 2.6.17 Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Cc: Ashok Raj <ashok.raj@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] rcutorture: add call_rcu_bh() operationsPaul E. McKenney2-2/+48
Add operations for the call_rcu_bh() variant of RCU. Also add an rcu_batches_completed_bh() function, which is needed by rcutorture. Signed-off-by: Paul E. McKenney <paulmck@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] rcutorture: add ops vector and Classic RCU opsPaul E. McKenney1-43/+120
Add an ops vector to rcutorture, and add the ops for Classic RCU. Update the rcutorture documentation to reflect slight change to the dmesg formats. Signed-off-by: Paul E. McKenney <paulmck@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] rcutorture: catchup doc fixes for idle-hz testsPaul E. McKenney1-1/+1
This just catches the RCU torture documentation up with the recent fixes that test RCU for architectures that turn of the scheduling-clock interrupt for idle CPUs and the addition of a SUCCESS/FAILURE indication, fixing up an obsolete comment as well. Signed-off-by: Paul E. McKenney <paulmck@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] fix kernel-doc in kernel/ dirRandy Dunlap2-4/+4
Fix kernel-doc parameters in kernel/ Warning(/var/linsrc/linux-2617-g9//kernel/auditsc.c:1376): No description found for parameter 'u_abs_timeout' Warning(/var/linsrc/linux-2617-g9//kernel/auditsc.c:1420): No description found for parameter 'u_msg_prio' Warning(/var/linsrc/linux-2617-g9//kernel/auditsc.c:1420): No description found for parameter 'u_abs_timeout' Warning(/var/linsrc/linux-2617-g9//kernel/acct.c:526): No description found for parameter 'pacct' Signed-off-by: Randy Dunlap <rdunlap@xenotime.net> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] spin/rwlock init cleanupsIngo Molnar1-1/+1
locking init cleanups: - convert " = SPIN_LOCK_UNLOCKED" to spin_lock_init() or DEFINE_SPINLOCK() - convert rwlocks in a similar manner this patch was generated automatically. Motivation: - cleanliness - lockdep needs control of lock initialization, which the open-coded variants do not give - it's also useful for -rt and for lock debugging in general Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] poison: add & use more constantsRandy Dunlap1-2/+3
Add more poison values to include/linux/poison.h. It's not clear to me whether some others should be added or not, so I haven't added any of these: ./include/linux/libata.h:#define ATA_TAG_POISON 0xfafbfcfdU ./arch/ppc/8260_io/fcc_enet.c:1918: memset((char *)(&(immap->im_dprambase[(mem_addr+64)])), 0x88, 32); ./drivers/usb/mon/mon_text.c:429: memset(mem, 0xe5, sizeof(struct mon_event_text)); ./drivers/char/ftape/lowlevel/ftape-ctl.c:738: memset(ft_buffer[i]->address, 0xAA, FT_BUFF_SIZE); ./drivers/block/sx8.c:/* 0xf is just arbitrary, non-zero noise; this is sorta like poisoning */ Signed-off-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] vdso: randomize the i386 vDSO by moving it into a vmaIngo Molnar1-0/+12
Move the i386 VDSO down into a vma and thus randomize it. Besides the security implications, this feature also helps debuggers, which can COW a vma-backed VDSO just like a normal DSO and can thus do single-stepping and other debugging features. It's good for hypervisors (Xen, VMWare) too, which typically live in the same high-mapped address space as the VDSO, hence whenever the VDSO is used, they get lots of guest pagefaults and have to fix such guest accesses up - which slows things down instead of speeding things up (the primary purpose of the VDSO). There's a new CONFIG_COMPAT_VDSO (default=y) option, which provides support for older glibcs that still rely on a prelinked high-mapped VDSO. Newer distributions (using glibc 2.3.3 or later) can turn this option off. Turning it off is also recommended for security reasons: attackers cannot use the predictable high-mapped VDSO page as syscall trampoline anymore. There is a new vdso=[0|1] boot option as well, and a runtime /proc/sys/vm/vdso_enabled sysctl switch, that allows the VDSO to be turned on/off. (This version of the VDSO-randomization patch also has working ELF coredumping, the previous patch crashed in the coredumping code.) This code is a combined work of the exec-shield VDSO randomization code and Gerd Hoffmann's hypervisor-centric VDSO patch. Rusty Russell started this patch and i completed it. [akpm@osdl.org: cleanups] [akpm@osdl.org: compile fix] [akpm@osdl.org: compile fix 2] [akpm@osdl.org: compile fix 3] [akpm@osdl.org: revernt MAXMEM change] Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Arjan van de Ven <arjan@infradead.org> Cc: Gerd Hoffmann <kraxel@suse.de> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Zachary Amsden <zach@vmware.com> Cc: Andi Kleen <ak@muc.de> Cc: Jan Beulich <jbeulich@novell.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] catch valid mem range at onlining memoryKAMEZAWA Hiroyuki1-0/+38
This patch allows hot-add memory which is not aligned to section. Now, hot-added memory has to be aligned to section size. Considering big section sized archs, this is not useful. When hot-added memory is registerd as iomem resoruce by iomem resource patch, we can make use of that information to detect valid memory range. Note: With this, not-aligned memory can be registerd. To allow hot-add memory with holes, we have to do more work around add_memory(). (It doesn't allows add memory to already existing mem section.) Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] pm_trace is dangerousAndrew Morton1-2/+11
CONFIG_PM_TRACES scrogs your RTC. Mark it as experimental, and defaulting to `off'. Also beef up the help message a bit. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] kernel/acct: fix function definitionRandy Dunlap1-1/+1
kernel/acct.c:579:19: warning: non-ANSI function declaration of function 'acct_process' Signed-off-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] 64bit Resource: finally enable 64bit resource sizesGreg Kroah-Hartman1-5/+3
Introduce the Kconfig entry and actually switch to a 64bit value, if wanted, for resource_size_t. Based on a patch series originally from Vivek Goyal <vgoyal@in.ibm.com> Cc: Vivek Goyal <vgoyal@in.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2006-06-27[PATCH] 64bit resource: change resource core to use resource_size_tGreg Kroah-Hartman1-16/+18
Based on a patch series originally from Vivek Goyal <vgoyal@in.ibm.com> Cc: Vivek Goyal <vgoyal@in.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2006-06-27[PATCH] 64bit resource: fix up printks for resources in arch and core codeGreg Kroah-Hartman1-4/+6
This is needed if we wish to change the size of the resource structures. Based on an original patch from Vivek Goyal <vgoyal@in.ibm.com> and Andrew Morton. (tweaked by Andy Isaacson <adi@hexapodia.org>) Cc: Vivek Goyal <vgoyal@in.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Andy Isaacson <adi@hexapodia.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2006-06-26Merge git://git.kernel.org/pub/scm/linux/kernel/git/sam/kbuildLinus Torvalds1-10/+1
* git://git.kernel.org/pub/scm/linux/kernel/git/sam/kbuild: (40 commits) kbuild: trivial fixes in Makefile kbuild: adding symbols in Kconfig and defconfig to TAGS kbuild: replace abort() with exit(1) kbuild: support for %.symtypes files kbuild: fix silentoldconfig recursion kbuild: add option for stripping modules while installing them kbuild: kill some false positives from modpost kbuild: export-symbol usage report generator kbuild: fix make -rR breakage kbuild: append -dirty for updated but uncommited changes kbuild: append git revision for all untagged commits kbuild: fix module.symvers parsing in modpost kbuild: ignore make's built-in rules & variables kbuild: bugfix with initramfs kbuild: modpost build fix kbuild: check license compatibility when building modules kbuild: export-type enhancement to modpost.c kbuild: add dependency on kernel.release to the package targets kbuild: `make kernelrelease' speedup kconfig: KCONFIG_OVERWRITECONFIG ...
2006-06-26Merge branch 'x86-64'Linus Torvalds5-3/+952
* x86-64: (83 commits) [PATCH] x86_64: x86_64 stack usage debugging [PATCH] x86_64: (resend) x86_64 stack overflow debugging [PATCH] x86_64: msi_apic.c build fix [PATCH] x86_64: i386/x86-64 Add nmi watchdog support for new Intel CPUs [PATCH] x86_64: Avoid broadcasting NMI IPIs [PATCH] x86_64: fix apic error on bootup [PATCH] x86_64: enlarge window for stack growth [PATCH] x86_64: Minor string functions optimizations [PATCH] x86_64: Move export symbols to their C functions [PATCH] x86_64: Standardize i386/x86_64 handling of NMI_VECTOR [PATCH] x86_64: Fix modular pc speaker [PATCH] x86_64: remove sys32_ni_syscall() [PATCH] x86_64: Do not use -ffunction-sections for modules [PATCH] x86_64: Add cpu_relax to apic_wait_icr_idle [PATCH] x86_64: adjust kstack_depth_to_print default [PATCH] i386/x86-64: adjust /proc/interrupts column headings [PATCH] x86_64: Fix race in cpu_local_* on preemptible kernels [PATCH] x86_64: Fix fast check in safe_smp_processor_id [PATCH] x86_64: x86_64 setup.c - printing cmp related boottime information [PATCH] i386/x86-64/ia64: Move polling flag into thread_info_status ... Manual resolve of trivial conflict in arch/i386/kernel/Makefile
2006-06-26[PATCH] i386/x86-64/ia64: Move polling flag into thread_info_statusAndi Kleen1-2/+7
During some profiling I noticed that default_idle causes a lot of memory traffic. I think that is caused by the atomic operations to clear/set the polling flag in thread_info. There is actually no reason to make this atomic - only the idle thread does it to itself, other CPUs only read it. So I moved it into ti->status. Converted i386/x86-64/ia64 for now because that was the easiest way to fix ACPI which also manipulates these flags in its idle function. Cc: Nick Piggin <npiggin@novell.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Len Brown <len.brown@intel.com> Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26[PATCH] x86_64: allow unwinder to build without module supportJan Beulich1-0/+4
Add proper conditionals to be able to build with CONFIG_MODULES=n. Signed-off-by: Jan Beulich <jbeulich@novell.com> Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26[PATCH] i386/x86-64: fall back to old-style call trace if no unwindingJan Beulich1-4/+3
If no unwinding is possible at all for a certain exception instance, fall back to the old style call trace instead of not showing any trace at all. Also, allow setting the stack trace mode at the command line. Signed-off-by: Jan Beulich <jbeulich@novell.com> Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26[PATCH] x86_64: reliable stack trace supportJan Beulich3-1/+931
These are the generic bits needed to enable reliable stack traces based on Dwarf2-like (.eh_frame) unwind information. Subsequent patches will enable x86-64 and i386 to make use of this. Thanks to Andi Kleen and Ingo Molnar, who pointed out several possibilities for improvement. Signed-off-by: Jan Beulich <jbeulich@novell.com> Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26[PATCH] x86_64: Add compat_printk and sysctl to turn off compat layer warningsAndi Kleen1-0/+11
Sometimes e.g. with crashme the compat layer warnings can be noisy. Add a way to turn them off by gating all output through compat_printk that checks a global sysctl. The default is not changed. Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26[PATCH] sched: fix SCHED_FIFO bug in sys_sched_rr_get_interval()Peter Williams1-1/+1
The introduction of SCHED_BATCH scheduling class with a value of 3 means that the expression (p->policy & SCHED_FIFO) will return true if policy is SCHED_BATCH or SCHED_FIFO. Unfortunately, this expression is used in sys_sched_rr_get_interval() and in the absence of a comment to say that this is intentional I presume that it is unintentional and erroneous. The fix is to change the expression to (p->policy == SCHED_FIFO). Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26[PATCH] coredump: copy_process: don't check SIGNAL_GROUP_EXITOleg Nesterov1-12/+0
After the previous patch SIGNAL_GROUP_EXIT implies a pending SIGKILL, we can remove this check from copy_process() because we already checked !signal_pending(). Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26[PATCH] coredump: kill ptrace related stuffOleg Nesterov2-6/+32
With this patch zap_process() sets SIGNAL_GROUP_EXIT while sending SIGKILL to the thread group. This means that a TASK_TRACED task 1. Will be awakened by signal_wake_up(1) 2. Can't sleep again via ptrace_notify() 3. Can't go to do_signal_stop() after return from ptrace_stop() in get_signal_to_deliver() So we can remove all ptrace related stuff from coredump path. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26[PATCH] proc: Cleanup proc_fd_access_allowedEric W. Biederman1-3/+17
In process of getting proc_fd_access_allowed to work it has developed a few warts. In particular the special case that always allows introspection and the special case to allow inspection of kernel threads. The special case for introspection is needed for /proc/self/mem. The special case for kernel threads really should be overridable by security modules. So consolidate these checks into ptrace.c:may_attach(). The check to always allow introspection is trivial. The check to allow access to kernel threads, and zombies is a little trickier. mem_read and mem_write already verify an mm exists so it isn't needed twice. proc_fd_access_allowed only doesn't want a check to verify task->mm exits, s it prevents all access to kernel threads. So just move the task->mm check into ptrace_attach where it is needed for practical reasons. I did a quick audit and none of the security modules in the kernel seem to care if they are passed a task without an mm into security_ptrace. So the above move should be safe and it allows security modules to come up with more restrictive policy. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Cc: Stephen Smalley <sds@tycho.nsa.gov> Cc: Chris Wright <chrisw@sous-sol.org> Cc: James Morris <jmorris@namei.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26[PATCH] proc: Use struct pid not struct task_refEric W. Biederman1-6/+5
Incrementally update my proc-dont-lock-task_structs-indefinitely patches so that they work with struct pid instead of struct task_ref. Mostly this is a straight 1-1 substitution. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26[PATCH] proc: don't lock task_structs indefinitelyEric W. Biederman1-7/+20
Every inode in /proc holds a reference to a struct task_struct. If a directory or file is opened and remains open after the the task exits this pinning continues. With 8K stacks on a 32bit machine the amount pinned per file descriptor is about 10K. Normally I would figure a reasonable per user process limit is about 100 processes. With 80 processes, with a 1000 file descriptors each I can trigger the 00M killer on a 32bit kernel, because I have pinned about 800MB of useless data. This patch replaces the struct task_struct pointer with a pointer to a struct task_ref which has a struct task_struct pointer. The so the pinning of dead tasks does not happen. The code now has to contend with the fact that the task may now exit at any time. Which is a little but not muh more complicated. With this change it takes about 1000 processes each opening up 1000 file descriptors before I can trigger the OOM killer. Much better. [mlp@google.com: task_mmu small fixes] Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Cc: Paul Jackson <pj@sgi.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Albert Cahalan <acahalan@gmail.com> Signed-off-by: Prasanna Meda <mlp@google.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26[PATCH] proc: Rewrite the proc dentry flush on exit optimizationEric W. Biederman2-9/+1
To keep the dcache from filling up with dead /proc entries we flush them on process exit. However over the years that code has gotten hairy with a dentry_pointer and a lock in task_struct and misdocumented as a correctness feature. I have rewritten this code to look and see if we have a corresponding entry in the dcache and if so flush it on process exit. This removes the extra fields in the task_struct and allows me to trivially handle the case of a /proc/<tgid>/task/<pid> entry as well as the current /proc/<pid> entries. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26[PATCH] Notify page fault call chainAnil S Keshavamurthy1-7/+23
With this patch Kprobes now registers for page fault notifications only when their is an active probe registered. Once all the active probes are unregistered their is no need to be notified of page faults and kprobes unregisters itself from the page fault notifications. Hence we will have ZERO side effects when no probes are active. Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26[PATCH] Kprobes registers for notify page faultAnil S Keshavamurthy1-0/+8
Kprobes now registers for page fault notifications. Signed-off-by: Anil S Keshavamurthy <anil.s.keshavmurthy@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26[PATCH] Kprobe: multi kprobe posthandler for boostermao, bibo1-8/+24
If there are multi kprobes on the same probepoint, there will be one extra aggr_kprobe on the head of kprobe list. The aggr_kprobe has aggr_post_handler/aggr_break_handler whether the other kprobe post_hander/break_handler is NULL or not. This patch modifies this, only when there is one or more kprobe in the list whose post_handler is not NULL, post_handler of aggr_kprobe will be set as aggr_post_handler. [soshima@redhat.com: !CONFIG_PREEMPT fix] Signed-off-by: bibo, mao <bibo.mao@intel.com> Cc: Masami Hiramatsu <hiramatu@sdl.hitachi.co.jp> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: "Keshavamurthy, Anil S" <anil.s.keshavamurthy@intel.com> Cc: Prasanna S Panchamukhi <prasanna@in.ibm.com> Cc: Jim Keniston <jkenisto@us.ibm.com> Cc: Yumiko Sugita <sugita@sdl.hitachi.co.jp> Cc: Hideo Aoki <haoki@redhat.com> Signed-off-by: Satoshi Oshima <soshima@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26[PATCH] fix and optimize clock source updateRoman Zippel1-42/+109
This fixes the clock source updates in update_wall_time() to correctly track the time coming in via current_tick_length(). Optimize the fast paths to be as short as possible to keep the overhead low. Signed-off-by: Roman Zippel <zippel@linux-m68k.org> Acked-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26[PATCH] time: rename clocksource functionsjohn stultz3-19/+19
As suggested by Roman Zippel, change clocksource functions to use clocksource_xyz rather then xyz_clocksource to avoid polluting the namespace. Signed-off-by: John Stultz <johnstul@us.ibm.com> Cc: Roman Zippel <zippel@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26[PATCH] Time: i386 Clocksource Driversjohn stultz1-3/+8
Implement the time sources for i386 (acpi_pm, cyclone, hpet, pit, and tsc). With this patch, the conversion of the i386 arch to the generic timekeeping code should be complete. The patch should be fairly straight forward, only adding the new clocksources. [hirofumi@mail.parknet.co.jp: acpi_pm cleanup] Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Paul Mundt <lethal@linux-sh.org> Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26[PATCH] Time: Introduce arch generic time accessorsjohn stultz2-0/+172
Introduces clocksource switching code and the arch generic time accessor functions that use the clocksource infrastructure. Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26[PATCH] Time: Use clocksource abstraction for NTP adjustmentsjohn stultz1-19/+28
Instead of incrementing xtime by tick_nsec + ntp adjustments, use the clocksource abstraction to increment and scale time. Using the clocksource abstraction allows other clocksources to be used consistently in the face of late or lost ticks, while preserving the existing behavior via the jiffies clocksource. This removes the need to keep time_phase adjustments as we just use the current_tick_length() function as the NTP interface and accumulate time using shifted nanoseconds. The basics of this design was by Roman Zippel, however it is my own interpretation and implementation, so the credit should go to him and the blame to me. Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26[PATCH] Time: Let user request precision from current_tick_length()john stultz1-4/+17
Change the current_tick_length() function so it takes an argument which specifies how much precision to return in shifted nanoseconds. This provides a simple way to convert between NTPs internal nanoseconds shifted by (SHIFT_SCALE - 10) to other shifted nanosecond units that are used by the clocksource abstraction. Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26[PATCH] Time: Use clocksource infrastructure for update_wall_timejohn stultz3-15/+83
Modify the update_wall_time function so it increments time using the clocksource abstraction instead of jiffies. Since the only clocksource driver currently provided is the jiffies clocksource, this should result in no functional change. Additionally, a timekeeping_init and timekeeping_resume function has been added to initialize and maintain some of the new timekeping state. [hirofumi@mail.parknet.co.jp: fixlet] Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26[PATCH] Time: Clocksource Infrastructurejohn stultz3-0/+418
This introduces the clocksource management infrastructure. A clocksource is a driver-like architecture generic abstraction of a free-running counter. This code defines the clocksource structure, and provides management code for registering, selecting, accessing and scaling clocksources. Additionally, this includes the trivial jiffies clocksource, a lowest common denominator clocksource, provided mainly for use as an example. [hirofumi@mail.parknet.co.jp: Don't enable IRQ too early] Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Paul Mundt <lethal@linux-sh.org> Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26[PATCH] Convert kernel/cpu.c to mutexesIngo Molnar1-5/+5
Convert kernel/cpu.c from semaphore to mutex. I've reviewed all lock_cpu_hotplug() critical sections, and they all seem to fit mutex semantics. Signed-off-by: Ingo Molnar <mingo@elte.hu> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Ashok Raj <ashok.raj@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26[PATCH] work around ppc64 bootup bug by making mutex-debugging save/restore irqsIngo Molnar4-37/+27
It seems ppc64 wants to lock mutexes in early bootup code, with interrupts disabled, and they expect interrupts to stay disabled, else they crash. Work around this bug by making mutex debugging variants save/restore irq flags. Signed-off-by: Ingo Molnar <mingo@elte.hu> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25Revert "swsusp special saveable pages support" commitsLinus Torvalds3-116/+18
This reverts commits 3e3318dee0878d42ed62a19c292a2ac284135db3 [PATCH] swsusp: x86_64 mark special saveable/unsaveable pages b6370d96e09944c6e3ae8d5743ca8a8ab1f79f6c [PATCH] swsusp: i386 mark special saveable/unsaveable pages ce4ab0012b32c1a4a1d6e934aeb73bf3151c48d9 [PATCH] swsusp: add architecture special saveable pages support because not only do they apparently cause page faults on x86, the infrastructure doesn't compile on powerpc. Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25Fix PM_TRACE dependency: works only on 32-bit x86 for nowLinus Torvalds1-1/+1
Not that x86-64 and other architecture support should be difficult to add (trivial fixups to the data format and add the proper linker script entry). Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25[PATCH] pacct: none-delayed process accounting accumulationKaiGai Kohei1-13/+11
In current 2.6.17 implementation, signal_struct refered from task_struct is used for per-process data structure. The pacct facility also uses it as a per-process data structure to store stime, utime, minflt, majflt. But those members are saved in __exit_signal(). It's too late. For example, if some threads exits at same time, pacct facility has a possibility to drop accountings for a part of those threads. (see, the following 'The results of original 2.6.17 kernel') I think accounting information should be completely collected into the per-process data structure before writing out an accounting record. This patch fixes this matter. Accumulation of stime, utime, minflt and majflt are done before generating accounting record. [mingo@elte.hu: fix acct_collect() siglock bug found by lockdep] Signed-off-by: KaiGai Kohei <kaigai@ak.jp.nec.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25[PATCH] pacct: avoidance to refer the last thread as a representation of the ↵KaiGai Kohei2-20/+26
process When pacct facility generate an 'ac_flag' field in accounting record, it refers a task_struct of the thread which died last in the process. But any other task_structs are ignored. Therefore, pacct facility drops ASU flag even if root-privilege operations are used by any other threads except the last one. In addition, AFORK flag is always set when the thread of group-leader didn't die last, although this process has called execve() after fork(). We have a same matter in ac_exitcode. The recorded ac_exitcode is an exit code of the last thread in the process. There is a possibility this exitcode is not the group leader's one.
2006-06-25[PATCH] pacct: add pacct_struct to fix some pacct bugs.KaiGai Kohei3-16/+40
The pacct facility need an i/o operation when an accounting record is generated. There is a possibility to wake OOM killer up. If OOM killer is activated, it kills some processes to make them release process memory regions. But acct_process() is called in the killed processes context before calling exit_mm(), so those processes cannot release own memory. In the results, any processes stop in this point and it finally cause a system stall.
2006-06-25[PATCH] kthread: move kernel-doc and put it into DocBookRandy Dunlap1-1/+60
Move kthread API kernel-doc from kthread.h to kthread.c & fix it. Add kthread API to kernel-api DocBook. Signed-off-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25[PATCH] ktime/hrtimer: fix kernel-doc commentsRandy Dunlap1-10/+1
Fix kernel-doc formatting in ktime.h and hrtimer.[ch] files. Signed-off-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25[PATCH] cpu hotplug: fix CPU_UP_CANCEL handlingHeiko Carstens4-0/+8
If a cpu hotplug callback fails on CPU_UP_PREPARE, all callbacks will be called with CPU_UP_CANCELED. A few of these callbacks assume that on CPU_UP_PREPARE a pointer to task has been stored in a percpu array. This assumption is not true if CPU_UP_PREPARE fails and the following calls to kthread_bind() in CPU_UP_CANCELED will cause an addressing exception because of passing a NULL pointer. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Ashok Raj <ashok.raj@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25[PATCH] kthread: convert stop_machine into a kthreadSerge E. Hallyn1-6/+11
- Update stop_machine.c to spawn stop_machine as kthreads rather than the deprecated kernel_threads. - Update stop_machine to use the more efficient kthread_bind() before running task in place of set_cpus_allowed() after. [akpm@osdl.org: remove now-wrong set_cpus_allowed()] Signed-off-by: Serge E. Hallyn <serue@us.ibm.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25[PATCH] Link error when futexes are disabled on 64bit architecturesAnton Blanchard1-1/+1
If futexes are disabled we fail to link on ppc64. Signed-off-by: Anton Blanchard <anton@samba.org> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25[PATCH] N32 sigset and __COMPAT_ENDIAN_SWAP__akpm@osdl.org1-7/+0
I'm testing glibc on MIPS64, little-endian, N32, O32 and N64 multilibs. Among the NPTL test failures seen are some arising from sigsuspend problems for N32: it blocks the wrong signals, so SIGCANCEL (SIGRTMIN) is blocked despite glibc's carefully excluding it from sets of signals to block. Specifically, testing suggests it blocks signal N^32 instead of signal N, so (in the example tested) blocking SIGUSR1 (17) blocks signal 49 instead. glibc's sigset_t uses an array of unsigned long, as does the kernel. In both cases, signal N+1 is represented as (1UL << (N % (8 * sizeof (unsigned long)))) in word number (N / (8 * sizeof (unsigned long))). Thus the N32 glibc uses an array of 32-bit words and the N64 kernel uses an array of 64-bit words. For little-endian, the layout is the same, with signals 1-32 in the first 4 bytes, signals 33-64 in the second, etc.; for big-endian, userspace has that layout while in the kernel each 8 bytes have the two halves swapped from the userspace layout. The N32 sigsuspend syscall uses sigset_from_compat to convert the userspace sigset to kernel format. If __COMPAT_ENDIAN_SWAP__ is *not* set, this uses logic of the form set->sig[0] = compat->sig[0] | (((long)compat->sig[1]) << 32 ) to convert the userspace sigset to a kernel one. This looks correct to me for both big and little endian, given that in userspace compat->sig[1] will represent signals 33-64, and so will the high 32 bits of set->sig[0] in the kernel. If however __COMPAT_ENDIAN_SWAP__ *is* set, as it is for __MIPSEL__, it uses set->sig[0] = compat->sig[1] | (((long)compat->sig[0]) << 32 ); which seems incorrect for both big and little endian, and would explain the observed symptoms. This code is the only use of __COMPAT_ENDIAN_SWAP__, so if incorrect then that macro serves no purpose, in which case something like the following patch would seem appropriate to remove it. Signed-off-by: Joseph Myers <joseph@codesourcery.com> Signed-off-by: Ralf Baechle <ralf@linux-mips.org> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25[PATCH] Get rid of /proc/sys/procStephen Hemminger1-11/+0
The table is empty, why does it still exist? Signed-off-by: Stephen Hemminger <shemminger@osdl.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25[PATCH] printk time parameterJan Engelhardt1-0/+2
Currently, enabling/disabling printk timestamps is only possible through reboot (bootparam) or recompile. I normally do not run with timestamps (since syslog handles that in a good manner), but for measuring small kernel delays (e.g. irq probing - see parport thread) I needed subsecond precision, but then again, just for some minutes rather than all kernel messages to come. The following patch adds a module_param() with which the timestamps can be en-/disabled in a live system through /sys/modules/printk/parameters/printk_time. Signed-off-by: Jan Engelhardt <jengelh@gmx.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25[PATCH] Remove unecessary NULL check in kernel/acct.cMatt Helsley1-5/+3
copy_process() appears to be the only caller of acct_clear_integrals() and does not pass in NULL task pointers. Remove the unecessary check. Signed-off-by: Matt Helsley <matthltc@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25[PATCH] constify parts of kernel/power/Andreas Mohr2-3/+3
Signed-off-by: Andreas Mohr <andi@lisas.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>