aboutsummaryrefslogtreecommitdiffstats
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2006-08-14[PATCH] workqueue: remove lock_cpu_hotplug()Andrew Morton1-12/+21
Use a private lock instead. It protects all per-cpu data structures in workqueue.c, including the workqueues list. Fix a bug in schedule_on_each_cpu(): it was forgetting to lock down the per-cpu resources. Unfixed long-standing bug: if someone unplugs the CPU identified by `singlethread_cpu' the kernel will get very sick. Cc: Dave Jones <davej@codemonkey.org.uk> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2006-08-14[PATCH] futex_handle_fault always failsjohn stultz1-6/+10
We found this issue last week w/ the -RT kernel, but it seems the same issue is in mainline as well. Basically it is possible for futex_unlock_pi to return without actually freeing the lock. This is due to buggy logic in the use of futex_handle_fault() and its attempt argument in a failure case. Looking at futex.c the logic is as follows: 1) In futex_unlock_pi() we start w/ ret=0 and we go down to the first futex_atomic_cmpxchg_inatomic(), where we find uval==-EFAULT. We then jump to the pi_faulted label. 2) From pi_faulted: We increment attempt, unlock the sem and hit the retry label. 3) From the retry label, with ret still zero, we again hit EFAULT on the first futex_atomic_cmpxchg_inatomic(), and again goto the pi_faulted label. 4) Again from pi_faulted: we increment attempt and enter the conditional, where we call futex_handle_fault. 5) futex_handle_fault fails, and we goto the out_unlock_release_sem label. 6) From out_unlock_release_sem we return, and since ret is still zero, we return without error, while never actually unlocking the lock. Issue #1: at the first futex_atomic_cmpxchg_inatomic() we should probably be setting ret=-EFAULT before jumping to pi_faulted: However in our case this doesn't really affect anything, as the glibc we're using ignores the error value from futex_unlock_pi(). Issue #2: Look at futex_handle_fault(), its first conditional will return -EFAULT if attempt is >= 2. However, from the "if(attempt++) futex_handle_fault(attempt)" logic above, we'll *never* call futex_handle_fault when attempt is less then two. So we never get a chance to even try to fault the page in. The following patch addresses these two issues by 1) Always setting ret to -EFAULT if futex_handle_fault fails, and 2) Removing the = in futex_handle_fault's (attempt >= 2) check. I'm really not sure this is the right fix, but wanted to bring it up so folks knew the issue is alive and well in the current -git tree. From looking at the git logs the logic was first introduced (then later copied to other places) in the following commit almost a year ago: http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=4732efbeb997189d9f9b04708dc26bf8613ed721;hp=5b039e681b8c5f30aac9cc04385cc94be45d0823 Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Ingo Molnar <mingo@elte.hu> Acked-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2006-08-14[PATCH] sys_getppid oopses on debug kernelKirill Korotaev1-34/+7
sys_getppid() optimization can access a freed memory. On kernels with DEBUG_SLAB turned ON, this results in Oops. As Dave Hansen noted, this optimization is also unsafe for memory hotplug. So this patch always takes the lock to be safe. [oleg@tv-sign.ru: simplifications] Signed-off-by: Kirill Korotaev <dev@openvz.org> Cc: <stable@kernel.org> Cc: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2006-08-14[PATCH] panic.c build fixAndrew Morton1-0/+1
kernel/panic.c: In function 'add_taint': kernel/panic.c:176: warning: implicit declaration of function 'debug_locks_off' Cc: Andi Kleen <ak@muc.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2006-08-14[PATCH] fix hrtimer percpu usage typoJan Blunck1-1/+1
The percpu variable is used incorrectly in switch_hrtimer_base(). Signed-off-by: Jan Blunck <jblunck@suse.de> Acked-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2006-08-06[PATCH] futex: Apply recent futex fixes to futex_compatThomas Gleixner1-3/+3
The recent fixups in futex.c need to be applied to futex_compat.c too. Fixes a hang reported by Olaf. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Olaf Hering <olh@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-08-06[PATCH] memory hotadd fixes: find_next_system_ram catch range fixKAMEZAWA Hiroyuki1-1/+2
find_next_system_ram() is used to find available memory resource at onlining newly added memory. This patch fixes following problem. find_next_system_ram() cannot catch this case. Resource: (start)-------------(end) Section : (start)-------------(end) Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Keith Mannthey <kmannth@gmail.com> Cc: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-08-06[PATCH] memory hotadd fixes: change find_next_system_ram's return value mannerKAMEZAWA Hiroyuki1-2/+4
find_next_system_ram() returns valid memory range which meets requested area, only used by memory-hot-add. This function always rewrite requested resource even if returned area is not fully fit in requested one. And sometimes the returnd resource is larger than requested area. This annoyes the caller. This patch changes the returned value to fit in requested area. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Keith Mannthey <kmannth@gmail.com> Cc: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-08-06[PATCH] vt: printk: Fix framebuffer console triggering might_sleep assertionAntonino A. Daplas1-1/+3
Reported by: Dave Jones Whilst printk'ing to both console and serial console, I got this... (2.6.18rc1) BUG: sleeping function called from invalid context at kernel/sched.c:4438 in_atomic():0, irqs_disabled():1 Call Trace: [<ffffffff80271db8>] show_trace+0xaa/0x23d [<ffffffff80271f60>] dump_stack+0x15/0x17 [<ffffffff8020b9f8>] __might_sleep+0xb2/0xb4 [<ffffffff8029232e>] __cond_resched+0x15/0x55 [<ffffffff80267eb8>] cond_resched+0x3b/0x42 [<ffffffff80268c64>] console_conditional_schedule+0x12/0x14 [<ffffffff80368159>] fbcon_redraw+0xf6/0x160 [<ffffffff80369c58>] fbcon_scroll+0x5d9/0xb52 [<ffffffff803a43c4>] scrup+0x6b/0xd6 [<ffffffff803a4453>] lf+0x24/0x44 [<ffffffff803a7ff8>] vt_console_print+0x166/0x23d [<ffffffff80295528>] __call_console_drivers+0x65/0x76 [<ffffffff80295597>] _call_console_drivers+0x5e/0x62 [<ffffffff80217e3f>] release_console_sem+0x14b/0x232 [<ffffffff8036acd6>] fb_flashcursor+0x279/0x2a6 [<ffffffff80251e3f>] run_workqueue+0xa8/0xfb [<ffffffff8024e5e0>] worker_thread+0xef/0x122 [<ffffffff8023660f>] kthread+0x100/0x136 [<ffffffff8026419e>] child_rip+0x8/0x12 This can occur when release_console_sem() is called but the log buffer still has contents that need to be flushed. The console drivers are called while the console_may_schedule flag is still true. The might_sleep() is triggered when fbcon calls console_conditional_schedule(). Fix by setting console_may_schedule to zero earlier, before the call to the console drivers. Signed-off-by: Antonino Daplas <adaplas@pol.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-08-06[PATCH] ptrace: make pid of child process available for PTRACE_EVENT_VFORK_DONEChuck Ebbert1-1/+3
When delivering PTRACE_EVENT_VFORK_DONE, provide pid of the child process when tracer calls ptrace(PTRACE_GETEVENTMSG). This is already (accidentally) available when the tracer is tracing VFORK in addition to VFORK_DONE. Signed-off-by: Chuck Ebbert <76306.1226@compuserve.com> Cc: Daniel Jacobowitz <dan@debian.org> Cc: Albert Cahalan <acahalan@gmail.com> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-08-06[PATCH] bug in futex unqueue_meChristian Borntraeger1-0/+1
This patch adds a barrier() in futex unqueue_me to avoid aliasing of two pointers. On my s390x system I saw the following oops: Unable to handle kernel pointer dereference at virtual kernel address 0000000000000000 Oops: 0004 [#1] CPU: 0 Not tainted Process mytool (pid: 13613, task: 000000003ecb6ac0, ksp: 00000000366bdbd8) Krnl PSW : 0704d00180000000 00000000003c9ac2 (_spin_lock+0xe/0x30) Krnl GPRS: 00000000ffffffff 000000003ecb6ac0 0000000000000000 0700000000000000 0000000000000000 0000000000000000 000001fe00002028 00000000000c091f 000001fe00002054 000001fe00002054 0000000000000000 00000000366bddc0 00000000005ef8c0 00000000003d00e8 0000000000144f91 00000000366bdcb8 Krnl Code: ba 4e 20 00 12 44 b9 16 00 3e a7 84 00 08 e3 e0 f0 88 00 04 Call Trace: ([<0000000000144f90>] unqueue_me+0x40/0xe4) [<0000000000145a0c>] do_futex+0x33c/0xc40 [<000000000014643e>] sys_futex+0x12e/0x144 [<000000000010bb00>] sysc_noemu+0x10/0x16 [<000002000003741c>] 0x2000003741c The code in question is: static int unqueue_me(struct futex_q *q) { int ret = 0; spinlock_t *lock_ptr; /* In the common case we don't take the spinlock, which is nice. */ retry: lock_ptr = q->lock_ptr; if (lock_ptr != 0) { spin_lock(lock_ptr); /* * q->lock_ptr can change between reading it and * spin_lock(), causing us to take the wrong lock. This * corrects the race condition. [...] and my compiler (gcc 4.1.0) makes the following out of it: 00000000000003c8 <unqueue_me>: 3c8: eb bf f0 70 00 24 stmg %r11,%r15,112(%r15) 3ce: c0 d0 00 00 00 00 larl %r13,3ce <unqueue_me+0x6> 3d0: R_390_PC32DBL .rodata+0x2a 3d4: a7 f1 1e 00 tml %r15,7680 3d8: a7 84 00 01 je 3da <unqueue_me+0x12> 3dc: b9 04 00 ef lgr %r14,%r15 3e0: a7 fb ff d0 aghi %r15,-48 3e4: b9 04 00 b2 lgr %r11,%r2 3e8: e3 e0 f0 98 00 24 stg %r14,152(%r15) 3ee: e3 c0 b0 28 00 04 lg %r12,40(%r11) /* write q->lock_ptr in r12 */ 3f4: b9 02 00 cc ltgr %r12,%r12 3f8: a7 84 00 4b je 48e <unqueue_me+0xc6> /* if r12 is zero then jump over the code.... */ 3fc: e3 20 b0 28 00 04 lg %r2,40(%r11) /* write q->lock_ptr in r2 */ 402: c0 e5 00 00 00 00 brasl %r14,402 <unqueue_me+0x3a> 404: R_390_PC32DBL _spin_lock+0x2 /* use r2 as parameter for spin_lock */ So the code becomes more or less: if (q->lock_ptr != 0) spin_lock(q->lock_ptr) instead of if (lock_ptr != 0) spin_lock(lock_ptr) Which caused the oops from above. After adding a barrier gcc creates code without this problem: [...] (the same) 3ee: e3 c0 b0 28 00 04 lg %r12,40(%r11) 3f4: b9 02 00 cc ltgr %r12,%r12 3f8: b9 04 00 2c lgr %r2,%r12 3fc: a7 84 00 48 je 48c <unqueue_me+0xc4> 400: c0 e5 00 00 00 00 brasl %r14,400 <unqueue_me+0x38> 402: R_390_PC32DBL _spin_lock+0x2 As a general note, this code of unqueue_me seems a bit fishy. The retry logic of unqueue_me only works if we can guarantee, that the original value of q->lock_ptr is always a spinlock (Otherwise we overwrite kernel memory). We know that q->lock_ptr can change. I dont know what happens with the original spinlock, as I am not an expert with the futex code. Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Acked-by: Ingo Molnar <mingo@redhat.com> Cc: Thomas Gleixner <tglx@timesys.com> Signed-off-by: Christian Borntraeger <borntrae@de.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-08-06[PATCH] Make suspend possible with a traced process at a breakpointRafael J. Wysocki1-8/+18
It should be possible to suspend, either to RAM or to disk, if there's a traced process that has just reached a breakpoint. However, this is a special case, because its parent process might have been frozen already and then we are unable to deliver the "freeze" signal to the traced process. If this happens, it's better to cancel the freezing of the traced process. Ref. http://bugzilla.kernel.org/show_bug.cgi?id=6787 Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-08-03[PATCH] take filling ->pid, etc. out of audit_get_context()Al Viro1-11/+12
move that stuff downstream and into the only branch where it'll be used. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-08-03[PATCH] don't bother with aux entires for dummy contextAl Viro1-3/+3
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-08-03[PATCH] mark context of syscall entered with no rules as dummyAl Viro1-2/+4
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-08-03[PATCH] introduce audit rules counterAl Viro2-0/+27
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-08-03[PATCH] fix audit oops with invalid operatorAmy Griffis1-0/+2
Michael C Thompson wrote: [Tue Aug 01 2006, 02:36:36PM EDT] > The trigger for this oops is: > # auditctl -a exit,always -S pread64 -F 'inode<1' Setting the err value will fix it. Signed-off-by: Amy Griffis <amy.griffis@hp.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-08-03[PATCH] fix oops with CONFIG_AUDIT and !CONFIG_AUDITSYSCALLAmy Griffis1-3/+1
Always initialize the audit_inode_hash[] so we don't oops on list rules. Signed-off-by: Amy Griffis <amy.griffis@hp.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-08-03[PATCH] fix missed create event for directory auditAmy Griffis1-3/+13
When an object is created via a symlink into an audited directory, audit misses the event due to not having collected the inode data for the directory. Modify __audit_inode_child() to copy the parent inode data if a parent wasn't found in audit_names[]. Signed-off-by: Amy Griffis <amy.griffis@hp.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-08-03[PATCH] fix faulty inode data collection for open() with O_CREATAmy Griffis1-22/+41
When the specified path is an existing file or when it is a symlink, audit collects the wrong inode number, which causes it to miss the open() event. Adding a second hook to the open() path fixes this. Also add audit_copy_inode() to consolidate some code. Signed-off-by: Amy Griffis <amy.griffis@hp.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-08-02Fix force_sig_info() semantics after cleanupsLinus Torvalds1-8/+17
Suresh points out that commit b0423a0d9cc836b2c3d796623cd19236bfedfe63 broke the semantics of a synchronous signal like SIGSEGV occurring recursively inside its own handler handler (or, indeed, any other context when the signal was blocked). That was unintentional, and this fixes things up by reinstating the old semantics, but without reverting the cleanups. Cc: Paul E. McKenney <paulmck@us.ibm.com> Acked-by: Suresh Siddha <suresh.b.siddha@intel.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-31[PATCH] timer: Fix tvec_bases initializerJosh Triplett1-1/+1
kernel/timer.c defines a (per-cpu) pointer to tvec_base_t, but initializes it using { &a_tvec_base_t }, which sparse warns about; change this to just &a_tvec_base_t. Signed-off-by: Josh Triplett <josh@freedesktop.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-31[PATCH] reference rt-mutex-design in rtmutex.cSteven Rostedt1-0/+2
In order to prevent Doc Rot, this patch adds a reference to the design document for rtmutex.c in rtmutex.c. So when someone needs to update or change the design of that file they will know that a document actually exists that explains the design (helping them change it), and hopefully that they will update the document if they too change the design. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-31[PATCH] Reducing local_bh_enable/disable overhead in irqtraceTim Chen1-0/+18
The recent changes from irqtrace feature has added overheads to local_bh_disable and local_bh_enable that reduces UDP performance across x86_64 and IA64, even though IA64 does not support the irqtrace feature. Patch in question is [PATCH]lockdep: irqtrace subsystem, core http://www.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=c ommit;h=de30a2b355ea85350ca2f58f3b9bf4e5bc007986 Prior to this patch, local_bh_disable was a short macro. Now it is a function which calls __local_bh_disable with added irq flags save and restore. The irq flags save and restore were also added to local_bh_enable, probably for injecting the trace irqs code. This overhead is on the generic code path across all architectures. On a IA_64 test machine (Itanium-2 1.6 GHz) running a benchmark like netperf's UDP streaming test, the added overhead results in a drop of 3% in throughput, as udp_sendmsg calls the local_bh_enable/disable several times. Other workloads that have heavy usages of local_bh_enable/disable could also be affected. The patch ideally should not have affected IA-64 performance as it does not have IRQ tracing support. A significant portion of the overhead is in the added irq flags save and restore, which I think is not needed if IRQ tracing is unused. A suggested patch is attached below that recovers the lost performance. However, the "ifdef"s in the patch are a bit ugly. Signed-off-by: Tim Chen <tim.c.chen@intel.com> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-31[PATCH] pi-futex: missing pi_waiters plist initializationHeiko Carstens1-0/+5
Initialize init task's pi_waiters plist. Otherwise cpu hotplug of cpu 0 might crash, since rt_mutex_getprio() accesses an uninitialized list head. call chain which led to crash: take_cpu_down sched_idle_next __setscheduler rt_mutex_getprio Using PLIST_HEAD_INIT in the INIT_TASK macro doesn't work unfortunately, since the pi_waiters member is only conditionally present. Cc: Arjan van de Ven <arjan@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-31[PATCH] Add DocBook documentation for workqueue functionsRolf Eike Beer1-4/+54
kernel/workqueue.c was omitted from generating kernel documentation. This adds a new section "Workqueues and Kevents" and adds documentation for some of the functions. Some functions in this file already had DocBook-style comments, now they finally become visible. Signed-off-by: Rolf Eike Beer <eike-kernel@sf-tec.de> Cc: "Randy.Dunlap" <rdunlap@xenotime.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-31[PATCH] fix cond_resched() fixJim Houston1-5/+5
In cond_resched_lock() it calls __resched_legal() before dropping the spin lock. __resched_legal() will always finds the preempt_count non-zero and will prevent the call to __cond_resched(). The attached patch adds a parameter to __resched_legal() with the expected preempt_count value. Cc: Ingo Molnar <mingo@elte.hu> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-31[PATCH] fix bad macro param in timer.cSteven Rostedt1-1/+1
We have #define INDEX(N) (base->timer_jiffies >> (TVR_BITS + N * TVN_BITS)) & TVN_MASK and it's used via list = varray[i + 1]->vec + (INDEX(i + 1)); So, due to underparenthesisation, this INDEX(i+1) is now a ... (TVR_BITS + i + 1 * TVN_BITS)) ... So this bugfix changes behaviour. It worked before by sheer luck: "If i was anything but 0, it was broken. But this was only used by s390 and arm. Since it was for the next interrupt, could that next interrupt be a problem (going into the second cascade)? But it was probably seldom wrong. That is, this would fail if the next interrupt was in the second cascade, and was wrapped. Which may never of happened. Also if it did happen, it would have just missed the interrupt. If an interrupt was missed, and no one was there to miss it, was it really missed :-)" Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Cc: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-31[PATCH] cpu hotplug: replace __devinit* with __cpuinit* for cpu notificationsChandra Seetharaman5-10/+10
Few of the callback functions and notifier blocks that are associated with cpu notifications incorrectly have __devinit and __devinitdata. They should be __cpuinit and __cpuinitdata instead. It makes no functional difference but wastes text area when CONFIG_HOTPLUG is enabled and CONFIG_HOTPLUG_CPU is not. This patch fixes all those instances. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Cc: Ashok Raj <ashok.raj@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-31[PATCH] IA64: kprobe invalidate icache of jump bufferbibo, mao1-0/+1
Kprobe inserts breakpoint instruction in probepoint and then jumps to instruction slot when breakpoint is hit, the instruction slot icache must be consistent with dcache. Here is the patch which invalidates instruction slot icache area. Without this patch, in some machines there will be fault when executing instruction slot where icache content is inconsistent with dcache. Signed-off-by: bibo,mao <bibo.mao@intel.com> Acked-by: "Luck, Tony" <tony.luck@intel.com> Acked-by: Keshavamurthy Anil S <anil.s.keshavamurthy@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-31[PATCH] delay accounting: temporarily enable by defaultShailabh Nagar1-4/+4
Enable delay accounting by default so that feature gets coverage testing without requiring special measures. Earlier, it was off by default and had to be enabled via a boot time param. This patch reverses the default behaviour to improve coverage testing. It can be removed late in the kernel development cycle if its believed users shouldn't have to incur any cost if they don't want delay accounting. Or it can be retained forever if the utility of the stats is deemed common enough to warrant keeping the feature on. Signed-off-by: Shailabh Nagar <nagar@watson.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-31[PATCH] taskstats: free skb, avoid returns in send_cpu_listenersShailabh Nagar1-13/+11
Add a missing freeing of skb in the case there are no listeners at all. Also remove the returning of error values by the function as it is unused by the sole caller. Signed-off-by: Shailabh Nagar <nagar@watson.ibm.com> Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-31[PATCH] make taskstats sending completely independent of delay accounting ↵Shailabh Nagar1-5/+3
on/off status Complete the separation of delay accounting and taskstats by ignoring the return value of delay accounting functions that fill in parts of taskstats before it is sent out (either in response to a command or as part of a task exit). Also make delayacct_add_tsk return silently when delay accounting is turned off rather than treat it as an error. Signed-off-by: Shailabh Nagar <nagar@watson.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-31[PATCH] genirq: {en,dis}able_irq_wake() need refcounting tooDavid Brownell1-2/+26
IRQs need refcounting and a state flag to track whether the the IRQ should be enabled or disabled as a "normal IRQ" source after a series of calls to {en,dis}able_irq(). For shared IRQs, the IRQ must be enabled so long as at least one driver needs it active. Likewise, IRQs need the same support to track whether the IRQ should be enabled or disabled as a "wakeup event" source after a series of calls to {en,dis}able_irq_wake(). For shared IRQs, the IRQ must be enabled as a wakeup source during sleep so long as at least one driver needs it. But right now they _don't have_ that refcounting ... which means sharing a wakeup-capable IRQ can't work correctly in some configurations. This patch adds the refcount and flag mechanisms to set_irq_wake() -- which is what {en,dis}able_irq_wake() call -- and minimal documentation of what the irq wake mechanism does. Drivers relying on the older (broken) "toggle" semantics will trigger a warning; that'll be a handful of drivers on ARM systems. Signed-off-by: David Brownell <dbrownell@users.sourceforge.net> Acked-by: Ingo Molnar <mingo@elte.hu> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: Russell King <rmk@arm.linux.org.uk> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-31[PATCH] sched: build_sched_domains() fixSiddha, Suresh B1-1/+6
Use the correct groups while initializing sched groups power for allnodes_domain. This fixes the crash observed while creating exclusive cpusets. Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Reported-and-tested-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-28[PATCH] pi-futex: robust-futex exitIngo Molnar2-38/+87
Fix robust PI-futexes to be properly unlocked on unexpected exit. For this to work the kernel has to know whether a futex is a PI or a non-PI one, because the semantics are different. Since the space in relevant glibc data structures is extremely scarce, the best solution is to encode the 'PI' information in bit 0 of the robust list pointer. Existing (non-PI) glibc robust futexes have this bit always zero, so the ABI is kept. New glibc with PI-robust-futexes will set this bit. Further fixes from Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Ulrich Drepper <drepper@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-28[PATCH] pi-futex: robust-futex exit crash fixIngo Molnar1-8/+24
Fix pi_state->list handling bugs: list handling mishap, locking error. Plus add more debug checks and fix a few style issues i noticed while debugging this. (reported by Ulrich Drepper and Jakub Jelinek.) Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-23[PATCH] Cpuset: fix ABBA deadlock with cpu hotplug lockPaul Jackson1-3/+21
Fix ABBA deadlock between lock_cpu_hotplug() and the cpuset callback_mutex lock. It only happens on cpu_exclusive cpusets, due to the dynamic sched domain code trying to take the cpu hotplug lock inside the cpuset callback_mutex lock. This bug has apparently been here for several months, but didn't get hit until the right customer load on a large system. This fix appears right from inspection, but it will take a few more days running it on that customers workload to be confident we nailed it. We don't have any other reproducible test case. The cpu_hotplug_lock() tends to cover large runs of code. The other places that hold both that lock and the cpuset callback mutex lock always nest the cpuset lock inside the hotplug lock. This place tries to do the reverse, risking an ABBA deadlock. This is in the cpuset_rmdir() code, where we: * take the callback_mutex lock * mark the cpuset CS_REMOVED * call update_cpu_domains for cpu_exclusive cpusets * in that call, take the cpu_hotplug lock if the cpuset is marked for removal. Thanks to Jack Steiner for identifying this deadlock. The fix is to tear down the dynamic sched domain before we grab the cpuset callback_mutex lock. This way, the two locks are serialized, with the hotplug lock taken and released before trying for the cpuset lock. I suspect that this bug was introduced when I changed the cpuset locking from one lock to two. The dynamic sched domain dependency on cpu_exclusive cpusets and its hotplug hooks were added to this code earlier, when cpusets had only a single lock. It may well have been fine then. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-23cpu hotplug: simplify and hopefully fix lockingLinus Torvalds1-41/+34
The CPU hotplug locking was quite messy, with a recursive lock to handle the fact that both the actual up/down sequence wanted to protect itself from being re-entered, but the callbacks that it called also tended to want to protect themselves from CPU events. This splits the lock into two (one to serialize the whole hotplug sequence, the other to protect against the CPU present bitmaps changing). The latter still allows recursive usage because some subsystems (ondemand policy for cpufreq at least) had already gotten too used to the lax locking, but the locking mistakes are hopefully now less fundamental, and we now warn about recursive lock usage when we see it, in the hope that it can be fixed. Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-14[PATCH] Remove down_write() from taskstats code invoked on the exit() pathShailabh Nagar1-5/+19
In send_cpu_listeners(), which is called on the exit path, a down_write() was protecting operations like skb_clone() and genlmsg_unicast() that do GFP_KERNEL allocations. If the oom-killer decides to kill tasks to satisfy the allocations,the exit of those tasks could block on the same semphore. The down_write() was only needed to allow removal of invalid listeners from the listener list. The patch converts the down_write to a down_read and defers the removal to a separate critical region. This ensures that even if the oom-killer is called, no other task's exit is blocked as it can still acquire another down_read. Thanks to Andrew Morton & Herbert Xu for pointing out the oom related pitfalls, and to Chandra Seetharaman for suggesting this fix instead of using something more complex like RCU. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Signed-off-by: Shailabh Nagar <nagar@watson.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-14[PATCH] per-task delay accounting taskstats interface: control exit data ↵Shailabh Nagar2-13/+192
through cpumasks On systems with a large number of cpus, with even a modest rate of tasks exiting per cpu, the volume of taskstats data sent on thread exit can overflow a userspace listener's buffers. One approach to avoiding overflow is to allow listeners to get data for a limited and specific set of cpus. By scaling the number of listeners and/or the cpus they monitor, userspace can handle the statistical data overload more gracefully. In this patch, each listener registers to listen to a specific set of cpus by specifying a cpumask. The interest is recorded per-cpu. When a task exits on a cpu, its taskstats data is unicast to each listener interested in that cpu. Thanks to Andrew Morton for pointing out the various scalability and general concerns of previous attempts and for suggesting this design. [akpm@osdl.org: build fix] Signed-off-by: Shailabh Nagar <nagar@watson.ibm.com> Signed-off-by: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-14[PATCH] delay accounting taskstats interface send tgid onceShailabh Nagar3-36/+74
Send per-tgid data only once during exit of a thread group instead of once with each member thread exit. Currently, when a thread exits, besides its per-tid data, the per-tgid data of its thread group is also sent out, if its thread group is non-empty. The per-tgid data sent consists of the sum of per-tid stats for all *remaining* threads of the thread group. This patch modifies this sending in two ways: - the per-tgid data is sent only when the last thread of a thread group exits. This cuts down heavily on the overhead of sending/receiving per-tgid data, especially when other exploiters of the taskstats interface aren't interested in per-tgid stats - the semantics of the per-tgid data sent are changed. Instead of being the sum of per-tid data for remaining threads, the value now sent is the true total accumalated statistics for all threads that are/were part of the thread group. The patch also addresses a minor issue where failure of one accounting subsystem to fill in the taskstats structure was causing the send of taskstats to not be sent at all. The patch has been tested for stability and run cerberus for over 4 hours on an SMP. [akpm@osdl.org: bugfixes] Signed-off-by: Shailabh Nagar <nagar@watson.ibm.com> Signed-off-by: Balbir Singh <balbir@in.ibm.com> Cc: Jay Lan <jlan@engr.sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-14[PATCH] per-task-delay-accounting: /proc export of aggregated block I/O delaysShailabh Nagar1-0/+12
Export I/O delays seen by a task through /proc/<tgid>/stats for use in top etc. Note that delays for I/O done for swapping in pages (swapin I/O) is clubbed together with all other I/O here (this is not the case in the netlink interface where the swapin I/O is kept distinct) [akpm@osdl.org: printk warning fix] Signed-off-by: Shailabh Nagar <nagar@watson.ibm.com> Signed-off-by: Balbir Singh <balbir@in.ibm.com> Cc: Jes Sorensen <jes@sgi.com> Cc: Peter Chubb <peterc@gelato.unsw.edu.au> Cc: Erich Focht <efocht@ess.nec.de> Cc: Levent Serinol <lserinol@gmail.com> Cc: Jay Lan <jlan@engr.sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-14[PATCH] per-task-delay-accounting: delay accounting usage of taskstats interfaceShailabh Nagar2-6/+72
Usage of taskstats interface by delay accounting. Signed-off-by: Shailabh Nagar <nagar@us.ibm.com> Signed-off-by: Balbir Singh <balbir@in.ibm.com> Cc: Jes Sorensen <jes@sgi.com> Cc: Peter Chubb <peterc@gelato.unsw.edu.au> Cc: Erich Focht <efocht@ess.nec.de> Cc: Levent Serinol <lserinol@gmail.com> Cc: Jay Lan <jlan@engr.sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-14[PATCH] per-task-delay-accounting: taskstats interfaceShailabh Nagar3-0/+344
Create a "taskstats" interface based on generic netlink (NETLINK_GENERIC family), for getting statistics of tasks and thread groups during their lifetime and when they exit. The interface is intended for use by multiple accounting packages though it is being created in the context of delay accounting. This patch creates the interface without populating the fields of the data that is sent to the user in response to a command or upon the exit of a task. Each accounting package interested in using taskstats has to provide an additional patch to add its stats to the common structure. [akpm@osdl.org: cleanups, Kconfig fix] Signed-off-by: Shailabh Nagar <nagar@us.ibm.com> Signed-off-by: Balbir Singh <balbir@in.ibm.com> Cc: Jes Sorensen <jes@sgi.com> Cc: Peter Chubb <peterc@gelato.unsw.edu.au> Cc: Erich Focht <efocht@ess.nec.de> Cc: Levent Serinol <lserinol@gmail.com> Cc: Jay Lan <jlan@engr.sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-14[PATCH] per-task-delay-accounting: cpu delay collection via schedstatsChandra Seetharaman1-22/+49
Make the task-related schedstats functions callable by delay accounting even if schedstats collection isn't turned on. This removes the dependency of delay accounting on schedstats. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Signed-off-by: Shailabh Nagar <nagar@watson.ibm.com> Signed-off-by: Balbir Singh <balbir@in.ibm.com> Cc: Jes Sorensen <jes@sgi.com> Cc: Peter Chubb <peterc@gelato.unsw.edu.au> Cc: Erich Focht <efocht@ess.nec.de> Cc: Levent Serinol <lserinol@gmail.com> Cc: Jay Lan <jlan@engr.sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-14[PATCH] per-task-delay-accounting: sync block I/O and swapin delay collectionShailabh Nagar2-0/+24
Unlike earlier iterations of the delay accounting patches, now delays are only collected for the actual I/O waits rather than try and cover the delays seen in I/O submission paths. Account separately for block I/O delays incurred as a result of swapin page faults whose frequency can be affected by the task/process' rss limit. Hence swapin delays can act as feedback for rss limit changes independent of I/O priority changes. Signed-off-by: Shailabh Nagar <nagar@watson.ibm.com> Signed-off-by: Balbir Singh <balbir@in.ibm.com> Cc: Jes Sorensen <jes@sgi.com> Cc: Peter Chubb <peterc@gelato.unsw.edu.au> Cc: Erich Focht <efocht@ess.nec.de> Cc: Levent Serinol <lserinol@gmail.com> Cc: Jay Lan <jlan@engr.sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-14[PATCH] per-task-delay-accounting: setupShailabh Nagar4-0/+92
Initialization code related to collection of per-task "delay" statistics which measure how long it had to wait for cpu, sync block io, swapping etc. The collection of statistics and the interface are in other patches. This patch sets up the data structures and allows the statistics collection to be disabled through a kernel boot parameter. Signed-off-by: Shailabh Nagar <nagar@watson.ibm.com> Signed-off-by: Balbir Singh <balbir@in.ibm.com> Cc: Jes Sorensen <jes@sgi.com> Cc: Peter Chubb <peterc@gelato.unsw.edu.au> Cc: Erich Focht <efocht@ess.nec.de> Cc: Levent Serinol <lserinol@gmail.com> Cc: Jay Lan <jlan@engr.sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-14[PATCH] lockdep: core, fix rq-lock handling on __ARCH_WANT_UNLOCKED_CTXSWIngo Molnar1-0/+8
On platforms that have __ARCH_WANT_UNLOCKED_CTXSW set and want to implement lock validator support there's a bug in rq->lock handling: in this case we dont 'carry over' the runqueue lock into another task - but still we did a spinlock_release() of it. Fix this by making the spinlock_release() in context_switch() dependent on !__ARCH_WANT_UNLOCKED_CTXSW. (Reported by Ralf Baechle on MIPS, which has __ARCH_WANT_UNLOCKED_CTXSW. This fixes a lockdep-internal BUG message on such platforms.) Signed-off-by: Ingo Molnar <mingo@elte.hu> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-14[PATCH] Fix sighand->siglock usage in kernel/acct.cOGAWA Hirofumi1-2/+2
IRQs must be disabled before taking ->siglock. Noticed by lockdep. Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Cc: Arjan van de Ven <arjan@infradead.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-14[PATCH] improve timekeeping resume robustnessjohn stultz1-1/+18
Resolve problems seen w/ APM suspend. Due to resume initialization ordering, its possible we could get a timer interrupt before the timekeeping resume() function is called. This patch ensures we don't do any timekeeping accounting before we're fully resumed. (akpm: fixes the machine-freezes-on-APM-resume bug) Signed-off-by: John Stultz <johnstul@us.ibm.com> Cc: Roman Zippel <zippel@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-14[PATCH] unexport open_softirqAdrian Bunk1-2/+0
Christoph Hellwig: open_softirq just enables a softirq. The softirq array is statically allocated so to add a new one you would have to patch the kernel. So there's no point to keep this export at all as any user would have to patch the enum in include/linux/interrupt.h anyway. Signed-off-by: Adrian Bunk <bunk@stusta.de> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-14[PATCH] Add try_to_freeze() to rt-test kthreadsLuca Tettamanti1-0/+1
When CONFIG_RT_MUTEX_TESTER is enabled kernel refuses to suspend the machine because it's unable to freeze the rt-test-* threads. Add try_to_freeze() after schedule() so that the threads will be freezed correctly; I've tested the patch and it lets the notebook suspends and resumes nicely. Signed-off-by: Luca Tettamanti <kronos.it@gmail.com> Cc: Ingo Molnar <mingo@redhat.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-14[PATCH] del_timer_sync(): add cpu_relax()Andrew Morton1-0/+1
Relax the CPU in the del_timer_sync() busywait loop. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-14[PATCH] remove kernel/kthread.c:kthread_stop_sem()Adrian Bunk1-22/+2
Remove the now-unneeded kthread_stop_sem(). Signed-off-by: Adrian Bunk <bunk@stusta.de> Acked-by: Alan Stern <stern@rowland.harvard.edu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-14[PATCH] null-terminate over-long /proc/kallsyms symbolsAndreas Gruenbacher2-9/+6
Got a customer bug report (https://bugzilla.novell.com/190296) about kernel symbols longer than 127 characters which end up in a string buffer that is not NULL terminated, leading to garbage in /proc/kallsyms. Using strlcpy prevents this from happening, even though such symbols still won't come out right. A better fix would be to not use a fixed-size buffer, but it's probably not worth the trouble. (Modversion'ed symbols even have a length limit of 60.) [bunk@stusta.de: build fix] Signed-off-by: Andreas Gruenbacher <agruen@suse.de> Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-12[PATCH] The scheduled unexport of insert_resourceAdrian Bunk1-2/+0
Implement the scheduled unexport of insert_resource. Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2006-07-12[PATCH] remove kernel/power/pm.c:pm_unregister_all()Adrian Bunk1-37/+0
Remove the deprecated and no longer used pm_unregister_all(). Signed-off-by: Adrian Bunk <bunk@stusta.de> Acked-by: Pavel Machek <pavel@suse.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2006-07-12[PATCH] Fix prctl privilege escalation and suid_dumpable (CVE-2006-2451)Marcel Holtmann1-1/+1
Based on a patch from Ernie Petrides During security research, Red Hat discovered a behavioral flaw in core dump handling. A local user could create a program that would cause a core file to be dumped into a directory they would not normally have permissions to write to. This could lead to a denial of service (disk consumption), or allow the local user to gain root privileges. The prctl() system call should never allow to set "dumpable" to the value 2. Especially not for non-privileged users. This can be split into three cases: 1) running as root -- then core dumps will already be done as root, and so prctl(PR_SET_DUMPABLE, 2) is not useful 2) running as non-root w/setuid-to-root -- this is the debatable case 3) running as non-root w/setuid-to-non-root -- then you definitely do NOT want "dumpable" to get set to 2 because you have the privilege escalation vulnerability With case #2, the only potential usefulness is for a program that has designed to run with higher privilege (than the user invoking it) that wants to be able to create root-owned root-validated core dumps. This might be useful as a debugging aid, but would only be safe if the program had done a chdir() to a safe directory. There is no benefit to a production setuid-to-root utility, because it shouldn't be dumping core in the first place. If this is true, then the same debugging aid could also be accomplished with the "suid_dumpable" sysctl. Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-10[PATCH] lockdep: disable lock debugging when kernel state becomes untrustedArjan van de Ven1-0/+2
Disable lockdep debugging in two situations where the integrity of the kernel no longer is guaranteed: when oopsing and when hitting a tainting-condition. The goal is to not get weird lockdep traces that don't make sense or are otherwise undebuggable, to not waste time. Lockdep assumes that the previous state it knows about is valid to operate, which is why lockdep turns itself off after the first violation it reports, after that point it can no longer make that assumption. A kernel oops means that the integrity of the kernel compromised; in addition anything lockdep would report is of lesser importance than the oops. All the tainting conditions are of similar integrity-violating nature and also make debugging/diagnosing more difficult. Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-10[PATCH] remove the tasklist_lock exportChristoph Hellwig1-3/+1
As announced half a year ago this patch will remove the tasklist_lock export. The previous two patches got rid of the remaining modular users. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-10[PATCH] uninline init_waitqueue_head()Ingo Molnar1-2/+6
allyesconfig vmlinux size delta: text data bss dec filename 20736884 6073834 3075176 29885894 vmlinux.before 20721009 6073966 3075176 29870151 vmlinux.after ~18 bytes per callsite, 15K of text size (~0.1%) saved. (as an added bonus this also removes a lockdep annotation.) Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-10[PATCH] swsusp: fix panic when signature can't be readLinus Torvalds1-2/+4
Do not panic a machine when swsusp signature can't be read. Signed-off-by: Pavel Machek <pavel@suse.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-10[PATCH] swsusp warning fixAndrew Morton1-10/+10
kernel/power/swap.c: In function 'swsusp_write': kernel/power/swap.c:275: warning: 'start' may be used uninitialized in this function gcc isn't smart enough, so help it. Cc: Pavel Machek <pavel@ucw.cz> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-10[PATCH] swsusp: do not use memcpy for snapshotting memoryRafael J. Wysocki1-2/+8
swsusp should not use memcpy for snapshotting memory, because on some architectures memcpy may increase preempt_count (i386 does this when CONFIG_X86_USE_3DNOW is set). Then, as a result, wrong value of preempt_count is stored in the image. Replace memcpy in copy_data_pages with an open-coded loop. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-10[PATCH] adjust clock for lost ticksRoman Zippel1-38/+47
A large number of lost ticks can cause an overadjustment of the clock. To compensate for this we look at the current error and the larger the error already is the more careful we are at adjusting the error. As small extra fix reset the error when the clock is set. Signed-off-by: Roman Zippel <zippel@linux-m68k.org> Acked-by: john stultz <johnstul@us.ibm.com> Cc: Uwe Bugla <uwe.bugla@gmx.de> Cc: James Bottomley <James.Bottomley@SteelEye.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-10[PATCH] pi-futex: Validate futex type instead of oopsingThomas Gleixner1-0/+6
Calling futex_lock_pi is called with a reference to a non PI futex and waiters exist already, lookup_pi_state() oopses due to pi_state == NULL. Check this condition and return -EINVAL to userspace. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: Jakub Jelinek <jakub@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-10[PATCH] kernel/softirq.c: EXPORT_UNUSED_SYMBOLAdrian Bunk1-1/+1
This patch marks an unused export as EXPORT_UNUSED_SYMBOL. Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-10[PATCH] kernel/printk.c: EXPORT_SYMBOL_UNUSEDAdrian Bunk1-2/+2
This patch marks unused exports as EXPORT_SYMBOL_UNUSED. Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-10[PATCH] lockdep: core, reduce per-lock class-cache sizeIngo Molnar1-33/+54
lockdep_map is embedded into every lock, which blows up data structure sizes all around the kernel. Reduce the class-cache to be for the default class only - that is used in 99.9% of the cases and even if we dont have a class cached, the lookup in the class-hash is lockless. This change reduces the per-lock dep_map overhead by 56 bytes on 64-bit platforms and by 28 bytes on 32-bit platforms. Signed-off-by: Ingo Molnar <mingo@elte.hu> Cc: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-10[PATCH] lockdep: improve debug outputArjan van de Ven1-2/+3
Make lockdep print which lock is held, in the "kfree() of a live lock" scenario. Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-10[PATCH] Minor cleanup to lockdep.cAndi Kleen1-32/+12
- Use printk formatting for indentation - Don't leave NTFS in the default event filter Signed-off-by: Andi Kleen <ak@suse.de> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-10[PATCH] small kernel/sched.c cleanupAndreas Mohr1-10/+7
- constify and optimize stat_nam (thanks to Michael Tokarev!) - spelling and comment fixes Signed-off-by: Andreas Mohr <andi@lisas.de> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-10[PATCH] sched: fix bug in __migrate_task()Peter Williams1-1/+1
Problem: In the function __migrate_task(), deactivate_task() followed by activate_task() is used to move the task from one run queue to another. This has two undesirable effects: 1. The task's priority is recalculated. (Nowhere else in the scheduler code is the priority recalculated for a change of CPU.) 2. The task's time stamp is set to the current time. At the very least, this makes the adjustment of the time stamp before the call to deactivate_task() redundant but I believe the problem is more serious as the time stamp now holds the time of the queue change instead of the time at which the task was woken. In addition, unless dest_rq is the same queue as "current" is on the time stamp could be inaccurate due to inter CPU drift. Solution: Replace the call to activate_task() with one to __activate_task(). Signed-off-by: Peter Williams <pwil3058@bigpond.net.au> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-04Merge master.kernel.org:/pub/scm/linux/kernel/git/davej/cpufreqLinus Torvalds1-25/+32
* master.kernel.org:/pub/scm/linux/kernel/git/davej/cpufreq: Move workqueue exports to where the functions are defined. [CPUFREQ] Misc cleanups in ondemand. [CPUFREQ] Make ondemand sampling per CPU and remove the mutex usage in sampling path. [CPUFREQ] Add queue_delayed_work_on() interface for workqueues. [CPUFREQ] Remove slowdown from ondemand sampling path.
2006-07-03[PATCH] revert "kthread: convert stop_machine into a kthread"Andrew Morton1-11/+6
Jiri reports that the stop_machin kthread conversion caused his machine to hang when suspending. Hyperthreading is apparently involved. I don't see why that would be and I can't reproduce it. Revert to the 2.6.17 code. Cc: "Serge E. Hallyn" <serue@us.ibm.com> Cc: Jiri Slaby <jirislaby@gmail.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03Merge git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpcLinus Torvalds1-1/+4
* git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc: powerpc: add defconfig for Freescale MPC8349E-mITX board powerpc: Add base support for the Freescale MPC8349E-mITX eval board Documentation: correct values in MPC8548E SEC example node [POWERPC] Actually copy over i8259.c to arch/ppc/syslib this time [POWERPC] Add new interrupt mapping core and change platforms to use it [POWERPC] Copy i8259 code back to arch/ppc [POWERPC] New device-tree interrupt parsing code [POWERPC] Use the genirq framework [PATCH] genirq: Allow fasteoi handler to retrigger disabled interrupts [POWERPC] Update the SWIM3 (powermac) floppy driver [POWERPC] Fix error handling in detecting legacy serial ports [POWERPC] Fix booting on Momentum "Apache" board (a Maple derivative) [POWERPC] Fix various offb and BootX-related issues [POWERPC] Add a default config for 32-bit CHRP machines [POWERPC] fix implicit declaration on cell. [POWERPC] change get_property to return void *
2006-07-03[PATCH] sched: cleanup, convert sched.c-internal typedefs to structIngo Molnar1-125/+125
convert: - runqueue_t to 'struct rq' - prio_array_t to 'struct prio_array' - migration_req_t to 'struct migration_req' I was the one who added these but they are both against the kernel coding style and also were used inconsistently at places. So just get rid of them at once, now that we are flushing the scheduler patch-queue anyway. Conversion was mostly scripted, the result was reviewed and all secondary whitespace and style impact (if any) was fixed up by hand. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03[PATCH] sched: cleanup, remove task_t, convert to struct task_structIngo Molnar12-138/+153
cleanup: remove task_t and convert all the uses to struct task_struct. I introduced it for the scheduler anno and it was a mistake. Conversion was mostly scripted, the result was reviewed and all secondary whitespace and style impact (if any) was fixed up by hand. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03[PATCH] sched: clean up fallout of recent changesIngo Molnar1-166/+194
Clean up some of the impact of recent (and not so recent) scheduler changes: - turning macros into nice inline functions - sanitizing and unifying variable definitions - whitespace, style consistency, 80-lines, comment correctness, spelling and curly braces police Due to the macro hell and variable placement simplifications there's even 26 bytes of .text saved: text data bss dec hex filename 25510 4153 192 29855 749f sched.o.before 25484 4153 192 29829 7485 sched.o.after [akpm@osdl.org: build fix] Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03[PATCH] lockdep: irqtrace subsystem, move account_system_vtime() calls into ↵Paul Mackerras1-0/+4
kernel/softirq.c At the moment, powerpc and s390 have their own versions of do_softirq which include local_bh_disable() and __local_bh_enable() calls. They end up calling __do_softirq (in kernel/softirq.c) which also does local_bh_disable/enable. Apparently the two levels of disable/enable trigger a warning from some validation code that Ingo is working on, and he would like to see the outer level removed. But to do that, we have to move the account_system_vtime calls that are currently in the arch do_softirq() implementations for powerpc and s390 into the generic __do_softirq() (this is a no-op for other archs because account_system_vtime is defined to be an empty inline function on all other archs). This patch does that. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03[PATCH] lockdep: annotate on-stack completionsIngo Molnar1-1/+1
lockdep needs to have the waitqueue lock initialized for on-stack waitqueues implicitly initialized by DECLARE_COMPLETION(). Annotate on-stack completions accordingly. Has no effect on non-lockdep kernels. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03[PATCH] lockdep: annotate enable_in_hardirq()Ingo Molnar1-1/+1
Make use of local_irq_enable_in_hardirq() API to annotate places that enable hardirqs in hardirq context. Has no effect on non-lockdep kernels. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03[PATCH] lockdep: annotate ->mmap_semIngo Molnar1-1/+4
Teach special (recursive) locking code to the lock validator. Has no effect on non-lockdep kernels. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03[PATCH] lockdep: annotate hrtimer base locksIngo Molnar1-1/+3
Teach special (recursive) locking code to the lock validator. Has no effect on non-lockdep kernels. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03[PATCH] lockdep: annotate scheduler runqueue locksIngo Molnar1-0/+2
Teach per-CPU runqueue locks and recursive locking code to the lock validator. Has no effect on non-lockdep kernels. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03[PATCH] lockdep: annotate timer base locksIngo Molnar1-0/+9
Split the per-CPU timer base locks up into separate lock classes, because they are used recursively. Has no effect on non-lockdep kernels. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03[PATCH] lockdep: annotate waitqueuesIngo Molnar1-0/+4
Create one lock class for all waitqueue locks in the kernel. Has no effect on non-lockdep kernels. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03[PATCH] lockdep: annotate genirqIngo Molnar1-0/+16
Teach special (recursive) locking code to the lock validator. Has no effect on non-lockdep kernels. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03[PATCH] lockdep: annotate futexIngo Molnar1-10/+18
Teach special (recursive) locking code to the lock validator. Introduces double_lock_hb() to unify double- hash-bucket-lock taking. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03[PATCH] lockdep: do not recurse in printkIngo Molnar1-5/+18
Make printk()-ing from within the lock validation code safer by using the lockdep-recursion counter. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03[PATCH] lockdep: prove mutex locking correctnessIngo Molnar3-8/+28
Use the lock validator framework to prove mutex locking correctness. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03[PATCH] lockdep: prove spinlock rwlock locking correctnessIngo Molnar3-9/+81
Use the lock validator framework to prove spinlock and rwlock locking correctness. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03[PATCH] lockdep: prove rwsem locking correctnessIngo Molnar2-1/+43
Use the lock validator framework to prove rwsem locking correctness. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03[PATCH] lockdep: procfsIngo Molnar2-0/+348
Lock validator /proc/lockdep and /proc/lockdep_stats support. (FIXME: should go into debugfs) Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03[PATCH] lockdep: allow read_lock() recursion of same classIngo Molnar1-3/+2
From: Ingo Molnar <mingo@elte.hu> lockdep so far only allowed read-recursion for the same lock instance. This is enough in the overwhelming majority of cases, but a hostap case triggered and reported by Miles Lane relies on same-class different-instance recursion. So we relax the restriction on read-lock recursion. (This change does not allow rwsem read-recursion, which is still forbidden.) Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03[PATCH] lockdep: coreIngo Molnar6-0/+2796
Do 'make oldconfig' and accept all the defaults for new config options - reboot into the kernel and if everything goes well it should boot up fine and you should have /proc/lockdep and /proc/lockdep_stats files. Typically if the lock validator finds some problem it will print out voluminous debug output that begins with "BUG: ..." and which syslog output can be used by kernel developers to figure out the precise locking scenario. What does the lock validator do? It "observes" and maps all locking rules as they occur dynamically (as triggered by the kernel's natural use of spinlocks, rwlocks, mutexes and rwsems). Whenever the lock validator subsystem detects a new locking scenario, it validates this new rule against the existing set of rules. If this new rule is consistent with the existing set of rules then the new rule is added transparently and the kernel continues as normal. If the new rule could create a deadlock scenario then this condition is printed out. When determining validity of locking, all possible "deadlock scenarios" are considered: assuming arbitrary number of CPUs, arbitrary irq context and task context constellations, running arbitrary combinations of all the existing locking scenarios. In a typical system this means millions of separate scenarios. This is why we call it a "locking correctness" validator - for all rules that are observed the lock validator proves it with mathematical certainty that a deadlock could not occur (assuming that the lock validator implementation itself is correct and its internal data structures are not corrupted by some other kernel subsystem). [see more details and conditionals of this statement in include/linux/lockdep.h and Documentation/lockdep-design.txt] Furthermore, this "all possible scenarios" property of the validator also enables the finding of complex, highly unlikely multi-CPU multi-context races via single single-context rules, increasing the likelyhood of finding bugs drastically. In practical terms: the lock validator already found a bug in the upstream kernel that could only occur on systems with 3 or more CPUs, and which needed 3 very unlikely code sequences to occur at once on the 3 CPUs. That bug was found and reported on a single-CPU system (!). So in essence a race will be found "piecemail-wise", triggering all the necessary components for the race, without having to reproduce the race scenario itself! In its short existence the lock validator found and reported many bugs before they actually caused a real deadlock. To further increase the efficiency of the validator, the mapping is not per "lock instance", but per "lock-class". For example, all struct inode objects in the kernel have inode->inotify_mutex. If there are 10,000 inodes cached, then there are 10,000 lock objects. But ->inotify_mutex is a single "lock type", and all locking activities that occur against ->inotify_mutex are "unified" into this single lock-class. The advantage of the lock-class approach is that all historical ->inotify_mutex uses are mapped into a single (and as narrow as possible) set of locking rules - regardless of how many different tasks or inode structures it took to build this set of rules. The set of rules persist during the lifetime of the kernel. To see the rough magnitude of checking that the lock validator does, here's a portion of /proc/lockdep_stats, fresh after bootup: lock-classes: 694 [max: 2048] direct dependencies: 1598 [max: 8192] indirect dependencies: 17896 all direct dependencies: 16206 dependency chains: 1910 [max: 8192] in-hardirq chains: 17 in-softirq chains: 105 in-process chains: 1065 stack-trace entries: 38761 [max: 131072] combined max dependencies: 2033928 hardirq-safe locks: 24 hardirq-unsafe locks: 176 softirq-safe locks: 53 softirq-unsafe locks: 137 irq-safe locks: 59 irq-unsafe locks: 176 The lock validator has observed 1598 actual single-thread locking patterns, and has validated all possible 2033928 distinct locking scenarios. More details about the design of the lock validator can be found in Documentation/lockdep-design.txt, which can also found at: http://redhat.com/~mingo/lockdep-patches/lockdep-design.txt [bunk@stusta.de: cleanups] Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03[PATCH] lockdep: irqtrace subsystem, coreIngo Molnar3-20/+140
Accurate hard-IRQ-flags and softirq-flags state tracing. This allows us to attach extra functionality to IRQ flags on/off events (such as trace-on/off). Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03[PATCH] lockdep: stacktrace subsystem, coreIngo Molnar2-0/+25
Framework to generate and save stacktraces quickly, without printing anything to the console. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03[PATCH] lockdep: locking init debugging improvementIngo Molnar2-3/+3
Locking init improvement: - introduce and use __SPIN_LOCK_UNLOCKED for array initializations, to pass in the name string of locks, used by debugging Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03[PATCH] lockdep: mutex section binutils workaroundIngo Molnar1-1/+1
Work around weird section nesting build bug causing smp-alternatives failures under certain circumstances. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03[PATCH] lockdep: better lock debuggingIngo Molnar11-482/+104
Generic lock debugging: - generalized lock debugging framework. For example, a bug in one lock subsystem turns off debugging in all lock subsystems. - got rid of the caller address passing (__IP__/__IP_DECL__/etc.) from the mutex/rtmutex debugging code: it caused way too much prototype hackery, and lockdep will give the same information anyway. - ability to do silent tests - check lock freeing in vfree too. - more finegrained debugging options, to allow distributions to turn off more expensive debugging features. There's no separate 'held mutexes' list anymore - but there's a 'held locks' stack within lockdep, which unifies deadlock detection across all lock classes. (this is independent of the lockdep validation stuff - lockdep first checks whether we are holding a lock already) Here are the current debugging options: CONFIG_DEBUG_MUTEXES=y CONFIG_DEBUG_LOCK_ALLOC=y which do: config DEBUG_MUTEXES bool "Mutex debugging, basic checks" config DEBUG_LOCK_ALLOC bool "Detect incorrect freeing of live mutexes" Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03[PATCH] lockdep: remove mutex deadlock checking codeIngo Molnar1-316/+0
With the lock validator we detect mutex deadlocks (and more), the mutex deadlock checking code is both redundant and slower. So remove it. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03[PATCH] lockdep: remove DEBUG_BUG_ON()Ingo Molnar1-8/+0
cleanup: remove unused DEBUG_BUG_ON() defines. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03[PATCH] lockdep: rename DEBUG_WARN_ON()Ingo Molnar4-26/+26
Rename DEBUG_WARN_ON() to the less generic DEBUG_LOCKS_WARN_ON() name, so that it's clear that this is a lock-debugging internal mechanism. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03[PATCH] lockdep: clean up rwsemsIngo Molnar1-0/+105
Clean up rwsems. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03[PATCH] lockdep: add is_module_address()Ingo Molnar1-0/+23
Add is_module_address() method - to be used by lockdep. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03[PATCH] ZVC/zone_reclaim: Leave 1% of unmapped pagecache pages for file I/OChristoph Lameter1-0/+11
It turns out that it is advantageous to leave a small portion of unmapped file backed pages if all of a zone's pages (or almost all pages) are allocated and so the page allocator has to go off-node. This allows recently used file I/O buffers to stay on the node and reduces the times that zone reclaim is invoked if file I/O occurs when we run out of memory in a zone. The problem is that zone reclaim runs too frequently when the page cache is used for file I/O (read write and therefore unmapped pages!) alone and we have almost all pages of the zone allocated. Zone reclaim may remove 32 unmapped pages. File I/O will use these pages for the next read/write requests and the unmapped pages increase. After the zone has filled up again zone reclaim will remove it again after only 32 pages. This cycle is too inefficient and there are potentially too many zone reclaim cycles. With the 1% boundary we may still remove all unmapped pages for file I/O in zone reclaim pass. However. it will take a large number of read and writes to get back to 1% again where we trigger zone reclaim again. The zone reclaim 2.6.16/17 does not show this behavior because we have a 30 second timeout. [akpm@osdl.org: rename the /proc file and the variable] Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-03[PATCH] genirq: Allow fasteoi handler to retrigger disabled interruptsBenjamin Herrenschmidt1-1/+4
Make the fasteoi handler mark disabled interrupts as pending if they happen anyway. This allow implementation of a delayed disable scheme with the fasteoi handler. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Acked-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Paul Mackerras <paulus@samba.org>
2006-07-02[PATCH] genirq:fixup missing SA_PERCPU replacementThomas Gleixner1-2/+2
The irqflags consolidation converted SA_PERCPU_IRQ to IRQF_PERCPU but did not define the new constant. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-02[PATCH] genirq: ARM dyntick cleanupThomas Gleixner1-12/+1
Linus: "The hacks in kernel/irq/handle.c are really horrid. REALLY horrid." They are indeed. Move the dyntick quirks to ARM where they belong. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-02Merge branch 'genirq' of master.kernel.org:/home/rmk/linux-2.6-armLinus Torvalds2-3/+41
* 'genirq' of master.kernel.org:/home/rmk/linux-2.6-arm: (24 commits) [ARM] 3683/2: ARM: Convert at91rm9200 to generic irq handling [ARM] 3682/2: ARM: Convert ixp4xx to generic irq handling [ARM] 3702/1: ARM: Convert ixp23xx to generic irq handling [ARM] 3701/1: ARM: Convert plat-omap to generic irq handling [ARM] 3700/1: ARM: Convert lh7a40x to generic irq handling [ARM] 3699/1: ARM: Convert s3c2410 to generic irq handling [ARM] 3698/1: ARM: Convert sa1100 to generic irq handling [ARM] 3697/1: ARM: Convert shark to generic irq handling [ARM] 3696/1: ARM: Convert clps711x to generic irq handling [ARM] 3694/1: ARM: Convert ecard driver to generic irq handling [ARM] 3693/1: ARM: Convert omap1 to generic irq handling [ARM] 3691/1: ARM: Convert imx to generic irq handling [ARM] 3688/1: ARM: Convert clps7500 to generic irq handling [ARM] 3687/1: ARM: Convert integrator to generic irq handling [ARM] 3685/1: ARM: Convert pxa to generic irq handling [ARM] 3684/1: ARM: Convert l7200 to generic irq handling [ARM] 3681/1: ARM: Convert ixp2000 to generic irq handling [ARM] 3680/1: ARM: Convert footbridge to generic irq handling [ARM] 3695/1: ARM drivers/pcmcia: Fixup includes [ARM] 3689/1: ARM drivers/input/touchscreen: Fixup includes ... Manual conflict resolved in kernel/irq/handle.c (butt-ugly ARM tickless code).
2006-07-02[PATCH] irq-flags: generic irq: Use the new IRQF_ constantsThomas Gleixner3-23/+23
Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: "David S. Miller" <davem@davemloft.net> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-01[ARM] 3690/1: genirq: Introduce and make use of dummy irq chipThomas Gleixner2-3/+28
Patch from Thomas Gleixner From: Thomas Gleixner <tglx@linutronix.de> ARM has a couple of really dumb interrupt controllers. Implement a generic one and fixup the ARM migration. ARM reused the no_irq_chip for this purpose, but this does not work out for platforms which are not converted to the new interrupt type handling model. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2006-07-01[ARM] 3679/1: ARM: Make ARM dyntick implementation work with genirqThomas Gleixner1-0/+13
Patch from Thomas Gleixner From: Thomas Gleixner <tglx@linutronix.de> Make the ARM dyntick implementation work with the generic irq code. This hopefully goes away once we consolidated the dyntick implementations. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2006-07-01Merge branch 'audit.b22' of ↵Linus Torvalds3-66/+209
git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current * 'audit.b22' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current: [PATCH] audit syscall classes [PATCH] audit: support for object context filters [PATCH] audit: rename AUDIT_SE_* constants [PATCH] add rule filterkey
2006-07-01[PATCH] IRQ: warning message cleanupBjorn Helgaas1-6/+7
Make warnings more consistent. Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com> Cc: Thomas Gleixner <tglx@linutronix.de> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-01[PATCH] IRQ: Use SA_PERCPU_IRQ, not IRQ_PER_CPU, for irqaction.flagsBjorn Helgaas1-1/+2
IRQ_PER_CPU is a bit in the struct irq_desc "status" field, not in the struct irqaction "flags", so the previous code checked the wrong bit. SA_PERCPU_IRQ is only used by drivers/char/mmtimer.c for SGI ia64 boxes. Signed-off-by: Bjorn Helgaas <bjorn.helgaas@hp.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-01[PATCH] pi-futex: futex_wake() lockup fixIngo Molnar1-2/+4
Fix futex_wake() exit condition bug when handling the robust-list with PI futexes on them. (reported by Ulrich Drepper, debugged by the lock validator.) Signed-off-by: Ingo Molnar <mingo@elte.hu> Cc: Ulrich Drepper <drepper@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-01[PATCH] pi-futex: fix mm_struct memory leakVernon Mauery1-1/+1
lock_queue was getting called essentially twice in a row and was continually incrementing the mm_count ref count, thus causing a memory leak. Dinakar Guniguntala provided a proper fix for the problem that simply grabs the spinlock for the hash bucket queue rather than calling lock_queue. The second time we do a queue_lock in futex_lock_pi, we really only need to take the hash bucket lock. Signed-off-by: Dinakar Guniguntala <dino@in.ibm.com> Signed-off-by: Vernon Mauery <vernux@us.ibm.com> Acked-by: Paul E. McKenney <paulmck@us.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-07-01[PATCH] audit syscall classesAl Viro1-0/+39
Allow to tie upper bits of syscall bitmap in audit rules to kernel-defined sets of syscalls. Infrastructure, a couple of classes (with 32bit counterparts for biarch targets) and actual tie-in on i386, amd64 and ia64. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-07-01[PATCH] audit: support for object context filtersDarrel Goeddel2-0/+65
This patch introduces object audit filters based on the elements of the SELinux context. Signed-off-by: Darrel Goeddel <dgoeddel@trustedcs.com> Acked-by: Stephen Smalley <sds@tycho.nsa.gov> kernel/auditfilter.c | 25 +++++++++++++++++++++++++ kernel/auditsc.c | 40 ++++++++++++++++++++++++++++++++++++++++ security/selinux/ss/services.c | 18 +++++++++++++++++- 3 files changed, 82 insertions(+), 1 deletion(-) Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-07-01[PATCH] audit: rename AUDIT_SE_* constantsDarrel Goeddel2-30/+30
This patch renames some audit constant definitions and adds additional definitions used by the following patch. The renaming avoids ambiguity with respect to the new definitions. Signed-off-by: Darrel Goeddel <dgoeddel@trustedcs.com> include/linux/audit.h | 15 ++++++++---- kernel/auditfilter.c | 50 ++++++++++++++++++++--------------------- kernel/auditsc.c | 10 ++++---- security/selinux/ss/services.c | 32 +++++++++++++------------- 4 files changed, 56 insertions(+), 51 deletions(-) Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-07-01[PATCH] add rule filterkeyAmy Griffis3-36/+75
Add support for a rule key, which can be used to tie audit records to audit rules. This is useful when a watched file is accessed through a link or symlink, as well as for general audit log analysis. Because this patch uses a string key instead of an integer key, there is a bit of extra overhead to do the kstrdup() when a rule fires. However, we're also allocating memory for the audit record buffer, so it's probably not that significant. I went ahead with a string key because it seems more user-friendly. Note that the user must ensure that filterkeys are unique. The kernel only checks for duplicate rules. Signed-off-by: Amy Griffis <amy.griffis@hpd.com>
2006-06-30Merge git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivialLinus Torvalds21-33/+1
* git://git.kernel.org/pub/scm/linux/kernel/git/bunk/trivial: Remove obsolete #include <linux/config.h> remove obsolete swsusp_encrypt arch/arm26/Kconfig typos Documentation/IPMI typos Kconfig: Typos in net/sched/Kconfig v9fs: do not include linux/version.h Documentation/DocBook/mtdnand.tmpl: typo fixes typo fixes: specfic -> specific typo fixes in Documentation/networking/pktgen.txt typo fixes: occuring -> occurring typo fixes: infomation -> information typo fixes: disadvantadge -> disadvantage typo fixes: aquire -> acquire typo fixes: mecanism -> mechanism typo fixes: bandwith -> bandwidth fix a typo in the RTC_CLASS help text smb is no longer maintained Manually merged trivial conflict in arch/um/kernel/vmlinux.lds.S
2006-06-30[PATCH] cond_resched() fixAndrew Morton1-12/+13
Fix a bug identified by Zou Nan hai <nanhai.zou@intel.com>: If the system is in state SYSTEM_BOOTING, and need_resched() is true, cond_resched() returns true even though it didn't reschedule. Consequently need_resched() remains true and JBD locks up. Fix that by teaching cond_resched() to only return true if it really did call schedule(). cond_resched_lock() and cond_resched_softirq() have a problem too. If we're in SYSTEM_BOOTING state and need_resched() is true, these functions will drop the lock and will then try to call schedule(), but the SYSTEM_BOOTING state will prevent schedule() from being called. So on return, need_resched() will still be true, but cond_resched_lock() has to return 1 to tell the caller that the lock was dropped. The caller will probably lock up. Bottom line: if these functions dropped the lock, they _must_ call schedule() to clear need_resched(). Make it so. Also, uninline __cond_resched(). It's largeish, and slowpath. Acked-by: Ingo Molnar <mingo@elte.hu> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-30[PATCH] SELinux: add security hook call to kill_proc_info_as_uidDavid Quigley1-2/+5
This patch adds a call to the extended security_task_kill hook introduced by the prior patch to the kill_proc_info_as_uid function so that these signals can be properly mediated by security modules. It also updates the existing hook call in check_kill_permission. Signed-off-by: David Quigley <dpquigl@tycho.nsa.gov> Signed-off-by: James Morris <jmorris@namei.org> Cc: Stephen Smalley <sds@tycho.nsa.gov> Cc: Chris Wright <chrisw@sous-sol.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-30[PATCH] zoned vm counters: zone_reclaim: remove ↵Christoph Lameter1-9/+0
/proc/sys/vm/zone_reclaim_interval The zone_reclaim_interval was necessary because we were not able to determine how many unmapped pages exist in a zone. Therefore we had to scan in intervals to figure out if any pages were unmapped. With the zoned counters and NR_ANON_PAGES we now know the number of pagecache pages and the number of mapped pages in a zone. So we can simply skip the reclaim if there is an insufficient number of unmapped pages. We use SWAP_CLUSTER_MAX as the boundary. Drop all support for /proc/sys/vm/zone_reclaim_interval. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-30Remove obsolete #include <linux/config.h>Jörn Engel20-20/+0
Signed-off-by: Jörn Engel <joern@wohnheim.fh-wedel.de> Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-06-30remove obsolete swsusp_encryptPavel Machek1-12/+0
Remove SWSUSP_ENCRYPT config option; it is no longer implemented. Signed-off-by: Pavel Machek <pavel@suse.cz> Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-06-30typo fixes: occuring -> occurringAdrian Bunk1-1/+1
Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-06-30Move workqueue exports to where the functions are defined.Dave Jones1-11/+10
Signed-off-by: Dave Jones <davej@redhat.com>
2006-06-30[CPUFREQ] Add queue_delayed_work_on() interface for workqueues.Venkatesh Pallipadi1-15/+23
Add queue_delayed_work_on() interface for workqueues. Signed-off-by: Alexey Starikovskiy <alexey.y.starikovskiy@intel.com> Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Signed-off-by: Dave Jones <davej@redhat.com>
2006-06-29[NETLINK]: Encapsulate eff_cap usage within security framework.Darrel Goeddel1-4/+4
This patch encapsulates the usage of eff_cap (in netlink_skb_params) within the security framework by extending security_netlink_recv to include a required capability parameter and converting all direct usage of eff_caps outside of the lsm modules to use the interface. It also updates the SELinux implementation of the security_netlink_send and security_netlink_recv hooks to take advantage of the sid in the netlink_skb_params struct. This also enables SELinux to perform auditing of netlink capability checks. Please apply, for 2.6.18 if possible. Signed-off-by: Darrel Goeddel <dgoeddel@trustedcs.com> Signed-off-by: Stephen Smalley <sds@tycho.nsa.gov> Acked-by: James Morris <jmorris@namei.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-29Merge git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpcLinus Torvalds1-4/+2
* git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc: (43 commits) [POWERPC] Use little-endian bit from firmware ibm,pa-features property [POWERPC] Make sure smp_processor_id works very early in boot [POWERPC] U4 DART improvements [POWERPC] todc: add support for Time-Of-Day-Clock [POWERPC] Make lparcfg.c work when both iseries and pseries are selected [POWERPC] Fix idr locking in init_new_context [POWERPC] mpc7448hpc2 (taiga) board config file [POWERPC] Add tsi108 pci and platform device data register function [POWERPC] Add general support for mpc7448hpc2 (Taiga) platform [POWERPC] Correct the MAX_CONTEXT definition powerpc: minor cleanups for mpc86xx [POWERPC] Make sure we select CONFIG_NEW_LEDS if ADB_PMU_LED is set [POWERPC] Simplify the code defining the 64-bit CPU features [POWERPC] powerpc: kconfig warning fix [POWERPC] Consolidate some of kernel/misc*.S [POWERPC] Remove unused function call_with_mmu_off [POWERPC] update asm-powerpc/time.h [POWERPC] Clean up it_lp_queue.h [POWERPC] Skip the "copy down" of the kernel if it is already at zero. [POWERPC] Add the use of the firmware soft-reset-nmi to kdump. ...
2006-06-29Merge master.kernel.org:/pub/scm/linux/kernel/git/gregkh/pci-2.6Linus Torvalds1-25/+27
* master.kernel.org:/pub/scm/linux/kernel/git/gregkh/pci-2.6: [PATCH] i386: export memory more than 4G through /proc/iomem [PATCH] 64bit Resource: finally enable 64bit resource sizes [PATCH] 64bit Resource: convert a few remaining drivers to use resource_size_t where needed [PATCH] 64bit resource: change pnp core to use resource_size_t [PATCH] 64bit resource: change pci core and arch code to use resource_size_t [PATCH] 64bit resource: change resource core to use resource_size_t [PATCH] 64bit resource: introduce resource_size_t for the start and end of struct resource [PATCH] 64bit resource: fix up printks for resources in misc drivers [PATCH] 64bit resource: fix up printks for resources in arch and core code [PATCH] 64bit resource: fix up printks for resources in pcmcia drivers [PATCH] 64bit resource: fix up printks for resources in video drivers [PATCH] 64bit resource: fix up printks for resources in ide drivers [PATCH] 64bit resource: fix up printks for resources in mtd drivers [PATCH] 64bit resource: fix up printks for resources in pci core and hotplug drivers [PATCH] 64bit resource: fix up printks for resources in networks drivers [PATCH] 64bit resource: fix up printks for resources in sound drivers [PATCH] 64bit resource: C99 changes for struct resource declarations Fixed up trivial conflict in drivers/ide/pci/cmd64x.c (the printk that was changed by the 64-bit resources had been deleted in the meantime ;)
2006-06-29[PATCH] genirq: add chip->eoi(), fastack -> fasteoiIngo Molnar1-14/+11
Clean up the fastack concept by turning it into fasteoi and introducing the ->eoi() method for chips. This also allows the cleanup of an i386 EOI quirk - now the quirk is cleanly separated from the pure ACK implementation. Signed-off-by: Ingo Molnar <mingo@elte.hu> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Roland Dreier <rolandd@cisco.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29[PATCH] genirq: fasteoi handler: handle interrupt disablingBenjamin Herrenschmidt1-1/+4
Note when a disable interrupt happened with the fasteoi handler as well so that delayed disable can be implemented with fasteoi-type controllers. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29[PATCH] genirq: more verbose debugging on unexpected IRQ vectorsIngo Molnar2-0/+42
One frequent sign of IRQ handling bugs is the appearance of unexpected vectors. Print out all the IRQ state in that case. We dont want this patch upstream, but it is useful during initial testing. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29[PATCH] genirq: cleanup: no_irq_type -> no_irq_chip renameIngo Molnar3-5/+5
Rename no_irq_type to no_irq_chip. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29[PATCH] genirq: add SA_TRIGGER supportThomas Gleixner1-4/+27
Enable drivers to request an IRQ with a given irq-flow (trigger/polarity) setting. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29[PATCH] genirq: add irq-wake (power-management) supportThomas Gleixner1-0/+21
Enable platforms to set the irq-wake (power-management) properties of an IRQ. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29[PATCH] genirq: add handle_bad_irq()Ingo Molnar2-0/+9
Handle bad IRQ vectors via the irqchip mechanism. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29[PATCH] genirq: add irq-chip supportThomas Gleixner3-2/+527
Enable platforms to use the irq-chip and irq-flow abstractions: allow setting of the chip, the type and provide highlevel handlers for common irq-flows. [rostedt@goodmis.org: misroute-irq: Don't call desc->chip->end because of edge interrupts] Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29[PATCH] genirq: coreThomas Gleixner6-3/+41
Core genirq support: add the irq-chip and irq-flow abstractions. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29[PATCH] genirq: update copyrightsIngo Molnar2-2/+7
Update/add copyrights in the generic IRQ code. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29[PATCH] genirq: add IRQ_NOAUTOEN supportThomas Gleixner2-7/+12
Enable platforms to disable the automatic enabling of freshly set up irqs. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29[PATCH] genirq: add IRQ_NOREQUEST supportThomas Gleixner1-1/+3
Enable platforms to disable request_irq() for certain interrupts. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29[PATCH] genirq: add IRQ_NOPROBE supportThomas Gleixner2-2/+6
Introduce IRQ_NOPROBE: enables platforms to control chip-probing. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29[PATCH] genirq: add genirq sw IRQ-retriggerThomas Gleixner3-10/+80
Enable platforms that do not have a hardware-assisted hardirq-resend mechanism to resend them via a softirq-driven IRQ emulation mechanism. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29[PATCH] genirq: cleanup: no_irq_type cleanupsIngo Molnar1-20/+25
Clean up no_irq_type: share the NOP functions where possible, and properly name the ack_bad() function. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29[PATCH] genirq: doc: handle_IRQ_event() and __do_IRQ() commentsIngo Molnar1-4/+16
Document handle_IRQ_event() and __do_IRQ(). Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29[PATCH] genirq: add ->retrigger() irq op to consolidate hw_irq_resend()Ingo Molnar1-1/+2
Add ->retrigger() irq op to consolidate hw_irq_resend() implementations. (Most architectures had it defined to NOP anyway.) NOTE: ia64 needs testing. i386 and x86_64 tested. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29[PATCH] genirq: debug: better debug printout in enable_irq()Thomas Gleixner1-0/+1
Make enable_irq() debug printouts user-readable. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29[PATCH] genirq: cleanup: turn ARCH_HAS_IRQ_PER_CPU into CONFIG_IRQ_PER_CPUIngo Molnar1-2/+2
Cleanup: change ARCH_HAS_IRQ_PER_CPU into a Kconfig method. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29[PATCH] genirq: cleanup: merge pending_irq_cpumask[] into irq_desc[]Ingo Molnar2-8/+4
Consolidation: remove the pending_irq_cpumask[NR_IRQS] array and move it into the irq_desc[NR_IRQS].pending_mask field. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29[PATCH] genirq: cleanup: merge irq_dir[], smp_affinity_entry[] into irq_desc[]Ingo Molnar1-13/+7
Consolidation: remove the irq_dir[NR_IRQS] and the smp_affinity_entry[NR_IRQS] arrays and move them into the irq_desc[] array. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29[PATCH] genirq: cleanup: reduce irq_desc_t use, mark it obsoleteIngo Molnar5-14/+15
Cleanup: remove irq_desc_t use from the generic IRQ code, and mark it obsolete. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29[PATCH] genirq: cleanup: misc code cleanupsIngo Molnar4-38/+34
Assorted code cleanups to the generic IRQ code. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29[PATCH] genirq: cleanup: remove fastcallIngo Molnar1-2/+2
Now that i386 defaults to regparm, explicit uses of fastcall are not needed anymore. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29[PATCH] genirq: cleanup: remove irq_descp()Ingo Molnar1-1/+1
Cleanup: remove irq_descp() - explicit use of irq_desc[] is shorter and more readable. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29[PATCH] genirq: cleanup: merge irq_affinity[] into irq_desc[]Ingo Molnar3-5/+6
Consolidation: remove the irq_affinity[NR_IRQS] array and move it into the irq_desc[NR_IRQS].affinity field. [akpm@osdl.org: sparc64 build fix] Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29[PATCH] genirq: sem2mutex probe_sem -> probing_activeIngo Molnar1-4/+5
Convert the irq auto-probing semaphore to a mutex. (This allows us to find probing API usage bugs sooner, via the mutex debugging code.) Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-29[PATCH] genirq: rename desc->handler to desc->chipIngo Molnar6-33/+33
This patch-queue improves the generic IRQ layer to be truly generic, by adding various abstractions and features to it, without impacting existing functionality. While the queue can be best described as "fix and improve everything in the generic IRQ layer that we could think of", and thus it consists of many smaller features and lots of cleanups, the one feature that stands out most is the new 'irq chip' abstraction. The irq-chip abstraction is about describing and coding and IRQ controller driver by mapping its raw hardware capabilities [and quirks, if needed] in a straightforward way, without having to think about "IRQ flow" (level/edge/etc.) type of details. This stands in contrast with the current 'irq-type' model of genirq architectures, which 'mixes' raw hardware capabilities with 'flow' details. The patchset supports both types of irq controller designs at once, and converts i386 and x86_64 to the new irq-chip design. As a bonus side-effect of the irq-chip approach, chained interrupt controllers (master/slave PIC constructs, etc.) are now supported by design as well. The end result of this patchset intends to be simpler architecture-level code and more consolidation between architectures. We reused many bits of code and many concepts from Russell King's ARM IRQ layer, the merging of which was one of the motivations for this patchset. This patch: rename desc->handler to desc->chip. Originally i did not want to do this, because it's a big patch. But having both "desc->handler", "desc->handle_irq" and "action->handler" caused a large degree of confusion and made the code appear alot less clean than it truly is. I have also attempted a dual approach as well by introducing a desc->chip alias - but that just wasnt robust enough and broke frequently. So lets get over with this quickly. The conversion was done automatically via scripts and converts all the code in the kernel. This renaming patch is the first one amongst the patches, so that the remaining patches can stay flexible and can be merged and split up without having some big monolithic patch act as a merge barrier. [akpm@osdl.org: build fix] [akpm@osdl.org: another build fix] Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-28[PATCH] load_module() cleanupAndrew Morton1-6/+21
Undo bizarre declaration in load_module(). Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-28[PATCH] Add EXPORT_UNUSED_SYMBOL and EXPORT_UNUSED_SYMBOL_GPLArjan van de Ven1-2/+76
Temporarily add EXPORT_UNUSED_SYMBOL and EXPORT_UNUSED_SYMBOL_GPL. These will be used as a transition measure for symbols that aren't used in the kernel and are on the way out. When a module uses such a symbol, a warning is printk'd at modprobe time. The main reason for removing unused exports is size: eacho export takes roughly between 100 and 150 bytes of kernel space in the binary. This patch gives users the option to immediately get this size gain via a config option. Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-28[POWERPC] Add the use of the firmware soft-reset-nmi to kdump.David Wilder1-4/+2
With this patch, kdump uses the firmware soft-reset NMI for two purposes: 1) Initiate the kdump (take a crash dump) by issuing a soft-reset. 2) Break a CPU out of a deadlock condition that is detected during kdump processing. When a soft-reset is initiated each CPU will enter system_reset_exception() and set its corresponding bit in the global bit-array cpus_in_sr then call die(). When die() finds the CPU's bit set in cpu_in_sr crash_kexec() is called to initiate a crash dump. The first CPU to enter crash_kexec() is called the "crashing CPU". All other CPUs are "secondary CPUs". The secondary CPU's pass through to crash_kexec_secondary() and sleep. The crashing CPU waits for all CPUs to enter via soft-reset then boots the kdump kernel (see crash_soft_reset_check()) When the system crashes due to a panic or exception, crash_kexec() is called by panic() or die(). The crashing CPU sends an IPI to all other CPUs to notify them of the pending shutdown. If a CPU is in a deadlock or hung state with interrupts disabled, the IPI will not be delivered. The result being, that the kdump kernel is not booted. This problem is solved with the use of a firmware generated soft-reset. After the crashing_cpu has issued the IPI, it waits for 10 sec for all CPUs to enter crash_ipi_callback(). A CPU signifies its entry to crash_ipi_callback() by setting its corresponding bit in the cpus_in_crash bit array. After 10 sec, if one or more CPUs have not set their bit in cpus_in_crash we assume that the CPU(s) is deadlocked. The operator is then prompted to generate a soft-reset to break the deadlock. Each CPU enters the soft reset handler as described above. Two conditions must be handled at this point: 1) The system crashed because the operator generated a soft-reset. See 2) The system had crashed before the soft-reset was generated ( in the case of a Panic or oops). The first CPU to enter crash_kexec() uses the state of the kexec_lock to determine this state. If kexec_lock is already held then condition 2 is true and crash_kexec_secondary() is called, else; this CPU is flagged as the crashing CPU, the kexec_lock is acquired and crash_kexec() proceeds as described above. Each additional CPUs responding to the soft-reset will pass through crash_kexec() to kexec_secondary(). All secondary CPUs call crash_ipi_callback() readying them self's for the shutdown. When ready they clear their bit in cpus_in_sr. The crashing CPU waits in kexec_secondary() until all other CPUs have cleared their bits in cpus_in_sr. The kexec kernel boot is then started. Signed-off-by: Haren Myneni <haren@us.ibm.com> Signed-off-by: David Wilder <dwilder@us.ibm.com> Signed-off-by: Paul Mackerras <paulus@samba.org>
2006-06-27[PATCH] Remove redundant NULL checks before [kv]free - in kernel/Jesper Juhl1-2/+1
Remove redundant kfree NULL checks from kernel/ Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] futex_requeue() optimizationSebastien Dugue1-5/+8
In futex_requeue(), when the 2 futexes keys hash to the same bucket, there is no need to move the futex_q to the end of the bucket list. Signed-off-by: Sebastien Dugue <sebastien.dugue@bull.net> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] rtmutex: Propagate priority settings into PI lock chainsThomas Gleixner2-6/+38
When the priority of a task, which is blocked on a lock, changes we must propagate this change into the PI lock chain. Therefor the chain walk code is changed to get rid of the references to current to avoid false positives in the deadlock detector, as setscheduler might be called by a task which holds the lock on which the task whose priority is changed is blocked. Also add some comments about the get/put_task_struct usage to avoid confusion. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] rtmutex: Modify rtmutex-tester to test the setscheduler propagationThomas Gleixner1-14/+18
Make test suite setscheduler calls asynchronously. Remove the waits in the test cases and add a new testcase to verify the correctness of the setscheduler priority propagation. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] Drop tasklist lock in do_sched_setschedulerThomas Gleixner1-1/+3
There is no need to hold tasklist_lock across the setscheduler call, when we pin the task structure with get_task_struct(). Interrupts are disabled in setscheduler anyway and the permission checks do not need interrupts disabled. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] pi-futex: futex_lock_pi/futex_unlock_pi supportIngo Molnar5-41/+818
This adds the actual pi-futex implementation, based on rt-mutexes. [dino@in.ibm.com: fix an oops-causing race] Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Dinakar Guniguntala <dino@in.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] pi-futex: rt mutex futex apiIngo Molnar1-0/+55
Add proxy-locking rt-mutex functionality needed by pi-futexes. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] pi-futex: rt mutex testerThomas Gleixner4-1/+461
RT-mutex tester: scriptable tester for rt mutexes, which allows userspace scripting of mutex unit-tests (and dynamic tests as well), using the actual rt-mutex implementation of the kernel. [akpm@osdl.org: fixlet] Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] pi-futex: rt mutex debugIngo Molnar4-0/+552
Runtime debugging functionality for rt-mutexes. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] pi-futex: rt mutex coreIngo Molnar6-0/+1058
Core functions for the rt-mutex subsystem. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] pi-futex: scheduler support for piIngo Molnar1-29/+160
Add framework to boost/unboost the priority of RT tasks. This consists of: - caching the 'normal' priority in ->normal_prio - providing a functions to set/get the priority of the task - make sched_setscheduler() aware of boosting The effective_prio() cleanups also fix a priority-calculation bug pointed out by Andrey Gelman, in set_user_nice(). has_rt_policy() fix: Peter Williams <pwil3058@bigpond.net.au> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Cc: Andrey Gelman <agelman@012.net.il> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] pi-futex: futex code cleanupsIngo Molnar2-117/+131
We are pleased to announce "lightweight userspace priority inheritance" (PI) support for futexes. The following patchset and glibc patch implements it, ontop of the robust-futexes patchset which is included in 2.6.16-mm1. We are calling it lightweight for 3 reasons: - in the user-space fastpath a PI-enabled futex involves no kernel work (or any other PI complexity) at all. No registration, no extra kernel calls - just pure fast atomic ops in userspace. - in the slowpath (in the lock-contention case), the system call and scheduling pattern is in fact better than that of normal futexes, due to the 'integrated' nature of FUTEX_LOCK_PI. [more about that further down] - the in-kernel PI implementation is streamlined around the mutex abstraction, with strict rules that keep the implementation relatively simple: only a single owner may own a lock (i.e. no read-write lock support), only the owner may unlock a lock, no recursive locking, etc. Priority Inheritance - why, oh why??? ------------------------------------- Many of you heard the horror stories about the evil PI code circling Linux for years, which makes no real sense at all and is only used by buggy applications and which has horrible overhead. Some of you have dreaded this very moment, when someone actually submits working PI code ;-) So why would we like to see PI support for futexes? We'd like to see it done purely for technological reasons. We dont think it's a buggy concept, we think it's useful functionality to offer to applications, which functionality cannot be achieved in other ways. We also think it's the right thing to do, and we think we've got the right arguments and the right numbers to prove that. We also believe that we can address all the counter-arguments as well. For these reasons (and the reasons outlined below) we are submitting this patch-set for upstream kernel inclusion. What are the benefits of PI? The short reply: ---------------- User-space PI helps achieving/improving determinism for user-space applications. In the best-case, it can help achieve determinism and well-bound latencies. Even in the worst-case, PI will improve the statistical distribution of locking related application delays. The longer reply: ----------------- Firstly, sharing locks between multiple tasks is a common programming technique that often cannot be replaced with lockless algorithms. As we can see it in the kernel [which is a quite complex program in itself], lockless structures are rather the exception than the norm - the current ratio of lockless vs. locky code for shared data structures is somewhere between 1:10 and 1:100. Lockless is hard, and the complexity of lockless algorithms often endangers to ability to do robust reviews of said code. I.e. critical RT apps often choose lock structures to protect critical data structures, instead of lockless algorithms. Furthermore, there are cases (like shared hardware, or other resource limits) where lockless access is mathematically impossible. Media players (such as Jack) are an example of reasonable application design with multiple tasks (with multiple priority levels) sharing short-held locks: for example, a highprio audio playback thread is combined with medium-prio construct-audio-data threads and low-prio display-colory-stuff threads. Add video and decoding to the mix and we've got even more priority levels. So once we accept that synchronization objects (locks) are an unavoidable fact of life, and once we accept that multi-task userspace apps have a very fair expectation of being able to use locks, we've got to think about how to offer the option of a deterministic locking implementation to user-space. Most of the technical counter-arguments against doing priority inheritance only apply to kernel-space locks. But user-space locks are different, there we cannot disable interrupts or make the task non-preemptible in a critical section, so the 'use spinlocks' argument does not apply (user-space spinlocks have the same priority inversion problems as other user-space locking constructs). Fact is, pretty much the only technique that currently enables good determinism for userspace locks (such as futex-based pthread mutexes) is priority inheritance: Currently (without PI), if a high-prio and a low-prio task shares a lock [this is a quite common scenario for most non-trivial RT applications], even if all critical sections are coded carefully to be deterministic (i.e. all critical sections are short in duration and only execute a limited number of instructions), the kernel cannot guarantee any deterministic execution of the high-prio task: any medium-priority task could preempt the low-prio task while it holds the shared lock and executes the critical section, and could delay it indefinitely. Implementation: --------------- As mentioned before, the userspace fastpath of PI-enabled pthread mutexes involves no kernel work at all - they behave quite similarly to normal futex-based locks: a 0 value means unlocked, and a value==TID means locked. (This is the same method as used by list-based robust futexes.) Userspace uses atomic ops to lock/unlock these mutexes without entering the kernel. To handle the slowpath, we have added two new futex ops: FUTEX_LOCK_PI FUTEX_UNLOCK_PI If the lock-acquire fastpath fails, [i.e. an atomic transition from 0 to TID fails], then FUTEX_LOCK_PI is called. The kernel does all the remaining work: if there is no futex-queue attached to the futex address yet then the code looks up the task that owns the futex [it has put its own TID into the futex value], and attaches a 'PI state' structure to the futex-queue. The pi_state includes an rt-mutex, which is a PI-aware, kernel-based synchronization object. The 'other' task is made the owner of the rt-mutex, and the FUTEX_WAITERS bit is atomically set in the futex value. Then this task tries to lock the rt-mutex, on which it blocks. Once it returns, it has the mutex acquired, and it sets the futex value to its own TID and returns. Userspace has no other work to perform - it now owns the lock, and futex value contains FUTEX_WAITERS|TID. If the unlock side fastpath succeeds, [i.e. userspace manages to do a TID -> 0 atomic transition of the futex value], then no kernel work is triggered. If the unlock fastpath fails (because the FUTEX_WAITERS bit is set), then FUTEX_UNLOCK_PI is called, and the kernel unlocks the futex on the behalf of userspace - and it also unlocks the attached pi_state->rt_mutex and thus wakes up any potential waiters. Note that under this approach, contrary to other PI-futex approaches, there is no prior 'registration' of a PI-futex. [which is not quite possible anyway, due to existing ABI properties of pthread mutexes.] Also, under this scheme, 'robustness' and 'PI' are two orthogonal properties of futexes, and all four combinations are possible: futex, robust-futex, PI-futex, robust+PI-futex. glibc support: -------------- Ulrich Drepper and Jakub Jelinek have written glibc support for PI-futexes (and robust futexes), enabling robust and PI (PTHREAD_PRIO_INHERIT) POSIX mutexes. (PTHREAD_PRIO_PROTECT support will be added later on too, no additional kernel changes are needed for that). [NOTE: The glibc patch is obviously inofficial and unsupported without matching upstream kernel functionality.] the patch-queue and the glibc patch can also be downloaded from: http://redhat.com/~mingo/PI-futex-patches/ Many thanks go to the people who helped us create this kernel feature: Steven Rostedt, Esben Nielsen, Benedikt Spranger, Daniel Walker, John Cooper, Arjan van de Ven, Oleg Nesterov and others. Credits for related prior projects goes to Dirk Grambow, Inaky Perez-Gonzalez, Bill Huey and many others. Clean up the futex code, before adding more features to it: - use u32 as the futex field type - that's the ABI - use __user and pointers to u32 instead of unsigned long - code style / comment style cleanups - rename hash-bucket name from 'bh' to 'hb'. I checked the pre and post futex.o object files to make sure this patch has no code effects. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Cc: Ulrich Drepper <drepper@redhat.com> Cc: Jakub Jelinek <jakub@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] BUG() if setscheduler is called from interrupt contextSteven Rostedt1-0/+2
Thomas Gleixner is adding the call to a rtmutex function in setscheduler. This call grabs a spin_lock that is not always protected by interrupts disabled. So this means that setscheduler cant be called from interrupt context. To prevent this from happening in the future, this patch adds a BUG_ON(in_interrupt()) in that function. (Thanks to akpm <aka. Andrew Morton> for this suggestion). Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] sched: uninline task_rq_lock()Oleg Nesterov1-1/+1
Saves 543 bytes from sched.o (gcc 3.3.3). Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Ingo Molnar <mingo@elte.hu> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Con Kolivas <kernel@kolivas.org> Cc: Peter Williams <pwil3058@bigpond.net.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] sched: mc/smt power savings sched policySiddha, Suresh B1-25/+215
sysfs entries 'sched_mc_power_savings' and 'sched_smt_power_savings' in /sys/devices/system/cpu/ control the MC/SMT power savings policy for the scheduler. Based on the values (1-enable, 0-disable) for these controls, sched groups cpu power will be determined for different domains. When power savings policy is enabled and under light load conditions, scheduler will minimize the physical packages/cpu cores carrying the load and thus conserving power(with a perf impact based on the workload characteristics... see OLS 2005 CMP kernel scheduler paper for more details..) Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Con Kolivas <kernel@kolivas.org> Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] sched_domai: Allocate sched_group structures dynamicallySrivatsa Vaddagiri1-5/+43
As explained here: http://marc.theaimsgroup.com/?l=linux-kernel&m=114327539012323&w=2 there is a problem with sharing sched_group structures between two separate sched_group structures for different sched_domains. The patch has been tested and found to avoid the kernel lockup problem described in above URL. Signed-off-by: Srivatsa Vaddagiri <vatsa@in.ibm.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Ingo Molnar <mingo@elte.hu> Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] sched_domai: Use kmalloc_nodeSrivatsa Vaddagiri1-2/+3
The sched group structures used to represent various nodes need to be allocated from respective nodes (as suggested here also: http://uwsg.ucs.indiana.edu/hypermail/linux/kernel/0603.3/0051.html) Signed-off-by: Srivatsa Vaddagiri <vatsa@in.ibm.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Ingo Molnar <mingo@elte.hu> Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] sched_domai: Don't use GFP_ATOMICSrivatsa Vaddagiri1-1/+1
Replace GFP_ATOMIC allocation for sched_group_nodes with GFP_KERNEL based allocation. Signed-off-by: Srivatsa Vaddagiri <vatsa@in.ibm.com Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Ingo Molnar <mingo@elte.hu> Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] sched_domain: handle kmalloc failureSrivatsa Vaddagiri1-61/+78
Try to handle mem allocation failures in build_sched_domains by bailing out and cleaning up thus-far allocated memory. The patch has a direct consequence that we disable load balancing completely (even at sibling level) upon *any* memory allocation failure. [Lee.Schermerhorn@hp.com: bugfix] Signed-off-by: Srivatsa Vaddagir <vatsa@in.ibm.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Ingo Molnar <mingo@elte.hu> Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] sched: Avoid unnecessarily moving highest priority task move_tasks()Peter Williams1-5/+21
Problem: To help distribute high priority tasks evenly across the available CPUs move_tasks() does not, under some circumstances, skip tasks whose load weight is bigger than the designated amount. Because the highest priority task on the busiest queue may be on the expired array it may be moved as a result of this mechanism. Apart from not being the most desirable way to redistribute the high priority tasks (we'd rather move the second highest priority task), there is a risk that this could set up a loop with this task bouncing backwards and forwards between the two queues. (This latter possibility can be demonstrated by running a nice==-20 CPU bound task on an otherwise quiet 2 CPU system.) Solution: Modify the mechanism so that it does not override skip for the highest priority task on the CPU. Of course, if there are more than one tasks at the highest priority then it will allow the override for one of them as this is a desirable redistribution of high priority tasks. Signed-off-by: Peter Williams <pwil3058@bigpond.com.au> Cc: Ingo Molnar <mingo@elte.hu> Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] sched: modify move_tasks() to improve load balancing outcomesPeter Williams1-2/+10
Problem: The move_tasks() function is designed to move UP TO the amount of load it is asked to move and in doing this it skips over tasks looking for ones whose load weights are less than or equal to the remaining load to be moved. This is (in general) a good thing but it has the unfortunate result of breaking one of the original load balancer's good points: namely, that (within the limits imposed by the active/expired array model and the fact the expired is processed first) it moves high priority tasks before low priority ones and this means there's a good chance (see active/expired problem for why it's only a chance) that the highest priority task on the queue but not actually on the CPU will be moved to the other CPU where (as a high priority task) it may preempt the current task. Solution: Modify move_tasks() so that high priority tasks are not skipped when moving them will make them the highest priority task on their new run queue. Signed-off-by: Peter Williams <pwil3058@bigpond.com.au> Cc: Ingo Molnar <mingo@elte.hu> Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com> Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] sched: implement smpnicePeter Williams1-65/+219
Problem: The introduction of separate run queues per CPU has brought with it "nice" enforcement problems that are best described by a simple example. For the sake of argument suppose that on a single CPU machine with a nice==19 hard spinner and a nice==0 hard spinner running that the nice==0 task gets 95% of the CPU and the nice==19 task gets 5% of the CPU. Now suppose that there is a system with 2 CPUs and 2 nice==19 hard spinners and 2 nice==0 hard spinners running. The user of this system would be entitled to expect that the nice==0 tasks each get 95% of a CPU and the nice==19 tasks only get 5% each. However, whether this expectation is met is pretty much down to luck as there are four equally likely distributions of the tasks to the CPUs that the load balancing code will consider to be balanced with loads of 2.0 for each CPU. Two of these distributions involve one nice==0 and one nice==19 task per CPU and in these circumstances the users expectations will be met. The other two distributions both involve both nice==0 tasks being on one CPU and both nice==19 being on the other CPU and each task will get 50% of a CPU and the user's expectations will not be met. Solution: The solution to this problem that is implemented in the attached patch is to use weighted loads when determining if the system is balanced and, when an imbalance is detected, to move an amount of weighted load between run queues (as opposed to a number of tasks) to restore the balance. Once again, the easiest way to explain why both of these measures are necessary is to use a simple example. Suppose that (in a slight variation of the above example) that we have a two CPU system with 4 nice==0 and 4 nice=19 hard spinning tasks running and that the 4 nice==0 tasks are on one CPU and the 4 nice==19 tasks are on the other CPU. The weighted loads for the two CPUs would be 4.0 and 0.2 respectively and the load balancing code would move 2 tasks resulting in one CPU with a load of 2.0 and the other with load of 2.2. If this was considered to be a big enough imbalance to justify moving a task and that task was moved using the current move_tasks() then it would move the highest priority task that it found and this would result in one CPU with a load of 3.0 and the other with a load of 1.2 which would result in the movement of a task in the opposite direction and so on -- infinite loop. If, on the other hand, an amount of load to be moved is calculated from the imbalance (in this case 0.1) and move_tasks() skips tasks until it find ones whose contributions to the weighted load are less than this amount it would move two of the nice==19 tasks resulting in a system with 2 nice==0 and 2 nice=19 on each CPU with loads of 2.1 for each CPU. One of the advantages of this mechanism is that on a system where all tasks have nice==0 the load balancing calculations would be mathematically identical to the current load balancing code. Notes: struct task_struct: has a new field load_weight which (in a trade off of space for speed) stores the contribution that this task makes to a CPU's weighted load when it is runnable. struct runqueue: has a new field raw_weighted_load which is the sum of the load_weight values for the currently runnable tasks on this run queue. This field always needs to be updated when nr_running is updated so two new inline functions inc_nr_running() and dec_nr_running() have been created to make sure that this happens. This also offers a convenient way to optimize away this part of the smpnice mechanism when CONFIG_SMP is not defined. int try_to_wake_up(): in this function the value SCHED_LOAD_BALANCE is used to represent the load contribution of a single task in various calculations in the code that decides which CPU to put the waking task on. While this would be a valid on a system where the nice values for the runnable tasks were distributed evenly around zero it will lead to anomalous load balancing if the distribution is skewed in either direction. To overcome this problem SCHED_LOAD_SCALE has been replaced by the load_weight for the relevant task or by the average load_weight per task for the queue in question (as appropriate). int move_tasks(): The modifications to this function were complicated by the fact that active_load_balance() uses it to move exactly one task without checking whether an imbalance actually exists. This precluded the simple overloading of max_nr_move with max_load_move and necessitated the addition of the latter as an extra argument to the function. The internal implementation is then modified to move up to max_nr_move tasks and max_load_move of weighted load. This slightly complicates the code where move_tasks() is called and if ever active_load_balance() is changed to not use move_tasks() the implementation of move_tasks() should be simplified accordingly. struct sched_group *find_busiest_group(): Similar to try_to_wake_up(), there are places in this function where SCHED_LOAD_SCALE is used to represent the load contribution of a single task and the same issues are created. A similar solution is adopted except that it is now the average per task contribution to a group's load (as opposed to a run queue) that is required. As this value is not directly available from the group it is calculated on the fly as the queues in the groups are visited when determining the busiest group. A key change to this function is that it is no longer to scale down *imbalance on exit as move_tasks() uses the load in its scaled form. void set_user_nice(): has been modified to update the task's load_weight field when it's nice value and also to ensure that its run queue's raw_weighted_load field is updated if it was runnable. From: "Siddha, Suresh B" <suresh.b.siddha@intel.com> With smpnice, sched groups with highest priority tasks can mask the imbalance between the other sched groups with in the same domain. This patch fixes some of the listed down scenarios by not considering the sched groups which are lightly loaded. a) on a simple 4-way MP system, if we have one high priority and 4 normal priority tasks, with smpnice we would like to see the high priority task scheduled on one cpu, two other cpus getting one normal task each and the fourth cpu getting the remaining two normal tasks. but with current smpnice extra normal priority task keeps jumping from one cpu to another cpu having the normal priority task. This is because of the busiest_has_loaded_cpus, nr_loaded_cpus logic.. We are not including the cpu with high priority task in max_load calculations but including that in total and avg_load calcuations.. leading to max_load < avg_load and load balance between cpus running normal priority tasks(2 Vs 1) will always show imbalanace as one normal priority and the extra normal priority task will keep moving from one cpu to another cpu having normal priority task.. b) 4-way system with HT (8 logical processors). Package-P0 T0 has a highest priority task, T1 is idle. Package-P1 Both T0 and T1 have 1 normal priority task each.. P2 and P3 are idle. With this patch, one of the normal priority tasks on P1 will be moved to P2 or P3.. c) With the current weighted smp nice calculations, it doesn't always make sense to look at the highest weighted runqueue in the busy group.. Consider a load balance scenario on a DP with HT system, with Package-0 containing one high priority and one low priority, Package-1 containing one low priority(with other thread being idle).. Package-1 thinks that it need to take the low priority thread from Package-0. And find_busiest_queue() returns the cpu thread with highest priority task.. And ultimately(with help of active load balance) we move high priority task to Package-1. And same continues with Package-0 now, moving high priority task from package-1 to package-0.. Even without the presence of active load balance, load balance will fail to balance the above scenario.. Fix find_busiest_queue to use "imbalance" when it is lightly loaded. [kernel@kolivas.org: sched: store weighted load on up] [kernel@kolivas.org: sched: add discrete weighted cpu load function] [suresh.b.siddha@intel.com: sched: remove dead code] Signed-off-by: Peter Williams <pwil3058@bigpond.com.au> Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com> Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Con Kolivas <kernel@kolivas.org> Cc: John Hawkes <hawkes@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] sched: CPU hotplug race vs. set_cpus_allowed()Kirill Korotaev1-4/+14
There is a race between set_cpus_allowed() and move_task_off_dead_cpu(). __migrate_task() doesn't report any err code, so task can be left on its runqueue if its cpus_allowed mask changed so that dest_cpu is not longer a possible target. Also, chaning cpus_allowed mask requires rq->lock being held. Signed-off-by: Kirill Korotaev <dev@openvz.org> Acked-By: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] unnecessary long index i in schedSteven Rostedt1-1/+2
Unless we expect to have more than 2G CPUs, there's no reason to have 'i' as a long long here. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] sched: fix interactive ceiling codeCon Kolivas1-25/+27
The relationship between INTERACTIVE_SLEEP and the ceiling is not perfect and not explicit enough. The sleep boost is not supposed to be any larger than without this code and the comment is not clear enough about what exactly it does, just the reason it does it. Fix it. There is a ceiling to the priority beyond which tasks that only ever sleep for very long periods cannot surpass. Fix it. Prevent the on-runqueue bonus logic from defeating the idle sleep logic. Opportunity to micro-optimise. Signed-off-by: Con Kolivas <kernel@kolivas.org> Signed-off-by: Mike Galbraith <efault@gmx.de> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Ken Chen <kenneth.w.chen@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] sched: simplify bitmap definitionSteven Rostedt1-3/+1
Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] sched: fix smt nice lock contention and optimizationChen, Kenneth W1-123/+59
Initial report and lock contention fix from Chris Mason: Recent benchmarks showed some performance regressions between 2.6.16 and 2.6.5. We tracked down one of the regressions to lock contention in schedule heavy workloads (~70,000 context switches per second) kernel/sched.c:dependent_sleeper() was responsible for most of the lock contention, hammering on the run queue locks. The patch below is more of a discussion point than a suggested fix (although it does reduce lock contention significantly). The dependent_sleeper code looks very expensive to me, especially for using a spinlock to bounce control between two different siblings in the same cpu. It is further optimized: * perform dependent_sleeper check after next task is determined * convert wake_sleeping_dependent to use trylock * skip smt runqueue check if trylock fails * optimize double_rq_lock now that smt nice is converted to trylock * early exit in searching first SD_SHARE_CPUPOWER domain * speedup fast path of dependent_sleeper [akpm@osdl.org: cleanup] Signed-off-by: Ken Chen <kenneth.w.chen@intel.com> Acked-by: Ingo Molnar <mingo@elte.hu> Acked-by: Con Kolivas <kernel@kolivas.org> Signed-off-by: Nick Piggin <npiggin@suse.de> Acked-by: Chris Mason <mason@suse.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] cpu hotplug: make cpu_notifier related notifier calls __cpuinit onlyChandra Seetharaman1-3/+4
Make notifier_calls associated with cpu_notifier as __cpuinit. __cpuinit makes sure that the function is init time only unless CONFIG_HOTPLUG_CPU is defined. [akpm@osdl.org: section fix] Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Cc: Ashok Raj <ashok.raj@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] cpu hotplug: make [un]register_cpu_notifier init time onlyChandra Seetharaman1-3/+5
CPUs come online only at init time (unless CONFIG_HOTPLUG_CPU is defined). So, cpu_notifier functionality need to be available only at init time. This patch makes register_cpu_notifier() available only at init time, unless CONFIG_HOTPLUG_CPU is defined. This patch exports register_cpu_notifier() and unregister_cpu_notifier() only if CONFIG_HOTPLUG_CPU is defined. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Cc: Ashok Raj <ashok.raj@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] cpu hotplug: revert initdata patch submitted for 2.6.17Chandra Seetharaman6-6/+6
This patch reverts notifier_block changes made in 2.6.17 Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Cc: Ashok Raj <ashok.raj@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] cpu hotplug: revert init patch submitted for 2.6.17Chandra Seetharaman7-7/+7
In 2.6.17, there was a problem with cpu_notifiers and XFS. I provided a band-aid solution to solve that problem. In the process, i undid all the changes you both were making to ensure that these notifiers were available only at init time (unless CONFIG_HOTPLUG_CPU is defined). We deferred the real fix to 2.6.18. Here is a set of patches that fixes the XFS problem cleanly and makes the cpu notifiers available only at init time (unless CONFIG_HOTPLUG_CPU is defined). If CONFIG_HOTPLUG_CPU is defined then cpu notifiers are available at run time. This patch reverts the notifier_call changes made in 2.6.17 Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Cc: Ashok Raj <ashok.raj@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] rcutorture: add call_rcu_bh() operationsPaul E. McKenney2-2/+48
Add operations for the call_rcu_bh() variant of RCU. Also add an rcu_batches_completed_bh() function, which is needed by rcutorture. Signed-off-by: Paul E. McKenney <paulmck@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] rcutorture: add ops vector and Classic RCU opsPaul E. McKenney1-43/+120
Add an ops vector to rcutorture, and add the ops for Classic RCU. Update the rcutorture documentation to reflect slight change to the dmesg formats. Signed-off-by: Paul E. McKenney <paulmck@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>