aboutsummaryrefslogtreecommitdiffstats
path: root/kernel
AgeCommit message (Collapse)AuthorFilesLines
2006-06-27[PATCH] pi-futex: rt mutex coreIngo Molnar6-0/+1058
Core functions for the rt-mutex subsystem. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] pi-futex: scheduler support for piIngo Molnar1-29/+160
Add framework to boost/unboost the priority of RT tasks. This consists of: - caching the 'normal' priority in ->normal_prio - providing a functions to set/get the priority of the task - make sched_setscheduler() aware of boosting The effective_prio() cleanups also fix a priority-calculation bug pointed out by Andrey Gelman, in set_user_nice(). has_rt_policy() fix: Peter Williams <pwil3058@bigpond.net.au> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Cc: Andrey Gelman <agelman@012.net.il> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] pi-futex: futex code cleanupsIngo Molnar2-117/+131
We are pleased to announce "lightweight userspace priority inheritance" (PI) support for futexes. The following patchset and glibc patch implements it, ontop of the robust-futexes patchset which is included in 2.6.16-mm1. We are calling it lightweight for 3 reasons: - in the user-space fastpath a PI-enabled futex involves no kernel work (or any other PI complexity) at all. No registration, no extra kernel calls - just pure fast atomic ops in userspace. - in the slowpath (in the lock-contention case), the system call and scheduling pattern is in fact better than that of normal futexes, due to the 'integrated' nature of FUTEX_LOCK_PI. [more about that further down] - the in-kernel PI implementation is streamlined around the mutex abstraction, with strict rules that keep the implementation relatively simple: only a single owner may own a lock (i.e. no read-write lock support), only the owner may unlock a lock, no recursive locking, etc. Priority Inheritance - why, oh why??? ------------------------------------- Many of you heard the horror stories about the evil PI code circling Linux for years, which makes no real sense at all and is only used by buggy applications and which has horrible overhead. Some of you have dreaded this very moment, when someone actually submits working PI code ;-) So why would we like to see PI support for futexes? We'd like to see it done purely for technological reasons. We dont think it's a buggy concept, we think it's useful functionality to offer to applications, which functionality cannot be achieved in other ways. We also think it's the right thing to do, and we think we've got the right arguments and the right numbers to prove that. We also believe that we can address all the counter-arguments as well. For these reasons (and the reasons outlined below) we are submitting this patch-set for upstream kernel inclusion. What are the benefits of PI? The short reply: ---------------- User-space PI helps achieving/improving determinism for user-space applications. In the best-case, it can help achieve determinism and well-bound latencies. Even in the worst-case, PI will improve the statistical distribution of locking related application delays. The longer reply: ----------------- Firstly, sharing locks between multiple tasks is a common programming technique that often cannot be replaced with lockless algorithms. As we can see it in the kernel [which is a quite complex program in itself], lockless structures are rather the exception than the norm - the current ratio of lockless vs. locky code for shared data structures is somewhere between 1:10 and 1:100. Lockless is hard, and the complexity of lockless algorithms often endangers to ability to do robust reviews of said code. I.e. critical RT apps often choose lock structures to protect critical data structures, instead of lockless algorithms. Furthermore, there are cases (like shared hardware, or other resource limits) where lockless access is mathematically impossible. Media players (such as Jack) are an example of reasonable application design with multiple tasks (with multiple priority levels) sharing short-held locks: for example, a highprio audio playback thread is combined with medium-prio construct-audio-data threads and low-prio display-colory-stuff threads. Add video and decoding to the mix and we've got even more priority levels. So once we accept that synchronization objects (locks) are an unavoidable fact of life, and once we accept that multi-task userspace apps have a very fair expectation of being able to use locks, we've got to think about how to offer the option of a deterministic locking implementation to user-space. Most of the technical counter-arguments against doing priority inheritance only apply to kernel-space locks. But user-space locks are different, there we cannot disable interrupts or make the task non-preemptible in a critical section, so the 'use spinlocks' argument does not apply (user-space spinlocks have the same priority inversion problems as other user-space locking constructs). Fact is, pretty much the only technique that currently enables good determinism for userspace locks (such as futex-based pthread mutexes) is priority inheritance: Currently (without PI), if a high-prio and a low-prio task shares a lock [this is a quite common scenario for most non-trivial RT applications], even if all critical sections are coded carefully to be deterministic (i.e. all critical sections are short in duration and only execute a limited number of instructions), the kernel cannot guarantee any deterministic execution of the high-prio task: any medium-priority task could preempt the low-prio task while it holds the shared lock and executes the critical section, and could delay it indefinitely. Implementation: --------------- As mentioned before, the userspace fastpath of PI-enabled pthread mutexes involves no kernel work at all - they behave quite similarly to normal futex-based locks: a 0 value means unlocked, and a value==TID means locked. (This is the same method as used by list-based robust futexes.) Userspace uses atomic ops to lock/unlock these mutexes without entering the kernel. To handle the slowpath, we have added two new futex ops: FUTEX_LOCK_PI FUTEX_UNLOCK_PI If the lock-acquire fastpath fails, [i.e. an atomic transition from 0 to TID fails], then FUTEX_LOCK_PI is called. The kernel does all the remaining work: if there is no futex-queue attached to the futex address yet then the code looks up the task that owns the futex [it has put its own TID into the futex value], and attaches a 'PI state' structure to the futex-queue. The pi_state includes an rt-mutex, which is a PI-aware, kernel-based synchronization object. The 'other' task is made the owner of the rt-mutex, and the FUTEX_WAITERS bit is atomically set in the futex value. Then this task tries to lock the rt-mutex, on which it blocks. Once it returns, it has the mutex acquired, and it sets the futex value to its own TID and returns. Userspace has no other work to perform - it now owns the lock, and futex value contains FUTEX_WAITERS|TID. If the unlock side fastpath succeeds, [i.e. userspace manages to do a TID -> 0 atomic transition of the futex value], then no kernel work is triggered. If the unlock fastpath fails (because the FUTEX_WAITERS bit is set), then FUTEX_UNLOCK_PI is called, and the kernel unlocks the futex on the behalf of userspace - and it also unlocks the attached pi_state->rt_mutex and thus wakes up any potential waiters. Note that under this approach, contrary to other PI-futex approaches, there is no prior 'registration' of a PI-futex. [which is not quite possible anyway, due to existing ABI properties of pthread mutexes.] Also, under this scheme, 'robustness' and 'PI' are two orthogonal properties of futexes, and all four combinations are possible: futex, robust-futex, PI-futex, robust+PI-futex. glibc support: -------------- Ulrich Drepper and Jakub Jelinek have written glibc support for PI-futexes (and robust futexes), enabling robust and PI (PTHREAD_PRIO_INHERIT) POSIX mutexes. (PTHREAD_PRIO_PROTECT support will be added later on too, no additional kernel changes are needed for that). [NOTE: The glibc patch is obviously inofficial and unsupported without matching upstream kernel functionality.] the patch-queue and the glibc patch can also be downloaded from: http://redhat.com/~mingo/PI-futex-patches/ Many thanks go to the people who helped us create this kernel feature: Steven Rostedt, Esben Nielsen, Benedikt Spranger, Daniel Walker, John Cooper, Arjan van de Ven, Oleg Nesterov and others. Credits for related prior projects goes to Dirk Grambow, Inaky Perez-Gonzalez, Bill Huey and many others. Clean up the futex code, before adding more features to it: - use u32 as the futex field type - that's the ABI - use __user and pointers to u32 instead of unsigned long - code style / comment style cleanups - rename hash-bucket name from 'bh' to 'hb'. I checked the pre and post futex.o object files to make sure this patch has no code effects. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Cc: Ulrich Drepper <drepper@redhat.com> Cc: Jakub Jelinek <jakub@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] BUG() if setscheduler is called from interrupt contextSteven Rostedt1-0/+2
Thomas Gleixner is adding the call to a rtmutex function in setscheduler. This call grabs a spin_lock that is not always protected by interrupts disabled. So this means that setscheduler cant be called from interrupt context. To prevent this from happening in the future, this patch adds a BUG_ON(in_interrupt()) in that function. (Thanks to akpm <aka. Andrew Morton> for this suggestion). Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] sched: uninline task_rq_lock()Oleg Nesterov1-1/+1
Saves 543 bytes from sched.o (gcc 3.3.3). Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Ingo Molnar <mingo@elte.hu> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Con Kolivas <kernel@kolivas.org> Cc: Peter Williams <pwil3058@bigpond.net.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] sched: mc/smt power savings sched policySiddha, Suresh B1-25/+215
sysfs entries 'sched_mc_power_savings' and 'sched_smt_power_savings' in /sys/devices/system/cpu/ control the MC/SMT power savings policy for the scheduler. Based on the values (1-enable, 0-disable) for these controls, sched groups cpu power will be determined for different domains. When power savings policy is enabled and under light load conditions, scheduler will minimize the physical packages/cpu cores carrying the load and thus conserving power(with a perf impact based on the workload characteristics... see OLS 2005 CMP kernel scheduler paper for more details..) Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Con Kolivas <kernel@kolivas.org> Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] sched_domai: Allocate sched_group structures dynamicallySrivatsa Vaddagiri1-5/+43
As explained here: http://marc.theaimsgroup.com/?l=linux-kernel&m=114327539012323&w=2 there is a problem with sharing sched_group structures between two separate sched_group structures for different sched_domains. The patch has been tested and found to avoid the kernel lockup problem described in above URL. Signed-off-by: Srivatsa Vaddagiri <vatsa@in.ibm.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Ingo Molnar <mingo@elte.hu> Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] sched_domai: Use kmalloc_nodeSrivatsa Vaddagiri1-2/+3
The sched group structures used to represent various nodes need to be allocated from respective nodes (as suggested here also: http://uwsg.ucs.indiana.edu/hypermail/linux/kernel/0603.3/0051.html) Signed-off-by: Srivatsa Vaddagiri <vatsa@in.ibm.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Ingo Molnar <mingo@elte.hu> Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] sched_domai: Don't use GFP_ATOMICSrivatsa Vaddagiri1-1/+1
Replace GFP_ATOMIC allocation for sched_group_nodes with GFP_KERNEL based allocation. Signed-off-by: Srivatsa Vaddagiri <vatsa@in.ibm.com Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Ingo Molnar <mingo@elte.hu> Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] sched_domain: handle kmalloc failureSrivatsa Vaddagiri1-61/+78
Try to handle mem allocation failures in build_sched_domains by bailing out and cleaning up thus-far allocated memory. The patch has a direct consequence that we disable load balancing completely (even at sibling level) upon *any* memory allocation failure. [Lee.Schermerhorn@hp.com: bugfix] Signed-off-by: Srivatsa Vaddagir <vatsa@in.ibm.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Ingo Molnar <mingo@elte.hu> Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] sched: Avoid unnecessarily moving highest priority task move_tasks()Peter Williams1-5/+21
Problem: To help distribute high priority tasks evenly across the available CPUs move_tasks() does not, under some circumstances, skip tasks whose load weight is bigger than the designated amount. Because the highest priority task on the busiest queue may be on the expired array it may be moved as a result of this mechanism. Apart from not being the most desirable way to redistribute the high priority tasks (we'd rather move the second highest priority task), there is a risk that this could set up a loop with this task bouncing backwards and forwards between the two queues. (This latter possibility can be demonstrated by running a nice==-20 CPU bound task on an otherwise quiet 2 CPU system.) Solution: Modify the mechanism so that it does not override skip for the highest priority task on the CPU. Of course, if there are more than one tasks at the highest priority then it will allow the override for one of them as this is a desirable redistribution of high priority tasks. Signed-off-by: Peter Williams <pwil3058@bigpond.com.au> Cc: Ingo Molnar <mingo@elte.hu> Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] sched: modify move_tasks() to improve load balancing outcomesPeter Williams1-2/+10
Problem: The move_tasks() function is designed to move UP TO the amount of load it is asked to move and in doing this it skips over tasks looking for ones whose load weights are less than or equal to the remaining load to be moved. This is (in general) a good thing but it has the unfortunate result of breaking one of the original load balancer's good points: namely, that (within the limits imposed by the active/expired array model and the fact the expired is processed first) it moves high priority tasks before low priority ones and this means there's a good chance (see active/expired problem for why it's only a chance) that the highest priority task on the queue but not actually on the CPU will be moved to the other CPU where (as a high priority task) it may preempt the current task. Solution: Modify move_tasks() so that high priority tasks are not skipped when moving them will make them the highest priority task on their new run queue. Signed-off-by: Peter Williams <pwil3058@bigpond.com.au> Cc: Ingo Molnar <mingo@elte.hu> Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com> Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] sched: implement smpnicePeter Williams1-65/+219
Problem: The introduction of separate run queues per CPU has brought with it "nice" enforcement problems that are best described by a simple example. For the sake of argument suppose that on a single CPU machine with a nice==19 hard spinner and a nice==0 hard spinner running that the nice==0 task gets 95% of the CPU and the nice==19 task gets 5% of the CPU. Now suppose that there is a system with 2 CPUs and 2 nice==19 hard spinners and 2 nice==0 hard spinners running. The user of this system would be entitled to expect that the nice==0 tasks each get 95% of a CPU and the nice==19 tasks only get 5% each. However, whether this expectation is met is pretty much down to luck as there are four equally likely distributions of the tasks to the CPUs that the load balancing code will consider to be balanced with loads of 2.0 for each CPU. Two of these distributions involve one nice==0 and one nice==19 task per CPU and in these circumstances the users expectations will be met. The other two distributions both involve both nice==0 tasks being on one CPU and both nice==19 being on the other CPU and each task will get 50% of a CPU and the user's expectations will not be met. Solution: The solution to this problem that is implemented in the attached patch is to use weighted loads when determining if the system is balanced and, when an imbalance is detected, to move an amount of weighted load between run queues (as opposed to a number of tasks) to restore the balance. Once again, the easiest way to explain why both of these measures are necessary is to use a simple example. Suppose that (in a slight variation of the above example) that we have a two CPU system with 4 nice==0 and 4 nice=19 hard spinning tasks running and that the 4 nice==0 tasks are on one CPU and the 4 nice==19 tasks are on the other CPU. The weighted loads for the two CPUs would be 4.0 and 0.2 respectively and the load balancing code would move 2 tasks resulting in one CPU with a load of 2.0 and the other with load of 2.2. If this was considered to be a big enough imbalance to justify moving a task and that task was moved using the current move_tasks() then it would move the highest priority task that it found and this would result in one CPU with a load of 3.0 and the other with a load of 1.2 which would result in the movement of a task in the opposite direction and so on -- infinite loop. If, on the other hand, an amount of load to be moved is calculated from the imbalance (in this case 0.1) and move_tasks() skips tasks until it find ones whose contributions to the weighted load are less than this amount it would move two of the nice==19 tasks resulting in a system with 2 nice==0 and 2 nice=19 on each CPU with loads of 2.1 for each CPU. One of the advantages of this mechanism is that on a system where all tasks have nice==0 the load balancing calculations would be mathematically identical to the current load balancing code. Notes: struct task_struct: has a new field load_weight which (in a trade off of space for speed) stores the contribution that this task makes to a CPU's weighted load when it is runnable. struct runqueue: has a new field raw_weighted_load which is the sum of the load_weight values for the currently runnable tasks on this run queue. This field always needs to be updated when nr_running is updated so two new inline functions inc_nr_running() and dec_nr_running() have been created to make sure that this happens. This also offers a convenient way to optimize away this part of the smpnice mechanism when CONFIG_SMP is not defined. int try_to_wake_up(): in this function the value SCHED_LOAD_BALANCE is used to represent the load contribution of a single task in various calculations in the code that decides which CPU to put the waking task on. While this would be a valid on a system where the nice values for the runnable tasks were distributed evenly around zero it will lead to anomalous load balancing if the distribution is skewed in either direction. To overcome this problem SCHED_LOAD_SCALE has been replaced by the load_weight for the relevant task or by the average load_weight per task for the queue in question (as appropriate). int move_tasks(): The modifications to this function were complicated by the fact that active_load_balance() uses it to move exactly one task without checking whether an imbalance actually exists. This precluded the simple overloading of max_nr_move with max_load_move and necessitated the addition of the latter as an extra argument to the function. The internal implementation is then modified to move up to max_nr_move tasks and max_load_move of weighted load. This slightly complicates the code where move_tasks() is called and if ever active_load_balance() is changed to not use move_tasks() the implementation of move_tasks() should be simplified accordingly. struct sched_group *find_busiest_group(): Similar to try_to_wake_up(), there are places in this function where SCHED_LOAD_SCALE is used to represent the load contribution of a single task and the same issues are created. A similar solution is adopted except that it is now the average per task contribution to a group's load (as opposed to a run queue) that is required. As this value is not directly available from the group it is calculated on the fly as the queues in the groups are visited when determining the busiest group. A key change to this function is that it is no longer to scale down *imbalance on exit as move_tasks() uses the load in its scaled form. void set_user_nice(): has been modified to update the task's load_weight field when it's nice value and also to ensure that its run queue's raw_weighted_load field is updated if it was runnable. From: "Siddha, Suresh B" <suresh.b.siddha@intel.com> With smpnice, sched groups with highest priority tasks can mask the imbalance between the other sched groups with in the same domain. This patch fixes some of the listed down scenarios by not considering the sched groups which are lightly loaded. a) on a simple 4-way MP system, if we have one high priority and 4 normal priority tasks, with smpnice we would like to see the high priority task scheduled on one cpu, two other cpus getting one normal task each and the fourth cpu getting the remaining two normal tasks. but with current smpnice extra normal priority task keeps jumping from one cpu to another cpu having the normal priority task. This is because of the busiest_has_loaded_cpus, nr_loaded_cpus logic.. We are not including the cpu with high priority task in max_load calculations but including that in total and avg_load calcuations.. leading to max_load < avg_load and load balance between cpus running normal priority tasks(2 Vs 1) will always show imbalanace as one normal priority and the extra normal priority task will keep moving from one cpu to another cpu having normal priority task.. b) 4-way system with HT (8 logical processors). Package-P0 T0 has a highest priority task, T1 is idle. Package-P1 Both T0 and T1 have 1 normal priority task each.. P2 and P3 are idle. With this patch, one of the normal priority tasks on P1 will be moved to P2 or P3.. c) With the current weighted smp nice calculations, it doesn't always make sense to look at the highest weighted runqueue in the busy group.. Consider a load balance scenario on a DP with HT system, with Package-0 containing one high priority and one low priority, Package-1 containing one low priority(with other thread being idle).. Package-1 thinks that it need to take the low priority thread from Package-0. And find_busiest_queue() returns the cpu thread with highest priority task.. And ultimately(with help of active load balance) we move high priority task to Package-1. And same continues with Package-0 now, moving high priority task from package-1 to package-0.. Even without the presence of active load balance, load balance will fail to balance the above scenario.. Fix find_busiest_queue to use "imbalance" when it is lightly loaded. [kernel@kolivas.org: sched: store weighted load on up] [kernel@kolivas.org: sched: add discrete weighted cpu load function] [suresh.b.siddha@intel.com: sched: remove dead code] Signed-off-by: Peter Williams <pwil3058@bigpond.com.au> Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com> Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Con Kolivas <kernel@kolivas.org> Cc: John Hawkes <hawkes@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] sched: CPU hotplug race vs. set_cpus_allowed()Kirill Korotaev1-4/+14
There is a race between set_cpus_allowed() and move_task_off_dead_cpu(). __migrate_task() doesn't report any err code, so task can be left on its runqueue if its cpus_allowed mask changed so that dest_cpu is not longer a possible target. Also, chaning cpus_allowed mask requires rq->lock being held. Signed-off-by: Kirill Korotaev <dev@openvz.org> Acked-By: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] unnecessary long index i in schedSteven Rostedt1-1/+2
Unless we expect to have more than 2G CPUs, there's no reason to have 'i' as a long long here. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] sched: fix interactive ceiling codeCon Kolivas1-25/+27
The relationship between INTERACTIVE_SLEEP and the ceiling is not perfect and not explicit enough. The sleep boost is not supposed to be any larger than without this code and the comment is not clear enough about what exactly it does, just the reason it does it. Fix it. There is a ceiling to the priority beyond which tasks that only ever sleep for very long periods cannot surpass. Fix it. Prevent the on-runqueue bonus logic from defeating the idle sleep logic. Opportunity to micro-optimise. Signed-off-by: Con Kolivas <kernel@kolivas.org> Signed-off-by: Mike Galbraith <efault@gmx.de> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Ken Chen <kenneth.w.chen@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] sched: simplify bitmap definitionSteven Rostedt1-3/+1
Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] sched: fix smt nice lock contention and optimizationChen, Kenneth W1-123/+59
Initial report and lock contention fix from Chris Mason: Recent benchmarks showed some performance regressions between 2.6.16 and 2.6.5. We tracked down one of the regressions to lock contention in schedule heavy workloads (~70,000 context switches per second) kernel/sched.c:dependent_sleeper() was responsible for most of the lock contention, hammering on the run queue locks. The patch below is more of a discussion point than a suggested fix (although it does reduce lock contention significantly). The dependent_sleeper code looks very expensive to me, especially for using a spinlock to bounce control between two different siblings in the same cpu. It is further optimized: * perform dependent_sleeper check after next task is determined * convert wake_sleeping_dependent to use trylock * skip smt runqueue check if trylock fails * optimize double_rq_lock now that smt nice is converted to trylock * early exit in searching first SD_SHARE_CPUPOWER domain * speedup fast path of dependent_sleeper [akpm@osdl.org: cleanup] Signed-off-by: Ken Chen <kenneth.w.chen@intel.com> Acked-by: Ingo Molnar <mingo@elte.hu> Acked-by: Con Kolivas <kernel@kolivas.org> Signed-off-by: Nick Piggin <npiggin@suse.de> Acked-by: Chris Mason <mason@suse.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] cpu hotplug: make cpu_notifier related notifier calls __cpuinit onlyChandra Seetharaman1-3/+4
Make notifier_calls associated with cpu_notifier as __cpuinit. __cpuinit makes sure that the function is init time only unless CONFIG_HOTPLUG_CPU is defined. [akpm@osdl.org: section fix] Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Cc: Ashok Raj <ashok.raj@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] cpu hotplug: make [un]register_cpu_notifier init time onlyChandra Seetharaman1-3/+5
CPUs come online only at init time (unless CONFIG_HOTPLUG_CPU is defined). So, cpu_notifier functionality need to be available only at init time. This patch makes register_cpu_notifier() available only at init time, unless CONFIG_HOTPLUG_CPU is defined. This patch exports register_cpu_notifier() and unregister_cpu_notifier() only if CONFIG_HOTPLUG_CPU is defined. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Cc: Ashok Raj <ashok.raj@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] cpu hotplug: revert initdata patch submitted for 2.6.17Chandra Seetharaman6-6/+6
This patch reverts notifier_block changes made in 2.6.17 Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Cc: Ashok Raj <ashok.raj@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] cpu hotplug: revert init patch submitted for 2.6.17Chandra Seetharaman7-7/+7
In 2.6.17, there was a problem with cpu_notifiers and XFS. I provided a band-aid solution to solve that problem. In the process, i undid all the changes you both were making to ensure that these notifiers were available only at init time (unless CONFIG_HOTPLUG_CPU is defined). We deferred the real fix to 2.6.18. Here is a set of patches that fixes the XFS problem cleanly and makes the cpu notifiers available only at init time (unless CONFIG_HOTPLUG_CPU is defined). If CONFIG_HOTPLUG_CPU is defined then cpu notifiers are available at run time. This patch reverts the notifier_call changes made in 2.6.17 Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Cc: Ashok Raj <ashok.raj@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] rcutorture: add call_rcu_bh() operationsPaul E. McKenney2-2/+48
Add operations for the call_rcu_bh() variant of RCU. Also add an rcu_batches_completed_bh() function, which is needed by rcutorture. Signed-off-by: Paul E. McKenney <paulmck@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] rcutorture: add ops vector and Classic RCU opsPaul E. McKenney1-43/+120
Add an ops vector to rcutorture, and add the ops for Classic RCU. Update the rcutorture documentation to reflect slight change to the dmesg formats. Signed-off-by: Paul E. McKenney <paulmck@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] rcutorture: catchup doc fixes for idle-hz testsPaul E. McKenney1-1/+1
This just catches the RCU torture documentation up with the recent fixes that test RCU for architectures that turn of the scheduling-clock interrupt for idle CPUs and the addition of a SUCCESS/FAILURE indication, fixing up an obsolete comment as well. Signed-off-by: Paul E. McKenney <paulmck@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] fix kernel-doc in kernel/ dirRandy Dunlap2-4/+4
Fix kernel-doc parameters in kernel/ Warning(/var/linsrc/linux-2617-g9//kernel/auditsc.c:1376): No description found for parameter 'u_abs_timeout' Warning(/var/linsrc/linux-2617-g9//kernel/auditsc.c:1420): No description found for parameter 'u_msg_prio' Warning(/var/linsrc/linux-2617-g9//kernel/auditsc.c:1420): No description found for parameter 'u_abs_timeout' Warning(/var/linsrc/linux-2617-g9//kernel/acct.c:526): No description found for parameter 'pacct' Signed-off-by: Randy Dunlap <rdunlap@xenotime.net> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] spin/rwlock init cleanupsIngo Molnar1-1/+1
locking init cleanups: - convert " = SPIN_LOCK_UNLOCKED" to spin_lock_init() or DEFINE_SPINLOCK() - convert rwlocks in a similar manner this patch was generated automatically. Motivation: - cleanliness - lockdep needs control of lock initialization, which the open-coded variants do not give - it's also useful for -rt and for lock debugging in general Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] poison: add & use more constantsRandy Dunlap1-2/+3
Add more poison values to include/linux/poison.h. It's not clear to me whether some others should be added or not, so I haven't added any of these: ./include/linux/libata.h:#define ATA_TAG_POISON 0xfafbfcfdU ./arch/ppc/8260_io/fcc_enet.c:1918: memset((char *)(&(immap->im_dprambase[(mem_addr+64)])), 0x88, 32); ./drivers/usb/mon/mon_text.c:429: memset(mem, 0xe5, sizeof(struct mon_event_text)); ./drivers/char/ftape/lowlevel/ftape-ctl.c:738: memset(ft_buffer[i]->address, 0xAA, FT_BUFF_SIZE); ./drivers/block/sx8.c:/* 0xf is just arbitrary, non-zero noise; this is sorta like poisoning */ Signed-off-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] vdso: randomize the i386 vDSO by moving it into a vmaIngo Molnar1-0/+12
Move the i386 VDSO down into a vma and thus randomize it. Besides the security implications, this feature also helps debuggers, which can COW a vma-backed VDSO just like a normal DSO and can thus do single-stepping and other debugging features. It's good for hypervisors (Xen, VMWare) too, which typically live in the same high-mapped address space as the VDSO, hence whenever the VDSO is used, they get lots of guest pagefaults and have to fix such guest accesses up - which slows things down instead of speeding things up (the primary purpose of the VDSO). There's a new CONFIG_COMPAT_VDSO (default=y) option, which provides support for older glibcs that still rely on a prelinked high-mapped VDSO. Newer distributions (using glibc 2.3.3 or later) can turn this option off. Turning it off is also recommended for security reasons: attackers cannot use the predictable high-mapped VDSO page as syscall trampoline anymore. There is a new vdso=[0|1] boot option as well, and a runtime /proc/sys/vm/vdso_enabled sysctl switch, that allows the VDSO to be turned on/off. (This version of the VDSO-randomization patch also has working ELF coredumping, the previous patch crashed in the coredumping code.) This code is a combined work of the exec-shield VDSO randomization code and Gerd Hoffmann's hypervisor-centric VDSO patch. Rusty Russell started this patch and i completed it. [akpm@osdl.org: cleanups] [akpm@osdl.org: compile fix] [akpm@osdl.org: compile fix 2] [akpm@osdl.org: compile fix 3] [akpm@osdl.org: revernt MAXMEM change] Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Arjan van de Ven <arjan@infradead.org> Cc: Gerd Hoffmann <kraxel@suse.de> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Zachary Amsden <zach@vmware.com> Cc: Andi Kleen <ak@muc.de> Cc: Jan Beulich <jbeulich@novell.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] catch valid mem range at onlining memoryKAMEZAWA Hiroyuki1-0/+38
This patch allows hot-add memory which is not aligned to section. Now, hot-added memory has to be aligned to section size. Considering big section sized archs, this is not useful. When hot-added memory is registerd as iomem resoruce by iomem resource patch, we can make use of that information to detect valid memory range. Note: With this, not-aligned memory can be registerd. To allow hot-add memory with holes, we have to do more work around add_memory(). (It doesn't allows add memory to already existing mem section.) Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] pm_trace is dangerousAndrew Morton1-2/+11
CONFIG_PM_TRACES scrogs your RTC. Mark it as experimental, and defaulting to `off'. Also beef up the help message a bit. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-27[PATCH] kernel/acct: fix function definitionRandy Dunlap1-1/+1
kernel/acct.c:579:19: warning: non-ANSI function declaration of function 'acct_process' Signed-off-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26Merge git://git.kernel.org/pub/scm/linux/kernel/git/sam/kbuildLinus Torvalds1-10/+1
* git://git.kernel.org/pub/scm/linux/kernel/git/sam/kbuild: (40 commits) kbuild: trivial fixes in Makefile kbuild: adding symbols in Kconfig and defconfig to TAGS kbuild: replace abort() with exit(1) kbuild: support for %.symtypes files kbuild: fix silentoldconfig recursion kbuild: add option for stripping modules while installing them kbuild: kill some false positives from modpost kbuild: export-symbol usage report generator kbuild: fix make -rR breakage kbuild: append -dirty for updated but uncommited changes kbuild: append git revision for all untagged commits kbuild: fix module.symvers parsing in modpost kbuild: ignore make's built-in rules & variables kbuild: bugfix with initramfs kbuild: modpost build fix kbuild: check license compatibility when building modules kbuild: export-type enhancement to modpost.c kbuild: add dependency on kernel.release to the package targets kbuild: `make kernelrelease' speedup kconfig: KCONFIG_OVERWRITECONFIG ...
2006-06-26Merge branch 'x86-64'Linus Torvalds5-3/+952
* x86-64: (83 commits) [PATCH] x86_64: x86_64 stack usage debugging [PATCH] x86_64: (resend) x86_64 stack overflow debugging [PATCH] x86_64: msi_apic.c build fix [PATCH] x86_64: i386/x86-64 Add nmi watchdog support for new Intel CPUs [PATCH] x86_64: Avoid broadcasting NMI IPIs [PATCH] x86_64: fix apic error on bootup [PATCH] x86_64: enlarge window for stack growth [PATCH] x86_64: Minor string functions optimizations [PATCH] x86_64: Move export symbols to their C functions [PATCH] x86_64: Standardize i386/x86_64 handling of NMI_VECTOR [PATCH] x86_64: Fix modular pc speaker [PATCH] x86_64: remove sys32_ni_syscall() [PATCH] x86_64: Do not use -ffunction-sections for modules [PATCH] x86_64: Add cpu_relax to apic_wait_icr_idle [PATCH] x86_64: adjust kstack_depth_to_print default [PATCH] i386/x86-64: adjust /proc/interrupts column headings [PATCH] x86_64: Fix race in cpu_local_* on preemptible kernels [PATCH] x86_64: Fix fast check in safe_smp_processor_id [PATCH] x86_64: x86_64 setup.c - printing cmp related boottime information [PATCH] i386/x86-64/ia64: Move polling flag into thread_info_status ... Manual resolve of trivial conflict in arch/i386/kernel/Makefile
2006-06-26[PATCH] i386/x86-64/ia64: Move polling flag into thread_info_statusAndi Kleen1-2/+7
During some profiling I noticed that default_idle causes a lot of memory traffic. I think that is caused by the atomic operations to clear/set the polling flag in thread_info. There is actually no reason to make this atomic - only the idle thread does it to itself, other CPUs only read it. So I moved it into ti->status. Converted i386/x86-64/ia64 for now because that was the easiest way to fix ACPI which also manipulates these flags in its idle function. Cc: Nick Piggin <npiggin@novell.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Len Brown <len.brown@intel.com> Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26[PATCH] x86_64: allow unwinder to build without module supportJan Beulich1-0/+4
Add proper conditionals to be able to build with CONFIG_MODULES=n. Signed-off-by: Jan Beulich <jbeulich@novell.com> Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26[PATCH] i386/x86-64: fall back to old-style call trace if no unwindingJan Beulich1-4/+3
If no unwinding is possible at all for a certain exception instance, fall back to the old style call trace instead of not showing any trace at all. Also, allow setting the stack trace mode at the command line. Signed-off-by: Jan Beulich <jbeulich@novell.com> Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26[PATCH] x86_64: reliable stack trace supportJan Beulich3-1/+931
These are the generic bits needed to enable reliable stack traces based on Dwarf2-like (.eh_frame) unwind information. Subsequent patches will enable x86-64 and i386 to make use of this. Thanks to Andi Kleen and Ingo Molnar, who pointed out several possibilities for improvement. Signed-off-by: Jan Beulich <jbeulich@novell.com> Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26[PATCH] x86_64: Add compat_printk and sysctl to turn off compat layer warningsAndi Kleen1-0/+11
Sometimes e.g. with crashme the compat layer warnings can be noisy. Add a way to turn them off by gating all output through compat_printk that checks a global sysctl. The default is not changed. Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26[PATCH] sched: fix SCHED_FIFO bug in sys_sched_rr_get_interval()Peter Williams1-1/+1
The introduction of SCHED_BATCH scheduling class with a value of 3 means that the expression (p->policy & SCHED_FIFO) will return true if policy is SCHED_BATCH or SCHED_FIFO. Unfortunately, this expression is used in sys_sched_rr_get_interval() and in the absence of a comment to say that this is intentional I presume that it is unintentional and erroneous. The fix is to change the expression to (p->policy == SCHED_FIFO). Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26[PATCH] coredump: copy_process: don't check SIGNAL_GROUP_EXITOleg Nesterov1-12/+0
After the previous patch SIGNAL_GROUP_EXIT implies a pending SIGKILL, we can remove this check from copy_process() because we already checked !signal_pending(). Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26[PATCH] coredump: kill ptrace related stuffOleg Nesterov2-6/+32
With this patch zap_process() sets SIGNAL_GROUP_EXIT while sending SIGKILL to the thread group. This means that a TASK_TRACED task 1. Will be awakened by signal_wake_up(1) 2. Can't sleep again via ptrace_notify() 3. Can't go to do_signal_stop() after return from ptrace_stop() in get_signal_to_deliver() So we can remove all ptrace related stuff from coredump path. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26[PATCH] proc: Cleanup proc_fd_access_allowedEric W. Biederman1-3/+17
In process of getting proc_fd_access_allowed to work it has developed a few warts. In particular the special case that always allows introspection and the special case to allow inspection of kernel threads. The special case for introspection is needed for /proc/self/mem. The special case for kernel threads really should be overridable by security modules. So consolidate these checks into ptrace.c:may_attach(). The check to always allow introspection is trivial. The check to allow access to kernel threads, and zombies is a little trickier. mem_read and mem_write already verify an mm exists so it isn't needed twice. proc_fd_access_allowed only doesn't want a check to verify task->mm exits, s it prevents all access to kernel threads. So just move the task->mm check into ptrace_attach where it is needed for practical reasons. I did a quick audit and none of the security modules in the kernel seem to care if they are passed a task without an mm into security_ptrace. So the above move should be safe and it allows security modules to come up with more restrictive policy. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Cc: Stephen Smalley <sds@tycho.nsa.gov> Cc: Chris Wright <chrisw@sous-sol.org> Cc: James Morris <jmorris@namei.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26[PATCH] proc: Use struct pid not struct task_refEric W. Biederman1-6/+5
Incrementally update my proc-dont-lock-task_structs-indefinitely patches so that they work with struct pid instead of struct task_ref. Mostly this is a straight 1-1 substitution. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26[PATCH] proc: don't lock task_structs indefinitelyEric W. Biederman1-7/+20
Every inode in /proc holds a reference to a struct task_struct. If a directory or file is opened and remains open after the the task exits this pinning continues. With 8K stacks on a 32bit machine the amount pinned per file descriptor is about 10K. Normally I would figure a reasonable per user process limit is about 100 processes. With 80 processes, with a 1000 file descriptors each I can trigger the 00M killer on a 32bit kernel, because I have pinned about 800MB of useless data. This patch replaces the struct task_struct pointer with a pointer to a struct task_ref which has a struct task_struct pointer. The so the pinning of dead tasks does not happen. The code now has to contend with the fact that the task may now exit at any time. Which is a little but not muh more complicated. With this change it takes about 1000 processes each opening up 1000 file descriptors before I can trigger the OOM killer. Much better. [mlp@google.com: task_mmu small fixes] Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Cc: Paul Jackson <pj@sgi.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Albert Cahalan <acahalan@gmail.com> Signed-off-by: Prasanna Meda <mlp@google.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26[PATCH] proc: Rewrite the proc dentry flush on exit optimizationEric W. Biederman2-9/+1
To keep the dcache from filling up with dead /proc entries we flush them on process exit. However over the years that code has gotten hairy with a dentry_pointer and a lock in task_struct and misdocumented as a correctness feature. I have rewritten this code to look and see if we have a corresponding entry in the dcache and if so flush it on process exit. This removes the extra fields in the task_struct and allows me to trivially handle the case of a /proc/<tgid>/task/<pid> entry as well as the current /proc/<pid> entries. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26[PATCH] Notify page fault call chainAnil S Keshavamurthy1-7/+23
With this patch Kprobes now registers for page fault notifications only when their is an active probe registered. Once all the active probes are unregistered their is no need to be notified of page faults and kprobes unregisters itself from the page fault notifications. Hence we will have ZERO side effects when no probes are active. Signed-off-by: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26[PATCH] Kprobes registers for notify page faultAnil S Keshavamurthy1-0/+8
Kprobes now registers for page fault notifications. Signed-off-by: Anil S Keshavamurthy <anil.s.keshavmurthy@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26[PATCH] Kprobe: multi kprobe posthandler for boostermao, bibo1-8/+24
If there are multi kprobes on the same probepoint, there will be one extra aggr_kprobe on the head of kprobe list. The aggr_kprobe has aggr_post_handler/aggr_break_handler whether the other kprobe post_hander/break_handler is NULL or not. This patch modifies this, only when there is one or more kprobe in the list whose post_handler is not NULL, post_handler of aggr_kprobe will be set as aggr_post_handler. [soshima@redhat.com: !CONFIG_PREEMPT fix] Signed-off-by: bibo, mao <bibo.mao@intel.com> Cc: Masami Hiramatsu <hiramatu@sdl.hitachi.co.jp> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: "Keshavamurthy, Anil S" <anil.s.keshavamurthy@intel.com> Cc: Prasanna S Panchamukhi <prasanna@in.ibm.com> Cc: Jim Keniston <jkenisto@us.ibm.com> Cc: Yumiko Sugita <sugita@sdl.hitachi.co.jp> Cc: Hideo Aoki <haoki@redhat.com> Signed-off-by: Satoshi Oshima <soshima@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26[PATCH] fix and optimize clock source updateRoman Zippel1-42/+109
This fixes the clock source updates in update_wall_time() to correctly track the time coming in via current_tick_length(). Optimize the fast paths to be as short as possible to keep the overhead low. Signed-off-by: Roman Zippel <zippel@linux-m68k.org> Acked-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26[PATCH] time: rename clocksource functionsjohn stultz3-19/+19
As suggested by Roman Zippel, change clocksource functions to use clocksource_xyz rather then xyz_clocksource to avoid polluting the namespace. Signed-off-by: John Stultz <johnstul@us.ibm.com> Cc: Roman Zippel <zippel@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26[PATCH] Time: i386 Clocksource Driversjohn stultz1-3/+8
Implement the time sources for i386 (acpi_pm, cyclone, hpet, pit, and tsc). With this patch, the conversion of the i386 arch to the generic timekeeping code should be complete. The patch should be fairly straight forward, only adding the new clocksources. [hirofumi@mail.parknet.co.jp: acpi_pm cleanup] Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Paul Mundt <lethal@linux-sh.org> Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26[PATCH] Time: Introduce arch generic time accessorsjohn stultz2-0/+172
Introduces clocksource switching code and the arch generic time accessor functions that use the clocksource infrastructure. Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26[PATCH] Time: Use clocksource abstraction for NTP adjustmentsjohn stultz1-19/+28
Instead of incrementing xtime by tick_nsec + ntp adjustments, use the clocksource abstraction to increment and scale time. Using the clocksource abstraction allows other clocksources to be used consistently in the face of late or lost ticks, while preserving the existing behavior via the jiffies clocksource. This removes the need to keep time_phase adjustments as we just use the current_tick_length() function as the NTP interface and accumulate time using shifted nanoseconds. The basics of this design was by Roman Zippel, however it is my own interpretation and implementation, so the credit should go to him and the blame to me. Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26[PATCH] Time: Let user request precision from current_tick_length()john stultz1-4/+17
Change the current_tick_length() function so it takes an argument which specifies how much precision to return in shifted nanoseconds. This provides a simple way to convert between NTPs internal nanoseconds shifted by (SHIFT_SCALE - 10) to other shifted nanosecond units that are used by the clocksource abstraction. Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26[PATCH] Time: Use clocksource infrastructure for update_wall_timejohn stultz3-15/+83
Modify the update_wall_time function so it increments time using the clocksource abstraction instead of jiffies. Since the only clocksource driver currently provided is the jiffies clocksource, this should result in no functional change. Additionally, a timekeeping_init and timekeeping_resume function has been added to initialize and maintain some of the new timekeping state. [hirofumi@mail.parknet.co.jp: fixlet] Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26[PATCH] Time: Clocksource Infrastructurejohn stultz3-0/+418
This introduces the clocksource management infrastructure. A clocksource is a driver-like architecture generic abstraction of a free-running counter. This code defines the clocksource structure, and provides management code for registering, selecting, accessing and scaling clocksources. Additionally, this includes the trivial jiffies clocksource, a lowest common denominator clocksource, provided mainly for use as an example. [hirofumi@mail.parknet.co.jp: Don't enable IRQ too early] Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Paul Mundt <lethal@linux-sh.org> Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26[PATCH] Convert kernel/cpu.c to mutexesIngo Molnar1-5/+5
Convert kernel/cpu.c from semaphore to mutex. I've reviewed all lock_cpu_hotplug() critical sections, and they all seem to fit mutex semantics. Signed-off-by: Ingo Molnar <mingo@elte.hu> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Ashok Raj <ashok.raj@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-26[PATCH] work around ppc64 bootup bug by making mutex-debugging save/restore irqsIngo Molnar4-37/+27
It seems ppc64 wants to lock mutexes in early bootup code, with interrupts disabled, and they expect interrupts to stay disabled, else they crash. Work around this bug by making mutex debugging variants save/restore irq flags. Signed-off-by: Ingo Molnar <mingo@elte.hu> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25Revert "swsusp special saveable pages support" commitsLinus Torvalds3-116/+18
This reverts commits 3e3318dee0878d42ed62a19c292a2ac284135db3 [PATCH] swsusp: x86_64 mark special saveable/unsaveable pages b6370d96e09944c6e3ae8d5743ca8a8ab1f79f6c [PATCH] swsusp: i386 mark special saveable/unsaveable pages ce4ab0012b32c1a4a1d6e934aeb73bf3151c48d9 [PATCH] swsusp: add architecture special saveable pages support because not only do they apparently cause page faults on x86, the infrastructure doesn't compile on powerpc. Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25Fix PM_TRACE dependency: works only on 32-bit x86 for nowLinus Torvalds1-1/+1
Not that x86-64 and other architecture support should be difficult to add (trivial fixups to the data format and add the proper linker script entry). Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25[PATCH] pacct: none-delayed process accounting accumulationKaiGai Kohei1-13/+11
In current 2.6.17 implementation, signal_struct refered from task_struct is used for per-process data structure. The pacct facility also uses it as a per-process data structure to store stime, utime, minflt, majflt. But those members are saved in __exit_signal(). It's too late. For example, if some threads exits at same time, pacct facility has a possibility to drop accountings for a part of those threads. (see, the following 'The results of original 2.6.17 kernel') I think accounting information should be completely collected into the per-process data structure before writing out an accounting record. This patch fixes this matter. Accumulation of stime, utime, minflt and majflt are done before generating accounting record. [mingo@elte.hu: fix acct_collect() siglock bug found by lockdep] Signed-off-by: KaiGai Kohei <kaigai@ak.jp.nec.com> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25[PATCH] pacct: avoidance to refer the last thread as a representation of the ↵KaiGai Kohei2-20/+26
process When pacct facility generate an 'ac_flag' field in accounting record, it refers a task_struct of the thread which died last in the process. But any other task_structs are ignored. Therefore, pacct facility drops ASU flag even if root-privilege operations are used by any other threads except the last one. In addition, AFORK flag is always set when the thread of group-leader didn't die last, although this process has called execve() after fork(). We have a same matter in ac_exitcode. The recorded ac_exitcode is an exit code of the last thread in the process. There is a possibility this exitcode is not the group leader's one.
2006-06-25[PATCH] pacct: add pacct_struct to fix some pacct bugs.KaiGai Kohei3-16/+40
The pacct facility need an i/o operation when an accounting record is generated. There is a possibility to wake OOM killer up. If OOM killer is activated, it kills some processes to make them release process memory regions. But acct_process() is called in the killed processes context before calling exit_mm(), so those processes cannot release own memory. In the results, any processes stop in this point and it finally cause a system stall.
2006-06-25[PATCH] kthread: move kernel-doc and put it into DocBookRandy Dunlap1-1/+60
Move kthread API kernel-doc from kthread.h to kthread.c & fix it. Add kthread API to kernel-api DocBook. Signed-off-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25[PATCH] ktime/hrtimer: fix kernel-doc commentsRandy Dunlap1-10/+1
Fix kernel-doc formatting in ktime.h and hrtimer.[ch] files. Signed-off-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25[PATCH] cpu hotplug: fix CPU_UP_CANCEL handlingHeiko Carstens4-0/+8
If a cpu hotplug callback fails on CPU_UP_PREPARE, all callbacks will be called with CPU_UP_CANCELED. A few of these callbacks assume that on CPU_UP_PREPARE a pointer to task has been stored in a percpu array. This assumption is not true if CPU_UP_PREPARE fails and the following calls to kthread_bind() in CPU_UP_CANCELED will cause an addressing exception because of passing a NULL pointer. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Ashok Raj <ashok.raj@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25[PATCH] kthread: convert stop_machine into a kthreadSerge E. Hallyn1-6/+11
- Update stop_machine.c to spawn stop_machine as kthreads rather than the deprecated kernel_threads. - Update stop_machine to use the more efficient kthread_bind() before running task in place of set_cpus_allowed() after. [akpm@osdl.org: remove now-wrong set_cpus_allowed()] Signed-off-by: Serge E. Hallyn <serue@us.ibm.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25[PATCH] Link error when futexes are disabled on 64bit architecturesAnton Blanchard1-1/+1
If futexes are disabled we fail to link on ppc64. Signed-off-by: Anton Blanchard <anton@samba.org> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25[PATCH] N32 sigset and __COMPAT_ENDIAN_SWAP__akpm@osdl.org1-7/+0
I'm testing glibc on MIPS64, little-endian, N32, O32 and N64 multilibs. Among the NPTL test failures seen are some arising from sigsuspend problems for N32: it blocks the wrong signals, so SIGCANCEL (SIGRTMIN) is blocked despite glibc's carefully excluding it from sets of signals to block. Specifically, testing suggests it blocks signal N^32 instead of signal N, so (in the example tested) blocking SIGUSR1 (17) blocks signal 49 instead. glibc's sigset_t uses an array of unsigned long, as does the kernel. In both cases, signal N+1 is represented as (1UL << (N % (8 * sizeof (unsigned long)))) in word number (N / (8 * sizeof (unsigned long))). Thus the N32 glibc uses an array of 32-bit words and the N64 kernel uses an array of 64-bit words. For little-endian, the layout is the same, with signals 1-32 in the first 4 bytes, signals 33-64 in the second, etc.; for big-endian, userspace has that layout while in the kernel each 8 bytes have the two halves swapped from the userspace layout. The N32 sigsuspend syscall uses sigset_from_compat to convert the userspace sigset to kernel format. If __COMPAT_ENDIAN_SWAP__ is *not* set, this uses logic of the form set->sig[0] = compat->sig[0] | (((long)compat->sig[1]) << 32 ) to convert the userspace sigset to a kernel one. This looks correct to me for both big and little endian, given that in userspace compat->sig[1] will represent signals 33-64, and so will the high 32 bits of set->sig[0] in the kernel. If however __COMPAT_ENDIAN_SWAP__ *is* set, as it is for __MIPSEL__, it uses set->sig[0] = compat->sig[1] | (((long)compat->sig[0]) << 32 ); which seems incorrect for both big and little endian, and would explain the observed symptoms. This code is the only use of __COMPAT_ENDIAN_SWAP__, so if incorrect then that macro serves no purpose, in which case something like the following patch would seem appropriate to remove it. Signed-off-by: Joseph Myers <joseph@codesourcery.com> Signed-off-by: Ralf Baechle <ralf@linux-mips.org> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25[PATCH] Get rid of /proc/sys/procStephen Hemminger1-11/+0
The table is empty, why does it still exist? Signed-off-by: Stephen Hemminger <shemminger@osdl.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25[PATCH] printk time parameterJan Engelhardt1-0/+2
Currently, enabling/disabling printk timestamps is only possible through reboot (bootparam) or recompile. I normally do not run with timestamps (since syslog handles that in a good manner), but for measuring small kernel delays (e.g. irq probing - see parport thread) I needed subsecond precision, but then again, just for some minutes rather than all kernel messages to come. The following patch adds a module_param() with which the timestamps can be en-/disabled in a live system through /sys/modules/printk/parameters/printk_time. Signed-off-by: Jan Engelhardt <jengelh@gmx.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25[PATCH] Remove unecessary NULL check in kernel/acct.cMatt Helsley1-5/+3
copy_process() appears to be the only caller of acct_clear_integrals() and does not pass in NULL task pointers. Remove the unecessary check. Signed-off-by: Matt Helsley <matthltc@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25[PATCH] constify parts of kernel/power/Andreas Mohr2-3/+3
Signed-off-by: Andreas Mohr <andi@lisas.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25[PATCH] schedule_on_each_cpu(): reduce kmalloc() sizeAndrew Morton1-8/+20
schedule_on_each_cpu() presently does a large kmalloc - 96 kbytes on 1024 CPU 64-bit. Rework it so that we do one 8192-byte allocation and then a pile of tiny ones, via alloc_percpu(). This has a much higher chance of success (100% in the current VM). This also has the effect of reducing the memory requirements from NR_CPUS*n to num_possible_cpus()*n. Cc: Christoph Lameter <clameter@engr.sgi.com> Cc: Andi Kleen <ak@muc.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25[PATCH] kernel/sys.c: cleanupsAdrian Bunk2-5/+3
- proper prototypes for the following functions: - ctrl_alt_del() (in include/linux/reboot.h) - getrusage() (in include/linux/resource.h) - make the following needlessly global functions static: - kernel_restart_prepare() - kernel_kexec() [akpm@osdl.org: compile fix] Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25[PATCH] Make printk work for really early debuggingMichael Ellerman1-17/+33
Currently printk is no use for early debugging because it refuses to actually print anything to the console unless cpu_online(smp_processor_id()) is true. The stated explanation is that console drivers may require per-cpu resources, or otherwise barf, because the system is not yet setup correctly. Fair enough. However some console drivers might be quite happy running early during boot, in fact we have one, and so it'd be nice if printk understood that. So I added a flag (which I would have called CON_BOOT, but that's taken) called CON_ANYTIME, which indicates that a console is happy to be called anytime, even if the cpu is not yet online. Tested on a Power 5 machine, with both a CON_ANYTIME driver and a bogus console driver that BUG()s if called while offline. No problems AFAICT. Built for i386 UP & SMP. Signed-off-by: Michael Ellerman <michael@ellerman.id.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25[PATCH] Allow raw_notifier callouts to unregister themselvesAlan Stern1-2/+3
Since raw_notifier chains don't benefit from any centralized locking protections, they shouldn't suffer from the associated limitations. Under some circumstances it might make sense for a raw_notifier callout routine to unregister itself from the notifier chain. This patch (as678) changes the notifier core to allow for such things. Signed-off-by: Alan Stern <stern@rowland.harvard.edu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25[PATCH] Define __raw_get_cpu_var and use itPaul Mackerras4-6/+6
There are several instances of per_cpu(foo, raw_smp_processor_id()), which is semantically equivalent to __get_cpu_var(foo) but without the warning that smp_processor_id() can give if CONFIG_DEBUG_PREEMPT is enabled. For those architectures with optimized per-cpu implementations, namely ia64, powerpc, s390, sparc64 and x86_64, per_cpu() turns into more and slower code than __get_cpu_var(), so it would be preferable to use __get_cpu_var on those platforms. This defines a __raw_get_cpu_var(x) macro which turns into per_cpu(x, raw_smp_processor_id()) on architectures that use the generic per-cpu implementation, and turns into __get_cpu_var(x) on the architectures that have an optimized per-cpu implementation. Signed-off-by: Paul Mackerras <paulus@samba.org> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: Ingo Molnar <mingo@elte.hu> Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25[PATCH] ensure NULL deref can't possibly happen in is_exported()Jesper Juhl1-1/+1
If CONFIG_KALLSYMS is defined and if it should happen that is_exported() is given a NULL 'mod' and lookup_symbol(name, __start___ksymtab, __stop___ksymtab) returns 0, then we'll end up dereferencing a NULL pointer. Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-24Add some basic resume trace facilitiesLinus Torvalds1-0/+9
Considering that there isn't a lot of hw we can depend on during resume, this is about as good as it gets. This is x86-only for now, although the basic concept (and most of the code) will certainly work on almost any platform. Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23[PATCH] More BUG_ON conversionEric Sesterhenn1-3/+2
Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de> Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Bartlomiej Zolnierkiewicz <B.Zolnierkiewicz@elka.pw.edu.pl> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: James Bottomley <James.Bottomley@steeleye.com> Acked-by: "Salyzyn, Mark" <mark_salyzyn@adaptec.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23[PATCH] adjust handle_IRR_event() return typeJan Beulich1-2/+3
Correct the return type of handle_IRQ_event() (inconsistency noticed during Xen development), and remove redundant declarations. The return type adjustment required breaking out the definition of irqreturn_t into a separate header, in order to satisfy current include order dependencies. Signed-off-by: Jan Beulich <jbeulich@novell.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Russell King <rmk@arm.linux.org.uk> Cc: Ian Molton <spyro@f2s.com> Cc: Mikael Starvik <starvik@axis.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Hirokazu Takata <takata.hirokazu@renesas.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: William Lee Irwin III <wli@holomorphy.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Miles Bader <uclinux-v850@lsi.nec.co.jp> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Roman Zippel <zippel@linux-m68k.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23[PATCH] When CONFIG_BASE_SMALL=1, cascade() may enter an infinite loopPorpoise1-13/+9
When CONFIG_BASE_SAMLL=1, cascade() in may enter the infinite loop. Because of CONFIG_BASE_SMALL=1(TVR_BITS=6 and TVN_BITS=4), the list base->tv5 may cascade into base->tv5. So, the kernel enters the infinite loop in the function cascade(). I created a test module to verify this bug, and a patch to fix it. #include <linux/kernel.h> #include <linux/module.h> #include <linux/init.h> #include <linux/timer.h> #if 0 #include <linux/kdb.h> #else #define kdb_printf printk #endif #define TVN_BITS (CONFIG_BASE_SMALL ? 4 : 6) #define TVR_BITS (CONFIG_BASE_SMALL ? 6 : 8) #define TVN_SIZE (1 << TVN_BITS) #define TVR_SIZE (1 << TVR_BITS) #define TVN_MASK (TVN_SIZE - 1) #define TVR_MASK (TVR_SIZE - 1) #define TV_SIZE(N) (N*TVN_BITS + TVR_BITS) struct timer_list timer0; struct timer_list dummy_timer1; struct timer_list dummy_timer2; void dummy_timer_fun(unsigned long data) { } unsigned long j=0; void check_timer_base(unsigned long data) { kdb_printf("check_timer_base %08x\n",jiffies); mod_timer(&timer0,(jiffies & (~0xFFF)) + 0x1FFF); } int init_module(void) { init_timer(&timer0); timer0.data = (unsigned long)0; timer0.function = check_timer_base; mod_timer(&timer0,jiffies+1); init_timer(&dummy_timer1); dummy_timer1.data = (unsigned long)0; dummy_timer1.function = dummy_timer_fun; init_timer(&dummy_timer2); dummy_timer2.data = (unsigned long)0; dummy_timer2.function = dummy_timer_fun; j=jiffies; j&=(~((1<<TV_SIZE(3))-1)); j+=(1<<TV_SIZE(3)); j+=(1<<TV_SIZE(4)); kdb_printf("mod_timer %08x\n",j); mod_timer(&dummy_timer1, j ); mod_timer(&dummy_timer2, j ); return 0; } void cleanup_module() { del_timer_sync(&timer0); del_timer_sync(&dummy_timer1); del_timer_sync(&dummy_timer2); } (Cleanups from Oleg) [oleg@tv-sign.ru: use list_replace_init()] Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Matt Mackall <mpm@selenic.com> Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23[PATCH] list: use list_replace_init() instead of list_splice_init()Oleg Nesterov2-6/+6
list_splice_init(list, head) does unneeded job if it is known that list_empty(head) == 1. We can use list_replace_init() instead. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23[PATCH] Doc: add audit & acct to DocBookRandy Dunlap1-0/+1
Fix one audit kernel-doc description (one parameter was missing). Add audit*.c interfaces to DocBook. Add BSD accounting interfaces to DocBook. Signed-off-by: Randy Dunlap <rdunlap@xenotime.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23[PATCH] Make RCU API inaccessible to non-GPL Linux kernel modulesPaul E. McKenney1-11/+2
Remove synchronize_kernel() (deprecated 2-APR-2005 in http://lkml.org/lkml/2005/4/3/11) and makes the RCU API inaccessible to non-GPL Linux kernel modules (as was announced more than one year ago in http://lkml.org/lkml/2005/4/3/8). Tested on x86 and ppc64. Signed-off-by: "Paul E. McKenney" <paulmck@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23[PATCH] kernel/sys.c doesn't need init.hJes Sorensen1-1/+0
kernel/sys.c doesn't have anything in it relying on linux/init.h - remove the include. Signed-off-by: Jes Sorensen <jes@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23[PATCH] CONFIG_NET=n build fixAndrew Morton1-1/+1
Cc: Greg KH <greg@kroah.com> Cc: Russell King <rmk@arm.linux.org.uk> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23[PATCH] make noirqdebug/irqfixup __read_mostly, add (un)likely()Andreas Mohr1-6/+6
Signed-off-by: Andreas Mohr <andi@lisas.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23[PATCH] invert irq/migration.c brach predictionDaniel Walker1-2/+2
If you get to that point in the code it means that desc->move_irq is set, pending_irq_cpumask[irq] and cpu_online_map should have a value. Still pretty good chance anding those two you'll still have a value. So these two branch predictors should be inverted. Signed-off-by: Daniel Walker <dwalker@mvista.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23[PATCH] cond_resched() might_sleep() fixIngo Molnar1-0/+3
add the __might_sleep() check back to cond_resched(). Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23[PATCH] dup fd error fixPrasanna Meda1-1/+2
Set errorp in dup_fd, it will be used in sys_unshare also. Signed-off-by: Prasanna Meda <mlp@google.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23[PATCH] mmput() might sleepAndrew Morton1-0/+2
exit_aio() and exit_mmap() can sleep. But it's easy to accidentally call mmput() from inside locks. Cc: Dave Peterson <dsp@llnl.gov> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23[PATCH] Add a sysfs file to determine if a kexec kernel is loadedJeff Moyer2-3/+22
Create two files in /sys/kernel, kexec_loaded and kexec_crash_loaded. Each file contains a simple boolean value indicating whether the relevant kernel has been loaded into memory. The motivation for this is geared around support. Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23[PATCH] swsusp: use less memory during resumeRafael J. Wysocki2-58/+85
Make swsusp allocate only as much memory as needed to store the image data and metadata during resume. Without this patch swsusp additionally allocates many page frames that will conflict with the "original" locations of the image data and are considered as "unsafe", treating them as "eaten" pages (ie. allocated but unusable). The patch makes swsusp allocate as many pages as it'll need to store the data read from the image in one shot, creating a list of allocated "safe" pages, and use the observation that all pages allocated by it are marked with the PG_nosave and PG_nosave_free flags set.  Namely, when it's about to load an image page, swsusp can check whether the page frame corresponding to the "original" location of this page has been allocated (ie. if the page frame has the PG_nosave and PG_nosave_free flags set) and if so, it can load the page directly into this page frame.  Otherwise it uses an allocated "safe" page from the list to store the data that will be copied to their "original" location later on. This allows us to save many page copyings and page allocations during resume and in the future it may allow us to load images greater than 50% of the normal zone. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: "Pavel Machek" <pavel@suse.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23[PATCH] kernel/power/snapshot.c: cleanupsAdrian Bunk1-7/+8
- make needlessly global functions static - make dummy functions static inline Signed-off-by: Adrian Bunk <bunk@stusta.de> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23[PATCH] swsusp: take lowmem reserves into accountRafael J. Wysocki1-1/+3
swsusp allocates memory from the normal zone, so it cannot use lowmem reserve pages from the lower zones. Therefore it should not count these pages as available to it. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Cc: Pavel Machek <pavel@ucw.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23[PATCH] swsusp: add architecture special saveable pages supportShaohua Li3-15/+117
1. Add architecture specific pages save/restore support. Next two patches will use this to save/restore 'ACPI NVS' pages. 2. Allow reserved pages 'nosave'. This could avoid save/restore BIOS reserved pages. Signed-off-by: Shaohua Li <shaohua.li@intel.com> Cc: Pavel Machek <pavel@ucw.cz> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Nigel Cunningham <nigel@suspend2.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23[PATCH] x86: kernel irq balance doesn't workZhang Yanmin1-0/+3
On i386, kernel irq balance doesn't work. 1) In function do_irq_balance, after kernel finds the min_loaded cpu but before calling set_pending_irq to really pin the selected_irq to the target cpu, kernel does a cpus_and with irq_affinity[selected_irq]. Later on, when the irq is acked, kernel would calls move_native_irq=>desc->handler->set_affinity to change the irq affinity. However, every function pointed by hw_interrupt_type->set_affinity(unsigned int irq, cpumask_t cpumask) always changes irq_affinity[irq] to cpumask. Next time when recalling do_irq_balance, it has to do cpu_ands again with irq_affinity[selected_irq], but irq_affinity[selected_irq] already becomes one cpu selected by the first irq balance. 2) Function balance_irq in file arch/i386/kernel/io_apic.c has the same issue. [akpm@osdl.org: cleanups] Signed-off-by: Zhang Yanmin <yanmin.zhang@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23[PATCH] SELinux: add security hook call to mediate attach_task (kernel/cpuset.c)David Quigley1-0/+8
Add a security hook call to enable security modules to control the ability to attach a task to a cpuset. While limited control over this operation is possible via permission checks on the pseudo fs interface, those checks are not sufficient to control access to the target task, which is looked up in this function. The existing task_setscheduler hook is re-used for this operation since this falls under the same class of operations. Signed-off-by: David Quigley <dpquigl@tycho.nsa.gov> Acked-by: Stephen Smalley <sds@tycho.nsa.gov> Signed-off-by: James Morris <jmorris@namei.org> Acked-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23[PATCH] SELinux: add security hooks to {get,set}affinityDavid Quigley1-1/+8
This patch adds LSM hooks into the setaffinity and getaffinity functions to enable security modules to control these operations between tasks with task_setscheduler and task_getscheduler LSM hooks. Signed-off-by: David Quigley <dpquigl@tycho.nsa.gov> Acked-by: Stephen Smalley <sds@tycho.nsa.gov> Signed-off-by: James Morris <jmorris@namei.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23[PATCH] move_pages: fix 32 -> 64 bit compat functionChristoph Lameter1-2/+2
The definition of the third parameter is a pointer to an array of virtual addresses which give us some trouble. The existing code calculated the wrong address in the array since I used void to avoid having to specify a type. I now use the correct type "compat_uptr_t __user *" in the definition of the function in kernel/compat.c. However, I used __u32 in syscalls.h. Would have to include compat.h there in order to provide the same definition which would generate an ugly include situation. On both ia64 and x86_64 compat_uptr_t is u32. So this works although parameter declarations differ. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23[PATCH] sys_move_pages: 32bit support (i386, x86_64)Christoph Lameter2-0/+24
sys_move_pages() support for 32bit (i386 plus x86_64 compat layer) Add support for move_pages() on i386 and also add the compat functions necessary to run 32 bit binaries on x86_64. Add compat_sys_move_pages to the x86_64 32bit binary layer. Note that it is not up to date so I added the missing pieces. Not sure if this is done the right way. [akpm@osdl.org: compile fix] Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Andi Kleen <ak@muc.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23[PATCH] page migration: sys_move_pages(): support moving of individual pagesChristoph Lameter1-0/+1
move_pages() is used to move individual pages of a process. The function can be used to determine the location of pages and to move them onto the desired node. move_pages() returns status information for each page. long move_pages(pid, number_of_pages_to_move, addresses_of_pages[], nodes[] or NULL, status[], flags); The addresses of pages is an array of void * pointing to the pages to be moved. The nodes array contains the node numbers that the pages should be moved to. If a NULL is passed instead of an array then no pages are moved but the status array is updated. The status request may be used to determine the page state before issuing another move_pages() to move pages. The status array will contain the state of all individual page migration attempts when the function terminates. The status array is only valid if move_pages() completed successfullly. Possible page states in status[]: 0..MAX_NUMNODES The page is now on the indicated node. -ENOENT Page is not present -EACCES Page is mapped by multiple processes and can only be moved if MPOL_MF_MOVE_ALL is specified. -EPERM The page has been mlocked by a process/driver and cannot be moved. -EBUSY Page is busy and cannot be moved. Try again later. -EFAULT Invalid address (no VMA or zero page). -ENOMEM Unable to allocate memory on target node. -EIO Unable to write back page. The page must be written back in order to move it since the page is dirty and the filesystem does not provide a migration function that would allow the moving of dirty pages. -EINVAL A dirty page cannot be moved. The filesystem does not provide a migration function and has no ability to write back pages. The flags parameter indicates what types of pages to move: MPOL_MF_MOVE Move pages that are only mapped by the process. MPOL_MF_MOVE_ALL Also move pages that are mapped by multiple processes. Requires sufficient capabilities. Possible return codes from move_pages() -ENOENT No pages found that would require moving. All pages are either already on the target node, not present, had an invalid address or could not be moved because they were mapped by multiple processes. -EINVAL Flags other than MPOL_MF_MOVE(_ALL) specified or an attempt to migrate pages in a kernel thread. -EPERM MPOL_MF_MOVE_ALL specified without sufficient priviledges. or an attempt to move a process belonging to another user. -EACCES One of the target nodes is not allowed by the current cpuset. -ENODEV One of the target nodes is not online. -ESRCH Process does not exist. -E2BIG Too many pages to move. -ENOMEM Not enough memory to allocate control array. -EFAULT Parameters could not be accessed. A test program for move_pages() may be found with the patches on ftp.kernel.org:/pub/linux/kernel/people/christoph/pmig/patches-2.6.17-rc4-mm3 From: Christoph Lameter <clameter@sgi.com> Detailed results for sys_move_pages() Pass a pointer to an integer to get_new_page() that may be used to indicate where the completion status of a migration operation should be placed. This allows sys_move_pags() to report back exactly what happened to each page. Wish there would be a better way to do this. Looks a bit hacky. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Jes Sorensen <jes@trained-monkey.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Andi Kleen <ak@muc.de> Cc: Michael Kerrisk <mtk-manpages@gmx.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23[PATCH] swsusp: rework memory shrinkerRafael J. Wysocki1-2/+8
Rework the swsusp's memory shrinker in the following way: - Simplify balance_pgdat() by removing all of the swsusp-related code from it. - Make shrink_all_memory() use shrink_slab() and a new function shrink_all_zones() which calls shrink_active_list() and shrink_inactive_list() directly for each zone in a way that's optimized for suspend. In shrink_all_memory() we try to free exactly as many pages as the caller asks for, preferably in one shot, starting from easier targets.  If slab caches are huge, they are most likely to have enough pages to reclaim.  The inactive lists are next (the zones with more inactive pages go first) etc. Each time shrink_all_memory() attempts to shrink the active and inactive lists for each zone in 5 passes.  In the first pass, only the inactive lists are taken into consideration.  In the next two passes the active lists are also shrunk, but mapped pages are not reclaimed.  In the last two passes the active and inactive lists are shrunk and mapped pages are reclaimed as well. The aim of this is to alter the reclaim logic to choose the best pages to keep on resume and improve the responsiveness of the resumed system. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Signed-off-by: Con Kolivas <kernel@kolivas.org> Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23[PATCH] support for panic at OOMKAMEZAWA Hiroyuki1-0/+9
This patch adds panic_on_oom sysctl under sys.vm. When sysctl vm.panic_on_oom = 1, the kernel panics intead of killing rogue processes. And if vm.panic_on_oom is 0 the kernel will do oom_kill() in the same way as it does today. Of course, the default value is 0 and only root can modifies it. In general, oom_killer works well and kill rogue processes. So the whole system can survive. But there are environments where panic is preferable rather than kill some processes. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23[PATCH] VFS: Permit filesystem to perform statfs with a known root dentryDavid Howells1-1/+1
Give the statfs superblock operation a dentry pointer rather than a superblock pointer. This complements the get_sb() patch. That reduced the significance of sb->s_root, allowing NFS to place a fake root there. However, NFS does require a dentry to use as a target for the statfs operation. This permits the root in the vfsmount to be used instead. linux/mount.h has been added where necessary to make allyesconfig build successfully. Interest has also been expressed for use with the FUSE and XFS filesystems. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Cc: Nathan Scott <nathans@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23[PATCH] VFS: Permit filesystem to override root dentry on mountDavid Howells2-8/+8
Extend the get_sb() filesystem operation to take an extra argument that permits the VFS to pass in the target vfsmount that defines the mountpoint. The filesystem is then required to manually set the superblock and root dentry pointers. For most filesystems, this should be done with simple_set_mnt() which will set the superblock pointer and then set the root dentry to the superblock's s_root (as per the old default behaviour). The get_sb() op now returns an integer as there's now no need to return the superblock pointer. This patch permits a superblock to be implicitly shared amongst several mount points, such as can be done with NFS to avoid potential inode aliasing. In such a case, simple_set_mnt() would not be called, and instead the mnt_root and mnt_sb would be set directly. The patch also makes the following changes: (*) the get_sb_*() convenience functions in the core kernel now take a vfsmount pointer argument and return an integer, so most filesystems have to change very little. (*) If one of the convenience function is not used, then get_sb() should normally call simple_set_mnt() to instantiate the vfsmount. This will always return 0, and so can be tail-called from get_sb(). (*) generic_shutdown_super() now calls shrink_dcache_sb() to clean up the dcache upon superblock destruction rather than shrink_dcache_anon(). This is required because the superblock may now have multiple trees that aren't actually bound to s_root, but that still need to be cleaned up. The currently called functions assume that the whole tree is rooted at s_root, and that anonymous dentries are not the roots of trees which results in dentries being left unculled. However, with the way NFS superblock sharing are currently set to be implemented, these assumptions are violated: the root of the filesystem is simply a dummy dentry and inode (the real inode for '/' may well be inaccessible), and all the vfsmounts are rooted on anonymous[*] dentries with child trees. [*] Anonymous until discovered from another tree. (*) The documentation has been adjusted, including the additional bit of changing ext2_* into foo_* in the documentation. [akpm@osdl.org: convert ipath_fs, do other stuff] Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Cc: Nathan Scott <nathans@sgi.com> Cc: Roland Dreier <rolandd@cisco.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpcLinus Torvalds1-0/+13
* git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc: (139 commits) [POWERPC] re-enable OProfile for iSeries, using timer interrupt [POWERPC] support ibm,extended-*-frequency properties [POWERPC] Extra sanity check in EEH code [POWERPC] Dont look for class-code in pci children [POWERPC] Fix mdelay badness on shared processor partitions [POWERPC] disable floating point exceptions for init [POWERPC] Unify ppc syscall tables [POWERPC] mpic: add support for serial mode interrupts [POWERPC] pseries: Print PCI slot location code on failure [POWERPC] spufs: one more fix for 64k pages [POWERPC] spufs: fail spu_create with invalid flags [POWERPC] spufs: clear class2 interrupt status before wakeup [POWERPC] spufs: fix Makefile for "make clean" [POWERPC] spufs: remove stop_code from struct spu [POWERPC] spufs: fix spu irq affinity setting [POWERPC] spufs: further abstract priv1 register access [POWERPC] spufs: split the Cell BE support into generic and platform dependant parts [POWERPC] spufs: dont try to access SPE channel 1 count [POWERPC] spufs: use kzalloc in create_spu [POWERPC] spufs: fix initial state of wbox file ... Manually resolved conflicts in: drivers/net/phy/Makefile include/asm-powerpc/spu.h
2006-06-22[PATCH] avoid tasklist_lock at getrusage for multithreaded case tooRavikiran G Thirumalai1-34/+22
Avoid taking tasklist_lock for at getrusage for the multithreaded case too. We don't need to take the tasklist lock for thread traversal of a process since Oleg's do-__unhash_process-under-siglock.patch and related work. Signed-off-by: Ravikiran Thirumalai <kiran@scalex86.org> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-22[PATCH] suspend_console() warning fixAndrew Morton1-1/+1
kernel/power/main.c: In function 'suspend_prepare': kernel/power/main.c:89: warning: implicit declaration of function 'suspend_console' kernel/power/main.c: In function 'suspend_finish': kernel/power/main.c:137: warning: implicit declaration of function 'resume_console' Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-22[PATCH] selinux: add hooks for key subsystemMichael LeMay1-1/+1
Introduce SELinux hooks to support the access key retention subsystem within the kernel. Incorporate new flask headers from a modified version of the SELinux reference policy, with support for the new security class representing retained keys. Extend the "key_alloc" security hook with a task parameter representing the intended ownership context for the key being allocated. Attach security information to root's default keyrings within the SELinux initialization routine. Has passed David's testsuite. Signed-off-by: Michael LeMay <mdlemay@epoch.ncsc.mil> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: James Morris <jmorris@namei.org> Acked-by: Chris Wright <chrisw@sous-sol.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-20Merge branch 'audit.b21' of ↵Linus Torvalds7-266/+1555
git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current * 'audit.b21' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/audit-current: (25 commits) [PATCH] make set_loginuid obey audit_enabled [PATCH] log more info for directory entry change events [PATCH] fix AUDIT_FILTER_PREPEND handling [PATCH] validate rule fields' types [PATCH] audit: path-based rules [PATCH] Audit of POSIX Message Queue Syscalls v.2 [PATCH] fix se_sen audit filter [PATCH] deprecate AUDIT_POSSBILE [PATCH] inline more audit helpers [PATCH] proc_loginuid_write() uses simple_strtoul() on non-terminated array [PATCH] update of IPC audit record cleanup [PATCH] minor audit updates [PATCH] fix audit_krule_to_{rule,data} return values [PATCH] add filtering by ppid [PATCH] log ppid [PATCH] collect sid of those who send signals to auditd [PATCH] execve argument logging [PATCH] fix deadlocks in AUDIT_LIST/AUDIT_LIST_RULES [PATCH] audit_panic() is audit-internal [PATCH] inotify (5/5): update kernel documentation ... Manual fixup of conflict in unclude/linux/inotify.h
2006-06-20Merge git://git.infradead.org/~dwmw2/rbtree-2.6Linus Torvalds1-2/+2
* git://git.infradead.org/~dwmw2/rbtree-2.6: [RBTREE] Switch rb_colour() et al to en_US spelling of 'color' for consistency Update UML kernel/physmem.c to use rb_parent() accessor macro [RBTREE] Update hrtimers to use rb_parent() accessor macro. [RBTREE] Add explicit alignment to sizeof(long) for struct rb_node. [RBTREE] Merge colour and parent fields of struct rb_node. [RBTREE] Remove dead code in rb_erase() [RBTREE] Update JFFS2 to use rb_parent() accessor macro. [RBTREE] Update eventpoll.c to use rb_parent() accessor macro. [RBTREE] Update key.c to use rb_parent() accessor macro. [RBTREE] Update ext3 to use rb_parent() accessor macro. [RBTREE] Change rbtree off-tree marking in I/O schedulers. [RBTREE] Add accessor macros for colour and parent fields of rb_node
2006-06-20Merge git://git.infradead.org/mtd-2.6Linus Torvalds2-185/+0
* git://git.infradead.org/mtd-2.6: (199 commits) [MTD] NAND: Fix breakage all over the place [PATCH] NAND: fix remaining OOB length calculation [MTD] NAND Fixup NDFC merge brokeness [MTD NAND] S3C2410 driver cleanup [MTD NAND] s3c24x0 board: Fix clock handling, ensure proper initialisation. [JFFS2] Check CRC32 on dirent and data nodes each time they're read [JFFS2] When retiring nextblock, allocate a node_ref for the wasted space [JFFS2] Mark XATTR support as experimental, for now [JFFS2] Don't trust node headers before the CRC is checked. [MTD] Restore MTD_ROM and MTD_RAM types [MTD] assume mtd->writesize is 1 for NOR flashes [MTD NAND] Fix s3c2410 NAND driver so it at least _looks_ like it compiles [MTD] Prepare physmap for 64-bit-resources [JFFS2] Fix more breakage caused by janitorial meddling. [JFFS2] Remove stray __exit from jffs2_compressors_exit() [MTD] Allow alternate JFFS2 mount variant for root filesystem. [MTD] Disconnect struct mtd_info from ABI [MTD] replace MTD_RAM with MTD_GENERIC_TYPE [MTD] replace MTD_ROM with MTD_GENERIC_TYPE [MTD] remove a forgotten MTD_XIP ...
2006-06-20[PATCH] make set_loginuid obey audit_enabledSteve Grubb1-11/+16
Hi, I was doing some testing and noticed that when the audit system was disabled, I was still getting messages about the loginuid being set. The following patch makes audit_set_loginuid look at in_syscall to determine if it should create an audit event. The loginuid will continue to be set as long as there is a context. Signed-off-by: Steve Grubb <sgrubb@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-06-20[PATCH] log more info for directory entry change eventsAmy Griffis4-61/+127
When an audit event involves changes to a directory entry, include a PATH record for the directory itself. A few other notable changes: - fixed audit_inode_child() hooks in fsnotify_move() - removed unused flags arg from audit_inode() - added audit log routines for logging a portion of a string Here's some sample output. before patch: type=SYSCALL msg=audit(1149821605.320:26): arch=40000003 syscall=39 success=yes exit=0 a0=bf8d3c7c a1=1ff a2=804e1b8 a3=bf8d3c7c items=1 ppid=739 pid=800 auid=0 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=ttyS0 comm="mkdir" exe="/bin/mkdir" subj=root:system_r:unconfined_t:s0-s0:c0.c255 type=CWD msg=audit(1149821605.320:26): cwd="/root" type=PATH msg=audit(1149821605.320:26): item=0 name="foo" parent=164068 inode=164010 dev=03:00 mode=040755 ouid=0 ogid=0 rdev=00:00 obj=root:object_r:user_home_t:s0 after patch: type=SYSCALL msg=audit(1149822032.332:24): arch=40000003 syscall=39 success=yes exit=0 a0=bfdd9c7c a1=1ff a2=804e1b8 a3=bfdd9c7c items=2 ppid=714 pid=777 auid=0 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=ttyS0 comm="mkdir" exe="/bin/mkdir" subj=root:system_r:unconfined_t:s0-s0:c0.c255 type=CWD msg=audit(1149822032.332:24): cwd="/root" type=PATH msg=audit(1149822032.332:24): item=0 name="/root" inode=164068 dev=03:00 mode=040750 ouid=0 ogid=0 rdev=00:00 obj=root:object_r:user_home_dir_t:s0 type=PATH msg=audit(1149822032.332:24): item=1 name="foo" inode=164010 dev=03:00 mode=040755 ouid=0 ogid=0 rdev=00:00 obj=root:object_r:user_home_t:s0 Signed-off-by: Amy Griffis <amy.griffis@hp.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-06-20[PATCH] fix AUDIT_FILTER_PREPEND handlingAmy Griffis1-0/+1
Clear AUDIT_FILTER_PREPEND flag after adding rule to list. This fixes three problems when a rule is added with the -A syntax: - auditctl displays filter list as "(null)" - the rule cannot be removed using -d - a duplicate rule can be added with -a Signed-off-by: Amy Griffis <amy.griffis@hp.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-06-20[PATCH] validate rule fields' typesAl Viro1-9/+48
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-06-20[PATCH] audit: path-based rulesAmy Griffis4-88/+900
In this implementation, audit registers inotify watches on the parent directories of paths specified in audit rules. When audit's inotify event handler is called, it updates any affected rules based on the filesystem event. If the parent directory is renamed, removed, or its filesystem is unmounted, audit removes all rules referencing that inotify watch. To keep things simple, this implementation limits location-based auditing to the directory entries in an existing directory. Given a path-based rule for /foo/bar/passwd, the following table applies: passwd modified -- audit event logged passwd replaced -- audit event logged, rules list updated bar renamed -- rule removed foo renamed -- untracked, meaning that the rule now applies to the new location Audit users typically want to have many rules referencing filesystem objects, which can significantly impact filtering performance. This patch also adds an inode-number-based rule hash to mitigate this situation. The patch is relative to the audit git tree: http://kernel.org/git/?p=linux/kernel/git/viro/audit-current.git;a=summary and uses the inotify kernel API: http://lkml.org/lkml/2006/6/1/145 Signed-off-by: Amy Griffis <amy.griffis@hp.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-06-20[PATCH] Audit of POSIX Message Queue Syscalls v.2George C. Wilson1-1/+273
This patch adds audit support to POSIX message queues. It applies cleanly to the lspp.b15 branch of Al Viro's git tree. There are new auxiliary data structures, and collection and emission routines in kernel/auditsc.c. New hooks in ipc/mqueue.c collect arguments from the syscalls. I tested the patch by building the examples from the POSIX MQ library tarball. Build them -lrt, not against the old MQ library in the tarball. Here's the URL: http://www.geocities.com/wronski12/posix_ipc/libmqueue-4.41.tar.gz Do auditctl -a exit,always -S for mq_open, mq_timedsend, mq_timedreceive, mq_notify, mq_getsetattr. mq_unlink has no new hooks. Please see the corresponding userspace patch to get correct output from auditd for the new record types. [fixes folded] Signed-off-by: George Wilson <ltcgcw@us.ibm.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-06-20[PATCH] deprecate AUDIT_POSSBILEAl Viro2-4/+5
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-06-20[PATCH] inline more audit helpersAl Viro1-10/+4
pull checks for ->audit_context into inlined wrappers Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-06-20[PATCH] update of IPC audit record cleanupLinda Knippers1-17/+5
The following patch addresses most of the issues with the IPC_SET_PERM records as described in: https://www.redhat.com/archives/linux-audit/2006-May/msg00010.html and addresses the comments I received on the record field names. To summarize, I made the following changes: 1. Changed sys_msgctl() and semctl_down() so that an IPC_SET_PERM record is emitted in the failure case as well as the success case. This matches the behavior in sys_shmctl(). I could simplify the code in sys_msgctl() and semctl_down() slightly but it would mean that in some error cases we could get an IPC_SET_PERM record without an IPC record and that seemed odd. 2. No change to the IPC record type, given no feedback on the backward compatibility question. 3. Removed the qbytes field from the IPC record. It wasn't being set and when audit_ipc_obj() is called from ipcperms(), the information isn't available. If we want the information in the IPC record, more extensive changes will be necessary. Since it only applies to message queues and it isn't really permission related, it doesn't seem worth it. 4. Removed the obj field from the IPC_SET_PERM record. This means that the kern_ipc_perm argument is no longer needed. 5. Removed the spaces and renamed the IPC_SET_PERM field names. Replaced iuid and igid fields with ouid and ogid in the IPC record. I tested this with the lspp.22 kernel on an x86_64 box. I believe it applies cleanly on the latest kernel. -- ljk Signed-off-by: Linda Knippers <linda.knippers@hp.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-06-20[PATCH] minor audit updatesSerge E. Hallyn1-9/+10
Just a few minor proposed updates. Only the last one will actually affect behavior. The rest are just misleading code. Several AUDIT_SET functions return 'old' value, but only return value <0 is checked for. So just return 0. propagate audit_set_rate_limit and audit_set_backlog_limit error values In audit_buffer_free, the audit_freelist_count was being incremented even when we discard the return buffer, so audit_freelist_count can end up wrong. This could cause the actual freelist to shrink over time, eventually threatening to degrate audit performance. Signed-off-by: Serge E. Hallyn <serue@us.ibm.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-06-20[PATCH] fix audit_krule_to_{rule,data} return valuesAmy Griffis1-2/+2
Don't return -ENOMEM when callers of these functions are checking for a NULL return. Bug noticed by Serge Hallyn. Signed-off-by: Amy Griffis <amy.griffis@hp.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-06-20[PATCH] add filtering by ppidAl Viro1-0/+4
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-06-20[PATCH] log ppidAl Viro1-2/+5
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-06-20[PATCH] collect sid of those who send signals to auditdAl Viro4-23/+44
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-06-20[PATCH] execve argument loggingAl Viro2-3/+56
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-06-20[PATCH] fix deadlocks in AUDIT_LIST/AUDIT_LIST_RULESAl Viro3-52/+81
We should not send a pile of replies while holding audit_netlink_mutex since we hold the same mutex when we receive commands. As the result, we can get blocked while sending and sit there holding the mutex while auditctl is unable to send the next command and get around to receiving what we'd sent. Solution: create skb and put them into a queue instead of sending; once we are done, send what we've got on the list. The former can be done synchronously while we are handling AUDIT_LIST or AUDIT_LIST_RULES; we are holding audit_netlink_mutex at that point. The latter is done asynchronously and without messing with audit_netlink_mutex. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-06-20[PATCH] inotify (1/5): split kernel API from userspace supportAmy Griffis2-3/+3
The following series of patches introduces a kernel API for inotify, making it possible for kernel modules to benefit from inotify's mechanism for watching inodes. With these patches, inotify will maintain for each caller a list of watches (via an embedded struct inotify_watch), where each inotify_watch is associated with a corresponding struct inode. The caller registers an event handler and specifies for which filesystem events their event handler should be called per inotify_watch. Signed-off-by: Amy Griffis <amy.griffis@hp.com> Acked-by: Robert Love <rml@novell.com> Acked-by: John McCutchan <john@johnmccutchan.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-06-19Add support for suspending and resuming the whole console subsystemLinus Torvalds2-1/+29
Trying to suspend/resume with console messages flying all around is doomed to failure, when the devices that the messages are trying to go to are being shut down. Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-17[PATCH] arm_timer: remove a racy and obsolete PF_EXITING checkOleg Nesterov1-3/+0
arm_timer() checks PF_EXITING to prevent BUG_ON(->exit_state) in run_posix_cpu_timers(). However, for some reason it does so only for CPUCLOCK_PERTHREAD case (which is imho wrong). Also, this check is not reliable, PF_EXITING could be set on another cpu without any locks/barriers just after the check, so it can't prevent from attaching the timer to the exiting task. The previous patch makes this check unneeded. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-17[PATCH] run_posix_cpu_timers: remove a bogus BUG_ON()Oleg Nesterov2-26/+18
do_exit() clears ->it_##clock##_expires, but nothing prevents another cpu to attach the timer to exiting process after that. arm_timer() tries to protect against this race, but the check is racy. After exit_notify() does 'write_unlock_irq(&tasklist_lock)' and before do_exit() calls 'schedule() local timer interrupt can find tsk->exit_state != 0. If that state was EXIT_DEAD (or another cpu does sys_wait4) interrupted task has ->signal == NULL. At this moment exiting task has no pending cpu timers, they were cleanuped in __exit_signal()->posix_cpu_timers_exit{,_group}(), so we can just return from irq. John Stultz recently confirmed this bug, see http://marc.theaimsgroup.com/?l=linux-kernel&m=115015841413687 Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-17[PATCH] check_process_timers: fix possible lockupOleg Nesterov1-5/+4
If the local timer interrupt happens just after do_exit() sets PF_EXITING (and before it clears ->it_xxx_expires) run_posix_cpu_timers() will call check_process_timers() with tasklist_lock + ->siglock held and check_process_timers: t = tsk; do { .... do { t = next_thread(t); } while (unlikely(t->flags & PF_EXITING)); } while (t != tsk); the outer loop will never stop. Actually, the window is bigger. Another process can attach the timer after ->it_xxx_expires was cleared (see the next commit) and the 'if (PF_EXITING)' check in arm_timer() is racy (see the one after that). Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-09kbuild: check license compatibility when building modulesSam Ravnborg1-10/+1
Modules that uses GPL symbols can no longer be build with kbuild, the build will fail during the modpost step. When a GPL-incompatible module uses a EXPORT_SYMBOL_GPL_FUTURE symbol then warn during modpost so author are actually notified. The actual license compatibility check is shared with the kernel to make sure it is in sync. Patch originally from: Andreas Gruenbacher <agruen@suse.de> and Ram Pai <linuxram@us.ibm.com> Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
2006-06-09[PATCH] Add a prctl to change the endianness of a process.Anton Blanchard1-0/+13
This new prctl is intended for changing the execution mode of the processor, on processors that support both a little-endian mode and a big-endian mode. It is intended for use by programs such as instruction set emulators (for example an x86 emulator on PowerPC), which may find it convenient to use the processor in an alternate endianness mode when executing translated instructions. Note that this does not imply the existence of a fully-fledged ABI for both endiannesses, or of compatibility code for converting system calls done in the non-native endianness mode. The program is expected to arrange for all of its system call arguments to be presented in the native endianness. Switching between big and little-endian mode will require some care in constructing the instruction sequence for the switch. Generally the instructions up to the instruction that invokes the prctl system call will have to be in the old endianness, and subsequent instructions will have to be in the new endianness. Signed-off-by: Anton Blanchard <anton@samba.org> Signed-off-by: Paul Mackerras <paulus@samba.org>
2006-05-31[PATCH] hrtimer: export symbolsStephen Hemminger1-0/+6
From: Stephen Hemminger <shemminger@osdl.org> I want to use the hrtimer's in the netem (Network Emulator) qdisc. But the necessary symbols aren't exported for module use. Also needed by SystemTap. Signed-off-by: Stephen Hemminger <shemminger@osdl.org> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "Stone, Joshua I" <joshua.i.stone@intel.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-05-21Revert "[PATCH] sched: fix interactive task starvation"Linus Torvalds1-44/+18
This reverts commit 5ce74abe788a26698876e66b9c9ce7e7acc25413 (and its dependent commit 8a5bc075b8d8cf7a87b3f08fad2fba0f5d13295e), because of audio underruns. Reported by Rene Herman <rene.herman@keyaccess.nl>, who also pinpointed the exact cause of the underruns: "Audio underruns galore, with only ogg123 and firefox (browsing the GIT tree online is also a nice trigger by the way). If I back it out, everything is fine for me again." Cc: Rene Herman <rene.herman@keyaccess.nl> Cc: Mike Galbraith <efault@gmx.de> Acked-by: Con Kolivas <kernel@kolivas.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-05-21[PATCH] Fix a NO_IDLE_HZ timer bugZachary Amsden1-0/+16
Under certain timing conditions, a race during boot occurs where timer ticks are being processed on remote CPUs. The remote timer ticks can increment jiffies, and if this happens during a window when a timeout is very close to expiring but a local tick has not yet been delivered, you can end up with 1) No softirq pending 2) A local timer wheel which is not synced to jiffies 3) No high resolution timer active 4) A local timer which is supposed to fire before the current jiffies value. In this circumstance, the comparison in next_timer_interrupt overflows, because the base of the comparison for high resolution timers is jiffies, but for the softirq timer wheel, it is relative the the current base of the wheel (jiffies_base). Signed-off-by: Zachary Amsden <zach@vmware.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-05-21[PATCH] cpuset: might_sleep_if check in cpuset_zones_allowedPaul Jackson1-0/+1
It's too easy to incorrectly call cpuset_zone_allowed() in an atomic context without __GFP_HARDWALL set, and when done, it is not noticed until a tight memory situation forces allocations to be tried outside the current cpuset. Add a 'might_sleep_if()' check, to catch this earlier on, instead of waiting for a similar check in the mutex_lock() code, which is only rarely invoked. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-05-21[PATCH] cpuset: update cpuset_zones_allowed commentPaul Jackson1-9/+15
Update the kernel/cpuset.c:cpuset_zone_allowed() comment. The rule for when mm/page_alloc.c should call cpuset_zone_allowed() was intended to be: Don't call cpuset_zone_allowed() if you can't sleep, unless you pass in the __GFP_HARDWALL flag set in gfp_flag, which disables the code that might scan up ancestor cpusets and sleep. The explanation of this rule in the comment above cpuset_zone_allowed() was stale, as a result of a restructuring of some __alloc_pages() code in November 2005. Rewrite that comment ... Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-05-16Merge git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6David Woodhouse4-25/+65
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2006-05-15[PATCH] symbol_put_addr() locks kernelTrent Piepho2-7/+7
Even since a previous patch: Fix race between CONFIG_DEBUG_SLABALLOC and modules Sun, 27 Jun 2004 17:55:19 +0000 (17:55 +0000) http://www.kernel.org/git/?p=linux/kernel/git/torvalds/old-2.6-bkcvs.git;a=commit;h=92b3db26d31cf21b70e3c1eadc56c179506d8fbe The function symbol_put_addr() will deadlock the kernel. symbol_put_addr() would acquire modlist_lock, then while holding the lock call two functions kernel_text_address() and module_text_address() which also try to acquire the same lock. This deadlocks the kernel of course. This patch changes symbol_put_addr() to not acquire the modlist_lock, it doesn't need it since it never looks at the module list directly. Also, it now uses core_kernel_text() instead of kernel_text_address(). The latter has an additional check for addr inside a module, but we don't need to do that since we call module_text_address() (the same function kernel_text_address uses) ourselves. Signed-off-by: Trent Piepho <xyzzy@speakeasy.org> Cc: Zwane Mwaikambo <zwane@fsmlabs.com> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Cc: Johannes Stezenbach <js@linuxtv.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-05-15[PATCH] RCU: introduce rcu_needs_cpu() interfaceHeiko Carstens1-0/+19
With "Paul E. McKenney" <paulmck@us.ibm.com> Introduce rcu_needs_cpu() interface. This can be used to tell if there will be a new rcu batch on a cpu soon by looking at the curlist pointer. This can be used to avoid to enter a tickless idle state where the cpu would miss that a new batch is ready when rcu_start_batch would be called on a different cpu. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: "Paul E. McKenney" <paulmck@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-05-11ptrace_attach: fix possible deadlock schenario with irqsLinus Torvalds1-1/+19
Eric Biederman points out that we can't take the task_lock while holding tasklist_lock for writing, because another CPU that holds the task lock might take an interrupt that then tries to take tasklist_lock for writing. Which would be a nasty deadlock, with one CPU spinning forever in an interrupt handler (although admittedly you need to really work at triggering it ;) Since the ptrace_attach() code is special and very unusual, just make it be extra careful, and use trylock+repeat to avoid the possible deadlock. Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-05-08Finally remove the obnoxious inter_module_xxx()David Woodhouse2-185/+0
This was already a bad plan when I argued against adding it in the first place. Good riddance. Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2006-05-07Fix ptrace_attach()/ptrace_traceme()/de_thread() raceLinus Torvalds1-18/+21
This holds the task lock (and, for ptrace_attach, the tasklist_lock) over the actual attach event, which closes a race between attacking to a thread that is either doing a PTRACE_TRACEME or getting de-threaded. Thanks to Oleg Nesterov for reminding me about this, and Chris Wright for noticing a lost return value in my first version. Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-05-01[PATCH] Audit Filter PerformanceSteve Grubb1-4/+7
While testing the watch performance, I noticed that selinux_task_ctxid() was creeping into the results more than it should. Investigation showed that the function call was being called whether it was needed or not. The below patch fixes this. Signed-off-by: Steve Grubb <sgrubb@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-05-01[PATCH] Rework of IPC auditingSteve Grubb1-3/+51
1) The audit_ipc_perms() function has been split into two different functions: - audit_ipc_obj() - audit_ipc_set_perm() There's a key shift here... The audit_ipc_obj() collects the uid, gid, mode, and SElinux context label of the current ipc object. This audit_ipc_obj() hook is now found in several places. Most notably, it is hooked in ipcperms(), which is called in various places around the ipc code permforming a MAC check. Additionally there are several places where *checkid() is used to validate that an operation is being performed on a valid object while not necessarily having a nearby ipcperms() call. In these locations, audit_ipc_obj() is called to ensure that the information is captured by the audit system. The audit_set_new_perm() function is called any time the permissions on the ipc object changes. In this case, the NEW permissions are recorded (and note that an audit_ipc_obj() call exists just a few lines before each instance). 2) Support for an AUDIT_IPC_SET_PERM audit message type. This allows for separate auxiliary audit records for normal operations on an IPC object and permissions changes. Note that the same struct audit_aux_data_ipcctl is used and populated, however there are separate audit_log_format statements based on the type of the message. Finally, the AUDIT_IPC block of code in audit_free_aux() was extended to handle aux messages of this new type. No more mem leaks I hope ;-) Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-05-01[PATCH] More user space subject labelsSteve Grubb3-39/+141
Hi, The patch below builds upon the patch sent earlier and adds subject label to all audit events generated via the netlink interface. It also cleans up a few other minor things. Signed-off-by: Steve Grubb <sgrubb@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-05-01[PATCH] Reworked patch for labels on user space messagesSteve Grubb1-3/+19
The below patch should be applied after the inode and ipc sid patches. This patch is a reworking of Tim's patch that has been updated to match the inode and ipc patches since its similar. [updated: > Stephen Smalley also wanted to change a variable from isec to tsec in the > user sid patch. ] Signed-off-by: Steve Grubb <sgrubb@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-05-01[PATCH] change lspp ipc auditingSteve Grubb1-47/+21
Hi, The patch below converts IPC auditing to collect sid's and convert to context string only if it needs to output an audit record. This patch depends on the inode audit change patch already being applied. Signed-off-by: Steve Grubb <sgrubb@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-05-01[PATCH] audit inode patchSteve Grubb1-37/+16
Previously, we were gathering the context instead of the sid. Now in this patch, we gather just the sid and convert to context only if an audit event is being output. This patch brings the performance hit from 146% down to 23% Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-05-01[PATCH] support for context based audit filtering, part 2Darrel Goeddel4-27/+256
This patch provides the ability to filter audit messages based on the elements of the process' SELinux context (user, role, type, mls sensitivity, and mls clearance). It uses the new interfaces from selinux to opaquely store information related to the selinux context and to filter based on that information. It also uses the callback mechanism provided by selinux to refresh the information when a new policy is loaded. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-05-01[PATCH] no need to wank with task_lock() and pinning task down in ↵Al Viro1-9/+1
audit_syscall_exit() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-05-01[PATCH] drop task argument of audit_syscall_{entry,exit}Al Viro1-4/+4
... it's always current, and that's a good thing - allows simpler locking. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-05-01[PATCH] drop gfp_mask in audit_log_exit()Al Viro1-30/+32
now we can do that - all callers are process-synchronous and do not hold any locks. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-05-01[PATCH] move call of audit_free() into do_exit()Al Viro3-10/+4
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-05-01[PATCH] deal with deadlocks in audit_free()Al Viro1-10/+10
Don't assume that audit_log_exit() et.al. are called for the context of current; pass task explictly. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2006-04-28[PATCH] request_irq(): remove warnings from irq probingAndrew Morton1-2/+4
- Add new SA_PROBEIRQ which suppresses the new sharing-mismatch warning. Some drivers like to use request_irq() to find an unused interrupt slot. - Use it in i82365.c - Kill unused SA_PROBE. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-28[PATCH] off-by-1 in kernel/power/main.cdean gaudet1-1/+1
There's an off-by-1 in kernel/power/main.c:state_store() ... if your kernel just happens to have some non-zero data at pm_states[PM_SUSPEND_MAX] (i.e. one past the end of the array) then it'll let you write anything you want to /sys/power/state and in response the box will enter S5. Signed-off-by: dean gaudet <dean@arctic.org> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-26[PATCH] Remove __devinit and __cpuinit from notifier_call definitionsChandra Seetharaman7-7/+7
Few of the notifier_chain_register() callers use __init in the definition of notifier_call. It is incorrect as the function definition should be available after the initializations (they do not unregister them during initializations). This patch fixes all such usages to _not_ have the notifier_call __init section. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-26[PATCH] Remove __devinitdata from notifier block definitionsChandra Seetharaman6-6/+6
Few of the notifier_chain_register() callers use __devinitdata in the definition of notifier_block data structure. It is incorrect as the data structure should be available after the initializations (they do not unregister them during initializations). This was leading to an oops when notifier_chain_register() call is invoked for those callback chains after initialization. This patch fixes all such usages to _not_ have the notifier_block data structure in the init data section. Signed-off-by: Chandra Seetharaman <sekharan@us.ibm.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-22[RBTREE] Update hrtimers to use rb_parent() accessor macro.David Woodhouse1-2/+2
Also switch it to use the same method of using off-tree nodes as everyone else now does -- set them to point to themselves. Signed-off-by: David Woodhouse <dwmw2@infradead.org>
2006-04-20Merge branch 'for-linus' of git://brick.kernel.dk/data/git/linux-2.6-blockLinus Torvalds1-0/+1
* 'for-linus' of git://brick.kernel.dk/data/git/linux-2.6-block: [PATCH] block/elevator.c: remove unused exports [PATCH] splice: fix smaller sized splice reads [PATCH] Don't inherit ->splice_pipe across forks [patch] cleanup: use blk_queue_stopped [PATCH] Document online io scheduler switching
2006-04-20[PATCH] kprobes: NULL out non-relevant fields in struct kretprobeAnanth N Mavinakayanahalli1-0/+3
In cases where a struct kretprobe's *_handler fields are non-NULL, it is possible to cause a system crash, due to the possibility of calls ending up in zombie functions. Documentation clearly states that unused *_handlers should be set to NULL, but kprobe users sometimes fail to do so. Fix it by setting the non-relevant fields of the struct kretprobe to NULL. Signed-off-by: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Acked-by: Jim Keniston <jkenisto@us.ibm.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-20[PATCH] Don't inherit ->splice_pipe across forksJens Axboe1-0/+1
It's really task private, so clear that field on fork after copying task structure. Signed-off-by: Jens Axboe <axboe@suse.de>
2006-04-19[PATCH] Add more prevent_tail_call()OGAWA Hirofumi1-13/+46
Those also break userland regs like following. 00000000 <sys_chown16>: 0: 0f b7 44 24 0c movzwl 0xc(%esp),%eax 5: 83 ca ff or $0xffffffff,%edx 8: 0f b7 4c 24 08 movzwl 0x8(%esp),%ecx d: 66 83 f8 ff cmp $0xffffffff,%ax 11: 0f 44 c2 cmove %edx,%eax 14: 66 83 f9 ff cmp $0xffffffff,%cx 18: 0f 45 d1 cmovne %ecx,%edx 1b: 89 44 24 0c mov %eax,0xc(%esp) 1f: 89 54 24 08 mov %edx,0x8(%esp) 23: e9 fc ff ff ff jmp 24 <sys_chown16+0x24> where the tailcall at the end overwrites the incoming stack-frame. Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> [ I would _really_ like to have a way to tell gcc about calling conventions. The "prevent_tail_call()" macro is pretty ugly ] Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-19[PATCH] swsusp: prevent possible image corruption on resumeRafael J. Wysocki1-4/+5
The function free_pagedir() used by swsusp for freeing its internal data structures clears the PG_nosave and PG_nosave_free flags for each page being freed. However, during resume PG_nosave_free set means that the page in question is "unsafe" (ie. it will be overwritten in the process of restoring the saved system state from the image), so it should not be used for the image data. Therefore free_pagedir() should not clear PG_nosave_free if it's called during resume (otherwise "unsafe" pages freed by it may be used for storing the image data and the data may get corrupted later on). Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-19[PATCH] task: Make task list manipulations RCU safeEric W. Biederman2-2/+2
While we can currently walk through thread groups, process groups, and sessions with just the rcu_read_lock, this opens the door to walking the entire task list. We already have all of the other RCU guarantees so there is no cost in doing this, this should be enough so that proc can stop taking the tasklist lock during readdir. prev_task was killed because it has no users, and using it will miss new tasks when doing an rcu traversal. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-14[PATCH] kill unushed __put_task_struct_cbEric W. Biederman1-6/+0
Somehow in the midst of dotting i's and crossing t's during the merge up to rc1 we wound up keeping __put_task_struct_cb when it should have been killed as it no longer has any users. Sorry I probably should have caught this while it was still in the -mm tree. Having the old code there gets confusing when reading through the code and trying to understand what is happening. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-14[PATCH] remove kernel/power/pm.c:pm_unregister()Adrian Bunk1-20/+0
Since the last user is removed in -mm, we can now remove this long deprecated function. Signed-off-by: Adrian Bunk <bunk@stusta.de> Cc: Pavel Machek <pavel@ucw.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2006-04-14[PATCH] fix non-leader exec under ptraceRoland McGrath2-7/+4
This reverts most of commit 30e0fca6c1d7d26f3f2daa4dd2b12c51dadc778a. It broke the case of non-leader MT exec when ptraced. I think the bug it was intended to fix was already addressed by commit 788e05a67c343fa22f2ae1d3ca264e7f15c25eaf. Signed-off-by: Roland McGrath <roland@redhat.com> Acked-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-11[PATCH] __group_complete_signal: remove bogus BUG_ONOleg Nesterov1-1/+0
Commit e56d090310d7625ecb43a1eeebd479f04affb48b [PATCH] RCU signal handling made this BUG_ON() unsafe. This code runs under ->siglock, while switch_exec_pids() takes tasklist_lock. Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-11Merge branch 'splice' of git://brick.kernel.dk/data/git/linux-2.6-blockLinus Torvalds1-0/+4
* 'splice' of git://brick.kernel.dk/data/git/linux-2.6-block: [PATCH] vfs: add splice_write and splice_read to documentation [PATCH] Remove sys_ prefix of new syscalls from __NR_sys_* [PATCH] splice: warning fix [PATCH] another round of fs/pipe.c cleanups [PATCH] splice: comment styles [PATCH] splice: add Ingo as addition copyright holder [PATCH] splice: unlikely() optimizations [PATCH] splice: speedups and optimizations [PATCH] pipe.c/fifo.c code cleanups [PATCH] get rid of the PIPE_*() macros [PATCH] splice: speedup __generic_file_splice_read [PATCH] splice: add direct fd <-> fd splicing support [PATCH] splice: add optional input and output offsets [PATCH] introduce a "kernel-internal pipe object" abstraction [PATCH] splice: be smarter about calling do_page_cache_readahead() [PATCH] splice: optimize the splice buffer mapping [PATCH] splice: cleanup __generic_file_splice_read() [PATCH] splice: only call wake_up_interruptible() when we really have to [PATCH] splice: potential !page dereference [PATCH] splice: mark the io page as accessed
2006-04-11[PATCH] add cpu_relax to hrtimer_cancelJoe Korty1-0/+1
Add a cpu_relax() to the hand-coded spinwait in hrtimer_cancel(). Signed-off-by: Joe Korty <joe.korty@ccur.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-11[PATCH] build kernel/irq/migration.c only if CONFIG_GENERIC_PENDING_IRQ is setChristoph Hellwig2-5/+3
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-11[PATCH] the scheduled unexport of panic_timeoutAdrian Bunk1-1/+0
Implement the scheduled unexport of panic_timeout. Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-11[PATCH] timer initialisation fixAndrew Morton1-10/+19
We need the boot CPU's tvec_bases[] entry to be initialised super-early in boot, for early_serial_setup(). That runs within setup_arch(), before even per-cpu areas are initialised. The patch changes tvec_bases to use compile-time initialisation, and adds a separate array `tvec_base_done' to keep track of which CPU has had its tvec_bases[] entry initialised (because we can no longer use the zeroness of that tvec_bases[] entry to determine whether it has been initialised). Thanks to Eugene Surovegin <ebs@ebshome.net> for diagnosing this. Cc: Eugene Surovegin <ebs@ebshome.net> Cc: Jan Beulich <jbeulich@novell.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-11[PATCH] frv: define MMU mode specific syscalls as 'cond_syscall' and clean ↵Hyok S. Choi1-0/+12
up unneeded macros For some architectures, a few syscalls are not linked in noMMU mode. In that case, the MMU depending syscalls are needed to be defined as 'cond_syscall'. For example, ARM architecture selectively links sys_mlock by the mode configuration. In case of FRV, it has been managed by #ifdef CONFIG_MMU macro in arch/frv/kernel/entry.S. However these conditional macros are just duplicates if they were defined as cond_syscall. Compilation test is done with FRV toolchains for both of MMU and noMMU mode. Signed-off-by: Hyok S. Choi <hyok.choi@samsung.com> Cc: David Howells <dhowells@redhat.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-11[PATCH] sched: don't awaken RT tasks on expired arrayMike Galbraith1-1/+1
RT tasks are being awakened on the expired array when expired_starving() is true, whereas they really should be excluded. Fix. Signed-off-by: Mike Galbraith <efault@gmx.de> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: Con Kolivas <kernel@kolivas.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-11[PATCH] sched: fix interactive task starvationMike Galbraith1-18/+44
Fix a starvation problem that occurs when a stream of highly interactive tasks delay an array switch for extended periods despite EXPIRED_STARVING(rq) being true. AFAIKT, the only choice is to enqueue awakening tasks on the expired array in this case. Without this patch, it can be nearly impossible to remotely login to a busy server, and interactive shell commands can starve for minutes. Also, convert the EXPIRED_STARVING macro into an inline function which humans can understand. Signed-off-by: Mike Galbraith <efault@gmx.de> Acked-by: Ingo Molnar <mingo@elte.hu> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Acked-by: Con Kolivas <kernel@kolivas.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-11[PATCH] splice: add direct fd <-> fd splicing supportJens Axboe1-0/+4
It's more efficient for sendfile() emulation. Basically we cache an internal private pipe and just use that as the intermediate area for pages. Direct splicing is not available from sys_splice(), it is only meant to be used for sendfile() emulation. Additional patch from Ingo Molnar to avoid the PIPE_BUFFERS loop at exit for the normal fast path. Signed-off-by: Jens Axboe <axboe@suse.de>
2006-04-09[PATCH] x86_64: Fix drift with HPET timer enabledJordan Hargrave1-1/+1
If the HPET timer is enabled, the clock can drift by ~3 seconds a day. This is due to the HPET timer not being initialized with the correct setting (still using PIT count). If HZ changes, this drift can become even more pronounced. HPET patch initializes tick_nsec with correct tick_nsec settings for HPET timer. Vojtech comments: "It's not entirely correct (it assumes the HPET ticks totally exactly), but it's significantly better than assuming the PIT error there." Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-04-02BUG_ON() Conversion in kernel/signal.cEric Sesterhenn1-2/+1
this changes if() BUG(); constructs to BUG_ON() which is cleaner, contains unlikely() and can better optimized away. Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de> Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-04-02BUG_ON() Conversion in kernel/signal.cEric Sesterhenn1-4/+2
this changes if() BUG(); constructs to BUG_ON() which is cleaner, contains unlikely() and can better optimized away. Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de> Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-04-02BUG_ON() Conversion in kernel/ptrace.cEric Sesterhenn1-2/+1
this changes if() BUG(); constructs to BUG_ON() which is cleaner, contains unlikely() and can better optimized away. Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de> Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-04-01Fix comments: s/granuality/granularity/Kalin KOZHUHAROV1-4/+4
I was grepping through the code and some `grep ganularity -R .` didn't catch what I thought. Then looking closer I saw the term "granuality" used in only four places (in comments) and granularity in many more places describing the same idea. Some other facts: dictionary.com does not know such a word define:granuality on google is not found (and pages for granuality are mostly related to patches to the kernel) it has not been discussed as a term on LKML, AFAICS (=Can Search) To be consistent, I think granularity should be used everywhere. Signed-off-by: Kalin KOZHUHAROV <kalin@thinrope.net> Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-04-01BUG_ON() Conversion in kernel/printk.cEric Sesterhenn1-4/+2
this changes if() BUG(); constructs to BUG_ON() which is cleaner, contains unlikely() and can better optimized away. Signed-off-by: Eric Sesterhenn <snakebyte@gmx.de> Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-04-01help text: SOFTWARE_SUSPEND doesn't need ACPIAdrian Bunk1-1/+1
The note that SOFTWARE_SUSPEND doesn't need APM is helpful, but nowadays the information that it doesn't need ACPI, too, is even more helpful. Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-03-31[PATCH] wrong error path in dup_fd() leading to oopses in RCUKirill Korotaev1-1/+1
Wrong error path in dup_fd() - it should return NULL on error, not an address of already freed memory :/ Triggered by OpenVZ stress test suite. What is interesting is that it was causing different oopses in RCU like below: Call Trace: [<c013492c>] rcu_do_batch+0x2c/0x80 [<c0134bdd>] rcu_process_callbacks+0x3d/0x70 [<c0126cf3>] tasklet_action+0x73/0xe0 [<c01269aa>] __do_softirq+0x10a/0x130 [<c01058ff>] do_softirq+0x4f/0x60 ======================= [<c0113817>] smp_apic_timer_interrupt+0x77/0x110 [<c0103b54>] apic_timer_interrupt+0x1c/0x24 Code: Bad EIP value. <0>Kernel panic - not syncing: Fatal exception in interrupt Signed-Off-By: Pavel Emelianov <xemul@sw.ru> Signed-Off-By: Dmitry Mishin <dim@openvz.org> Signed-Off-By: Kirill Korotaev <dev@openvz.org> Signed-Off-By: Linus Torvalds <torvalds@osdl.org>
2006-03-31[PATCH] pidhash: Refactor the pid hash tableEric W. Biederman2-73/+155
Simplifies the code, reduces the need for 4 pid hash tables, and makes the code more capable. In the discussions I had with Oleg it was felt that to a large extent the cleanup itself justified the work. With struct pid being dynamically allocated meant we could create the hash table entry when the pid was allocated and free the hash table entry when the pid was freed. Instead of playing with the hash lists when ever a process would attach or detach to a process. For myself the fact that it gave what my previous task_ref patch gave for free with simpler code was a big win. The problem is that if you hold a reference to struct task_struct you lock in 10K of low memory. If you do that in a user controllable way like /proc does, with an unprivileged but hostile user space application with typical resource limits of 1000 fds and 100 processes I can trigger the OOM killer by consuming all of low memory with task structs, on a machine wight 1GB of low memory. If I instead hold a reference to struct pid which holds a pointer to my task_struct, I don't suffer from that problem because struct pid is 2 orders of magnitude smaller. In fact struct pid is small enough that most other kernel data structures dwarf it, so simply limiting the number of referring data structures is enough to prevent exhaustion of low memory. This splits the current struct pid into two structures, struct pid and struct pid_link, and reduces our number of hash tables from PIDTYPE_MAX to just one. struct pid_link is the per process linkage into the hash tables and lives in struct task_struct. struct pid is given an indepedent lifetime, and holds pointers to each of the pid types. The independent life of struct pid simplifies attach_pid, and detach_pid, because we are always manipulating the list of pids and not the hash table. In addition in giving struct pid an indpendent life it makes the concept much more powerful. Kernel data structures can now embed a struct pid * instead of a pid_t and not suffer from pid wrap around problems or from keeping unnecessarily large amounts of memory allocated. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-31[PATCH] task: RCU protect task->usageEric W. Biederman1-1/+6
A big problem with rcu protected data structures that are also reference counted is that you must jump through several hoops to increase the reference count. I think someone finally implemented atomic_inc_not_zero(&count) to automate the common case. Unfortunately this means you must special case the rcu access case. When data structures are only visible via rcu in a manner that is not determined by the reference count on the object (i.e. tasks are visible until their zombies are reaped) there is a much simpler technique we can employ. Simply delaying the decrement of the reference count until the rcu interval is over. What that means is that the proc code that looks up a task and later wants to sleep can now do: rcu_read_lock(); task = find_task_by_pid(some_pid); if (task) { get_task_struct(task); } rcu_read_unlock(); The effect on the rest of the kernel is that put_task_struct becomes cheaper and immediate, and in the case where the task has been reaped it frees the task immediate instead of unnecessarily waiting an until the rcu interval is over. Cleanup of task_struct does not happen when its reference count drops to zero, instead cleanup happens when release_task is called. Tasks can only be looked up via rcu before release_task is called. All rcu protected members of task_struct are freed by release_task. Therefore we can move call_rcu from put_task_struct into release_task. And we can modify release_task to not immediately release the reference count but instead have it call put_task_struct from the function it gives to call_rcu. The end result: - get_task_struct is safe in an rcu context where we have just looked up the task. - put_task_struct() simplifies into its old pre rcu self. This reorganization also makes put_task_struct uncallable from modules as it is not exported but it does not appear to be called from any modules so this should not be an issue, and is trivially fixed. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-31[PATCH] resurrect __put_task_structAndrew Morton1-3/+7
This just got nuked in mainline. Bring it back because Eric's patches use it. Cc: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-31[PATCH] Make setsid() more robustEric W. Biederman1-4/+15
The core problem: setsid fails if it is called by init. The effect in 2.6.16 and the earlier kernels that have this problem is that if you do a "ps -j 1 or ps -ej 1" you will see that init and several of it's children have process group and session == 0. Instead of process group == session == 1. Despite init calling setsid. The reason it fails is that daemonize calls set_special_pids(1,1) on kernel threads that are launched before /sbin/init is called. The only remaining effect in that current->signal->leader == 0 for init instead of 1. And the setsid call fails. No one has noticed because /sbin/init does not check the return value of setsid. In 2.4 where we don't have the pidhash table, and daemonize doesn't exist setsid actually works for init. I care a lot about pid == 1 not being a special case that we leave broken, because of the container/jail work that I am doing. - Carefully allow init (pid == 1) to call setsid despite the kernel using its session. - Use find_task_by_pid instead of find_pid because find_pid taking a pidtype is going away. Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-31[PATCH] futex: check and validate timevalsThomas Gleixner2-2/+6
The futex timeval is not checked for correctness. The change does not break existing applications as the timeval is supplied by glibc (and glibc always passes a correct value), but the glibc-internal tests for this functionality fail. Signed-off-by: Thomas Gleixner <tglx@tglx.de> Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-31[PATCH] sched: activate SCHED BATCH expiredCon Kolivas1-3/+7
To increase the strength of SCHED_BATCH as a scheduling hint we can activate batch tasks on the expired array since by definition they are latency insensitive tasks. Signed-off-by: Con Kolivas <kernel@kolivas.org> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>