aboutsummaryrefslogtreecommitdiffstats
AgeCommit message (Collapse)AuthorFilesLines
2016-02-23swait: adapt to swait API changes and include file name changeslinux-4.4-y-rt-new-swaitClark Williams7-14/+14
Signed-off-by: Clark Williams <williams@redhat.com>
2016-02-23swork: move simple workqueues to swork and adapt to latest swait changesClark Williams5-16/+16
Signed-off-by: Clark Williams <williams@redhat.com>
2016-02-22Revert "Revert "work-simple: Simple work queue implemenation""Clark Williams3-1/+198
This reverts commit 1d766c3357da71da62039caf1509af1e09f16fb7.
2016-02-22rcu: use simple wait queues where possible in rcutreePaul Gortmaker3-30/+31
As of commit dae6e64d2bcfd4b06304ab864c7e3a4f6b5fedf4 ("rcu: Introduce proper blocking to no-CBs kthreads GP waits") the RCU subsystem started making use of wait queues. Here we convert all additions of RCU wait queues to use simple wait queues, since they don't need the extra overhead of the full wait queue features. Originally this was done for RT kernels[1], since we would get things like... BUG: sleeping function called from invalid context at kernel/rtmutex.c:659 in_atomic(): 1, irqs_disabled(): 1, pid: 8, name: rcu_preempt Pid: 8, comm: rcu_preempt Not tainted Call Trace: [<ffffffff8106c8d0>] __might_sleep+0xd0/0xf0 [<ffffffff817d77b4>] rt_spin_lock+0x24/0x50 [<ffffffff8106fcf6>] __wake_up+0x36/0x70 [<ffffffff810c4542>] rcu_gp_kthread+0x4d2/0x680 [<ffffffff8105f910>] ? __init_waitqueue_head+0x50/0x50 [<ffffffff810c4070>] ? rcu_gp_fqs+0x80/0x80 [<ffffffff8105eabb>] kthread+0xdb/0xe0 [<ffffffff8106b912>] ? finish_task_switch+0x52/0x100 [<ffffffff817e0754>] kernel_thread_helper+0x4/0x10 [<ffffffff8105e9e0>] ? __init_kthread_worker+0x60/0x60 [<ffffffff817e0750>] ? gs_change+0xb/0xb ...and hence simple wait queues were deployed on RT out of necessity (as simple wait uses a raw lock), but mainline might as well take advantage of the more streamline support as well. [1] This is a carry forward of work from v3.10-rt; the original conversion was by Thomas on an earlier -rt version, and Sebastian extended it to additional post-3.10 added RCU waiters; here I've added a commit log and unified the RCU changes into one, and uprev'd it to match mainline RCU. Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Thomas Gleixner <tglx@linutronix.de>
2016-02-22rcu: Do not call rcu_nocb_gp_cleanup() while holding rnp->lockDaniel Wagner3-5/+18
rcu_nocb_gp_cleanup() is called while holding rnp->lock. Currently, this is okay because the wake_up_all() in rcu_nocb_gp_cleanup() will not enable the IRQs. lockdep is happy. By switching over using swait this is not true anymore. swake_up_all() enables the IRQs while processing the waiters. __do_softirq() can now run and will eventually call rcu_process_callbacks() which wants to grap nrp->lock. Let's move the rcu_nocb_gp_cleanup() call outside the lock before we switch over to swait. If we would hold the rnp->lock and use swait, lockdep reports following: ================================= [ INFO: inconsistent lock state ] 4.2.0-rc5-00025-g9a73ba0 #136 Not tainted --------------------------------- inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage. rcu_preempt/8 [HC0[0]:SC0[0]:HE1:SE1] takes: (rcu_node_1){+.?...}, at: [<ffffffff811387c7>] rcu_gp_kthread+0xb97/0xeb0 {IN-SOFTIRQ-W} state was registered at: [<ffffffff81109b9f>] __lock_acquire+0xd5f/0x21e0 [<ffffffff8110be0f>] lock_acquire+0xdf/0x2b0 [<ffffffff81841cc9>] _raw_spin_lock_irqsave+0x59/0xa0 [<ffffffff81136991>] rcu_process_callbacks+0x141/0x3c0 [<ffffffff810b1a9d>] __do_softirq+0x14d/0x670 [<ffffffff810b2214>] irq_exit+0x104/0x110 [<ffffffff81844e96>] smp_apic_timer_interrupt+0x46/0x60 [<ffffffff81842e70>] apic_timer_interrupt+0x70/0x80 [<ffffffff810dba66>] rq_attach_root+0xa6/0x100 [<ffffffff810dbc2d>] cpu_attach_domain+0x16d/0x650 [<ffffffff810e4b42>] build_sched_domains+0x942/0xb00 [<ffffffff821777c2>] sched_init_smp+0x509/0x5c1 [<ffffffff821551e3>] kernel_init_freeable+0x172/0x28f [<ffffffff8182cdce>] kernel_init+0xe/0xe0 [<ffffffff8184231f>] ret_from_fork+0x3f/0x70 irq event stamp: 76 hardirqs last enabled at (75): [<ffffffff81841330>] _raw_spin_unlock_irq+0x30/0x60 hardirqs last disabled at (76): [<ffffffff8184116f>] _raw_spin_lock_irq+0x1f/0x90 softirqs last enabled at (0): [<ffffffff810a8df2>] copy_process.part.26+0x602/0x1cf0 softirqs last disabled at (0): [< (null)>] (null) other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(rcu_node_1); <Interrupt> lock(rcu_node_1); *** DEADLOCK *** 1 lock held by rcu_preempt/8: #0: (rcu_node_1){+.?...}, at: [<ffffffff811387c7>] rcu_gp_kthread+0xb97/0xeb0 stack backtrace: CPU: 0 PID: 8 Comm: rcu_preempt Not tainted 4.2.0-rc5-00025-g9a73ba0 #136 Hardware name: Dell Inc. PowerEdge R820/066N7P, BIOS 2.0.20 01/16/2014 0000000000000000 000000006d7e67d8 ffff881fb081fbd8 ffffffff818379e0 0000000000000000 ffff881fb0812a00 ffff881fb081fc38 ffffffff8110813b 0000000000000000 0000000000000001 ffff881f00000001 ffffffff8102fa4f Call Trace: [<ffffffff818379e0>] dump_stack+0x4f/0x7b [<ffffffff8110813b>] print_usage_bug+0x1db/0x1e0 [<ffffffff8102fa4f>] ? save_stack_trace+0x2f/0x50 [<ffffffff811087ad>] mark_lock+0x66d/0x6e0 [<ffffffff81107790>] ? check_usage_forwards+0x150/0x150 [<ffffffff81108898>] mark_held_locks+0x78/0xa0 [<ffffffff81841330>] ? _raw_spin_unlock_irq+0x30/0x60 [<ffffffff81108a28>] trace_hardirqs_on_caller+0x168/0x220 [<ffffffff81108aed>] trace_hardirqs_on+0xd/0x10 [<ffffffff81841330>] _raw_spin_unlock_irq+0x30/0x60 [<ffffffff810fd1c7>] swake_up_all+0xb7/0xe0 [<ffffffff811386e1>] rcu_gp_kthread+0xab1/0xeb0 [<ffffffff811089bf>] ? trace_hardirqs_on_caller+0xff/0x220 [<ffffffff81841341>] ? _raw_spin_unlock_irq+0x41/0x60 [<ffffffff81137c30>] ? rcu_barrier+0x20/0x20 [<ffffffff810d2014>] kthread+0x104/0x120 [<ffffffff81841330>] ? _raw_spin_unlock_irq+0x30/0x60 [<ffffffff810d1f10>] ? kthread_create_on_node+0x260/0x260 [<ffffffff8184231f>] ret_from_fork+0x3f/0x70 [<ffffffff810d1f10>] ? kthread_create_on_node+0x260/0x260 Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de>
2016-02-22KVM: use simple waitqueue for vcpu->wqMarcelo Tosatti11-40/+41
The problem: On -rt, an emulated LAPIC timer instances has the following path: 1) hard interrupt 2) ksoftirqd is scheduled 3) ksoftirqd wakes up vcpu thread 4) vcpu thread is scheduled This extra context switch introduces unnecessary latency in the LAPIC path for a KVM guest. The solution: Allow waking up vcpu thread from hardirq context, thus avoiding the need for ksoftirqd to be scheduled. Normal waitqueues make use of spinlocks, which on -RT are sleepable locks. Therefore, waking up a waitqueue waiter involves locking a sleeping lock, which is not allowed from hard interrupt context. cyclictest command line: This patch reduces the average latency in my tests from 14us to 11us. Daniel writes: Paolo asked for numbers from kvm-unit-tests/tscdeadline_latency benchmark on mainline. The test was run 1000 times on tip/sched/core 4.4.0-rc8-01134-g0905f04: ./x86-run x86/tscdeadline_latency.flat -cpu host with idle=poll. The test seems not to deliver really stable numbers though most of them are smaller. Paolo write: "Anything above ~10000 cycles means that the host went to C1 or lower---the number means more or less nothing in that case. The mean shows an improvement indeed." Before: min max mean std count 1000.000000 1000.000000 1000.000000 1000.000000 mean 5162.596000 2019270.084000 5824.491541 20681.645558 std 75.431231 622607.723969 89.575700 6492.272062 min 4466.000000 23928.000000 5537.926500 585.864966 25% 5163.000000 1613252.750000 5790.132275 16683.745433 50% 5175.000000 2281919.000000 5834.654000 23151.990026 75% 5190.000000 2382865.750000 5861.412950 24148.206168 max 5228.000000 4175158.000000 6254.827300 46481.048691 After min max mean std count 1000.000000 1000.00000 1000.000000 1000.000000 mean 5143.511000 2076886.10300 5813.312474 21207.357565 std 77.668322 610413.09583 86.541500 6331.915127 min 4427.000000 25103.00000 5529.756600 559.187707 25% 5148.000000 1691272.75000 5784.889825 17473.518244 50% 5160.000000 2308328.50000 5832.025000 23464.837068 75% 5172.000000 2393037.75000 5853.177675 24223.969976 max 5222.000000 3922458.00000 6186.720500 42520.379830 [Patch was originaly based on the swait implementation found in the -rt tree. Daniel ported it to mainline's version and gathered the benchmark numbers for tscdeadline_latency test.] Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com>
2016-02-22kbuild: Add option to turn incompatible pointer check into errorDaniel Wagner1-0/+3
With the introduction of the simple wait API we have two very similar APIs in the kernel. For example wake_up() and swake_up() is only one character away. Although the compiler will warn happily the wrong usage it keeps on going an even links the kernel. Thomas and Peter would rather like to see early missuses reported as error early on. In a first attempt we tried to wrap all swait and wait calls into a macro which has an compile time type assertion. The result was pretty ugly and wasn't able to catch all wrong usages. woken_wake_function(), autoremove_wake_function() and wake_bit_function() are assigned as function pointers. Wrapping them with a macro around is not possible. Prefixing them with '_' was also not a real option because there some users in the kernel which do use them as well. All in all this attempt looked to intrusive and too ugly. An alternative is to turn the pointer type check into an error which catches wrong type uses. Obviously not only the swait/wait ones. That isn't a bad thing either. Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de>
2016-02-22wait.[ch]: Introduce the simple waitqueue (swait) implementationPeter Zijlstra (Intel)3-1/+296
The existing wait queue support has support for custom wake up call backs, wake flags, wake key (passed to call back) and exclusive flags that allow wakers to be tagged as exclusive, for limiting the number of wakers. In a lot of cases, none of these features are used, and hence we can benefit from a slimmed down version that lowers memory overhead and reduces runtime overhead. The concept originated from -rt, where waitqueues are a constant source of trouble, as we can't convert the head lock to a raw spinlock due to fancy and long lasting callbacks. With the removal of custom callbacks, we can use a raw lock for queue list manipulations, hence allowing the simple wait support to be used in -rt. Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Originally-by: Thomas Gleixner <tglx@linutronix.de> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> [Patch is from PeterZ which is based on Thomas version. Commit message is written by Paul G. Daniel: - And some compile issues fixed. - Added non-lazy implementation of swake_up_locked as suggested by Boqun Feng.]
2016-02-22Revert "wait-simple: Simple waitqueue implementation"Clark Williams3-323/+1
This reverts commit 41f21eadc3c0e3e7f506f4773c99eb4de3090010.
2016-02-22Revert "work-simple: Simple work queue implemenation"Clark Williams3-198/+1
This reverts commit 9596b71b02648051dcd8fe458cd37ba771871682.
2016-02-22Revert "rcu: use simple waitqueues"Clark Williams3-22/+21
This reverts commit e76cd149b42c39c5671a0b1f0b36058fa8ff5733.
2016-02-13v4.4.1-rt6Thomas Gleixner1-0/+1
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-13workqueue: Prevent deadlock/stall on RTThomas Gleixner2-15/+53
Austin reported a XFS deadlock/stall on RT where scheduled work gets never exececuted and tasks are waiting for each other for ever. The underlying problem is the modification of the RT code to the handling of workers which are about to go to sleep. In mainline a worker thread which goes to sleep wakes an idle worker if there is more work to do. This happens from the guts of the schedule() function. On RT this must be outside and the accessed data structures are not protected against scheduling due to the spinlock to rtmutex conversion. So the naive solution to this was to move the code outside of the scheduler and protect the data structures by the pool lock. That approach turned out to be a little naive as we cannot call into that code when the thread blocks on a lock, as it is not allowed to block on two locks in parallel. So we dont call into the worker wakeup magic when the worker is blocked on a lock, which causes the deadlock/stall observed by Austin and Mike. Looking deeper into that worker code it turns out that the only relevant data structure which needs to be protected is the list of idle workers which can be woken up. So the solution is to protect the list manipulation operations with preempt_enable/disable pairs on RT and call unconditionally into the worker code even when the worker is blocked on a lock. The preemption protection is safe as there is nothing which can fiddle with the list outside of thread context. Reported-and_tested-by: Austin Schuh <austin@peloton-tech.com> Reported-and_tested-by: Mike Galbraith <umgwanakikbuti@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: http://vger.kernel.org/r/alpine.DEB.2.10.1406271249510.5170@nanos Cc: Richard Weinberger <richard.weinberger@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org>
2016-02-13md: disable bcacheSebastian Andrzej Siewior1-0/+1
It uses anon semaphores |drivers/md/bcache/request.c: In function ‘cached_dev_write_complete’: |drivers/md/bcache/request.c:1007:2: error: implicit declaration of function ‘up_read_non_owner’ [-Werror=implicit-function-declaration] | up_read_non_owner(&dc->writeback_lock); | ^ |drivers/md/bcache/request.c: In function ‘request_write’: |drivers/md/bcache/request.c:1033:2: error: implicit declaration of function ‘down_read_non_owner’ [-Werror=implicit-function-declaration] | down_read_non_owner(&dc->writeback_lock); | ^ either we get rid of those or we have to introduce them… Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2016-02-13rt,ntp: Move call to schedule_delayed_work() to helper threadSteven Rostedt1-0/+43
The ntp code for notify_cmos_timer() is called from a hard interrupt context. schedule_delayed_work() under PREEMPT_RT_FULL calls spinlocks that have been converted to mutexes, thus calling schedule_delayed_work() from interrupt is not safe. Add a helper thread that does the call to schedule_delayed_work and wake up that thread instead of calling schedule_delayed_work() directly. This is only for CONFIG_PREEMPT_RT_FULL, otherwise the code still calls schedule_delayed_work() directly in irq context. Note: There's a few places in the kernel that do this. Perhaps the RT code should have a dedicated thread that does the checks. Just register a notifier on boot up for your check and wake up the thread when needed. This will be a todo. Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-02-13memcontrol: Prevent scheduling while atomic in cgroup codeMike Galbraith1-2/+5
mm, memcg: make refill_stock() use get_cpu_light() Nikita reported the following memcg scheduling while atomic bug: Call Trace: [e22d5a90] [c0007ea8] show_stack+0x4c/0x168 (unreliable) [e22d5ad0] [c0618c04] __schedule_bug+0x94/0xb0 [e22d5ae0] [c060b9ec] __schedule+0x530/0x550 [e22d5bf0] [c060bacc] schedule+0x30/0xbc [e22d5c00] [c060ca24] rt_spin_lock_slowlock+0x180/0x27c [e22d5c70] [c00b39dc] res_counter_uncharge_until+0x40/0xc4 [e22d5ca0] [c013ca88] drain_stock.isra.20+0x54/0x98 [e22d5cc0] [c01402ac] __mem_cgroup_try_charge+0x2e8/0xbac [e22d5d70] [c01410d4] mem_cgroup_charge_common+0x3c/0x70 [e22d5d90] [c0117284] __do_fault+0x38c/0x510 [e22d5df0] [c011a5f4] handle_pte_fault+0x98/0x858 [e22d5e50] [c060ed08] do_page_fault+0x42c/0x6fc [e22d5f40] [c000f5b4] handle_page_fault+0xc/0x80 What happens: refill_stock() get_cpu_var() drain_stock() res_counter_uncharge() res_counter_uncharge_until() spin_lock() <== boom Fix it by replacing get/put_cpu_var() with get/put_cpu_light(). Reported-by: Nikita Yushchenko <nyushchenko@dev.rtsoft.ru> Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2016-02-13cgroups: use simple wait in css_release()Sebastian Andrzej Siewior2-4/+7
To avoid: |BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:914 |in_atomic(): 1, irqs_disabled(): 0, pid: 92, name: rcuc/11 |2 locks held by rcuc/11/92: | #0: (rcu_callback){......}, at: [<ffffffff810e037e>] rcu_cpu_kthread+0x3de/0x940 | #1: (rcu_read_lock_sched){......}, at: [<ffffffff81328390>] percpu_ref_call_confirm_rcu+0x0/0xd0 |Preemption disabled at:[<ffffffff813284e2>] percpu_ref_switch_to_atomic_rcu+0x82/0xc0 |CPU: 11 PID: 92 Comm: rcuc/11 Not tainted 3.18.7-rt0+ #1 | ffff8802398cdf80 ffff880235f0bc28 ffffffff815b3a12 0000000000000000 | 0000000000000000 ffff880235f0bc48 ffffffff8109aa16 0000000000000000 | ffff8802398cdf80 ffff880235f0bc78 ffffffff815b8dd4 000000000000df80 |Call Trace: | [<ffffffff815b3a12>] dump_stack+0x4f/0x7c | [<ffffffff8109aa16>] __might_sleep+0x116/0x190 | [<ffffffff815b8dd4>] rt_spin_lock+0x24/0x60 | [<ffffffff8108d2cd>] queue_work_on+0x6d/0x1d0 | [<ffffffff8110c881>] css_release+0x81/0x90 | [<ffffffff8132844e>] percpu_ref_call_confirm_rcu+0xbe/0xd0 | [<ffffffff813284e2>] percpu_ref_switch_to_atomic_rcu+0x82/0xc0 | [<ffffffff810e03e5>] rcu_cpu_kthread+0x445/0x940 | [<ffffffff81098a2d>] smpboot_thread_fn+0x18d/0x2d0 | [<ffffffff810948d8>] kthread+0xe8/0x100 | [<ffffffff815b9c3c>] ret_from_fork+0x7c/0xb0 Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2016-02-13i915: bogus warning from i915 when running on PREEMPT_RTClark Williams1-1/+1
The i915 driver has a 'WARN_ON(!in_interrupt())' in the display handler, which whines constanly on the RT kernel (since the interrupt is actually handled in a threaded handler and not actual interrupt context). Change the WARN_ON to WARN_ON_NORT Tested-by: Joakim Hernberg <jhernberg@alchemy.lu> Signed-off-by: Clark Williams <williams@redhat.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2016-02-13drm/i915: drop trace_i915_gem_ring_dispatch on rtSebastian Andrzej Siewior1-0/+2
This tracepoint is responsible for: |[<814cc358>] __schedule_bug+0x4d/0x59 |[<814d24cc>] __schedule+0x88c/0x930 |[<814d3b90>] ? _raw_spin_unlock_irqrestore+0x40/0x50 |[<814d3b95>] ? _raw_spin_unlock_irqrestore+0x45/0x50 |[<810b57b5>] ? task_blocks_on_rt_mutex+0x1f5/0x250 |[<814d27d9>] schedule+0x29/0x70 |[<814d3423>] rt_spin_lock_slowlock+0x15b/0x278 |[<814d3786>] rt_spin_lock+0x26/0x30 |[<a00dced9>] gen6_gt_force_wake_get+0x29/0x60 [i915] |[<a00e183f>] gen6_ring_get_irq+0x5f/0x100 [i915] |[<a00b2a33>] ftrace_raw_event_i915_gem_ring_dispatch+0xe3/0x100 [i915] |[<a00ac1b3>] i915_gem_do_execbuffer.isra.13+0xbd3/0x1430 [i915] |[<810f8943>] ? trace_buffer_unlock_commit+0x43/0x60 |[<8113e8d2>] ? ftrace_raw_event_kmem_alloc+0xd2/0x180 |[<8101d063>] ? native_sched_clock+0x13/0x80 |[<a00acf29>] i915_gem_execbuffer2+0x99/0x280 [i915] |[<a00114a3>] drm_ioctl+0x4c3/0x570 [drm] |[<8101d0d9>] ? sched_clock+0x9/0x10 |[<a00ace90>] ? i915_gem_execbuffer+0x480/0x480 [i915] |[<810f1c18>] ? rb_commit+0x68/0xa0 |[<810f1c6c>] ? ring_buffer_unlock_commit+0x1c/0xa0 |[<81197467>] do_vfs_ioctl+0x97/0x540 |[<81021318>] ? ftrace_raw_event_sys_enter+0xd8/0x130 |[<811979a1>] sys_ioctl+0x91/0xb0 |[<814db931>] tracesys+0xe1/0xe6 Chris Wilson does not like to move i915_trace_irq_get() out of the macro |No. This enables the IRQ, as well as making a number of |very expensively serialised read, unconditionally. so it is gone now on RT. Reported-by: Joakim Hernberg <jbh@alchemy.lu> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2016-02-13gpu/i915: don't open code these thingsSebastian Andrzej Siewior1-1/+1
The opencode part is gone in 1f83fee0 ("drm/i915: clear up wedged transitions") the owner check is still there. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2016-02-13cpufreq: drop K8's driver from beeing selectedSebastian Andrzej Siewior1-1/+1
Ralf posted a picture of a backtrace from | powernowk8_target_fn() -> transition_frequency_fidvid() and then at the | end: | 932 policy = cpufreq_cpu_get(smp_processor_id()); | 933 cpufreq_cpu_put(policy); crashing the system on -RT. I assumed that policy was a NULL pointer but was rulled out. Since Ralf can't do any more investigations on this and I have no machine with this, I simply switch it off. Reported-by: Ralf Mardorf <ralf.mardorf@alice-dsl.net> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2016-02-13mmci: Remove bogus local_irq_save()Thomas Gleixner1-5/+0
On !RT interrupt runs with interrupts disabled. On RT it's in a thread, so no need to disable interrupts at all. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-13i2c/omap: drop the lock hard irq contextSebastian Andrzej Siewior1-4/+1
The lock is taken while reading two registers. On RT the first lock is taken in hard irq where it might sleep and in the threaded irq. The threaded irq runs in oneshot mode so the hard irq does not run until the thread the completes so there is no reason to grab the lock. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2016-02-13leds: trigger: disable CPU trigger on -RTSebastian Andrzej Siewior1-1/+1
as it triggers: |CPU: 0 PID: 0 Comm: swapper Not tainted 3.12.8-rt10 #141 |[<c0014aa4>] (unwind_backtrace+0x0/0xf8) from [<c0012788>] (show_stack+0x1c/0x20) |[<c0012788>] (show_stack+0x1c/0x20) from [<c043c8dc>] (dump_stack+0x20/0x2c) |[<c043c8dc>] (dump_stack+0x20/0x2c) from [<c004c5e8>] (__might_sleep+0x13c/0x170) |[<c004c5e8>] (__might_sleep+0x13c/0x170) from [<c043f270>] (__rt_spin_lock+0x28/0x38) |[<c043f270>] (__rt_spin_lock+0x28/0x38) from [<c043fa00>] (rt_read_lock+0x68/0x7c) |[<c043fa00>] (rt_read_lock+0x68/0x7c) from [<c036cf74>] (led_trigger_event+0x2c/0x5c) |[<c036cf74>] (led_trigger_event+0x2c/0x5c) from [<c036e0bc>] (ledtrig_cpu+0x54/0x5c) |[<c036e0bc>] (ledtrig_cpu+0x54/0x5c) from [<c000ffd8>] (arch_cpu_idle_exit+0x18/0x1c) |[<c000ffd8>] (arch_cpu_idle_exit+0x18/0x1c) from [<c00590b8>] (cpu_startup_entry+0xa8/0x234) |[<c00590b8>] (cpu_startup_entry+0xa8/0x234) from [<c043b2cc>] (rest_init+0xb8/0xe0) |[<c043b2cc>] (rest_init+0xb8/0xe0) from [<c061ebe0>] (start_kernel+0x2c4/0x380) Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2016-02-13arm+arm64: lazy-preempt: add TIF_NEED_RESCHED_LAZY to _TIF_WORK_MASKSebastian Andrzej Siewior3-6/+13
_TIF_WORK_MASK is used to check for TIF_NEED_RESCHED so we need to check for TIF_NEED_RESCHED_LAZY here, too. Reported-by: Grygorii Strashko <grygorii.strashko@ti.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2016-02-13arch/arm64: Add lazy preempt supportAnders Roxell4-3/+15
arm64 is missing support for PREEMPT_RT. The main feature which is lacking is support for lazy preemption. The arch-specific entry code, thread information structure definitions, and associated data tables have to be extended to provide this support. Then the Kconfig file has to be extended to indicate the support is available, and also to indicate that support for full RT preemption is now available. Signed-off-by: Anders Roxell <anders.roxell@linaro.org>
2016-02-13powerpc: Add support for lazy preemptionThomas Gleixner5-11/+33
Implement the powerpc pieces for lazy preempt. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-13arm: Add support for lazy preemptionThomas Gleixner5-3/+18
Implement the arm pieces for lazy preempt. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-13x86: Support for lazy preemptionThomas Gleixner6-2/+43
Implement the x86 pieces for lazy preempt. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-13preempt-lazy: Add the lazy-preemption check to preempt_schedule()Sebastian Andrzej Siewior1-8/+28
Probably in the rebase onto v4.1 this check got moved into less commonly used preempt_schedule_notrace(). This patch ensures that both functions use it. Reported-by: Mike Galbraith <umgwanakikbuti@gmail.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2016-02-13sched: Add support for lazy preemptionThomas Gleixner13-29/+204
It has become an obsession to mitigate the determinism vs. throughput loss of RT. Looking at the mainline semantics of preemption points gives a hint why RT sucks throughput wise for ordinary SCHED_OTHER tasks. One major issue is the wakeup of tasks which are right away preempting the waking task while the waking task holds a lock on which the woken task will block right after having preempted the wakee. In mainline this is prevented due to the implicit preemption disable of spin/rw_lock held regions. On RT this is not possible due to the fully preemptible nature of sleeping spinlocks. Though for a SCHED_OTHER task preempting another SCHED_OTHER task this is really not a correctness issue. RT folks are concerned about SCHED_FIFO/RR tasks preemption and not about the purely fairness driven SCHED_OTHER preemption latencies. So I introduced a lazy preemption mechanism which only applies to SCHED_OTHER tasks preempting another SCHED_OTHER task. Aside of the existing preempt_count each tasks sports now a preempt_lazy_count which is manipulated on lock acquiry and release. This is slightly incorrect as for lazyness reasons I coupled this on migrate_disable/enable so some other mechanisms get the same treatment (e.g. get_cpu_light). Now on the scheduler side instead of setting NEED_RESCHED this sets NEED_RESCHED_LAZY in case of a SCHED_OTHER/SCHED_OTHER preemption and therefor allows to exit the waking task the lock held region before the woken task preempts. That also works better for cross CPU wakeups as the other side can stay in the adaptive spinning loop. For RT class preemption there is no change. This simply sets NEED_RESCHED and forgoes the lazy preemption counter. Initial test do not expose any observable latency increasement, but history shows that I've been proven wrong before :) The lazy preemption mode is per default on, but with CONFIG_SCHED_DEBUG enabled it can be disabled via: # echo NO_PREEMPT_LAZY >/sys/kernel/debug/sched_features and reenabled via # echo PREEMPT_LAZY >/sys/kernel/debug/sched_features The test results so far are very machine and workload dependent, but there is a clear trend that it enhances the non RT workload performance. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-13rcu: make RCU_BOOST default on RTSebastian Andrzej Siewior1-2/+2
Since it is no longer invoked from the softirq people run into OOM more often if the priority of the RCU thread is too low. Making boosting default on RT should help in those case and it can be switched off if someone knows better. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2016-02-13rcu: Eliminate softirq processing from rcutreePaul E. McKenney3-147/+123
Running RCU out of softirq is a problem for some workloads that would like to manage RCU core processing independently of other softirq work, for example, setting kthread priority. This commit therefore moves the RCU core work from softirq to a per-CPU/per-flavor SCHED_OTHER kthread named rcuc. The SCHED_OTHER approach avoids the scalability problems that appeared with the earlier attempt to move RCU core processing to from softirq to kthreads. That said, kernels built with RCU_BOOST=y will run the rcuc kthreads at the RCU-boosting priority. Reported-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Mike Galbraith <bitbucket@online.de> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2016-02-13rcu: Disable RCU_FAST_NO_HZ on RTThomas Gleixner1-1/+1
This uses a timer_list timer from the irq disabled guts of the idle code. Disable it for now to prevent wreckage. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-13kernel/perf: mark perf_cpu_context's timer as irqsafeSebastian Andrzej Siewior1-0/+1
Otherwise we get a WARN_ON() backtrace and some events are reported as "not counted". Cc: stable-rt@vger.kernel.org Reported-by: Yang Shi <yang.shi@linaro.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2016-02-13perf: Make swevent hrtimer run in irq instead of softirqYong Zhang1-0/+1
Otherwise we get a deadlock like below: [ 1044.042749] BUG: scheduling while atomic: ksoftirqd/21/141/0x00010003 [ 1044.042752] INFO: lockdep is turned off. [ 1044.042754] Modules linked in: [ 1044.042757] Pid: 141, comm: ksoftirqd/21 Tainted: G W 3.4.0-rc2-rt3-23676-ga723175-dirty #29 [ 1044.042759] Call Trace: [ 1044.042761] <IRQ> [<ffffffff8107d8e5>] __schedule_bug+0x65/0x80 [ 1044.042770] [<ffffffff8168978c>] __schedule+0x83c/0xa70 [ 1044.042775] [<ffffffff8106bdd2>] ? prepare_to_wait+0x32/0xb0 [ 1044.042779] [<ffffffff81689a5e>] schedule+0x2e/0xa0 [ 1044.042782] [<ffffffff81071ebd>] hrtimer_wait_for_timer+0x6d/0xb0 [ 1044.042786] [<ffffffff8106bb30>] ? wake_up_bit+0x40/0x40 [ 1044.042790] [<ffffffff81071f20>] hrtimer_cancel+0x20/0x40 [ 1044.042794] [<ffffffff8111da0c>] perf_swevent_cancel_hrtimer+0x3c/0x50 [ 1044.042798] [<ffffffff8111da31>] task_clock_event_stop+0x11/0x40 [ 1044.042802] [<ffffffff8111da6e>] task_clock_event_del+0xe/0x10 [ 1044.042805] [<ffffffff8111c568>] event_sched_out+0x118/0x1d0 [ 1044.042809] [<ffffffff8111c649>] group_sched_out+0x29/0x90 [ 1044.042813] [<ffffffff8111ed7e>] __perf_event_disable+0x18e/0x200 [ 1044.042817] [<ffffffff8111c343>] remote_function+0x63/0x70 [ 1044.042821] [<ffffffff810b0aae>] generic_smp_call_function_single_interrupt+0xce/0x120 [ 1044.042826] [<ffffffff81022bc7>] smp_call_function_single_interrupt+0x27/0x40 [ 1044.042831] [<ffffffff8168d50c>] call_function_single_interrupt+0x6c/0x80 [ 1044.042833] <EOI> [<ffffffff811275b0>] ? perf_event_overflow+0x20/0x20 [ 1044.042840] [<ffffffff8168b970>] ? _raw_spin_unlock_irq+0x30/0x70 [ 1044.042844] [<ffffffff8168b976>] ? _raw_spin_unlock_irq+0x36/0x70 [ 1044.042848] [<ffffffff810702e2>] run_hrtimer_softirq+0xc2/0x200 [ 1044.042853] [<ffffffff811275b0>] ? perf_event_overflow+0x20/0x20 [ 1044.042857] [<ffffffff81045265>] __do_softirq_common+0xf5/0x3a0 [ 1044.042862] [<ffffffff81045c3d>] __thread_do_softirq+0x15d/0x200 [ 1044.042865] [<ffffffff81045dda>] run_ksoftirqd+0xfa/0x210 [ 1044.042869] [<ffffffff81045ce0>] ? __thread_do_softirq+0x200/0x200 [ 1044.042873] [<ffffffff81045ce0>] ? __thread_do_softirq+0x200/0x200 [ 1044.042877] [<ffffffff8106b596>] kthread+0xb6/0xc0 [ 1044.042881] [<ffffffff8168b97b>] ? _raw_spin_unlock_irq+0x3b/0x70 [ 1044.042886] [<ffffffff8168d994>] kernel_thread_helper+0x4/0x10 [ 1044.042889] [<ffffffff8107d98c>] ? finish_task_switch+0x8c/0x110 [ 1044.042894] [<ffffffff8168b97b>] ? _raw_spin_unlock_irq+0x3b/0x70 [ 1044.042897] [<ffffffff8168bd5d>] ? retint_restore_args+0xe/0xe [ 1044.042900] [<ffffffff8106b4e0>] ? kthreadd+0x1e0/0x1e0 [ 1044.042902] [<ffffffff8168d990>] ? gs_change+0xb/0xb Signed-off-by: Yong Zhang <yong.zhang0@gmail.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/1341476476-5666-1-git-send-email-yong.zhang0@gmail.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2016-02-13lockdep: selftest: fix warnings due to missing PREEMPT_RT conditionalsJosh Cartwright1-0/+27
"lockdep: Selftest: Only do hardirq context test for raw spinlock" disabled the execution of certain tests with PREEMPT_RT_FULL, but did not prevent the tests from still being defined. This leads to warnings like: ./linux/lib/locking-selftest.c:574:1: warning: 'irqsafe1_hard_rlock_12' defined but not used [-Wunused-function] ./linux/lib/locking-selftest.c:574:1: warning: 'irqsafe1_hard_rlock_21' defined but not used [-Wunused-function] ./linux/lib/locking-selftest.c:577:1: warning: 'irqsafe1_hard_wlock_12' defined but not used [-Wunused-function] ./linux/lib/locking-selftest.c:577:1: warning: 'irqsafe1_hard_wlock_21' defined but not used [-Wunused-function] ./linux/lib/locking-selftest.c:580:1: warning: 'irqsafe1_soft_spin_12' defined but not used [-Wunused-function] ... Fixed by wrapping the test definitions in #ifndef CONFIG_PREEMPT_RT_FULL conditionals. Signed-off-by: Josh Cartwright <josh.cartwright@ni.com> Signed-off-by: Xander Huff <xander.huff@ni.com> Acked-by: Gratian Crisan <gratian.crisan@ni.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2016-02-13lockdep: selftest: Only do hardirq context test for raw spinlockYong Zhang1-0/+23
On -rt there is no softirq context any more and rwlock is sleepable, disable softirq context test and rwlock+irq test. Signed-off-by: Yong Zhang <yong.zhang0@gmail.com> Cc: Yong Zhang <yong.zhang@windriver.com> Link: http://lkml.kernel.org/r/1334559716-18447-3-git-send-email-yong.zhang0@gmail.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-13crypto: Convert crypto notifier chain to SRCUPeter Zijlstra3-7/+7
The crypto notifier deadlocks on RT. Though this can be a real deadlock on mainline as well due to fifo fair rwsems. The involved parties here are: [ 82.172678] swapper/0 S 0000000000000001 0 1 0 0x00000000 [ 82.172682] ffff88042f18fcf0 0000000000000046 ffff88042f18fc80 ffffffff81491238 [ 82.172685] 0000000000011cc0 0000000000011cc0 ffff88042f18c040 ffff88042f18ffd8 [ 82.172688] 0000000000011cc0 0000000000011cc0 ffff88042f18ffd8 0000000000011cc0 [ 82.172689] Call Trace: [ 82.172697] [<ffffffff81491238>] ? _raw_spin_unlock_irqrestore+0x6c/0x7a [ 82.172701] [<ffffffff8148fd3f>] schedule+0x64/0x66 [ 82.172704] [<ffffffff8148ec6b>] schedule_timeout+0x27/0xd0 [ 82.172708] [<ffffffff81043c0c>] ? unpin_current_cpu+0x1a/0x6c [ 82.172713] [<ffffffff8106e491>] ? migrate_enable+0x12f/0x141 [ 82.172716] [<ffffffff8148fbbd>] wait_for_common+0xbb/0x11f [ 82.172719] [<ffffffff810709f2>] ? try_to_wake_up+0x182/0x182 [ 82.172722] [<ffffffff8148fc96>] wait_for_completion_interruptible+0x1d/0x2e [ 82.172726] [<ffffffff811debfd>] crypto_wait_for_test+0x49/0x6b [ 82.172728] [<ffffffff811ded32>] crypto_register_alg+0x53/0x5a [ 82.172730] [<ffffffff811ded6c>] crypto_register_algs+0x33/0x72 [ 82.172734] [<ffffffff81ad7686>] ? aes_init+0x12/0x12 [ 82.172737] [<ffffffff81ad76ea>] aesni_init+0x64/0x66 [ 82.172741] [<ffffffff81000318>] do_one_initcall+0x7f/0x13b [ 82.172744] [<ffffffff81ac4d34>] kernel_init+0x199/0x22c [ 82.172747] [<ffffffff81ac44ef>] ? loglevel+0x31/0x31 [ 82.172752] [<ffffffff814987c4>] kernel_thread_helper+0x4/0x10 [ 82.172755] [<ffffffff81491574>] ? retint_restore_args+0x13/0x13 [ 82.172759] [<ffffffff81ac4b9b>] ? start_kernel+0x3ca/0x3ca [ 82.172761] [<ffffffff814987c0>] ? gs_change+0x13/0x13 [ 82.174186] cryptomgr_test S 0000000000000001 0 41 2 0x00000000 [ 82.174189] ffff88042c971980 0000000000000046 ffffffff81d74830 0000000000000292 [ 82.174192] 0000000000011cc0 0000000000011cc0 ffff88042c96eb80 ffff88042c971fd8 [ 82.174195] 0000000000011cc0 0000000000011cc0 ffff88042c971fd8 0000000000011cc0 [ 82.174195] Call Trace: [ 82.174198] [<ffffffff8148fd3f>] schedule+0x64/0x66 [ 82.174201] [<ffffffff8148ec6b>] schedule_timeout+0x27/0xd0 [ 82.174204] [<ffffffff81043c0c>] ? unpin_current_cpu+0x1a/0x6c [ 82.174206] [<ffffffff8106e491>] ? migrate_enable+0x12f/0x141 [ 82.174209] [<ffffffff8148fbbd>] wait_for_common+0xbb/0x11f [ 82.174212] [<ffffffff810709f2>] ? try_to_wake_up+0x182/0x182 [ 82.174215] [<ffffffff8148fc96>] wait_for_completion_interruptible+0x1d/0x2e [ 82.174218] [<ffffffff811e4883>] cryptomgr_notify+0x280/0x385 [ 82.174221] [<ffffffff814943de>] notifier_call_chain+0x6b/0x98 [ 82.174224] [<ffffffff8108a11c>] ? rt_down_read+0x10/0x12 [ 82.174227] [<ffffffff810677cd>] __blocking_notifier_call_chain+0x70/0x8d [ 82.174230] [<ffffffff810677fe>] blocking_notifier_call_chain+0x14/0x16 [ 82.174234] [<ffffffff811dd272>] crypto_probing_notify+0x24/0x50 [ 82.174236] [<ffffffff811dd7a1>] crypto_alg_mod_lookup+0x3e/0x74 [ 82.174238] [<ffffffff811dd949>] crypto_alloc_base+0x36/0x8f [ 82.174241] [<ffffffff811e9408>] cryptd_alloc_ablkcipher+0x6e/0xb5 [ 82.174243] [<ffffffff811dd591>] ? kzalloc.clone.5+0xe/0x10 [ 82.174246] [<ffffffff8103085d>] ablk_init_common+0x1d/0x38 [ 82.174249] [<ffffffff8103852a>] ablk_ecb_init+0x15/0x17 [ 82.174251] [<ffffffff811dd8c6>] __crypto_alloc_tfm+0xc7/0x114 [ 82.174254] [<ffffffff811e0caa>] ? crypto_lookup_skcipher+0x1f/0xe4 [ 82.174256] [<ffffffff811e0dcf>] crypto_alloc_ablkcipher+0x60/0xa5 [ 82.174258] [<ffffffff811e5bde>] alg_test_skcipher+0x24/0x9b [ 82.174261] [<ffffffff8106d96d>] ? finish_task_switch+0x3f/0xfa [ 82.174263] [<ffffffff811e6b8e>] alg_test+0x16f/0x1d7 [ 82.174267] [<ffffffff811e45ac>] ? cryptomgr_probe+0xac/0xac [ 82.174269] [<ffffffff811e45d8>] cryptomgr_test+0x2c/0x47 [ 82.174272] [<ffffffff81061161>] kthread+0x7e/0x86 [ 82.174275] [<ffffffff8106d9dd>] ? finish_task_switch+0xaf/0xfa [ 82.174278] [<ffffffff814987c4>] kernel_thread_helper+0x4/0x10 [ 82.174281] [<ffffffff81491574>] ? retint_restore_args+0x13/0x13 [ 82.174284] [<ffffffff810610e3>] ? __init_kthread_worker+0x8c/0x8c [ 82.174287] [<ffffffff814987c0>] ? gs_change+0x13/0x13 [ 82.174329] cryptomgr_probe D 0000000000000002 0 47 2 0x00000000 [ 82.174332] ffff88042c991b70 0000000000000046 ffff88042c991bb0 0000000000000006 [ 82.174335] 0000000000011cc0 0000000000011cc0 ffff88042c98ed00 ffff88042c991fd8 [ 82.174338] 0000000000011cc0 0000000000011cc0 ffff88042c991fd8 0000000000011cc0 [ 82.174338] Call Trace: [ 82.174342] [<ffffffff8148fd3f>] schedule+0x64/0x66 [ 82.174344] [<ffffffff814901ad>] __rt_mutex_slowlock+0x85/0xbe [ 82.174347] [<ffffffff814902d2>] rt_mutex_slowlock+0xec/0x159 [ 82.174351] [<ffffffff81089c4d>] rt_mutex_fastlock.clone.8+0x29/0x2f [ 82.174353] [<ffffffff81490372>] rt_mutex_lock+0x33/0x37 [ 82.174356] [<ffffffff8108a0f2>] __rt_down_read+0x50/0x5a [ 82.174358] [<ffffffff8108a11c>] ? rt_down_read+0x10/0x12 [ 82.174360] [<ffffffff8108a11c>] rt_down_read+0x10/0x12 [ 82.174363] [<ffffffff810677b5>] __blocking_notifier_call_chain+0x58/0x8d [ 82.174366] [<ffffffff810677fe>] blocking_notifier_call_chain+0x14/0x16 [ 82.174369] [<ffffffff811dd272>] crypto_probing_notify+0x24/0x50 [ 82.174372] [<ffffffff811debd6>] crypto_wait_for_test+0x22/0x6b [ 82.174374] [<ffffffff811decd3>] crypto_register_instance+0xb4/0xc0 [ 82.174377] [<ffffffff811e9b76>] cryptd_create+0x378/0x3b6 [ 82.174379] [<ffffffff811de512>] ? __crypto_lookup_template+0x5b/0x63 [ 82.174382] [<ffffffff811e4545>] cryptomgr_probe+0x45/0xac [ 82.174385] [<ffffffff811e4500>] ? crypto_alloc_pcomp+0x1b/0x1b [ 82.174388] [<ffffffff81061161>] kthread+0x7e/0x86 [ 82.174391] [<ffffffff8106d9dd>] ? finish_task_switch+0xaf/0xfa [ 82.174394] [<ffffffff814987c4>] kernel_thread_helper+0x4/0x10 [ 82.174398] [<ffffffff81491574>] ? retint_restore_args+0x13/0x13 [ 82.174401] [<ffffffff810610e3>] ? __init_kthread_worker+0x8c/0x8c [ 82.174403] [<ffffffff814987c0>] ? gs_change+0x13/0x13 cryptomgr_test spawns the cryptomgr_probe thread from the notifier call. The probe thread fires the same notifier as the test thread and deadlocks on the rwsem on RT. Now this is a potential deadlock in mainline as well, because we have fifo fair rwsems. If another thread blocks with a down_write() on the notifier chain before the probe thread issues the down_read() it will block the probe thread and the whole party is dead locked. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-13net: Add a mutex around devnet_rename_seqSebastian Andrzej Siewior1-14/+20
On RT write_seqcount_begin() disables preemption and device_rename() allocates memory with GFP_KERNEL and grabs later the sysfs_mutex mutex. Serialize with a mutex and add use the non preemption disabling __write_seqcount_begin(). To avoid writer starvation, let the reader grab the mutex and release it when it detects a writer in progress. This keeps the normal case (no reader on the fly) fast. [ tglx: Instead of replacing the seqcount by a mutex, add the mutex ] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-13net: netfilter: Serialize xt_write_recseq sections on RTThomas Gleixner2-0/+13
The netfilter code relies only on the implicit semantics of local_bh_disable() for serializing wt_write_recseq sections. RT breaks that and needs explicit serialization here. Reported-by: Peter LaDow <petela@gocougs.wsu.edu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-13net/core: protect users of napi_alloc_cache against reentranceSebastian Andrzej Siewior1-4/+14
On -RT the code running in BH can not be moved to another CPU so CPU local variable remain local. However the code can be preempted and another task may enter BH accessing the same CPU using the same napi_alloc_cache variable. This patch ensures that each user of napi_alloc_cache uses a local lock. Cc: stable-rt@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2016-02-13net: Another local_irq_disable/kmalloc headacheThomas Gleixner1-4/+6
Replace it by a local lock. Though that's pretty inefficient :( Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-13net: Remove preemption disabling in netif_rx()Priyanka Jain1-4/+4
1)enqueue_to_backlog() (called from netif_rx) should be bind to a particluar CPU. This can be achieved by disabling migration. No need to disable preemption 2)Fixes crash "BUG: scheduling while atomic: ksoftirqd" in case of RT. If preemption is disabled, enqueue_to_backog() is called in atomic context. And if backlog exceeds its count, kfree_skb() is called. But in RT, kfree_skb() might gets scheduled out, so it expects non atomic context. 3)When CONFIG_PREEMPT_RT_FULL is not defined, migrate_enable(), migrate_disable() maps to preempt_enable() and preempt_disable(), so no change in functionality in case of non-RT. -Replace preempt_enable(), preempt_disable() with migrate_enable(), migrate_disable() respectively -Replace get_cpu(), put_cpu() with get_cpu_light(), put_cpu_light() respectively Signed-off-by: Priyanka Jain <Priyanka.Jain@freescale.com> Acked-by: Rajan Srivastava <Rajan.Srivastava@freescale.com> Cc: <rostedt@goodmis.orgn> Link: http://lkml.kernel.org/r/1337227511-2271-1-git-send-email-Priyanka.Jain@freescale.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-13scsi: qla2xxx: Use local_irq_save_nort() in qla2x00_pollJohn Kacur1-2/+2
RT triggers the following: [ 11.307652] [<ffffffff81077b27>] __might_sleep+0xe7/0x110 [ 11.307663] [<ffffffff8150e524>] rt_spin_lock+0x24/0x60 [ 11.307670] [<ffffffff8150da78>] ? rt_spin_lock_slowunlock+0x78/0x90 [ 11.307703] [<ffffffffa0272d83>] qla24xx_intr_handler+0x63/0x2d0 [qla2xxx] [ 11.307736] [<ffffffffa0262307>] qla2x00_poll+0x67/0x90 [qla2xxx] Function qla2x00_poll does local_irq_save() before calling qla24xx_intr_handler which has a spinlock. Since spinlocks are sleepable on rt, it is not allowed to call them with interrupts disabled. Therefore we use local_irq_save_nort() instead which saves flags without disabling interrupts. This fix needs to be applied to v3.0-rt, v3.2-rt and v3.4-rt Suggested-by: Thomas Gleixner Signed-off-by: John Kacur <jkacur@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: David Sommerseth <davids@redhat.com> Link: http://lkml.kernel.org/r/1335523726-10024-1-git-send-email-jkacur@redhat.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-13rt/locking: Reenable migration accross scheduleThomas Gleixner1-12/+20
We currently disable migration across lock acquisition. That includes the part where we block on the lock and schedule out. We cannot disable migration after taking the lock as that would cause a possible lock inversion. But we can be smart and enable migration when we block and schedule out. That allows the scheduler to place the task freely at least if this is the first migrate disable level. For nested locking this does not help at all. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2016-02-13rtmutex: push down migrate_disable() into rt_spin_lock()Sebastian Andrzej Siewior6-29/+54
No point in having the migrate disable/enable invocations in all the macro/inlines. That's just more code for no win as we do a function call anyway. Move it to the core code and save quite some text size. text data bss dec filename 11034127 3676912 14901248 29612287 vmlinux.before 10990437 3676848 14901248 29568533 vmlinux.after ~-40KiB Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2016-02-13hotplug: Use set_cpus_allowed_ptr() in sync_unplug_thread()Mike Galbraith1-1/+1
do_set_cpus_allowed() is not safe vs ->sched_class change. crash> bt PID: 11676 TASK: ffff88026f979da0 CPU: 22 COMMAND: "sync_unplug/22" #0 [ffff880274d25bc8] machine_kexec at ffffffff8103b41c #1 [ffff880274d25c18] crash_kexec at ffffffff810d881a #2 [ffff880274d25cd8] oops_end at ffffffff81525818 #3 [ffff880274d25cf8] do_invalid_op at ffffffff81003096 #4 [ffff880274d25d90] invalid_op at ffffffff8152d3de [exception RIP: set_cpus_allowed_rt+18] RIP: ffffffff8109e012 RSP: ffff880274d25e48 RFLAGS: 00010202 RAX: ffffffff8109e000 RBX: ffff88026f979da0 RCX: ffff8802770cb6e8 RDX: 0000000000000000 RSI: ffffffff81add700 RDI: ffff88026f979da0 RBP: ffff880274d25e78 R8: ffffffff816112e0 R9: 0000000000000001 R10: 0000000000000001 R11: 0000000000011940 R12: ffff88026f979da0 R13: ffff8802770cb6d0 R14: ffff880274d25fd8 R15: 0000000000000000 ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 #5 [ffff880274d25e60] do_set_cpus_allowed at ffffffff8108e65f #6 [ffff880274d25e80] sync_unplug_thread at ffffffff81058c08 #7 [ffff880274d25ed8] kthread at ffffffff8107cad6 #8 [ffff880274d25f50] ret_from_fork at ffffffff8152bbbc crash> task_struct ffff88026f979da0 | grep class sched_class = 0xffffffff816111e0 <fair_sched_class+64>, Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2016-02-13cpu_down: move migrate_enable() backTiejun Chen1-1/+1
Commit 08c1ab68, "hotplug-use-migrate-disable.patch", intends to use migrate_enable()/migrate_disable() to replace that combination of preempt_enable() and preempt_disable(), but actually in !CONFIG_PREEMPT_RT_FULL case, migrate_enable()/migrate_disable() are still equal to preempt_enable()/preempt_disable(). So that followed cpu_hotplug_begin()/cpu_unplug_begin(cpu) would go schedule() to trigger schedule_debug() like this: _cpu_down() | + migrate_disable() = preempt_disable() | + cpu_hotplug_begin() or cpu_unplug_begin() | + schedule() | + __schedule() | + preempt_disable(); | + __schedule_bug() is true! So we should move migrate_enable() as the original scheme. Signed-off-by: Tiejun Chen <tiejun.chen@windriver.com>
2016-02-13kernel/hotplug: restore original cpu mask oncpu/downSebastian Andrzej Siewior1-1/+12
If a task which is allowed to run only on CPU X puts CPU Y down then it will be allowed on all CPUs but the on CPU Y after it comes back from kernel. This patch ensures that we don't lose the initial setting unless the CPU the task is running is going down. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2016-02-13kernel/cpu: fix cpu down problem if kthread's cpu is going downSebastian Andrzej Siewior1-2/+13
If kthread is pinned to CPUx and CPUx is going down then we get into trouble: - first the unplug thread is created - it will set itself to hp->unplug. As a result, every task that is going to take a lock, has to leave the CPU. - the CPU_DOWN_PREPARE notifier are started. The worker thread will start a new process for the "high priority worker". Now kthread would like to take a lock but since it can't leave the CPU it will never complete its task. We could fire the unplug thread after the notifier but then the cpu is no longer marked "online" and the unplug thread will run on CPU0 which was fixed before :) So instead the unplug thread is started and kept waiting until the notfier complete their work. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2016-02-13cpu hotplug: Document why PREEMPT_RT uses a spinlockSteven Rostedt1-0/+8
The patch: cpu: Make hotplug.lock a "sleeping" spinlock on RT Tasks can block on hotplug.lock in pin_current_cpu(), but their state might be != RUNNING. So the mutex wakeup will set the state unconditionally to RUNNING. That might cause spurious unexpected wakeups. We could provide a state preserving mutex_lock() function, but this is semantically backwards. So instead we convert the hotplug.lock() to a spinlock for RT, which has the state preserving semantics already. Fixed a bug where the hotplug lock on PREEMPT_RT can be called after a task set its state to TASK_UNINTERRUPTIBLE and before it called schedule. If the hotplug_lock used a mutex, and there was contention, the current task's state would be turned to TASK_RUNNABLE and the schedule call will not sleep. This caused unexpected results. Although the patch had a description of the change, the code had no comments about it. This causes confusion to those that review the code, and as PREEMPT_RT is held in a quilt queue and not git, it's not as easy to see why a change was made. Even if it was in git, the code should still have a comment for something as subtle as this. Document the rational for using a spinlock on PREEMPT_RT in the hotplug lock code. Reported-by: Nicholas Mc Guire <der.herr@hofr.at> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2016-02-13cpu/rt: Rework cpu down for PREEMPT_RTSteven Rostedt3-44/+281
Bringing a CPU down is a pain with the PREEMPT_RT kernel because tasks can be preempted in many more places than in non-RT. In order to handle per_cpu variables, tasks may be pinned to a CPU for a while, and even sleep. But these tasks need to be off the CPU if that CPU is going down. Several synchronization methods have been tried, but when stressed they failed. This is a new approach. A sync_tsk thread is still created and tasks may still block on a lock when the CPU is going down, but how that works is a bit different. When cpu_down() starts, it will create the sync_tsk and wait on it to inform that current tasks that are pinned on the CPU are no longer pinned. But new tasks that are about to be pinned will still be allowed to do so at this time. Then the notifiers are called. Several notifiers will bring down tasks that will enter these locations. Some of these tasks will take locks of other tasks that are on the CPU. If we don't let those other tasks continue, but make them block until CPU down is done, the tasks that the notifiers are waiting on will never complete as they are waiting for the locks held by the tasks that are blocked. Thus we still let the task pin the CPU until the notifiers are done. After the notifiers run, we then make new tasks entering the pinned CPU sections grab a mutex and wait. This mutex is now a per CPU mutex in the hotplug_pcp descriptor. To help things along, a new function in the scheduler code is created called migrate_me(). This function will try to migrate the current task off the CPU this is going down if possible. When the sync_tsk is created, all tasks will then try to migrate off the CPU going down. There are several cases that this wont work, but it helps in most cases. After the notifiers are called and if a task can't migrate off but enters the pin CPU sections, it will be forced to wait on the hotplug_pcp mutex until the CPU down is complete. Then the scheduler will force the migration anyway. Also, I found that THREAD_BOUND need to also be accounted for in the pinned CPU, and the migrate_disable no longer treats them special. This helps fix issues with ksoftirqd and workqueue that unbind on CPU down. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-13cpu: Make hotplug.lock a "sleeping" spinlock on RTSteven Rostedt1-7/+27
Tasks can block on hotplug.lock in pin_current_cpu(), but their state might be != RUNNING. So the mutex wakeup will set the state unconditionally to RUNNING. That might cause spurious unexpected wakeups. We could provide a state preserving mutex_lock() function, but this is semantically backwards. So instead we convert the hotplug.lock() to a spinlock for RT, which has the state preserving semantics already. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Cc: Carsten Emde <C.Emde@osadl.org> Cc: John Kacur <jkacur@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Clark Williams <clark.williams@gmail.com> Link: http://lkml.kernel.org/r/1330702617.25686.265.camel@gandalf.stny.rr.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-13seqlock: Prevent rt starvationThomas Gleixner3-15/+47
If a low prio writer gets preempted while holding the seqlock write locked, a high prio reader spins forever on RT. To prevent this let the reader grab the spinlock, so it blocks and eventually boosts the writer. This way the writer can proceed and endless spinning is prevented. For seqcount writers we disable preemption over the update code path. Thanks to Al Viro for distangling some VFS code to make that possible. Nicholas Mc Guire: - spin_lock+unlock => spin_unlock_wait - __write_seqcount_begin => __raw_write_seqcount_begin Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-13random: Make it work on rtThomas Gleixner5-8/+20
Delegate the random insertion to the forced threaded interrupt handler. Store the return IP of the hard interrupt handler in the irq descriptor and feed it into the random generator as a source of entropy. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-13cpumask: Disable CONFIG_CPUMASK_OFFSTACK for RTThomas Gleixner2-1/+2
We can't deal with the cpumask allocations which happen in atomic context (see arch/x86/kernel/apic/io_apic.c) on RT right now. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-13acpi/rt: Convert acpi_gbl_hardware lock back to a raw_spinlock_tSteven Rostedt5-7/+22
We hit the following bug with 3.6-rt: [ 5.898990] BUG: scheduling while atomic: swapper/3/0/0x00000002 [ 5.898991] no locks held by swapper/3/0. [ 5.898993] Modules linked in: [ 5.898996] Pid: 0, comm: swapper/3 Not tainted 3.6.11-rt28.19.el6rt.x86_64.debug #1 [ 5.898997] Call Trace: [ 5.899011] [<ffffffff810804e7>] __schedule_bug+0x67/0x90 [ 5.899028] [<ffffffff81577923>] __schedule+0x793/0x7a0 [ 5.899032] [<ffffffff810b4e40>] ? debug_rt_mutex_print_deadlock+0x50/0x200 [ 5.899034] [<ffffffff81577b89>] schedule+0x29/0x70 [ 5.899036] BUG: scheduling while atomic: swapper/7/0/0x00000002 [ 5.899037] no locks held by swapper/7/0. [ 5.899039] [<ffffffff81578525>] rt_spin_lock_slowlock+0xe5/0x2f0 [ 5.899040] Modules linked in: [ 5.899041] [ 5.899045] [<ffffffff81579a58>] ? _raw_spin_unlock_irqrestore+0x38/0x90 [ 5.899046] Pid: 0, comm: swapper/7 Not tainted 3.6.11-rt28.19.el6rt.x86_64.debug #1 [ 5.899047] Call Trace: [ 5.899049] [<ffffffff81578bc6>] rt_spin_lock+0x16/0x40 [ 5.899052] [<ffffffff810804e7>] __schedule_bug+0x67/0x90 [ 5.899054] [<ffffffff8157d3f0>] ? notifier_call_chain+0x80/0x80 [ 5.899056] [<ffffffff81577923>] __schedule+0x793/0x7a0 [ 5.899059] [<ffffffff812f2034>] acpi_os_acquire_lock+0x1f/0x23 [ 5.899062] [<ffffffff810b4e40>] ? debug_rt_mutex_print_deadlock+0x50/0x200 [ 5.899068] [<ffffffff8130be64>] acpi_write_bit_register+0x33/0xb0 [ 5.899071] [<ffffffff81577b89>] schedule+0x29/0x70 [ 5.899072] [<ffffffff8130be13>] ? acpi_read_bit_register+0x33/0x51 [ 5.899074] [<ffffffff81578525>] rt_spin_lock_slowlock+0xe5/0x2f0 [ 5.899077] [<ffffffff8131d1fc>] acpi_idle_enter_bm+0x8a/0x28e [ 5.899079] [<ffffffff81579a58>] ? _raw_spin_unlock_irqrestore+0x38/0x90 [ 5.899081] [<ffffffff8107e5da>] ? this_cpu_load+0x1a/0x30 [ 5.899083] [<ffffffff81578bc6>] rt_spin_lock+0x16/0x40 [ 5.899087] [<ffffffff8144c759>] cpuidle_enter+0x19/0x20 [ 5.899088] [<ffffffff8157d3f0>] ? notifier_call_chain+0x80/0x80 [ 5.899090] [<ffffffff8144c777>] cpuidle_enter_state+0x17/0x50 [ 5.899092] [<ffffffff812f2034>] acpi_os_acquire_lock+0x1f/0x23 [ 5.899094] [<ffffffff8144d1a1>] cpuidle899101] [<ffffffff8130be13>] ? As the acpi code disables interrupts in acpi_idle_enter_bm, and calls code that grabs the acpi lock, it causes issues as the lock is currently in RT a sleeping lock. The lock was converted from a raw to a sleeping lock due to some previous issues, and tests that showed it didn't seem to matter. Unfortunately, it did matter for one of our boxes. This patch converts the lock back to a raw lock. I've run this code on a few of my own machines, one being my laptop that uses the acpi quite extensively. I've been able to suspend and resume without issues. [ tglx: Made the change exclusive for acpi_gbl_hardware_lock ] Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Cc: John Kacur <jkacur@gmail.com> Cc: Clark Williams <clark@redhat.com> Link: http://lkml.kernel.org/r/1360765565.23152.5.camel@gandalf.local.home Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2016-02-13dm: Make rt awareThomas Gleixner1-1/+1
Use the BUG_ON_NORT variant for the irq_disabled() checks. RT has interrupts legitimately enabled here as we cant deadlock against the irq thread due to the "sleeping spinlocks" conversion. Reported-by: Luis Claudio R. Goncalves <lclaudio@uudg.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-13crypto: Reduce preempt disabled regions, more algosSebastian Andrzej Siewior2-28/+24
Don Estabrook reported | kernel: WARNING: CPU: 2 PID: 858 at kernel/sched/core.c:2428 migrate_disable+0xed/0x100() | kernel: WARNING: CPU: 2 PID: 858 at kernel/sched/core.c:2462 migrate_enable+0x17b/0x200() | kernel: WARNING: CPU: 3 PID: 865 at kernel/sched/core.c:2428 migrate_disable+0xed/0x100() and his backtrace showed some crypto functions which looked fine. The problem is the following sequence: glue_xts_crypt_128bit() { blkcipher_walk_virt(); /* normal migrate_disable() */ glue_fpu_begin(); /* get atomic */ while (nbytes) { __glue_xts_crypt_128bit(); blkcipher_walk_done(); /* with nbytes = 0, migrate_enable() * while we are atomic */ }; glue_fpu_end() /* no longer atomic */ } and this is why the counter get out of sync and the warning is printed. The other problem is that we are non-preemptible between glue_fpu_begin() and glue_fpu_end() and the latency grows. To fix this, I shorten the FPU off region and ensure blkcipher_walk_done() is called with preemption enabled. This might hurt the performance because we now enable/disable the FPU state more often but we gain lower latency and the bug is gone. Reported-by: Don Estabrook <don.estabrook@gmail.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2016-02-13x86: crypto: Reduce preempt disabled regionsPeter Zijlstra1-11/+13
Restrict the preempt disabled regions to the actual floating point operations and enable preemption for the administrative actions. This is necessary on RT to avoid that kfree and other operations are called with preemption disabled. Reported-and-tested-by: Carsten Emde <cbe@osadl.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-13sas-ata/isci: dont't disable interrupts in qc_issue handlerPaul Gortmaker1-2/+2
On 3.14-rt we see the following trace on Canoe Pass for SCSI_ISCI "Intel(R) C600 Series Chipset SAS Controller" when the sas qc_issue handler is run: BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:905 in_atomic(): 0, irqs_disabled(): 1, pid: 432, name: udevd CPU: 11 PID: 432 Comm: udevd Not tainted 3.14.28-rt22 #2 Hardware name: Intel Corporation S2600CP/S2600CP, BIOS SE5C600.86B.02.01.0002.082220131453 08/22/2013 ffff880fab500000 ffff880fa9f239c0 ffffffff81a2d273 0000000000000000 ffff880fa9f239d8 ffffffff8107f023 ffff880faac23dc0 ffff880fa9f239f0 ffffffff81a33cc0 ffff880faaeb1400 ffff880fa9f23a40 ffffffff815de891 Call Trace: [<ffffffff81a2d273>] dump_stack+0x4e/0x7a [<ffffffff8107f023>] __might_sleep+0xe3/0x160 [<ffffffff81a33cc0>] rt_spin_lock+0x20/0x50 [<ffffffff815de891>] isci_task_execute_task+0x171/0x2f0 <----- [<ffffffff815cfecb>] sas_ata_qc_issue+0x25b/0x2a0 [<ffffffff81606363>] ata_qc_issue+0x1f3/0x370 [<ffffffff8160c600>] ? ata_scsi_invalid_field+0x40/0x40 [<ffffffff8160c8f5>] ata_scsi_translate+0xa5/0x1b0 [<ffffffff8160efc6>] ata_sas_queuecmd+0x86/0x280 [<ffffffff815ce446>] sas_queuecommand+0x196/0x230 [<ffffffff81081fad>] ? get_parent_ip+0xd/0x50 [<ffffffff815b05a4>] scsi_dispatch_cmd+0xb4/0x210 [<ffffffff815b7744>] scsi_request_fn+0x314/0x530 and gdb shows: (gdb) list * isci_task_execute_task+0x171 0xffffffff815ddfb1 is in isci_task_execute_task (drivers/scsi/isci/task.c:138). 133 dev_dbg(&ihost->pdev->dev, "%s: num=%d\n", __func__, num); 134 135 for_each_sas_task(num, task) { 136 enum sci_status status = SCI_FAILURE; 137 138 spin_lock_irqsave(&ihost->scic_lock, flags); <----- 139 idev = isci_lookup_device(task->dev); 140 io_ready = isci_device_io_ready(idev, task); 141 tag = isci_alloc_tag(ihost); 142 spin_unlock_irqrestore(&ihost->scic_lock, flags); (gdb) In addition to the scic_lock, the function also contains locking of the task_state_lock -- which is clearly not a candidate for raw lock conversion. As can be seen by the comment nearby, we really should be running the qc_issue code with interrupts enabled anyway. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2016-02-13scsi/fcoe: Make RT aware.Thomas Gleixner3-13/+13
Do not disable preemption while taking sleeping locks. All user look safe for migrate_diable() only. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-13KVM: use simple waitqueue for vcpu->wqMarcelo Tosatti10-38/+37
The problem: On -RT, an emulated LAPIC timer instances has the following path: 1) hard interrupt 2) ksoftirqd is scheduled 3) ksoftirqd wakes up vcpu thread 4) vcpu thread is scheduled This extra context switch introduces unnecessary latency in the LAPIC path for a KVM guest. The solution: Allow waking up vcpu thread from hardirq context, thus avoiding the need for ksoftirqd to be scheduled. Normal waitqueues make use of spinlocks, which on -RT are sleepable locks. Therefore, waking up a waitqueue waiter involves locking a sleeping lock, which is not allowed from hard interrupt context. cyclictest command line: # cyclictest -m -n -q -p99 -l 1000000 -h60 -D 1m This patch reduces the average latency in my tests from 14us to 11us. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2016-02-13KVM: lapic: mark LAPIC timer handler as irqsafeMarcelo Tosatti1-0/+1
Since lapic timer handler only wakes up a simple waitqueue, it can be executed from hardirq context. Reduces average cyclictest latency by 3us. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2016-02-13x86: kvm Require const tsc for RTThomas Gleixner1-0/+7
Non constant TSC is a nightmare on bare metal already, but with virtualization it becomes a complete disaster because the workarounds are horrible latency wise. That's also a preliminary for running RT in a guest on top of a RT host. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-13ipc/sem: Rework semaphore wakeupsPeter Zijlstra1-0/+10
Current sysv sems have a weird ass wakeup scheme that involves keeping preemption disabled over a potential O(n^2) loop and busy waiting on that on other CPUs. Kill this and simply wake the task directly from under the sem_lock. This was discovered by a migrate_disable() debug feature that disallows: spin_lock(); preempt_disable(); spin_unlock() preempt_enable(); Cc: Manfred Spraul <manfred@colorfullife.com> Suggested-by: Thomas Gleixner <tglx@linutronix.de> Reported-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Manfred Spraul <manfred@colorfullife.com> Link: http://lkml.kernel.org/r/1315994224.5040.1.camel@twins Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-13arm: Enable highmem for rtThomas Gleixner3-8/+57
fixup highmem for ARM. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-13arm/highmem: Flush tlb on unmapSebastian Andrzej Siewior1-1/+1
The tlb should be flushed on unmap and thus make the mapping entry invalid. This is only done in the non-debug case which does not look right. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2016-02-13x86/highmem: Add a "already used pte" checkSebastian Andrzej Siewior1-0/+2
This is a copy from kmap_atomic_prot(). Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2016-02-13mm: rt: Fix generic kmap_atomic for RTThomas Gleixner1-2/+2
The update to 4.1 brought in the mainline variant of the pagefault disable distangling from preempt count. That introduced a preempt_disable/enable pair in the generic kmap_atomic/kunmap_atomic implementations which got not converted to the _nort() variant. That results in massive 'scheduling while atomic/sleeping function called from invalid context' splats. Fix that up. Reported-and-tested-by: Juergen Borleis <jbe@pengutronix.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-13mm, rt: kmap_atomic schedulingPeter Zijlstra7-10/+86
In fact, with migrate_disable() existing one could play games with kmap_atomic. You could save/restore the kmap_atomic slots on context switch (if there are any in use of course), this should be esp easy now that we have a kmap_atomic stack. Something like the below.. it wants replacing all the preempt_disable() stuff with pagefault_disable() && migrate_disable() of course, but then you can flip kmaps around like below. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> [dvhart@linux.intel.com: build fix] Link: http://lkml.kernel.org/r/1311842631.5890.208.camel@twins [tglx@linutronix.de: Get rid of the per cpu variable and store the idx and the pte content right away in the task struct. Shortens the context switch code. ]
2016-02-13mips: Disable highmem on RTThomas Gleixner1-1/+1
The current highmem handling on -RT is not compatible and needs fixups. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-13powerpc: Disable highmem on RTThomas Gleixner1-1/+1
The current highmem handling on -RT is not compatible and needs fixups. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-13sysfs: Add /sys/kernel/realtime entryClark Williams1-0/+12
Add a /sys/kernel entry to indicate that the kernel is a realtime kernel. Clark says that he needs this for udev rules, udev needs to evaluate if its a PREEMPT_RT kernel a few thousand times and parsing uname output is too slow or so. Are there better solutions? Should it exist and return 0 on !-rt? Signed-off-by: Clark Williams <williams@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
2016-02-13kgdb/serial: Short term workaroundJason Wessel3-5/+6
On 07/27/2011 04:37 PM, Thomas Gleixner wrote: > - KGDB (not yet disabled) is reportedly unusable on -rt right now due > to missing hacks in the console locking which I dropped on purpose. > To work around this in the short term you can use this patch, in addition to the clocksource watchdog patch that Thomas brewed up. Comments are welcome of course. Ultimately the right solution is to change separation between the console and the HW to have a polled mode + work queue so as not to introduce any kind of latency. Thanks, Jason.
2016-02-13arm64/xen: Make XEN depend on !RTThomas Gleixner1-1/+1
It's not ready and probably never will be, unless xen folks have a look at it. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-13KVM: arm/arm64: downgrade preempt_disable()d region to migrate_disable()Josh Cartwright1-3/+3
kvm_arch_vcpu_ioctl_run() disables the use of preemption when updating the vgic and timer states to prevent the calling task from migrating to another CPU. It does so to prevent the task from writing to the incorrect per-CPU GIC distributor registers. On -rt kernels, it's possible to maintain the same guarantee with the use of migrate_{disable,enable}(), with the added benefit that the migrate-disabled region is preemptible. Update kvm_arch_vcpu_ioctl_run() to do so. Cc: Christoffer Dall <christoffer.dall@linaro.org> Reported-by: Manish Jaggi <Manish.Jaggi@caviumnetworks.com> Signed-off-by: Josh Cartwright <joshc@ni.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2016-02-13genirq: update irq_set_irqchip_state documentationJosh Cartwright1-1/+1
On -rt kernels, the use of migrate_disable()/migrate_enable() is sufficient to guarantee a task isn't moved to another CPU. Update the irq_set_irqchip_state() documentation to reflect this. Signed-off-by: Josh Cartwright <joshc@ni.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2016-02-13ARM: enable irq in translation/section permission fault handlersYadi.hu1-0/+6
Probably happens on all ARM, with CONFIG_PREEMPT_RT_FULL CONFIG_DEBUG_ATOMIC_SLEEP This simple program.... int main() { *((char*)0xc0001000) = 0; }; [ 512.742724] BUG: sleeping function called from invalid context at kernel/rtmutex.c:658 [ 512.743000] in_atomic(): 0, irqs_disabled(): 128, pid: 994, name: a [ 512.743217] INFO: lockdep is turned off. [ 512.743360] irq event stamp: 0 [ 512.743482] hardirqs last enabled at (0): [< (null)>] (null) [ 512.743714] hardirqs last disabled at (0): [<c0426370>] copy_process+0x3b0/0x11c0 [ 512.744013] softirqs last enabled at (0): [<c0426370>] copy_process+0x3b0/0x11c0 [ 512.744303] softirqs last disabled at (0): [< (null)>] (null) [ 512.744631] [<c041872c>] (unwind_backtrace+0x0/0x104) [ 512.745001] [<c09af0c4>] (dump_stack+0x20/0x24) [ 512.745355] [<c0462490>] (__might_sleep+0x1dc/0x1e0) [ 512.745717] [<c09b6770>] (rt_spin_lock+0x34/0x6c) [ 512.746073] [<c0441bf0>] (do_force_sig_info+0x34/0xf0) [ 512.746457] [<c0442668>] (force_sig_info+0x18/0x1c) [ 512.746829] [<c041d880>] (__do_user_fault+0x9c/0xd8) [ 512.747185] [<c041d938>] (do_bad_area+0x7c/0x94) [ 512.747536] [<c041d990>] (do_sect_fault+0x40/0x48) [ 512.747898] [<c040841c>] (do_DataAbort+0x40/0xa0) [ 512.748181] Exception stack(0xecaa1fb0 to 0xecaa1ff8) Oxc0000000 belongs to kernel address space, user task can not be allowed to access it. For above condition, correct result is that test case should receive a “segment fault” and exits but not stacks. the root cause is commit 02fe2845d6a8 ("avoid enabling interrupts in prefetch/data abort handlers"),it deletes irq enable block in Data abort assemble code and move them into page/breakpiont/alignment fault handlers instead. But author does not enable irq in translation/section permission fault handlers. ARM disables irq when it enters exception/ interrupt mode, if kernel doesn't enable irq, it would be still disabled during translation/section permission fault. We see the above splat because do_force_sig_info is still called with IRQs off, and that code eventually does a: spin_lock_irqsave(&t->sighand->siglock, flags); As this is architecture independent code, and we've not seen any other need for other arch to have the siglock converted to raw lock, we can conclude that we should enable irq for ARM translation/section permission exception. Signed-off-by: Yadi.hu <yadi.hu@windriver.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2016-02-13arm/unwind: use a raw_spin_lockSebastian Andrzej Siewior1-7/+7
Mostly unwind is done with irqs enabled however SLUB may call it with irqs disabled while creating a new SLUB cache. I had system freeze while loading a module which called kmem_cache_create() on init. That means SLUB's __slab_alloc() disabled interrupts and then ->new_slab_objects() ->new_slab() ->setup_object() ->setup_object_debug() ->init_tracking() ->set_track() ->save_stack_trace() ->save_stack_trace_tsk() ->walk_stackframe() ->unwind_frame() ->unwind_find_idx() =>spin_lock_irqsave(&unwind_lock); Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2016-02-13ARM: at91: tclib: Default to tclib timer for RTThomas Gleixner1-1/+2
RT is not too happy about the shared timer interrupt in AT91 devices. Default to tclib timer for RT. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-13powerpc: ps3/device-init.c - adapt to completions using swait vs waitPaul Gortmaker1-1/+1
To fix: cc1: warnings being treated as errors arch/powerpc/platforms/ps3/device-init.c: In function 'ps3_notification_read_write': arch/powerpc/platforms/ps3/device-init.c:755:2: error: passing argument 1 of 'prepare_to_wait_event' from incompatible pointer type arch/powerpc/platforms/ps3/device-init.c:755:2: error: passing argument 1 of 'abort_exclusive_wait' from incompatible pointer type arch/powerpc/platforms/ps3/device-init.c:755:2: error: passing argument 1 of 'finish_wait' from incompatible pointer type arch/powerpc/platforms/ps3/device-init.o] Error 1 make[3]: *** Waiting for unfinished jobs.... Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2016-02-13powerpc/kvm: Disable in-kernel MPIC emulation for PREEMPT_RT_FULLBogdan Purcareata1-0/+1
While converting the openpic emulation code to use a raw_spinlock_t enables guests to run on RT, there's still a performance issue. For interrupts sent in directed delivery mode with a multiple CPU mask, the emulated openpic will loop through all of the VCPUs, and for each VCPUs, it call IRQ_check, which will loop through all the pending interrupts for that VCPU. This is done while holding the raw_lock, meaning that in all this time the interrupts and preemption are disabled on the host Linux. A malicious user app can max both these number and cause a DoS. This temporary fix is sent for two reasons. First is so that users who want to use the in-kernel MPIC emulation are aware of the potential latencies, thus making sure that the hardware MPIC and their usage scenario does not involve interrupts sent in directed delivery mode, and the number of possible pending interrupts is kept small. Secondly, this should incentivize the development of a proper openpic emulation that would be better suited for RT. Acked-by: Scott Wood <scottwood@freescale.com> Signed-off-by: Bogdan Purcareata <bogdan.purcareata@freescale.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2016-02-13powerpc: Use generic rwsem on RTThomas Gleixner1-1/+2
Use generic code which uses rtmutex Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-13printk: Drop the logbuf_lock more oftenSebastian Andrzej Siewior1-1/+26
The lock is hold with irgs off. The latency drops 500us+ on my arm bugs with a "full" buffer after executing "dmesg" on the shell. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2016-02-13printk: Make rt awareThomas Gleixner1-3/+23
Drop the lock before calling the console driver and do not disable interrupts while printing to a serial console. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-13snd/pcm: fix snd_pcm_stream_lock*() irqs_disabled() splatsMike Galbraith1-4/+4
Locking functions previously using read_lock_irq()/read_lock_irqsave() were changed to local_irq_disable/save(), leading to gripes. Use nort variants. |BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:915 |in_atomic(): 0, irqs_disabled(): 1, pid: 5947, name: alsa-sink-ALC88 |CPU: 5 PID: 5947 Comm: alsa-sink-ALC88 Not tainted 3.18.7-rt1 #9 |Hardware name: MEDION MS-7848/MS-7848, BIOS M7848W08.404 11/06/2014 | ffff880409316240 ffff88040866fa38 ffffffff815bdeb5 0000000000000002 | 0000000000000000 ffff88040866fa58 ffffffff81073c86 ffffffffa03b2640 | ffff88040239ec00 ffff88040866fa78 ffffffff815c3d34 ffffffffa03b2640 |Call Trace: | [<ffffffff815bdeb5>] dump_stack+0x4f/0x9e | [<ffffffff81073c86>] __might_sleep+0xe6/0x150 | [<ffffffff815c3d34>] __rt_spin_lock+0x24/0x50 | [<ffffffff815c4044>] rt_read_lock+0x34/0x40 | [<ffffffffa03a2979>] snd_pcm_stream_lock+0x29/0x70 [snd_pcm] | [<ffffffffa03a355d>] snd_pcm_playback_poll+0x5d/0x120 [snd_pcm] | [<ffffffff811937a2>] do_sys_poll+0x322/0x5b0 | [<ffffffff81193d48>] SyS_ppoll+0x1a8/0x1c0 | [<ffffffff815c4556>] system_call_fastpath+0x16/0x1b Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2016-02-13irqwork: Move irq safe work to irq contextThomas Gleixner3-4/+17
On architectures where arch_irq_work_has_interrupt() returns false, we end up running the irq safe work from the softirq context. That results in a potential deadlock in the scheduler irq work which expects that function to be called with interrupts disabled. Split the irq_work_tick() function into a hard and soft variant. Call the hard variant from the tick interrupt and add the soft variant to the timer softirq. Reported-and-tested-by: Yanjiang Jin <yanjiang.jin@windriver.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable-rt@vger.kernel.org
2016-02-13irqwork: push most work into softirq contextSebastian Andrzej Siewior5-14/+47
Initially we defered all irqwork into softirq because we didn't want the latency spikes if perf or another user was busy and delayed the RT task. The NOHZ trigger (nohz_full_kick_work) was the first user that did not work as expected if it did not run in the original irqwork context so we had to bring it back somehow for it. push_irq_work_func is the second one that requires this. This patch adds the IRQ_WORK_HARD_IRQ which makes sure the callback runs in raw-irq context. Everything else is defered into softirq context. Without -RT we have the orignal behavior. This patch incorporates tglx orignal work which revoked a little bringing back the arch_irq_work_raise() if possible and a few fixes from Steven Rostedt and Mike Galbraith, Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2016-02-13net: sysrq via icmpCarsten Emde4-2/+47
There are (probably rare) situations when a system crashed and the system console becomes unresponsive but the network icmp layer still is alive. Wouldn't it be wonderful, if we then could submit a sysreq command via ping? This patch provides this facility. Please consult the updated documentation Documentation/sysrq.txt for details. Signed-off-by: Carsten Emde <C.Emde@osadl.org>
2016-02-13net: Avoid livelock in net_tx_action() on RTSteven Rostedt1-1/+31
qdisc_lock is taken w/o disabling interrupts or bottom halfs. So code holding a qdisc_lock() can be interrupted and softirqs can run on the return of interrupt in !RT. The spin_trylock() in net_tx_action() makes sure, that the softirq does not deadlock. When the lock can't be acquired q is requeued and the NET_TX softirq is raised. That causes the softirq to run over and over. That works in mainline as do_softirq() has a retry loop limit and leaves the softirq processing in the interrupt return path and schedules ksoftirqd. The task which holds qdisc_lock cannot be preempted, so the lock is released and either ksoftirqd or the next softirq in the return from interrupt path can proceed. Though it's a bit strange to actually run MAX_SOFTIRQ_RESTART (10) loops before it decides to bail out even if it's clear in the first iteration :) On RT all softirq processing is done in a FIFO thread and we don't have a loop limit, so ksoftirqd preempts the lock holder forever and unqueues and requeues until the reset button is hit. Due to the forced threading of ksoftirqd on RT we actually cannot deadlock on qdisc_lock because it's a "sleeping lock". So it's safe to replace the spin_trylock() with a spin_lock(). When contended, ksoftirqd is scheduled out and the lock holder can proceed. [ tglx: Massaged changelog and code comments ] Solved-by: Thomas Gleixner <tglx@linuxtronix.de> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Tested-by: Carsten Emde <cbe@osadl.org> Cc: Clark Williams <williams@redhat.com> Cc: John Kacur <jkacur@redhat.com> Cc: Luis Claudio R. Goncalves <lclaudio@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-13net: provide a way to delegate processing a softirq to ksoftirqdSebastian Andrzej Siewior3-1/+30
If the NET_RX uses up all of his budget it moves the following NAPI invocations into the `ksoftirqd`. On -RT it does not do so. Instead it rises the NET_RX softirq in its current context again. In order to get closer to mainline's behaviour this patch provides __raise_softirq_irqoff_ksoft() which raises the softirq in the ksoftird. Cc: stable-rt@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2016-02-13net: move xmit_recursion to per-task variable on -RTSebastian Andrzej Siewior3-3/+50
A softirq on -RT can be preempted. That means one task is in __dev_queue_xmit(), gets preempted and another task may enter __dev_queue_xmit() aw well. netperf together with a bridge device will then trigger the `recursion alert` because each task increments the xmit_recursion variable which is per-CPU. A virtual device like br0 is required to trigger this warning. This patch moves the counter to per task instead per-CPU so it counts the recursion properly on -RT. Cc: stable-rt@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2016-02-13net/core/cpuhotplug: Drain input_pkt_queue locklessGrygorii Strashko1-1/+1
I can constantly see below error report with 4.1 RT-kernel on TI ARM dra7-evm if I'm trying to unplug cpu1: [ 57.737589] CPU1: shutdown [ 57.767537] BUG: spinlock bad magic on CPU#0, sh/137 [ 57.767546] lock: 0xee994730, .magic: 00000000, .owner: <none>/-1, .owner_cpu: 0 [ 57.767552] CPU: 0 PID: 137 Comm: sh Not tainted 4.1.10-rt8-01700-g2c38702-dirty #55 [ 57.767555] Hardware name: Generic DRA74X (Flattened Device Tree) [ 57.767568] [<c001acd0>] (unwind_backtrace) from [<c001534c>] (show_stack+0x20/0x24) [ 57.767579] [<c001534c>] (show_stack) from [<c075560c>] (dump_stack+0x84/0xa0) [ 57.767593] [<c075560c>] (dump_stack) from [<c00aca48>] (spin_dump+0x84/0xac) [ 57.767603] [<c00aca48>] (spin_dump) from [<c00acaa4>] (spin_bug+0x34/0x38) [ 57.767614] [<c00acaa4>] (spin_bug) from [<c00acc10>] (do_raw_spin_lock+0x168/0x1c0) [ 57.767624] [<c00acc10>] (do_raw_spin_lock) from [<c075b4cc>] (_raw_spin_lock+0x4c/0x54) [ 57.767631] [<c075b4cc>] (_raw_spin_lock) from [<c07599fc>] (rt_spin_lock_slowlock+0x5c/0x374) [ 57.767638] [<c07599fc>] (rt_spin_lock_slowlock) from [<c075bcf4>] (rt_spin_lock+0x38/0x70) [ 57.767649] [<c075bcf4>] (rt_spin_lock) from [<c06333c0>] (skb_dequeue+0x28/0x7c) [ 57.767662] [<c06333c0>] (skb_dequeue) from [<c06476ec>] (dev_cpu_callback+0x1b8/0x240) [ 57.767673] [<c06476ec>] (dev_cpu_callback) from [<c007566c>] (notifier_call_chain+0x3c/0xb4) The reason is that skb_dequeue is taking skb->lock, but RT changed the core code to use a raw spinlock. The non-raw lock is not initialized on purpose to catch exactly this kind of problem. Fixes: 91df05da13a6 'net: Use skbufhead with raw lock' Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable-rt@vger.kernel.org
2016-02-13net: Use skbufhead with raw lockThomas Gleixner3-6/+21
Use the rps lock as rawlock so we can keep irq-off regions. It looks low latency. However we can't kfree() from this context therefore we defer this to the softirq and use the tofree_queue list for it (similar to process_queue). Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-13net: Make synchronize_rcu_expedited() conditional on !RT_FULLJosh Cartwright1-1/+1
While the use of synchronize_rcu_expedited() might make synchronize_net() "faster", it does so at significant cost on RT systems, as expediting a grace period forcibly preempts any high-priority RT tasks (via the stop_machine() mechanism). Without this change, we can observe a latency spike up to 30us with cyclictest by rapidly unplugging/reestablishing an ethernet link. Suggested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Josh Cartwright <joshc@ni.com> Cc: bigeasy@linutronix.de Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Link: http://lkml.kernel.org/r/20151027123153.GG8245@jcartwri.amer.corp.natinst.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-13sunrpc: Make svc_xprt_do_enqueue() use get_cpu_light()Mike Galbraith1-3/+3
|BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:915 |in_atomic(): 1, irqs_disabled(): 0, pid: 3194, name: rpc.nfsd |Preemption disabled at:[<ffffffffa06bf0bb>] svc_xprt_received+0x4b/0xc0 [sunrpc] |CPU: 6 PID: 3194 Comm: rpc.nfsd Not tainted 3.18.7-rt1 #9 |Hardware name: MEDION MS-7848/MS-7848, BIOS M7848W08.404 11/06/2014 | ffff880409630000 ffff8800d9a33c78 ffffffff815bdeb5 0000000000000002 | 0000000000000000 ffff8800d9a33c98 ffffffff81073c86 ffff880408dd6008 | ffff880408dd6000 ffff8800d9a33cb8 ffffffff815c3d84 ffff88040b3ac000 |Call Trace: | [<ffffffff815bdeb5>] dump_stack+0x4f/0x9e | [<ffffffff81073c86>] __might_sleep+0xe6/0x150 | [<ffffffff815c3d84>] rt_spin_lock+0x24/0x50 | [<ffffffffa06beec0>] svc_xprt_do_enqueue+0x80/0x230 [sunrpc] | [<ffffffffa06bf0bb>] svc_xprt_received+0x4b/0xc0 [sunrpc] | [<ffffffffa06c03ed>] svc_add_new_perm_xprt+0x6d/0x80 [sunrpc] | [<ffffffffa06b2693>] svc_addsock+0x143/0x200 [sunrpc] | [<ffffffffa072e69c>] write_ports+0x28c/0x340 [nfsd] | [<ffffffffa072d2ac>] nfsctl_transaction_write+0x4c/0x80 [nfsd] | [<ffffffff8117ee83>] vfs_write+0xb3/0x1d0 | [<ffffffff8117f889>] SyS_write+0x49/0xb0 | [<ffffffff815c4556>] system_call_fastpath+0x16/0x1b Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2016-02-13jump-label: disable if stop_machine() is usedThomas Gleixner1-1/+1
Some architectures are using stop_machine() while switching the opcode which leads to latency spikes. The architectures which use stop_machine() atm: - ARM stop machine - s390 stop machine The architecures which use other sorcery: - MIPS - X86 - powerpc - sparc - arm64 Signed-off-by: Thomas Gleixner <tglx@linutronix.de> [bigeasy: only ARM for now] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2016-02-13debugobjects: Make RT awareThomas Gleixner1-1/+4
Avoid filling the pool / allocating memory with irqs off(). Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-13percpu_ida: Use local locksSebastian Andrzej Siewior1-8/+12
the local_irq_save() + spin_lock() does not work that well on -RT Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2016-02-13idr: Use local lock instead of preempt enable/disableThomas Gleixner2-6/+41
We need to protect the per cpu variable and prevent migration. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-13sched: Distangle worker accounting from rqlockThomas Gleixner3-99/+41
The worker accounting for cpu bound workers is plugged into the core scheduler code and the wakeup code. This is not a hard requirement and can be avoided by keeping track of the state in the workqueue code itself. Keep track of the sleeping state in the worker itself and call the notifier before entering the core scheduler. There might be false positives when the task is woken between that call and actually scheduling, but that's not really different from scheduling and being woken immediately after switching away. There is also no harm from updating nr_running when the task returns from scheduling instead of accounting it in the wakeup code. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20110622174919.135236139@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-13workqueue: Prevent workqueue versus ata-piix livelockThomas Gleixner1-1/+2
An Intel i7 system regularly detected rcu_preempt stalls after the kernel was upgraded from 3.6-rt to 3.8-rt. When the stall happened, disk I/O was no longer possible, unless the system was restarted. The kernel message was: INFO: rcu_preempt self-detected stall on CPU { 6} [..] NMI backtrace for cpu 6 CPU 6 Pid: 119, comm: irq/19-ata_piix Not tainted 3.8.13-rt13 #11 Shuttle Inc. SX58/SX58 RIP: 0010:[<ffffffff8124ca60>] [<ffffffff8124ca60>] ip_compute_csum+0x30/0x30 RSP: 0018:ffff880333303cb0 EFLAGS: 00000002 RAX: 0000000000000006 RBX: 00000000000003e9 RCX: 0000000000000034 RDX: 0000000000000000 RSI: ffffffff81aa16d0 RDI: 0000000000000001 RBP: ffff880333303ce8 R08: ffffffff81aa16d0 R09: ffffffff81c1b8cc R10: 0000000000000000 R11: 0000000000000000 R12: 000000000005161f R13: 0000000000000006 R14: ffffffff81aa16d0 R15: 0000000000000002 FS: 0000000000000000(0000) GS:ffff880333300000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000003c1b2bb420 CR3: 0000000001a0f000 CR4: 00000000000007e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process irq/19-ata_piix (pid: 119, threadinfo ffff88032d88a000, task ffff88032df80000) Stack: ffffffff8124cb32 000000000005161e 00000000000003e9 0000000000001000 0000000000009022 ffffffff81aa16d0 0000000000000002 ffff880333303cf8 ffffffff8124caa9 ffff880333303d08 ffffffff8124cad2 ffff880333303d28 Call Trace: <IRQ> [<ffffffff8124cb32>] ? delay_tsc+0x33/0xe3 [<ffffffff8124caa9>] __delay+0xf/0x11 [<ffffffff8124cad2>] __const_udelay+0x27/0x29 [<ffffffff8102d1fa>] native_safe_apic_wait_icr_idle+0x39/0x45 [<ffffffff8102dc9b>] __default_send_IPI_dest_field.constprop.0+0x1e/0x58 [<ffffffff8102dd1e>] default_send_IPI_mask_sequence_phys+0x49/0x7d [<ffffffff81030326>] physflat_send_IPI_all+0x17/0x19 [<ffffffff8102de53>] arch_trigger_all_cpu_backtrace+0x50/0x79 [<ffffffff810b21d0>] rcu_check_callbacks+0x1cb/0x568 [<ffffffff81048c9c>] ? raise_softirq+0x2e/0x35 [<ffffffff81086be0>] ? tick_sched_do_timer+0x38/0x38 [<ffffffff8104f653>] update_process_times+0x44/0x55 [<ffffffff81086866>] tick_sched_handle+0x4a/0x59 [<ffffffff81086c1c>] tick_sched_timer+0x3c/0x5b [<ffffffff81062845>] __run_hrtimer+0x9b/0x158 [<ffffffff810631d8>] hrtimer_interrupt+0x172/0x2aa [<ffffffff8102d498>] smp_apic_timer_interrupt+0x76/0x89 [<ffffffff814d881d>] apic_timer_interrupt+0x6d/0x80 <EOI> [<ffffffff81057cd2>] ? __local_lock_irqsave+0x17/0x4a [<ffffffff81059336>] try_to_grab_pending+0x42/0x17e [<ffffffff8105a699>] mod_delayed_work_on+0x32/0x88 [<ffffffff8105a70b>] mod_delayed_work+0x1c/0x1e [<ffffffff8122ae84>] blk_run_queue_async+0x37/0x39 [<ffffffff81230985>] flush_end_io+0xf1/0x107 [<ffffffff8122e0da>] blk_finish_request+0x21e/0x264 [<ffffffff8122e162>] blk_end_bidi_request+0x42/0x60 [<ffffffff8122e1ba>] blk_end_request+0x10/0x12 [<ffffffff8132de46>] scsi_io_completion+0x1bf/0x492 [<ffffffff81335cec>] ? sd_done+0x298/0x2ef [<ffffffff81325a02>] scsi_finish_command+0xe9/0xf2 [<ffffffff8132dbcb>] scsi_softirq_done+0x106/0x10f [<ffffffff812333d3>] blk_done_softirq+0x77/0x87 [<ffffffff8104826f>] do_current_softirqs+0x172/0x2e1 [<ffffffff810aa820>] ? irq_thread_fn+0x3a/0x3a [<ffffffff81048466>] local_bh_enable+0x43/0x72 [<ffffffff810aa866>] irq_forced_thread_fn+0x46/0x52 [<ffffffff810ab089>] irq_thread+0x8c/0x17c [<ffffffff810ab179>] ? irq_thread+0x17c/0x17c [<ffffffff810aaffd>] ? wake_threads_waitq+0x44/0x44 [<ffffffff8105eb18>] kthread+0x8d/0x95 [<ffffffff8105ea8b>] ? __kthread_parkme+0x65/0x65 [<ffffffff814d7b7c>] ret_from_fork+0x7c/0xb0 [<ffffffff8105ea8b>] ? __kthread_parkme+0x65/0x65 The state of softirqd of this CPU at the time of the crash was: ksoftirqd/6 R running task 0 53 2 0x00000000 ffff88032fc39d18 0000000000000046 ffff88033330c4c0 ffff8803303f4710 ffff88032fc39fd8 ffff88032fc39fd8 0000000000000000 0000000000062500 ffff88032df88000 ffff8803303f4710 0000000000000000 ffff88032fc38000 Call Trace: [<ffffffff8105a3ae>] ? __queue_work+0x27c/0x27c [<ffffffff814d178c>] preempt_schedule+0x61/0x76 [<ffffffff8106cccf>] migrate_enable+0xe5/0x1df [<ffffffff8105a3ae>] ? __queue_work+0x27c/0x27c [<ffffffff8104ef52>] run_timer_softirq+0x161/0x1d6 [<ffffffff8104826f>] do_current_softirqs+0x172/0x2e1 [<ffffffff8104840b>] run_ksoftirqd+0x2d/0x45 [<ffffffff8106658a>] smpboot_thread_fn+0x2ea/0x308 [<ffffffff810662a0>] ? test_ti_thread_flag+0xc/0xc [<ffffffff810662a0>] ? test_ti_thread_flag+0xc/0xc [<ffffffff8105eb18>] kthread+0x8d/0x95 [<ffffffff8105ea8b>] ? __kthread_parkme+0x65/0x65 [<ffffffff814d7afc>] ret_from_fork+0x7c/0xb0 [<ffffffff8105ea8b>] ? __kthread_parkme+0x65/0x65 Apparently, the softirq demon and the ata_piix IRQ handler were waiting for each other to finish ending up in a livelock. After the below patch was applied, the system no longer crashes. Reported-by: Carsten Emde <C.Emde@osadl.org> Proposed-by: Thomas Gleixner <tglx@linutronix.de> Tested by: Carsten Emde <C.Emde@osadl.org> Signed-off-by: Carsten Emde <C.Emde@osadl.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2016-02-13workqueue: Use local irq lock instead of irq disable regionsThomas Gleixner1-14/+17
Use a local_irq_lock as a replacement for irq off regions. We keep the semantic of irq-off in regard to the pool->lock and remain preemptible. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-13workqueue: Use normal rcuThomas Gleixner1-43/+53
There is no need for sched_rcu. The undocumented reason why sched_rcu is used is to avoid a few explicit rcu_read_lock()/unlock() pairs by abusing the fact that sched_rcu reader side critical sections are also protected by preempt or irq disabled regions. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-13net: Use cpu_chill() instead of cpu_relax()Thomas Gleixner2-3/+5
Retry loops on RT might loop forever when the modifying side was preempted. Use cpu_chill() instead of cpu_relax() to let the system make progress. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-13fs: dcache: Use cpu_chill() in trylock loopsThomas Gleixner4-4/+7
Retry loops on RT might loop forever when the modifying side was preempted. Use cpu_chill() instead of cpu_relax() to let the system make progress. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-13block: Use cpu_chill() for retry loopsThomas Gleixner1-2/+3
Retry loops on RT might loop forever when the modifying side was preempted. Steven also observed a live lock when there was a concurrent priority boosting going on. Use cpu_chill() instead of cpu_relax() to let the system make progress. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-13block/mq: drop per ctx cpu_lockSebastian Andrzej Siewior2-12/+0
While converting the get_cpu() to get_cpu_light() I added a cpu lock to ensure the same code is not invoked twice on the same CPU. And now I run into this: | kernel BUG at kernel/locking/rtmutex.c:996! | invalid opcode: 0000 [#1] PREEMPT SMP | CPU0: 13 PID: 75 Comm: kworker/u258:0 Tainted: G I 3.18.7-rt1.5+ #12 | Workqueue: writeback bdi_writeback_workfn (flush-8:0) | task: ffff88023742a620 ti: ffff88023743c000 task.ti: ffff88023743c000 | RIP: 0010:[<ffffffff81523cc0>] [<ffffffff81523cc0>] rt_spin_lock_slowlock+0x280/0x2d0 | Call Trace: | [<ffffffff815254e7>] rt_spin_lock+0x27/0x60 taking the same lock again | | [<ffffffff8127c771>] blk_mq_insert_requests+0x51/0x130 | [<ffffffff8127d4a9>] blk_mq_flush_plug_list+0x129/0x140 | [<ffffffff81272461>] blk_flush_plug_list+0xd1/0x250 | [<ffffffff81522075>] schedule+0x75/0xa0 | [<ffffffff8152474d>] do_nanosleep+0xdd/0x180 | [<ffffffff810c8312>] __hrtimer_nanosleep+0xd2/0x1c0 | [<ffffffff810c8456>] cpu_chill+0x56/0x80 | [<ffffffff8107c13d>] try_to_grab_pending+0x1bd/0x390 | [<ffffffff8107c431>] cancel_delayed_work+0x21/0x170 | [<ffffffff81279a98>] blk_mq_stop_hw_queue+0x18/0x40 | [<ffffffffa000ac6f>] scsi_queue_rq+0x7f/0x830 [scsi_mod] | [<ffffffff8127b0de>] __blk_mq_run_hw_queue+0x1ee/0x360 | [<ffffffff8127b528>] blk_mq_map_request+0x108/0x190 take the lock ^^^ | | [<ffffffff8127c8d2>] blk_sq_make_request+0x82/0x350 | [<ffffffff8126f6c0>] generic_make_request+0xd0/0x120 | [<ffffffff8126f788>] submit_bio+0x78/0x190 | [<ffffffff811bd537>] _submit_bh+0x117/0x180 | [<ffffffff811bf528>] __block_write_full_page.constprop.38+0x138/0x3f0 | [<ffffffff811bf880>] block_write_full_page+0xa0/0xe0 | [<ffffffff811c02b3>] blkdev_writepage+0x13/0x20 | [<ffffffff81127b25>] __writepage+0x15/0x40 | [<ffffffff8112873b>] write_cache_pages+0x1fb/0x440 | [<ffffffff811289be>] generic_writepages+0x3e/0x60 | [<ffffffff8112a17c>] do_writepages+0x1c/0x30 | [<ffffffff811b3603>] __writeback_single_inode+0x33/0x140 | [<ffffffff811b462d>] writeback_sb_inodes+0x2bd/0x490 | [<ffffffff811b4897>] __writeback_inodes_wb+0x97/0xd0 | [<ffffffff811b4a9b>] wb_writeback+0x1cb/0x210 | [<ffffffff811b505b>] bdi_writeback_workfn+0x25b/0x380 | [<ffffffff8107b50b>] process_one_work+0x1bb/0x490 | [<ffffffff8107c7ab>] worker_thread+0x6b/0x4f0 | [<ffffffff81081863>] kthread+0xe3/0x100 | [<ffffffff8152627c>] ret_from_fork+0x7c/0xb0 After looking at this for a while it seems that it is save if blk_mq_ctx is used multiple times, the in struct lock protects the access. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2016-02-13block: blk-mq: Use swaitSebastian Andrzej Siewior3-7/+7
| BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:914 | in_atomic(): 1, irqs_disabled(): 0, pid: 255, name: kworker/u257:6 | 5 locks held by kworker/u257:6/255: | #0: ("events_unbound"){.+.+.+}, at: [<ffffffff8108edf1>] process_one_work+0x171/0x5e0 | #1: ((&entry->work)){+.+.+.}, at: [<ffffffff8108edf1>] process_one_work+0x171/0x5e0 | #2: (&shost->scan_mutex){+.+.+.}, at: [<ffffffffa000faa3>] __scsi_add_device+0xa3/0x130 [scsi_mod] | #3: (&set->tag_list_lock){+.+...}, at: [<ffffffff812f09fa>] blk_mq_init_queue+0x96a/0xa50 | #4: (rcu_read_lock_sched){......}, at: [<ffffffff8132887d>] percpu_ref_kill_and_confirm+0x1d/0x120 | Preemption disabled at:[<ffffffff812eff76>] blk_mq_freeze_queue_start+0x56/0x70 | | CPU: 2 PID: 255 Comm: kworker/u257:6 Not tainted 3.18.7-rt0+ #1 | Workqueue: events_unbound async_run_entry_fn | 0000000000000003 ffff8800bc29f998 ffffffff815b3a12 0000000000000000 | 0000000000000000 ffff8800bc29f9b8 ffffffff8109aa16 ffff8800bc29fa28 | ffff8800bc5d1bc8 ffff8800bc29f9e8 ffffffff815b8dd4 ffff880000000000 | Call Trace: | [<ffffffff815b3a12>] dump_stack+0x4f/0x7c | [<ffffffff8109aa16>] __might_sleep+0x116/0x190 | [<ffffffff815b8dd4>] rt_spin_lock+0x24/0x60 | [<ffffffff810b6089>] __wake_up+0x29/0x60 | [<ffffffff812ee06e>] blk_mq_usage_counter_release+0x1e/0x20 | [<ffffffff81328966>] percpu_ref_kill_and_confirm+0x106/0x120 | [<ffffffff812eff76>] blk_mq_freeze_queue_start+0x56/0x70 | [<ffffffff812f0000>] blk_mq_update_tag_set_depth+0x40/0xd0 | [<ffffffff812f0a1c>] blk_mq_init_queue+0x98c/0xa50 | [<ffffffffa000dcf0>] scsi_mq_alloc_queue+0x20/0x60 [scsi_mod] | [<ffffffffa000ea35>] scsi_alloc_sdev+0x2f5/0x370 [scsi_mod] | [<ffffffffa000f494>] scsi_probe_and_add_lun+0x9e4/0xdd0 [scsi_mod] | [<ffffffffa000fb26>] __scsi_add_device+0x126/0x130 [scsi_mod] | [<ffffffffa013033f>] ata_scsi_scan_host+0xaf/0x200 [libata] | [<ffffffffa012b5b6>] async_port_probe+0x46/0x60 [libata] | [<ffffffff810978fb>] async_run_entry_fn+0x3b/0xf0 | [<ffffffff8108ee81>] process_one_work+0x201/0x5e0 Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2016-02-13blk-mq: revert raw locks, post pone notifier to POST_DEADSebastian Andrzej Siewior2-8/+11
The blk_mq_cpu_notify_lock should be raw because some CPU down levels are called with interrupts off. The notifier itself calls currently one function that is blk_mq_hctx_notify(). That function acquires the ctx->lock lock which is sleeping and I would prefer to keep it that way. That function only moves IO-requests from the CPU that is going offline to another CPU and it is currently the only one. Therefore I revert the list lock back to sleeping spinlocks and let the notifier run at POST_DEAD time. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2016-02-13cpu_chill: Add a UNINTERRUPTIBLE hrtimer_nanosleepSteven Rostedt1-7/+18
We hit another bug that was caused by switching cpu_chill() from msleep() to hrtimer_nanosleep(). This time it is a livelock. The problem is that hrtimer_nanosleep() calls schedule with the state == TASK_INTERRUPTIBLE. But these means that if a signal is pending, the scheduler wont schedule, and will simply change the current task state back to TASK_RUNNING. This nullifies the whole point of cpu_chill() in the first place. That is, if a task is spinning on a try_lock() and it preempted the owner of the lock, if it has a signal pending, it will never give up the CPU to let the owner of the lock run. I made a static function __hrtimer_nanosleep() that takes a fifth parameter "state", which determines the task state of that the nanosleep() will be in. The normal hrtimer_nanosleep() will act the same, but cpu_chill() will call the __hrtimer_nanosleep() directly with the TASK_UNINTERRUPTIBLE state. cpu_chill() only cares that the first sleep happens, and does not care about the state of the restart schedule (in hrtimer_nanosleep_restart). Reported-by: Ulrich Obergfell <uobergfe@redhat.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2016-02-13rt: Introduce cpu_chill()Thomas Gleixner2-0/+25
Retry loops on RT might loop forever when the modifying side was preempted. Add cpu_chill() to replace cpu_relax(). cpu_chill() defaults to cpu_relax() for non RT. On RT it puts the looping task to sleep for a tick so the preempted task can make progress. Steven Rostedt changed it to use a hrtimer instead of msleep(): | |Ulrich Obergfell pointed out that cpu_chill() calls msleep() which is woken |up by the ksoftirqd running the TIMER softirq. But as the cpu_chill() is |called from softirq context, it may block the ksoftirqd() from running, in |which case, it may never wake up the msleep() causing the deadlock. | |I checked the vmcore, and irq/74-qla2xxx is stuck in the msleep() call, |running on CPU 8. The one ksoftirqd that is stuck, happens to be the one that |runs on CPU 8, and it is blocked on a lock held by irq/74-qla2xxx. As that |ksoftirqd is the one that will wake up irq/74-qla2xxx, and it happens to be |blocked on a lock that irq/74-qla2xxx holds, we have our deadlock. | |The solution is not to convert the cpu_chill() back to a cpu_relax() as that |will re-create a possible live lock that the cpu_chill() fixed earlier, and may |also leave this bug open on other softirqs. The fix is to remove the |dependency on ksoftirqd from cpu_chill(). That is, instead of calling |msleep() that requires ksoftirqd to wake it up, use the |hrtimer_nanosleep() code that does the wakeup from hard irq context. | ||Looks to be the lock of the block softirq. I don't have the core dump ||anymore, but from what I could tell the ksoftirqd was blocked on the ||block softirq lock, where the block softirq handler did a msleep ||(called by the qla2xxx interrupt handler). || ||Looking at trigger_softirq() in block/blk-softirq.c, it can do a ||smp_callfunction() to another cpu to run the block softirq. If that ||happens to be the cpu where the qla2xx irq handler is doing the block ||softirq and is in a middle of a msleep(), I believe the ksoftirqd will ||try to run the softirq. If it does that, then BOOM, it's deadlocked ||because the ksoftirqd will never run the timer softirq either. | ||I should have also stated that it was only one lock that was involved. ||But the lock owner was doing a msleep() that requires a wakeup by ||ksoftirqd to continue. If ksoftirqd happens to be blocked on a lock ||held by the msleep() caller, then you have your deadlock. || ||It's best not to have any softirqs going to sleep requiring another ||softirq to wake it up. Note, if we ever require a timer softirq to do a ||cpu_chill() it will most definitely hit this deadlock. + bigeasy: add PF_NOFREEZE: | [....] Waiting for /dev to be fully populated... | ===================================== | [ BUG: udevd/229 still has locks held! ] | 3.12.11-rt17 #23 Not tainted | ------------------------------------- | 1 lock held by udevd/229: | #0: (&type->i_mutex_dir_key#2){+.+.+.}, at: lookup_slow+0x28/0x98 | | stack backtrace: | CPU: 0 PID: 229 Comm: udevd Not tainted 3.12.11-rt17 #23 | (unwind_backtrace+0x0/0xf8) from (show_stack+0x10/0x14) | (show_stack+0x10/0x14) from (dump_stack+0x74/0xbc) | (dump_stack+0x74/0xbc) from (do_nanosleep+0x120/0x160) | (do_nanosleep+0x120/0x160) from (hrtimer_nanosleep+0x90/0x110) | (hrtimer_nanosleep+0x90/0x110) from (cpu_chill+0x30/0x38) | (cpu_chill+0x30/0x38) from (dentry_kill+0x158/0x1ec) | (dentry_kill+0x158/0x1ec) from (dput+0x74/0x15c) | (dput+0x74/0x15c) from (lookup_real+0x4c/0x50) | (lookup_real+0x4c/0x50) from (__lookup_hash+0x34/0x44) | (__lookup_hash+0x34/0x44) from (lookup_slow+0x38/0x98) | (lookup_slow+0x38/0x98) from (path_lookupat+0x208/0x7fc) | (path_lookupat+0x208/0x7fc) from (filename_lookup+0x20/0x60) | (filename_lookup+0x20/0x60) from (user_path_at_empty+0x50/0x7c) | (user_path_at_empty+0x50/0x7c) from (user_path_at+0x14/0x1c) | (user_path_at+0x14/0x1c) from (vfs_fstatat+0x48/0x94) | (vfs_fstatat+0x48/0x94) from (SyS_stat64+0x14/0x30) | (SyS_stat64+0x14/0x30) from (ret_fast_syscall+0x0/0x48) Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2016-02-13dump stack: don't disable preemption during traceSebastian Andrzej Siewior3-8/+8
I see here large latencies during a stack dump on x86. The preempt_disable() and get_cpu() should forbid moving the task to another CPU during a stack dump and avoiding two stack traces in parallel on the same CPU. However a stack trace from a second CPU may still happen in parallel. Also nesting is allowed so a stack trace happens in process-context and we may have another one from IRQ context. With migrate disable we keep this code preemptible and allow a second backtrace on the same CPU by another task. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2016-02-13block/mq: don't complete requests via IPISebastian Andrzej Siewior4-0/+25
The IPI runs in hardirq context and there are sleeping locks. This patch moves the completion into a workqueue. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2016-02-13block/mq: do not invoke preempt_disable()Sebastian Andrzej Siewior1-5/+5
preempt_disable() and get_cpu() don't play well together with the sleeping locks it tries to allocate later. It seems to be enough to replace it with get_cpu_light() and migrate_disable(). Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2016-02-13block: mq: use cpu_light()Sebastian Andrzej Siewior2-3/+18
there is a might sleep splat because get_cpu() disables preemption and later we grab a lock. As a workaround for this we use get_cpu_light() and an additional lock to prevent taking the same ctx. There is a lock member in the ctx already but there some functions which do ++ on the member and this works with irq off but on RT we would need the extra lock. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2016-02-13mm/vmalloc: Another preempt disable region which sucksThomas Gleixner1-5/+8
Avoid the preempt disable version of get_cpu_var(). The inner-lock should provide enough serialisation. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-13fs/epoll: Do not disable preemption on RTThomas Gleixner1-2/+2
ep_call_nested() takes a sleeping lock so we can't disable preemption. The light version is enough since ep_call_nested() doesn't mind beeing invoked twice on the same CPU. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-13thermal: Defer thermal wakups to threadsDaniel Wagner1-3/+47
On RT the spin lock in pkg_temp_thermal_platfrom_thermal_notify will call schedule while we run in irq context. [<ffffffff816850ac>] dump_stack+0x4e/0x8f [<ffffffff81680f7d>] __schedule_bug+0xa6/0xb4 [<ffffffff816896b4>] __schedule+0x5b4/0x700 [<ffffffff8168982a>] schedule+0x2a/0x90 [<ffffffff8168a8b5>] rt_spin_lock_slowlock+0xe5/0x2d0 [<ffffffff8168afd5>] rt_spin_lock+0x25/0x30 [<ffffffffa03a7b75>] pkg_temp_thermal_platform_thermal_notify+0x45/0x134 [x86_pkg_temp_thermal] [<ffffffff8103d4db>] ? therm_throt_process+0x1b/0x160 [<ffffffff8103d831>] intel_thermal_interrupt+0x211/0x250 [<ffffffff8103d8c1>] smp_thermal_interrupt+0x21/0x40 [<ffffffff8169415d>] thermal_interrupt+0x6d/0x80 Let's defer the work to a kthread. Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de> [bigeasy: reoder init/denit position. TODO: flush swork on exit] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2016-02-13x86: UV: raw_spinlock conversionMike Galbraith5-30/+35
Shrug. Lots of hobbyists have a beast in their basement, right? Signed-off-by: Mike Galbraith <mgalbraith@suse.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2016-02-13x86: Use generic rwsem_spinlocks on -rtThomas Gleixner1-1/+4
Simplifies the separation of anon_rw_semaphores and rw_semaphores for -rt. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-13x86: stackprotector: Avoid random pool on rtThomas Gleixner1-1/+8
CPU bringup calls into the random pool to initialize the stack canary. During boot that works nicely even on RT as the might sleep checks are disabled. During CPU hotplug the might sleep checks trigger. Making the locks in random raw is a major PITA, so avoid the call on RT is the only sensible solution. This is basically the same randomness which we get during boot where the random pool has no entropy and we rely on the TSC randomnness. Reported-by: Carsten Emde <carsten.emde@osadl.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-13x86/mce: use swait queue for mce wakeupsSteven Rostedt1-12/+56
We had a customer report a lockup on a 3.0-rt kernel that had the following backtrace: [ffff88107fca3e80] rt_spin_lock_slowlock at ffffffff81499113 [ffff88107fca3f40] rt_spin_lock at ffffffff81499a56 [ffff88107fca3f50] __wake_up at ffffffff81043379 [ffff88107fca3f80] mce_notify_irq at ffffffff81017328 [ffff88107fca3f90] intel_threshold_interrupt at ffffffff81019508 [ffff88107fca3fa0] smp_threshold_interrupt at ffffffff81019fc1 [ffff88107fca3fb0] threshold_interrupt at ffffffff814a1853 It actually bugged because the lock was taken by the same owner that already had that lock. What happened was the thread that was setting itself on a wait queue had the lock when an MCE triggered. The MCE interrupt does a wake up on its wait list and grabs the same lock. NOTE: THIS IS NOT A BUG ON MAINLINE Sorry for yelling, but as I Cc'd mainline maintainers I want them to know that this is an PREEMPT_RT bug only. I only Cc'd them for advice. On PREEMPT_RT the wait queue locks are converted from normal "spin_locks" into an rt_mutex (see the rt_spin_lock_slowlock above). These are not to be taken by hard interrupt context. This usually isn't a problem as most all interrupts in PREEMPT_RT are converted into schedulable threads. Unfortunately that's not the case with the MCE irq. As wait queue locks are notorious for long hold times, we can not convert them to raw_spin_locks without causing issues with -rt. But Thomas has created a "simple-wait" structure that uses raw spin locks which may have been a good fit. Unfortunately, wait queues are not the only issue, as the mce_notify_irq also does a schedule_work(), which grabs the workqueue spin locks that have the exact same issue. Thus, this patch I'm proposing is to move the actual work of the MCE interrupt into a helper thread that gets woken up on the MCE interrupt and does the work in a schedulable context. NOTE: THIS PATCH ONLY CHANGES THE BEHAVIOR WHEN PREEMPT_RT IS SET Oops, sorry for yelling again, but I want to stress that I keep the same behavior of mainline when PREEMPT_RT is not set. Thus, this only changes the MCE behavior when PREEMPT_RT is configured. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> [bigeasy@linutronix: make mce_notify_work() a proper prototype, use kthread_run()] Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> [wagi: use work-simple framework to defer work to a kthread] Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de>
2016-02-13x86: Convert mce timer to hrtimerThomas Gleixner1-32/+20
mce_timer is started in atomic contexts of cpu bringup. This results in might_sleep() warnings on RT. Convert mce_timer to a hrtimer to avoid this. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> fold in: |From: Mike Galbraith <bitbucket@online.de> |Date: Wed, 29 May 2013 13:52:13 +0200 |Subject: [PATCH] x86/mce: fix mce timer interval | |Seems mce timer fire at the wrong frequency in -rt kernels since roughly |forever due to 32 bit overflow. 3.8-rt is also missing a multiplier. | |Add missing us -> ns conversion and 32 bit overflow prevention. | |Signed-off-by: Mike Galbraith <bitbucket@online.de> |[bigeasy: use ULL instead of u64 cast] |Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2016-02-13fs: jbd2: pull your plug when waiting for spaceSebastian Andrzej Siewior1-0/+2
Two cps in parallel managed to stall the the ext4 fs. It seems that journal code is either waiting for locks or sleeping waiting for something to happen. This seems similar to what Mike observed on ext3, here is his description: |With an -rt kernel, and a heavy sync IO load, tasks can jam |up on journal locks without unplugging, which can lead to |terminal IO starvation. Unplug and schedule when waiting |for space. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2016-02-13fs: ntfs: disable interrupt only on !RTMike Galbraith1-2/+2
On Sat, 2007-10-27 at 11:44 +0200, Ingo Molnar wrote: > * Nick Piggin <nickpiggin@yahoo.com.au> wrote: > > > > [10138.175796] [<c0105de3>] show_trace+0x12/0x14 > > > [10138.180291] [<c0105dfb>] dump_stack+0x16/0x18 > > > [10138.184769] [<c011609f>] native_smp_call_function_mask+0x138/0x13d > > > [10138.191117] [<c0117606>] smp_call_function+0x1e/0x24 > > > [10138.196210] [<c012f85c>] on_each_cpu+0x25/0x50 > > > [10138.200807] [<c0115c74>] flush_tlb_all+0x1e/0x20 > > > [10138.205553] [<c016caaf>] kmap_high+0x1b6/0x417 > > > [10138.210118] [<c011ec88>] kmap+0x4d/0x4f > > > [10138.214102] [<c026a9d8>] ntfs_end_buffer_async_read+0x228/0x2f9 > > > [10138.220163] [<c01a0e9e>] end_bio_bh_io_sync+0x26/0x3f > > > [10138.225352] [<c01a2b09>] bio_endio+0x42/0x6d > > > [10138.229769] [<c02c2a08>] __end_that_request_first+0x115/0x4ac > > > [10138.235682] [<c02c2da7>] end_that_request_chunk+0x8/0xa > > > [10138.241052] [<c0365943>] ide_end_request+0x55/0x10a > > > [10138.246058] [<c036dae3>] ide_dma_intr+0x6f/0xac > > > [10138.250727] [<c0366d83>] ide_intr+0x93/0x1e0 > > > [10138.255125] [<c015afb4>] handle_IRQ_event+0x5c/0xc9 > > > > Looks like ntfs is kmap()ing from interrupt context. Should be using > > kmap_atomic instead, I think. > > it's not atomic interrupt context but irq thread context - and -rt > remaps kmap_atomic() to kmap() internally. Hm. Looking at the change to mm/bounce.c, perhaps I should do this instead? Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-13block: Turn off warning which is bogus on RTThomas Gleixner1-1/+1
On -RT the context is always with IRQs enabled. Ignore this warning on -RT. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-13mm: Protect activate_mm() by preempt_[disable&enable]_rt()Yong Zhang2-0/+4
User preempt_*_rt instead of local_irq_*_rt or otherwise there will be warning on ARM like below: WARNING: at build/linux/kernel/smp.c:459 smp_call_function_many+0x98/0x264() Modules linked in: [<c0013bb4>] (unwind_backtrace+0x0/0xe4) from [<c001be94>] (warn_slowpath_common+0x4c/0x64) [<c001be94>] (warn_slowpath_common+0x4c/0x64) from [<c001bec4>] (warn_slowpath_null+0x18/0x1c) [<c001bec4>] (warn_slowpath_null+0x18/0x1c) from [<c0053ff8>](smp_call_function_many+0x98/0x264) [<c0053ff8>] (smp_call_function_many+0x98/0x264) from [<c0054364>] (smp_call_function+0x44/0x6c) [<c0054364>] (smp_call_function+0x44/0x6c) from [<c0017d50>] (__new_context+0xbc/0x124) [<c0017d50>] (__new_context+0xbc/0x124) from [<c009e49c>] (flush_old_exec+0x460/0x5e4) [<c009e49c>] (flush_old_exec+0x460/0x5e4) from [<c00d61ac>] (load_elf_binary+0x2e0/0x11ac) [<c00d61ac>] (load_elf_binary+0x2e0/0x11ac) from [<c009d060>] (search_binary_handler+0x94/0x2a4) [<c009d060>] (search_binary_handler+0x94/0x2a4) from [<c009e8fc>] (do_execve+0x254/0x364) [<c009e8fc>] (do_execve+0x254/0x364) from [<c0010e84>] (sys_execve+0x34/0x54) [<c0010e84>] (sys_execve+0x34/0x54) from [<c000da00>] (ret_fast_syscall+0x0/0x30) ---[ end trace 0000000000000002 ]--- The reason is that ARM need irq enabled when doing activate_mm(). According to mm-protect-activate-switch-mm.patch, actually preempt_[disable|enable]_rt() is sufficient. Inspired-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Yong Zhang <yong.zhang0@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/r/1337061236-1766-1-git-send-email-yong.zhang0@gmail.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-13fs: namespace preemption fixThomas Gleixner1-1/+4
On RT we cannot loop with preemption disabled here as mnt_make_readonly() might have been preempted. We can safely enable preemption while waiting for MNT_WRITE_HOLD to be cleared. Safe on !RT as well. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-13fs/aio: simple simple workSebastian Andrzej Siewior1-7/+17
|BUG: sleeping function called from invalid context at kernel/locking/rtmutex.c:768 |in_atomic(): 1, irqs_disabled(): 0, pid: 26, name: rcuos/2 |2 locks held by rcuos/2/26: | #0: (rcu_callback){.+.+..}, at: [<ffffffff810b1a12>] rcu_nocb_kthread+0x1e2/0x380 | #1: (rcu_read_lock_sched){.+.+..}, at: [<ffffffff812acd26>] percpu_ref_kill_rcu+0xa6/0x1c0 |Preemption disabled at:[<ffffffff810b1a93>] rcu_nocb_kthread+0x263/0x380 |Call Trace: | [<ffffffff81582e9e>] dump_stack+0x4e/0x9c | [<ffffffff81077aeb>] __might_sleep+0xfb/0x170 | [<ffffffff81589304>] rt_spin_lock+0x24/0x70 | [<ffffffff811c5790>] free_ioctx_users+0x30/0x130 | [<ffffffff812ace34>] percpu_ref_kill_rcu+0x1b4/0x1c0 | [<ffffffff810b1a93>] rcu_nocb_kthread+0x263/0x380 | [<ffffffff8106e046>] kthread+0xd6/0xf0 | [<ffffffff81591eec>] ret_from_fork+0x7c/0xb0 replace this preempt_disable() friendly swork. Reported-By: Mike Galbraith <umgwanakikbuti@gmail.com> Suggested-by: Benjamin LaHaise <bcrl@kvack.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2016-02-13completion: Use simple wait queuesThomas Gleixner7-27/+33
Completions have no long lasting callbacks and therefor do not need the complex waitqueue variant. Use simple waitqueues which reduces the contention on the waitqueue lock. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-13rcu: use simple waitqueuesThomas Gleixner3-21/+22
Convert RCU's wait-queues into simple waitqueues. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Merged Steven's static void rcu_nocb_gp_cleanup(struct rcu_state *rsp, struct rcu_node *rnp) { - swait_wake(&rnp->nocb_gp_wq[rnp->completed & 0x1]); + wake_up_all(&rnp->nocb_gp_wq[rnp->completed & 0x1]); } Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2016-02-13work-simple: Simple work queue implemenationDaniel Wagner3-1/+198
Provides a framework for enqueuing callbacks from irq context PREEMPT_RT_FULL safe. The callbacks are executed in kthread context. Bases on wait-simple. Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2016-02-13wait-simple: Simple waitqueue implementationThomas Gleixner3-1/+323
wait_queue is a swiss army knife and in most of the cases the complexity is not needed. For RT waitqueues are a constant source of trouble as we can't convert the head lock to a raw spinlock due to fancy and long lasting callbacks. Provide a slim version, which allows RT to replace wait queues. This should go mainline as well, as it lowers memory consumption and runtime overhead. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> smp_mb() added by Steven Rostedt to fix a race condition with swait wakeups vs adding items to the list.
2016-02-13wait.h: include atomic.hSebastian Andrzej Siewior1-0/+1
| CC init/main.o |In file included from include/linux/mmzone.h:9:0, | from include/linux/gfp.h:4, | from include/linux/kmod.h:22, | from include/linux/module.h:13, | from init/main.c:15: |include/linux/wait.h: In function ‘wait_on_atomic_t’: |include/linux/wait.h:982:2: error: implicit declaration of function ‘atomic_read’ [-Werror=implicit-function-declaration] | if (atomic_read(val) == 0) | ^ This pops up on ARM. Non-RT gets its atomic.h include from spinlock.h Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2016-02-13rt: Improve the serial console PASS_LIMITIngo Molnar1-1/+10
Beyond the warning: drivers/tty/serial/8250/8250.c:1613:6: warning: unused variable ‘pass_counter’ [-Wunused-variable] the solution of just looping infinitely was ugly - up it to 1 million to give it a chance to continue in some really ugly situation. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-13tty/serial/pl011: Make the locking work on RTThomas Gleixner1-5/+10
The lock is a sleeping lock and local_irq_save() is not the optimsation we are looking for. Redo it to make it work on -RT and non-RT. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-13tty/serial/omap: Make the locking RT awareThomas Gleixner1-8/+4
The lock is a sleeping lock and local_irq_save() is not the optimsation we are looking for. Redo it to make it work on -RT and non-RT. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-13kernel/stop_machine: partly revert "stop_machine: Use raw spinlocks"Sebastian Andrzej Siewior1-32/+8
With completion using swait and so rawlocks we don't need this anymore. Further, bisect thinks this patch is responsible for: |BUG: unable to handle kernel NULL pointer dereference at (null) |IP: [<ffffffff81082123>] sched_cpu_active+0x53/0x70 |PGD 0 |Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC |Dumping ftrace buffer: | (ftrace buffer empty) |Modules linked in: |CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.4.1+ #330 |Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Debian-1.8.2-1 04/01/2014 |task: ffff88013ae64b00 ti: ffff88013ae74000 task.ti: ffff88013ae74000 |RIP: 0010:[<ffffffff81082123>] [<ffffffff81082123>] sched_cpu_active+0x53/0x70 |RSP: 0000:ffff88013ae77eb8 EFLAGS: 00010082 |RAX: 0000000000000001 RBX: ffffffff81c2cf20 RCX: 0000001050fb52fb |RDX: 0000001050fb52fb RSI: 000000105117ca1e RDI: 00000000001c7723 |RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000001 |R10: 0000000000000000 R11: 0000000000000001 R12: 00000000ffffffff |R13: ffffffff81c2cee0 R14: 0000000000000000 R15: 0000000000000001 |FS: 0000000000000000(0000) GS:ffff88013b200000(0000) knlGS:0000000000000000 |CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b |CR2: 0000000000000000 CR3: 0000000001c09000 CR4: 00000000000006e0 |Stack: | ffffffff810c446d ffff88013ae77f00 ffffffff8107d8dd 000000000000000a | 0000000000000001 0000000000000000 0000000000000000 0000000000000000 | 0000000000000000 ffff88013ae77f10 ffffffff8107d90e ffff88013ae77f20 |Call Trace: | [<ffffffff810c446d>] ? debug_lockdep_rcu_enabled+0x1d/0x20 | [<ffffffff8107d8dd>] ? notifier_call_chain+0x5d/0x80 | [<ffffffff8107d90e>] ? __raw_notifier_call_chain+0xe/0x10 | [<ffffffff810598a3>] ? cpu_notify+0x23/0x40 | [<ffffffff8105a7b8>] ? notify_cpu_starting+0x28/0x30 during hotplug. The rawlocks need to remain however. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2016-02-13stomp-machine: use lg_global_trylock_relax() to dead with stop_cpus_lock lglockMike Galbraith1-10/+15
If the stop machinery is called from inactive CPU we cannot use lg_global_lock(), because some other stomp machine invocation might be in progress and the lock can be contended. We cannot schedule from this context, so use the lovely new lg_global_trylock_relax() primitive to do what we used to do via one mutex_trylock()/cpu_relax() loop. We now do that trylock()/relax() across an entire herd of locks. Joy. Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2016-02-13stomp-machine: create lg_global_trylock_relax() primitiveMike Galbraith4-0/+37
Create lg_global_trylock_relax() for use by stopper thread when it cannot schedule, to deal with stop_cpus_lock, which is now an lglock. Signed-off-by: Mike Galbraith <umgwanakikbuti@gmail.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2016-02-13lglocks: Provide a RT safe variantThomas Gleixner2-22/+58
lglocks by itself will spin in order to get the lock. This will end up badly if a task with the highest priority keeps spinning while a task with the lowest priority owns the lock. Lets replace them with rt_mutex based locks so they can sleep, track owner and boost if needed. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-13rcu: disable more spots of rcu_bhSebastian Andrzej Siewior2-0/+8
We don't use ru_bh on -RT but we still fork a thread for it and keep it as a flavour. No more. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2016-02-13rcutree/rcu_bh_qs: Disable irq while calling rcu_preempt_qs()Tiejun Chen1-0/+5
Any callers to the function rcu_preempt_qs() must disable irqs in order to protect the assignment to ->rcu_read_unlock_special. In RT case, rcu_bh_qs() as the wrapper of rcu_preempt_qs() is called in some scenarios where irq is enabled, like this path, do_single_softirq() | + local_irq_enable(); + handle_softirq() | | | + rcu_bh_qs() | | | + rcu_preempt_qs() | + local_irq_disable() So here we'd better disable irq directly inside of rcu_bh_qs() to fix this, otherwise the kernel may be freezable sometimes as observed. And especially this way is also kind and safe for the potential rcu_bh_qs() usage elsewhere in the future. Signed-off-by: Tiejun Chen <tiejun.chen@windriver.com> Signed-off-by: Bin Jiang <bin.jiang@windriver.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2016-02-13rcu: Make ksoftirqd do RCU quiescent statesPaul E. McKenney3-6/+15
Implementing RCU-bh in terms of RCU-preempt makes the system vulnerable to network-based denial-of-service attacks. This patch therefore makes __do_softirq() invoke rcu_bh_qs(), but only when __do_softirq() is running in ksoftirqd context. A wrapper layer in interposed so that other calls to __do_softirq() avoid invoking rcu_bh_qs(). The underlying function __do_softirq_common() does the actual work. The reason that rcu_bh_qs() is bad in these non-ksoftirqd contexts is that there might be a local_bh_enable() inside an RCU-preempt read-side critical section. This local_bh_enable() can invoke __do_softirq() directly, so if __do_softirq() were to invoke rcu_bh_qs() (which just calls rcu_preempt_qs() in the PREEMPT_RT_FULL case), there would be an illegal RCU-preempt quiescent state in the middle of an RCU-preempt read-side critical section. Therefore, quiescent states can only happen in cases where __do_softirq() is invoked directly from ksoftirqd. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Link: http://lkml.kernel.org/r/20111005184518.GA21601@linux.vnet.ibm.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-13rcu: Merge RCU-bh into RCU-preemptThomas Gleixner4-2/+57
The Linux kernel has long RCU-bh read-side critical sections that intolerably increase scheduling latency under mainline's RCU-bh rules, which include RCU-bh read-side critical sections being non-preemptible. This patch therefore arranges for RCU-bh to be implemented in terms of RCU-preempt for CONFIG_PREEMPT_RT_FULL=y. This has the downside of defeating the purpose of RCU-bh, namely, handling the case where the system is subjected to a network-based denial-of-service attack that keeps at least one CPU doing full-time softirq processing. This issue will be fixed by a later commit. The current commit will need some work to make it appropriate for mainline use, for example, it needs to be extended to cover Tiny RCU. [ paulmck: Added a useful changelog ] Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Link: http://lkml.kernel.org/r/20111005185938.GA20403@linux.vnet.ibm.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-13rcu: Frob softirq testPeter Zijlstra1-1/+1
With RT_FULL we get the below wreckage: [ 126.060484] ======================================================= [ 126.060486] [ INFO: possible circular locking dependency detected ] [ 126.060489] 3.0.1-rt10+ #30 [ 126.060490] ------------------------------------------------------- [ 126.060492] irq/24-eth0/1235 is trying to acquire lock: [ 126.060495] (&(lock)->wait_lock#2){+.+...}, at: [<ffffffff81501c81>] rt_mutex_slowunlock+0x16/0x55 [ 126.060503] [ 126.060504] but task is already holding lock: [ 126.060506] (&p->pi_lock){-...-.}, at: [<ffffffff81074fdc>] try_to_wake_up+0x35/0x429 [ 126.060511] [ 126.060511] which lock already depends on the new lock. [ 126.060513] [ 126.060514] [ 126.060514] the existing dependency chain (in reverse order) is: [ 126.060516] [ 126.060516] -> #1 (&p->pi_lock){-...-.}: [ 126.060519] [<ffffffff810afe9e>] lock_acquire+0x145/0x18a [ 126.060524] [<ffffffff8150291e>] _raw_spin_lock_irqsave+0x4b/0x85 [ 126.060527] [<ffffffff810b5aa4>] task_blocks_on_rt_mutex+0x36/0x20f [ 126.060531] [<ffffffff815019bb>] rt_mutex_slowlock+0xd1/0x15a [ 126.060534] [<ffffffff81501ae3>] rt_mutex_lock+0x2d/0x2f [ 126.060537] [<ffffffff810d9020>] rcu_boost+0xad/0xde [ 126.060541] [<ffffffff810d90ce>] rcu_boost_kthread+0x7d/0x9b [ 126.060544] [<ffffffff8109a760>] kthread+0x99/0xa1 [ 126.060547] [<ffffffff81509b14>] kernel_thread_helper+0x4/0x10 [ 126.060551] [ 126.060552] -> #0 (&(lock)->wait_lock#2){+.+...}: [ 126.060555] [<ffffffff810af1b8>] __lock_acquire+0x1157/0x1816 [ 126.060558] [<ffffffff810afe9e>] lock_acquire+0x145/0x18a [ 126.060561] [<ffffffff8150279e>] _raw_spin_lock+0x40/0x73 [ 126.060564] [<ffffffff81501c81>] rt_mutex_slowunlock+0x16/0x55 [ 126.060566] [<ffffffff81501ce7>] rt_mutex_unlock+0x27/0x29 [ 126.060569] [<ffffffff810d9f86>] rcu_read_unlock_special+0x17e/0x1c4 [ 126.060573] [<ffffffff810da014>] __rcu_read_unlock+0x48/0x89 [ 126.060576] [<ffffffff8106847a>] select_task_rq_rt+0xc7/0xd5 [ 126.060580] [<ffffffff8107511c>] try_to_wake_up+0x175/0x429 [ 126.060583] [<ffffffff81075425>] wake_up_process+0x15/0x17 [ 126.060585] [<ffffffff81080a51>] wakeup_softirqd+0x24/0x26 [ 126.060590] [<ffffffff81081df9>] irq_exit+0x49/0x55 [ 126.060593] [<ffffffff8150a3bd>] smp_apic_timer_interrupt+0x8a/0x98 [ 126.060597] [<ffffffff81509793>] apic_timer_interrupt+0x13/0x20 [ 126.060600] [<ffffffff810d5952>] irq_forced_thread_fn+0x1b/0x44 [ 126.060603] [<ffffffff810d582c>] irq_thread+0xde/0x1af [ 126.060606] [<ffffffff8109a760>] kthread+0x99/0xa1 [ 126.060608] [<ffffffff81509b14>] kernel_thread_helper+0x4/0x10 [ 126.060611] [ 126.060612] other info that might help us debug this: [ 126.060614] [ 126.060615] Possible unsafe locking scenario: [ 126.060616] [ 126.060617] CPU0 CPU1 [ 126.060619] ---- ---- [ 126.060620] lock(&p->pi_lock); [ 126.060623] lock(&(lock)->wait_lock); [ 126.060625] lock(&p->pi_lock); [ 126.060627] lock(&(lock)->wait_lock); [ 126.060629] [ 126.060629] *** DEADLOCK *** [ 126.060630] [ 126.060632] 1 lock held by irq/24-eth0/1235: [ 126.060633] #0: (&p->pi_lock){-...-.}, at: [<ffffffff81074fdc>] try_to_wake_up+0x35/0x429 [ 126.060638] [ 126.060638] stack backtrace: [ 126.060641] Pid: 1235, comm: irq/24-eth0 Not tainted 3.0.1-rt10+ #30 [ 126.060643] Call Trace: [ 126.060644] <IRQ> [<ffffffff810acbde>] print_circular_bug+0x289/0x29a [ 126.060651] [<ffffffff810af1b8>] __lock_acquire+0x1157/0x1816 [ 126.060655] [<ffffffff810ab3aa>] ? trace_hardirqs_off_caller+0x1f/0x99 [ 126.060658] [<ffffffff81501c81>] ? rt_mutex_slowunlock+0x16/0x55 [ 126.060661] [<ffffffff810afe9e>] lock_acquire+0x145/0x18a [ 126.060664] [<ffffffff81501c81>] ? rt_mutex_slowunlock+0x16/0x55 [ 126.060668] [<ffffffff8150279e>] _raw_spin_lock+0x40/0x73 [ 126.060671] [<ffffffff81501c81>] ? rt_mutex_slowunlock+0x16/0x55 [ 126.060674] [<ffffffff810d9655>] ? rcu_report_qs_rsp+0x87/0x8c [ 126.060677] [<ffffffff81501c81>] rt_mutex_slowunlock+0x16/0x55 [ 126.060680] [<ffffffff810d9ea3>] ? rcu_read_unlock_special+0x9b/0x1c4 [ 126.060683] [<ffffffff81501ce7>] rt_mutex_unlock+0x27/0x29 [ 126.060687] [<ffffffff810d9f86>] rcu_read_unlock_special+0x17e/0x1c4 [ 126.060690] [<ffffffff810da014>] __rcu_read_unlock+0x48/0x89 [ 126.060693] [<ffffffff8106847a>] select_task_rq_rt+0xc7/0xd5 [ 126.060696] [<ffffffff810683da>] ? select_task_rq_rt+0x27/0xd5 [ 126.060701] [<ffffffff810a852a>] ? clockevents_program_event+0x8e/0x90 [ 126.060704] [<ffffffff8107511c>] try_to_wake_up+0x175/0x429 [ 126.060708] [<ffffffff810a95dc>] ? tick_program_event+0x1f/0x21 [ 126.060711] [<ffffffff81075425>] wake_up_process+0x15/0x17 [ 126.060715] [<ffffffff81080a51>] wakeup_softirqd+0x24/0x26 [ 126.060718] [<ffffffff81081df9>] irq_exit+0x49/0x55 [ 126.060721] [<ffffffff8150a3bd>] smp_apic_timer_interrupt+0x8a/0x98 [ 126.060724] [<ffffffff81509793>] apic_timer_interrupt+0x13/0x20 [ 126.060726] <EOI> [<ffffffff81072855>] ? migrate_disable+0x75/0x12d [ 126.060733] [<ffffffff81080a61>] ? local_bh_disable+0xe/0x1f [ 126.060736] [<ffffffff81080a70>] ? local_bh_disable+0x1d/0x1f [ 126.060739] [<ffffffff810d5952>] irq_forced_thread_fn+0x1b/0x44 [ 126.060742] [<ffffffff81502ac0>] ? _raw_spin_unlock_irq+0x3b/0x59 [ 126.060745] [<ffffffff810d582c>] irq_thread+0xde/0x1af [ 126.060748] [<ffffffff810d5937>] ? irq_thread_fn+0x3a/0x3a [ 126.060751] [<ffffffff810d574e>] ? irq_finalize_oneshot+0xd1/0xd1 [ 126.060754] [<ffffffff810d574e>] ? irq_finalize_oneshot+0xd1/0xd1 [ 126.060757] [<ffffffff8109a760>] kthread+0x99/0xa1 [ 126.060761] [<ffffffff81509b14>] kernel_thread_helper+0x4/0x10 [ 126.060764] [<ffffffff81069ed7>] ? finish_task_switch+0x87/0x10a [ 126.060768] [<ffffffff81502ec4>] ? retint_restore_args+0xe/0xe [ 126.060771] [<ffffffff8109a6c7>] ? __init_kthread_worker+0x8c/0x8c [ 126.060774] [<ffffffff81509b10>] ? gs_change+0xb/0xb Because irq_exit() does: void irq_exit(void) { account_system_vtime(current); trace_hardirq_exit(); sub_preempt_count(IRQ_EXIT_OFFSET); if (!in_interrupt() && local_softirq_pending()) invoke_softirq(); ... } Which triggers a wakeup, which uses RCU, now if the interrupted task has t->rcu_read_unlock_special set, the rcu usage from the wakeup will end up in rcu_read_unlock_special(). rcu_read_unlock_special() will test for in_irq(), which will fail as we just decremented preempt_count with IRQ_EXIT_OFFSET, and in_sering_softirq(), which for PREEMPT_RT_FULL reads: int in_serving_softirq(void) { int res; preempt_disable(); res = __get_cpu_var(local_softirq_runner) == current; preempt_enable(); return res; } Which will thus also fail, resulting in the above wreckage. The 'somewhat' ugly solution is to open-code the preempt_count() test in rcu_read_unlock_special(). Also, we're not at all sure how ->rcu_read_unlock_special gets set here... so this is very likely a bandaid and more thought is required. Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
2016-02-13ptrace: don't open IRQs in ptrace_freeze_traced() too earlySebastian Andrzej Siewior1-2/+4
In the non-RT case the spin_lock_irq() here disables interrupts as well as raw_spin_lock_irq(). So in the unlock case the interrupts are enabled too early. Reported-by: kernel test robot <ying.huang@linux.intel.com> Cc: stable-rt@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2016-02-13rwlocks: Fix section mismatchJohn Kacur2-2/+3
This fixes the following build error for the preempt-rt kernel. make kernel/fork.o CC kernel/fork.o kernel/fork.c:90: error: section of tasklist_lock conflicts with previous declaration make[2]: *** [kernel/fork.o] Error 1 make[1]: *** [kernel/fork.o] Error 2 The rt kernel cache aligns the RWLOCK in DEFINE_RWLOCK by default. The non-rt kernels explicitly cache align only the tasklist_lock in kernel/fork.c That can create a build conflict. This fixes the build problem by making the non-rt kernels cache align RWLOCKs by default. The side effect is that the other RWLOCKs are also cache aligned for non-rt. This is a short term solution for rt only. The longer term solution would be to push the cache aligned DEFINE_RWLOCK to mainline. If there are objections, then we could create a DEFINE_RWLOCK_CACHE_ALIGNED or something of that nature. Signed-off-by: John Kacur <jkacur@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/alpine.LFD.2.00.1109191104010.23118@localhost6.localdomain6 Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-13ptrace: fix ptrace vs tasklist_lock raceSebastian Andrzej Siewior3-6/+66
As explained by Alexander Fyodorov <halcy@yandex.ru>: |read_lock(&tasklist_lock) in ptrace_stop() is converted to mutex on RT kernel, |and it can remove __TASK_TRACED from task->state (by moving it to |task->saved_state). If parent does wait() on child followed by a sys_ptrace |call, the following race can happen: | |- child sets __TASK_TRACED in ptrace_stop() |- parent does wait() which eventually calls wait_task_stopped() and returns | child's pid |- child blocks on read_lock(&tasklist_lock) in ptrace_stop() and moves | __TASK_TRACED flag to saved_state |- parent calls sys_ptrace, which calls ptrace_check_attach() and wait_task_inactive() The patch is based on his initial patch where an additional check is added in case the __TASK_TRACED moved to ->saved_state. The pi_lock is taken in case the caller is interrupted between looking into ->state and ->saved_state. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2016-02-13rtmutex: Add RT aware ww locksSebastian Andrzej Siewior1-25/+226
lockdep says: | -------------------------------------------------------------------------- | | Wound/wait tests | | --------------------- | ww api failures: ok | ok | ok | | ww contexts mixing: ok | ok | | finishing ww context: ok | ok | ok | ok | | locking mismatches: ok | ok | ok | | EDEADLK handling: ok | ok | ok | ok | ok | ok | ok | ok | ok | ok | | spinlock nest unlocked: ok | | ----------------------------------------------------- | |block | try |context| | ----------------------------------------------------- | context: ok | ok | ok | | try: ok | ok | ok | | block: ok | ok | ok | | spinlock: ok | ok | ok | Signed-off-by: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
2016-02-13rtmutex: Use chainwalking control enumbmouring@ni.com1-1/+1
In 8930ed80 (rtmutex: Cleanup deadlock detector debug logic), chainwalking control enums were introduced to limit the deadlock detection logic. One of the calls to task_blocks_on_rt_mutex was missed when converting to use the enums. Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Brad Mouring <brad.mouring@ni.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2016-02-13rt: Add the preempt-rt lock replacement APIsThomas Gleixner23-56/+1597
Map spinlocks, rwlocks, rw_semaphores and semaphores to the rt_mutex based locking functions for preempt-rt. This also introduces RT's sleeping locks. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-13rbtree: don't include the rcu headerSebastian Andrzej Siewior2-9/+13
The RCU header pulls in spinlock.h and fails due not yet defined types: |In file included from include/linux/spinlock.h:275:0, | from include/linux/rcupdate.h:38, | from include/linux/rbtree.h:34, | from include/linux/rtmutex.h:17, | from include/linux/spinlock_types.h:18, | from kernel/bounds.c:13: |include/linux/rwlock_rt.h:16:38: error: unknown type name ‘rwlock_t’ | extern void __lockfunc rt_write_lock(rwlock_t *rwlock); | ^ This patch moves the only RCU user from the header file into c file so the inclusion can be avoided. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2016-02-13rtmutex: Avoid include hellThomas Gleixner1-1/+1
Include only the required raw types. This avoids pulling in the complete spinlock header which in turn requires rtmutex.h at some point. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-13spinlock: Split the lock types headerThomas Gleixner4-72/+95
Split raw_spinlock into its own file and the remaining spinlock_t into its own non-RT header. The non-RT header will be replaced later by sleeping spinlocks. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-13rtmutex: Add rtmutex_lock_killable()Thomas Gleixner2-0/+20
Add "killable" type to rtmutex. We need this since rtmutex are used as "normal" mutexes which do use this type. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-13locking: locktorture: Do NOT include rwlock.h directlyWolfgang M. Reimer1-1/+0
Including rwlock.h directly will cause kernel builds to fail if CONFIG_PREEMPT_RT_FULL is defined. The correct header file (rwlock_rt.h OR rwlock.h) will be included by spinlock.h which is included by locktorture.c anyway. Cc: stable-rt@vger.kernel.org Signed-off-by: Wolfgang M. Reimer <linuxball@gmail.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2016-02-13wait.h: include atomic.hGrygorii Strashko1-0/+1
This patch fixes build error: CC kernel/pid_namespace.o In file included from kernel/pid_namespace.c:11:0: include/linux/pid.h: In function 'get_pid': include/linux/pid.h:78:3: error: implicit declaration of function 'atomic_inc' [-Werror=implicit-function-declaration] atomic_inc(&pid->count); ^ which happens when CONFIG_PROVE_LOCKING=n CONFIG_DEBUG_SPINLOCK=n CONFIG_DEBUG_MUTEXES=n CONFIG_DEBUG_LOCK_ALLOC=n CONFIG_PID_NS=y Vanilla gets this via spinlock.h. Signed-off-by: Grygorii Strashko <Grygorii.Strashko@linaro.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2016-02-13futex: Ensure lock/unlock symetry versus pi_lock and hash bucket lockThomas Gleixner1-0/+2
In exit_pi_state_list() we have the following locking construct: spin_lock(&hb->lock); raw_spin_lock_irq(&curr->pi_lock); ... spin_unlock(&hb->lock); In !RT this works, but on RT the migrate_enable() function which is called from spin_unlock() sees atomic context due to the held pi_lock and just decrements the migrate_disable_atomic counter of the task. Now the next call to migrate_disable() sees the counter being negative and issues a warning. That check should be in migrate_enable() already. Fix this by dropping pi_lock before unlocking hb->lock and reaquire pi_lock after that again. This is safe as the loop code reevaluates head again under the pi_lock. Reported-by: Yong Zhang <yong.zhang@windriver.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2016-02-13futex: Fix bug on when a requeued RT task times outSteven Rostedt2-1/+32
Requeue with timeout causes a bug with PREEMPT_RT_FULL. The bug comes from a timed out condition. TASK 1 TASK 2 ------ ------ futex_wait_requeue_pi() futex_wait_queue_me() <timed out> double_lock_hb(); raw_spin_lock(pi_lock); if (current->pi_blocked_on) { } else { current->pi_blocked_on = PI_WAKE_INPROGRESS; run_spin_unlock(pi_lock); spin_lock(hb->lock); <-- blocked! plist_for_each_entry_safe(this) { rt_mutex_start_proxy_lock(); task_blocks_on_rt_mutex(); BUG_ON(task->pi_blocked_on)!!!! The BUG_ON() actually has a check for PI_WAKE_INPROGRESS, but the problem is that, after TASK 1 sets PI_WAKE_INPROGRESS, it then tries to grab the hb->lock, which it fails to do so. As the hb->lock is a mutex, it will block and set the "pi_blocked_on" to the hb->lock. When TASK 2 goes to requeue it, the check for PI_WAKE_INPROGESS fails because the task1's pi_blocked_on is no longer set to that, but instead, set to the hb->lock. The fix: When calling rt_mutex_start_proxy_lock() a check is made to see if the proxy tasks pi_blocked_on is set. If so, exit out early. Otherwise set it to a new flag PI_REQUEUE_INPROGRESS, which notifies the proxy task that it is being requeued, and will handle things appropriately. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-13rtmutex: Handle the various new futex race conditionsThomas Gleixner3-21/+94
RT opens a few new interesting race conditions in the rtmutex/futex combo due to futex hash bucket lock being a 'sleeping' spinlock and therefor not disabling preemption. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-13md: raid5: Make raid5_percpu handling RT awareThomas Gleixner2-2/+6
__raid_run_ops() disables preemption with get_cpu() around the access to the raid5_percpu variables. That causes scheduling while atomic spews on RT. Serialize the access to the percpu data with a lock and keep the code preemptible. Reported-by: Udo van den Heuvel <udovdh@xs4all.nl> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Udo van den Heuvel <udovdh@xs4all.nl>
2016-02-13rtmutex: trylock is okay on -RTSebastian Andrzej Siewior1-0/+4
non-RT kernel could deadlock on rt_mutex_trylock() in softirq context. On -RT we don't run softirqs in IRQ context but in thread context so it is not a issue here. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2016-02-13softirq: split timer softirqs out of ksoftirqdSebastian Andrzej Siewior1-11/+74
The softirqd runs in -RT with SCHED_FIFO (prio 1) and deals mostly with timer wakeup which can not happen in hardirq context. The prio has been risen from the normal SCHED_OTHER so the timer wakeup does not happen too late. With enough networking load it is possible that the system never goes idle and schedules ksoftirqd and everything else with a higher priority. One of the tasks left behind is one of RCU's threads and so we see stalls and eventually run out of memory. This patch moves the TIMER and HRTIMER softirqs out of the `ksoftirqd` thread into its own `ktimersoftd`. The former can now run SCHED_OTHER (same as mainline) and the latter at SCHED_FIFO due to the wakeups. From networking point of view: The NAPI callback runs after the network interrupt thread completes. If its run time takes too long the NAPI code itself schedules the `ksoftirqd`. Here in the thread it can run at SCHED_OTHER priority and it won't defer RCU anymore. Cc: stable-rt@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2016-02-13genirq: Allow disabling of softirq processing in irq thread contextThomas Gleixner5-2/+38
The processing of softirqs in irq thread context is a performance gain for the non-rt workloads of a system, but it's counterproductive for interrupts which are explicitely related to the realtime workload. Allow such interrupts to prevent softirq processing in their thread context. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-13kernel: migrate_disable() do fastpath in atomic & irqs-offSebastian Andrzej Siewior1-2/+2
With interrupts off it makes no sense to do the long path since we can't leave the CPU anyway. Also we might end up in a recursion with lockdep. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2016-02-13kernel: softirq: unlock with irqs onSebastian Andrzej Siewior1-1/+3
We unlock the lock while the interrupts are off. This isn't a problem now but will get because the migrate_disable() + enable are not symmetrical in regard to the status of interrupts. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2016-02-13softirq: Split softirq locksThomas Gleixner8-94/+477
The 3.x RT series removed the split softirq implementation in favour of pushing softirq processing into the context of the thread which raised it. Though this prevents us from handling the various softirqs at different priorities. Now instead of reintroducing the split softirq threads we split the locks which serialize the softirq processing. If a softirq is raised in context of a thread, then the softirq is noted on a per thread field, if the thread is in a bh disabled region. If the softirq is raised from hard interrupt context, then the bit is set in the flag field of ksoftirqd and ksoftirqd is invoked. When a thread leaves a bh disabled region, then it tries to execute the softirqs which have been raised in its own context. It acquires the per softirq / per cpu lock for the softirq and then checks, whether the softirq is still pending in the per cpu local_softirq_pending() field. If yes, it runs the softirq. If no, then some other task executed it already. This allows for zero config softirq elevation in the context of user space tasks or interrupt threads. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-13softirq: Disable softirq stacks for RTThomas Gleixner8-1/+15
Disable extra stacks for softirqs. We want to preempt softirqs and having them on special IRQ-stack does not make this easier. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-13softirq: Check preemption after reenabling interruptsThomas Gleixner4-0/+16
raise_softirq_irqoff() disables interrupts and wakes the softirq daemon, but after reenabling interrupts there is no preemption check, so the execution of the softirq thread might be delayed arbitrarily. In principle we could add that check to local_irq_enable/restore, but that's overkill as the rasie_softirq_irqoff() sections are the only ones which show this behaviour. Reported-by: Carsten Emde <cbe@osadl.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-13tasklet: Prevent tasklets from going into infinite spin in RTIngo Molnar2-72/+162
When CONFIG_PREEMPT_RT_FULL is enabled, tasklets run as threads, and spinlocks turn are mutexes. But this can cause issues with tasks disabling tasklets. A tasklet runs under ksoftirqd, and if a tasklets are disabled with tasklet_disable(), the tasklet count is increased. When a tasklet runs, it checks this counter and if it is set, it adds itself back on the softirq queue and returns. The problem arises in RT because ksoftirq will see that a softirq is ready to run (the tasklet softirq just re-armed itself), and will not sleep, but instead run the softirqs again. The tasklet softirq will still see that the count is non-zero and will not execute the tasklet and requeue itself on the softirq again, which will cause ksoftirqd to run it again and again and again. It gets worse because ksoftirqd runs as a real-time thread. If it preempted the task that disabled tasklets, and that task has migration disabled, or can't run for other reasons, the tasklet softirq will never run because the count will never be zero, and ksoftirqd will go into an infinite loop. As an RT task, it this becomes a big problem. This is a hack solution to have tasklet_disable stop tasklets, and when a tasklet runs, instead of requeueing the tasklet softirqd it delays it. When tasklet_enable() is called, and tasklets are waiting, then the tasklet_enable() will kick the tasklets to continue. This prevents the lock up from ksoftirq going into an infinite loop. [ rostedt@goodmis.org: ported to 3.0-rt ] Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-13locking: Disable spin on owner for RTThomas Gleixner1-2/+2
Drop spin on owner for mutex / rwsem. We are most likely not using it but… Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-13lockdep: Make it RT awareThomas Gleixner2-3/+9
teach lockdep that we don't really do softirqs on -RT. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-13hotplug: Use migrate disable on unplugThomas Gleixner1-3/+3
Migration needs to be disabled accross the unplug handling to make sure that the unplug thread is off the unplugged cpu. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-13trace: Add migrate-disabled counter to tracing outputThomas Gleixner4-3/+15
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-13hotplug: Reread hotplug_pcp on pin_current_cpu() retryYong Zhang1-1/+3
When retry happens, it's likely that the task has been migrated to another cpu (except unplug failed), but it still derefernces the original hotplug_pcp per cpu data. Update the pointer to hotplug_pcp in the retry path, so it points to the current cpu. Signed-off-by: Yong Zhang <yong.zhang0@gmail.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20110728031600.GA338@windriver.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-13hotplug: sync_unplug: No "\n" in task nameYong Zhang1-1/+1
Otherwise the output will look a little odd. Signed-off-by: Yong Zhang <yong.zhang0@gmail.com> Link: http://lkml.kernel.org/r/1318762607-2261-2-git-send-email-yong.zhang0@gmail.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-13hotplug: Lightweight get online cpusThomas Gleixner2-4/+122
get_online_cpus() is a heavy weight function which involves a global mutex. migrate_disable() wants a simpler construct which prevents only a CPU from going doing while a task is in a migrate disabled section. Implement a per cpu lockless mechanism, which serializes only in the real unplug case on a global mutex. That serialization affects only tasks on the cpu which should be brought down. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-13stop_machine: Use raw spinlocksThomas Gleixner1-20/+44
Use raw-locks in stomp_machine() to allow locking in irq-off regions. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-13stop_machine: convert stop_machine_run() to PREEMPT_RTIngo Molnar1-0/+10
Instead of playing with non-preemption, introduce explicit startup serialization. This is more robust and cleaner as well. Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> [bigeasy: XXX: stopper_lock -> stop_cpus_lock]
2016-02-13sched/workqueue: Only wake up idle workers if not blocked on sleeping spin lockSteven Rostedt1-1/+3
In -rt, most spin_locks() turn into mutexes. One of these spin_lock conversions is performed on the workqueue gcwq->lock. When the idle worker is worken, the first thing it will do is grab that same lock and it too will block, possibly jumping into the same code, but because nr_running would already be decremented it prevents an infinite loop. But this is still a waste of CPU cycles, and it doesn't follow the method of mainline, as new workers should only be woken when a worker thread is truly going to sleep, and not just blocked on a spin_lock(). Check the saved_state too before waking up new workers. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2016-02-13sched: ttwu: Return success when only changing the saved_state valueThomas Gleixner1-1/+3
When a task blocks on a rt lock, it saves the current state in p->saved_state, so a lock related wake up will not destroy the original state. When a real wakeup happens, while the task is running due to a lock wakeup already, we update p->saved_state to TASK_RUNNING, but we do not return success, which might cause another wakeup in the waitqueue code and the task remains in the waitqueue list. Return success in that case as well. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-13sched: Disable CONFIG_RT_GROUP_SCHED on RTThomas Gleixner1-0/+1
Carsten reported problems when running: taskset 01 chrt -f 1 sleep 1 from within rc.local on a F15 machine. The task stays running and never gets on the run queue because some of the run queues have rt_throttled=1 which does not go away. Works nice from a ssh login shell. Disabling CONFIG_RT_GROUP_SCHED solves that as well. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-13sched: Disable TTWU_QUEUE on RTThomas Gleixner1-0/+5
The queued remote wakeup mechanism can introduce rather large latencies if the number of migrated tasks is high. Disable it for RT. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-13sched: Use the proper LOCK_OFFSET for cond_resched()Thomas Gleixner1-0/+4
RT does not increment preempt count when a 'sleeping' spinlock is locked. Update PREEMPT_LOCK_OFFSET for that case. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-13sched: Take RT softirq semantics into account in cond_resched()Thomas Gleixner2-0/+6
The softirq semantics work different on -RT. There is no SOFTIRQ_MASK in the preemption counter which leads to the BUG_ON() statement in __cond_resched_softirq(). As for -RT it is enough to perform a "normal" schedule. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-13sched: Do not account rcu_preempt_depth on RT in might_sleep()Thomas Gleixner2-1/+8
RT changes the rcu_preempt_depth semantics, so we cannot check for it in might_sleep(). Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-13sched: Add saved_state for tasks blocked on sleeping locksThomas Gleixner3-1/+33
Spinlocks are state preserving in !RT. RT changes the state when a task gets blocked on a lock. So we need to remember the state before the lock contention. If a regular wakeup (not a RTmutex related wakeup) happens, the saved_state is updated to running. When the lock sleep is done, the saved state is restored. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-13sched: Move mmdrop to RCU on RTThomas Gleixner4-2/+45
Takes sleeping locks and calls into the memory allocator, so nothing we want to do in task switch and oder atomic contexts. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-13sched: Limit the number of task migrations per batchThomas Gleixner1-0/+4
Put an upper limit on the number of tasks which are migrated per batch to avoid large latencies. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-13sched: Move task_struct cleanup to RCUThomas Gleixner2-1/+27
__put_task_struct() does quite some expensive work. We don't want to burden random tasks with that. Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-13posix-timers: Thread posix-cpu-timers on -rtJohn Stultz4-4/+202
posix-cpu-timer code takes non -rt safe locks in hard irq context. Move it to a thread. [ 3.0 fixes from Peter Zijlstra <peterz@infradead.org> ] Signed-off-by: John Stultz <johnstul@us.ibm.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-13hrtimer: Move schedule_work call to helper threadYang Shi1-0/+40
When run ltp leapsec_timer test, the following call trace is caught: BUG: sleeping function called from invalid context at kernel/rtmutex.c:659 in_atomic(): 1, irqs_disabled(): 1, pid: 0, name: swapper/1 Preemption disabled at:[<ffffffff810857f3>] cpu_startup_entry+0x133/0x310 CPU: 1 PID: 0 Comm: swapper/1 Not tainted 3.10.10-rt3 #2 Hardware name: Intel Corporation Calpella platform/MATXM-CORE-411-B, BIOS 4.6.3 08/18/2010 ffffffff81c2f800 ffff880076843e40 ffffffff8169918d ffff880076843e58 ffffffff8106db31 ffff88007684b4a0 ffff880076843e70 ffffffff8169d9c0 ffff88007684b4a0 ffff880076843eb0 ffffffff81059da1 0000001876851200 Call Trace: <IRQ> [<ffffffff8169918d>] dump_stack+0x19/0x1b [<ffffffff8106db31>] __might_sleep+0xf1/0x170 [<ffffffff8169d9c0>] rt_spin_lock+0x20/0x50 [<ffffffff81059da1>] queue_work_on+0x61/0x100 [<ffffffff81065aa1>] clock_was_set_delayed+0x21/0x30 [<ffffffff810883be>] do_timer+0x40e/0x660 [<ffffffff8108f487>] tick_do_update_jiffies64+0xf7/0x140 [<ffffffff8108fe42>] tick_check_idle+0x92/0xc0 [<ffffffff81044327>] irq_enter+0x57/0x70 [<ffffffff816a040e>] smp_apic_timer_interrupt+0x3e/0x9b [<ffffffff8169f80a>] apic_timer_interrupt+0x6a/0x70 <EOI> [<ffffffff8155ea1c>] ? cpuidle_enter_state+0x4c/0xc0 [<ffffffff8155eb68>] cpuidle_idle_call+0xd8/0x2d0 [<ffffffff8100b59e>] arch_cpu_idle+0xe/0x30 [<ffffffff8108585e>] cpu_startup_entry+0x19e/0x310 [<ffffffff8168efa2>] start_secondary+0x1ad/0x1b0 The clock_was_set_delayed is called in hard IRQ handler (timer interrupt), which calls schedule_work. Under PREEMPT_RT_FULL, schedule_work calls spinlocks which could sleep, so it's not safe to call schedule_work in interrupt context. Reference upstream commit b68d61c705ef02384c0538b8d9374545097899ca (rt,ntp: Move call to schedule_delayed_work() to helper thread) from git://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-stable-rt.git, which makes a similar change. add a helper thread which does the call to schedule_work and wake up that thread instead of calling schedule_work directly. Signed-off-by: Yang Shi <yang.shi@windriver.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2016-02-13timer-fd: Prevent live lockThomas Gleixner1-1/+4
If hrtimer_try_to_cancel() requires a retry, then depending on the priority setting te retry loop might prevent timer callback completion on RT. Prevent that by waiting for completion on RT, no change for a non RT kernel. Reported-by: Sankara Muthukrishnan <sankara.m@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-02-13sched/deadline: dl_task_timer has to be irqsafeJuri Lelli1-0/+1
As for rt_period_timer, dl_task_timer has to be irqsafe. Signed-off-by: Juri Lelli <juri.lelli@gmail.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
2016-02-13hrtimer: Fixup hrtimer callback changes for preempt-rtThomas Gleixner6-9/+139
In preempt-rt we can not call the callbacks which take sleeping locks from the timer interrupt context. Bring back the softirq split for now, until we fixed the signal delivery problem for real. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2016-02-13hrtimer: enfore 64byte alignmentSebastian Andrzej Siewior1-4/+0
The patch "hrtimer: Fixup hrtimer callback changes for preempt-rt" adds a list_head expired to struct hrtimer_clock_base and with it we run into BUILD_BUG_ON(sizeof(struct hrtimer_clock_base) > HRTIMER_CLOCK_BASE_ALIGN); Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>