RCU Priority Inversion

To recap...

Preemptible RCU can suffer from a variant of priority inversion where a low-priority task is preempted within an RCU read-side critical section by CPU-bound medium-priority tasks (at least one per CPU). Because an RCU grace period may end only after all pre-existing RCU read-side critical sections have ended, as long as those medium-priority tasks are running, no future RCU grace periods can come to an end. All that is needed to complete the priority-inversion picture is a high-priority task attempting to allocate memory that is not yet available due to the fact that RCU grace periods cannot come to an end. This high-priority task is effectively blocked by the medium-priority tasks, which is an example of priority inversion.

This condition does not occur frequently in production real-time systems, partly because many real-time developers take care to avoid saturating the CPUs, partly because this practice can also reduce thread-queuing scheduling latencies. On such systems, the medium-priority tasks would block sooner rather than later, which would permit the low-priority task to resume, leave its RCU read-side critical section, thus permitting RCU grace periods to come to an end. In addition, enterprise systems, whether real-time or not, tend to have large amounts of memory configured, which allows these systems to ride out many minutes of RCU priority inversion. RCU priority boosting has therefore not had a very high priority on my to-do list, at least not recently.

As has been noted elsewhere, the work one is doing can influence both your perspectives and your priorities, and my work with Linaro has not only turned my attention to extreme energy efficiency, but also to small-memory platforms. We can all be thankful that I will spare you what my 30-years-ago self might have said in response to hundreds of megabytes of memory being described as “small” and instead simply state that I have now started work on RCU priority boosting. My first step was to upgrade the rcutorture test suite to check for RCU priority inversion.

This test worked, correctly identifying preemptible RCU as being vulnerable to RCU priority inversion. However, it also identified non-preemptible RCU (TREE_RCU) as being vulnerable to RCU priority inversion, despite the fact that it is not possible to preempt TREE_RCU readers.

Why did this happen?

TREE_RCU uses the Linux kernel's softirq environment for much of the grace-period processing. Although softirq cannot be preempted, ksoftirqd can be. So, if softirq processing is handed off to ksoftirqd, which is a set of non-real-time kernel threads, and if ksoftirqd is preempted in the middle of RCU grace-period processing, grace periods will be stalled.

Therefore, ksoftirqd processing provides another source of potential RCU priority inversions that must be addressed. My current patchset therefore moves the softirq processing to kthreads and allows their real-time priority to be specified as a kernel build parameter. Of course, one can use the chrt command to change these kthreads' priorities at runtime, but only if you have the forsight to have a shell running at high enough priority to gain access to the CPU! ;–)