RCU Read-Side Ordering Constraints

Again, the question is, given that t2()'s and t3's lookups can be reordered, how the following code can exclude the outcome l1==1&&l2==0&&l3==1&&l4==0?

 1 void t1(void)
 2 {
 3   spin_lock(chain(A));
 4   insert(A);
 5   spin_unlock(chain(A));
 6   synchronize_rcu();
 7   spin_lock(chain(B));
 8   insert(B);
 9   spin_unlock(chain(B));
10 }
11 
12 void t2(void)
13 {
14   rcu_read_lock();
15   l1 = lookup(A);
16   rcu_read_unlock();
17   rcu_read_lock();
18   l2 = lookup(B);
19   rcu_read_unlock();
20 }
21 
22 void t3(void)
23 {
24   rcu_read_lock();
25   l3 = lookup(B);
26   rcu_read_unlock();
27   rcu_read_lock();
28   l4 = lookup(A);
29   rcu_read_unlock();
30 }

First, the fundamental property of RCU, that synchronize_rcu() needs to wait for all pre-existing RCU read-side critical sections, prevents this outcome. To see this, let's focus on t3(). If line 25 returns 1, then we know that line 8 has completed. For line 28 to return 0, it must have executed before line 4, which means that line 6 must wait for line 29 to complete. But this would mean that lines 24-26 also qualify as a pre-existing RCU read-side critical section, which requires line 6 to wait for them as well, which contradicts the requirement that line 8 have completed.

Of course, a semi-formal proof is nice to have, but any time that a semi-formal proof picks a fight with real code, the proof always looses. So let's take a look at what the implementations really do.

Implementation Perspective

Different implementations approach this problem in different ways, however, all implementations that I am aware of do make this guarantee. The following sections discuss the three major in-kernel implementations.

SRCU

Since Lai Jiangshan's SRCU rewrite in 3.5, srcu_read_lock() and srcu_read_unlock() have memory barriers, the goal being to allow srcu_read_lock() and srcu_read_unlock() to be used by idle and even by offline CPUs. These memory barriers clearly prevent t2()'s and t3's lookups from being reordered, which just as clearly prevents t2() and t3 from disagreeing about the insertion order.

For those interested in historical matters, prior to 3.5, SRCU's update-side code forced memory barriers to execute on read-side CPUs as needed, and these memory barriers will order any SRCU read-side critical sections running concurrently with a synchronize_srcu().

Classic RCU (CONFIG_PREEMPT=n)

In Classic RCU, both rcu_read_lock() and rcu_read_unlock() are no-ops, which means that, unlike SRCU, t2()'s and t3's lookups can be reordered by both the compiler and the CPU. However, there is no call to schedule() between the pairs of RCU read-side critical sections, so RCU treats t2()'s function body as one big RCU read-side critical section, and ditto for t3().

This means that if any part of t3() executes before the synchronize_rcu() on line 6, all of t3() must execute before line 6. The only way for line 28 to return 0 is for it to precede line 6, which in turn means that there is then no way for line 25 to return 1. Therefore, Classic RCU does provide the required read-side ordering constraint.

Preemptible RCU (CONFIG_PREEMPT=y)

Preemptible RCU does not have memory-barrier instructions, but it does have the barrier() directive that prevents the compiler from reordering code either into or out of any given RCU read-side critical section. Of course, on weakly ordered systems (such as ARM or PowerPC), the CPU can still reorder t2()'s and t3()'s lookups.

However, for performance reasons, preemptible RCU never reports the beginning of an RCU read-side critical section to RCU's core processing. Furthermore, on the rare occasions where RCU does report the end of an RCU read-side critical section to RCU's core processing, it acquires locks, which prevent the CPU from reordering subsequent code into the previous RCU read-side critical section. Reports as to whether or not a given CPU is in an RCU read-side critical section can also be carried out via interrupts and preemption, but locks are acquired in both of those cases as well, again preventing the CPU from reordering subsequent code into the previous RCU read-side critical section.

Now, if the RCU core does not receive a report from a given CPU, it must take a conservative approach, assuming that the CPU is in an RCU read-side critical section for the entire time, which reduces to the preemptible-RCU case. On the other hand, if t3() reports its state to the RCU core at either line 26 or via an interrupt between lines 26 and 27, then locking ensures sufficient ordering to reduce to the SRCU case.

Conceptual Perspective

Although it is nice that the implementations do what they are supposed to, we still need a conceptual view of the low-level ordering for RCU read-side critical sections. Here you go:

If either the compiler or the CPU reorders code making up an RCU read-side critical section, RCU must treat the resulting RCU read-side critical section as extending to include the earliest and latest statements in execution order in addition to the location of the RCU read-side critical section in the source code. Also, if code reordering causes a pair of RCU read-side critical sections to overlap or be interchanged, RCU must treat them as one large merged RCU read-side critical section, in other words, as if there was an additional rcu_read_lock() preceding the first and an additional rcu_read_unlock() following the last.
This last ordering criterion might cause some concern to coders who are sneaky enough to imagine a schedule() between a pair of RCU read-side critical sections. However, this is easily resolved and is left as an exercise to the reader.