Why Have Empty Lock-Based Critical Sections?

The semantics of transactional lock elision does not exactly match that of locking, one difference being their handling of empty critical sections. A lock acquisition immediately followed by a release of that same lock will wait for any preceding holder of that same lock to release it. In contrast, transactional lock elision converts this rare but useful idiom into a no-op, albeit a no-op that might abort and roll back. Therefore, mechanical application of transactional lock elision will break any program depending on the empty-critical-section idiom.

So how could empty critical sections ever be useful?

One use may be found in the 2.4 Linux kernel, where brlock was used in this manner. This brlock-based scheme used a per-CPU set of reader-writer locks, where readers read-acquire the lock corresponding to the CPU that they were running on, while writers write-acquired and immediately released each CPU's lock. This usage allowed the networking code to get an effect similar to the use of RCU: The read acquisitions rarely missed the cache, so that the per-packet routing and netfilter lookups attained excellent performance and scalability, while the rare routing and netfilter updates incurred significant overhead. Not only was this usage pattern similar to RCU, it was in fact replaced by RCU in the 2.5 Linux kernel.

Perhaps the similar empty-lock-critical-section uses rumored to be present in some user-level applications should similarly be replaced by user-space RCU, but until that happens, some caution will be required when planning wholesale application of transactional lock elision.

The empty-lock-critical-section idiom can also be used to reduce lock contention in some situations. For example, consider a multithreaded user-space application where each thread processes unit of work maintained in a per-thread list, where thread are prohibited from touching each others' lists. There could also be updates that require that all previously scheduled units of work have completed before the update can progress. One way to handle this is to schedule a unit of work on each thread, so that when all of these units of work complete, the update may proceed.

In some applications, threads can come and go. For example, each thread might correspond to one user of the application, and thus be removed when that user logs out or otherwise disconnects. In many applications, threads cannot depart atomically: They must instead explicitly unravel themselves from various portions of the application using a specific sequence of actions. One specific action will be refusing to accept further requests from other threads, and another specific action will be disposing of any remaining units of work on its list, for example, by placing these units of work in a global work-item-disposal list to be taken by one of the remaining threads. (Why not just drain the thread's work-item list by executing each item? Because a given work item might generate more work items, so that the list could not be drained in a timely fashion.)

If the application is to perform and scale well, a good locking design is required. One common solution is to have a global lock (call it G) protecting the entire process of departing (and perhaps other things as well), with finer-grained locks protecting the individual unraveling operations.

Now, a departing thread must clearly refuse to accept further requests before disposing of the work on its list, because otherwise additional work might arrive after the disposal action, which would render that disposal action ineffective. So simplified pseudocode for a departing thread might be as follows:

Acquire lock G.
Acquire the lock guarding communications.
Refuse further communications from other threads.
Release the lock guarding communications.
Acquire the lock guarding the global work-item-disposal list.
Move all pending work items to the global work-item-disposal list.
Release the lock guarding the global work-item-disposal list.
Release lock G.

Of course, a thread that needs to wait for all pre-existing work items will need to take departing threads into account. To see this, suppose that this thread starts waiting for all pre-existing work items just after a departing thread has refused further communications from other threads. How can this thread wait for the departing thread's work items to complete, keeping in mind that threads are not allowed to access each others' lists of work items?

One straightforward approach is for this thread to acquire G and then the lock guarding the global work-item-disposal list, then move the work items to its own list. The thread then release both locks, places a work item on the end of it own list, and then wait for all of the work items that it placed on each thread's list (including its own) to complete.

This approach does work well in many cases, but if special processing is required for each work item as it is pulled in from the global work-item-disposal list, the result could be excessive contention on G. One way to avoid that contention is to acquire G and then immediately release it. Then the process of waiting for all prior work items look something like the following:

Set a global counter to one and initialize a condition variable to zero.
Send a message to all threads to cause them to atomically increment the global counter, and then to enqueue a work item. The work item will atomically decrement the global counter, and if the result is zero, it will set a condition variable to one.
Acquire G, which will wait on any currently departing thread to finish departing. Because only one thread may depart at a time, all the remaining threads will have already received the message sent in the preceding step.
Release G.
Acquire the lock guarding the global work-item-disposal list.
Move all work items from the global work-item-disposal list to this thread's list, processing them as needed along the way.
Release the lock guarding the global work-item-disposal list.
Enqueue an additional work item onto this thread's list. (As before, this work item will atomically decrement the global counter, and if the result is zero, it will set a condition variable to one.)
Wait for the condition variable to take on the value one.

Once this procedure completes, all pre-existing work items are guaranteed to have completed. In this case, the locks are being used for messaging as well as for protection of data. Because transactional lock elision dispenses with the messaging aspects of locking, it cannot be used for programs that rely on locking's messaging semantics. It should be noted that non-empty critical sections might well rely on locking's messaging semantics, which means that special-casing empty lock-based critical sections is insufficient: Deeper semantic analysis is required to determine whether or not a given critical section can safely use transactional lock elision.

In summary, although there are many situations where transactional lock elision can be a drop-in replacement for locking, there are situations where their differing semantics of empty critical sections prohibits drop-in replacement.