What Is The Best Use of HTM?

HTM is likely to be at its best for large in-memory data structures that are difficult to statically partition but that are dynamically partitionable (in other words, the conflict probability is reasonably low). There must be a reasonable non-TM fallback algorithm for every transaction. The workload should ideally be update-heavy with small accesses and updates, and not subject to aggressive real-time constraints. Finally, if HTM is used for transactional lock elision, locks should be placed in separate cache lines and any empty critical sections must continue to use explicit locking.

Why is this situation best for HTM?

Having the data structure in memory avoids non-idempotent operations, which HTM does not handle well.

If the data structure is easy to statically partition, then existing lock-based solutions are likely to work well, though HTM's ability to avoid the cache misses associated with lock acquisition and release might improve performance and scalability. On the other hand, if the data structure is not dynamically partitionable, HTM will likely suffer from a high conflict rate, degrading HTM's performance and scalability. Similarly, if the data structure is not large, the conflict probabilities are likely to be high, again degrading HTM's performance and scalability.

If there is no reasonable non-HTM fallback, then nothing can be done in the case of persistent transaction failure. Of course, this begs the question of exactly what you would use in place of HTM when there is no reasonable fallback. The most straightforward possibility is of course a sequential program. Never forget that parallelism is but one possible performance-optimization technique of many, and that it is not always the best tool for the job!

If the workload is not update-heavy, then things like hazard pointers and RCU might provide better performance and scalability, especially if the read-side accesses have large cache footprints. Furthermore, if some of the transactions interact with extra-transactional accesses, extreme care is required to avoid some degenerate situations (see Blundell et al).

HTM can in some cases eliminate those cache misses that would otherwise be incurred by heavyweight synchronization primitives. The performance benefits of this cache-miss elimination are most pronounced for small transactions. Furthermore, all else being equal, smaller transactions are more likely to avoid conflicts than are larger transactions.

Real-time use of transactional memory remains largely unexplored, so real-time use of HTM should be undertaken with caution.

If a lock is located in the same cache line as some of the data that it protects, then any critical section that modifies that data will conflict with all other threads attempting to elide that same lock. The resulting retries and fallbacks might well severely degrade HTM scalability and performance compared to straight locking.

Finally, an empty lock-based critical section waits for all prior critical sections to complete, while the equivalent HTM-based critical section is for all intents and purposes a no-op. Therefore, empty lock-based critical sections cannot be safely elided. In addition, it is possible for a program with non-empty lock-based critical sections to rely on all prior critical sections having completed, and such critical sections also cannot be safely elided.

Of course, it is quite possible that HTM will offer benefits in other circumstances, but the situation called out above appears to hold the most promise for HTM.