datastruct: Self-review, part 1

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
author: Paul E. McKenney <paulmck@kernel.org> 2023-09-24 07:43:44 -0700
committer: Paul E. McKenney <paulmck@kernel.org> 2023-09-24 07:43:44 -0700
commit: 4d868154f6cfc15e25c6e74202c436b3762f4877 (patch)
tree: a6dc562c83a719fedaf2c138deb35ed107c41bcd
parent: d85fcb1264920f2784da69e601e6d4f85019b446 (diff)
download: perfbook-4d868154f6cfc15e25c6e74202c436b3762f4877.tar.gz
1 files changed, 42 insertions, 33 deletions
diff --git a/datastruct/datastruct.tex b/datastruct/datastruct.tex
index b56492d1..dc08e0c9 100644
--- a/datastruct/datastruct.tex
+++ b/datastruct/datastruct.tex
@@ -35,11 +35,14 @@ This chapter will expose a number of complications:
 \item	Even \emph{read-only synchronization-free} data-structure traversal 
 	can fail to scale on some types of systems.
 \label{sec:datastruct:Cache-size-induced failure to scale redux}
-\item	Data-structure traverals avoiding the aforementioned complications
+\item	Data-structure traversals avoiding the aforementioned complications
 	can still be impeded by concurrent updates.
 \label{sec:datastruct:Update-activity failure to scale}
 \end{enumerate}
 
+This chapter will investigate these complications and demostrate some
+ways of unraveling them.
+
 \Cref{sec:datastruct:Motivating Application}
 presents the motivating application for this chapter's data structures.
 \Cref{chp:Partitioning and Synchronization Design} showed how
@@ -92,8 +95,8 @@ so that he suspects that his mice might be checking up on their nemesis.
 Whatever their source, Schr\"odinger's application must handle high
 query rates to a single data element.
 
-As we will see, this simple application can be a challenge to concurrent
-data structures.
+As we will see, this simple application can pose a challenge to
+traditional concurrent data structures.
 
 \section{Partitionable Data Structures}
 \label{sec:datastruct:Partitionable Data Structures}
@@ -163,6 +166,9 @@ shows a set of data structures used in a simple fixed-sized hash
 table using chaining and per-hash-bucket locking, and
 \cref{fig:datastruct:Hash-Table Data-Structure Diagram}
 diagrams how they fit together.
+Note that the \co{cds_} functions and data structures may be found
+in the userspace RCU
+library~\cite{MathieuDesnoyers2009URCU,PaulMcKenney2013LWNURCUqueuestack,PaulMcKenney2013LWNURCUqueuestackAPI,PaulMcKenney2013LWNURCUlist}.
 The \co{hashtab} structure (\clnrefrange{tab:b}{tab:e} in
 \cref{lst:datastruct:Hash-Table Data Structures})
 contains four \co{ht_bucket} structures
@@ -182,6 +188,8 @@ the corresponding element's hash value in the \co{->hte_hash} field.
 The \co{ht_elem} structure is included in a larger structure
 which might contain a complex key.
 \end{fcvref}
+\Cref{fig:datastruct:Hash-Table Data-Structure Diagram}
+shows bucket~0 containing two elements and bucket~2 containing one.
 
 \begin{listing}
 \input{CodeSamples/datastruct/hash/hash_bkt@struct.fcv}
@@ -196,9 +204,6 @@ which might contain a complex key.
 \label{fig:datastruct:Hash-Table Data-Structure Diagram}
 \end{figure}
 
-\Cref{fig:datastruct:Hash-Table Data-Structure Diagram}
-shows bucket~0 containing two elements and bucket~2 containing one.
-
 \begin{fcvref}[ln:datastruct:hash_bkt:map_lock:map]
 \Cref{lst:datastruct:Hash-Table Mapping and Locking}
 shows mapping and locking functions.
@@ -221,8 +226,8 @@ corresponding to the specified hash value.
 \begin{fcvref}[ln:datastruct:hash_bkt:lookup]
 \Cref{lst:datastruct:Hash-Table Lookup}
 shows \co{hashtab_lookup()},
-which returns a pointer to the element with the specified hash and key if it
-exists, or \co{NULL} otherwise.
+which returns a pointer to the element with the specified hash and key,
+or \co{NULL} if that element does not exist.
 This function takes both a hash value and a pointer to the key because
 this allows users of this function to use arbitrary keys and
 arbitrary hash functions.
@@ -327,11 +332,10 @@ The performance results for a single 28-core socket of a 2.1\,GHz
 Intel Xeon system using a bucket-locked hash table
 with 262,144 buckets are shown in
 \cref{fig:datastruct:Read-Only Hash-Table Performance For Schroedinger's Zoo}.
-The performance does scale nearly linearly, but it falls a far short
-of the ideal performance level, even at only 28~CPUs.
-Part of this shortfall is due to the fact that the lock acquisitions and
-releases incur no cache misses on a single CPU, but do incur misses
-on two or more CPUs.
+The performance does scale nearly linearly, but falls far short ideal,
+even at only 28~CPUs.
+Part of this shortfall is due to the fact that lock acquisitions and
+releases incur communications cache misses only on two or more CPUs.
 
 \begin{figure}
 \centering
@@ -343,7 +347,7 @@ on two or more CPUs.
 And things only get worse with more CPUs, as can be seen in
 \cref{fig:datastruct:Read-Only Hash-Table Performance For Schroedinger's Zoo; 448 CPUs}.
 We do not need to show ideal performance:
-The performance for 29~CPUs and beyond is all too clearly worse than abysmal.
+The performance for 29~CPUs and beyond is abysmal.
 This clearly underscores the dangers of extrapolating performance from a
 modest number of CPUs.
 
@@ -377,9 +381,11 @@ However, as can be seen in
 changing the number of buckets has almost no effect:
 Scalability is still abysmal.
 In particular, we still see a sharp dropoff at 29~CPUs and beyond,
-clearly demonstrating the complication put forward on
+clearly demonstrating the complication put forward by
+\cref{sec:datastruct:NUMA-induced failure to scale}
+of the list of complications on
 \cpageref{sec:datastruct:NUMA-induced failure to scale}.
-And just as clearly, something else is going on.
+Clearly, something else is going on.
 
 The problem is that this is a multi-socket system, with CPUs~0--27
 and~225--251 mapped to the first socket as shown in
@@ -422,8 +428,8 @@ please read on!
 	on each of those systems?
 }\QuickQuizEnd
 
-One key property of the Schr\"odinger's-zoo runs discussed thus far is that
-they are all read-only.
+One key property of these Schr\"odinger's-zoo experiments is they are
+all read-only.
 This makes the performance degradation due to lock-acquisition-induced
 cache misses all the more painful.
 Even though we are not updating the underlying hash table itself, we are
@@ -442,7 +448,7 @@ read-mostly cases where updates are rare, but could happen at any time.
 \setlength\dashlinedash{1pt}
 \setlength\dashlinegap{2pt}
 
-\begin{figure}
+\begin{table}
 \renewcommand*{\arraystretch}{1.2}
 \centering
 \begin{tabular}{r||c:c}
@@ -469,7 +475,7 @@ read-mostly cases where updates are rare, but could happen at any time.
 \end{tabular}
 \caption{NUMA Topology of System Under Test}
 \label{fig:datastruct:NUMA Topology of System Under Test}
-\end{figure}
+\end{table}
 
 \section{Read-Mostly Data Structures}
 \label{sec:datastruct:Read-Mostly Data Structures}
@@ -735,7 +741,9 @@ about a factor of five short of ideal.
 adds completely unsynchronized results, which works because this
 is a read-only benchmark with nothing to synchronize.
 Even with no synchronization whatsoever, performance still falls far
-short of ideal, thus demonstrating two more complications on
+short of ideal, thus demonstrating
+\cref{sec:datastruct:Cache-size-induced failure to scale,sec:datastruct:Cache-size-induced failure to scale redux}
+of the list of complications on
 \cpageref{sec:datastruct:Cache-size-induced failure to scale redux}.
 
 The problem is that this system has sockets with 28 cores, which have
@@ -754,7 +762,7 @@ only 11~ways.
 This means that L2 cache collisions will be the rule and also that L3
 cache collisions will not be uncommon, so that the resulting cache misses
 will degrade performance.
-In this case, the bottleneck is not in the CPU, but rather in the hardware
+In this case, the bottleneck is not in the CPU, but rather in the
 memory system.
 % This data was generated wtih 262,144 buckets and elements.
 % RCU hash ht_elem structure is 40 bytes plus 32 more from zoo_he for 72.
@@ -962,10 +970,11 @@ doing lookups and at the right-hand side of the figure all 448~CPUs are
 doing updates.
 Hazard pointers and RCU start off with a significant advantage because,
 unlike bucket locking, readers do not exclude updaters.
-However, as the number of updating CPUs increases, update-side overhead
-starts to make its presence known, first for RCU and then for hazard
-pointers.
-Of course, all three of these implementations beat global locking.
+However, as the number of updating CPUs increases, the update-side
+deferred-execution overhead starts to make its presence known, first
+for RCU and then for hazard pointers.
+Of course, all three of these implementations beat global locking, and
+by more than an order of magnitude.
 
 It is quite possible that the differences in lookup performance observed in
 \cref{fig:datastruct:Read-Side RCU-Protected Hash-Table Performance For Schroedinger's Zoo in the Presence of Updates}
@@ -1005,7 +1014,9 @@ not recommended for production use.
 	excellent memory bandwidth.
 }\QuickQuizEnd
 
-And this situation exposes yet another of the complications listed on
+And this situation demonstrates
+\cref{sec:datastruct:Update-activity failure to scale}
+in the list complications on
 \cpageref{sec:datastruct:Update-activity failure to scale}.
 
 % @@@ Testing strategy.  Summarize hashtorture, add QQ for additional
@@ -1037,9 +1048,8 @@ last heartbeat before declaring death.
 It is clearly ridiculous to wait only one millisecond, because then
 a healthy living cat would have to be declared dead---and then
 resurrected---more than once per second.
-It is equally ridiculous to wait a full month, because by that time
-the poor cat's death would have made itself very clearly known
-via olfactory means.
+It is equally ridiculous to wait a full month, because by that time the
+poor cat's death would be unmistakeably apparent to olfactory sensors.
 
 \begin{figure}
 \centering
@@ -1115,12 +1125,11 @@ scalable RCU-protected hash tables, as described in the following sections.
 \label{sec:datastruct:Resizable Hash Table Design}
 
 In happy contrast to the situation in the early 2000s, there are now
-no fewer than three different types of scalable RCU-protected hash
-tables.
+several different types of scalable RCU-protected hash tables.
 The first (and simplest) was developed for the Linux kernel by
 \ppl{Herbert}{Xu}~\cite{HerbertXu2010RCUResizeHash}, and is described in the
 following sections.
-The other two are covered briefly in
+Two others are covered briefly in
 \cref{sec:datastruct:Other Resizable Hash Tables}.
 
 The key insight behind the first hash-table implementation is that
author	Paul E. McKenney <paulmck@kernel.org>	2023-09-24 07:43:44 -0700
committer	Paul E. McKenney <paulmck@kernel.org>	2023-09-24 07:43:44 -0700
commit	4d868154f6cfc15e25c6e74202c436b3762f4877 (patch)
tree	a6dc562c83a719fedaf2c138deb35ed107c41bcd
parent	d85fcb1264920f2784da69e601e6d4f85019b446 (diff)
download	perfbook-4d868154f6cfc15e25c6e74202c436b3762f4877.tar.gz