diff options
author | Paul E. McKenney <paulmck@kernel.org> | 2023-09-24 07:43:44 -0700 |
---|---|---|
committer | Paul E. McKenney <paulmck@kernel.org> | 2023-09-24 07:43:44 -0700 |
commit | 4d868154f6cfc15e25c6e74202c436b3762f4877 (patch) | |
tree | a6dc562c83a719fedaf2c138deb35ed107c41bcd | |
parent | d85fcb1264920f2784da69e601e6d4f85019b446 (diff) | |
download | perfbook-4d868154f6cfc15e25c6e74202c436b3762f4877.tar.gz |
datastruct: Self-review, part 1
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
-rw-r--r-- | datastruct/datastruct.tex | 75 |
1 files changed, 42 insertions, 33 deletions
diff --git a/datastruct/datastruct.tex b/datastruct/datastruct.tex index b56492d1..dc08e0c9 100644 --- a/datastruct/datastruct.tex +++ b/datastruct/datastruct.tex @@ -35,11 +35,14 @@ This chapter will expose a number of complications: \item Even \emph{read-only synchronization-free} data-structure traversal can fail to scale on some types of systems. \label{sec:datastruct:Cache-size-induced failure to scale redux} -\item Data-structure traverals avoiding the aforementioned complications +\item Data-structure traversals avoiding the aforementioned complications can still be impeded by concurrent updates. \label{sec:datastruct:Update-activity failure to scale} \end{enumerate} +This chapter will investigate these complications and demostrate some +ways of unraveling them. + \Cref{sec:datastruct:Motivating Application} presents the motivating application for this chapter's data structures. \Cref{chp:Partitioning and Synchronization Design} showed how @@ -92,8 +95,8 @@ so that he suspects that his mice might be checking up on their nemesis. Whatever their source, Schr\"odinger's application must handle high query rates to a single data element. -As we will see, this simple application can be a challenge to concurrent -data structures. +As we will see, this simple application can pose a challenge to +traditional concurrent data structures. \section{Partitionable Data Structures} \label{sec:datastruct:Partitionable Data Structures} @@ -163,6 +166,9 @@ shows a set of data structures used in a simple fixed-sized hash table using chaining and per-hash-bucket locking, and \cref{fig:datastruct:Hash-Table Data-Structure Diagram} diagrams how they fit together. +Note that the \co{cds_} functions and data structures may be found +in the userspace RCU +library~\cite{MathieuDesnoyers2009URCU,PaulMcKenney2013LWNURCUqueuestack,PaulMcKenney2013LWNURCUqueuestackAPI,PaulMcKenney2013LWNURCUlist}. The \co{hashtab} structure (\clnrefrange{tab:b}{tab:e} in \cref{lst:datastruct:Hash-Table Data Structures}) contains four \co{ht_bucket} structures @@ -182,6 +188,8 @@ the corresponding element's hash value in the \co{->hte_hash} field. The \co{ht_elem} structure is included in a larger structure which might contain a complex key. \end{fcvref} +\Cref{fig:datastruct:Hash-Table Data-Structure Diagram} +shows bucket~0 containing two elements and bucket~2 containing one. \begin{listing} \input{CodeSamples/datastruct/hash/hash_bkt@struct.fcv} @@ -196,9 +204,6 @@ which might contain a complex key. \label{fig:datastruct:Hash-Table Data-Structure Diagram} \end{figure} -\Cref{fig:datastruct:Hash-Table Data-Structure Diagram} -shows bucket~0 containing two elements and bucket~2 containing one. - \begin{fcvref}[ln:datastruct:hash_bkt:map_lock:map] \Cref{lst:datastruct:Hash-Table Mapping and Locking} shows mapping and locking functions. @@ -221,8 +226,8 @@ corresponding to the specified hash value. \begin{fcvref}[ln:datastruct:hash_bkt:lookup] \Cref{lst:datastruct:Hash-Table Lookup} shows \co{hashtab_lookup()}, -which returns a pointer to the element with the specified hash and key if it -exists, or \co{NULL} otherwise. +which returns a pointer to the element with the specified hash and key, +or \co{NULL} if that element does not exist. This function takes both a hash value and a pointer to the key because this allows users of this function to use arbitrary keys and arbitrary hash functions. @@ -327,11 +332,10 @@ The performance results for a single 28-core socket of a 2.1\,GHz Intel Xeon system using a bucket-locked hash table with 262,144 buckets are shown in \cref{fig:datastruct:Read-Only Hash-Table Performance For Schroedinger's Zoo}. -The performance does scale nearly linearly, but it falls a far short -of the ideal performance level, even at only 28~CPUs. -Part of this shortfall is due to the fact that the lock acquisitions and -releases incur no cache misses on a single CPU, but do incur misses -on two or more CPUs. +The performance does scale nearly linearly, but falls far short ideal, +even at only 28~CPUs. +Part of this shortfall is due to the fact that lock acquisitions and +releases incur communications cache misses only on two or more CPUs. \begin{figure} \centering @@ -343,7 +347,7 @@ on two or more CPUs. And things only get worse with more CPUs, as can be seen in \cref{fig:datastruct:Read-Only Hash-Table Performance For Schroedinger's Zoo; 448 CPUs}. We do not need to show ideal performance: -The performance for 29~CPUs and beyond is all too clearly worse than abysmal. +The performance for 29~CPUs and beyond is abysmal. This clearly underscores the dangers of extrapolating performance from a modest number of CPUs. @@ -377,9 +381,11 @@ However, as can be seen in changing the number of buckets has almost no effect: Scalability is still abysmal. In particular, we still see a sharp dropoff at 29~CPUs and beyond, -clearly demonstrating the complication put forward on +clearly demonstrating the complication put forward by +\cref{sec:datastruct:NUMA-induced failure to scale} +of the list of complications on \cpageref{sec:datastruct:NUMA-induced failure to scale}. -And just as clearly, something else is going on. +Clearly, something else is going on. The problem is that this is a multi-socket system, with CPUs~0--27 and~225--251 mapped to the first socket as shown in @@ -422,8 +428,8 @@ please read on! on each of those systems? }\QuickQuizEnd -One key property of the Schr\"odinger's-zoo runs discussed thus far is that -they are all read-only. +One key property of these Schr\"odinger's-zoo experiments is they are +all read-only. This makes the performance degradation due to lock-acquisition-induced cache misses all the more painful. Even though we are not updating the underlying hash table itself, we are @@ -442,7 +448,7 @@ read-mostly cases where updates are rare, but could happen at any time. \setlength\dashlinedash{1pt} \setlength\dashlinegap{2pt} -\begin{figure} +\begin{table} \renewcommand*{\arraystretch}{1.2} \centering \begin{tabular}{r||c:c} @@ -469,7 +475,7 @@ read-mostly cases where updates are rare, but could happen at any time. \end{tabular} \caption{NUMA Topology of System Under Test} \label{fig:datastruct:NUMA Topology of System Under Test} -\end{figure} +\end{table} \section{Read-Mostly Data Structures} \label{sec:datastruct:Read-Mostly Data Structures} @@ -735,7 +741,9 @@ about a factor of five short of ideal. adds completely unsynchronized results, which works because this is a read-only benchmark with nothing to synchronize. Even with no synchronization whatsoever, performance still falls far -short of ideal, thus demonstrating two more complications on +short of ideal, thus demonstrating +\cref{sec:datastruct:Cache-size-induced failure to scale,sec:datastruct:Cache-size-induced failure to scale redux} +of the list of complications on \cpageref{sec:datastruct:Cache-size-induced failure to scale redux}. The problem is that this system has sockets with 28 cores, which have @@ -754,7 +762,7 @@ only 11~ways. This means that L2 cache collisions will be the rule and also that L3 cache collisions will not be uncommon, so that the resulting cache misses will degrade performance. -In this case, the bottleneck is not in the CPU, but rather in the hardware +In this case, the bottleneck is not in the CPU, but rather in the memory system. % This data was generated wtih 262,144 buckets and elements. % RCU hash ht_elem structure is 40 bytes plus 32 more from zoo_he for 72. @@ -962,10 +970,11 @@ doing lookups and at the right-hand side of the figure all 448~CPUs are doing updates. Hazard pointers and RCU start off with a significant advantage because, unlike bucket locking, readers do not exclude updaters. -However, as the number of updating CPUs increases, update-side overhead -starts to make its presence known, first for RCU and then for hazard -pointers. -Of course, all three of these implementations beat global locking. +However, as the number of updating CPUs increases, the update-side +deferred-execution overhead starts to make its presence known, first +for RCU and then for hazard pointers. +Of course, all three of these implementations beat global locking, and +by more than an order of magnitude. It is quite possible that the differences in lookup performance observed in \cref{fig:datastruct:Read-Side RCU-Protected Hash-Table Performance For Schroedinger's Zoo in the Presence of Updates} @@ -1005,7 +1014,9 @@ not recommended for production use. excellent memory bandwidth. }\QuickQuizEnd -And this situation exposes yet another of the complications listed on +And this situation demonstrates +\cref{sec:datastruct:Update-activity failure to scale} +in the list complications on \cpageref{sec:datastruct:Update-activity failure to scale}. % @@@ Testing strategy. Summarize hashtorture, add QQ for additional @@ -1037,9 +1048,8 @@ last heartbeat before declaring death. It is clearly ridiculous to wait only one millisecond, because then a healthy living cat would have to be declared dead---and then resurrected---more than once per second. -It is equally ridiculous to wait a full month, because by that time -the poor cat's death would have made itself very clearly known -via olfactory means. +It is equally ridiculous to wait a full month, because by that time the +poor cat's death would be unmistakeably apparent to olfactory sensors. \begin{figure} \centering @@ -1115,12 +1125,11 @@ scalable RCU-protected hash tables, as described in the following sections. \label{sec:datastruct:Resizable Hash Table Design} In happy contrast to the situation in the early 2000s, there are now -no fewer than three different types of scalable RCU-protected hash -tables. +several different types of scalable RCU-protected hash tables. The first (and simplest) was developed for the Linux kernel by \ppl{Herbert}{Xu}~\cite{HerbertXu2010RCUResizeHash}, and is described in the following sections. -The other two are covered briefly in +Two others are covered briefly in \cref{sec:datastruct:Other Resizable Hash Tables}. The key insight behind the first hash-table implementation is that |