tag name | mm-slub-5.15-rc1 (227c950afec7aaccb6f8f65a4a2dcd8b78181021) |
tag date | 2021-09-04 12:21:51 +0200 |
tagged by | Vlastimil Babka <vbabka@suse.cz> |
tagged object | commit bd0e7491a9... |
download | linux-mm-slub-5.15-rc1.tar.gz |
---|
SLUB: reduce irq disabled scope and make it RT compatible
This series was initially inspired by Mel's pcplist local_lock rewrite, and
also interest to better understand SLUB's locking and the new primitives and RT
variants and implications. It makes SLUB compatible with PREEMPT_RT and
generally more preemption-friendly, apparently without significant regressions,
as the fast paths are not affected.
The main changes to SLUB by this series:
* irq disabling is now only done for minimum amount of time needed to protect
the strict kmem_cache_cpu fields, and as part of spin lock, local lock and
bit lock operations to make them irq-safe
* SLUB is fully PREEMPT_RT compatible
Series is based on 5.14-rc6 and also available as a git branch:
https://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git/log/?h=slub-local-lock-v5r0
The series should now be sufficiently tested in both RT and !RT configs, mainly
thanks to Mike.
The RFC/v1 version also got basic performance screening by Mel that didn't show
major regressions. Mike's testing with hackbench of v2 on !RT reported
negligible differences [6]:
virgin(ish) tip
5.13.0.g60ab3ed-tip
7,320.67 msec task-clock # 7.792 CPUs utilized ( +- 0.31% )
221,215 context-switches # 0.030 M/sec ( +- 3.97% )
16,234 cpu-migrations # 0.002 M/sec ( +- 4.07% )
13,233 page-faults # 0.002 M/sec ( +- 0.91% )
27,592,205,252 cycles # 3.769 GHz ( +- 0.32% )
8,309,495,040 instructions # 0.30 insn per cycle ( +- 0.37% )
1,555,210,607 branches # 212.441 M/sec ( +- 0.42% )
5,484,209 branch-misses # 0.35% of all branches ( +- 2.13% )
0.93949 +- 0.00423 seconds time elapsed ( +- 0.45% )
0.94608 +- 0.00384 seconds time elapsed ( +- 0.41% ) (repeat)
0.94422 +- 0.00410 seconds time elapsed ( +- 0.43% )
5.13.0.g60ab3ed-tip +slub-local-lock-v2r3
7,343.57 msec task-clock # 7.776 CPUs utilized ( +- 0.44% )
223,044 context-switches # 0.030 M/sec ( +- 3.02% )
16,057 cpu-migrations # 0.002 M/sec ( +- 4.03% )
13,164 page-faults # 0.002 M/sec ( +- 0.97% )
27,684,906,017 cycles # 3.770 GHz ( +- 0.45% )
8,323,273,871 instructions # 0.30 insn per cycle ( +- 0.28% )
1,556,106,680 branches # 211.901 M/sec ( +- 0.31% )
5,463,468 branch-misses # 0.35% of all branches ( +- 1.33% )
0.94440 +- 0.00352 seconds time elapsed ( +- 0.37% )
0.94830 +- 0.00228 seconds time elapsed ( +- 0.24% ) (repeat)
0.93813 +- 0.00440 seconds time elapsed ( +- 0.47% ) (repeat)
RT configs showed some throughput regressions, but that's expected tradeoff for
the preemption improvements through the RT mutex. It didn't prevent the v2 to
be incorporated to the 5.13 RT tree [7], leading to testing exposure and
bugfixes.
Before the series, SLUB is lockless in both allocation and free fast paths, but
elsewhere, it's disabling irqs for considerable periods of time - especially in
allocation slowpath and the bulk allocation, where IRQs are re-enabled only
when a new page from the page allocator is needed, and the context allows
blocking. The irq disabled sections can then include deactivate_slab() which
walks a full freelist and frees the slab back to page allocator or
unfreeze_partials() going through a list of percpu partial slabs. The RT tree
currently has some patches mitigating these, but we can do much better in
mainline too.
Patches 1-6 are straightforward improvements or cleanups that could exist
outside of this series too, but are prerequsities.
Patches 7-9 are also preparatory code changes without functional changes, but
not so useful without the rest of the series.
Patch 10 simplifies the fast paths on systems with preemption, based on
(hopefully correct) observation that the current loops to verify tid are
unnecessary.
Patches 11-20 focus on reducing irq disabled scope in the allocation slowpath.
Patch 11 moves disabling of irqs into ___slab_alloc() from its callers, which
are the allocation slowpath, and bulk allocation. Instead these callers only
disable preemption to stabilize the cpu. The following patches then gradually
reduce the scope of disabled irqs in ___slab_alloc() and the functions called
from there. As of patch 14, the re-enabling of irqs based on gfp flags before
calling the page allocator is removed from allocate_slab(). As of patch 17,
it's possible to reach the page allocator (in case of existing slabs depleted)
without disabling and re-enabling irqs a single time.
Pathces 21-26 reduce the scope of disabled irqs in functions related to
unfreezing percpu partial slab.
Patch 27 is preparatory. Patch 28 is adopted from the RT tree and converts the
flushing of percpu slabs on all cpus from using IPI to workqueue, so that the
processing isn't happening with irqs disabled in the IPI handler. The flushing
is not performance critical so it should be acceptable.
Patch 29 also comes from RT tree and makes object_map_lock RT compatible.
Patch 30 make slab_lock irq-safe on RT where we cannot rely on having
irq disabled from the list_lock spin lock usage.
Patch 31 changes kmem_cache_cpu->partial handling in put_cpu_partial() from
cmpxchg loop to a short irq disabled section, which is used by all other code
modifying the field. This addresses a theoretical race scenario pointed out by
Jann, and makes the critical section safe wrt with RT local_lock semantics
after the conversion in patch 35.
Patch 32 changes preempt disable to migrate disable, so that the nested
list_lock spinlock is safe to take on RT. Because migrate_disable() is a
function call even on !RT, a small set of private wrappers is introduced
to keep using the cheaper preempt_disable() on !PREEMPT_RT configurations.
As of this patch, SLUB should be already compatible with RT's lock semantics.
Finally, patch 33 changes irq disabled sections that protect kmem_cache_cpu
fields in the slow paths, with a local lock. However on PREEMPT_RT it means the
lockless fast paths can now preempt slow paths which don't expect that, so the
local lock has to be taken also in the fast paths and they are no longer
lockless. RT folks seem to not mind this tradeoff. The patch also updates the
locking documentation in the file's comment.
-----BEGIN PGP SIGNATURE-----
iQEzBAABCAAdFiEEjUuTAak14xi+SF7M4CHKc/GJqRAFAmEzSooACgkQ4CHKc/GJ
qRC3Agf+MXJB5NVCOkwgEk9wipbFETrJDsvM2Yf2CrqbK9MzKtPNrL82lZHdgtq2
HJ5gT8QZTFQ7n8nbY3P6LRClDdtqYm8b7aX02qtc2JrM29wIQw8A1gummLkQDNRm
s+vd0ndPc4V6mqJQqiTk1WB8F+SJ0u3LfjesbIlqgcWREzZaPgm+hw3UUEtz/tXu
RiEkWI30u0S0X5/HimqK8pdmwGPvzX8l1N9Sc2VeoQoFPPL/Cm2D5jZR/xHtKLfW
q4ZVVXdh/YtOWXMD0jOr9q/bxwLDWCkvWHEmAES5nT2apFmCuusZ3+XWzWf8bSX/
j3eTiiNHTaktf/mndEymEbztnqmfGQ==
=3Jty
-----END PGP SIGNATURE-----