perf(lock): shared locker latch on lock-get hot path (+ bench fixes/tooling)#27
Open
gburd wants to merge 4 commits into
Open
perf(lock): shared locker latch on lock-get hot path (+ bench fixes/tooling)#27gburd wants to merge 4 commits into
gburd wants to merge 4 commits into
Conversation
The TPROC drivers opened the environment with default lock-region sizing (~1000 locks/objects/lockers) and a tiny default log buffer. A batched bulk load or a many-thread run exhausts those entries and fails mid-run with ENOMEM (BDB2055 'Lock table is out of available lock entries', BDB1501 'Logging region out of memory'), and an unchecked failure during populate could leave a partially built environment that crashes on reuse. Size the lock subsystem (200k locks/objects/lockers) and the log buffer (16MB) when the corresponding subsystems are enabled. Verified populate + run at scale 5 and 50.
Add a -D N toggle to the shared harness: 0 (default) keeps BDB's detect-on-every-conflict behavior; N>0 disables inline detection and runs a background deadlock detector every N ms instead. Lets a run A/B the cost of synchronous vs periodic deadlock detection. Measurement tooling only; no engine change. (A/B on a 12-core box found the two modes within noise on the contended debit/credit workload.)
A micro-benchmark that exercises the lock subsystem in isolation: each thread allocates its own locker and loops lock_get/lock_put on either per-thread (distinct, no-conflict) or shared read objects, with no access method or buffer pool in the path. This exposes lock-manager scaling that btree-bound workloads (e.g. scale_bench rrand) hide behind page cache misses. On a 24-thread box it shows the per-op global locker mutex plateauing throughput at ~8 threads.
Every DB_ENV->lock_get / lock_put resolves its locker through __lock_getlocker_int under the region-global locker mutex (mtx_lockers). On the lock-get path the lookup is create=0 -- a read-only walk of the locker hash bucket -- yet it was held *exclusive*, serializing every lock acquisition across all cores even when objects are fully partitioned and there is no lock conflict. Make mtx_lockers a DB_MUTEX_SHARED latch and take it in shared mode for the read-only locker lookup on the hot path (__lock_get_api). Locker create, free, the deadlock detector's locker-list walk, failchk, and stat continue to hold it exclusive, so they never run concurrently with a reader. Measured with lab/bench/lock_bench (distinct mode, no lock conflict, on a 24-thread box): master plateaus and then declines past 8 threads (~3.0M ops/s peak, 2.6M at 24t); the shared latch scales to 7.0M at 24t -- 2.1x at 8 threads, 2.7x at 24. It captures roughly half the upper bound of removing the mutex entirely; the remainder is the shared latch's own reference-count cache line, which would require partitioning the locker hash to recover (left for later -- this is the low-risk 80/20). A small single-thread regression (~8%) reflects the shared latch's slightly higher uncontended cost and is dwarfed by the multi-core gain. Verified: TCL lock001/002/003 (incl. multi-process), txn001/002, test001, ssi001/002 pass; concurrent shared read-lock acquisition (lock_bench shared) runs clean.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Lock-manager scaling: shared locker latch on the lock-get hot path
This branch began as benchmark fixes during a lock-manager scaling
investigation and now also carries the engine win that investigation
found.
The bottleneck (found with a dedicated probe)
Every
DB_ENV->lock_get/lock_putresolves its locker through__lock_getlocker_intunder the region-global locker mutexmtx_lockers.On the lock-get path that lookup is
create=0— a read-only walk of alocker hash bucket — yet it was held exclusive, serializing every lock
acquisition across all cores even with objects fully partitioned (240-way)
and zero lock conflict.
This was invisible in btree workloads (the earlier
scale_bench rrandprofile is dominated by B-tree key-comparison cache misses, which hide the
lock serialization). A dedicated lock-only probe (
lock_bench, added here)exposed it: on a 24-thread box, master plateaus at ~3M lock ops/s and
declines past 8 threads.
The fix (3 lines of engine change)
Make
mtx_lockersaDB_MUTEX_SHAREDlatch and take it shared for theread-only locker lookup on the hot path. Locker create/free, the deadlock
detector's locker-list walk, failchk, and stat keep it exclusive, so a
reader never runs concurrently with a writer.
Measured (
lock_benchdistinct, no conflict, 24-thread box)* removing the mutex entirely (unsafe diagnostic). The shared latch captures
~half the upper bound; closing the rest needs partitioning the locker hash
(more invasive, deferred — this is the low-risk 80/20). Single-thread cost
rises ~8% (shared latch vs plain mutex, uncontended), dwarfed by the
multi-core gain.
No regression on real workloads:
rrandunchanged (btree-bound),tproc_bflat (deadlock/disk-bound). The fix helps where the bottleneck isand costs nothing where it isn't.
Also in this branch (benchmark tooling/fixes)
ENOMEM at scale (a real bug in the merged suite).
lock_bench: the direct lock-manager probe used to find this.-Dknob: A/B synchronous vs periodic deadlock detection.Verified
TCL
lock001/002/003(incl. multi-process),txn001/002,test001,ssi001/002pass; concurrent shared read-lock acquisition runs clean; cleanbuild (gcc via Nix, Apple clang).