Skip to content

perf(lock): shared locker latch on lock-get hot path (+ bench fixes/tooling)#27

Open
gburd wants to merge 4 commits into
masterfrom
perf/lock-scaling
Open

perf(lock): shared locker latch on lock-get hot path (+ bench fixes/tooling)#27
gburd wants to merge 4 commits into
masterfrom
perf/lock-scaling

Conversation

@gburd

@gburd gburd commented Jun 20, 2026

Copy link
Copy Markdown
Collaborator

Lock-manager scaling: shared locker latch on the lock-get hot path

This branch began as benchmark fixes during a lock-manager scaling
investigation and now also carries the engine win that investigation
found
.

The bottleneck (found with a dedicated probe)

Every DB_ENV->lock_get/lock_put resolves its locker through
__lock_getlocker_int under the region-global locker mutex mtx_lockers.
On the lock-get path that lookup is create=0 — a read-only walk of a
locker hash bucket — yet it was held exclusive, serializing every lock
acquisition across all cores even with objects fully partitioned (240-way)
and zero lock conflict.

This was invisible in btree workloads (the earlier scale_bench rrand
profile is dominated by B-tree key-comparison cache misses, which hide the
lock serialization). A dedicated lock-only probe (lock_bench, added here)
exposed it: on a 24-thread box, master plateaus at ~3M lock ops/s and
declines past 8 threads
.

The fix (3 lines of engine change)

Make mtx_lockers a DB_MUTEX_SHARED latch and take it shared for the
read-only locker lookup on the hot path. Locker create/free, the deadlock
detector's locker-list walk, failchk, and stat keep it exclusive, so a
reader never runs concurrently with a writer.

Measured (lock_bench distinct, no conflict, 24-thread box)

threads master this branch upper bound*
1 1.38M 1.27M 1.87M
4 2.79M 4.55M 5.91M
8 3.03M 6.36M 10.1M
12 2.76M 6.88M 12.7M
24 2.60M 7.01M (2.7×) 16.1M

* removing the mutex entirely (unsafe diagnostic). The shared latch captures
~half the upper bound; closing the rest needs partitioning the locker hash
(more invasive, deferred — this is the low-risk 80/20). Single-thread cost
rises ~8% (shared latch vs plain mutex, uncontended), dwarfed by the
multi-core gain.

No regression on real workloads: rrand unchanged (btree-bound),
tproc_b flat (deadlock/disk-bound). The fix helps where the bottleneck is
and costs nothing where it isn't.

Also in this branch (benchmark tooling/fixes)

  • fix: size the lock subsystem + log buffer so the TPROC drivers don't
    ENOMEM at scale (a real bug in the merged suite).
  • lock_bench: the direct lock-manager probe used to find this.
  • -D knob: A/B synchronous vs periodic deadlock detection.

Verified

TCL lock001/002/003 (incl. multi-process), txn001/002, test001,
ssi001/002 pass; concurrent shared read-lock acquisition runs clean; clean
build (gcc via Nix, Apple clang).

gburd added 4 commits June 19, 2026 21:09
The TPROC drivers opened the environment with default lock-region sizing
(~1000 locks/objects/lockers) and a tiny default log buffer.  A batched
bulk load or a many-thread run exhausts those entries and fails mid-run
with ENOMEM (BDB2055 'Lock table is out of available lock entries',
BDB1501 'Logging region out of memory'), and an unchecked failure during
populate could leave a partially built environment that crashes on reuse.

Size the lock subsystem (200k locks/objects/lockers) and the log buffer
(16MB) when the corresponding subsystems are enabled.  Verified populate +
run at scale 5 and 50.
Add a -D N toggle to the shared harness: 0 (default) keeps BDB's
detect-on-every-conflict behavior; N>0 disables inline detection and runs a
background deadlock detector every N ms instead.  Lets a run A/B the cost of
synchronous vs periodic deadlock detection.

Measurement tooling only; no engine change.  (A/B on a 12-core box found the
two modes within noise on the contended debit/credit workload.)
A micro-benchmark that exercises the lock subsystem in isolation: each
thread allocates its own locker and loops lock_get/lock_put on either
per-thread (distinct, no-conflict) or shared read objects, with no access
method or buffer pool in the path.  This exposes lock-manager scaling that
btree-bound workloads (e.g. scale_bench rrand) hide behind page cache
misses.  On a 24-thread box it shows the per-op global locker mutex
plateauing throughput at ~8 threads.
Every DB_ENV->lock_get / lock_put resolves its locker through
__lock_getlocker_int under the region-global locker mutex (mtx_lockers).
On the lock-get path the lookup is create=0 -- a read-only walk of the
locker hash bucket -- yet it was held *exclusive*, serializing every lock
acquisition across all cores even when objects are fully partitioned and
there is no lock conflict.

Make mtx_lockers a DB_MUTEX_SHARED latch and take it in shared mode for the
read-only locker lookup on the hot path (__lock_get_api).  Locker create,
free, the deadlock detector's locker-list walk, failchk, and stat continue
to hold it exclusive, so they never run concurrently with a reader.

Measured with lab/bench/lock_bench (distinct mode, no lock conflict, on a
24-thread box): master plateaus and then declines past 8 threads
(~3.0M ops/s peak, 2.6M at 24t); the shared latch scales to 7.0M at 24t --
2.1x at 8 threads, 2.7x at 24.  It captures roughly half the upper bound of
removing the mutex entirely; the remainder is the shared latch's own
reference-count cache line, which would require partitioning the locker
hash to recover (left for later -- this is the low-risk 80/20).  A small
single-thread regression (~8%) reflects the shared latch's slightly higher
uncontended cost and is dwarfed by the multi-core gain.

Verified: TCL lock001/002/003 (incl. multi-process), txn001/002, test001,
ssi001/002 pass; concurrent shared read-lock acquisition (lock_bench shared)
runs clean.
@gburd gburd changed the title bench: lock-subsystem sizing fix + deadlock-detection knob perf(lock): shared locker latch on lock-get hot path (+ bench fixes/tooling) Jun 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant