perf(lock): shared locker latch on lock-get hot path (+ bench fixes/tooling) by gburd · Pull Request #27 · berkeleydb/libdb

gburd · 2026-06-20T01:24:49Z

Lock-manager scaling: shared locker latch on the lock-get hot path

This branch began as benchmark fixes during a lock-manager scaling
investigation and now also carries the engine win that investigation
found.

The bottleneck (found with a dedicated probe)

Every DB_ENV->lock_get/lock_put resolves its locker through
__lock_getlocker_int under the region-global locker mutex mtx_lockers.
On the lock-get path that lookup is create=0 — a read-only walk of a
locker hash bucket — yet it was held exclusive, serializing every lock
acquisition across all cores even with objects fully partitioned (240-way)
and zero lock conflict.

This was invisible in btree workloads (the earlier scale_bench rrand
profile is dominated by B-tree key-comparison cache misses, which hide the
lock serialization). A dedicated lock-only probe (lock_bench, added here)
exposed it: on a 24-thread box, master plateaus at ~3M lock ops/s and
declines past 8 threads.

The fix (3 lines of engine change)

Make mtx_lockers a DB_MUTEX_SHARED latch and take it shared for the
read-only locker lookup on the hot path. Locker create/free, the deadlock
detector's locker-list walk, failchk, and stat keep it exclusive, so a
reader never runs concurrently with a writer.

Measured (`lock_bench` distinct, no conflict, 24-thread box)

threads	master	this branch	upper bound*
1	1.38M	1.27M	1.87M
4	2.79M	4.55M	5.91M
8	3.03M	6.36M	10.1M
12	2.76M	6.88M	12.7M
24	2.60M	7.01M (2.7×)	16.1M

* removing the mutex entirely (unsafe diagnostic). The shared latch captures
~half the upper bound; closing the rest needs partitioning the locker hash
(more invasive, deferred — this is the low-risk 80/20). Single-thread cost
rises ~8% (shared latch vs plain mutex, uncontended), dwarfed by the
multi-core gain.

No regression on real workloads: rrand unchanged (btree-bound),
tproc_b flat (deadlock/disk-bound). The fix helps where the bottleneck is
and costs nothing where it isn't.

Also in this branch (benchmark tooling/fixes)

fix: size the lock subsystem + log buffer so the TPROC drivers don't
ENOMEM at scale (a real bug in the merged suite).
lock_bench: the direct lock-manager probe used to find this.
-D knob: A/B synchronous vs periodic deadlock detection.

Verified

TCL lock001/002/003 (incl. multi-process), txn001/002, test001,
ssi001/002 pass; concurrent shared read-lock acquisition runs clean; clean
build (gcc via Nix, Apple clang).

The TPROC drivers opened the environment with default lock-region sizing (~1000 locks/objects/lockers) and a tiny default log buffer. A batched bulk load or a many-thread run exhausts those entries and fails mid-run with ENOMEM (BDB2055 'Lock table is out of available lock entries', BDB1501 'Logging region out of memory'), and an unchecked failure during populate could leave a partially built environment that crashes on reuse. Size the lock subsystem (200k locks/objects/lockers) and the log buffer (16MB) when the corresponding subsystems are enabled. Verified populate + run at scale 5 and 50.

Add a -D N toggle to the shared harness: 0 (default) keeps BDB's detect-on-every-conflict behavior; N>0 disables inline detection and runs a background deadlock detector every N ms instead. Lets a run A/B the cost of synchronous vs periodic deadlock detection. Measurement tooling only; no engine change. (A/B on a 12-core box found the two modes within noise on the contended debit/credit workload.)

A micro-benchmark that exercises the lock subsystem in isolation: each thread allocates its own locker and loops lock_get/lock_put on either per-thread (distinct, no-conflict) or shared read objects, with no access method or buffer pool in the path. This exposes lock-manager scaling that btree-bound workloads (e.g. scale_bench rrand) hide behind page cache misses. On a 24-thread box it shows the per-op global locker mutex plateauing throughput at ~8 threads.

Every DB_ENV->lock_get / lock_put resolves its locker through __lock_getlocker_int under the region-global locker mutex (mtx_lockers). On the lock-get path the lookup is create=0 -- a read-only walk of the locker hash bucket -- yet it was held *exclusive*, serializing every lock acquisition across all cores even when objects are fully partitioned and there is no lock conflict. Make mtx_lockers a DB_MUTEX_SHARED latch and take it in shared mode for the read-only locker lookup on the hot path (__lock_get_api). Locker create, free, the deadlock detector's locker-list walk, failchk, and stat continue to hold it exclusive, so they never run concurrently with a reader. Measured with lab/bench/lock_bench (distinct mode, no lock conflict, on a 24-thread box): master plateaus and then declines past 8 threads (~3.0M ops/s peak, 2.6M at 24t); the shared latch scales to 7.0M at 24t -- 2.1x at 8 threads, 2.7x at 24. It captures roughly half the upper bound of removing the mutex entirely; the remainder is the shared latch's own reference-count cache line, which would require partitioning the locker hash to recover (left for later -- this is the low-risk 80/20). A small single-thread regression (~8%) reflects the shared latch's slightly higher uncontended cost and is dwarfed by the multi-core gain. Verified: TCL lock001/002/003 (incl. multi-process), txn001/002, test001, ssi001/002 pass; concurrent shared read-lock acquisition (lock_bench shared) runs clean.

gburd added 4 commits June 19, 2026 21:09

gburd changed the title ~~bench: lock-subsystem sizing fix + deadlock-detection knob~~ perf(lock): shared locker latch on lock-get hot path (+ bench fixes/tooling) Jun 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(lock): shared locker latch on lock-get hot path (+ bench fixes/tooling)#27

perf(lock): shared locker latch on lock-get hot path (+ bench fixes/tooling)#27
gburd wants to merge 4 commits into
masterfrom
perf/lock-scaling

gburd commented Jun 20, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

gburd commented Jun 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Lock-manager scaling: shared locker latch on the lock-get hot path

The bottleneck (found with a dedicated probe)

The fix (3 lines of engine change)

Measured (lock_bench distinct, no conflict, 24-thread box)

Also in this branch (benchmark tooling/fixes)

Verified

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

gburd commented Jun 20, 2026 •

edited

Loading

Measured (`lock_bench` distinct, no conflict, 24-thread box)