Skip to content

perf(mpool): #2/#7 prototype — measured that false-sharing isn't the cap#19

Open
gburd wants to merge 1 commit into
masterfrom
perf/mpool-pin
Open

perf(mpool): #2/#7 prototype — measured that false-sharing isn't the cap#19
gburd wants to merge 1 commit into
masterfrom
perf/mpool-pin

Conversation

@gburd

@gburd gburd commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator

Prototypes and measures the first candidate fix for the read-scaling ceiling identified in docs/design/scaling-findings.md.

Hypothesis

struct __bh packs the write-hot fields (pin ref count, LRU priority — written on every __memp_fget/__memp_fput) into the same cache line as the read-mostly identity fields (pgno/mf_offset/flags/hq) that every concurrent hash-chain walk of a hot (btree root) buffer reads. So each pin would invalidate the line all readers need just to traverse/match.

Change

Isolate the write-hot fields on their own cache line, behind MPOOL_HOTFIELDS_ISOLATED (one-line A/B). Off by default.

Measured (controlled interleaved A/B, medians, 12-core)

workload t packed isolated delta
rrand 8 486,745 489,564 +0.6%
rrand 12 390,927 390,416 -0.1%
snap 8 518,422 514,260 -0.8%
snap 12 408,213 409,415 +0.3%

No effect (±0.6%). The cap is true sharing of the atomic counters (bhp->ref + the shared-latch share-counts), not false sharing — relocating the words can't help.

Why it's still useful

It rules out the cheap fix with data and refines #2/#7: the per-read shared-counter RMW must be removed (optimistic/versioned access — needs epoch reclamation BDB lacks — or a sharded pin count), not relocated. Kept guarded + off to re-A/B on the 24-core Linux box (currently unreachable) where the futex-dominated ceiling was characterized. Smoke-tested write + MVCC-freeze paths; full TCL regression required before any default-on change.

Builds clean default (off). See docs/design/scaling-findings.mdPrototype 1.

Prototype the #2/#7 hypothesis that the buffer-header write-hot fields (pin
ref count + LRU priority) false-share a cache line with the read-mostly
identity/traversal fields every hash-chain walk reads. Isolate them on their
own line behind MPOOL_HOTFIELDS_ISOLATED (off by default).

Controlled interleaved A/B (packed vs isolated, medians, 12-core box): no
effect (+/-0.6%). The read-path cap is TRUE sharing of the atomic counters
(bhp->ref and the shared-latch share-counts), not false sharing -- relocating
the words cannot help. Left off by default (only adds per-buffer memory); kept
guarded to re-A/B on the 24-core Linux box.

Refines the #2/#7 direction: the per-read shared-counter RMW must be removed
(optimistic/versioned access needing epoch reclamation, or a sharded pin
count), not relocated. Documented in docs/design/scaling-findings.md.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant