perf(mpool): Stage 1 — BH_WIRED + optimistic descent (WIP)#22
Draft
gburd wants to merge 12 commits into
Draft
Conversation
Foundation for the optimistic descent: B-tree internal/root pages are wired so the frame is never reclaimed, letting a later lock-free descent read them without a use-after-free hazard (BDB has no epoch reclamation). - struct __bh gains a dedicated 'wired' byte (not a flags bit: it is set with a plain monotonic store while the caller holds only a shared buffer latch, so it must not share the non-atomic RMW of the flags word that __memp_pgwrite uses to clear BH_DIRTY). Reset to 0 at every buffer-header (re)init site. - __memp_alloc skips wired buffers when choosing a victim. - __memp_wire() sets it; guarded against memory-mapped pages (whose page pointer is not a buffer frame -- caught a SIGBUS in test001 with mmap'd files). - __bam_search wires P_IBTREE/P_IRECNO pages on descent (bounded: internal levels only, never leaves). No measurable perf change yet (internals were already hot-resident); this only guarantees residency for step 2 (LSN-validated optimistic descent). Validated: NOSYNC forced-eviction integrity (50k/1MB); TCL test001 btree+hash, test003.
- __memp_unwire(): clears the wired mark so a freed frame is evictable again; called from __db_free (the single page-free chokepoint for all access methods) and from __memp_bhfree for the file/env-close discard path. The wired byte gates the counter so it is decremented exactly once. - Per-region wired-page counter (MPOOL.wired_pages, atomic) with a cap of MPOOL_WIRED_MAX_PCT (25%) of the region's buffers: over the cap __memp_wire is a no-op and the descent uses a normal pin, so wiring can never starve the cache. - db_stat -m reports 'Wired buffers (non-evictable)'. - mmap guard on both wire and unwire (page ptr is not a buffer frame). Validated: NOSYNC integrity 50k/1MB; TCL test001 btree+hash, test003, test011 (cursor splits/merges -> page frees exercise __memp_unwire).
Resolve mpool conflicts keeping both the cooling (CLOCK warmth, probationary COOL admission, COOL-first eviction) and the Stage 1 wiring (BH_WIRED byte, __memp_wire/unwire, wired-page cap + counter, wired-buffers stat). db_stat drops the removed LRU priority/generation lines. Stage 1 now builds on the scan-resistant cooling base.
Per review: internal/subtree-root pages should stay in the normal evictable pool; only the single main tree root (BAM_ROOT_PGNO) -- fetched by every operation -- is wired, so it stays resident without churning eviction and the root snapshot can refresh cheaply. Move the __memp_wire call from the all-internals site in __bam_search to __bam_get_root, gated on h->pgno == BAM_ROOT_PGNO(dbc). Unwiring is already handled on page free (__db_free) and file close (__memp_bhfree). Validated: NOSYNC integrity 50k/1MB; TCL test001 btree+hash, test011.
Read lookups of the main tree no longer fetch (pin/latch) the contended live root. Each handle keeps a private immutable copy of the root taken at a known root LSN; a plain read of the live root LSN (via the wired root buffer) confirms the copy is current, the copy yields the descent's first child, and __bam_search starts from that child -- never touching the live root. Correctness: after the child is fetched, the live root LSN is re-checked (seqlock); if it changed (a split added a level, or a merge freed the child) the child is released and the descent restarts from the real root. Gated to plain read finds of a logged, durable, non-multiversion btree (where the page LSN reliably advances on root modification) -- everything else uses the normal descent. Old copies are retired to a free list and released at handle close (root changes are rare; no reader/free race, no epoch reclamation). Child selection reuses __bam_cmp and the exact __bam_search binary-search rule so the chosen child is identical to a normal descent. Validated: NOSYNC integrity 50k/1MB (logged, fast path active, all verified); TCL test001 btree+hash, test003, test011 (dups), test026. The first cut mis-gated non-logged envs (LSN never advances -> stale copy -> wrong results); fixed with the LOGGING_ON + durable gate. Concurrent stress + scaling on meh next.
rrand 200k/3s on meh (24t, tmpfs): snapshot beats master at every thread count (+22-29% at 4-8t) but both peak ~8t and negatively scale to 24t. The snapshot raises the read-scaling ceiling without removing it; at 24t the bottleneck has moved to the lock-manager locker region (lockers% 51-67%, lockpart% ~0.1). Real measured win worth landing; multicore scaling past ~8 cores now bounded by the lock manager (ROADMAP #4).
…neck perf on meh (24t, snapshot): 40.5% of time is futex wait under __db_pthread_mutex_lock, split between __db_cursor_int (cursor alloc) and __dbc_close (cursor free) -- the per-get transient cursor linked/unlinked on the ONE shared DB handle's active-cursor queue (dbp->mutex). Per-thread handles (sepdb) run +49% at 24t and scale near-linearly to 8t, proving it. Next bottleneck underneath: __memp_fget/fput hash-bucket latch + refcount atomics per descent page (root snapshot removed only the root fetch). Benchmark critique: aggregate metric is sound; it induces lock-manager traffic via DB_INIT_LOCK|TXN reads (should also measure READ_UNCOMMITTED); targ_t.ops false-shares (latent, not in profile); meh is 12c/24t so the 12->24 tail is HT + all-core turbo, and peak-at-8 is software contention. Conclusion: the next scaling fix is the per-get cursor-allocation mutex -- NOT Stage 1c (blocked, only partial) nor Stage 2 #5 (orthogonal: rrand is 100% cache hits, zero I/O).
Each thread opens its own handle on the SAME bench.db (removing the shared-handle cursor-queue mutex app-side) and reads under a selectable isolation level (none/read-committed/snapshot/uncommitted) to measure how far BDB scales with full transactional isolation -- not requiring uncommitted reads. Per-thread state is cache-line padded.
scale_iso (per-thread handle on shared bench.db): full-isolation reads
('none', per-op page read locks) scale identically to uncommitted (668k
vs 656k @ 24t, 3x to 8t) -- isolation is NOT the scaling barrier, the
shared handle was. Per-op explicit txns ('rc') collapse past 8t (txn/
locker/log machinery = bottleneck #3); long-lived MVCC ('snap') avoids it.
Documents the cursor-allocation fix design (sharded queues recommended,
needs full run_std; ~+47% prize vs shared-handle path).
Scaling measurements, profiling, and design exploration are development notes, not user-facing documentation; they do not belong in ./docs (which should track the code). Moved to the agent notes area, which is never committed. (The same file remains on master from an earlier PR and should be removed there in a follow-up.)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stage 1 of the buffer-swip plan (PR #20): lift the multi-core read ceiling by removing the per-op pin/latch RMW on the hot root/internal B-tree pages.
Step 1 (this commit): BH_WIRED foundation. Internal/root pages are pinned resident (a dedicated
wiredbyte; evictor skips them;__memp_wireset from__bam_search, guarded against mmap'd pages). This guarantees the frame can't be reclaimed under a lock-free reader — the prerequisite for the optimistic descent, since BDB has no epoch reclamation. No perf change yet.Step 2 (next): LSN-validated optimistic descent. Read wired internals without the pin/latch: snapshot
LSN(h), search the node, re-checkLSN(h); restart on the (rare) split. Gated onmfp->multiversion==0. This is the change that removes the shared-counter RMW and should let reads scale toward 24 cores. Needs concurrent-split stress testing + meh measurement before it leaves draft.Validated so far: NOSYNC forced-eviction integrity (50k/1MB); TCL test001 btree+hash, test003. (Caught + fixed a SIGBUS:
__memp_wireon mmap'd pages.)