docs(design): buffer swip + optimistic descent + AIO plan#20
Open
gburd wants to merge 3 commits into
Open
Conversation
Plan of record for lifting the multi-core read-scaling ceiling without boiling the ocean: a tagged roff_t swip (hot/cool/cold/in-flight) as the single substrate for replacement, scan resistance, and async I/O; BH_WIRED internals + LSN-validated optimistic descent to make hot-page reads write-free; a per-ENV os_aio abstraction (io_uring/kqueue/IOCP/POSIX) for prefetch and trickle writeback. Staged 0->1->2, each measured on the 24-core box. Code anchors verified against master and captured in docs/design/research/.
Studied two sibling LeanStore/Umbra-style cooling buffer managers (libxtc examples/06_sqlxtc bufmgr.c; noxu noxu-evictor). Key refinements: - Robust scan resistance = probationary COOL admission + COOL-first eviction (never cool a HOT buffer while a COOL victim exists), not a single warmth counter. Add as Stage 0.5 in the existing bucket-scan evictor (no swip). - Single-word pin/evict gate (ref -1 = EVICTING) to drop the TRYLOCK+recheck race in __memp_alloc. - prefer-clean foreground eviction + background trickle (Stage 2). - noxu's per-op CacheMode + two tracking sets (scan drained first) confirm the direction. Adopt principles within BDB's multi-process/persisted-page limits.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Plan of record for the next scalability push (companion to scaling-findings.md).
Goal: make hot-page reads write-free so read throughput scales past 300 cores, without a full buffer-manager rewrite.
Approach (staged):
roff_tchild reference (HOT/COOL/COLD/IN_FLIGHT in the low bits) stored in a per-internal-frame in-memory shadow vector (not in the persisted page image;roff_tnot raw pointer, for multi-process safety). One substrate for replacement, scan resistance, and async I/O state.prioritywrite with a clock/cool state machine (scan-resistant; deletes the global counter +__memp_reset_lru).BH_WIREDinternal pages + LSN-validated optimistic descent: wiring the small, hot, rarely-modified internal set sidesteps general epoch reclamation, so optimistic readers never touch freed memory. Removes the three per-read shared-counter RMWs on internal levels.os_aioabstraction (io_uring/kqueue/IOCP/POSIX) for prefetch (COLD→IN_FLIGHT) and async trickle writeback; respects the WAL rule; multi-process degrades to today's in-transit wait.Grounded in LeanStore (optimistic lock coupling, swizzling, cooling replacement) and PostgreSQL (atomic pin, private refcount). Exact code anchors verified against master in
docs/design/research/. Each stage is a separate branch, measured on the 24-core box before merge.