Skip to content

docs(design): buffer swip + optimistic descent + AIO plan#20

Open
gburd wants to merge 3 commits into
masterfrom
docs/swip-aio-design
Open

docs(design): buffer swip + optimistic descent + AIO plan#20
gburd wants to merge 3 commits into
masterfrom
docs/swip-aio-design

Conversation

@gburd

@gburd gburd commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator

Plan of record for the next scalability push (companion to scaling-findings.md).

Goal: make hot-page reads write-free so read throughput scales past 300 cores, without a full buffer-manager rewrite.

Approach (staged):

  • Swip — a tagged roff_t child reference (HOT/COOL/COLD/IN_FLIGHT in the low bits) stored in a per-internal-frame in-memory shadow vector (not in the persisted page image; roff_t not raw pointer, for multi-process safety). One substrate for replacement, scan resistance, and async I/O state.
  • Stage 0 — replace the per-access LRU priority write with a clock/cool state machine (scan-resistant; deletes the global counter + __memp_reset_lru).
  • Stage 1BH_WIRED internal pages + LSN-validated optimistic descent: wiring the small, hot, rarely-modified internal set sidesteps general epoch reclamation, so optimistic readers never touch freed memory. Removes the three per-read shared-counter RMWs on internal levels.
  • Stage 2os_aio abstraction (io_uring/kqueue/IOCP/POSIX) for prefetch (COLD→IN_FLIGHT) and async trickle writeback; respects the WAL rule; multi-process degrades to today's in-transit wait.

Grounded in LeanStore (optimistic lock coupling, swizzling, cooling replacement) and PostgreSQL (atomic pin, private refcount). Exact code anchors verified against master in docs/design/research/. Each stage is a separate branch, measured on the 24-core box before merge.

gburd added 2 commits June 17, 2026 09:17
Plan of record for lifting the multi-core read-scaling ceiling without boiling
the ocean: a tagged roff_t swip (hot/cool/cold/in-flight) as the single
substrate for replacement, scan resistance, and async I/O; BH_WIRED internals +
LSN-validated optimistic descent to make hot-page reads write-free; a per-ENV
os_aio abstraction (io_uring/kqueue/IOCP/POSIX) for prefetch and trickle
writeback. Staged 0->1->2, each measured on the 24-core box. Code anchors
verified against master and captured in docs/design/research/.
Studied two sibling LeanStore/Umbra-style cooling buffer managers (libxtc
examples/06_sqlxtc bufmgr.c; noxu noxu-evictor). Key refinements:
- Robust scan resistance = probationary COOL admission + COOL-first eviction
  (never cool a HOT buffer while a COOL victim exists), not a single warmth
  counter. Add as Stage 0.5 in the existing bucket-scan evictor (no swip).
- Single-word pin/evict gate (ref -1 = EVICTING) to drop the
  TRYLOCK+recheck race in __memp_alloc.
- prefer-clean foreground eviction + background trickle (Stage 2).
- noxu's per-op CacheMode + two tracking sets (scan drained first) confirm the
  direction. Adopt principles within BDB's multi-process/persisted-page limits.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant