docs(design): buffer swip + optimistic descent + AIO plan by gburd · Pull Request #20 · berkeleydb/libdb

gburd · 2026-06-17T13:17:40Z

Plan of record for the next scalability push (companion to scaling-findings.md).

Goal: make hot-page reads write-free so read throughput scales past 300 cores, without a full buffer-manager rewrite.

Approach (staged):

Swip — a tagged roff_t child reference (HOT/COOL/COLD/IN_FLIGHT in the low bits) stored in a per-internal-frame in-memory shadow vector (not in the persisted page image; roff_t not raw pointer, for multi-process safety). One substrate for replacement, scan resistance, and async I/O state.
Stage 0 — replace the per-access LRU priority write with a clock/cool state machine (scan-resistant; deletes the global counter + __memp_reset_lru).
Stage 1 — BH_WIRED internal pages + LSN-validated optimistic descent: wiring the small, hot, rarely-modified internal set sidesteps general epoch reclamation, so optimistic readers never touch freed memory. Removes the three per-read shared-counter RMWs on internal levels.
Stage 2 — os_aio abstraction (io_uring/kqueue/IOCP/POSIX) for prefetch (COLD→IN_FLIGHT) and async trickle writeback; respects the WAL rule; multi-process degrades to today's in-transit wait.

Grounded in LeanStore (optimistic lock coupling, swizzling, cooling replacement) and PostgreSQL (atomic pin, private refcount). Exact code anchors verified against master in docs/design/research/. Each stage is a separate branch, measured on the 24-core box before merge.

Plan of record for lifting the multi-core read-scaling ceiling without boiling the ocean: a tagged roff_t swip (hot/cool/cold/in-flight) as the single substrate for replacement, scan resistance, and async I/O; BH_WIRED internals + LSN-validated optimistic descent to make hot-page reads write-free; a per-ENV os_aio abstraction (io_uring/kqueue/IOCP/POSIX) for prefetch and trickle writeback. Staged 0->1->2, each measured on the 24-core box. Code anchors verified against master and captured in docs/design/research/.

Studied two sibling LeanStore/Umbra-style cooling buffer managers (libxtc examples/06_sqlxtc bufmgr.c; noxu noxu-evictor). Key refinements: - Robust scan resistance = probationary COOL admission + COOL-first eviction (never cool a HOT buffer while a COOL victim exists), not a single warmth counter. Add as Stage 0.5 in the existing bucket-scan evictor (no swip). - Single-word pin/evict gate (ref -1 = EVICTING) to drop the TRYLOCK+recheck race in __memp_alloc. - prefer-clean foreground eviction + background trickle (Stage 2). - noxu's per-op CacheMode + two tracking sets (scan drained first) confirm the direction. Adopt principles within BDB's multi-process/persisted-page limits.

gburd added 2 commits June 17, 2026 09:17

docs(design): move research reports under docs/design/research

63a1bcf

gburd mentioned this pull request Jun 17, 2026

perf(mpool): Stage 0 — CLOCK/second-chance replacement (write-free hot reads) #21

Merged

gburd mentioned this pull request Jun 17, 2026

perf(mpool): Stage 1 — BH_WIRED + optimistic descent (WIP) #22

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(design): buffer swip + optimistic descent + AIO plan#20

docs(design): buffer swip + optimistic descent + AIO plan#20
gburd wants to merge 3 commits into
masterfrom
docs/swip-aio-design

gburd commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

gburd commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant