diff --git a/docs/design/buffer-swip-aio.md b/docs/design/buffer-swip-aio.md new file mode 100644 index 000000000..486f00343 --- /dev/null +++ b/docs/design/buffer-swip-aio.md @@ -0,0 +1,310 @@ +# Scalable buffer access: tagged swip, optimistic descent, and async I/O + +Status: design / plan of record. Companion to +[`scaling-findings.md`](scaling-findings.md). Code anchors verified against +`master` and recorded in `research/mpool-btree.md` and `research/os-aio.md`. + +## 1. Problem + +Measured (24-core Xeon, `lab/bench/scale_bench`): in-cache random reads peak at +~8 threads and then *negatively scale*. `perf` self-time is dominated by +`futex` + `__atomic_inc/dec`. The cause is not a lock we can shard: every +B-tree descent re-pins the root and upper internal pages, and each pin performs +**atomic read-modify-writes on shared words** for a page touched by every thread +on every operation: + +1. the bucket latch share-count (`hp->mtx_hash`, already a shared latch), +2. the buffer pin count (`bhp->ref`, `mp_fget.c:296`), +3. the buffer latch share-count (`bhp->mtx_buf`, `mp_fget.c:318`), + +plus a fourth write on the way out: the per-access LRU bump +(`bhp->priority = c_mp->lru_priority`, `mp_fput.c:215`). + +A controlled A/B already proved that **cache-line isolation of those fields does +nothing** (±0.6%): the cost is *true* sharing of the words themselves, not false +sharing of neighbours. + +### The hardware rule that governs >300-core NUMA + +A **read** of a shared cache line is nearly free — the line is replicated in +Shared state in every socket simultaneously. A **write** to a shared line +invalidates every replica and forces a cross-socket read-for-ownership. At +hundreds of cores across many sockets, any per-operation write to a shared line +is the scaling wall. Therefore: + +> **The read path must perform zero stores to shared memory.** + +Sharded/per-core pin counts only relocate one of four writes and still store on +every read; they cannot reach this bar. Only **optimistic, version-validated +reads** (readers write nothing) do. That is what LeanStore's Optimistic Lock +Coupling achieves and is the design we adopt. + +## 2. Principles + +1. **Write-free reads of hot pages.** Reads of resident root/internal pages + take no latch share-count, bump no pin count, and write no LRU field. They + validate a version instead. +2. **Exploit the structure instead of generalizing.** The hot pages are the + B-tree root and upper internal nodes: a tiny fraction of all pages, almost + never evicted, almost never modified. We make exactly those pages cheap and + leave leaves on the existing path. This sidesteps the hardest LeanStore + prerequisite (general epoch reclamation) — see §5. +3. **One substrate for everything.** A single tagged child pointer (the *swip*) + carries replacement state (hot/cool/cold), residency, and async-I/O state + (in-flight). Replacement, scan resistance, prefetch, and the descent fast + path all read the same word. Build it once. +4. **Measure each stage** on the 24-core box before/after; do not enable a + default-on change without the full TCL regression. + +## 3. The swip (tagged `roff_t`) + +A *swip* is a child reference stored by a parent internal frame. BDB buffer +frames are aligned, so the low bits of a region offset (`roff_t`) are free +(3 with 8-byte alignment, 6 with cache-line alignment). We tag them exactly as +LeanStore tags its pointer: + +| state | low bits | payload (high bits) | descent action | +|---|---|---|---| +| `COLD` | `00` | — (child is the on-disk `pgno` in the page image) | normal `__memp_fget` path; optionally issue prefetch | +| `HOT` | `01` | `roff_t` of the buffer frame | follow directly, read under LSN validation — **pure read** | +| `COOL` | `10` | `roff_t` of the buffer frame | follow; reheat to `HOT` (a rare write, only at the hot/cold boundary) | +| `IN_FLIGHT` | `11` | `roff_t` of the reserved frame | async read outstanding — wait on the in-transit latch | + +Two BDB-specific constraints shape where the swip lives: + +1. **Multi-process shared region.** A raw swizzled pointer is valid in only one + address space. The swip is therefore a **`roff_t`**, resolved with + `R_ADDR(region, off)` (base + offset). This works for both the >300-core + single-process-multithreaded target and BDB's classic multi-process mode. + Raw-pointer swizzling stays available later as a single-process-only fast + path (Stage 3), not now. +2. **Pages are persisted images.** A btree internal page stores child *page + numbers* on disk (`BINTERNAL.pgno`, `bt_search.c:476`), and that image lives + in the buffer. We must not overwrite a child slot with a swip or we would + persist a transient pointer. So the swip lives in an **in-memory shadow + vector** attached to the internal frame (parallel to the page's child slots), + allocated lazily at wire time near `mp_region.c:289`, not in the page image. + +Reads of a `HOT` swip are pure loads (Shared-state, replicated per socket). +Writes happen only on rare transitions (swizzle/unswizzle, hot↔cool, I/O). The +root and upper internals stay `HOT`, so they are never written on the hot path. + +## 4. Replacement = the hot/cool/cold state machine (Stage 0) + +Today every `__memp_fput` writes `bhp->priority` (`mp_fput.c:215`), bumps a +global counter (`mp_fput.c:269-271`), and a periodic `__memp_reset_lru` +(`mp_fput.c:279-331`) sweeps the whole cache. That is a shared write per access +and an O(cache) sweep. Replace it with a clock / second-chance scheme whose +state is the swip's 2 bits (plus a per-frame referenced bit in `flags`): + +- **Read of a HOT page**: at most set a "referenced" bit, and only with a plain + store *if currently clear* (read-first, so the steady-state hot read is a pure + read). No global counter, no per-access priority write. +- **Eviction** (`__memp_alloc`, clock hand `c_mp->last_checked` at + `mp_alloc.c:182`): the hand demotes `HOT`→`COOL` (clears referenced / moves to + a cooling FIFO) and evicts `COOL` pages whose referenced bit is still clear. + Touching a `COOL` page reheats it to `HOT` (second chance). +- **Scan resistance** (the property you want): bulk-scanned pages enter `COOL` + and cycle out via the cooling hand without displacing the `HOT` working set — + the swip state *is* the scan-resistance mechanism, as in LeanStore. The hot + working set (root/upper internals) stays `HOT` and is never churned. + +This removes write #4 from the read path and deletes the global-counter + +`__memp_reset_lru` machinery. It is independent of the descent/AIO work and is +the foundation the swip sits on, so it ships first. + +Invariant preserved: eviction still requires `BH_REFCOUNT == 0` and the buffer +latch exclusive (`mp_alloc.c:485/489/498`). We only change *which* victim is +chosen and *how* recency is tracked, never the safety handshake. + +## 5. Optimistic descent + `BH_WIRED` (Stage 1) + +`__bam_search` descends parent→child, latch-coupling: it fetches the child +(`bt_search.c:806`), then releases the parent (`bt_search.c:810`). Internal vs +leaf is `TYPE(h) == P_LBTREE` (`bt_search.c:348`; `P_IBTREE=3`, `P_LBTREE=5`). + +**`BH_WIRED` (`0x200`, first free flag bit, `mp.h:552`).** When the btree +fetches a `P_IBTREE` page, mark its buffer wired; the evictor skips wired +buffers (`mp_alloc.c:320` and `:485`). Because internal nodes are <1% of pages +and always hot, the memory cost is negligible and they would never be evicted +anyway. **Wiring converts "safely reclaim arbitrary pages" (general epoch/RCU) +into "these specific pages are never freed"** — so an optimistic reader of a +wired internal page can never dereference freed memory. This is the trick that +avoids boiling the ocean. + +**LSN-validated optimistic read.** At the child-pointer read (`bt_search.c:476`) +of a wired internal parent: +1. snapshot `LSN(h)` (page LSN, byte 0 of the page, updated under the page's + exclusive latch on every structural change), +2. read/binary-search the node *without* the pin/latch, +3. re-read `LSN(h)`; if unchanged, the read was consistent — follow the child + swip; if it moved (a rare split/merge), fall back to the existing + pinned+latched path and retry. + +For a wired parent we skip the child `__memp_fget` pin and the parent +`__memp_fput` (`bt_search.c:810`). The child swip (the `roff_t` of the resident +child frame) is cached in the parent's shadow vector at `:476`; a `HOT` swip is +followed with no hash lookup and no bucket latch — removing writes #1, #2, #3 on +the internal levels. Leaves keep the normal pin+latch (numerous, evictable, not +a single hot line). + +**Gating (from the survey).** +- Only on `mfp->multiversion == 0` (mirror `__bam_get_root` at `:188/226`); the + MVCC version-chain visibility walk (`mp_fget.c:264`) must not be bypassed. +- `BH_WIRED` mutually exclusive with `BH_FROZEN`/thaw. +- Binary search over the node must tolerate transiently inconsistent bytes + (it is validated by the post-read LSN re-check; never act on a result that + fails revalidation). +- Swips are invalidated on `__bam_split` / `__bam_pinsert`, keyed off the parent + LSN change the optimistic reader already checks. + +## 6. Async I/O, prefetch, trickle (Stage 2) + +All data-page I/O funnels through one synchronous choke point: `__os_io` +(`os/os_rw.c:20`, `pread`/`pwrite`). Read-in is `__memp_pgread` +(`mp_bh.c:298`); writeback is `__memp_pgwrite` (`mp_bh.c:469`) via +`__memp_bhwrite`, reached from eviction (`mp_alloc.c:509`), checkpoint/sync +(`mp_sync.c:562`), and trickle (`mp_trickle.c`). There is no prefetch/readahead +today (no `fadvise`/`madvise` anywhere). + +**`os_aio` interface** (new, in the `os/` layer; sketch in `research/os-aio.md` +and `src/dbinc/os_aio.h`): per-ENV context with `submit_read`, `submit_write`, +`reap`/poll, `cancel`; completion cookie `{BH*, DB_MPOOL_HASH*, MPOOLFILE*}` so +the completion handler runs the existing `pgread`/`pgwrite` tail. Backends: + +- Linux **io_uring** (prep/submit, reap CQEs; pairs with `O_DIRECT`), +- BSD/macOS **kqueue + aio** (`aiocb`, `EVFILT_AIO`), +- Windows **IOCP** (requires reopening files `FILE_FLAG_OVERLAPPED` — + `os_windows/os_open.c` does not today), +- POSIX **aio** fallback; `j_aio_*` test hooks mirroring `j_pread`/`j_pwrite`. + +**Prefetch** transitions `COLD → IN_FLIGHT` and submits an async read at the +miss site (`mp_fget.c:806-865`, where the buffer is allocated and marked +`BH_TRASH`); the `IN_FLIGHT` swip lets concurrent accessors find the reserved +frame instead of re-issuing. A cursor/scan hint drives readahead. + +**Trickle writeback** is the safest first async target: background flush of +cool/dirty pages so foreground eviction rarely stalls on a write. Split +`__memp_pgwrite` into **prep** (WAL flush + `pgout`, keep `mp_bh.c:347-405` +verbatim) → **submit** (replace `__os_io` at `:469` with `os_aio_submit_write`) +→ **finish** (move the `BH_DIRTY` clear + `hash_page_dirty` decrement, +`mp_bh.c:480-507`, into the completion handler). + +**WAL rule** (must precede every async write): the per-page `__log_flush` keyed +on `mfp->lsn_off` (`mp_bh.c:347-354`) and the batch flush (`mp_sync.c:464-473`) +stay in the *prep* step, before submission. Non-negotiable. + +**Per-process constraint** (critical, from the survey): AIO contexts are +per-process but the mpool region is shared, so a page marked in-flight by one +process cannot be reaped by another. Resolution: **only the process holding +`bhp->mtx_buf` exclusive submits and reaps**, releasing the latch in its own +completion handler. Other processes use the existing in-transit wait — block on +`bhp->mtx_buf`, then re-check `BH_TRASH` (`mp_fget.c:309-329`). Crash with an op +in flight reuses the existing `mtx_buf` recovery on `DB_RECOVER`. This means the +high-core single-process target gets full async; multi-process degrades +gracefully to today's behaviour. + +## 7. Stages, exit criteria, ordering + +The stages are a dependency chain; each is a separate branch/PR, measured on the +24-core box. + +| stage | branch | delivers | exit criteria | +|---|---|---|---| +| **0** | `perf/swip-stage0-cooling` | clock/cool replacement; delete per-access `priority` write + global counter + `__memp_reset_lru`; referenced bit in `flags` | TCL regression green; eviction quality not worse on a cache-pressure workload; `wrand`/`rrand` no regression; **scan-resistance** shown (a scan doesn't evict the hot set) | +| **1** | `perf/swip-stage1-descent` | `BH_WIRED`; shadow swip vector (`roff_t` + 2 bits); LSN-validated optimistic descent of wired internals | TCL regression green; **read ceiling lifts** on `rrand`/`snap` at 8–24 threads vs master; correctness under concurrent split (stress test) | +| **2** | `perf/swip-stage2-aio` | `os_aio` interface + 1 backend (io_uring first); async trickle writeback; prefetch hint | TCL green; trickle keeps dirty-eviction stalls down under write load; prefetch improves cold-scan latency; multi-process falls back correctly | +| **3** | (later, if needed) | general pointer swizzling + epoch reclamation + raw-pointer single-process fast path | only if profiling shows the wired-internals scope leaves gains on the table | + +## 8. Correctness invariants (the parts that must not be wrong) + +1. Hot pages stay `HOT`, so their swips are read-only on the descent; cooling + churns only the boundary. +2. Every structural modification bumps the page LSN under the page's exclusive + latch *before* the change is visible; the optimistic reader's LSN re-check + brackets every read of node contents and never acts on an unvalidated result. +3. Eviction's `refcount == 0` + exclusive-`mtx_buf` handshake is unchanged. +4. `BH_WIRED` excludes `BH_FROZEN`; optimistic fast path gated on + `mfp->multiversion == 0`. +5. Async writes obey the WAL rule in the prep step; only the latch-holding + process reaps its own AIO; swips are never persisted to the page image. + +## 9. Measurement plan + +Re-run `lab/bench/scale_bench {rrand,sepdb,snap} 200000 3 1 2 4 8 12 16 24` plus +a cache-pressure workload and a write/trickle workload, on the 24-core Linux box +(`meh`) — the laptop is too noisy (asymmetric cores) to resolve these deltas. +Each stage records before/after medians and a `perf` self-time delta in +`scaling-findings.md`. Stage 1's headline metric: does the 8-thread ceiling lift +and the `futex`/atomic self-time collapse. + +## 10. Open questions + +- Shadow swip vector lifetime vs `__bam` page reorganization — invalidate vs + rebuild on split/merge (lean toward invalidate-on-LSN-change). +- Whether to wire only levels ≥1 (internals) or also a configurable number of + upper levels under memory accounting. +- io_uring + `O_DIRECT` alignment interaction with BDB page sizes and the + existing `DB_OSO_DIRECT` path. +- Interaction of trickle/group-commit (#3 in the roadmap) — the async writer and + the WAL group-commit should share the log-flush coordination. + +## 11. Reference implementations (sqlxtc, noxu) and refinements + +Two sibling projects implement LeanStore/Umbra-style cooling buffer managers; +studying them validates this design and sharpens three points. + +**sqlxtc `bufmgr.c`** (libxtc `examples/06_sqlxtc`): frame states +`FREE/HOT/COOL/LOADED/WRITING`; the eviction state lives in the parent **swip** +and transitions are owned by whoever wins a CAS (loser retries) — exactly the +tagged-swip model here. Its `evict_one` clock sweep is the key: + +- **Probationary admission + COOL-first eviction.** New pages are admitted + `COOL`. The sweep reclaims an already-`COOL` frame and **only cools a `HOT` + frame when a full sweep finds no `COOL` victim** (`force_cool`). Because a + scan keeps supplying `COOL` pages, *the hot set is never cooled to make room + for a scan* — robust scan resistance for a scan of any length. A `COOL` page + carries a CLOCK `ref` bit (set on access, cleared by the sweep) so a + re-touched cool page survives one sweep and is promoted back to `HOT`. +- **Single-word pin/evict gate.** `pin >= 0` ⇒ a fixer may `CAS pin→pin+1`; + the evictor reserves an unpinned frame with `CAS 0→-1` (`-1` = EVICTING). + Acquiring a pin and reserving for eviction race on **one** atomic word, with + no separate "pin++ then re-check" window. +- **prefer-clean foreground + background trickle.** Foreground eviction + reclaims a clean victim and leaves dirty `COOL` pages for the trickler; it + flushes a dirty page inline only as a last-resort progress guarantee. + +**noxu `noxu-evictor`** (Berkeley-DB-JE lineage): a per-operation **`CacheMode`** +(`Default/Unchanged/EvictLn/EvictBin/KeepHot/MakeEvictable`) drives **two +independent tracking sets** — `primary` and `scan_resistant` — and the evictor +drains *scan → primary → dirty* with per-phase quotas. Pluggable LRU/Clock/ARC/ +CAR/LIRS; it notes LRU pollutes on scans while ARC/CAR/LIRS resist inherently. + +### What this changes here + +1. **Scan resistance is probationary admission + COOL-first, not a single + counter.** Stage 0's frequency-climb CLOCK ages *every* buffer the hand + passes, so a long scan still decays the hot set (measured: only ~19% fewer + hot-set page-ins than LRU). The fix is to **admit new/scan-read pages cool + and never cool a HOT buffer while a COOL victim exists** — then a scan of any + length leaves the hot set untouched. This is implementable in BDB's existing + bucket-scan evictor *without* the full swip (call it **Stage 0.5**): split + the warmth range into COOL/HOT bands, admit reads at the top of COOL, promote + to HOT only on re-reference, and only decrement HOT-band warmth when a full + sweep finds no COOL victim (BDB's existing `aggressive` escalation is the + hook). Robust scan resistance, no optimistic-descent risk. +2. **Adopt the single-word pin/evict reservation** on `bhp->ref` + (`-1` = EVICTING) to replace `__memp_alloc`'s TRYLOCK-`mtx_buf`-then-recheck- + `ref==1` dance, removing that race window. +3. **prefer-clean eviction + trickle** is the Stage 2 writeback design; confirm + foreground eviction never blocks on a device write while a clean COOL victim + exists. + +### BDB constraints (why we adopt principles, not code) + +Both references are greenfield, single-address-space. BDB's mpool is +multi-process (swips must be `roff_t`, resolved via `R_ADDR`), pages are +persisted disk images (the swip lives in an in-memory shadow vector, never in +the page), and the on-disk format + public API are fixed. So we take the +*mechanisms* — probationary COOL admission, COOL-first eviction, swip-encoded +state, single-word pin/evict, prefer-clean+trickle — within those constraints. diff --git a/docs/design/research/mpool-btree.md b/docs/design/research/mpool-btree.md new file mode 100644 index 000000000..0c584d3a1 --- /dev/null +++ b/docs/design/research/mpool-btree.md @@ -0,0 +1,504 @@ +# MPOOL / BTREE implementation-grounding report + +Read-only survey of `master` for a buffer-manager rework (Stage 0: cooling +state replacing the per-access priority write; Stage 1: `BH_WIRED` + +per-parent child swip + LSN-validated optimistic descent of wired internal +pages). Every line number below was read from the working tree at survey time. + +> **Two premises in the request do not exist in this tree — flag before relying on them:** +> 1. `MPOOL_HOTFIELDS_ISOLATED` — **no such guard, macro, or `#ifdef` anywhere** +> (`grep` across the whole repo returns nothing). `struct __bh` in +> `src/dbinc/mp.h` is a plain, unguarded, unpadded `/* SHARED */` struct. +> There is no cache-line isolation of the hot fields (`ref`, `flags`, +> `priority`) today; they share a line with `mtx_buf`, `hq`, `pgno`. +> 2. `__bam_iget` — **does not exist**. The btree descent entry points are +> `__bam_get_root` (root fetch) and `__bam_search` (root→leaf walk) in +> `src/btree/bt_search.c`. The report uses those. + +--- + +## 1. MPOOL pin / unpin + +### 1a. Pin — `__memp_fget`, `src/mp/mp_fget.c:103` + +The resident-page fast path: + +**Bucket latch (read lock on `hp->mtx_hash`)** is taken inside the +`MP_GET_BUCKET` macro at `src/mp/mp_fget.c:245`: + +```c +245: MP_GET_BUCKET(env, mfp, *pgnoaddr, &infop, hp, bucket, ret); +246: if (ret != 0) +247: return (ret); +248: c_mp = infop->primary; +``` + +The macro body (`src/dbinc/mp.h`, `MP_GET_BUCKET`) computes the bucket and does +`MUTEX_READLOCK(env, (hp)->mtx_hash)` for the single-region case, else calls +`__memp_get_bucket`. After this, `h_locked = 1` (set at line 257). + +**Chain walk reading `bhp->pgno` / `bhp->mf_offset`**, `src/mp/mp_fget.c:256`: + +```c +256: st_hsearch = 0; +257: h_locked = 1; +258: SH_TAILQ_FOREACH(bhp, &hp->hash_bucket, hq, __bh) { +259: ++st_hsearch; +260: if (bhp->pgno != *pgnoaddr || bhp->mf_offset != mf_offset) +261: continue; +``` + +(For snapshot reads the MVCC version chain is walked backward at lines +264–278 via `SH_CHAIN_PREV(bhp, vc, __bh)` — see hazards §6.) + +**`atomic_inc(&bhp->ref)` (the pin)**, `src/mp/mp_fget.c:288`: + +```c +288: if (BH_REFCOUNT(bhp) == UINT16_MAX) { /* overflow guard */ +... +296: atomic_inc(env, &bhp->ref); +297: b_incr = 1; +``` + +**Drop bucket latch, then take the buffer latch.** Hash mutex is released +*before* the buffer mutex (lock-ordering: hash → buf), `src/mp/mp_fget.c:304`: + +```c +304: MUTEX_UNLOCK(env, hp->mtx_hash); +305: h_locked = 0; +306: if (dirty || extending || makecopy || F_ISSET(bhp, BH_FROZEN)) { +307: xlatch: if (LF_ISSET(DB_MPOOL_TRY)) { +308: if ((ret = +309: MUTEX_TRYLOCK(env, bhp->mtx_buf)) != 0) +310: goto err; +311: } else +312: MUTEX_LOCK(env, bhp->mtx_buf); +313: F_SET(bhp, BH_EXCLUSIVE); +314: } else if (LF_ISSET(DB_MPOOL_TRY)) { +315: if ((ret = MUTEX_TRY_READLOCK(env, bhp->mtx_buf)) != 0) +316: goto err; +317: } else +318: MUTEX_READLOCK(env, bhp->mtx_buf); +``` + +The clean read fast path is line 318 (`MUTEX_READLOCK(env, bhp->mtx_buf)`); +`b_lock = 1` at line 333. Note: without `HAVE_SHARED_LATCHES`, +`F_SET(bhp, BH_EXCLUSIVE)` is unconditional at line 330. + +The pin is recorded in the thread's `PIN_LIST` at lines 866–905 (`lp->b_ref = +R_OFFSET(infop, bhp)`), and the page pointer returned at line 933 +(`*(void **)addrp = bhp->buf;`). + +### 1b. Unpin — `__memp_fput`, `src/mp/mp_fput.c:65` + +**Multi-reference early-out** (`atomic_dec` returning >1 keeps priority +untouched), `src/mp/mp_fput.c:183`: + +```c +183: DB_ASSERT(env, atomic_read(&bhp->ref) != 0); +184: if (atomic_dec(env, &bhp->ref) > 1 || (atomic_read(&bhp->ref) == 1 && +185: !F_ISSET(bhp, BH_DIRTY))) { +186: if (F_ISSET(bhp, BH_EXCLUSIVE)) +187: F_CLR(bhp, BH_EXCLUSIVE); +188: MUTEX_UNLOCK(env, bhp->mtx_buf); +189: return (0); +190: } +``` + +**The LRU bump (the per-access `bhp->priority` write — Stage 0's target)**, +`src/mp/mp_fput.c:206`: + +```c +206: if (priority == DB_PRIORITY_VERY_LOW || +207: mfp->priority == MPOOL_PRI_VERY_LOW) +208: bhp->priority = 0; +209: else { +... +215: bhp->priority = c_mp->lru_priority; +... /* per-priority adjust */ +249: bhp->priority += adjust; +... +253: } +``` + +**Global clock advance + wraparound reset**, `src/mp/mp_fput.c:269`: + +```c +269: if (++c_mp->lru_priority >= MPOOL_LRU_REDZONE && +270: (t_ret = __memp_reset_lru(env, infop)) != 0 && ret == 0) +271: ret = t_ret; +``` + +`__memp_reset_lru` (`src/mp/mp_fput.c:279`) bulk-decrements every buffer's +priority when the counter saturates (lines 300, 325). This O(cache) sweep is +exactly what a clock hand makes unnecessary. + +--- + +## 2. Replacement / eviction — `__memp_alloc`, `src/mp/mp_alloc.c:30` + +### Victim selection +`__memp_alloc` first tries `__env_alloc` from free memory (`alloc:` label, +line ~95). On failure it falls to `search:` (line 149) and scans buckets. + +**High-priority threshold** (buffers fresher than `lru_priority - +pages/10` are skipped until aggressive), `src/mp/mp_alloc.c:155`: + +```c +155: cache_reduction = c_mp->pages / 10; +156: high_priority = aggressive ? MPOOL_LRU_MAX : +157: c_mp->lru_priority - cache_reduction; +158: lru_generation = c_mp->lru_generation; +``` + +**Bucket scan with wraparound** via `c_mp->last_checked`, +`src/mp/mp_alloc.c:182`: + +```c +182: hp = &dbht[c_mp->last_checked++]; +183: if (hp >= hp_end) { +184: c_mp->last_checked = 0; +185: hp = &dbht[c_mp->last_checked++]; +186: } +``` + +After `MPOOL_ALLOC_SEARCH_LIMIT` (500) buckets, or a full pass, `aggressive` +is raised and `high_priority = MPOOL_LRU_MAX` (lines 261, 297) so every buffer +becomes eligible. + +**Per-bucket lowest-priority pick** (`bhp->priority` is the LRU key), +`src/mp/mp_alloc.c:303`: + +```c +303: retry_search: bhp = NULL; +304: bucket_priority = high_priority; +... +320: if (SH_CHAIN_SINGLETON(current_bhp, vc)) { +321: if (BH_REFCOUNT(current_bhp) != 0) +322: continue; /* skip pinned */ +323: buffers++; +324: if (bucket_priority > current_bhp->priority) { +325: bucket_priority = current_bhp->priority; +... +328: bhp = current_bhp; +329: atomic_inc(env, &bhp->ref); +330: } +331: continue; +332: } +``` + +Two buckets are compared and the lower-priority winner kept +(`priority = bhp->priority;` at line 409; swap logic 420–447). + +### `BH_REFCOUNT==0` check and exclusive `mtx_buf` +Eviction will not touch a pinned buffer. After dropping the hash mutex +(`src/mp/mp_alloc.c:481`): + +```c +484: /* Don't bother trying to latch a busy buffer. */ +485: if (BH_REFCOUNT(bhp) > 1) +486: goto next_hb; +... +489: if ((ret = MUTEX_TRYLOCK(env, bhp->mtx_buf)) != 0) { /* exclusive */ +... +493: F_SET(bhp, BH_EXCLUSIVE); +494: b_lock = 1; +497: /* Someone may have grabbed it while we got the lock. */ +498: if (BH_REFCOUNT(bhp) != 1) +499: goto next_hb; +``` + +So the invariant is: an evictable buffer must have `ref == 1` (the evictor's +own transient inc) and be exclusively latched via non-blocking `MUTEX_TRYLOCK`. +Dirty victims are written first (`F_ISSET(bhp, BH_DIRTY)` → +`__memp_bhwrite`, lines 507–525); MVCC mid-chain victims are frozen +(`SH_CHAIN_HASPREV` → `__memp_bh_freeze`, lines 540–566). A same-size clean +victim is reused in place (`p = bhp; goto found;`, lines 660–666). + +### How a clock / second-chance / hot–cool state machine slots in +- **State storage:** reuse the 32-bit `bhp->priority` (`src/dbinc/mp.h:555`) or + steal bits from `bhp->flags` (`u_int16_t`, `src/dbinc/mp.h:553`, only 9 bits + used through `BH_THAWED 0x100`). A 2-bit HOT/WARM/COOL/COLD field fits in + `flags`. +- **Reference bit set on access:** replace the `bhp->priority = c_mp->lru_priority` + write at `src/mp/mp_fput.c:215` (and the `DB_PRIORITY_*` adjust block 209–253) + with a single `F_SET(bhp, BH_HOT)` / set-reference-bit — cheaper, and it + removes the `++c_mp->lru_priority` global counter contention (line 269) and + the whole `__memp_reset_lru` sweep (lines 279–331). +- **Evictor consults the state bit / advances the cool hand:** the victim test + at `src/mp/mp_alloc.c:320–332`. Instead of "lowest priority wins," do + second-chance: if `BH_HOT` set, clear it (demote one level) and `continue`; + only a COLD, `BH_REFCOUNT==0` buffer becomes `bhp`. `c_mp->last_checked` + (line 182) already *is* a per-region clock hand — reuse it as the cooling + hand. The `aggressive`/`high_priority` escalation (155–157, 261) maps to + "force-demote on the second sweep." + +--- + +## 3. BH struct + flags — `src/dbinc/mp.h` + +`struct __bh` is `/* SHARED */` (cross-process), `src/dbinc/mp.h:539`: + +```c +539: struct __bh { /* SHARED */ +540: db_mutex_t mtx_buf; /* Shared/Exclusive mutex */ +541: db_atomic_t ref; /* Reference count. */ +542:#define BH_REFCOUNT(bhp) atomic_read(&(bhp)->ref) +544:#define BH_CALLPGIN 0x001 +545:#define BH_DIRTY 0x002 +546:#define BH_DIRTY_CREATE 0x004 +547:#define BH_DISCARD 0x008 +548:#define BH_EXCLUSIVE 0x010 +549:#define BH_FREED 0x020 +550:#define BH_FROZEN 0x040 +551:#define BH_TRASH 0x080 +552:#define BH_THAWED 0x100 +553: u_int16_t flags; +554: +555: u_int32_t priority; /* Priority. */ +556: SH_TAILQ_ENTRY hq; /* MPOOL hash bucket queue. */ +557: +558: db_pgno_t pgno; +559: roff_t mf_offset; +560: u_int32_t bucket; +561: int region; +562: +563: roff_t td_off; /* MVCC: creating TXN_DETAIL offset. */ +564: SH_CHAIN_ENTRY vc; /* MVCC: version chain. */ +... +571: DB_ALIGN8 u_int8_t buf[1]; /* Variable length data. */ +572: }; +``` + +- **No `MPOOL_HOTFIELDS_ISOLATED` guard exists** (confirmed by grep). If the + rework needs cache-line isolation of `ref`/`flags`/`priority` to avoid false + sharing of the new reference/cool bit, that padding does **not** exist yet and + must be added here. +- **`BH_WIRED` (new, non-evictable) flag:** next free bit is `0x200` + (`#define BH_WIRED 0x200`) — `flags` is `u_int16_t`, room through `0x8000`. +- **Where the evictor checks `BH_WIRED`:** add `if (F_ISSET(current_bhp, + BH_WIRED)) continue;` in the victim loop at `src/mp/mp_alloc.c:320` + (singleton branch) and at the MVCC-chain branch ~line 350, alongside the + existing `BH_REFCOUNT != 0` skip. Belt-and-suspenders: also short-circuit at + `src/mp/mp_alloc.c:485` before the `MUTEX_TRYLOCK`. +- **Where wired internal pages are marked on fetch:** in `__memp_fget` after + the buffer is pinned and latched, just before the return at + `src/mp/mp_fget.c:933`. But the *caller* knows page type, not mpool — so the + natural mark point is the btree layer once `TYPE(h) == P_IBTREE` is known + (see §4). A new mpool entry point (e.g. `__memp_wire(bhp)` setting `BH_WIRED` + under `bhp->mtx_buf`) called from `__bam_get_root`/`__bam_search` after the + internal page fget is the clean seam. + +--- + +## 4. BTREE descent — `src/btree/bt_search.c` + +Entry points (note: **`__bam_iget` does not exist**): +- `__bam_get_root`, `src/btree/bt_search.c:59` — fetches the root via the + `BAM_GET_ROOT` macro at line 114, asserts `TYPE(h)` is one of + `P_IBTREE/P_IRECNO/P_LBTREE/P_LRECNO/P_LDUP` at lines 123–124. +- `__bam_search`, `src/btree/bt_search.c:253` — the root→leaf descent loop. + +### The descent loop — `src/btree/bt_search.c:347` +```c +347: for (;;) { +348: if (TYPE(h) == P_LBTREE) /* leaf vs internal distinction */ +349: adjust = P_INDX; +350: else { ... adjust = O_INDX; } +... + /* binary search on h ... then for an internal page: */ +476: pg = GET_BINTERNAL(dbp, h, indx)->pgno; /* child pgno from parent */ +477: level = LEVEL(h); +... +601: parent_h = h; /* normal search: latch-couple parent→child */ +602: goto lock_next; +... +803: skip_lock: stack = set_stack; +804: } +805: /* Get the child page. */ +806: if ((ret = __memp_fget(mpf, &pg, +807: dbc->thread_info, dbc->txn, get_mode, &h)) != 0) +808: goto err; +809: /* Release the parent. */ +810: if (parent_h != NULL && (ret = __memp_fput(mpf, +811: dbc->thread_info, parent_h, dbc->priority)) != 0) +812: goto err; +813: parent_h = NULL; +814: } +``` + +So per level: read child pgno from parent (`GET_BINTERNAL(...)->pgno`, line +476) → fetch+pin+latch child (`__memp_fget`, lines 806–807) → unpin parent +(`__memp_fput`, lines 810–811). Latch-coupling: the parent stays pinned +(`parent_h = h`, line 601) across the child fetch. + +### Internal vs leaf, and the page LSN +- Page type: `P_IBTREE = 3` (`src/dbinc/db_page.h:39`), `P_LBTREE = 5` + (`src/dbinc/db_page.h:41`); read via `TYPE(p)` macro (`db_page.h:317`), + `LEVEL(p)` (`db_page.h:316`), `LEAFLEVEL == 1` (`db_page.h`). +- LSN lives at the head of every `PAGE`: `DB_LSN lsn;` is bytes 00–07 + (`src/dbinc/db_page.h:260`), accessed via `LSN(p)` (`db_page.h:310`). It is + stamped under exclusive page latch + log write by the access-method redo/undo + paths (e.g. `__db_*_log` / `__memp_dirty` callers), **not** by `__memp_fget`. + For a reader, `LSN(h)` is stable as long as the page is latched. +- Child entry record: `BINTERNAL` (`src/dbinc/db_page.h:784–791`), `pgno` at + offset 04–07; fetched by `GET_BINTERNAL` (`db_page.h:795`). + +### Stage 1 insertion point — optimistic LSN-validated descent of a wired internal page +The exact seam is the top of the descent loop body, **between** reading the +child pointer and the `__memp_fget` of the child: + +- Replace the sequence at `src/btree/bt_search.c:476` (`pg = + GET_BINTERNAL(...)->pgno`) → `806` (`__memp_fget(child)`) → `810` + (`__memp_fput(parent)`) with: if the current `h` is a wired internal page + (`F_ISSET(PAGE_TO_BH(h), BH_WIRED)`), read `pg` and snapshot `lsn = LSN(h)` + **without** holding the buffer latch, do the binary search against the + unlatched (but wired, hence non-evictable, non-relocating) frame, then + re-validate `LOG_COMPARE(&lsn, &LSN(h)) == 0` before trusting `pg`. On + mismatch, fall back to the latched path (current lines 806–811). Because + `BH_WIRED` guarantees the frame is not reused by `__memp_alloc` (§2/§3), the + pointer stays valid; the LSN check guards against in-place content mutation + (split/merge) of the internal page. +- The parent `__memp_fput` at lines 810–811 is skipped for wired parents (no + pin was taken), which is the whole point — descent of the upper tree touches + no `mtx_hash`, no `mtx_buf`, no `atomic_inc`. + +### Where a per-parent child-swip cache attaches +`GET_BINTERNAL(dbp, h, indx)->pgno` (line 476) is a `db_pgno_t` (logical page +number) that must be re-hashed and re-looked-up each descent. A swip cache +turns it into a direct `roff_t` to the child `BH` with low state bits +(resolved/unresolved). Two candidate homes: +- **Shadow vector keyed by internal frame** (preferred, see §5): an array of + `roff_t` swips parallel to the `BINTERNAL` slots of each wired internal page, + consulted at line 476 before falling back to `__memp_fget`. Low 2–3 bits of + the `roff_t` encode state (0 = unresolved → do the fget and fill the swip; + 1 = resolved offset; 2 = invalidated by split). Alignment of `BH` + allocations (`MVCC_BHALIGN`, and `__env_alloc` granularity) leaves low bits + free. +- It must be invalidated wherever the child set changes — `__bam_split` / + `__bam_pinsert` and page-free paths — keyed off the parent page LSN bump. + +--- + +## 5. Where the region stores per-buffer metadata — `src/mp/mp_region.c` + +Per-region layout is built in `__memp_init` (`src/mp/mp_region.c:196`): + +```c +286: if ((ret = __env_alloc(infop, +287: htab_buckets * sizeof(DB_MPOOL_HASH), &htab)) != 0) +288: goto mem_err; +289: mp->htab = R_OFFSET(infop, htab); +... +339: mp->htab_buckets = htab_buckets; +``` + +The cache region (`MPOOL`, `src/dbinc/mp.h:struct __mpool`) holds offsets, not +arrays: `roff_t htab` (hash buckets), `roff_t regids`, `roff_t ftab`, plus the +`free_frozen` / `alloc_frozen` lists. Individual `BH`s are *not* a contiguous +array — each is `__env_alloc`'d on demand inside `__memp_alloc` +(`src/mp/mp_alloc.c`, `alloc:` path, `c_mp->pages++`), so there is no existing +dense per-buffer index to hang a parallel vector off. + +**For a per-internal-frame shadow swip vector:** allocate it in `__memp_init` +right after the htab allocation (after `src/mp/mp_region.c:289`), as another +`__env_alloc(infop, ...)` whose offset is stored in a new `roff_t` field on +`struct __mpool`. Because swips are *per wired internal frame* (not per cache +slot), the cleaner design is to allocate the swip vector lazily when a frame is +wired (alongside `BH_WIRED` marking, §3), sized to that page's `NUM_ENT`, and +store its `roff_t` either in a new `BH` field or in a small region-side hash +keyed by `bhp` offset. Both the htab and any new vector must be reachable by +`R_ADDR(infop, off)` from every attached process (cross-process — see §6). + +--- + +## Concrete insertion points + +### Stage 0 — cooling state replaces the per-access priority write +| What | File:line | +|------|-----------| +| Remove/replace the LRU priority write | `src/mp/mp_fput.c:215` (`bhp->priority = c_mp->lru_priority`) and the adjust block `209–253` → set a reference/HOT bit instead | +| Remove the global clock increment + redzone reset | `src/mp/mp_fput.c:269–271` (and delete the `__memp_reset_lru` sweep `279–331`) | +| State bits storage | `src/dbinc/mp.h:553` (`flags`, free bits ≥ `0x200`) or repurpose `priority` `mp.h:555` | +| Evictor second-chance / cooling check | `src/mp/mp_alloc.c:320–332` (singleton) and `~345–360` (MVCC chain); reuse `c_mp->last_checked` `mp_alloc.c:182` as the cool hand | +| Keep refcount==0 + TRYLOCK invariant unchanged | `src/mp/mp_alloc.c:485, 489, 498` | + +### Stage 1 — `BH_WIRED` + swip + LSN-validated descent +| What | File:line | +|------|-----------| +| Define `BH_WIRED 0x200` | `src/dbinc/mp.h:552` (after `BH_THAWED`) | +| Evictor skips wired buffers | `src/mp/mp_alloc.c:320` (and `485`) — `if (F_ISSET(.., BH_WIRED)) continue;` | +| Mark internal page wired after fetch | `src/btree/bt_search.c` after the child `__memp_fget` at `806–807`, gated on `TYPE(h)==P_IBTREE`; via a new `__memp_wire()` touching `src/mp/mp_fget.c:933` region | +| Optimistic read seam | `src/btree/bt_search.c:476` (read `pg`+`LSN(h)`) … revalidate before/instead of `__memp_fget` at `806`; skip parent `__memp_fput` `810–811` for wired parents | +| Swip cache read/fill | at `src/btree/bt_search.c:476` (`GET_BINTERNAL(...)->pgno`) | +| Swip vector allocation | `src/mp/mp_region.c:289` (region) + new `roff_t` on `struct __mpool` in `src/dbinc/mp.h` | +| Swip invalidation | btree split/insert (`__bam_split`, `__bam_pinsert`) keyed off parent `LSN` | + +--- + +## Correctness hazards + +1. **Cross-process shared region.** `struct __bh`, `struct __mpool`, + `DB_MPOOL_HASH` are all `/* SHARED */` and live in a region mapped by + multiple processes at *different* virtual addresses — every reference is a + `roff_t` + `R_ADDR`/`R_OFFSET`, never a pointer. A swip cache **must store + `roff_t`, not `BH *`**, and any new state bits must be written atomically + w.r.t. other processes (the existing code only relies on 32-bit reads/writes + being atomic — see the `priority`/`lru_priority` "we don't lock, garbage is + tolerable" comments at `mp_fput.c:210–214` and `mfp->priority` note in + `mp.h`). A HOT/cool bit with relaxed semantics is fine; a swip that gates + correctness is not — it needs the LSN revalidation as its correctness anchor. + +2. **Optimistic descent vs latching.** Skipping `bhp->mtx_buf` on a wired + internal page means giving up the latch that today guarantees a stable + `LSN(h)` and stable `BINTERNAL` bytes during the binary search + (`bt_search.c:347–476`). `BH_WIRED` only guarantees the *frame* is not + evicted/reused; it does **not** stop an in-place split/merge from rewriting + the page under you. The `LOG_COMPARE(&snapshot_lsn, &LSN(h))` recheck after + the search is mandatory, and the binary-search code must tolerate transiently + inconsistent bytes (no out-of-bounds `indx`) before the recheck rejects them. + +3. **MVCC / version-chain interaction at `read_lsnp`.** `__memp_fget` resolves + snapshot reads by walking the `vc` chain backward + (`src/mp/mp_fget.c:264–278`, `BH_VISIBLE`/`SH_CHAIN_PREV`) to the version + visible at `td->read_lsn` (set at `mp_fget.c:179`). A swip/LSN fast path that + bypasses `__memp_fget` bypasses this visibility resolution — so it is only + sound for **internal btree pages, which are not multi-versioned** under + normal operation. Guard the optimistic path with + `atomic_read(&mfp->multiversion) == 0` (as `__bam_get_root` already does at + `bt_search.c:188, 226`) **or** restrict it to `P_IBTREE` frames that are + provably singletons (`SH_CHAIN_SINGLETON(bhp, vc)`). + +4. **Freeze / thaw.** Mid-chain MVCC victims are frozen + (`__memp_bh_freeze`, `mp_alloc.c:545`) and re-materialized + (`__memp_bh_thaw`); a frozen buffer's data is gone. `BH_WIRED` must be + mutually exclusive with `BH_FROZEN` — the evictor's freeze branch + (`mp_alloc.c:540–566`) and the fget thaw branches (`mp_fget.c:307, 360+`) + must treat wired as "never freeze," else a swip could point at a frozen + header whose `buf` is invalid. Wiring should also force `SH_CHAIN_SINGLETON`. + +5. **`last_pgno` / extend races and `DB_MPOOL_NEW`.** The optimistic path reads + a child `pgno` that a concurrent allocation could be creating + (`mp_fget.c` `newpg:`/`extending`); the LSN recheck on the *parent* covers + structural change, but the swip-fill must take the normal latched fget for + any not-yet-resident child. + +6. **`priority` is read locklessly during stats and reset.** Repurposing + `priority` for cool-state must keep `__memp_reset_lru`'s assumptions or + remove that function entirely; leaving both the old sweep (`mp_fput.c:279`) + and a new state machine running would double-account. + +--- + +## Summary (10 lines) + +1. Pin fast path: `__memp_fget` (`mp_fget.c:103`) read-locks `hp->mtx_hash` (`:245`), walks the chain (`:258`), `atomic_inc(&bhp->ref)` (`:296`), drops hash lock (`:304`), `MUTEX_READLOCK(bhp->mtx_buf)` (`:318`). +2. Unpin: `__memp_fput` (`mp_fput.c:65`) `atomic_dec(&bhp->ref)` (`:184`) and the per-access LRU write `bhp->priority = c_mp->lru_priority` (`:215`) + global `++c_mp->lru_priority` (`:269`) → `__memp_reset_lru` sweep (`:279`). +3. Eviction: `__memp_alloc` (`mp_alloc.c:30`) clock-scans `last_checked` buckets (`:182`), picks lowest `priority` singleton (`:320–332`), requires `BH_REFCOUNT==1` (`:485,498`) + `MUTEX_TRYLOCK(bhp->mtx_buf)` exclusive (`:489`), writes dirty (`:507`), freezes MVCC mid-chain (`:545`). +4. `struct __bh` (`mp.h:539`) is plain/unguarded — **`MPOOL_HOTFIELDS_ISOLATED` does not exist**; `flags` (`:553`) has free bits, define `BH_WIRED 0x200` after `BH_THAWED` (`:552`). +5. Descent: **`__bam_iget` does not exist**; use `__bam_get_root` (`bt_search.c:59`) + `__bam_search` loop (`:347`); child pgno from `GET_BINTERNAL(...)->pgno` (`:476`), child `__memp_fget` (`:806`), parent `__memp_fput` (`:810`), latch-couple via `parent_h=h` (`:601`). +6. Page type/LSN: `P_IBTREE=3`/`P_LBTREE=5` (`db_page.h:39/41`), `TYPE()`/`LEVEL()` (`:316/317`), `LSN()` at PAGE byte 0 (`:260/310`); LSN stamped under exclusive latch by the AM, stable while latched. +7. Stage 0: replace `mp_fput.c:215` priority write with a HOT/reference bit, delete `:269` counter + `:279` sweep, make the evictor at `mp_alloc.c:320` do second-chance using `last_checked` (`:182`) as the cool hand. +8. Stage 1: mark `P_IBTREE` frames `BH_WIRED` after fget (seam near `bt_search.c:806`), skip them in the evictor (`mp_alloc.c:320,485`); at `bt_search.c:476` snapshot `LSN(h)`, search unlatched, revalidate `LOG_COMPARE(LSN(h))` before trusting `pg`; attach a `roff_t`-with-state swip vector allocated near `mp_region.c:289`. +9. Hazards: shared region forces `roff_t` swips not pointers; optimistic reads need the LSN recheck because `BH_WIRED` stops eviction but not in-place splits; MVCC `read_lsnp` visibility walk (`mp_fget.c:264`) must be bypassed only for non-multiversion singleton internal pages; `BH_WIRED` must exclude `BH_FROZEN`/thaw. +10. Region: `BH`s are `__env_alloc`'d on demand (no dense array); per-region htab built in `__memp_init` (`mp_region.c:286–289`) — add the swip vector there or lazily at wire time, with a new `roff_t` on `struct __mpool`. diff --git a/docs/design/research/os-aio.md b/docs/design/research/os-aio.md new file mode 100644 index 000000000..0cf37cbc0 --- /dev/null +++ b/docs/design/research/os-aio.md @@ -0,0 +1,393 @@ +# OS / mpool I/O survey — basis for an internal async-I/O abstraction + +Read-only survey of `/Users/gregburd/oss/libdb` @ `master`. No source was +modified. All line numbers are from the working tree at survey time +(2026-06-17). Paths are repo-relative. + +--- + +## 1. The OS I/O abstraction (`src/os/`, `src/os_windows/`) + +All page and byte I/O funnels through a single dispatcher, `__os_io`, which +takes an opcode (`DB_IO_READ` / `DB_IO_WRITE`). Everything below it is +**fully synchronous and blocking**. + +Opcodes and the file handle: + +- `src/dbinc/os.h:92-93` — `#define DB_IO_READ 1`, `#define DB_IO_WRITE 2`. +- `src/dbinc/os.h:112-151` — `struct __fh_t` (`DB_FH`). Holds `int fd` + (POSIX) and `HANDLE handle` (Win32, `:130`), plus `db_mutex_t mtx_fh` + (`:124`) that "protects the handle/fd across seek and read/write pairs." + Last-seek bookkeeping `pgno/pgsize/offset` (`:139-141`). + +### POSIX — `src/os/os_rw.c` + +| Function | Lines | Signature | Notes | +|---|---|---|---| +| `__os_io` | 20-128 | `(ENV*, int op, DB_FH*, db_pgno_t pgno, u_int32_t pgsize, u_int32_t relative, u_int32_t io_len, u_int8_t *buf, size_t *niop)` | Entry point for page I/O. | +| `__os_read` | 134-196 | `(ENV*, DB_FH*, void *addr, size_t len, size_t *nrp)` | Loop over `read(2)`. | +| `__os_write` | 204-229 | `(ENV*, DB_FH*, void *addr, size_t len, size_t *nwp)` | Zero-fill wrapper → `__os_physwrite`. | +| `__os_physwrite` | 236-310 | `(ENV*, DB_FH*, void *addr, size_t len, size_t *nwp)` | Loop over `write(2)`. | +| `__os_seek` | `src/os/os_seek.c:19-66` | `(ENV*, DB_FH*, db_pgno_t, u_int32_t pgsize, off_t relative)` | `lseek(2)`. | + +Behavior of `__os_io` (`os_rw.c:48-101`): if `pread`/`pwrite` are available +and no `j_*` hook / no-zero-fill quirk applies, it computes +`offset = relative ? relative : (off_t)pgno * pgsize` and issues a single +**blocking** `pread`/`pwrite` (`:67-69`, `:88-90`) with no `mtx_fh` held. On +a short transfer or when a hook is installed, it falls through to the `slow:` +label (`:101`) which takes `MUTEX_LOCK(env, fhp->mtx_fh)`, then +`__os_seek` + `__os_read`/`__os_write`, then unlocks (`:101-122`). +`LAST_PANIC_CHECK_BEFORE_IO(env)` (`os.h:104`) is invoked immediately before +every syscall. + +### Open — `src/os/os_open.c` + +`__os_open` (`:18-200`), signature +`(ENV*, const char *name, u_int32_t page_size, u_int32_t flags, int mode, DB_FH **fhpp)`. +Translates `DB_OSO_*` to `open(2)` `oflags`. Relevant flags for AIO: + +- `DB_OSO_DIRECT` → `O_DIRECT` (`:78-80`) and Solaris `directio()` (`:166-170`). + Direct I/O is the natural pairing for io_uring / POSIX-aio buffered bypass. +- `DB_OSO_DSYNC` → `O_DSYNC` + sets `DB_FH_NOSYNC` (`:82-84`, `:148-152`). +- `DB_OSO_SEQ` is accepted in `OKFLAGS` (`:42`) but, notably, **never mapped + to any `posix_fadvise`/`madvise` hint** — there is no readahead plumbing + (see §3). + +There is **no async open path and no per-fd AIO context** anywhere in the +handle. `__os_openhandle` (`src/os/os_handle.c`) just wraps `open(2)`. + +### Windows — `src/os_windows/os_rw.c` + +`__os_io` (`:16-93`) is structurally identical but uses an `OVERLAPPED` +struct **with `over.hEvent = 0` and the comment "we don't want asynchronous +notifications"** (`:36`). It calls `ReadFile`/`WriteFile` with the +`OVERLAPPED` purely to pass a 64-bit offset (`:33-35`), then treats the result +synchronously (`:57-71`). `__os_read` (`:103`), `__os_write` (`:163`), +`__os_physwrite` (`:184`) loop on `ReadFile`/`WriteFile` with a NULL +`OVERLAPPED`. The handle (`fhp->handle`) is **not** opened +`FILE_FLAG_OVERLAPPED`, so it cannot today be associated with an IOCP. See +`src/os_windows/os_open.c` (`__os_open`, 7.6 KB) — `CreateFile` is issued +without `FILE_FLAG_OVERLAPPED`. + +**Takeaway:** the entire library reaches disk for data pages through exactly +one choke point, `__os_io(env, DB_IO_{READ,WRITE}, fhp, pgno, pgsize, 0, len, buf, niop)`. +An async abstraction can be introduced beside it without touching callers. + +--- + +## 2. mpool I/O call sites — where sync becomes a candidate for async + +### 2a. Read-in (prefetch candidate) + +`__memp_pgread` — `src/mp/mp_bh.c:217-310`, signature +`(DB_MPOOLFILE *dbmfp, BH *bhp, int can_create)`. + +- Pre-conditions asserted: buffer is held `BH_EXCLUSIVE` (`:233`), and marked + `BH_TRASH` on entry (`:236`). +- The actual blocking read: `mp_bh.c:298-300` + ```c + ret = __os_io(env, DB_IO_READ, dbmfp->fhp, + bhp->pgno, pagesize, 0, pagesize, bhp->buf, &nr); + ``` +- On success clears `BH_TRASH` (`:309`). + +**Where a missing page is read in** — `src/mp/mp_fget.c` (`__memp_fget`): + +- The single demand-read site is the `BH_TRASH` block at + `mp_fget.c:920-933`; the call is `mp_fget.c:927-929`: + ```c + if ((ret = __memp_pgread(dbmfp, + bhp, LF_ISSET(DB_MPOOL_CREATE) ? 1 : 0)) != 0) + ``` +- The cache-miss buffer is allocated and set up in the `SECOND_MISS` arm + (`mp_fget.c:775-868`): `__memp_alloc` is reached via the `alloc:` label + (`:649-651`), the new `bhp` is latched exclusive + (`MUTEX_LOCK(env, bhp->mtx_buf)` at `:806`, `F_SET(bhp, BH_EXCLUSIVE)` at + `:808`), inserted into the hash bucket, and marked `F_SET(bhp, BH_TRASH)` + at `:865` for the non-extending (real disk read) case. + + **This is the prefetch insertion point.** A `__memp_fget` variant (or a + `DB_MPOOL_PREFETCH` flag) could: allocate + insert the buffer, mark it + `BH_TRASH | BH_EXCLUSIVE`, **submit** an async read via the new interface + instead of calling `__memp_pgread`, and return without blocking. The + completion handler would fill `bhp->buf`, clear `BH_TRASH`, and drop the + exclusive latch — exactly what `__memp_pgread:298-309` does inline today. + Sequential-scan callers (btree `__bam_*`) are the obvious clients. + +### 2b. Eviction / sync / trickle writes (writeback candidates) + +`__memp_pgwrite` — `src/mp/mp_bh.c:315-510` (static), reached only through +`__memp_bhwrite` (`mp_bh.c:20-209`). The blocking write is `mp_bh.c:469-471`: +```c +ret = __os_io(env, DB_IO_WRITE, dbmfp->fhp, bhp->pgno, + mfp->pagesize, 0, mfp->pagesize, buf, &nw); +``` +Three callers of `__memp_bhwrite`, each an async-writeback candidate: + +1. **Eviction (allocation pressure)** — `src/mp/mp_alloc.c:509` + `ret = __memp_bhwrite(dbmp, hp, bh_mfp, bhp, 0);` inside `__memp_alloc`, + in the `this_buffer:` arm (`:473-533`). The victim is latched exclusive via + `MUTEX_TRYLOCK(env, bhp->mtx_buf)` (`:489`) + `F_SET(bhp, BH_EXCLUSIVE)` + (`:492`); a dirty victim is written synchronously before reuse + (`dirty_eviction`, `:505-533`). Candidate for async writeback **with a + pending-completion barrier before the slot is handed out**. + +2. **Checkpoint / explicit sync & trickle** — `src/mp/mp_sync.c`, + `__memp_sync_int` (proto `src/dbinc_auto/mp_ext.h:99`: + `(ENV*, DB_MPOOLFILE*, u_int32_t, u_int32_t flags, u_int32_t *wrote, int *interrupted)`). + The write loop is `mp_sync.c:477-...`; the write call is `mp_sync.c:562-563` + `t_ret = __memp_bhwrite(dbmp, hp, mfp, bhp, 1)`. Buffers are gathered into + `bharray`, sorted by `__bhcmp` for file/page order (`:451-452`) — already a + batch, ideal for a single async **submit-many / reap** cycle. + +3. **Trickle (background cleaner)** — `src/mp/mp_trickle.c`, `__memp_trickle` + (static, `:50-...`) computes `need_clean` then calls + `__memp_sync_int(env, NULL, need_clean, DB_SYNC_TRICKLE | DB_SYNC_INTERRUPT_OK, &wrote, NULL)`. + This is the lowest-risk first target for async writeback: it is already + "best effort," already batched, and already interruptible + (`DB_SYNC_INTERRUPT_OK`, `src/dbinc/mp.h:48`; `DB_SYNC_TRICKLE`, + `mp.h:51`). + +Note `mp_sync.c:512-519` already **skips buffers held `BH_EXCLUSIVE`** by +another thread ("come back to it"), and `:505-510` pins each via +`atomic_inc(&bhp->ref)` + `MUTEX_READLOCK(env, bhp->mtx_buf)` — the same +reference/latch discipline an async writer must maintain until completion is +reaped. + +--- + +## 3. Existing read-ahead / prefetch facilities + +**None.** Grep across `src/` for +`prefetch|readahead|read_ahead|posix_fadvise|madvise|MP_FOR_PREFETCH|POSIX_FADV` +returns zero matches. The only related artifacts: + +- `DB_OSO_SEQ` open flag exists (`src/os/os_open.c:42`, `OKFLAGS`) but is + **dead** with respect to I/O hints — never translated to `posix_fadvise` + or `O_*`. +- `DB_MPOOL_*` `__memp_fget` flags (`src/mp/mp_fget.c:58-59`): + `DB_MPOOL_CREATE | DB_MPOOL_DIRTY | DB_MPOOL_EDIT | DB_MPOOL_LAST | + DB_MPOOL_NEW` — none express "fetch ahead / don't block." There is no + `DB_MPOOL_PREFETCH`. +- `DB->stat` / `DB_MPOOL_STAT` (`src/mp/mp_stat.c`) track `st_page_in`, + `st_cache_miss`, `st_page_trickle`, etc., but no readahead counters. + +So a prefetch facility is greenfield: it needs both a new public hint +(`DB_MPOOL_PREFETCH` or a `DB_MPOOLFILE->prefetch(pgno, npages)` method) and +the async submit path of §2a. + +--- + +## 4. Proposed minimal internal async-I/O abstraction (`src/os/os_aio.c`) + +### Design constraints discovered + +- Single choke point (`__os_io`) → the abstraction can be a sibling layer. +- Buffers are page-sized, page-aligned `bhp->buf` regions; offset is + `pgno * pgsize`. The op already carries everything io_uring/aio needs. +- **Multi-process shared mpool is the hard constraint** (see §4b). + +### 4a. Interface (platform-neutral) + +``` +os_aio_env_create(env) -> per-ENV aio context (ring/kqueue/iocp/aiocb pool) +os_aio_env_destroy(env) +os_aio_submit_read (ctx, fhp, pgno, pgsize, buf, cookie) -> handle | EAGAIN +os_aio_submit_write(ctx, fhp, pgno, pgsize, buf, cookie) -> handle | EAGAIN +os_aio_reap(ctx, min, max, completions[], timeout) -> n completions +os_aio_cancel(ctx, handle) -> best-effort +``` + +`cookie` carries `{ bhp, hp, mfp }` so the completion handler can run the +exact tail of `__memp_pgread` (clear `BH_TRASH`, drop latch) or +`__memp_pgwrite` (clear `BH_DIRTY`, `atomic_dec(&hp->hash_page_dirty)`, +`atomic_dec(&bhp->ref)`, drop latch). The context lives off `ENV` alongside +`env->mp_handle`, created in `__env_open` and torn down in `__env_close`. + +### 4b. Backend mapping + +| Backend | Submit | Complete / reap | Notes | +|---|---|---|---| +| **Linux io_uring** | `io_uring_get_sqe` + `io_uring_prep_read/write` (offset, `bhp->buf`, `pgsize`) + `io_uring_submit` | `io_uring_wait_cqe` / `io_uring_peek_batch_cqe`; `user_data` = cookie | One ring per ENV; pairs with `O_DIRECT` (`DB_OSO_DIRECT`). Cleanest fit. | +| **BSD/macOS kqueue + aio** | `aio_read`/`aio_write` on an `aiocb` whose `aio_sigevent` is `EVFILT_AIO` against the kqueue | `kevent()`; `kev.udata`/`aiocbp` = cookie | macOS aio is limited & no `O_DIRECT` (use `F_NOCACHE`); modest depth. | +| **Windows overlapped/IOCP** | `ReadFile`/`WriteFile` with `OVERLAPPED` on a `FILE_FLAG_OVERLAPPED` handle bound to an IOCP | `GetQueuedCompletionStatus(Ex)`; `lpOverlapped`→cookie | **Requires reopening the handle with `FILE_FLAG_OVERLAPPED`** — today `os_windows/os_open.c` does not set it, and `os_rw.c:36` deliberately uses `hEvent=0` synchronous overlapped. | +| **POSIX aio fallback** | `aio_read`/`aio_write`/`lio_listio` on `aiocb` array | `aio_suspend` + `aio_error`/`aio_return`, or `SIGEV_THREAD` callback | Portable floor; often thread-pool emulated. Last resort. | + +A `j_aio_*` global hook (mirroring the existing `DB_GLOBAL(j_pread)` etc. in +`os_rw.c`) lets the test harness stub the backend. + +### 4c. Per-process AIO context vs. shared mpool — the correctness pivot + +AIO contexts (io_uring ring fds, kqueues, IOCP handles, `aiocb`s) are +**per-process**. The mpool buffer-cache region is **shared across processes**. +Therefore a buffer marked "I/O in flight" by process A **cannot have its +completion reaped by process B**. Resolution: + +1. Async submit is only ever issued by the process that **holds + `bhp->mtx_buf` exclusive** (reads: `mp_fget.c:806`/`mp_bh.c:233`; writes: + the exclusive/pinned victim in `mp_alloc.c:489-492` and the pinned buffer + in `mp_sync.c:505-526`). The latch is held across submit→reap and released + **only by the submitting process in its completion handler**. + +2. Any **other** process/thread that wants that page does **not** try to reap + foreign completions. It uses the **existing in-transit wait path**: it + blocks on `bhp->mtx_buf`, then re-checks `BH_TRASH`. That path already + exists and is unchanged: + - Waiter acquires the latch: `mp_fget.c:309-319` + (`MUTEX_LOCK`/`MUTEX_READLOCK(env, bhp->mtx_buf)`), with the + shared→exclusive upgrade-on-`BH_TRASH` dance at `mp_fget.c:322-329`. + The `BH_TRASH` re-check after acquiring is `mp_fget.c:324` and the + `revive`/`break` re-read at `mp_fget.c:485-505`. + - The flush/eviction scan simply **skips** buffers another process holds + `BH_EXCLUSIVE` (`mp_sync.c:512-519`) and pins+waits on `mtx_buf` + otherwise (`mp_sync.c:505-526`). + + So the fallback is "block on `bhp->mtx_buf` and re-read on wake" — the same + semantics used today when one thread is mid-`__memp_pgread` and another + requests the page. Async simply widens the window during which the latch is + held; it does not introduce a new cross-process reaping requirement. + +3. **Crash/abandonment:** if the submitting process dies with an op in flight, + the buffer stays `BH_TRASH`+latched. This is the same failure surface as a + process dying inside synchronous `__memp_pgread` today, handled by existing + environment-recovery / dead-process latch cleanup (`mtx_buf` recovery on + `DB_RECOVER`). No new invariant. + +--- + +## 5. WAL-rule enforcement before a dirty page write (must be preserved) + +The "log-flush-before-page-write" (WAL) rule is enforced in two places that +async writeback **must continue to honor before submitting a write**: + +1. **Per-page, inside `__memp_pgwrite`** — `src/mp/mp_bh.c:347-354`: + ```c + if (LOGGING_ON(env) && mfp->lsn_off != DB_LSN_OFF_NOTSET && + !IS_CLIENT_PGRECOVER(env)) { + memcpy(&lsn, bhp->buf + mfp->lsn_off, sizeof(DB_LSN)); + if (!IS_NOT_LOGGED_LSN(lsn) && + (ret = __log_flush(env, &lsn)) != 0) + goto err; /* do NOT write the page */ + } + ``` + It reads the page's LSN at `mfp->lsn_off` (`src/dbinc/mp.h:498`) and forces + the log up to that LSN via `__log_flush` before the `__os_io(DB_IO_WRITE)` + at `mp_bh.c:469`. A `DIAGNOSTIC` assertion re-verifies the durable log LSN + (`lp->s_lsn`) has passed the page LSN (`mp_bh.c:356-405`). + +2. **Batch pre-flush, inside `__memp_sync_int`** — `src/mp/mp_sync.c:464-473`: + ```c + if (LOGGING_ON(env) && (ret = __log_flush(env, NULL)) != 0) + goto err; + ``` + A whole-log flush before the write loop, as an optimization; the per-page + check in `__memp_pgwrite` still runs because pages may be re-dirtied after + this flush. + +**Implication for async:** the `__log_flush(env, &lsn)` at `mp_bh.c:352` must +complete (synchronously, or as an ordering dependency) **before** the async +write SQE for that page is submitted. The clean integration is to keep the +WAL check at the top of the write submit routine — i.e. factor +`__memp_pgwrite` into `prep (WAL + pgout) → submit → finish`, where `prep` +retains lines `347-405` verbatim and only the `__os_io` at `469` is replaced +by `os_aio_submit_write`. The dirty-state bookkeeping in the `err:`/ +`file_dead:` tail (`mp_bh.c:480-507`: clear `BH_DIRTY`, decrement +`hash_page_dirty`) moves into the completion handler. + +--- + +## Summary (10 lines) + +1. All data-page I/O passes through one synchronous choke point, + `__os_io(DB_IO_READ|DB_IO_WRITE, fhp, pgno, pgsize, …)` in + `src/os/os_rw.c:20` (and the Win32 twin in `src/os_windows/os_rw.c:16`). +2. Reads block in `__memp_pgread` (`mp_bh.c:298`); writes block in + `__memp_pgwrite` (`mp_bh.c:469`), reached only via `__memp_bhwrite`. +3. The demand read-in site is `__memp_fget`'s `BH_TRASH` block, + `mp_fget.c:927`; the buffer is allocated/latched/marked at + `mp_fget.c:806-865` — the natural **prefetch submit** point. +4. Writeback candidates: eviction (`mp_alloc.c:509`), checkpoint/sync + (`mp_sync.c:562`), and trickle (`mp_trickle.c` → `__memp_sync_int`); + trickle is the safest first target (already batched + interruptible). +5. There is **no** existing prefetch/readahead — zero matches for + `prefetch|readahead|fadvise|madvise`; `DB_OSO_SEQ` is accepted but unused. +6. Proposed `os_aio` interface: per-ENV context + + `submit_read/submit_write/reap/cancel`, with a `{bhp,hp,mfp}` cookie that + runs the existing pgread/pgwrite completion tail. +7. Backends map cleanly: io_uring (prep+submit/CQE), kqueue+aio + (`aiocb`/`EVFILT_AIO`/`kevent`), Windows IOCP (`OVERLAPPED`/ + `GetQueuedCompletionStatus` — needs `FILE_FLAG_OVERLAPPED`), POSIX aio + (`lio_listio`/`aio_suspend`) as the floor. +8. AIO contexts are per-process but mpool is shared, so only the process + holding `bhp->mtx_buf` exclusive submits/reaps; other processes fall back + to the **existing in-transit wait** (`MUTEX_*LOCK(bhp->mtx_buf)` at + `mp_fget.c:309-329`, skip-if-`BH_EXCLUSIVE` at `mp_sync.c:512-519`). +9. The WAL rule (`__log_flush(env,&lsn)` at `mp_bh.c:352`, plus batch flush at + `mp_sync.c:472`) must be evaluated **before** a write is submitted; refactor + `__memp_pgwrite` into prep→submit→finish keeping `mp_bh.c:347-405` intact. +10. Net: introduce `os_aio` beside `__os_io`, add a `DB_MPOOL_PREFETCH` + fetch path, and convert trickle→sync→eviction writeback incrementally, + with no change to the cross-process latch invariants. + +--- + +## Proposed `os_aio` interface — C header sketch + +```c +/*- + * src/dbinc/os_aio.h (PROPOSED — not yet implemented) + * + * Internal asynchronous page-I/O abstraction layered beside __os_io(). + * One context per ENV; submit/reap is single-process. A buffer with an + * I/O in flight is held BH_EXCLUSIVE by the submitting process; other + * processes never reap foreign completions and instead block on + * bhp->mtx_buf (the existing in-transit wait). WAL: callers MUST have + * satisfied the log-flush-before-write rule (mp_bh.c __memp_pgwrite) + * prior to os_aio_submit_write(). + */ +#define DB_AIO_READ 1 /* mirrors DB_IO_READ */ +#define DB_AIO_WRITE 2 /* mirrors DB_IO_WRITE */ + +struct __db_aio_op { /* one submitted operation */ + DB_FH *fhp; /* target file handle */ + db_pgno_t pgno; /* page number (offset = pgno*pgsize) */ + u_int32_t pgsize; /* transfer length == page size */ + u_int8_t *buf; /* page buffer (bhp->buf) */ + void *cookie; /* opaque: {BH*, DB_MPOOL_HASH*, MPOOLFILE*} */ + int op; /* DB_AIO_READ | DB_AIO_WRITE */ +}; + +struct __db_aio_completion { + void *cookie; /* as submitted */ + size_t nbytes; /* bytes transferred */ + int ret; /* 0 or __os_posix_err()-style code */ +}; + +typedef struct __db_aio_ctx DB_AIO_CTX; /* opaque per-ENV backend state */ + +/* Lifecycle — created in __env_open, destroyed in __env_close. */ +int __os_aio_env_create __P((ENV *, DB_AIO_CTX **)); +void __os_aio_env_destroy __P((ENV *, DB_AIO_CTX *)); + +/* Submit one op. Returns 0, or EAGAIN when the queue is full (caller + * then falls back to synchronous __os_io and proceeds). */ +int __os_aio_submit __P((ENV *, DB_AIO_CTX *, struct __db_aio_op *)); + +/* Reap between min and max completions; honors DB_SYNC_INTERRUPT_OK-style + * cancellation via a timeout. Returns count in *np. */ +int __os_aio_reap __P((ENV *, DB_AIO_CTX *, u_int32_t min, u_int32_t max, + struct __db_aio_completion *, u_int32_t *np, + db_timeout_t timeout)); + +/* Best-effort cancel (used on env shutdown / interrupt). */ +int __os_aio_cancel __P((ENV *, DB_AIO_CTX *, struct __db_aio_op *)); + +/* + * Backend selection (compile/config time): + * HAVE_AIO_IO_URING -> io_uring (Linux; pair with DB_OSO_DIRECT) + * HAVE_AIO_KQUEUE -> kqueue + aio(4) (BSD/macOS; F_NOCACHE) + * HAVE_AIO_IOCP -> overlapped/IOCP (Windows; needs FILE_FLAG_OVERLAPPED) + * HAVE_AIO_POSIX -> POSIX aio (portable fallback / floor) + * Test override: DB_GLOBAL(j_aio_*) hooks, mirroring j_pread/j_pwrite. + */ +```