diff --git a/docs/design/buffer-swip-aio.md b/docs/design/buffer-swip-aio.md
new file mode 100644
index 000000000..486f00343
--- /dev/null
+++ b/docs/design/buffer-swip-aio.md
@@ -0,0 +1,310 @@
+# Scalable buffer access: tagged swip, optimistic descent, and async I/O
+
+Status: design / plan of record. Companion to
+[`scaling-findings.md`](scaling-findings.md). Code anchors verified against
+`master` and recorded in `research/mpool-btree.md` and `research/os-aio.md`.
+
+## 1. Problem
+
+Measured (24-core Xeon, `lab/bench/scale_bench`): in-cache random reads peak at
+~8 threads and then *negatively scale*. `perf` self-time is dominated by
+`futex` + `__atomic_inc/dec`. The cause is not a lock we can shard: every
+B-tree descent re-pins the root and upper internal pages, and each pin performs
+**atomic read-modify-writes on shared words** for a page touched by every thread
+on every operation:
+
+1. the bucket latch share-count (`hp->mtx_hash`, already a shared latch),
+2. the buffer pin count (`bhp->ref`, `mp_fget.c:296`),
+3. the buffer latch share-count (`bhp->mtx_buf`, `mp_fget.c:318`),
+
+plus a fourth write on the way out: the per-access LRU bump
+(`bhp->priority = c_mp->lru_priority`, `mp_fput.c:215`).
+
+A controlled A/B already proved that **cache-line isolation of those fields does
+nothing** (±0.6%): the cost is *true* sharing of the words themselves, not false
+sharing of neighbours.
+
+### The hardware rule that governs >300-core NUMA
+
+A **read** of a shared cache line is nearly free — the line is replicated in
+Shared state in every socket simultaneously. A **write** to a shared line
+invalidates every replica and forces a cross-socket read-for-ownership. At
+hundreds of cores across many sockets, any per-operation write to a shared line
+is the scaling wall. Therefore:
+
+> **The read path must perform zero stores to shared memory.**
+
+Sharded/per-core pin counts only relocate one of four writes and still store on
+every read; they cannot reach this bar. Only **optimistic, version-validated
+reads** (readers write nothing) do. That is what LeanStore's Optimistic Lock
+Coupling achieves and is the design we adopt.
+
+## 2. Principles
+
+1. **Write-free reads of hot pages.** Reads of resident root/internal pages
+   take no latch share-count, bump no pin count, and write no LRU field. They
+   validate a version instead.
+2. **Exploit the structure instead of generalizing.** The hot pages are the
+   B-tree root and upper internal nodes: a tiny fraction of all pages, almost
+   never evicted, almost never modified. We make exactly those pages cheap and
+   leave leaves on the existing path. This sidesteps the hardest LeanStore
+   prerequisite (general epoch reclamation) — see §5.
+3. **One substrate for everything.** A single tagged child pointer (the *swip*)
+   carries replacement state (hot/cool/cold), residency, and async-I/O state
+   (in-flight). Replacement, scan resistance, prefetch, and the descent fast
+   path all read the same word. Build it once.
+4. **Measure each stage** on the 24-core box before/after; do not enable a
+   default-on change without the full TCL regression.
+
+## 3. The swip (tagged `roff_t`)
+
+A *swip* is a child reference stored by a parent internal frame. BDB buffer
+frames are aligned, so the low bits of a region offset (`roff_t`) are free
+(3 with 8-byte alignment, 6 with cache-line alignment). We tag them exactly as
+LeanStore tags its pointer:
+
+| state | low bits | payload (high bits) | descent action |
+|---|---|---|---|
+| `COLD` | `00` | — (child is the on-disk `pgno` in the page image) | normal `__memp_fget` path; optionally issue prefetch |
+| `HOT` | `01` | `roff_t` of the buffer frame | follow directly, read under LSN validation — **pure read** |
+| `COOL` | `10` | `roff_t` of the buffer frame | follow; reheat to `HOT` (a rare write, only at the hot/cold boundary) |
+| `IN_FLIGHT` | `11` | `roff_t` of the reserved frame | async read outstanding — wait on the in-transit latch |
+
+Two BDB-specific constraints shape where the swip lives:
+
+1. **Multi-process shared region.** A raw swizzled pointer is valid in only one
+   address space. The swip is therefore a **`roff_t`**, resolved with
+   `R_ADDR(region, off)` (base + offset). This works for both the >300-core
+   single-process-multithreaded target and BDB's classic multi-process mode.
+   Raw-pointer swizzling stays available later as a single-process-only fast
+   path (Stage 3), not now.
+2. **Pages are persisted images.** A btree internal page stores child *page
+   numbers* on disk (`BINTERNAL.pgno`, `bt_search.c:476`), and that image lives
+   in the buffer. We must not overwrite a child slot with a swip or we would
+   persist a transient pointer. So the swip lives in an **in-memory shadow
+   vector** attached to the internal frame (parallel to the page's child slots),
+   allocated lazily at wire time near `mp_region.c:289`, not in the page image.
+
+Reads of a `HOT` swip are pure loads (Shared-state, replicated per socket).
+Writes happen only on rare transitions (swizzle/unswizzle, hot↔cool, I/O). The
+root and upper internals stay `HOT`, so they are never written on the hot path.
+
+## 4. Replacement = the hot/cool/cold state machine (Stage 0)
+
+Today every `__memp_fput` writes `bhp->priority` (`mp_fput.c:215`), bumps a
+global counter (`mp_fput.c:269-271`), and a periodic `__memp_reset_lru`
+(`mp_fput.c:279-331`) sweeps the whole cache. That is a shared write per access
+and an O(cache) sweep. Replace it with a clock / second-chance scheme whose
+state is the swip's 2 bits (plus a per-frame referenced bit in `flags`):
+
+- **Read of a HOT page**: at most set a "referenced" bit, and only with a plain
+  store *if currently clear* (read-first, so the steady-state hot read is a pure
+  read). No global counter, no per-access priority write.
+- **Eviction** (`__memp_alloc`, clock hand `c_mp->last_checked` at
+  `mp_alloc.c:182`): the hand demotes `HOT`→`COOL` (clears referenced / moves to
+  a cooling FIFO) and evicts `COOL` pages whose referenced bit is still clear.
+  Touching a `COOL` page reheats it to `HOT` (second chance).
+- **Scan resistance** (the property you want): bulk-scanned pages enter `COOL`
+  and cycle out via the cooling hand without displacing the `HOT` working set —
+  the swip state *is* the scan-resistance mechanism, as in LeanStore. The hot
+  working set (root/upper internals) stays `HOT` and is never churned.
+
+This removes write #4 from the read path and deletes the global-counter +
+`__memp_reset_lru` machinery. It is independent of the descent/AIO work and is
+the foundation the swip sits on, so it ships first.
+
+Invariant preserved: eviction still requires `BH_REFCOUNT == 0` and the buffer
+latch exclusive (`mp_alloc.c:485/489/498`). We only change *which* victim is
+chosen and *how* recency is tracked, never the safety handshake.
+
+## 5. Optimistic descent + `BH_WIRED` (Stage 1)
+
+`__bam_search` descends parent→child, latch-coupling: it fetches the child
+(`bt_search.c:806`), then releases the parent (`bt_search.c:810`). Internal vs
+leaf is `TYPE(h) == P_LBTREE` (`bt_search.c:348`; `P_IBTREE=3`, `P_LBTREE=5`).
+
+**`BH_WIRED` (`0x200`, first free flag bit, `mp.h:552`).** When the btree
+fetches a `P_IBTREE` page, mark its buffer wired; the evictor skips wired
+buffers (`mp_alloc.c:320` and `:485`). Because internal nodes are <1% of pages
+and always hot, the memory cost is negligible and they would never be evicted
+anyway. **Wiring converts "safely reclaim arbitrary pages" (general epoch/RCU)
+into "these specific pages are never freed"** — so an optimistic reader of a
+wired internal page can never dereference freed memory. This is the trick that
+avoids boiling the ocean.
+
+**LSN-validated optimistic read.** At the child-pointer read (`bt_search.c:476`)
+of a wired internal parent:
+1. snapshot `LSN(h)` (page LSN, byte 0 of the page, updated under the page's
+   exclusive latch on every structural change),
+2. read/binary-search the node *without* the pin/latch,
+3. re-read `LSN(h)`; if unchanged, the read was consistent — follow the child
+   swip; if it moved (a rare split/merge), fall back to the existing
+   pinned+latched path and retry.
+
+For a wired parent we skip the child `__memp_fget` pin and the parent
+`__memp_fput` (`bt_search.c:810`). The child swip (the `roff_t` of the resident
+child frame) is cached in the parent's shadow vector at `:476`; a `HOT` swip is
+followed with no hash lookup and no bucket latch — removing writes #1, #2, #3 on
+the internal levels. Leaves keep the normal pin+latch (numerous, evictable, not
+a single hot line).
+
+**Gating (from the survey).**
+- Only on `mfp->multiversion == 0` (mirror `__bam_get_root` at `:188/226`); the
+  MVCC version-chain visibility walk (`mp_fget.c:264`) must not be bypassed.
+- `BH_WIRED` mutually exclusive with `BH_FROZEN`/thaw.
+- Binary search over the node must tolerate transiently inconsistent bytes
+  (it is validated by the post-read LSN re-check; never act on a result that
+  fails revalidation).
+- Swips are invalidated on `__bam_split` / `__bam_pinsert`, keyed off the parent
+  LSN change the optimistic reader already checks.
+
+## 6. Async I/O, prefetch, trickle (Stage 2)
+
+All data-page I/O funnels through one synchronous choke point: `__os_io`
+(`os/os_rw.c:20`, `pread`/`pwrite`). Read-in is `__memp_pgread`
+(`mp_bh.c:298`); writeback is `__memp_pgwrite` (`mp_bh.c:469`) via
+`__memp_bhwrite`, reached from eviction (`mp_alloc.c:509`), checkpoint/sync
+(`mp_sync.c:562`), and trickle (`mp_trickle.c`). There is no prefetch/readahead
+today (no `fadvise`/`madvise` anywhere).
+
+**`os_aio` interface** (new, in the `os/` layer; sketch in `research/os-aio.md`
+and `src/dbinc/os_aio.h`): per-ENV context with `submit_read`, `submit_write`,
+`reap`/poll, `cancel`; completion cookie `{BH*, DB_MPOOL_HASH*, MPOOLFILE*}` so
+the completion handler runs the existing `pgread`/`pgwrite` tail. Backends:
+
+- Linux **io_uring** (prep/submit, reap CQEs; pairs with `O_DIRECT`),
+- BSD/macOS **kqueue + aio** (`aiocb`, `EVFILT_AIO`),
+- Windows **IOCP** (requires reopening files `FILE_FLAG_OVERLAPPED` —
+  `os_windows/os_open.c` does not today),
+- POSIX **aio** fallback; `j_aio_*` test hooks mirroring `j_pread`/`j_pwrite`.
+
+**Prefetch** transitions `COLD → IN_FLIGHT` and submits an async read at the
+miss site (`mp_fget.c:806-865`, where the buffer is allocated and marked
+`BH_TRASH`); the `IN_FLIGHT` swip lets concurrent accessors find the reserved
+frame instead of re-issuing. A cursor/scan hint drives readahead.
+
+**Trickle writeback** is the safest first async target: background flush of
+cool/dirty pages so foreground eviction rarely stalls on a write. Split
+`__memp_pgwrite` into **prep** (WAL flush + `pgout`, keep `mp_bh.c:347-405`
+verbatim) → **submit** (replace `__os_io` at `:469` with `os_aio_submit_write`)
+→ **finish** (move the `BH_DIRTY` clear + `hash_page_dirty` decrement,
+`mp_bh.c:480-507`, into the completion handler).
+
+**WAL rule** (must precede every async write): the per-page `__log_flush` keyed
+on `mfp->lsn_off` (`mp_bh.c:347-354`) and the batch flush (`mp_sync.c:464-473`)
+stay in the *prep* step, before submission. Non-negotiable.
+
+**Per-process constraint** (critical, from the survey): AIO contexts are
+per-process but the mpool region is shared, so a page marked in-flight by one
+process cannot be reaped by another. Resolution: **only the process holding
+`bhp->mtx_buf` exclusive submits and reaps**, releasing the latch in its own
+completion handler. Other processes use the existing in-transit wait — block on
+`bhp->mtx_buf`, then re-check `BH_TRASH` (`mp_fget.c:309-329`). Crash with an op
+in flight reuses the existing `mtx_buf` recovery on `DB_RECOVER`. This means the
+high-core single-process target gets full async; multi-process degrades
+gracefully to today's behaviour.
+
+## 7. Stages, exit criteria, ordering
+
+The stages are a dependency chain; each is a separate branch/PR, measured on the
+24-core box.
+
+| stage | branch | delivers | exit criteria |
+|---|---|---|---|
+| **0** | `perf/swip-stage0-cooling` | clock/cool replacement; delete per-access `priority` write + global counter + `__memp_reset_lru`; referenced bit in `flags` | TCL regression green; eviction quality not worse on a cache-pressure workload; `wrand`/`rrand` no regression; **scan-resistance** shown (a scan doesn't evict the hot set) |
+| **1** | `perf/swip-stage1-descent` | `BH_WIRED`; shadow swip vector (`roff_t` + 2 bits); LSN-validated optimistic descent of wired internals | TCL regression green; **read ceiling lifts** on `rrand`/`snap` at 8–24 threads vs master; correctness under concurrent split (stress test) |
+| **2** | `perf/swip-stage2-aio` | `os_aio` interface + 1 backend (io_uring first); async trickle writeback; prefetch hint | TCL green; trickle keeps dirty-eviction stalls down under write load; prefetch improves cold-scan latency; multi-process falls back correctly |
+| **3** | (later, if needed) | general pointer swizzling + epoch reclamation + raw-pointer single-process fast path | only if profiling shows the wired-internals scope leaves gains on the table |
+
+## 8. Correctness invariants (the parts that must not be wrong)
+
+1. Hot pages stay `HOT`, so their swips are read-only on the descent; cooling
+   churns only the boundary.
+2. Every structural modification bumps the page LSN under the page's exclusive
+   latch *before* the change is visible; the optimistic reader's LSN re-check
+   brackets every read of node contents and never acts on an unvalidated result.
+3. Eviction's `refcount == 0` + exclusive-`mtx_buf` handshake is unchanged.
+4. `BH_WIRED` excludes `BH_FROZEN`; optimistic fast path gated on
+   `mfp->multiversion == 0`.
+5. Async writes obey the WAL rule in the prep step; only the latch-holding
+   process reaps its own AIO; swips are never persisted to the page image.
+
+## 9. Measurement plan
+
+Re-run `lab/bench/scale_bench {rrand,sepdb,snap} 200000 3 1 2 4 8 12 16 24` plus
+a cache-pressure workload and a write/trickle workload, on the 24-core Linux box
+(`meh`) — the laptop is too noisy (asymmetric cores) to resolve these deltas.
+Each stage records before/after medians and a `perf` self-time delta in
+`scaling-findings.md`. Stage 1's headline metric: does the 8-thread ceiling lift
+and the `futex`/atomic self-time collapse.
+
+## 10. Open questions
+
+- Shadow swip vector lifetime vs `__bam` page reorganization — invalidate vs
+  rebuild on split/merge (lean toward invalidate-on-LSN-change).
+- Whether to wire only levels ≥1 (internals) or also a configurable number of
+  upper levels under memory accounting.
+- io_uring + `O_DIRECT` alignment interaction with BDB page sizes and the
+  existing `DB_OSO_DIRECT` path.
+- Interaction of trickle/group-commit (#3 in the roadmap) — the async writer and
+  the WAL group-commit should share the log-flush coordination.
+
+## 11. Reference implementations (sqlxtc, noxu) and refinements
+
+Two sibling projects implement LeanStore/Umbra-style cooling buffer managers;
+studying them validates this design and sharpens three points.
+
+**sqlxtc `bufmgr.c`** (libxtc `examples/06_sqlxtc`): frame states
+`FREE/HOT/COOL/LOADED/WRITING`; the eviction state lives in the parent **swip**
+and transitions are owned by whoever wins a CAS (loser retries) — exactly the
+tagged-swip model here. Its `evict_one` clock sweep is the key:
+
+- **Probationary admission + COOL-first eviction.** New pages are admitted
+  `COOL`. The sweep reclaims an already-`COOL` frame and **only cools a `HOT`
+  frame when a full sweep finds no `COOL` victim** (`force_cool`). Because a
+  scan keeps supplying `COOL` pages, *the hot set is never cooled to make room
+  for a scan* — robust scan resistance for a scan of any length. A `COOL` page
+  carries a CLOCK `ref` bit (set on access, cleared by the sweep) so a
+  re-touched cool page survives one sweep and is promoted back to `HOT`.
+- **Single-word pin/evict gate.** `pin >= 0` ⇒ a fixer may `CAS pin→pin+1`;
+  the evictor reserves an unpinned frame with `CAS 0→-1` (`-1` = EVICTING).
+  Acquiring a pin and reserving for eviction race on **one** atomic word, with
+  no separate "pin++ then re-check" window.
+- **prefer-clean foreground + background trickle.** Foreground eviction
+  reclaims a clean victim and leaves dirty `COOL` pages for the trickler; it
+  flushes a dirty page inline only as a last-resort progress guarantee.
+
+**noxu `noxu-evictor`** (Berkeley-DB-JE lineage): a per-operation **`CacheMode`**
+(`Default/Unchanged/EvictLn/EvictBin/KeepHot/MakeEvictable`) drives **two
+independent tracking sets** — `primary` and `scan_resistant` — and the evictor
+drains *scan → primary → dirty* with per-phase quotas. Pluggable LRU/Clock/ARC/
+CAR/LIRS; it notes LRU pollutes on scans while ARC/CAR/LIRS resist inherently.
+
+### What this changes here
+
+1. **Scan resistance is probationary admission + COOL-first, not a single
+   counter.** Stage 0's frequency-climb CLOCK ages *every* buffer the hand
+   passes, so a long scan still decays the hot set (measured: only ~19% fewer
+   hot-set page-ins than LRU). The fix is to **admit new/scan-read pages cool
+   and never cool a HOT buffer while a COOL victim exists** — then a scan of any
+   length leaves the hot set untouched. This is implementable in BDB's existing
+   bucket-scan evictor *without* the full swip (call it **Stage 0.5**): split
+   the warmth range into COOL/HOT bands, admit reads at the top of COOL, promote
+   to HOT only on re-reference, and only decrement HOT-band warmth when a full
+   sweep finds no COOL victim (BDB's existing `aggressive` escalation is the
+   hook). Robust scan resistance, no optimistic-descent risk.
+2. **Adopt the single-word pin/evict reservation** on `bhp->ref`
+   (`-1` = EVICTING) to replace `__memp_alloc`'s TRYLOCK-`mtx_buf`-then-recheck-
+   `ref==1` dance, removing that race window.
+3. **prefer-clean eviction + trickle** is the Stage 2 writeback design; confirm
+   foreground eviction never blocks on a device write while a clean COOL victim
+   exists.
+
+### BDB constraints (why we adopt principles, not code)
+
+Both references are greenfield, single-address-space. BDB's mpool is
+multi-process (swips must be `roff_t`, resolved via `R_ADDR`), pages are
+persisted disk images (the swip lives in an in-memory shadow vector, never in
+the page), and the on-disk format + public API are fixed. So we take the
+*mechanisms* — probationary COOL admission, COOL-first eviction, swip-encoded
+state, single-word pin/evict, prefer-clean+trickle — within those constraints.
diff --git a/docs/design/research/mpool-btree.md b/docs/design/research/mpool-btree.md
new file mode 100644
index 000000000..0c584d3a1
--- /dev/null
+++ b/docs/design/research/mpool-btree.md
@@ -0,0 +1,504 @@
+# MPOOL / BTREE implementation-grounding report
+
+Read-only survey of `master` for a buffer-manager rework (Stage 0: cooling
+state replacing the per-access priority write; Stage 1: `BH_WIRED` +
+per-parent child swip + LSN-validated optimistic descent of wired internal
+pages). Every line number below was read from the working tree at survey time.
+
+> **Two premises in the request do not exist in this tree — flag before relying on them:**
+> 1. `MPOOL_HOTFIELDS_ISOLATED` — **no such guard, macro, or `#ifdef` anywhere**
+>    (`grep` across the whole repo returns nothing). `struct __bh` in
+>    `src/dbinc/mp.h` is a plain, unguarded, unpadded `/* SHARED */` struct.
+>    There is no cache-line isolation of the hot fields (`ref`, `flags`,
+>    `priority`) today; they share a line with `mtx_buf`, `hq`, `pgno`.
+> 2. `__bam_iget` — **does not exist**. The btree descent entry points are
+>    `__bam_get_root` (root fetch) and `__bam_search` (root→leaf walk) in
+>    `src/btree/bt_search.c`. The report uses those.
+
+---
+
+## 1. MPOOL pin / unpin
+
+### 1a. Pin — `__memp_fget`, `src/mp/mp_fget.c:103`
+
+The resident-page fast path:
+
+**Bucket latch (read lock on `hp->mtx_hash`)** is taken inside the
+`MP_GET_BUCKET` macro at `src/mp/mp_fget.c:245`:
+
+```c
+245:	MP_GET_BUCKET(env, mfp, *pgnoaddr, &infop, hp, bucket, ret);
+246:	if (ret != 0)
+247:		return (ret);
+248:	c_mp = infop->primary;
+```
+
+The macro body (`src/dbinc/mp.h`, `MP_GET_BUCKET`) computes the bucket and does
+`MUTEX_READLOCK(env, (hp)->mtx_hash)` for the single-region case, else calls
+`__memp_get_bucket`. After this, `h_locked = 1` (set at line 257).
+
+**Chain walk reading `bhp->pgno` / `bhp->mf_offset`**, `src/mp/mp_fget.c:256`:
+
+```c
+256:	st_hsearch = 0;
+257:	h_locked = 1;
+258:	SH_TAILQ_FOREACH(bhp, &hp->hash_bucket, hq, __bh) {
+259:		++st_hsearch;
+260:		if (bhp->pgno != *pgnoaddr || bhp->mf_offset != mf_offset)
+261:			continue;
+```
+
+(For snapshot reads the MVCC version chain is walked backward at lines
+264–278 via `SH_CHAIN_PREV(bhp, vc, __bh)` — see hazards §6.)
+
+**`atomic_inc(&bhp->ref)` (the pin)**, `src/mp/mp_fget.c:288`:
+
+```c
+288:		if (BH_REFCOUNT(bhp) == UINT16_MAX) {     /* overflow guard */
+...
+296:		atomic_inc(env, &bhp->ref);
+297:		b_incr = 1;
+```
+
+**Drop bucket latch, then take the buffer latch.** Hash mutex is released
+*before* the buffer mutex (lock-ordering: hash → buf), `src/mp/mp_fget.c:304`:
+
+```c
+304:		MUTEX_UNLOCK(env, hp->mtx_hash);
+305:		h_locked = 0;
+306:		if (dirty || extending || makecopy || F_ISSET(bhp, BH_FROZEN)) {
+307: xlatch:		if (LF_ISSET(DB_MPOOL_TRY)) {
+308:				if ((ret =
+309:				    MUTEX_TRYLOCK(env, bhp->mtx_buf)) != 0)
+310:					goto err;
+311:			} else
+312:				MUTEX_LOCK(env, bhp->mtx_buf);
+313:			F_SET(bhp, BH_EXCLUSIVE);
+314:		} else if (LF_ISSET(DB_MPOOL_TRY)) {
+315:			if ((ret = MUTEX_TRY_READLOCK(env, bhp->mtx_buf)) != 0)
+316:				goto err;
+317:		} else
+318:			MUTEX_READLOCK(env, bhp->mtx_buf);
+```
+
+The clean read fast path is line 318 (`MUTEX_READLOCK(env, bhp->mtx_buf)`);
+`b_lock = 1` at line 333. Note: without `HAVE_SHARED_LATCHES`,
+`F_SET(bhp, BH_EXCLUSIVE)` is unconditional at line 330.
+
+The pin is recorded in the thread's `PIN_LIST` at lines 866–905 (`lp->b_ref =
+R_OFFSET(infop, bhp)`), and the page pointer returned at line 933
+(`*(void **)addrp = bhp->buf;`).
+
+### 1b. Unpin — `__memp_fput`, `src/mp/mp_fput.c:65`
+
+**Multi-reference early-out** (`atomic_dec` returning >1 keeps priority
+untouched), `src/mp/mp_fput.c:183`:
+
+```c
+183:	DB_ASSERT(env, atomic_read(&bhp->ref) != 0);
+184:	if (atomic_dec(env, &bhp->ref) > 1 || (atomic_read(&bhp->ref) == 1 &&
+185:	    !F_ISSET(bhp, BH_DIRTY))) {
+186:		if (F_ISSET(bhp, BH_EXCLUSIVE))
+187:			F_CLR(bhp, BH_EXCLUSIVE);
+188:		MUTEX_UNLOCK(env, bhp->mtx_buf);
+189:		return (0);
+190:	}
+```
+
+**The LRU bump (the per-access `bhp->priority` write — Stage 0's target)**,
+`src/mp/mp_fput.c:206`:
+
+```c
+206:	if (priority == DB_PRIORITY_VERY_LOW ||
+207:	    mfp->priority == MPOOL_PRI_VERY_LOW)
+208:		bhp->priority = 0;
+209:	else {
+...
+215:		bhp->priority = c_mp->lru_priority;
+...                         /* per-priority adjust */
+249:				bhp->priority += adjust;
+...
+253:	}
+```
+
+**Global clock advance + wraparound reset**, `src/mp/mp_fput.c:269`:
+
+```c
+269:	if (++c_mp->lru_priority >= MPOOL_LRU_REDZONE &&
+270:	    (t_ret = __memp_reset_lru(env, infop)) != 0 && ret == 0)
+271:		ret = t_ret;
+```
+
+`__memp_reset_lru` (`src/mp/mp_fput.c:279`) bulk-decrements every buffer's
+priority when the counter saturates (lines 300, 325). This O(cache) sweep is
+exactly what a clock hand makes unnecessary.
+
+---
+
+## 2. Replacement / eviction — `__memp_alloc`, `src/mp/mp_alloc.c:30`
+
+### Victim selection
+`__memp_alloc` first tries `__env_alloc` from free memory (`alloc:` label,
+line ~95). On failure it falls to `search:` (line 149) and scans buckets.
+
+**High-priority threshold** (buffers fresher than `lru_priority -
+pages/10` are skipped until aggressive), `src/mp/mp_alloc.c:155`:
+
+```c
+155:	cache_reduction = c_mp->pages / 10;
+156:	high_priority = aggressive ? MPOOL_LRU_MAX :
+157:	    c_mp->lru_priority - cache_reduction;
+158:	lru_generation = c_mp->lru_generation;
+```
+
+**Bucket scan with wraparound** via `c_mp->last_checked`,
+`src/mp/mp_alloc.c:182`:
+
+```c
+182:		hp = &dbht[c_mp->last_checked++];
+183:		if (hp >= hp_end) {
+184:			c_mp->last_checked = 0;
+185:			hp = &dbht[c_mp->last_checked++];
+186:		}
+```
+
+After `MPOOL_ALLOC_SEARCH_LIMIT` (500) buckets, or a full pass, `aggressive`
+is raised and `high_priority = MPOOL_LRU_MAX` (lines 261, 297) so every buffer
+becomes eligible.
+
+**Per-bucket lowest-priority pick** (`bhp->priority` is the LRU key),
+`src/mp/mp_alloc.c:303`:
+
+```c
+303: retry_search:	bhp = NULL;
+304:		bucket_priority = high_priority;
+...
+320:			if (SH_CHAIN_SINGLETON(current_bhp, vc)) {
+321:				if (BH_REFCOUNT(current_bhp) != 0)
+322:					continue;            /* skip pinned */
+323:				buffers++;
+324:				if (bucket_priority > current_bhp->priority) {
+325:					bucket_priority = current_bhp->priority;
+...
+328:					bhp = current_bhp;
+329:					atomic_inc(env, &bhp->ref);
+330:				}
+331:				continue;
+332:			}
+```
+
+Two buckets are compared and the lower-priority winner kept
+(`priority = bhp->priority;` at line 409; swap logic 420–447).
+
+### `BH_REFCOUNT==0` check and exclusive `mtx_buf`
+Eviction will not touch a pinned buffer. After dropping the hash mutex
+(`src/mp/mp_alloc.c:481`):
+
+```c
+484:		/* Don't bother trying to latch a busy buffer. */
+485:		if (BH_REFCOUNT(bhp) > 1)
+486:			goto next_hb;
+...
+489:		if ((ret = MUTEX_TRYLOCK(env, bhp->mtx_buf)) != 0) {  /* exclusive */
+...
+493:		F_SET(bhp, BH_EXCLUSIVE);
+494:		b_lock = 1;
+497:		/* Someone may have grabbed it while we got the lock. */
+498:		if (BH_REFCOUNT(bhp) != 1)
+499:			goto next_hb;
+```
+
+So the invariant is: an evictable buffer must have `ref == 1` (the evictor's
+own transient inc) and be exclusively latched via non-blocking `MUTEX_TRYLOCK`.
+Dirty victims are written first (`F_ISSET(bhp, BH_DIRTY)` →
+`__memp_bhwrite`, lines 507–525); MVCC mid-chain victims are frozen
+(`SH_CHAIN_HASPREV` → `__memp_bh_freeze`, lines 540–566). A same-size clean
+victim is reused in place (`p = bhp; goto found;`, lines 660–666).
+
+### How a clock / second-chance / hot–cool state machine slots in
+- **State storage:** reuse the 32-bit `bhp->priority` (`src/dbinc/mp.h:555`) or
+  steal bits from `bhp->flags` (`u_int16_t`, `src/dbinc/mp.h:553`, only 9 bits
+  used through `BH_THAWED 0x100`). A 2-bit HOT/WARM/COOL/COLD field fits in
+  `flags`.
+- **Reference bit set on access:** replace the `bhp->priority = c_mp->lru_priority`
+  write at `src/mp/mp_fput.c:215` (and the `DB_PRIORITY_*` adjust block 209–253)
+  with a single `F_SET(bhp, BH_HOT)` / set-reference-bit — cheaper, and it
+  removes the `++c_mp->lru_priority` global counter contention (line 269) and
+  the whole `__memp_reset_lru` sweep (lines 279–331).
+- **Evictor consults the state bit / advances the cool hand:** the victim test
+  at `src/mp/mp_alloc.c:320–332`. Instead of "lowest priority wins," do
+  second-chance: if `BH_HOT` set, clear it (demote one level) and `continue`;
+  only a COLD, `BH_REFCOUNT==0` buffer becomes `bhp`. `c_mp->last_checked`
+  (line 182) already *is* a per-region clock hand — reuse it as the cooling
+  hand. The `aggressive`/`high_priority` escalation (155–157, 261) maps to
+  "force-demote on the second sweep."
+
+---
+
+## 3. BH struct + flags — `src/dbinc/mp.h`
+
+`struct __bh` is `/* SHARED */` (cross-process), `src/dbinc/mp.h:539`:
+
+```c
+539: struct __bh { /* SHARED */
+540:	db_mutex_t	mtx_buf;	/* Shared/Exclusive mutex */
+541:	db_atomic_t	ref;		/* Reference count. */
+542:#define	BH_REFCOUNT(bhp)	atomic_read(&(bhp)->ref)
+544:#define	BH_CALLPGIN	0x001
+545:#define	BH_DIRTY	0x002
+546:#define	BH_DIRTY_CREATE	0x004
+547:#define	BH_DISCARD	0x008
+548:#define	BH_EXCLUSIVE	0x010
+549:#define	BH_FREED	0x020
+550:#define	BH_FROZEN	0x040
+551:#define	BH_TRASH	0x080
+552:#define	BH_THAWED	0x100
+553:	u_int16_t	flags;
+554:
+555:	u_int32_t	priority;	/* Priority. */
+556:	SH_TAILQ_ENTRY	hq;		/* MPOOL hash bucket queue. */
+557:
+558:	db_pgno_t	pgno;
+559:	roff_t		mf_offset;
+560:	u_int32_t	bucket;
+561:	int		region;
+562:
+563:	roff_t		td_off;		/* MVCC: creating TXN_DETAIL offset. */
+564:	SH_CHAIN_ENTRY	vc;		/* MVCC: version chain. */
+...
+571:	DB_ALIGN8	u_int8_t buf[1];	/* Variable length data. */
+572: };
+```
+
+- **No `MPOOL_HOTFIELDS_ISOLATED` guard exists** (confirmed by grep). If the
+  rework needs cache-line isolation of `ref`/`flags`/`priority` to avoid false
+  sharing of the new reference/cool bit, that padding does **not** exist yet and
+  must be added here.
+- **`BH_WIRED` (new, non-evictable) flag:** next free bit is `0x200`
+  (`#define BH_WIRED 0x200`) — `flags` is `u_int16_t`, room through `0x8000`.
+- **Where the evictor checks `BH_WIRED`:** add `if (F_ISSET(current_bhp,
+  BH_WIRED)) continue;` in the victim loop at `src/mp/mp_alloc.c:320`
+  (singleton branch) and at the MVCC-chain branch ~line 350, alongside the
+  existing `BH_REFCOUNT != 0` skip. Belt-and-suspenders: also short-circuit at
+  `src/mp/mp_alloc.c:485` before the `MUTEX_TRYLOCK`.
+- **Where wired internal pages are marked on fetch:** in `__memp_fget` after
+  the buffer is pinned and latched, just before the return at
+  `src/mp/mp_fget.c:933`. But the *caller* knows page type, not mpool — so the
+  natural mark point is the btree layer once `TYPE(h) == P_IBTREE` is known
+  (see §4). A new mpool entry point (e.g. `__memp_wire(bhp)` setting `BH_WIRED`
+  under `bhp->mtx_buf`) called from `__bam_get_root`/`__bam_search` after the
+  internal page fget is the clean seam.
+
+---
+
+## 4. BTREE descent — `src/btree/bt_search.c`
+
+Entry points (note: **`__bam_iget` does not exist**):
+- `__bam_get_root`, `src/btree/bt_search.c:59` — fetches the root via the
+  `BAM_GET_ROOT` macro at line 114, asserts `TYPE(h)` is one of
+  `P_IBTREE/P_IRECNO/P_LBTREE/P_LRECNO/P_LDUP` at lines 123–124.
+- `__bam_search`, `src/btree/bt_search.c:253` — the root→leaf descent loop.
+
+### The descent loop — `src/btree/bt_search.c:347`
+```c
+347:	for (;;) {
+348:		if (TYPE(h) == P_LBTREE)        /* leaf vs internal distinction */
+349:			adjust = P_INDX;
+350:		else { ... adjust = O_INDX; }
+...
+        /* binary search on h ... then for an internal page: */
+476:			pg = GET_BINTERNAL(dbp, h, indx)->pgno;   /* child pgno from parent */
+477:			level = LEVEL(h);
+...
+601:				parent_h = h;     /* normal search: latch-couple parent→child */
+602:				goto lock_next;
+...
+803: skip_lock:	stack = set_stack;
+804:		}
+805:		/* Get the child page. */
+806:		if ((ret = __memp_fget(mpf, &pg,
+807:		     dbc->thread_info, dbc->txn, get_mode, &h)) != 0)
+808:			goto err;
+809:		/* Release the parent. */
+810:		if (parent_h != NULL && (ret = __memp_fput(mpf,
+811:		    dbc->thread_info, parent_h, dbc->priority)) != 0)
+812:			goto err;
+813:		parent_h = NULL;
+814:	}
+```
+
+So per level: read child pgno from parent (`GET_BINTERNAL(...)->pgno`, line
+476) → fetch+pin+latch child (`__memp_fget`, lines 806–807) → unpin parent
+(`__memp_fput`, lines 810–811). Latch-coupling: the parent stays pinned
+(`parent_h = h`, line 601) across the child fetch.
+
+### Internal vs leaf, and the page LSN
+- Page type: `P_IBTREE = 3` (`src/dbinc/db_page.h:39`), `P_LBTREE = 5`
+  (`src/dbinc/db_page.h:41`); read via `TYPE(p)` macro (`db_page.h:317`),
+  `LEVEL(p)` (`db_page.h:316`), `LEAFLEVEL == 1` (`db_page.h`).
+- LSN lives at the head of every `PAGE`: `DB_LSN lsn;` is bytes 00–07
+  (`src/dbinc/db_page.h:260`), accessed via `LSN(p)` (`db_page.h:310`). It is
+  stamped under exclusive page latch + log write by the access-method redo/undo
+  paths (e.g. `__db_*_log` / `__memp_dirty` callers), **not** by `__memp_fget`.
+  For a reader, `LSN(h)` is stable as long as the page is latched.
+- Child entry record: `BINTERNAL` (`src/dbinc/db_page.h:784–791`), `pgno` at
+  offset 04–07; fetched by `GET_BINTERNAL` (`db_page.h:795`).
+
+### Stage 1 insertion point — optimistic LSN-validated descent of a wired internal page
+The exact seam is the top of the descent loop body, **between** reading the
+child pointer and the `__memp_fget` of the child:
+
+- Replace the sequence at `src/btree/bt_search.c:476` (`pg =
+  GET_BINTERNAL(...)->pgno`) → `806` (`__memp_fget(child)`) → `810`
+  (`__memp_fput(parent)`) with: if the current `h` is a wired internal page
+  (`F_ISSET(PAGE_TO_BH(h), BH_WIRED)`), read `pg` and snapshot `lsn = LSN(h)`
+  **without** holding the buffer latch, do the binary search against the
+  unlatched (but wired, hence non-evictable, non-relocating) frame, then
+  re-validate `LOG_COMPARE(&lsn, &LSN(h)) == 0` before trusting `pg`. On
+  mismatch, fall back to the latched path (current lines 806–811). Because
+  `BH_WIRED` guarantees the frame is not reused by `__memp_alloc` (§2/§3), the
+  pointer stays valid; the LSN check guards against in-place content mutation
+  (split/merge) of the internal page.
+- The parent `__memp_fput` at lines 810–811 is skipped for wired parents (no
+  pin was taken), which is the whole point — descent of the upper tree touches
+  no `mtx_hash`, no `mtx_buf`, no `atomic_inc`.
+
+### Where a per-parent child-swip cache attaches
+`GET_BINTERNAL(dbp, h, indx)->pgno` (line 476) is a `db_pgno_t` (logical page
+number) that must be re-hashed and re-looked-up each descent. A swip cache
+turns it into a direct `roff_t` to the child `BH` with low state bits
+(resolved/unresolved). Two candidate homes:
+- **Shadow vector keyed by internal frame** (preferred, see §5): an array of
+  `roff_t` swips parallel to the `BINTERNAL` slots of each wired internal page,
+  consulted at line 476 before falling back to `__memp_fget`. Low 2–3 bits of
+  the `roff_t` encode state (0 = unresolved → do the fget and fill the swip;
+  1 = resolved offset; 2 = invalidated by split). Alignment of `BH`
+  allocations (`MVCC_BHALIGN`, and `__env_alloc` granularity) leaves low bits
+  free.
+- It must be invalidated wherever the child set changes — `__bam_split` /
+  `__bam_pinsert` and page-free paths — keyed off the parent page LSN bump.
+
+---
+
+## 5. Where the region stores per-buffer metadata — `src/mp/mp_region.c`
+
+Per-region layout is built in `__memp_init` (`src/mp/mp_region.c:196`):
+
+```c
+286:	if ((ret = __env_alloc(infop,
+287:	    htab_buckets * sizeof(DB_MPOOL_HASH), &htab)) != 0)
+288:		goto mem_err;
+289:	mp->htab = R_OFFSET(infop, htab);
+...
+339:	mp->htab_buckets = htab_buckets;
+```
+
+The cache region (`MPOOL`, `src/dbinc/mp.h:struct __mpool`) holds offsets, not
+arrays: `roff_t htab` (hash buckets), `roff_t regids`, `roff_t ftab`, plus the
+`free_frozen` / `alloc_frozen` lists. Individual `BH`s are *not* a contiguous
+array — each is `__env_alloc`'d on demand inside `__memp_alloc`
+(`src/mp/mp_alloc.c`, `alloc:` path, `c_mp->pages++`), so there is no existing
+dense per-buffer index to hang a parallel vector off.
+
+**For a per-internal-frame shadow swip vector:** allocate it in `__memp_init`
+right after the htab allocation (after `src/mp/mp_region.c:289`), as another
+`__env_alloc(infop, ...)` whose offset is stored in a new `roff_t` field on
+`struct __mpool`. Because swips are *per wired internal frame* (not per cache
+slot), the cleaner design is to allocate the swip vector lazily when a frame is
+wired (alongside `BH_WIRED` marking, §3), sized to that page's `NUM_ENT`, and
+store its `roff_t` either in a new `BH` field or in a small region-side hash
+keyed by `bhp` offset. Both the htab and any new vector must be reachable by
+`R_ADDR(infop, off)` from every attached process (cross-process — see §6).
+
+---
+
+## Concrete insertion points
+
+### Stage 0 — cooling state replaces the per-access priority write
+| What | File:line |
+|------|-----------|
+| Remove/replace the LRU priority write | `src/mp/mp_fput.c:215` (`bhp->priority = c_mp->lru_priority`) and the adjust block `209–253` → set a reference/HOT bit instead |
+| Remove the global clock increment + redzone reset | `src/mp/mp_fput.c:269–271` (and delete the `__memp_reset_lru` sweep `279–331`) |
+| State bits storage | `src/dbinc/mp.h:553` (`flags`, free bits ≥ `0x200`) or repurpose `priority` `mp.h:555` |
+| Evictor second-chance / cooling check | `src/mp/mp_alloc.c:320–332` (singleton) and `~345–360` (MVCC chain); reuse `c_mp->last_checked` `mp_alloc.c:182` as the cool hand |
+| Keep refcount==0 + TRYLOCK invariant unchanged | `src/mp/mp_alloc.c:485, 489, 498` |
+
+### Stage 1 — `BH_WIRED` + swip + LSN-validated descent
+| What | File:line |
+|------|-----------|
+| Define `BH_WIRED 0x200` | `src/dbinc/mp.h:552` (after `BH_THAWED`) |
+| Evictor skips wired buffers | `src/mp/mp_alloc.c:320` (and `485`) — `if (F_ISSET(.., BH_WIRED)) continue;` |
+| Mark internal page wired after fetch | `src/btree/bt_search.c` after the child `__memp_fget` at `806–807`, gated on `TYPE(h)==P_IBTREE`; via a new `__memp_wire()` touching `src/mp/mp_fget.c:933` region |
+| Optimistic read seam | `src/btree/bt_search.c:476` (read `pg`+`LSN(h)`) … revalidate before/instead of `__memp_fget` at `806`; skip parent `__memp_fput` `810–811` for wired parents |
+| Swip cache read/fill | at `src/btree/bt_search.c:476` (`GET_BINTERNAL(...)->pgno`) |
+| Swip vector allocation | `src/mp/mp_region.c:289` (region) + new `roff_t` on `struct __mpool` in `src/dbinc/mp.h` |
+| Swip invalidation | btree split/insert (`__bam_split`, `__bam_pinsert`) keyed off parent `LSN` |
+
+---
+
+## Correctness hazards
+
+1. **Cross-process shared region.** `struct __bh`, `struct __mpool`,
+   `DB_MPOOL_HASH` are all `/* SHARED */` and live in a region mapped by
+   multiple processes at *different* virtual addresses — every reference is a
+   `roff_t` + `R_ADDR`/`R_OFFSET`, never a pointer. A swip cache **must store
+   `roff_t`, not `BH *`**, and any new state bits must be written atomically
+   w.r.t. other processes (the existing code only relies on 32-bit reads/writes
+   being atomic — see the `priority`/`lru_priority` "we don't lock, garbage is
+   tolerable" comments at `mp_fput.c:210–214` and `mfp->priority` note in
+   `mp.h`). A HOT/cool bit with relaxed semantics is fine; a swip that gates
+   correctness is not — it needs the LSN revalidation as its correctness anchor.
+
+2. **Optimistic descent vs latching.** Skipping `bhp->mtx_buf` on a wired
+   internal page means giving up the latch that today guarantees a stable
+   `LSN(h)` and stable `BINTERNAL` bytes during the binary search
+   (`bt_search.c:347–476`). `BH_WIRED` only guarantees the *frame* is not
+   evicted/reused; it does **not** stop an in-place split/merge from rewriting
+   the page under you. The `LOG_COMPARE(&snapshot_lsn, &LSN(h))` recheck after
+   the search is mandatory, and the binary-search code must tolerate transiently
+   inconsistent bytes (no out-of-bounds `indx`) before the recheck rejects them.
+
+3. **MVCC / version-chain interaction at `read_lsnp`.** `__memp_fget` resolves
+   snapshot reads by walking the `vc` chain backward
+   (`src/mp/mp_fget.c:264–278`, `BH_VISIBLE`/`SH_CHAIN_PREV`) to the version
+   visible at `td->read_lsn` (set at `mp_fget.c:179`). A swip/LSN fast path that
+   bypasses `__memp_fget` bypasses this visibility resolution — so it is only
+   sound for **internal btree pages, which are not multi-versioned** under
+   normal operation. Guard the optimistic path with
+   `atomic_read(&mfp->multiversion) == 0` (as `__bam_get_root` already does at
+   `bt_search.c:188, 226`) **or** restrict it to `P_IBTREE` frames that are
+   provably singletons (`SH_CHAIN_SINGLETON(bhp, vc)`).
+
+4. **Freeze / thaw.** Mid-chain MVCC victims are frozen
+   (`__memp_bh_freeze`, `mp_alloc.c:545`) and re-materialized
+   (`__memp_bh_thaw`); a frozen buffer's data is gone. `BH_WIRED` must be
+   mutually exclusive with `BH_FROZEN` — the evictor's freeze branch
+   (`mp_alloc.c:540–566`) and the fget thaw branches (`mp_fget.c:307, 360+`)
+   must treat wired as "never freeze," else a swip could point at a frozen
+   header whose `buf` is invalid. Wiring should also force `SH_CHAIN_SINGLETON`.
+
+5. **`last_pgno` / extend races and `DB_MPOOL_NEW`.** The optimistic path reads
+   a child `pgno` that a concurrent allocation could be creating
+   (`mp_fget.c` `newpg:`/`extending`); the LSN recheck on the *parent* covers
+   structural change, but the swip-fill must take the normal latched fget for
+   any not-yet-resident child.
+
+6. **`priority` is read locklessly during stats and reset.** Repurposing
+   `priority` for cool-state must keep `__memp_reset_lru`'s assumptions or
+   remove that function entirely; leaving both the old sweep (`mp_fput.c:279`)
+   and a new state machine running would double-account.
+
+---
+
+## Summary (10 lines)
+
+1. Pin fast path: `__memp_fget` (`mp_fget.c:103`) read-locks `hp->mtx_hash` (`:245`), walks the chain (`:258`), `atomic_inc(&bhp->ref)` (`:296`), drops hash lock (`:304`), `MUTEX_READLOCK(bhp->mtx_buf)` (`:318`).
+2. Unpin: `__memp_fput` (`mp_fput.c:65`) `atomic_dec(&bhp->ref)` (`:184`) and the per-access LRU write `bhp->priority = c_mp->lru_priority` (`:215`) + global `++c_mp->lru_priority` (`:269`) → `__memp_reset_lru` sweep (`:279`).
+3. Eviction: `__memp_alloc` (`mp_alloc.c:30`) clock-scans `last_checked` buckets (`:182`), picks lowest `priority` singleton (`:320–332`), requires `BH_REFCOUNT==1` (`:485,498`) + `MUTEX_TRYLOCK(bhp->mtx_buf)` exclusive (`:489`), writes dirty (`:507`), freezes MVCC mid-chain (`:545`).
+4. `struct __bh` (`mp.h:539`) is plain/unguarded — **`MPOOL_HOTFIELDS_ISOLATED` does not exist**; `flags` (`:553`) has free bits, define `BH_WIRED 0x200` after `BH_THAWED` (`:552`).
+5. Descent: **`__bam_iget` does not exist**; use `__bam_get_root` (`bt_search.c:59`) + `__bam_search` loop (`:347`); child pgno from `GET_BINTERNAL(...)->pgno` (`:476`), child `__memp_fget` (`:806`), parent `__memp_fput` (`:810`), latch-couple via `parent_h=h` (`:601`).
+6. Page type/LSN: `P_IBTREE=3`/`P_LBTREE=5` (`db_page.h:39/41`), `TYPE()`/`LEVEL()` (`:316/317`), `LSN()` at PAGE byte 0 (`:260/310`); LSN stamped under exclusive latch by the AM, stable while latched.
+7. Stage 0: replace `mp_fput.c:215` priority write with a HOT/reference bit, delete `:269` counter + `:279` sweep, make the evictor at `mp_alloc.c:320` do second-chance using `last_checked` (`:182`) as the cool hand.
+8. Stage 1: mark `P_IBTREE` frames `BH_WIRED` after fget (seam near `bt_search.c:806`), skip them in the evictor (`mp_alloc.c:320,485`); at `bt_search.c:476` snapshot `LSN(h)`, search unlatched, revalidate `LOG_COMPARE(LSN(h))` before trusting `pg`; attach a `roff_t`-with-state swip vector allocated near `mp_region.c:289`.
+9. Hazards: shared region forces `roff_t` swips not pointers; optimistic reads need the LSN recheck because `BH_WIRED` stops eviction but not in-place splits; MVCC `read_lsnp` visibility walk (`mp_fget.c:264`) must be bypassed only for non-multiversion singleton internal pages; `BH_WIRED` must exclude `BH_FROZEN`/thaw.
+10. Region: `BH`s are `__env_alloc`'d on demand (no dense array); per-region htab built in `__memp_init` (`mp_region.c:286–289`) — add the swip vector there or lazily at wire time, with a new `roff_t` on `struct __mpool`.
diff --git a/docs/design/research/os-aio.md b/docs/design/research/os-aio.md
new file mode 100644
index 000000000..0cf37cbc0
--- /dev/null
+++ b/docs/design/research/os-aio.md
@@ -0,0 +1,393 @@
+# OS / mpool I/O survey — basis for an internal async-I/O abstraction
+
+Read-only survey of `/Users/gregburd/oss/libdb` @ `master`. No source was
+modified. All line numbers are from the working tree at survey time
+(2026-06-17). Paths are repo-relative.
+
+---
+
+## 1. The OS I/O abstraction (`src/os/`, `src/os_windows/`)
+
+All page and byte I/O funnels through a single dispatcher, `__os_io`, which
+takes an opcode (`DB_IO_READ` / `DB_IO_WRITE`). Everything below it is
+**fully synchronous and blocking**.
+
+Opcodes and the file handle:
+
+- `src/dbinc/os.h:92-93` — `#define DB_IO_READ 1`, `#define DB_IO_WRITE 2`.
+- `src/dbinc/os.h:112-151` — `struct __fh_t` (`DB_FH`). Holds `int fd`
+  (POSIX) and `HANDLE handle` (Win32, `:130`), plus `db_mutex_t mtx_fh`
+  (`:124`) that "protects the handle/fd across seek and read/write pairs."
+  Last-seek bookkeeping `pgno/pgsize/offset` (`:139-141`).
+
+### POSIX — `src/os/os_rw.c`
+
+| Function | Lines | Signature | Notes |
+|---|---|---|---|
+| `__os_io` | 20-128 | `(ENV*, int op, DB_FH*, db_pgno_t pgno, u_int32_t pgsize, u_int32_t relative, u_int32_t io_len, u_int8_t *buf, size_t *niop)` | Entry point for page I/O. |
+| `__os_read` | 134-196 | `(ENV*, DB_FH*, void *addr, size_t len, size_t *nrp)` | Loop over `read(2)`. |
+| `__os_write` | 204-229 | `(ENV*, DB_FH*, void *addr, size_t len, size_t *nwp)` | Zero-fill wrapper → `__os_physwrite`. |
+| `__os_physwrite` | 236-310 | `(ENV*, DB_FH*, void *addr, size_t len, size_t *nwp)` | Loop over `write(2)`. |
+| `__os_seek` | `src/os/os_seek.c:19-66` | `(ENV*, DB_FH*, db_pgno_t, u_int32_t pgsize, off_t relative)` | `lseek(2)`. |
+
+Behavior of `__os_io` (`os_rw.c:48-101`): if `pread`/`pwrite` are available
+and no `j_*` hook / no-zero-fill quirk applies, it computes
+`offset = relative ? relative : (off_t)pgno * pgsize` and issues a single
+**blocking** `pread`/`pwrite` (`:67-69`, `:88-90`) with no `mtx_fh` held. On
+a short transfer or when a hook is installed, it falls through to the `slow:`
+label (`:101`) which takes `MUTEX_LOCK(env, fhp->mtx_fh)`, then
+`__os_seek` + `__os_read`/`__os_write`, then unlocks (`:101-122`).
+`LAST_PANIC_CHECK_BEFORE_IO(env)` (`os.h:104`) is invoked immediately before
+every syscall.
+
+### Open — `src/os/os_open.c`
+
+`__os_open` (`:18-200`), signature
+`(ENV*, const char *name, u_int32_t page_size, u_int32_t flags, int mode, DB_FH **fhpp)`.
+Translates `DB_OSO_*` to `open(2)` `oflags`. Relevant flags for AIO:
+
+- `DB_OSO_DIRECT` → `O_DIRECT` (`:78-80`) and Solaris `directio()` (`:166-170`).
+  Direct I/O is the natural pairing for io_uring / POSIX-aio buffered bypass.
+- `DB_OSO_DSYNC` → `O_DSYNC` + sets `DB_FH_NOSYNC` (`:82-84`, `:148-152`).
+- `DB_OSO_SEQ` is accepted in `OKFLAGS` (`:42`) but, notably, **never mapped
+  to any `posix_fadvise`/`madvise` hint** — there is no readahead plumbing
+  (see §3).
+
+There is **no async open path and no per-fd AIO context** anywhere in the
+handle. `__os_openhandle` (`src/os/os_handle.c`) just wraps `open(2)`.
+
+### Windows — `src/os_windows/os_rw.c`
+
+`__os_io` (`:16-93`) is structurally identical but uses an `OVERLAPPED`
+struct **with `over.hEvent = 0` and the comment "we don't want asynchronous
+notifications"** (`:36`). It calls `ReadFile`/`WriteFile` with the
+`OVERLAPPED` purely to pass a 64-bit offset (`:33-35`), then treats the result
+synchronously (`:57-71`). `__os_read` (`:103`), `__os_write` (`:163`),
+`__os_physwrite` (`:184`) loop on `ReadFile`/`WriteFile` with a NULL
+`OVERLAPPED`. The handle (`fhp->handle`) is **not** opened
+`FILE_FLAG_OVERLAPPED`, so it cannot today be associated with an IOCP. See
+`src/os_windows/os_open.c` (`__os_open`, 7.6 KB) — `CreateFile` is issued
+without `FILE_FLAG_OVERLAPPED`.
+
+**Takeaway:** the entire library reaches disk for data pages through exactly
+one choke point, `__os_io(env, DB_IO_{READ,WRITE}, fhp, pgno, pgsize, 0, len, buf, niop)`.
+An async abstraction can be introduced beside it without touching callers.
+
+---
+
+## 2. mpool I/O call sites — where sync becomes a candidate for async
+
+### 2a. Read-in (prefetch candidate)
+
+`__memp_pgread` — `src/mp/mp_bh.c:217-310`, signature
+`(DB_MPOOLFILE *dbmfp, BH *bhp, int can_create)`.
+
+- Pre-conditions asserted: buffer is held `BH_EXCLUSIVE` (`:233`), and marked
+  `BH_TRASH` on entry (`:236`).
+- The actual blocking read: `mp_bh.c:298-300`
+  ```c
+  ret = __os_io(env, DB_IO_READ, dbmfp->fhp,
+      bhp->pgno, pagesize, 0, pagesize, bhp->buf, &nr);
+  ```
+- On success clears `BH_TRASH` (`:309`).
+
+**Where a missing page is read in** — `src/mp/mp_fget.c` (`__memp_fget`):
+
+- The single demand-read site is the `BH_TRASH` block at
+  `mp_fget.c:920-933`; the call is `mp_fget.c:927-929`:
+  ```c
+  if ((ret = __memp_pgread(dbmfp,
+      bhp, LF_ISSET(DB_MPOOL_CREATE) ? 1 : 0)) != 0)
+  ```
+- The cache-miss buffer is allocated and set up in the `SECOND_MISS` arm
+  (`mp_fget.c:775-868`): `__memp_alloc` is reached via the `alloc:` label
+  (`:649-651`), the new `bhp` is latched exclusive
+  (`MUTEX_LOCK(env, bhp->mtx_buf)` at `:806`, `F_SET(bhp, BH_EXCLUSIVE)` at
+  `:808`), inserted into the hash bucket, and marked `F_SET(bhp, BH_TRASH)`
+  at `:865` for the non-extending (real disk read) case.
+
+  **This is the prefetch insertion point.** A `__memp_fget` variant (or a
+  `DB_MPOOL_PREFETCH` flag) could: allocate + insert the buffer, mark it
+  `BH_TRASH | BH_EXCLUSIVE`, **submit** an async read via the new interface
+  instead of calling `__memp_pgread`, and return without blocking. The
+  completion handler would fill `bhp->buf`, clear `BH_TRASH`, and drop the
+  exclusive latch — exactly what `__memp_pgread:298-309` does inline today.
+  Sequential-scan callers (btree `__bam_*`) are the obvious clients.
+
+### 2b. Eviction / sync / trickle writes (writeback candidates)
+
+`__memp_pgwrite` — `src/mp/mp_bh.c:315-510` (static), reached only through
+`__memp_bhwrite` (`mp_bh.c:20-209`). The blocking write is `mp_bh.c:469-471`:
+```c
+ret = __os_io(env, DB_IO_WRITE, dbmfp->fhp, bhp->pgno,
+    mfp->pagesize, 0, mfp->pagesize, buf, &nw);
+```
+Three callers of `__memp_bhwrite`, each an async-writeback candidate:
+
+1. **Eviction (allocation pressure)** — `src/mp/mp_alloc.c:509`
+   `ret = __memp_bhwrite(dbmp, hp, bh_mfp, bhp, 0);` inside `__memp_alloc`,
+   in the `this_buffer:` arm (`:473-533`). The victim is latched exclusive via
+   `MUTEX_TRYLOCK(env, bhp->mtx_buf)` (`:489`) + `F_SET(bhp, BH_EXCLUSIVE)`
+   (`:492`); a dirty victim is written synchronously before reuse
+   (`dirty_eviction`, `:505-533`). Candidate for async writeback **with a
+   pending-completion barrier before the slot is handed out**.
+
+2. **Checkpoint / explicit sync & trickle** — `src/mp/mp_sync.c`,
+   `__memp_sync_int` (proto `src/dbinc_auto/mp_ext.h:99`:
+   `(ENV*, DB_MPOOLFILE*, u_int32_t, u_int32_t flags, u_int32_t *wrote, int *interrupted)`).
+   The write loop is `mp_sync.c:477-...`; the write call is `mp_sync.c:562-563`
+   `t_ret = __memp_bhwrite(dbmp, hp, mfp, bhp, 1)`. Buffers are gathered into
+   `bharray`, sorted by `__bhcmp` for file/page order (`:451-452`) — already a
+   batch, ideal for a single async **submit-many / reap** cycle.
+
+3. **Trickle (background cleaner)** — `src/mp/mp_trickle.c`, `__memp_trickle`
+   (static, `:50-...`) computes `need_clean` then calls
+   `__memp_sync_int(env, NULL, need_clean, DB_SYNC_TRICKLE | DB_SYNC_INTERRUPT_OK, &wrote, NULL)`.
+   This is the lowest-risk first target for async writeback: it is already
+   "best effort," already batched, and already interruptible
+   (`DB_SYNC_INTERRUPT_OK`, `src/dbinc/mp.h:48`; `DB_SYNC_TRICKLE`,
+   `mp.h:51`).
+
+Note `mp_sync.c:512-519` already **skips buffers held `BH_EXCLUSIVE`** by
+another thread ("come back to it"), and `:505-510` pins each via
+`atomic_inc(&bhp->ref)` + `MUTEX_READLOCK(env, bhp->mtx_buf)` — the same
+reference/latch discipline an async writer must maintain until completion is
+reaped.
+
+---
+
+## 3. Existing read-ahead / prefetch facilities
+
+**None.** Grep across `src/` for
+`prefetch|readahead|read_ahead|posix_fadvise|madvise|MP_FOR_PREFETCH|POSIX_FADV`
+returns zero matches. The only related artifacts:
+
+- `DB_OSO_SEQ` open flag exists (`src/os/os_open.c:42`, `OKFLAGS`) but is
+  **dead** with respect to I/O hints — never translated to `posix_fadvise`
+  or `O_*`.
+- `DB_MPOOL_*` `__memp_fget` flags (`src/mp/mp_fget.c:58-59`):
+  `DB_MPOOL_CREATE | DB_MPOOL_DIRTY | DB_MPOOL_EDIT | DB_MPOOL_LAST |
+  DB_MPOOL_NEW` — none express "fetch ahead / don't block." There is no
+  `DB_MPOOL_PREFETCH`.
+- `DB->stat` / `DB_MPOOL_STAT` (`src/mp/mp_stat.c`) track `st_page_in`,
+  `st_cache_miss`, `st_page_trickle`, etc., but no readahead counters.
+
+So a prefetch facility is greenfield: it needs both a new public hint
+(`DB_MPOOL_PREFETCH` or a `DB_MPOOLFILE->prefetch(pgno, npages)` method) and
+the async submit path of §2a.
+
+---
+
+## 4. Proposed minimal internal async-I/O abstraction (`src/os/os_aio.c`)
+
+### Design constraints discovered
+
+- Single choke point (`__os_io`) → the abstraction can be a sibling layer.
+- Buffers are page-sized, page-aligned `bhp->buf` regions; offset is
+  `pgno * pgsize`. The op already carries everything io_uring/aio needs.
+- **Multi-process shared mpool is the hard constraint** (see §4b).
+
+### 4a. Interface (platform-neutral)
+
+```
+os_aio_env_create(env)            -> per-ENV aio context (ring/kqueue/iocp/aiocb pool)
+os_aio_env_destroy(env)
+os_aio_submit_read (ctx, fhp, pgno, pgsize, buf, cookie) -> handle | EAGAIN
+os_aio_submit_write(ctx, fhp, pgno, pgsize, buf, cookie) -> handle | EAGAIN
+os_aio_reap(ctx, min, max, completions[], timeout)       -> n completions
+os_aio_cancel(ctx, handle)        -> best-effort
+```
+
+`cookie` carries `{ bhp, hp, mfp }` so the completion handler can run the
+exact tail of `__memp_pgread` (clear `BH_TRASH`, drop latch) or
+`__memp_pgwrite` (clear `BH_DIRTY`, `atomic_dec(&hp->hash_page_dirty)`,
+`atomic_dec(&bhp->ref)`, drop latch). The context lives off `ENV` alongside
+`env->mp_handle`, created in `__env_open` and torn down in `__env_close`.
+
+### 4b. Backend mapping
+
+| Backend | Submit | Complete / reap | Notes |
+|---|---|---|---|
+| **Linux io_uring** | `io_uring_get_sqe` + `io_uring_prep_read/write` (offset, `bhp->buf`, `pgsize`) + `io_uring_submit` | `io_uring_wait_cqe` / `io_uring_peek_batch_cqe`; `user_data` = cookie | One ring per ENV; pairs with `O_DIRECT` (`DB_OSO_DIRECT`). Cleanest fit. |
+| **BSD/macOS kqueue + aio** | `aio_read`/`aio_write` on an `aiocb` whose `aio_sigevent` is `EVFILT_AIO` against the kqueue | `kevent()`; `kev.udata`/`aiocbp` = cookie | macOS aio is limited & no `O_DIRECT` (use `F_NOCACHE`); modest depth. |
+| **Windows overlapped/IOCP** | `ReadFile`/`WriteFile` with `OVERLAPPED` on a `FILE_FLAG_OVERLAPPED` handle bound to an IOCP | `GetQueuedCompletionStatus(Ex)`; `lpOverlapped`→cookie | **Requires reopening the handle with `FILE_FLAG_OVERLAPPED`** — today `os_windows/os_open.c` does not set it, and `os_rw.c:36` deliberately uses `hEvent=0` synchronous overlapped. |
+| **POSIX aio fallback** | `aio_read`/`aio_write`/`lio_listio` on `aiocb` array | `aio_suspend` + `aio_error`/`aio_return`, or `SIGEV_THREAD` callback | Portable floor; often thread-pool emulated. Last resort. |
+
+A `j_aio_*` global hook (mirroring the existing `DB_GLOBAL(j_pread)` etc. in
+`os_rw.c`) lets the test harness stub the backend.
+
+### 4c. Per-process AIO context vs. shared mpool — the correctness pivot
+
+AIO contexts (io_uring ring fds, kqueues, IOCP handles, `aiocb`s) are
+**per-process**. The mpool buffer-cache region is **shared across processes**.
+Therefore a buffer marked "I/O in flight" by process A **cannot have its
+completion reaped by process B**. Resolution:
+
+1. Async submit is only ever issued by the process that **holds
+   `bhp->mtx_buf` exclusive** (reads: `mp_fget.c:806`/`mp_bh.c:233`; writes:
+   the exclusive/pinned victim in `mp_alloc.c:489-492` and the pinned buffer
+   in `mp_sync.c:505-526`). The latch is held across submit→reap and released
+   **only by the submitting process in its completion handler**.
+
+2. Any **other** process/thread that wants that page does **not** try to reap
+   foreign completions. It uses the **existing in-transit wait path**: it
+   blocks on `bhp->mtx_buf`, then re-checks `BH_TRASH`. That path already
+   exists and is unchanged:
+   - Waiter acquires the latch: `mp_fget.c:309-319`
+     (`MUTEX_LOCK`/`MUTEX_READLOCK(env, bhp->mtx_buf)`), with the
+     shared→exclusive upgrade-on-`BH_TRASH` dance at `mp_fget.c:322-329`.
+     The `BH_TRASH` re-check after acquiring is `mp_fget.c:324` and the
+     `revive`/`break` re-read at `mp_fget.c:485-505`.
+   - The flush/eviction scan simply **skips** buffers another process holds
+     `BH_EXCLUSIVE` (`mp_sync.c:512-519`) and pins+waits on `mtx_buf`
+     otherwise (`mp_sync.c:505-526`).
+
+   So the fallback is "block on `bhp->mtx_buf` and re-read on wake" — the same
+   semantics used today when one thread is mid-`__memp_pgread` and another
+   requests the page. Async simply widens the window during which the latch is
+   held; it does not introduce a new cross-process reaping requirement.
+
+3. **Crash/abandonment:** if the submitting process dies with an op in flight,
+   the buffer stays `BH_TRASH`+latched. This is the same failure surface as a
+   process dying inside synchronous `__memp_pgread` today, handled by existing
+   environment-recovery / dead-process latch cleanup (`mtx_buf` recovery on
+   `DB_RECOVER`). No new invariant.
+
+---
+
+## 5. WAL-rule enforcement before a dirty page write (must be preserved)
+
+The "log-flush-before-page-write" (WAL) rule is enforced in two places that
+async writeback **must continue to honor before submitting a write**:
+
+1. **Per-page, inside `__memp_pgwrite`** — `src/mp/mp_bh.c:347-354`:
+   ```c
+   if (LOGGING_ON(env) && mfp->lsn_off != DB_LSN_OFF_NOTSET &&
+       !IS_CLIENT_PGRECOVER(env)) {
+       memcpy(&lsn, bhp->buf + mfp->lsn_off, sizeof(DB_LSN));
+       if (!IS_NOT_LOGGED_LSN(lsn) &&
+           (ret = __log_flush(env, &lsn)) != 0)
+           goto err;          /* do NOT write the page */
+   }
+   ```
+   It reads the page's LSN at `mfp->lsn_off` (`src/dbinc/mp.h:498`) and forces
+   the log up to that LSN via `__log_flush` before the `__os_io(DB_IO_WRITE)`
+   at `mp_bh.c:469`. A `DIAGNOSTIC` assertion re-verifies the durable log LSN
+   (`lp->s_lsn`) has passed the page LSN (`mp_bh.c:356-405`).
+
+2. **Batch pre-flush, inside `__memp_sync_int`** — `src/mp/mp_sync.c:464-473`:
+   ```c
+   if (LOGGING_ON(env) && (ret = __log_flush(env, NULL)) != 0)
+       goto err;
+   ```
+   A whole-log flush before the write loop, as an optimization; the per-page
+   check in `__memp_pgwrite` still runs because pages may be re-dirtied after
+   this flush.
+
+**Implication for async:** the `__log_flush(env, &lsn)` at `mp_bh.c:352` must
+complete (synchronously, or as an ordering dependency) **before** the async
+write SQE for that page is submitted. The clean integration is to keep the
+WAL check at the top of the write submit routine — i.e. factor
+`__memp_pgwrite` into `prep (WAL + pgout) → submit → finish`, where `prep`
+retains lines `347-405` verbatim and only the `__os_io` at `469` is replaced
+by `os_aio_submit_write`. The dirty-state bookkeeping in the `err:`/
+`file_dead:` tail (`mp_bh.c:480-507`: clear `BH_DIRTY`, decrement
+`hash_page_dirty`) moves into the completion handler.
+
+---
+
+## Summary (10 lines)
+
+1. All data-page I/O passes through one synchronous choke point,
+   `__os_io(DB_IO_READ|DB_IO_WRITE, fhp, pgno, pgsize, …)` in
+   `src/os/os_rw.c:20` (and the Win32 twin in `src/os_windows/os_rw.c:16`).
+2. Reads block in `__memp_pgread` (`mp_bh.c:298`); writes block in
+   `__memp_pgwrite` (`mp_bh.c:469`), reached only via `__memp_bhwrite`.
+3. The demand read-in site is `__memp_fget`'s `BH_TRASH` block,
+   `mp_fget.c:927`; the buffer is allocated/latched/marked at
+   `mp_fget.c:806-865` — the natural **prefetch submit** point.
+4. Writeback candidates: eviction (`mp_alloc.c:509`), checkpoint/sync
+   (`mp_sync.c:562`), and trickle (`mp_trickle.c` → `__memp_sync_int`);
+   trickle is the safest first target (already batched + interruptible).
+5. There is **no** existing prefetch/readahead — zero matches for
+   `prefetch|readahead|fadvise|madvise`; `DB_OSO_SEQ` is accepted but unused.
+6. Proposed `os_aio` interface: per-ENV context +
+   `submit_read/submit_write/reap/cancel`, with a `{bhp,hp,mfp}` cookie that
+   runs the existing pgread/pgwrite completion tail.
+7. Backends map cleanly: io_uring (prep+submit/CQE), kqueue+aio
+   (`aiocb`/`EVFILT_AIO`/`kevent`), Windows IOCP (`OVERLAPPED`/
+   `GetQueuedCompletionStatus` — needs `FILE_FLAG_OVERLAPPED`), POSIX aio
+   (`lio_listio`/`aio_suspend`) as the floor.
+8. AIO contexts are per-process but mpool is shared, so only the process
+   holding `bhp->mtx_buf` exclusive submits/reaps; other processes fall back
+   to the **existing in-transit wait** (`MUTEX_*LOCK(bhp->mtx_buf)` at
+   `mp_fget.c:309-329`, skip-if-`BH_EXCLUSIVE` at `mp_sync.c:512-519`).
+9. The WAL rule (`__log_flush(env,&lsn)` at `mp_bh.c:352`, plus batch flush at
+   `mp_sync.c:472`) must be evaluated **before** a write is submitted; refactor
+   `__memp_pgwrite` into prep→submit→finish keeping `mp_bh.c:347-405` intact.
+10. Net: introduce `os_aio` beside `__os_io`, add a `DB_MPOOL_PREFETCH`
+    fetch path, and convert trickle→sync→eviction writeback incrementally,
+    with no change to the cross-process latch invariants.
+
+---
+
+## Proposed `os_aio` interface — C header sketch
+
+```c
+/*-
+ * src/dbinc/os_aio.h  (PROPOSED — not yet implemented)
+ *
+ * Internal asynchronous page-I/O abstraction layered beside __os_io().
+ * One context per ENV; submit/reap is single-process. A buffer with an
+ * I/O in flight is held BH_EXCLUSIVE by the submitting process; other
+ * processes never reap foreign completions and instead block on
+ * bhp->mtx_buf (the existing in-transit wait).  WAL: callers MUST have
+ * satisfied the log-flush-before-write rule (mp_bh.c __memp_pgwrite)
+ * prior to os_aio_submit_write().
+ */
+#define DB_AIO_READ   1            /* mirrors DB_IO_READ  */
+#define DB_AIO_WRITE  2            /* mirrors DB_IO_WRITE */
+
+struct __db_aio_op {              /* one submitted operation */
+    DB_FH      *fhp;              /* target file handle              */
+    db_pgno_t   pgno;            /* page number (offset = pgno*pgsize) */
+    u_int32_t   pgsize;          /* transfer length == page size     */
+    u_int8_t   *buf;             /* page buffer (bhp->buf)           */
+    void       *cookie;          /* opaque: {BH*, DB_MPOOL_HASH*, MPOOLFILE*} */
+    int         op;              /* DB_AIO_READ | DB_AIO_WRITE       */
+};
+
+struct __db_aio_completion {
+    void       *cookie;          /* as submitted                     */
+    size_t      nbytes;          /* bytes transferred                */
+    int         ret;             /* 0 or __os_posix_err()-style code */
+};
+
+typedef struct __db_aio_ctx DB_AIO_CTX;   /* opaque per-ENV backend state */
+
+/* Lifecycle — created in __env_open, destroyed in __env_close. */
+int  __os_aio_env_create  __P((ENV *, DB_AIO_CTX **));
+void __os_aio_env_destroy __P((ENV *, DB_AIO_CTX *));
+
+/* Submit one op. Returns 0, or EAGAIN when the queue is full (caller
+ * then falls back to synchronous __os_io and proceeds). */
+int  __os_aio_submit __P((ENV *, DB_AIO_CTX *, struct __db_aio_op *));
+
+/* Reap between min and max completions; honors DB_SYNC_INTERRUPT_OK-style
+ * cancellation via a timeout. Returns count in *np. */
+int  __os_aio_reap   __P((ENV *, DB_AIO_CTX *, u_int32_t min, u_int32_t max,
+                          struct __db_aio_completion *, u_int32_t *np,
+                          db_timeout_t timeout));
+
+/* Best-effort cancel (used on env shutdown / interrupt). */
+int  __os_aio_cancel __P((ENV *, DB_AIO_CTX *, struct __db_aio_op *));
+
+/*
+ * Backend selection (compile/config time):
+ *   HAVE_AIO_IO_URING  -> io_uring        (Linux; pair with DB_OSO_DIRECT)
+ *   HAVE_AIO_KQUEUE    -> kqueue + aio(4)  (BSD/macOS; F_NOCACHE)
+ *   HAVE_AIO_IOCP      -> overlapped/IOCP  (Windows; needs FILE_FLAG_OVERLAPPED)
+ *   HAVE_AIO_POSIX     -> POSIX aio        (portable fallback / floor)
+ * Test override: DB_GLOBAL(j_aio_*) hooks, mirroring j_pread/j_pwrite.
+ */
+```