Skip to content

perf(mpool): Stage 2 — async I/O backends (io_uring/IOCP/kqueue/POSIX-aio/threadpool)#24

Draft
gburd wants to merge 11 commits into
masterfrom
perf/aio-stage2
Draft

perf(mpool): Stage 2 — async I/O backends (io_uring/IOCP/kqueue/POSIX-aio/threadpool)#24
gburd wants to merge 11 commits into
masterfrom
perf/aio-stage2

Conversation

@gburd

@gburd gburd commented Jun 19, 2026

Copy link
Copy Markdown
Collaborator

Stage 2: asynchronous I/O for the buffer pool

A pluggable os_aio layer so buffer-pool writeback (checkpoint/sync and
trickle) can issue page writes asynchronously instead of blocking a
foreground thread on each device write, with native completion engines per
platform and a portable fallback.

Backends (selected at runtime, in order)

  1. io_uring (Linux) — native, reused from libxtc.
  2. IOCP (Windows) — native, reused from libxtc.
  3. kqueue + aio (FreeBSD/BSD) — aiocb with SIGEV_KEVENT, reaped via
    kevent. Gated on sigev_notify_kqueue; macOS lacks it and uses POSIX aio.
  4. POSIX aio (Solaris/illumos, macOS) — aio_read/aio_write +
    aio_suspend/aio_error/aio_return.
  5. thread-pool offload — last-resort portable fallback.

Without any backend, os_aio falls back to synchronous I/O (behaviour
preserving). configure.ac autodetects each (HAVE_IO_URING,
HAVE_AIO_KQUEUE, HAVE_AIO_POSIX, HAVE_AIO_THREADPOOL).

Integration

  • __memp_pgwrite split into prep (WAL flush + pgout — the WAL rule stays
    before submission) / submit / finish.
  • Checkpoint/sync (__memp_sync_int) submits a bounded in-flight window and
    drains it; trickle uses the same path, so trickle writeback is async too.
  • Foreground eviction stays synchronous by design (prefer-clean eviction +
    trickle keep dirty pages away from the evictor; an async evictor would hold
    latches waiting on its own completion).
  • Multi-process safe: AIO contexts are per-process; a page marked in-flight by
    one process is reaped only by that process, others use the existing
    mtx_buf in-transit wait.

Thread-pool fork safety

The pool is process-wide, lazy, and fork-safe: a live-context registry lets the
pthread_atfork child handler reset every context's completion FIFO and
in-flight count (workers do not survive fork) without running completions
running a write completion in the child would clear BH_DIRTY and drop a pin
for I/O the child never performed, corrupting inherited copy-on-write state.

Build

The tree is K&R C; Autoconf >= 2.72 AC_PROG_CC adopts -std=gnu23, under
which K&R definitions are a hard error. Pinned -std=gnu99 and silenced the two
inherent K&R diagnostics behind compiler-support probes. configure regenerated
with Autoconf 2.73 (the dev-flake toolchain).

Status / not yet done

  • Draft pending at-scale measurement. rrand (the read-scaling workload) is
    100% cache hits with zero I/O, so AIO cannot move that curve; benefit must be
    shown on a write/cache-pressure workload. To be measured on the Linux box
    (io_uring) and FreeBSD (kqueue aio).
  • Prefetch / readahead is deferred to a follow-up (Stage 2b): it is the only
    read-side async feature, is correctness-heavy (buffer reservation, in-transit
    handshake, per-process reap constraint), and should be built against measured
    evidence rather than speculatively.

Maps to ROADMAP #8. Companion to the design doc (PR #20).

gburd added 11 commits June 18, 2026 16:11
Introduce a thin, pluggable async-I/O layer (DB_AIO_CONTEXT) the buffer pool
will use to prefetch pages and trickle dirty pages without blocking a
foreground thread on a device write.  This commit lands the interface and the
behaviour-preserving synchronous backend: submit performs the I/O inline via
__os_io and runs the completion immediately; reap is a no-op; available() is 0.
Platform backends (Linux io_uring first) plug in through DB_AIO_BACKEND and
override submit/reap/cancel; the synchronous path remains the universal
fallback.

Registered in dist/srcfiles.in and dist/Makefile.in.  Validated: full library
builds and links; an internal-API unit test round-trips a 512B page through
submit(WRITE)+submit(READ)+reap with correct data and two completions.

Not yet wired into mpool -- the io_uring backend and the trickle/prefetch
integration (which carry write-path data-integrity requirements) are the next
step on this branch.
Add a real asynchronous backend for the Stage 2 os_aio layer on Linux,
adapting the io_uring file-I/O mechanics from the XTC Project's libxtc
(src/io/io_uring.c, ISC, with the author's permission): SQE acquisition
with submit-and-retry on a full ring, io_uring_prep_read/write keyed by
DB_AIO_OP, and a CQE drain that maps each completion back to its op and
runs the callback.  The readiness/poll/wakeup machinery from libxtc is
dropped -- the buffer pool only issues file reads/writes.

The context struct moves to dbinc/os_aio.h so the backend can attach its
vtable + ring; __os_aio_create installs the backend under HAVE_IO_URING
and otherwise keeps the synchronous fallback; __os_aio_ctx_available
reports whether a context is truly async.  Built only with HAVE_IO_URING
(else an empty TU); registered in srcfiles.in and Makefile.in.

Validated (sync path): local build clean.  io_uring path build+round-trip
on meh next (liburing built from source).
__os_aio_reap ran COMPQUIET(max,0)/COMPQUIET(wait,0) -- which assign the
variables, not just silence warnings -- before forwarding to the backend,
so every async reap saw max=0,wait=0 and never blocked or drained.  Dispatch
to the backend first; COMPQUIET only on the synchronous fall-through.
Add the cross-platform async path for every system without a native file
completion engine (everything but Linux io_uring / Windows IOCP): a fixed
worker pool drains a submission FIFO, runs each op via the normal __os_io,
and posts results to a completion FIFO that __os_aio_reap drains (blocking
on a condvar when asked to wait).  The pool structure is adapted from
libxtc (src/ptc/blocking.c + aio.c, ISC) with a completion queue replacing
its fiber-park wakeup; per-op I/O reuses BDB's portable __os_io so no
platform read/write branches are needed.

__os_aio_create now selects in preference order: io_uring (HAVE_IO_URING)
> thread-pool (HAVE_AIO_THREADPOOL) > synchronous fallback.  POSIX uses
pthreads; a Win32 path lands behind the same surface.  Registered in
srcfiles.in and Makefile.in.

Validated: on macOS (no io_uring) the pool is the active backend and the
submit/reap round-trip completes with correct data and callbacks.
Adapt libxtc's IOCP file AIO (src/io/io_iocp.c, ISC): overlapped
ReadFile/WriteFile on a file HANDLE associated with a completion port,
reaped in batch via GetQueuedCompletionStatusEx -- no worker thread.
The socket AFD-poll machinery is dropped; only file I/O is needed.
__os_aio_create selection is now io_uring > IOCP > thread-pool > sync.

Guarded by HAVE_IOCP (off in every build produced here, so it is an
empty TU on POSIX and cannot affect those builds).  NOTE: faithful
adaptation but NOT yet compiled/validated on Windows; the thread-pool
backend remains the validated Windows async path until a Windows build
confirms it.  Registered in srcfiles.in and Makefile.in.
configure.ac now probes liburing.h + io_uring_queue_init (defining
HAVE_IO_URING and adding -luring to LIBSO_LIBS so the shared library and
utilities link) and pthread.h (defining HAVE_AIO_THREADPOOL for the
portable offload).  Both optional; absent either, os_aio is synchronous.
config.hin carries the new template entries.

NOTE: the generated dist/configure must be regenerated with dist/s_config
on a host with autoconf for this to take effect; the committed configure
is unchanged here (no autoconf available in this environment).
… AIO ctx

Split the page-write into __memp_pgwrite_prep (WAL flush, write-ahead
verify, backup coordination, pgout) and __memp_pgwrite_finish (writer-count
release, stats, page-image free, BH_DIRTY/BH_TRASH clear under the hash
latch).  __memp_pgwrite is now a thin wrapper (prep + __os_io + finish) with
identical behavior, so the same logic can drive an asynchronous write.
Also adds a best-effort per-process DB_MPOOL->aio_ctx (created at mpool
open, destroyed at refresh) for upcoming async writeback.

Behavior-preserving (one harmless change: the backup writer count is now
balanced on the pgout-failure path, which previously leaked it).
Validated: test001 btree+hash, recd001 recovery.
Wire the per-process AIO context into the checkpoint/sync loop
(__memp_sync_int).  When an async backend is available, each dirty buffer
is written with __memp_bhwrite_async: WAL flush + pgout in prep, the page
write submitted via os_aio while the buffer pin (ref + shared mtx_buf) and
a file-handle reference are held, and a bounded in-flight window
(MEMP_AIO_WINDOW=16) is reaped (BH_DIRTY clear + reference release) at the
window boundary, at done, and before an interrupt.  The required fsync runs
only after the window is drained, preserving checkpoint durability; WAL
ordering is unchanged (log flushed in prep before submit).  Crash
consistency is identical to the synchronous path -- the log remains the
source of truth and a page is marked clean only after its write completes.

Gated on dbmp->aio_ctx availability: builds without an async backend (and
all non-fast-path files: dead/temporary/extent/unopened/read-only) use the
unchanged synchronous __memp_bhwrite.  The eviction path (__memp_bhwrite ->
__memp_pgwrite) is untouched.

Validated with the threadpool backend active (aio_ctx available=1):
recd001/002/005/015 recovery, txn001, test001 btree+hash, test003, test010.
The previous backend spawned 4 worker threads at every context (every env)
creation, which env-churning workloads (and BDB's own env tests) pay
repeatedly, and worker threads in a library are a fork hazard.

Replace it with a single process-wide pool created lazily on the first
submitted op: a context costs nothing until it actually issues an async
write, and many environments share one pool.  Each context keeps only its
own completion FIFO + condvar; one global lock guards the submission FIFO
and the completion FIFOs.  A pthread_atfork child handler resets the pool
to unstarted (workers do not survive fork) so a child re-spawns lazily and
never blocks on an inherited-locked pool mutex.  Context destroy drains its
in-flight writes before freeing its state.

Validated: lazy-pool submit/reap round-trip (data + 2 completions),
test001 btree.
Add pkgs.liburing (Linux only) to the devShell and the autoconf build
inputs so configure autodetects liburing and HAVE_IO_URING + -luring are
enabled in the nix build/dev environment.
…uild

Stage 2 previously offered only io_uring (Linux), IOCP (Windows), and a
portable thread-pool offload, so every non-Linux/Windows platform fell back
to the thread pool.  Add the two missing native file-AIO engines and make
the thread pool a true last resort.

- os_aio_posix.c: POSIX.1b backend (aio_read/aio_write submit; aio_suspend +
  aio_error/aio_return reap), bounded in-flight table sized to ctx->depth,
  synchronous fallback when the table is full or the kernel refuses an op.
  Native path on Solaris/illumos and macOS.
- os_aio_kqueue.c: BSD backend using aiocb with SIGEV_KEVENT so completions
  post EVFILT_AIO kevents reaped via kevent().  Gated on the sigev_notify_kqueue
  member, which FreeBSD has and macOS lacks, so macOS uses the POSIX backend.
- os_aio.c: selection order io_uring > IOCP > kqueue-aio > POSIX aio >
  thread pool, each #ifdef-gated.
- configure.ac: autodetect HAVE_AIO_POSIX (aio.h + aio_read in libc/librt)
  and HAVE_AIO_KQUEUE (sigev_notify_kqueue + EVFILT_AIO).

Build fix (required for any modern toolchain): the tree is K&R C, and
Autoconf >= 2.72 AC_PROG_CC adopts -std=gnu23, under which old-style
definitions are a hard error.  Pin gnu99 (strip an injected -std=gnu23 from
CC, append -std=gnu99 to CFLAGS) so K&R stays legal, and silence the two
inherent K&R diagnostics (-Wno-deprecated-non-prototype,
-Wno-knr-promoted-parameter) behind compiler-support probes.  configure
regenerated with Autoconf 2.73 (the dev-flake toolchain).

Verified: clean build on macOS (Apple clang 21) with zero source warnings;
configure selects POSIX aio (kqueue correctly excluded for lack of
sigev_notify_kqueue); all AIO objects compile.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant