perf(mpool): Stage 2 — async I/O backends (io_uring/IOCP/kqueue/POSIX-aio/threadpool)#24
Draft
gburd wants to merge 11 commits into
Draft
perf(mpool): Stage 2 — async I/O backends (io_uring/IOCP/kqueue/POSIX-aio/threadpool)#24gburd wants to merge 11 commits into
gburd wants to merge 11 commits into
Conversation
Introduce a thin, pluggable async-I/O layer (DB_AIO_CONTEXT) the buffer pool will use to prefetch pages and trickle dirty pages without blocking a foreground thread on a device write. This commit lands the interface and the behaviour-preserving synchronous backend: submit performs the I/O inline via __os_io and runs the completion immediately; reap is a no-op; available() is 0. Platform backends (Linux io_uring first) plug in through DB_AIO_BACKEND and override submit/reap/cancel; the synchronous path remains the universal fallback. Registered in dist/srcfiles.in and dist/Makefile.in. Validated: full library builds and links; an internal-API unit test round-trips a 512B page through submit(WRITE)+submit(READ)+reap with correct data and two completions. Not yet wired into mpool -- the io_uring backend and the trickle/prefetch integration (which carry write-path data-integrity requirements) are the next step on this branch.
Add a real asynchronous backend for the Stage 2 os_aio layer on Linux, adapting the io_uring file-I/O mechanics from the XTC Project's libxtc (src/io/io_uring.c, ISC, with the author's permission): SQE acquisition with submit-and-retry on a full ring, io_uring_prep_read/write keyed by DB_AIO_OP, and a CQE drain that maps each completion back to its op and runs the callback. The readiness/poll/wakeup machinery from libxtc is dropped -- the buffer pool only issues file reads/writes. The context struct moves to dbinc/os_aio.h so the backend can attach its vtable + ring; __os_aio_create installs the backend under HAVE_IO_URING and otherwise keeps the synchronous fallback; __os_aio_ctx_available reports whether a context is truly async. Built only with HAVE_IO_URING (else an empty TU); registered in srcfiles.in and Makefile.in. Validated (sync path): local build clean. io_uring path build+round-trip on meh next (liburing built from source).
__os_aio_reap ran COMPQUIET(max,0)/COMPQUIET(wait,0) -- which assign the variables, not just silence warnings -- before forwarding to the backend, so every async reap saw max=0,wait=0 and never blocked or drained. Dispatch to the backend first; COMPQUIET only on the synchronous fall-through.
Add the cross-platform async path for every system without a native file completion engine (everything but Linux io_uring / Windows IOCP): a fixed worker pool drains a submission FIFO, runs each op via the normal __os_io, and posts results to a completion FIFO that __os_aio_reap drains (blocking on a condvar when asked to wait). The pool structure is adapted from libxtc (src/ptc/blocking.c + aio.c, ISC) with a completion queue replacing its fiber-park wakeup; per-op I/O reuses BDB's portable __os_io so no platform read/write branches are needed. __os_aio_create now selects in preference order: io_uring (HAVE_IO_URING) > thread-pool (HAVE_AIO_THREADPOOL) > synchronous fallback. POSIX uses pthreads; a Win32 path lands behind the same surface. Registered in srcfiles.in and Makefile.in. Validated: on macOS (no io_uring) the pool is the active backend and the submit/reap round-trip completes with correct data and callbacks.
Adapt libxtc's IOCP file AIO (src/io/io_iocp.c, ISC): overlapped ReadFile/WriteFile on a file HANDLE associated with a completion port, reaped in batch via GetQueuedCompletionStatusEx -- no worker thread. The socket AFD-poll machinery is dropped; only file I/O is needed. __os_aio_create selection is now io_uring > IOCP > thread-pool > sync. Guarded by HAVE_IOCP (off in every build produced here, so it is an empty TU on POSIX and cannot affect those builds). NOTE: faithful adaptation but NOT yet compiled/validated on Windows; the thread-pool backend remains the validated Windows async path until a Windows build confirms it. Registered in srcfiles.in and Makefile.in.
configure.ac now probes liburing.h + io_uring_queue_init (defining HAVE_IO_URING and adding -luring to LIBSO_LIBS so the shared library and utilities link) and pthread.h (defining HAVE_AIO_THREADPOOL for the portable offload). Both optional; absent either, os_aio is synchronous. config.hin carries the new template entries. NOTE: the generated dist/configure must be regenerated with dist/s_config on a host with autoconf for this to take effect; the committed configure is unchanged here (no autoconf available in this environment).
… AIO ctx Split the page-write into __memp_pgwrite_prep (WAL flush, write-ahead verify, backup coordination, pgout) and __memp_pgwrite_finish (writer-count release, stats, page-image free, BH_DIRTY/BH_TRASH clear under the hash latch). __memp_pgwrite is now a thin wrapper (prep + __os_io + finish) with identical behavior, so the same logic can drive an asynchronous write. Also adds a best-effort per-process DB_MPOOL->aio_ctx (created at mpool open, destroyed at refresh) for upcoming async writeback. Behavior-preserving (one harmless change: the backup writer count is now balanced on the pgout-failure path, which previously leaked it). Validated: test001 btree+hash, recd001 recovery.
Wire the per-process AIO context into the checkpoint/sync loop (__memp_sync_int). When an async backend is available, each dirty buffer is written with __memp_bhwrite_async: WAL flush + pgout in prep, the page write submitted via os_aio while the buffer pin (ref + shared mtx_buf) and a file-handle reference are held, and a bounded in-flight window (MEMP_AIO_WINDOW=16) is reaped (BH_DIRTY clear + reference release) at the window boundary, at done, and before an interrupt. The required fsync runs only after the window is drained, preserving checkpoint durability; WAL ordering is unchanged (log flushed in prep before submit). Crash consistency is identical to the synchronous path -- the log remains the source of truth and a page is marked clean only after its write completes. Gated on dbmp->aio_ctx availability: builds without an async backend (and all non-fast-path files: dead/temporary/extent/unopened/read-only) use the unchanged synchronous __memp_bhwrite. The eviction path (__memp_bhwrite -> __memp_pgwrite) is untouched. Validated with the threadpool backend active (aio_ctx available=1): recd001/002/005/015 recovery, txn001, test001 btree+hash, test003, test010.
The previous backend spawned 4 worker threads at every context (every env) creation, which env-churning workloads (and BDB's own env tests) pay repeatedly, and worker threads in a library are a fork hazard. Replace it with a single process-wide pool created lazily on the first submitted op: a context costs nothing until it actually issues an async write, and many environments share one pool. Each context keeps only its own completion FIFO + condvar; one global lock guards the submission FIFO and the completion FIFOs. A pthread_atfork child handler resets the pool to unstarted (workers do not survive fork) so a child re-spawns lazily and never blocks on an inherited-locked pool mutex. Context destroy drains its in-flight writes before freeing its state. Validated: lazy-pool submit/reap round-trip (data + 2 completions), test001 btree.
Add pkgs.liburing (Linux only) to the devShell and the autoconf build inputs so configure autodetects liburing and HAVE_IO_URING + -luring are enabled in the nix build/dev environment.
…uild Stage 2 previously offered only io_uring (Linux), IOCP (Windows), and a portable thread-pool offload, so every non-Linux/Windows platform fell back to the thread pool. Add the two missing native file-AIO engines and make the thread pool a true last resort. - os_aio_posix.c: POSIX.1b backend (aio_read/aio_write submit; aio_suspend + aio_error/aio_return reap), bounded in-flight table sized to ctx->depth, synchronous fallback when the table is full or the kernel refuses an op. Native path on Solaris/illumos and macOS. - os_aio_kqueue.c: BSD backend using aiocb with SIGEV_KEVENT so completions post EVFILT_AIO kevents reaped via kevent(). Gated on the sigev_notify_kqueue member, which FreeBSD has and macOS lacks, so macOS uses the POSIX backend. - os_aio.c: selection order io_uring > IOCP > kqueue-aio > POSIX aio > thread pool, each #ifdef-gated. - configure.ac: autodetect HAVE_AIO_POSIX (aio.h + aio_read in libc/librt) and HAVE_AIO_KQUEUE (sigev_notify_kqueue + EVFILT_AIO). Build fix (required for any modern toolchain): the tree is K&R C, and Autoconf >= 2.72 AC_PROG_CC adopts -std=gnu23, under which old-style definitions are a hard error. Pin gnu99 (strip an injected -std=gnu23 from CC, append -std=gnu99 to CFLAGS) so K&R stays legal, and silence the two inherent K&R diagnostics (-Wno-deprecated-non-prototype, -Wno-knr-promoted-parameter) behind compiler-support probes. configure regenerated with Autoconf 2.73 (the dev-flake toolchain). Verified: clean build on macOS (Apple clang 21) with zero source warnings; configure selects POSIX aio (kqueue correctly excluded for lack of sigev_notify_kqueue); all AIO objects compile.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stage 2: asynchronous I/O for the buffer pool
A pluggable
os_aiolayer so buffer-pool writeback (checkpoint/sync andtrickle) can issue page writes asynchronously instead of blocking a
foreground thread on each device write, with native completion engines per
platform and a portable fallback.
Backends (selected at runtime, in order)
aiocbwithSIGEV_KEVENT, reaped viakevent. Gated onsigev_notify_kqueue; macOS lacks it and uses POSIX aio.aio_read/aio_write+aio_suspend/aio_error/aio_return.Without any backend,
os_aiofalls back to synchronous I/O (behaviourpreserving).
configure.acautodetects each (HAVE_IO_URING,HAVE_AIO_KQUEUE,HAVE_AIO_POSIX,HAVE_AIO_THREADPOOL).Integration
__memp_pgwritesplit into prep (WAL flush + pgout — the WAL rule staysbefore submission) / submit / finish.
__memp_sync_int) submits a bounded in-flight window anddrains it; trickle uses the same path, so trickle writeback is async too.
trickle keep dirty pages away from the evictor; an async evictor would hold
latches waiting on its own completion).
one process is reaped only by that process, others use the existing
mtx_bufin-transit wait.Thread-pool fork safety
The pool is process-wide, lazy, and fork-safe: a live-context registry lets the
pthread_atforkchild handler reset every context's completion FIFO andin-flight count (workers do not survive fork) without running completions —
running a write completion in the child would clear
BH_DIRTYand drop a pinfor I/O the child never performed, corrupting inherited copy-on-write state.
Build
The tree is K&R C; Autoconf >= 2.72
AC_PROG_CCadopts-std=gnu23, underwhich K&R definitions are a hard error. Pinned
-std=gnu99and silenced the twoinherent K&R diagnostics behind compiler-support probes.
configureregeneratedwith Autoconf 2.73 (the dev-flake toolchain).
Status / not yet done
rrand(the read-scaling workload) is100% cache hits with zero I/O, so AIO cannot move that curve; benefit must be
shown on a write/cache-pressure workload. To be measured on the Linux box
(io_uring) and FreeBSD (kqueue aio).
read-side async feature, is correctness-heavy (buffer reservation, in-transit
handshake, per-process reap constraint), and should be built against measured
evidence rather than speculatively.
Maps to ROADMAP #8. Companion to the design doc (PR #20).