diff --git a/docs/design/scaling-findings.md b/docs/design/scaling-findings.md deleted file mode 100644 index 263cd414f..000000000 --- a/docs/design/scaling-findings.md +++ /dev/null @@ -1,143 +0,0 @@ -# Empirical multi-core scaling findings - -Measured on **meh** (Linux 6.12, Xeon E5-2697 v2, 12c/24t, single socket) with -`lab/bench/scale_bench.c` against a stock Autoconf build of `master`. All data -fit in a 512 MB cache (no device I/O on the read paths). - -## Throughput vs threads - -Read-random, in-cache point gets on a 200k-key B-tree: - -| threads | ops/sec | scaling vs 1t | -|--------:|--------:|--------------:| -| 1 | 200,010 | 1.0x | -| 2 | 234,004 | 1.2x | -| 4 | 399,883 | 2.0x | -| 8 | **464,661** | 2.3x (peak) | -| 12 | 448,775 | — | -| 16 | 416,366 | — | -| 24 | 408,518 | **negative** | - -Read throughput **peaks at ~8 threads and then declines** — on a 24-thread box -we get ~2.3x, not ~12-24x. - -## Where the time goes (perf, rrand @ 24t, self-time) - -``` -66.70% [kernel] (futex) <- threads blocked in the kernel -35.43% __db_pthread_mutex_lock -31.58% __db_pthread_mutex_unlock -30.55% __lll_lock_wait / 28% __lll_lock_wake -26.33% __memp_fget 22.55% __memp_fput -21.90% __atomic_inc 20.91% __atomic_dec -48.56% __bam_search (B-tree descent; calls __memp_fget per level) -``` - -BDB's own wait counters (`lockpart%`, `mpoolhash%`, region waits) are **near -zero** — the contention does not show up there because it is the **per-page -buffer mutex** (a pthread mutex → futex) and the **page reference count** atomic, -not a lock-region/partition mutex. - -## Root cause - -Every B-tree search descends through the **root and internal pages**. Each -`__memp_fget` (a) takes that buffer header's mutex and (b) atomically increments -its pin/reference count; `__memp_fput` reverses it. Because the **root page is -fetched by every operation on every thread**, its mutex and refcount cache line -become a single global serialization point: - -- under the pthread-mutex build, that mutex goes to the kernel → the **66.7% - futex** storm (and negative scaling past the point where futex contention - dominates); -- the refcount atomic inc/dec (~43% combined self-time) ping-pongs one cache - line across all cores. - -## Workload contrasts (confirming the cause) - -- **rhot** (all threads read ONE key): adds lock-manager **partition-mutex** - contention — `lockpart%` rises 13% (4t) → 37.5% (24t) — because every read - takes a *page read lock* on the single hot page. Reads don't conflict - (`conflict%=0`) but acquiring the read lock latches the partition. Page-level - read locking is pure overhead for read-mostly workloads. -- **wrand** (random writes, auto-commit): ~**733 ops/sec single-threaded** — - fsync-per-commit bound — and the **lock region mutex** is heavily contended - (`lockreg_w` in the thousands). Writes need group commit. - -## What this means for the ROADMAP (data-driven re-prioritization) - -1. **#2 latch-free / contention-free buffer-header access is the #1 read-scaling - fix.** The dominant cost is the per-page mutex + refcount on hot (root / - internal) pages. Directions: don't take a kernel-bound mutex to pin a - resident page (optimistic/version-validated reads, LeanStore-style), shard or - bias-lock the pin count, and avoid pinning hot internal pages on the read - path. -2. **#7 cache-line / false-sharing** is the close-second cost (the refcount - atomic). Splitting/aligning the pin counter pairs directly with #2. -3. **#3 group-commit WAL** is the clear write-path win (writes are fsync-bound), - plus the lock-region mutex needs attention under writes. -4. **#4 lock manager**: page-level read locks are needless overhead for - read-mostly access (snapshot/SI reads already skip them — see SSI) and are - the hot-key bottleneck. -5. **Lower priority than the ROADMAP assumed on these boxes:** the mpool **hash** - mutex and lock **partitions** are *not* contended here (~0 wait), and both - hosts are single-socket so NUMA placement (part of #1) can't be validated. - The sharded-buffer-pool hash work matters less than the per-page-pin work - until we test on a multi-socket NUMA box. - -**Next target: #2 (+#7).** Prototype a contention-free pin for resident pages -and re-run this sweep to confirm the 8-thread ceiling lifts. - -## Isolation experiments (which shared structure is the cap?) - -Two workloads were added to separate the candidate causes. Numbers below are an -in-cache random-read sweep on a 12-core Apple Silicon laptop (noisier and fewer -cores than `meh`, but the *pattern* matches the 24-core `perf` profile and is -enough to rank the causes). ops/sec: - -| threads | `rrand` (locked, shared db) | `sepdb` (own db/thread) | `snap` (MVCC, no page locks) | -|--------:|----------------------------:|------------------------:|-----------------------------:| -| 1 | ~560k | ~627k | 463k | -| 2 | ~600k | ~693k | 572k | -| 4 | ~480k | ~638k | 631k | -| 8 | ~340k | ~502k | 459k | -| 12 | ~340k | ~417k | 389k | - -- **`sepdb`** gives every thread its own database file, so there is no shared - root/internal page. It is **30–50% faster than `rrand`** at 4–12 threads, - which confirms the **shared hot page is a real bottleneck**. It still declines, - so it is not the *only* one — the threads still share one env (mpool region, - locker table); the `lockers%` signal is the per-operation locker allocation. -- **`snap`** reads in a per-thread `DB_TXN_SNAPSHOT` transaction: MVCC reads take - **no page read locks** and reuse a single locker, removing the entire - lock-manager per-op cost. It scales to **4 threads (1.36×)** where `rrand` is - already flat — but it **plateaus at 8–12 threads at the same level as - `rrand`**. The only per-op work `snap` still does that `rrand` also does is - `__memp_fget`/`__memp_fput` (pin/unpin every page on the root→leaf path). - -**Conclusion / ranking (measured, not assumed):** - -1. **#2 buffer-header page pin is the dominant cap.** Even with *all* locking - removed (`snap`), throughput still ceilings at 8–12 threads, because every - read pins the shared root/internal pages through `__memp_fget` (per-page - mutex + atomic refcount). This is the change that can lift the ceiling. -2. **#4 lock manager is a secondary cost** in the 2–8 thread range: page read - locks + per-op locker allocation (`lockers%` 25–62%). Snapshot/lock-free - reads relieve it but do not remove the #2 ceiling. -3. **#3 write path** remains fsync-bound and independent (group commit). - -This is why the prototype order is **#2 first** (contention-free pin for -resident pages), then #4 (cache/reuse lockers, or default read-mostly access to -the lock-free path), then #3 (group commit). - -## Reproduce - -```sh -# on a build host: -cc -O2 -pthread lab/bench/scale_bench.c -Ibuild_unix -Lbuild_unix/.libs -ldb-5.3 -o scale_bench -LD_LIBRARY_PATH=build_unix/.libs ./scale_bench rrand 200000 3 1 2 4 8 12 16 24 -# workloads: rrand | rhot | wrand | sepdb | snap -# rrand shared db, locked reads (baseline) -# sepdb one db file per thread (isolates shared-page contention) -# snap per-thread MVCC snapshot txn (isolates lock-manager cost) -``` - diff --git a/lab/bench/scale_iso.c b/lab/bench/scale_iso.c new file mode 100644 index 000000000..5b59fb339 --- /dev/null +++ b/lab/bench/scale_iso.c @@ -0,0 +1,207 @@ +/*- + * libdb isolation-scaling probe. + * + * The original scale_bench shares ONE DB handle across all threads, so every + * DB->get allocates+frees a transient cursor on that handle's active-cursor + * queue under dbp->mutex -- a single mutex that serializes every read and + * caps scaling at ~8 cores (measured). This probe removes that app-level + * bottleneck the supported way -- each thread opens its OWN handle on the + * SAME database file -- and measures how far BDB then scales under several + * isolation levels, WITHOUT resorting to uncommitted reads: + * + * none DB->get(txn=NULL) per-op auto page read lock + * rc per-op txn + DB_READ_COMMITTED + * snap long-lived per-thread DB_TXN_SNAPSHOT (MVCC, no read locks) + * uncom DB_READ_UNCOMMITTED (baseline; isolation sacrificed) + * + * All threads read the same shared dataset (full isolation, not separate + * DBs), so this isolates handle/cursor scaling from the data. + * + * cc -O2 -pthread scale_iso.c -I -L/.libs -ldb-5.3 -o scale_iso + * ./scale_iso [t2 ...] + */ +#include +#include +#include +#include +#include +#include +#include +#include "db.h" + +enum { ISO_NONE = 0, ISO_RC = 1, ISO_SNAP = 2, ISO_UNCOM = 3 }; + +static DB_ENV *env; +static int g_iso; +static uint32_t g_nkeys; +static volatile int stop, go; + +/* Cache-line-padded per-thread state: no false sharing of the op counter, and + * each thread carries its own handle + (for snap) its long-lived read txn. */ +typedef struct { + uint64_t ops; + unsigned seed; + int tid; + DB *db; /* this thread's own handle on the shared file */ + DB_TXN *rtxn; /* long-lived read txn (snap), else NULL */ + char pad[32]; /* pad to one 64-byte cache line */ +} targ_t __attribute__((aligned(64))); + +static double +now_sec(void) +{ + struct timespec ts; + clock_gettime(CLOCK_MONOTONIC, &ts); + return (ts.tv_sec + ts.tv_nsec / 1e9); +} + +static void * +worker(void *a) +{ + targ_t *t = a; + DBT key, data; + uint32_t kb, vbuf[32]; + DB_TXN *txn; + int ret; + + while (!go) { } + while (!stop) { + uint32_t k = (uint32_t)(rand_r(&t->seed) % g_nkeys); + memset(&key, 0, sizeof(key)); key.data = &kb; key.size = sizeof(kb); kb = k; + memset(&data, 0, sizeof(data)); + data.data = vbuf; data.ulen = sizeof(vbuf); data.flags = DB_DBT_USERMEM; + + txn = NULL; + ret = 0; + switch (g_iso) { + case ISO_SNAP: /* long-lived snapshot txn, no read locks */ + ret = t->db->get(t->db, t->rtxn, &key, &data, 0); + break; + case ISO_RC: /* per-op txn, read-committed (locks held to commit) */ + if ((ret = env->txn_begin(env, NULL, &txn, DB_READ_COMMITTED)) != 0) + break; + ret = t->db->get(t->db, txn, &key, &data, 0); + if (ret == DB_BUFFER_SMALL) ret = 0; + if (ret == 0 || ret == DB_NOTFOUND) + (void)txn->commit(txn, DB_TXN_NOSYNC); + else + (void)txn->abort(txn); + break; + case ISO_UNCOM: /* uncommitted: no read locks (isolation sacrificed) */ + ret = t->db->get(t->db, NULL, &key, &data, DB_READ_UNCOMMITTED); + break; + default: /* none: auto per-op page read lock */ + ret = t->db->get(t->db, NULL, &key, &data, 0); + break; + } + if (ret == DB_BUFFER_SMALL) ret = 0; + if (ret != 0 && ret != DB_NOTFOUND) { + env->err(env, ret, "get k=%u", k); + return (NULL); + } + t->ops++; + } + return (NULL); +} + +static int +open_handle(const char *file, uint32_t flags, DB **dbp) +{ + int ret; + if ((ret = db_create(dbp, env, 0)) != 0) return (ret); + return ((*dbp)->open(*dbp, NULL, file, NULL, DB_BTREE, flags, 0)); +} + +static void +run(int nthreads, double secs) +{ + pthread_t th[256]; + static targ_t ta[256]; + double t0, dur; + uint64_t total = 0; + uint32_t rdflag; + int i; + + stop = go = 0; + rdflag = g_iso == ISO_UNCOM ? DB_READ_UNCOMMITTED : + g_iso == ISO_SNAP ? DB_MULTIVERSION : 0; + for (i = 0; i < nthreads; i++) { + ta[i].ops = 0; ta[i].seed = (unsigned)(i * 2654435761u + 1); ta[i].tid = i; + ta[i].rtxn = NULL; + /* Per-thread handle on the SHARED file. */ + if (open_handle("bench.db", DB_AUTO_COMMIT | DB_THREAD | rdflag, &ta[i].db) != 0) { + fprintf(stderr, "open_handle %d failed\n", i); exit(1); + } + if (g_iso == ISO_SNAP && + env->txn_begin(env, NULL, &ta[i].rtxn, DB_TXN_SNAPSHOT) != 0) { + fprintf(stderr, "txn_begin snap %d failed\n", i); exit(1); + } + pthread_create(&th[i], NULL, worker, &ta[i]); + } + go = 1; + t0 = now_sec(); + struct timespec sl = { (time_t)secs, (long)((secs - (long)secs) * 1e9) }; + nanosleep(&sl, NULL); + stop = 1; + for (i = 0; i < nthreads; i++) { + pthread_join(th[i], NULL); + total += ta[i].ops; + if (ta[i].rtxn != NULL) (void)ta[i].rtxn->commit(ta[i].rtxn, 0); + (void)ta[i].db->close(ta[i].db, 0); + } + dur = now_sec() - t0; + printf("%-6s %3d %12.0f ops/sec\n", + g_iso == ISO_SNAP ? "snap" : g_iso == ISO_RC ? "rc" : + g_iso == ISO_UNCOM ? "uncom" : "none", nthreads, total / dur); + fflush(stdout); +} + +int +main(int argc, char **argv) +{ + DB *db; + DBT key, data; + uint32_t kb, i; + char vbuf[100]; + int ret, ai; + + if (argc < 5) { + fprintf(stderr, "usage: %s \n", argv[0]); + return (1); + } + g_iso = strcmp(argv[1], "snap") == 0 ? ISO_SNAP : + strcmp(argv[1], "rc") == 0 ? ISO_RC : + strcmp(argv[1], "uncom") == 0 ? ISO_UNCOM : ISO_NONE; + g_nkeys = (uint32_t)atoi(argv[2]); + double secs = atof(argv[3]); + + system("rm -rf ./ISODB && mkdir ./ISODB"); + if ((ret = db_env_create(&env, 0)) != 0) { fprintf(stderr, "env_create %d\n", ret); return 1; } + env->set_errfile(env, stderr); + env->set_cachesize(env, 0, 512 * 1024 * 1024, 1); + /* MVCC needs to be enabled on the env for snapshot reads. */ + if ((ret = env->open(env, "./ISODB", DB_CREATE | DB_INIT_MPOOL | + DB_INIT_LOCK | DB_INIT_TXN | DB_INIT_LOG | DB_THREAD | DB_MULTIVERSION, 0)) != 0) { + env->err(env, ret, "env open"); return 1; + } + + /* Load the shared dataset once with a plain handle. */ + if ((ret = open_handle("bench.db", DB_CREATE | DB_AUTO_COMMIT | DB_THREAD, &db)) != 0) { + env->err(env, ret, "load open"); return 1; + } + memset(vbuf, 'v', sizeof(vbuf)); + for (i = 0; i < g_nkeys; i++) { + memset(&key, 0, sizeof(key)); key.data = &kb; key.size = sizeof(kb); kb = i; + memset(&data, 0, sizeof(data)); data.data = vbuf; data.size = sizeof(vbuf); + if ((ret = db->put(db, NULL, &key, &data, 0)) != 0) { env->err(env, ret, "load"); return 1; } + } + (void)db->close(db, 0); + printf("# iso=%s nkeys=%u secs=%.1f (per-thread handle on shared bench.db)\n", + argv[1], g_nkeys, secs); + + for (ai = 4; ai < argc; ai++) + run(atoi(argv[ai]), secs); + + env->close(env, 0); + return (0); +} diff --git a/src/btree/bt_compact.c b/src/btree/bt_compact.c index ac91e6d0c..1b01d032a 100644 --- a/src/btree/bt_compact.c +++ b/src/btree/bt_compact.c @@ -2635,6 +2635,17 @@ again: if (F_ISSET(dbp, DB_AM_SUBDB) && LSN_NOT_LOGGED(LSN(meta)); bt->bt_root = meta->root = PGNO(root); bt->revision = dbp->mpf->mfp->revision; + /* + * The tree root moved to a new page. Any cached + * root-snapshot for this handle now points at the old + * (soon-to-be-freed) root frame, so disarm it under the + * handle mutex; the next read descent rebuilds it. + * Clearing bt_rootpage is sufficient: __bam_rsnap_child + * bails when it is NULL, before touching bt_rsnap. + */ + MUTEX_LOCK(dbp->env, dbp->mutex); + bt->bt_rootpage = NULL; + MUTEX_UNLOCK(dbp->env, dbp->mutex); if ((ret = __memp_fput(dbp->mpf, ip, root, dbp->priority)) != 0) goto err; diff --git a/src/btree/bt_method.c b/src/btree/bt_method.c index bac2cc82b..72788650e 100644 --- a/src/btree/bt_method.c +++ b/src/btree/bt_method.c @@ -113,6 +113,20 @@ __bam_db_close(dbp) if (t->re_source != NULL) __os_free(dbp->env, t->re_source); + /* Free root snapshots: the current copy and all retired copies. */ + { + BAM_RSNAP *s, *snext; + + for (s = t->bt_rsnap; s != NULL; s = snext) { + snext = s->next; + __os_free(dbp->env, s); + } + for (s = t->bt_rsnap_free; s != NULL; s = snext) { + snext = s->next; + __os_free(dbp->env, s); + } + } + __os_free(dbp->env, t); dbp->bt_internal = NULL; diff --git a/src/btree/bt_search.c b/src/btree/bt_search.c index ca9997c12..499d5bf2d 100644 --- a/src/btree/bt_search.c +++ b/src/btree/bt_search.c @@ -49,6 +49,146 @@ #include "dbinc/lock.h" #include "dbinc/mp.h" +static int __bam_rsnap_refresh __P((DBC *)); +static int __bam_rsnap_child __P((DBC *, const DBT *, db_pgno_t *, DB_LSN *)); + +/* + * __bam_rsnap_refresh -- + * Refresh this handle's private copy of the B-tree root. Fetches the + * (wired) root once under its shared latch, takes a consistent copy and + * its LSN, and publishes it. The previously-current copy is retired to + * a free list (freed at handle close) so a concurrent reader still + * holding it is never freed underneath -- root changes are rare, so few + * copies accumulate. Serialized by the handle mutex. + */ +static int +__bam_rsnap_refresh(dbc) + DBC *dbc; +{ + BTREE *t; + BAM_RSNAP *snap; + DB *dbp; + DB_MPOOLFILE *mpf; + DB_LSN lsn; + ENV *env; + PAGE *h; + db_pgno_t root_pgno; + u_int32_t psize; + int ret, t_ret, wired; + + dbp = dbc->dbp; + env = dbp->env; + mpf = dbp->mpf; + t = dbp->bt_internal; + root_pgno = t->bt_root; + if (root_pgno == PGNO_INVALID) + return (0); + + if ((ret = __memp_fget(mpf, &root_pgno, + dbc->thread_info, dbc->txn, 0, &h)) != 0) + return (ret); + /* + * Wire the root so the cached buffer address stays resident: only then + * may we keep a pointer to the frame for later lock-free LSN reads. If + * wiring did not take (mmap'd page, or the per-region wired cap was + * reached), we must not cache the frame -- it is evictable and the + * pointer could dangle -- so we disarm the fast path for this handle + * (bt_rootpage/bt_rsnap NULL) and fall back to the normal descent. + */ + wired = 0; + (void)__memp_wire(mpf, h, &wired); + lsn = LSN(h); + psize = dbp->pgsize; + snap = NULL; + if (wired && TYPE(h) == P_IBTREE && psize != 0 && + (ret = __os_malloc(env, sizeof(BAM_RSNAP) + psize, &snap)) == 0) { + snap->next = NULL; + snap->lsn = lsn; + snap->size = psize; + memcpy(BAM_RSNAP_PAGE(snap), h, psize); + } + + MUTEX_LOCK(env, dbp->mutex); + /* Cache the wired live-root buffer; NULL if it could not be wired. */ + t->bt_rootpage = wired ? h : NULL; + /* Retire the previously-current copy to the free list. */ + if (t->bt_rsnap != NULL) { + ((BAM_RSNAP *)t->bt_rsnap)->next = t->bt_rsnap_free; + t->bt_rsnap_free = t->bt_rsnap; + } + t->bt_rsnap = snap; /* NULL if not wired or the root is a leaf */ + t->bt_rsnap_lsn = lsn; + MUTEX_UNLOCK(env, dbp->mutex); + + if ((t_ret = __memp_fput(mpf, + dbc->thread_info, h, dbc->priority)) != 0 && ret == 0) + ret = t_ret; + return (ret); +} + +/* + * __bam_rsnap_child -- + * If this handle holds a current snapshot of the root (its LSN still + * matches the live root), search the snapshot copy for the child that + * the descent for "key" would take, returning that child page number and + * the snapshot LSN. Returns DB_NOTFOUND if there is no current snapshot + * (the caller falls back to the normal descent and refreshes). + */ +static int +__bam_rsnap_child(dbc, key, childp, snap_lsnp) + DBC *dbc; + const DBT *key; + db_pgno_t *childp; + DB_LSN *snap_lsnp; +{ + BTREE *t; + BAM_RSNAP *snap; + DB *dbp; + DB_LSN live; + PAGE *cp; + db_indx_t base, indx, lim; + int (*func) __P((DB *, const DBT *, const DBT *)); + int cmp, ret; + + dbp = dbc->dbp; + t = dbp->bt_internal; + snap = t->bt_rsnap; + if (snap == NULL || t->bt_rootpage == NULL) + return (DB_NOTFOUND); + + /* Racy read of the live root LSN; a torn read just forces a refresh. */ + live = LSN((PAGE *)t->bt_rootpage); + if (live.file != snap->lsn.file || live.offset != snap->lsn.offset) + return (DB_NOTFOUND); + + cp = BAM_RSNAP_PAGE(snap); + if (TYPE(cp) != P_IBTREE) + return (DB_NOTFOUND); + + /* + * Mirror the internal-page child selection in __bam_search exactly + * (same binary search, same __bam_cmp, same base->index rule), so the + * child chosen is identical to a normal descent. + */ + func = t->bt_compare; + indx = 0; + cmp = 1; + DB_BINARY_SEARCH_FOR(base, lim, NUM_ENT(cp), O_INDX) { + DB_BINARY_SEARCH_INCR(indx, base, lim, O_INDX); + if ((ret = __bam_cmp(dbc, key, cp, indx, func, &cmp)) != 0) + return (DB_NOTFOUND); + if (cmp == 0) + break; + if (cmp > 0) + DB_BINARY_SEARCH_SHIFT_BASE(indx, base, lim, O_INDX); + } + if (cmp != 0) + indx = base > 0 ? base - O_INDX : base; + *childp = GET_BINTERNAL(dbp, cp, indx)->pgno; + *snap_lsnp = snap->lsn; + return (0); +} + /* * __bam_get_root -- * Fetch the root of a tree and see if we want to keep @@ -128,6 +268,17 @@ retry: if (lock_mode == DB_LOCK_WRITE) DB_ASSERT(dbp->env, TYPE(h) == P_IBTREE || TYPE(h) == P_IRECNO || TYPE(h) == P_LBTREE || TYPE(h) == P_LRECNO || TYPE(h) == P_LDUP); + /* + * Wire the one common tree root so it stays resident: it is fetched by + * every operation, so keeping it non-evictable removes the read-in/ + * eviction churn on the hottest page and lets the root snapshot refresh + * cheaply. Only the main tree root (BAM_ROOT_PGNO) is wired -- subtree + * (off-page duplicate) roots and all internal pages stay evictable. + * Unwired when the page is freed (__db_free) or the file closes. + */ + if (h->pgno == BAM_ROOT_PGNO(dbc)) + (void)__memp_wire(mpf, h, NULL); + /* * Decide if we need to dirty and/or lock this page. * We must not hold the latch while we get the lock. @@ -275,6 +426,9 @@ __bam_search(dbc, root_pgno, key, flags, slevel, recnop, exactp) int (*func) __P((DB *, const DBT *, const DBT *)); u_int32_t get_mode, wait; u_int8_t level, saved_level; + int from_snap; + db_pgno_t snap_child; + DB_LSN snap_lsn; if (F_ISSET(dbc, DBC_OPD)) LOCK_CHECK_OFF(dbc->thread_info); @@ -305,6 +459,31 @@ __bam_search(dbc, root_pgno, key, flags, slevel, recnop, exactp) */ start_pgno = saved_pg = root_pgno; + + /* + * Root-snapshot fast path (option B): for a plain read lookup of the + * main tree (not write/stack/parent/next/del/min/max, not OPD, not + * recno/recnum, not multiversion), take the first child from this + * handle's private root copy and begin the descent there, never + * fetching (pinning/latching) the contended live root. The copy's + * validity is confirmed against the live root LSN here, and re-checked + * after the child is fetched (below) to close the window where a + * concurrent root change could make the child stale. + */ + from_snap = 0; + if (root_pgno == PGNO_INVALID && key != NULL && slevel == LEAFLEVEL && + LF_ISSET(SR_READ) && !LF_ISSET(SR_WRITE | SR_PARENT | SR_STACK | + SR_NEXT | SR_DEL | SR_START | SR_BOTH | SR_MIN | SR_MAX | + SR_STK_ONLY) && !F_ISSET(dbc, DBC_OPD) && + dbc->dbtype == DB_BTREE && !F_ISSET(cp, C_RECNUM) && + atomic_read(&mpf->mfp->multiversion) == 0 && + LOGGING_ON(env) && !F_ISSET(dbp, DB_AM_NOT_DURABLE)) { + if (__bam_rsnap_child(dbc, key, &snap_child, &snap_lsn) == 0) { + start_pgno = snap_child; + from_snap = 1; + } else + (void)__bam_rsnap_refresh(dbc); + } saved_level = MAXBTREELEVEL; retry: if ((ret = __bam_get_root(dbc, start_pgno, slevel, flags, &stack)) != 0) goto err; @@ -339,6 +518,31 @@ retry: if ((ret = __bam_get_root(dbc, start_pgno, slevel, flags, &stack)) != 0) BT_STK_CLR(cp); + /* + * Root-snapshot re-check: we began the descent at a child taken from + * the root copy. If the live root LSN no longer matches the snapshot, + * the root changed (e.g. a split added a level, or a merge freed the + * child) while we were fetching the child, so the child may be stale or + * reused. Release it and restart the descent from the real root. + */ + if (from_snap) { + DB_LSN now; + + now = LSN((PAGE *)t->bt_rootpage); + if (now.file != snap_lsn.file || now.offset != snap_lsn.offset) { + if ((ret = __memp_fput(mpf, + dbc->thread_info, h, dbc->priority)) != 0) + goto err; + h = NULL; + (void)__LPUT(dbc, lock); + LOCK_INIT(lock); + from_snap = 0; + start_pgno = PGNO_INVALID; + (void)__bam_rsnap_refresh(dbc); + goto retry; + } + } + /* Choose a comparison function. */ func = F_ISSET(dbc, DBC_OPD) ? (dbp->dup_compare == NULL ? __bam_defcmp : dbp->dup_compare) : diff --git a/src/db/db_meta.c b/src/db/db_meta.c index f89c10665..f72310aca 100644 --- a/src/db/db_meta.c +++ b/src/db/db_meta.c @@ -312,6 +312,12 @@ __db_free(dbc, h, flags) meta = NULL; prev = NULL; LOCK_INIT(metalock); + + /* + * The page is being freed back to the file; if it was wired + * (a B-tree internal page) clear that so the frame can be reused. + */ + (void)__memp_unwire(mpf, h); #ifdef HAVE_FTRUNCATE lp = NULL; nelem = 0; diff --git a/src/dbinc/btree.h b/src/dbinc/btree.h index 1f4c5d5a9..7d5132c3d 100644 --- a/src/dbinc/btree.h +++ b/src/dbinc/btree.h @@ -49,6 +49,21 @@ extern "C" { /* Forward structure declarations. */ struct __btree; typedef struct __btree BTREE; + +/* + * BAM_RSNAP -- + * An immutable private copy of the B-tree root page taken at a known + * root LSN, used by the lock-free root-snapshot descent. The copied + * page image follows the header (size bytes). Superseded copies are + * chained via "next" and freed when the handle closes. + */ +typedef struct __bam_rsnap { + struct __bam_rsnap *next; /* Chain of superseded copies. */ + DB_LSN lsn; /* Root LSN this copy was taken at. */ + u_int32_t size; /* Page size (bytes of copy). */ + /* The copied root page image follows immediately. */ +} BAM_RSNAP; +#define BAM_RSNAP_PAGE(s) ((PAGE *)((u_int8_t *)(s) + sizeof(BAM_RSNAP))) struct __cursor; typedef struct __cursor BTREE_CURSOR; struct __epg; typedef struct __epg EPG; @@ -502,6 +517,23 @@ struct __btree { /* Btree access method. */ db_pgno_t bt_lpgno; /* Last insert location. */ DB_LSN bt_llsn; /* Last insert LSN. */ + /* + * Root snapshot (option B): a private, immutable copy of the B-tree + * root used to find the descent's first child without fetching + * (pinning/latching) the contended live root. bt_rootpage caches the + * wired live-root buffer so a reader can read the current root LSN with + * a plain load; bt_rsnap is the current copy (NULL when the root is a + * leaf); bt_rsnap_lsn is the LSN at the last refresh; bt_rsnap_free + * chains superseded copies, freed when the handle closes (root changes + * are rare, so few accumulate -- this avoids a reader/free race without + * epoch reclamation). Process-local; cross-process correctness comes + * from validating against the shared live-root LSN. + */ + void *bt_rootpage; /* Cached wired live-root buffer. */ + void *bt_rsnap; /* Current root copy (BAM_RSNAP *). */ + DB_LSN bt_rsnap_lsn; /* Root LSN at last snapshot refresh. */ + void *bt_rsnap_free; /* Superseded copies, freed at close. */ + /* * !!! * The re_modified field is NOT protected by any mutex, and for this diff --git a/src/dbinc/mp.h b/src/dbinc/mp.h index bef9e824b..aaa08f903 100644 --- a/src/dbinc/mp.h +++ b/src/dbinc/mp.h @@ -194,6 +194,15 @@ struct __mpool { /* SHARED */ */ u_int32_t pages; /* Number of pages in the cache. */ + /* + * Count of buffers currently wired (non-evictable; B-tree internal + * pages, Stage 1). Atomic so __memp_wire/__memp_unwire need no region + * lock. Capped at MPOOL_WIRED_MAX_PCT of `pages` so wiring can never + * starve the cache; over the cap, wiring is simply skipped (the descent + * falls back to a normal pin). + */ + db_atomic_t wired_pages; + /* * The stat fields are not thread protected, and cannot be trusted. */ @@ -415,6 +424,14 @@ struct __db_mpool_fstat_int { /* SHARED */ #define MPOOL_CLOCK_HOT MPOOL_CLOCK_DEFAULT /* >= this is protected */ #define MPOOL_CLOCK_ADMIT MPOOL_CLOCK_VERY_LOW /* probationary warmth */ +/* + * Wiring cap (Stage 1): at most this percent of a cache region's buffers may + * be wired (held non-evictable for B-tree internal pages), so wiring can never + * starve the cache. Over the cap, __memp_wire is a no-op and the descent uses + * a normal pin. + */ +#define MPOOL_WIRED_MAX_PCT 25 + /* * MPOOLFILE -- * Shared DB_MPOOLFILE information. @@ -579,6 +596,17 @@ struct __bh { /* SHARED */ #define BH_THAWED 0x100 /* Page was thawed. */ u_int16_t flags; + /* + * Stage 1: "wired" buffers are exempt from eviction (set for B-tree + * internal/root pages so the optimistic descent can read them without + * risk of the frame being reclaimed under it). A dedicated byte, not a + * flags bit: it is set with a plain monotonic store while the caller + * holds only a shared buffer latch, so it must not share the non-atomic + * RMW of the flags word (which __memp_pgwrite clears BH_DIRTY in under a + * shared latch). Reset to 0 wherever a buffer header is (re)initialized. + */ + u_int8_t wired; + u_int32_t priority; /* Priority. */ SH_TAILQ_ENTRY hq; /* MPOOL hash bucket queue. */ diff --git a/src/dbinc_auto/mp_ext.h b/src/dbinc_auto/mp_ext.h index d142b5846..8df869c35 100644 --- a/src/dbinc_auto/mp_ext.h +++ b/src/dbinc_auto/mp_ext.h @@ -42,6 +42,8 @@ int __memp_mf_discard __P((DB_MPOOL *, MPOOLFILE *, int)); int __memp_inmemlist __P((ENV *, char ***, int *)); int __memp_fput_pp __P((DB_MPOOLFILE *, void *, DB_CACHE_PRIORITY, u_int32_t)); int __memp_fput __P((DB_MPOOLFILE *, DB_THREAD_INFO *, void *, DB_CACHE_PRIORITY)); +int __memp_wire __P((DB_MPOOLFILE *, void *, int *)); +int __memp_unwire __P((DB_MPOOLFILE *, void *)); int __memp_unpin_buffers __P((ENV *, DB_THREAD_INFO *)); int __memp_dirty __P((DB_MPOOLFILE *, void *, DB_THREAD_INFO *, DB_TXN *, DB_CACHE_PRIORITY, u_int32_t)); int __memp_shared __P((DB_MPOOLFILE *, void *)); diff --git a/src/mp/mp_alloc.c b/src/mp/mp_alloc.c index 4766b2828..2cee7b53b 100644 --- a/src/mp/mp_alloc.c +++ b/src/mp/mp_alloc.c @@ -313,7 +313,8 @@ retry_search: bhp = NULL; if (SH_CHAIN_SINGLETON(current_bhp, vc)) { u_int32_t warmth; - if (BH_REFCOUNT(current_bhp) != 0) + if (BH_REFCOUNT(current_bhp) != 0 || + current_bhp->wired) continue; buffers++; warmth = current_bhp->priority; diff --git a/src/mp/mp_bh.c b/src/mp/mp_bh.c index 93746fec4..de057742a 100644 --- a/src/mp/mp_bh.c +++ b/src/mp/mp_bh.c @@ -591,6 +591,16 @@ __memp_bhfree(dbmp, infop, mfp, hp, bhp, flags) pagesize = mfp->pagesize; #endif + /* + * If this buffer was wired and is being freed without going through + * __db_free (e.g. file/env close discard), drop it from the region's + * wired count. The wired byte gates this so it is decremented once. + */ + if (bhp->wired != 0) { + bhp->wired = 0; + (void)atomic_dec(env, &((MPOOL *)infop->primary)->wired_pages); + } + DB_ASSERT(env, LF_ISSET(BH_FREE_UNLOCKED) || (hp != NULL && MUTEX_IS_OWNED(env, hp->mtx_hash))); DB_ASSERT(env, BH_REFCOUNT(bhp) == 1 && diff --git a/src/mp/mp_fget.c b/src/mp/mp_fget.c index 63a1791c7..9697afa46 100644 --- a/src/mp/mp_fget.c +++ b/src/mp/mp_fget.c @@ -649,6 +649,7 @@ reuse: if ((makecopy || F_ISSET(bhp, BH_FROZEN)) && /* Initialize enough so we can call __memp_bhfree. */ alloc_bhp->flags = 0; + alloc_bhp->wired = 0; atomic_init(&alloc_bhp->ref, 1); #ifdef DIAGNOSTIC if ((uintptr_t)alloc_bhp->buf & (sizeof(size_t) - 1)) { @@ -797,6 +798,7 @@ reuse: if ((makecopy || F_ISSET(bhp, BH_FROZEN)) && bhp->td_off = INVALID_ROFF; SH_CHAIN_INIT(bhp, vc); bhp->flags = 0; + bhp->wired = 0; /* * Reference the buffer and lock exclusive. We either @@ -1001,6 +1003,7 @@ reuse: if ((makecopy || F_ISSET(bhp, BH_FROZEN)) && alloc_bhp->flags = BH_EXCLUSIVE | ((flags == DB_MPOOL_FREE) ? BH_FREED : F_ISSET(bhp, BH_DIRTY | BH_DIRTY_CREATE)); + alloc_bhp->wired = 0; DB_ASSERT(env, flags != DB_MPOOL_FREE || !F_ISSET(bhp, BH_DIRTY)); F_CLR(bhp, BH_DIRTY | BH_DIRTY_CREATE); diff --git a/src/mp/mp_fput.c b/src/mp/mp_fput.c index e75152d67..966ec7494 100644 --- a/src/mp/mp_fput.c +++ b/src/mp/mp_fput.c @@ -285,3 +285,109 @@ __memp_unpin_buffers(env, ip) } return (0); } + +/* + * __memp_wire -- + * Mark a resident buffer as non-evictable ("wired"). Used for B-tree + * internal/root pages so the Stage 1 optimistic descent can read them + * without the frame being reclaimed under it. The set is a plain + * monotonic store to a dedicated byte (not the flags word), safe to do + * while the caller holds only a shared buffer latch; the byte is reset to + * 0 wherever a buffer header is (re)initialized. + * + * If wiredp is non-NULL it is set to 1 iff the frame is wired (and so + * guaranteed resident) on return -- either newly wired here or already + * wired -- and 0 otherwise (an mmap'd page, or the per-region cap was + * reached). Callers that cache the frame address for later lock-free + * reads MUST only do so when *wiredp is 1; a non-wired frame is evictable + * and its address may dangle. + * + * PUBLIC: int __memp_wire __P((DB_MPOOLFILE *, void *, int *)); + */ +int +__memp_wire(dbmfp, pgaddr, wiredp) + DB_MPOOLFILE *dbmfp; + void *pgaddr; + int *wiredp; +{ + BH *bhp; + DB_MPOOL *dbmp; + ENV *env; + MPOOL *c_mp; + + if (wiredp != NULL) + *wiredp = 0; + + /* + * A memory-mapped (read-only) file hands back a pointer into the mmap + * region, not a buffer frame, so the BH back-computation below would be + * a wild pointer. Such pages are never in the buffer pool and never + * evicted, so there is nothing to wire (and nothing that can dangle). + */ + if (dbmfp->addr != NULL && pgaddr >= dbmfp->addr && + (u_int8_t *)pgaddr <= + (u_int8_t *)dbmfp->addr + dbmfp->len) + return (0); + + bhp = (BH *)((u_int8_t *)pgaddr - SSZA(BH, buf)); + if (bhp->wired != 0) { + if (wiredp != NULL) + *wiredp = 1; + return (0); + } + + /* + * Cap wiring at MPOOL_WIRED_MAX_PCT of the region's buffers so wiring + * can never starve the cache. Over the cap this is a no-op (the frame + * stays evictable and the descent uses a normal pin). The count is + * approximate under races, which is fine for a cap. Compute the limit + * as (pages * PCT) / 100 so it does not round down to zero for caches + * smaller than 100 buffers. + */ + env = dbmfp->env; + dbmp = env->mp_handle; + c_mp = dbmp->reginfo[bhp->region].primary; + if (atomic_read(&c_mp->wired_pages) >= + c_mp->pages * MPOOL_WIRED_MAX_PCT / 100) + return (0); + + bhp->wired = 1; + (void)atomic_inc(env, &c_mp->wired_pages); + if (wiredp != NULL) + *wiredp = 1; + return (0); +} + +/* + * __memp_unwire -- + * Clear the wired mark on a buffer (e.g. when its page is freed) so the + * frame becomes evictable again. Safe on mmap'd pages (no-op). + * + * PUBLIC: int __memp_unwire __P((DB_MPOOLFILE *, void *)); + */ +int +__memp_unwire(dbmfp, pgaddr) + DB_MPOOLFILE *dbmfp; + void *pgaddr; +{ + BH *bhp; + DB_MPOOL *dbmp; + ENV *env; + MPOOL *c_mp; + + if (dbmfp->addr != NULL && pgaddr >= dbmfp->addr && + (u_int8_t *)pgaddr <= + (u_int8_t *)dbmfp->addr + dbmfp->len) + return (0); + + bhp = (BH *)((u_int8_t *)pgaddr - SSZA(BH, buf)); + if (bhp->wired == 0) + return (0); + + bhp->wired = 0; + env = dbmfp->env; + dbmp = env->mp_handle; + c_mp = dbmp->reginfo[bhp->region].primary; + (void)atomic_dec(env, &c_mp->wired_pages); + return (0); +} diff --git a/src/mp/mp_region.c b/src/mp/mp_region.c index 495203054..d5d6efce8 100644 --- a/src/mp/mp_region.c +++ b/src/mp/mp_region.c @@ -312,6 +312,7 @@ __memp_init(env, dbmp, reginfo_off, htab_buckets, max_nreg) } mp->htab_buckets = htab_buckets; mp->htab_mutexes = dbenv->mp_mtxcount; + atomic_init(&mp->wired_pages, 0); mp->pagesize = dbenv->mp_pagesize == 0 ? MPOOL_DEFAULT_PAGESIZE : dbenv->mp_pagesize; diff --git a/src/mp/mp_stat.c b/src/mp/mp_stat.c index a019fa891..fd14934ef 100644 --- a/src/mp/mp_stat.c +++ b/src/mp/mp_stat.c @@ -709,6 +709,8 @@ __memp_print_hash(env, dbmp, reginfo, fmap, flags) c_mp = reginfo->primary; DB_MSGBUF_INIT(&mb); STAT_ULONG("Hash table last-checked", c_mp->last_checked); + STAT_ULONG("Wired buffers (non-evictable)", + atomic_read(&c_mp->wired_pages)); STAT_ULONG("Put counter", c_mp->put_counter); /* Display the hash table list of BH's. */