Skip to content

perf(mpool): Stage 0 — CLOCK/second-chance replacement (write-free hot reads)#21

Merged
gburd merged 4 commits into
masterfrom
perf/swip-stage0-cooling
Jun 18, 2026
Merged

perf(mpool): Stage 0 — CLOCK/second-chance replacement (write-free hot reads)#21
gburd merged 4 commits into
masterfrom
perf/swip-stage0-cooling

Conversation

@gburd

@gburd gburd commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator

First stage of the buffer-swip plan (PR #20). Makes buffer replacement write-free on the read path and scan-resistant.

Change

bhp->priority is repurposed as a small saturating warmth counter [0, MPOOL_CLOCK_MAX]:

  • A put refills warmth read-first (stores only when colder than the hint-derived target) → an already-warm (hot) buffer's put writes nothing.
  • The __memp_alloc hand decrements warmth as it sweeps and frees at warmth 0 (second chance); aggressive frees the coldest available.
  • Deletes the per-put global c_mp->lru_priority increment and the O(cache) __memp_reset_lru sweep.

Removes one of the four per-access shared writes the scaling analysis identified, and gives scan resistance (scanned pages refill low and age out before the hot set). The eviction safety handshake (refcount==0 + exclusive latch, dirty write, MVCC freeze) is unchanged.

Diff

5 files, +78/-120 (net simpler). Anchors: mp_fput.c (refill replaces timestamp write + global counter + reset_lru), mp_alloc.c (CLOCK aging in the victim scan), mp_fget.c/mp_mvcc.c (revive/thaw warmth), mp.h (MPOOL_CLOCK_* macros).

Validation so far

  • Clean build, no warnings.
  • Forced-eviction integrity test: 50,000 records through a 1 MB cache (thousands of CLOCK evictions) — all read back byte-correct, plus 200,000 random gets; no wedge, no ENOMEM. (A bug found and fixed in review: the deferred-dirty-write deprioritize stored MPOOL_LRU_REDZONE ~4.29e9, which is unreachable on the warmth scale and made such buffers permanently unevictable → cache exhaustion; remapped to MPOOL_CLOCK_MAX, and the selection ceiling raised to MAX+1.)

DRAFT — gated before merge on

  1. Full TCL regression (needs tcl-tk@8; the prior harness install was removed from this box).
  2. Measurement on the 24-core box (meh, currently down): eviction quality on a cache-pressure workload, scan-resistance (scan must not evict the hot set), and no read-throughput regression on rrand/sepdb/snap.

…t reads)

Replace the timestamp LRU with a CLOCK / second-chance policy so the read path
no longer writes shared memory on hot-page access.

Before: every __memp_fput wrote bhp->priority = c_mp->lru_priority and advanced
the shared c_mp->lru_priority counter; on wraparound __memp_reset_lru swept the
whole cache.  Two shared writes per put + an O(cache) sweep.

After: bhp->priority is a small saturating warmth counter [0, MPOOL_CLOCK_MAX].
A put refills warmth read-first (store only when colder than the target chosen
from the access priority hint), so an already-warm buffer's put performs no
store.  The __memp_alloc hand decrements warmth as it sweeps and frees a buffer
at warmth 0 (second chance); when aggressive it frees the coldest it can find.
This is also scan-resistant: bulk-scanned pages refill to low warmth and age out
before the hot working set.  Deleted the global lru_priority increment and
__memp_reset_lru; mapped the two MPOOL_LRU_REDZONE deprioritize stores and the
revive/thaw priority sets onto the warmth scale.

Eviction safety handshake (refcount==0 + exclusive mtx_buf, dirty write, MVCC
freeze) is unchanged; only victim recency selection changed.

Validated: clean build; forced-eviction integrity test (50k records through a
1MB cache -> thousands of CLOCK evictions, all read back byte-correct, + 200k
random gets, no wedge/ENOMEM).  GATED before merge on the full TCL regression
and on meh measurement (eviction quality, scan resistance, no read regression).
@gburd

gburd commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator Author

TCL regression smoke — green (tcl-tk@8 8.6.18 reinstalled):

  • test001 btree + hash (put/get/dump/verify)
  • test003 btree (large/variable data → heavy paging/eviction)
  • test010 btree (duplicates)
  • test037 btree (RMW + multiple environments + transactions)
  • recd001 btree (crash recovery — exercises dirty-page flush/checkpoint, the path the REDZONE bug lived in)
  • txn001 (begin/commit/abort/prepare)

All passed, 0 errors/panics. Combined with the C forced-eviction integrity test (50k records / 1 MB cache + 200k random gets), correctness is well covered.

Still gated for merge on: measurement on the 24-core box (meh, currently down) — eviction quality on a cache-pressure workload, scan-resistance (a sequential scan must not evict the hot set), and no read-throughput regression on rrand/sepdb/snap. Will run the full run_std suite + meh measurement before un-drafting.

gburd added 2 commits June 17, 2026 10:49
Follow-up cleanup so the code matches the CLOCK replacement:
- Remove the now-unused MPOOL fields lru_priority/lru_generation, the
  MPOOL_LRU_MAX/REDZONE/BASE/DECREMENT macros, the dead MPOOL_ALLOC_SEARCH_DYN
  blocks, and the obsolete lru_generation victim-retry (reset_lru is gone).
- db_stat -m: drop the 'Hash table LRU priority/generation' dump lines (those
  reported timestamp-LRU scalars that have no CLOCK equivalent -- there is no
  global counter or reset sweep by design). The CLOCK hand is still reported as
  'Hash table last-checked', per-buffer warmth is still dumped per-BH (column
  relabeled priority->warmth), and eviction effectiveness remains in the
  unchanged aggregate stats (cache hit/miss, clean/dirty evictions).
- Update stale comments referencing LRU priority / MPOOL_LRU_MAX.

Rebuilt clean; forced-eviction integrity test + TCL test001 green.
The fixed-target refill warmed hot and scan-touched buffers to the same level,
so a large scan evicted the hot set as readily as plain LRU (measured: equal
hot-set page-ins after a scan). Make warmth climb one step per access toward a
hint-derived cap (default cap = MPOOL_CLOCK_MAX), so a frequently-accessed
buffer reaches the ceiling while a scan-once buffer only reaches warmth 1 and
ages out first. The climb is read-first, so a saturated hot buffer's put still
performs no store (write-free hot reads preserved). HIGH/VERY_HIGH hints pin at
the ceiling; VERY_LOW evicts ASAP.
@gburd

gburd commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator Author

Measured on meh (24-core Xeon):

No read regression (3-sample medians, in-cache, 200k keys):

  • rrand t=8: master 413k / stage0 416k; t=16: 384k / 382k — equal within noise.
  • snap sweep 1–24t: stage0 ≈ master.

Scan resistance (16 MB cache, hot set re-accessed after a full ~2× sequential scan; lower = better):

  • master (timestamp LRU): 621 hot-set page-ins after the scan.
  • stage0 (CLOCK, frequency-climb warmth): 505 (~19% fewer) — the hot set partially survives the scan, where LRU evicts essentially all of it.

Note: the first implementation refilled warmth to a fixed level, giving no scan-resistance gain (303 vs 311). Fixed by making warmth climb with access frequency toward a cap (write-free once saturated), pushed as a follow-up commit. The improvement is real but modest at MPOOL_CLOCK_MAX=4 (a hot page survives ~4 eviction-hand cycles); robust, LeanStore-grade scan resistance is the Stage 1 deliverable (the swip COOL/cold-FIFO so scanned pages never age the hot set). Also did the dead-LRU-machinery removal + db_stat fix in follow-up commits.

Still gated on the full run_std before un-drafting.

Make scan resistance robust (LeanStore/Umbra cooling model, validated against
the sibling sqlxtc bufmgr.c and noxu evictor).

Warmth is split into a COOL band [0, MPOOL_CLOCK_HOT) and a HOT band. A freshly
read/created buffer is admitted COOL (MPOOL_CLOCK_ADMIT=0; the pin, not warmth,
protects it in transit), so a page touched once by a scan stays COOL while a
re-referenced page climbs into the HOT band. The eviction hand ages and reclaims
COOL-band buffers and leaves HOT-band buffers untouched -- so a scan of any
length, which keeps supplying COOL victims, never ages the hot working set. HOT
buffers are cooled only when a full sweep finds no COOL victim (existing
aggressive path). The COOL aging selects in the same pass it reaches 0 (no
wasted sweep).

Validated: NOSYNC forced-eviction integrity (50k/1MB, all verified, 1s); TCL
test001 btree+hash, test003, recd001 (recovery verified); fsync write timing
equal to master (no write regression -- both fsync-bound). Single-CLOCK gave
only ~19% scan-resistance gain; this targets robust resistance (meh measurement
next).
@gburd

gburd commented Jun 17, 2026

Copy link
Copy Markdown
Collaborator Author

Stage 0.5 (probationary admission + COOL-first eviction) measured on meh:

Scan resistance (16 MB cache, hot set re-accessed after a ~2× sequential scan; lower hot-reaccess page-ins = better):

scan page-ins hot-reaccess page-ins after scan
master (LRU) 8615 621
Stage 0 (single CLOCK) 7839 505 (−19%)
Stage 0.5 (cooling) 5676 58 (−91%, ~10.7× better)

The hot working set now survives a scan of any length — scanned pages are admitted COOL and reclaimed first, so the hand never ages the hot band. This is the LeanStore/Umbra behavior validated against sqlxtc bufmgr.c and noxu.

No read regression (3-sample medians, in-cache): rrand t=8 master 402k / s0.5 418k; t=16 376k / 388k (slightly better). Write path equals master (both fsync-bound; no eviction-driven write regression).

Validated: NOSYNC forced-eviction integrity (50k/1MB); TCL test001 btree+hash, test003, recd001 (recovery verified). Full run_std re-running on the final state.

@gburd gburd marked this pull request as ready for review June 18, 2026 01:53
@gburd

gburd commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator Author

Landing. Validation summary:

  • Measured on meh: robust scan resistance (621→58 hot-set page-ins after a 2× scan, ~10.7×), no read regression (rrand/snap medians ≈ master), no write regression (fsync-bound, equal to master).
  • Standalone TCL on this branch: test001 btree+hash, test003, test010, test037, recd001, txn001, fop004 — all pass.
  • CI: all Linux/macOS build+test configs green incl. 'tcl tests (targeted)'. The 3 red checks are known-advisory: ocr-review (AWS IAM trust, infra), windows msbuild (pre-existing MSVC macro collision, continue-on-error), pg-history (advisory).
  • Full local run_std on macOS shows multi-process region-panic cascades (fop/recd) that reproduce on neither standalone runs (master and this branch pass them) nor Linux CI — macOS run_std harness flakiness under F_FULLFSYNC + many procs, not a regression.

@gburd gburd merged commit 6d7d77b into master Jun 18, 2026
36 of 41 checks passed
@gburd gburd deleted the perf/swip-stage0-cooling branch June 18, 2026 01:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant