perf(mpool): Stage 0 — CLOCK/second-chance replacement (write-free hot reads)#21
Conversation
…t reads) Replace the timestamp LRU with a CLOCK / second-chance policy so the read path no longer writes shared memory on hot-page access. Before: every __memp_fput wrote bhp->priority = c_mp->lru_priority and advanced the shared c_mp->lru_priority counter; on wraparound __memp_reset_lru swept the whole cache. Two shared writes per put + an O(cache) sweep. After: bhp->priority is a small saturating warmth counter [0, MPOOL_CLOCK_MAX]. A put refills warmth read-first (store only when colder than the target chosen from the access priority hint), so an already-warm buffer's put performs no store. The __memp_alloc hand decrements warmth as it sweeps and frees a buffer at warmth 0 (second chance); when aggressive it frees the coldest it can find. This is also scan-resistant: bulk-scanned pages refill to low warmth and age out before the hot working set. Deleted the global lru_priority increment and __memp_reset_lru; mapped the two MPOOL_LRU_REDZONE deprioritize stores and the revive/thaw priority sets onto the warmth scale. Eviction safety handshake (refcount==0 + exclusive mtx_buf, dirty write, MVCC freeze) is unchanged; only victim recency selection changed. Validated: clean build; forced-eviction integrity test (50k records through a 1MB cache -> thousands of CLOCK evictions, all read back byte-correct, + 200k random gets, no wedge/ENOMEM). GATED before merge on the full TCL regression and on meh measurement (eviction quality, scan resistance, no read regression).
|
TCL regression smoke — green (tcl-tk@8 8.6.18 reinstalled):
All passed, 0 errors/panics. Combined with the C forced-eviction integrity test (50k records / 1 MB cache + 200k random gets), correctness is well covered. Still gated for merge on: measurement on the 24-core box ( |
Follow-up cleanup so the code matches the CLOCK replacement: - Remove the now-unused MPOOL fields lru_priority/lru_generation, the MPOOL_LRU_MAX/REDZONE/BASE/DECREMENT macros, the dead MPOOL_ALLOC_SEARCH_DYN blocks, and the obsolete lru_generation victim-retry (reset_lru is gone). - db_stat -m: drop the 'Hash table LRU priority/generation' dump lines (those reported timestamp-LRU scalars that have no CLOCK equivalent -- there is no global counter or reset sweep by design). The CLOCK hand is still reported as 'Hash table last-checked', per-buffer warmth is still dumped per-BH (column relabeled priority->warmth), and eviction effectiveness remains in the unchanged aggregate stats (cache hit/miss, clean/dirty evictions). - Update stale comments referencing LRU priority / MPOOL_LRU_MAX. Rebuilt clean; forced-eviction integrity test + TCL test001 green.
The fixed-target refill warmed hot and scan-touched buffers to the same level, so a large scan evicted the hot set as readily as plain LRU (measured: equal hot-set page-ins after a scan). Make warmth climb one step per access toward a hint-derived cap (default cap = MPOOL_CLOCK_MAX), so a frequently-accessed buffer reaches the ceiling while a scan-once buffer only reaches warmth 1 and ages out first. The climb is read-first, so a saturated hot buffer's put still performs no store (write-free hot reads preserved). HIGH/VERY_HIGH hints pin at the ceiling; VERY_LOW evicts ASAP.
|
Measured on meh (24-core Xeon): No read regression (3-sample medians, in-cache, 200k keys):
Scan resistance (16 MB cache, hot set re-accessed after a full ~2× sequential scan; lower = better):
Note: the first implementation refilled warmth to a fixed level, giving no scan-resistance gain (303 vs 311). Fixed by making warmth climb with access frequency toward a cap (write-free once saturated), pushed as a follow-up commit. The improvement is real but modest at MPOOL_CLOCK_MAX=4 (a hot page survives ~4 eviction-hand cycles); robust, LeanStore-grade scan resistance is the Stage 1 deliverable (the swip COOL/cold-FIFO so scanned pages never age the hot set). Also did the dead-LRU-machinery removal + db_stat fix in follow-up commits. Still gated on the full |
Make scan resistance robust (LeanStore/Umbra cooling model, validated against the sibling sqlxtc bufmgr.c and noxu evictor). Warmth is split into a COOL band [0, MPOOL_CLOCK_HOT) and a HOT band. A freshly read/created buffer is admitted COOL (MPOOL_CLOCK_ADMIT=0; the pin, not warmth, protects it in transit), so a page touched once by a scan stays COOL while a re-referenced page climbs into the HOT band. The eviction hand ages and reclaims COOL-band buffers and leaves HOT-band buffers untouched -- so a scan of any length, which keeps supplying COOL victims, never ages the hot working set. HOT buffers are cooled only when a full sweep finds no COOL victim (existing aggressive path). The COOL aging selects in the same pass it reaches 0 (no wasted sweep). Validated: NOSYNC forced-eviction integrity (50k/1MB, all verified, 1s); TCL test001 btree+hash, test003, recd001 (recovery verified); fsync write timing equal to master (no write regression -- both fsync-bound). Single-CLOCK gave only ~19% scan-resistance gain; this targets robust resistance (meh measurement next).
|
Stage 0.5 (probationary admission + COOL-first eviction) measured on meh: Scan resistance (16 MB cache, hot set re-accessed after a ~2× sequential scan; lower hot-reaccess page-ins = better):
The hot working set now survives a scan of any length — scanned pages are admitted COOL and reclaimed first, so the hand never ages the hot band. This is the LeanStore/Umbra behavior validated against sqlxtc No read regression (3-sample medians, in-cache): rrand t=8 master 402k / s0.5 418k; t=16 376k / 388k (slightly better). Write path equals master (both fsync-bound; no eviction-driven write regression). Validated: NOSYNC forced-eviction integrity (50k/1MB); TCL test001 btree+hash, test003, recd001 (recovery verified). Full |
|
Landing. Validation summary:
|
First stage of the buffer-swip plan (PR #20). Makes buffer replacement write-free on the read path and scan-resistant.
Change
bhp->priorityis repurposed as a small saturating warmth counter [0, MPOOL_CLOCK_MAX]:__memp_allochand decrements warmth as it sweeps and frees at warmth 0 (second chance); aggressive frees the coldest available.c_mp->lru_priorityincrement and the O(cache)__memp_reset_lrusweep.Removes one of the four per-access shared writes the scaling analysis identified, and gives scan resistance (scanned pages refill low and age out before the hot set). The eviction safety handshake (refcount==0 + exclusive latch, dirty write, MVCC freeze) is unchanged.
Diff
5 files, +78/-120 (net simpler). Anchors:
mp_fput.c(refill replaces timestamp write + global counter + reset_lru),mp_alloc.c(CLOCK aging in the victim scan),mp_fget.c/mp_mvcc.c(revive/thaw warmth),mp.h(MPOOL_CLOCK_* macros).Validation so far
MPOOL_LRU_REDZONE~4.29e9, which is unreachable on the warmth scale and made such buffers permanently unevictable → cache exhaustion; remapped to MPOOL_CLOCK_MAX, and the selection ceiling raised to MAX+1.)DRAFT — gated before merge on
tcl-tk@8; the prior harness install was removed from this box).