fix(mem_wal): suppress stale LSM vector-search reads via block-list post-filter by hamersaw · Pull Request #6899 · lance-format/lance

hamersaw · 2026-05-21T16:55:48Z

What

Fixes a stale read in LSM MemWAL vector search: when a primary key is updated and its fresh row falls out of its own source's top-k, the superseded copy from an older generation could win the cross-source dedup and be returned.

Background

LsmGlobalPkDedupExec (introduced in #6881) is exact only over the candidates each source surfaces. If a PK's fresh version is pushed out of its source's top-k by closer rows, the dedup never sees it and cannot suppress the stale copy from an older generation — so the stale row is served. Repro: test_vector_search_stale_read_when_fresh_falls_out_of_top_k.

Approach

Make staleness a per-source PK-hash post-filter applied to each source's KNN before the cross-source union, so a stale row never reaches the merge.

Membership. Each generation's membership is an Arc<HashSet<u64>> of PK hashes (compute_pk_hash — the same hash the dedup nodes use). Built once per generation; flushed generations' sets are cached on FlushedMemTableCache and scanned streaming (one batch resident at a time, no full PK-column buffer).
Per-source block set. compute_source_block_lists gives each source the membership sets of the generations newer than it — NEWER(G) — as a Vec<Arc<HashSet<u64>>>, referenced, never merged into a per-query union. Generations are per-shard, so the map is keyed (shard_id, generation) and a source is only superseded by strictly-newer generations of its own shard (the base table, shardless and oldest, is blocked by every generation). The base table is not scanned — it's filtered by hashing only its KNN candidates.
Execution. PkHashFilterExec drops any candidate whose PK hash is in any of its source's blocked sets. This handles only cross-generation supersession: a PK in a newer generation makes every copy of it stale, so dropping by hash needs no row address. Within-generation duplicates (same PK twice in one generation) share a hash and are left to the existing global dedup's (generation, freshness) tiebreaker.

Configuration

LsmVectorSearchPlanner::plan_search exposes two knobs (Rust + Python + Java bindings):

overfetch_factor: f64 — a single knob controlling both stale filtering and over-fetch:
- < 1.0 (e.g. 0.0): stale filtering off (no block-list / PkHashFilterExec; the global dedup still runs).
- == 1.0 (default): filtering on, no over-fetch — a source with superseded rows fetches exactly k and may return fewer than k live rows.
- > 1.0: filtering on, over-fetch ceil(k * factor) so dropping the stale rows still leaves k.
There is intentionally no separate on/off flag — over-fetch is only meaningful while filtering, so the factor encodes both.
refine_base_table: bool (replaces the old refine_factor: Option<u32>) — re-rank the base arm's approximate index distances to exact (factor 1). Auto-enabled whenever stale filtering runs (over-fetch widens the base's candidate pool, which must be exact before the merge).

External API

LsmScanner::contains_pks(&RecordBatch) -> Vec<bool> — test which primary keys have been (re)written in the WAL fresh tier (active + frozen memtables + flushed generations), built like any query (construct the scanner, then call). Hashing is internal, so callers never reproduce compute_pk_hash.

Scope / caveats

Post-filter, not a true prefilter. It relies on over-fetch to backfill dropped rows and does not guarantee k live results in the adversarial case (more superseded rows near the query than the over-fetch covers); PkHashFilterExec logs a per-source warning when this happens. Promoting the same membership into the KNN as a true prefilter (the index traverses until k rows pass, removing the over-fetch) is the headline follow-up.
Within-generation top-k eviction (a generation holds both a stale and a fresh copy of a PK, and the fresh one is evicted) is not pre-filtered — it shares a hash, so it can't be disambiguated by membership. It's the same bug class as the cross-source case, currently relying on the global dedup; closing it would need flush-time dedup (so flushed generations are internally deduped like the base table). The base table is assumed internally deduped.

Tests

test_vector_search_stale_read_when_fresh_falls_out_of_top_k passes (with positive + overfetch_factor=0.0 toggle-off assertions); plus over-fetch backfill, cross-flushed and composite-PK stale reads, same-L0 newest-wins, per-shard block-list isolation, and PkHashFilterExec / membership unit tests (incl. null and composite PKs). cargo fmt, cargo clippy -p lance --tests -- -D warnings, and the lance / lance-jni / pylance crate checks are clean.

🤖 Generated with Claude Code

…e-read fix Foundation for fixing the LSM vector-search stale read where an updated PK's fresh row falls out of its own source's top-k, letting the superseded copy from an older generation win (the per-source top-k -> global-dedup gap that replaced the bloom-based FilterStaleExec in lance-format#6881). Adds the mem_wal-side machinery that *produces* the per-source block-lists, keyed on the same compute_pk_hash the dedup nodes use: - GenPkIndex: per-generation pk-hash -> rowaddrs, with membership, within-generation supersession (write-polarity aware), and cross-generation supersession (superseded_by NEWER(G)). - compute_block_lists: folds generations newest-first into per-gen block trees plus the membership union base uses. - compute_source_block_lists: assembles a per-source RowAddrMask bitmap from the real LSM sources (active/frozen via BatchStore positions, flushed via disk scan of pk + _rowaddr, base blocked cross-generation only). Scope: this PR only builds the bitmaps; how a mask drives the actual KNN search (a Scanner pre-filter API) is a deliberate follow-up, so the new functions are gated dead_code until wired and there is no behavior change yet. The failing spec test lands with that follow-up. Unit-tested: membership, within/cross-gen supersession, in-memory assembly, and the base disk-scan path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

codecov · 2026-05-21T17:30:37Z

Codecov Report

❌ Patch coverage is 92.36043% with 78 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
...st/lance/src/dataset/mem_wal/scanner/block_list.rs	87.71%	22 Missing and 13 partials ⚠️
...src/dataset/mem_wal/scanner/exec/pk_hash_filter.rs	84.54%	29 Missing and 3 partials ⚠️
...lance/src/dataset/mem_wal/scanner/flushed_cache.rs	87.17%	2 Missing and 3 partials ⚠️
rust/lance/src/dataset/mem_wal/scanner/builder.rs	94.00%	0 Missing and 3 partials ⚠️
...lance/src/dataset/mem_wal/scanner/vector_search.rs	99.31%	1 Missing and 2 partials ⚠️

📢 Thoughts on this report? Let us know!

…ost-filter A primary-key update whose fresh row falls out of its own source's top-k let the superseded copy from an older generation win the global dedup, returning a stale row (the gap LsmGlobalPkDedupExec cannot close, exposed by test_vector_search_stale_read_when_fresh_falls_out_of_top_k). Each per-source KNN now over-fetches and drops rows superseded by a newer generation (matched by _rowid) via BlockListFilterExec, before the cross-source union, so a stale row never reaches the merge. The existing global dedup still resolves within-source duplicates. Block-list bitmaps come from the block-list machinery, re-keyed from _rowaddr to _rowid (what fast_search emits; equal to _rowaddr unless stable row ids are enabled). This is a post-filter that relies on over-fetch (STALE_OVERFETCH_FACTOR) to backfill dropped rows; it does not guarantee k live results in the adversarial case. A TODO at the filter site tracks promoting the same mask to a true KNN prefilter (which traverses until k rows pass, removing the over-fetch). Turns the failing spec test green; existing vector-search tests unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…tch default 2.5 Default STALE_OVERFETCH_FACTOR to 2.5 (was 4) and round up at the call site (ceil(k * factor)), so a filtered source still over-fetches at least k+1. Add UnderfillFilterWarnExec: a pass-through node placed at the top of the plan only when a per-source block-list was applied. It logs a tracing warning if the query returns fewer than k rows — the signature of an under-fetch, where the over-fetch did not leave enough live candidates after dropping superseded rows. Gated on filtering having happened so a genuinely small unfiltered result does not warn; the real fix remains a true KNN prefilter (tracked at the filter site). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…block-list compute_source_block_lists rebuilt each flushed generation's GenPkIndex on every vector query via a full PK-column scan. Cache it on FlushedMemTableCache (a parallel moka cache keyed by the same immutable flushed path) so the index is built once per generation, single-flight, and reused as an Arc::clone on warm queries; pruned alongside the dataset cache by retain_paths at compaction. Cold miss still builds lazily by scanning. Build-at-flush is deliberately not wired: the flush write path has no handle to the reader-side cache, so seeding it would mean threading a read cache through the writer for marginal gain (it only saves the first cold scan). MVCC watermark pinning is likewise deferred: membership is built at plan time (a subset of execution-time candidates), so it cannot over-suppress a not-yet-visible row. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…K stale reads Adds three vector-search dedup tests: - over-fetch backfill: a source whose entire top-k is stale still returns k live results from its next-nearest rows (no under-fill). - cross-flushed suppression: an older flushed generation's stale row is blocked by a newer flushed generation (no base/active involved). - composite primary key: the block-list keys on the full (id1,id2) PK, so an updated base row is suppressed end-to-end. Also Box::pin the now-larger plan_search sub-futures (block-list build and the per-source KNN build) and the flushed PK-index build future to stay under clippy::large_futures at every await site. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… (dangerous direction) Our existing within-active dedup test keeps the newer copy only because it is also the closer one, so it passes even with broken newest-wins. This ports the lance-format#6844 spec: the re-inserted copy is FARTHER than the stale one (which sits on the query), so a correct result requires real newest-wins. Passes on the block-list post-filter. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…che method get_or_build_pk_index is public (on FlushedMemTableCache) but GenPkIndex lives in a private module, so the [`GenPkIndex`] intra-doc link failed the rustdoc CI job under -D warnings (rustdoc::private-intra-doc-links). Keep it a plain code span. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Replace the per-generation row-address block-list (RowAddrMask + BlockListFilterExec) with PK-hash membership filtering (PkHashFilterExec) for every source, base included. - GenPkIndex is now a generation's set of PK hashes; compute_blocked_sets yields each generation's NEWER(G) set plus the base table's full-union set. No row addresses are tracked. - Removes the per-query full base PK scan: base is filtered by hashing only its KNN candidates against the membership set, like every other source, instead of scanning the whole base table. - Within-generation duplicates are left to the global dedup's (generation, freshness) tiebreaker. The address-based within-gen block is dropped (no test exercised it). Trade-off: within-gen top-k eviction is no longer pre-filtered -- the same bug class we fix cross-source, to be closed later by a true KNN prefilter or flush-time dedup. - Fold the under-fetch warning into PkHashFilterExec as a per-source check (the over-fetch is per-source) and delete UnderfillFilterWarnExec. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… newtype Now that a generation's membership is just a set of PK hashes, the separate module and newtype no longer earn their keep. - Merge gen_pk_index.rs into block_list.rs. GenPkIndex (a thin wrapper over HashSet<u64>) is replaced by free functions over HashSet<u64>: pk_hashes_from_batches / pk_hashes_from_batch_store and the private compute_blocked_sets. - FlushedMemTableCache caches Arc<HashSet<u64>> directly (get_or_build_pk_hashes), dropping the newtype from the cache type. - Tighten doc and inline comments across the vector-search block-list path (block_list, pk_hash_filter, vector_search, flushed_cache). No behavior change. 109 scanner tests green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ry union `compute_source_block_lists` previously merged every newer generation's hashes into a fresh `NEWER(G)` set per source, per query — copying a large flushed gen's PKs on every search. Instead pass `PkHashFilterExec` a `Vec<Arc<HashSet<u64>>>` of the newer generations' membership sets and have it block a candidate whose PK hash is in any of them. - Each generation's membership is a shared `Arc<HashSet<u64>>` referenced, never merged; flushed gens reuse their cached set instead of copying it. - Removes `compute_blocked_sets`; the per-source list is the sets accumulated newest-first. No behavior change. 109 scanner tests green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add a writer-independent way to ask whether primary keys have been (re)written in the WAL fresh tier — the active + frozen memtables and flushed generations a scanner spans — i.e. are shadowed above the base table. Built like any query: construct the `LsmScanner`, then call `contains_pks`. - `LsmScanner::contains_pks(&RecordBatch) -> Vec<bool>`: hashes each input PK row with the same `compute_pk_hash` the dedup uses and tests it against the per-generation membership sets, so callers never hash PKs themselves; composite PKs work by including all PK columns. - Internal `block_list::fresh_tier_block_list` returns those membership sets as `Vec<Arc<HashSet<u64>>>` (the shape the vector-search filter already consumes); flushed sets are reused from the FlushedMemTableCache. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Addresses review findings. 1. (correctness) Generations are per-shard, but compute_source_block_lists keyed the block-list by generation alone and accumulated membership across all shards. With two shards each at the same generation, the adjacent same-gen sources cross-blocked: a source ran through PkHashFilterExec against its own (or another shard's) membership, so an entire shard's memtable/flushed generation was silently filtered out of a multi-shard vector search. Now keyed by (shard_id, generation) and accumulated per shard, so a source is blocked only by strictly-newer generations of its own shard and never by itself. Planner looks up by (shard_id, generation). New regression test block_lists_are_keyed_per_shard. 2. (memory) The flushed PK scan folds the projected stream into the hash set one batch at a time (scan_pk_hashes) instead of try_collect into a Vec<RecordBatch>, so only one PK batch is resident at a time. 3. (test) The stale-read regression test now also asserts the nearest live neighbor (pk=2) is the returned top-1, not just that the stale read is absent. 112 scanner tests green; clippy clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Replace the hardcoded STALE_OVERFETCH_FACTOR with a configurable plan_search argument, and adopt PR lance-format#6908's refine simplification. - `overfetch_factor: f64` is a single knob that controls both stale-read filtering and over-fetch: < 1.0 (e.g. 0.0) → stale filter off (global PK dedup still runs) == 1.0 → filter on, no over-fetch (may return < k live rows) > 1.0 → filter on, fetch ceil(k * factor) to backfill drops There is deliberately no separate on/off flag: over-fetch is only meaningful while filtering, so the factor encodes both. Default 1.0 (filter on, no over-fetch). NOTE: 0.0 now disables filtering — it no longer means "filter on, no over-fetch". - `refine_factor: Option<u32>` → `refine_base_table: bool`, auto-enabled whenever stale filtering runs (over-fetch widens the base's approximate candidate pool, so it must be re-ranked to exact before the merge). - Plumbed through Python (PyO3 + wrapper) and Java (JNI + LsmVectorSearchPlanner overloads), each with explicit docs. Adapted to this branch's per-shard block-list: over-fetch applies only to a source that a newer same-shard generation supersedes. 112 scanner tests green; clippy clean; lance / lance-jni / pylance compile. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Resolve modify/delete conflicts: upstream lance-format#6882 replaced the flat mem_wal_read/mem_wal_vector benches with a benches/mem_wal/ tree, so accept those deletions; port the new mem_wal_vector_bench plan_search call to the configurable (refine_base_table, overfetch_factor) signature. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

claude Bot reviewed May 21, 2026

View reviewed changes

github-actions Bot added the enhancement New feature or request label May 21, 2026

hamersaw changed the title ~~feat(mem_wal): block-list membership + bitmaps for vector-search stale-read fix~~ fix(mem_wal): suppress stale LSM vector-search reads via block-list post-filter May 21, 2026

github-actions Bot added the bug Something isn't working label May 21, 2026

hamersaw and others added 11 commits May 21, 2026 14:33

github-actions Bot added python java labels May 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(mem_wal): suppress stale LSM vector-search reads via block-list post-filter#6899

fix(mem_wal): suppress stale LSM vector-search reads via block-list post-filter#6899
hamersaw wants to merge 14 commits into
lance-format:mainfrom
hamersaw:bug/wal-stale-row-on-update

hamersaw commented May 21, 2026 •

edited

Loading

Uh oh!

claude Bot left a comment

Uh oh!

codecov Bot commented May 21, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hamersaw commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Background

Approach

Configuration

External API

Scope / caveats

Tests

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

codecov Bot commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hamersaw commented May 21, 2026 •

edited

Loading

codecov Bot commented May 21, 2026 •

edited

Loading