fix(mem_wal): suppress stale LSM vector-search reads via block-list post-filter#6899
Open
hamersaw wants to merge 14 commits into
Open
fix(mem_wal): suppress stale LSM vector-search reads via block-list post-filter#6899hamersaw wants to merge 14 commits into
hamersaw wants to merge 14 commits into
Conversation
…e-read fix Foundation for fixing the LSM vector-search stale read where an updated PK's fresh row falls out of its own source's top-k, letting the superseded copy from an older generation win (the per-source top-k -> global-dedup gap that replaced the bloom-based FilterStaleExec in lance-format#6881). Adds the mem_wal-side machinery that *produces* the per-source block-lists, keyed on the same compute_pk_hash the dedup nodes use: - GenPkIndex: per-generation pk-hash -> rowaddrs, with membership, within-generation supersession (write-polarity aware), and cross-generation supersession (superseded_by NEWER(G)). - compute_block_lists: folds generations newest-first into per-gen block trees plus the membership union base uses. - compute_source_block_lists: assembles a per-source RowAddrMask bitmap from the real LSM sources (active/frozen via BatchStore positions, flushed via disk scan of pk + _rowaddr, base blocked cross-generation only). Scope: this PR only builds the bitmaps; how a mask drives the actual KNN search (a Scanner pre-filter API) is a deliberate follow-up, so the new functions are gated dead_code until wired and there is no behavior change yet. The failing spec test lands with that follow-up. Unit-tested: membership, within/cross-gen supersession, in-memory assembly, and the base disk-scan path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
…ost-filter A primary-key update whose fresh row falls out of its own source's top-k let the superseded copy from an older generation win the global dedup, returning a stale row (the gap LsmGlobalPkDedupExec cannot close, exposed by test_vector_search_stale_read_when_fresh_falls_out_of_top_k). Each per-source KNN now over-fetches and drops rows superseded by a newer generation (matched by _rowid) via BlockListFilterExec, before the cross-source union, so a stale row never reaches the merge. The existing global dedup still resolves within-source duplicates. Block-list bitmaps come from the block-list machinery, re-keyed from _rowaddr to _rowid (what fast_search emits; equal to _rowaddr unless stable row ids are enabled). This is a post-filter that relies on over-fetch (STALE_OVERFETCH_FACTOR) to backfill dropped rows; it does not guarantee k live results in the adversarial case. A TODO at the filter site tracks promoting the same mask to a true KNN prefilter (which traverses until k rows pass, removing the over-fetch). Turns the failing spec test green; existing vector-search tests unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tch default 2.5 Default STALE_OVERFETCH_FACTOR to 2.5 (was 4) and round up at the call site (ceil(k * factor)), so a filtered source still over-fetches at least k+1. Add UnderfillFilterWarnExec: a pass-through node placed at the top of the plan only when a per-source block-list was applied. It logs a tracing warning if the query returns fewer than k rows — the signature of an under-fetch, where the over-fetch did not leave enough live candidates after dropping superseded rows. Gated on filtering having happened so a genuinely small unfiltered result does not warn; the real fix remains a true KNN prefilter (tracked at the filter site). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…block-list compute_source_block_lists rebuilt each flushed generation's GenPkIndex on every vector query via a full PK-column scan. Cache it on FlushedMemTableCache (a parallel moka cache keyed by the same immutable flushed path) so the index is built once per generation, single-flight, and reused as an Arc::clone on warm queries; pruned alongside the dataset cache by retain_paths at compaction. Cold miss still builds lazily by scanning. Build-at-flush is deliberately not wired: the flush write path has no handle to the reader-side cache, so seeding it would mean threading a read cache through the writer for marginal gain (it only saves the first cold scan). MVCC watermark pinning is likewise deferred: membership is built at plan time (a subset of execution-time candidates), so it cannot over-suppress a not-yet-visible row. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…K stale reads Adds three vector-search dedup tests: - over-fetch backfill: a source whose entire top-k is stale still returns k live results from its next-nearest rows (no under-fill). - cross-flushed suppression: an older flushed generation's stale row is blocked by a newer flushed generation (no base/active involved). - composite primary key: the block-list keys on the full (id1,id2) PK, so an updated base row is suppressed end-to-end. Also Box::pin the now-larger plan_search sub-futures (block-list build and the per-source KNN build) and the flushed PK-index build future to stay under clippy::large_futures at every await site. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… (dangerous direction) Our existing within-active dedup test keeps the newer copy only because it is also the closer one, so it passes even with broken newest-wins. This ports the lance-format#6844 spec: the re-inserted copy is FARTHER than the stale one (which sits on the query), so a correct result requires real newest-wins. Passes on the block-list post-filter. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…che method get_or_build_pk_index is public (on FlushedMemTableCache) but GenPkIndex lives in a private module, so the [`GenPkIndex`] intra-doc link failed the rustdoc CI job under -D warnings (rustdoc::private-intra-doc-links). Keep it a plain code span. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the per-generation row-address block-list (RowAddrMask + BlockListFilterExec) with PK-hash membership filtering (PkHashFilterExec) for every source, base included. - GenPkIndex is now a generation's set of PK hashes; compute_blocked_sets yields each generation's NEWER(G) set plus the base table's full-union set. No row addresses are tracked. - Removes the per-query full base PK scan: base is filtered by hashing only its KNN candidates against the membership set, like every other source, instead of scanning the whole base table. - Within-generation duplicates are left to the global dedup's (generation, freshness) tiebreaker. The address-based within-gen block is dropped (no test exercised it). Trade-off: within-gen top-k eviction is no longer pre-filtered -- the same bug class we fix cross-source, to be closed later by a true KNN prefilter or flush-time dedup. - Fold the under-fetch warning into PkHashFilterExec as a per-source check (the over-fetch is per-source) and delete UnderfillFilterWarnExec. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… newtype Now that a generation's membership is just a set of PK hashes, the separate module and newtype no longer earn their keep. - Merge gen_pk_index.rs into block_list.rs. GenPkIndex (a thin wrapper over HashSet<u64>) is replaced by free functions over HashSet<u64>: pk_hashes_from_batches / pk_hashes_from_batch_store and the private compute_blocked_sets. - FlushedMemTableCache caches Arc<HashSet<u64>> directly (get_or_build_pk_hashes), dropping the newtype from the cache type. - Tighten doc and inline comments across the vector-search block-list path (block_list, pk_hash_filter, vector_search, flushed_cache). No behavior change. 109 scanner tests green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ry union `compute_source_block_lists` previously merged every newer generation's hashes into a fresh `NEWER(G)` set per source, per query — copying a large flushed gen's PKs on every search. Instead pass `PkHashFilterExec` a `Vec<Arc<HashSet<u64>>>` of the newer generations' membership sets and have it block a candidate whose PK hash is in any of them. - Each generation's membership is a shared `Arc<HashSet<u64>>` referenced, never merged; flushed gens reuse their cached set instead of copying it. - Removes `compute_blocked_sets`; the per-source list is the sets accumulated newest-first. No behavior change. 109 scanner tests green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add a writer-independent way to ask whether primary keys have been (re)written in the WAL fresh tier — the active + frozen memtables and flushed generations a scanner spans — i.e. are shadowed above the base table. Built like any query: construct the `LsmScanner`, then call `contains_pks`. - `LsmScanner::contains_pks(&RecordBatch) -> Vec<bool>`: hashes each input PK row with the same `compute_pk_hash` the dedup uses and tests it against the per-generation membership sets, so callers never hash PKs themselves; composite PKs work by including all PK columns. - Internal `block_list::fresh_tier_block_list` returns those membership sets as `Vec<Arc<HashSet<u64>>>` (the shape the vector-search filter already consumes); flushed sets are reused from the FlushedMemTableCache. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Addresses review findings. 1. (correctness) Generations are per-shard, but compute_source_block_lists keyed the block-list by generation alone and accumulated membership across all shards. With two shards each at the same generation, the adjacent same-gen sources cross-blocked: a source ran through PkHashFilterExec against its own (or another shard's) membership, so an entire shard's memtable/flushed generation was silently filtered out of a multi-shard vector search. Now keyed by (shard_id, generation) and accumulated per shard, so a source is blocked only by strictly-newer generations of its own shard and never by itself. Planner looks up by (shard_id, generation). New regression test block_lists_are_keyed_per_shard. 2. (memory) The flushed PK scan folds the projected stream into the hash set one batch at a time (scan_pk_hashes) instead of try_collect into a Vec<RecordBatch>, so only one PK batch is resident at a time. 3. (test) The stale-read regression test now also asserts the nearest live neighbor (pk=2) is the returned top-1, not just that the stale read is absent. 112 scanner tests green; clippy clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the hardcoded STALE_OVERFETCH_FACTOR with a configurable plan_search argument, and adopt PR lance-format#6908's refine simplification. - `overfetch_factor: f64` is a single knob that controls both stale-read filtering and over-fetch: < 1.0 (e.g. 0.0) → stale filter off (global PK dedup still runs) == 1.0 → filter on, no over-fetch (may return < k live rows) > 1.0 → filter on, fetch ceil(k * factor) to backfill drops There is deliberately no separate on/off flag: over-fetch is only meaningful while filtering, so the factor encodes both. Default 1.0 (filter on, no over-fetch). NOTE: 0.0 now disables filtering — it no longer means "filter on, no over-fetch". - `refine_factor: Option<u32>` → `refine_base_table: bool`, auto-enabled whenever stale filtering runs (over-fetch widens the base's approximate candidate pool, so it must be re-ranked to exact before the merge). - Plumbed through Python (PyO3 + wrapper) and Java (JNI + LsmVectorSearchPlanner overloads), each with explicit docs. Adapted to this branch's per-shard block-list: over-fetch applies only to a source that a newer same-shard generation supersedes. 112 scanner tests green; clippy clean; lance / lance-jni / pylance compile. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Resolve modify/delete conflicts: upstream lance-format#6882 replaced the flat mem_wal_read/mem_wal_vector benches with a benches/mem_wal/ tree, so accept those deletions; port the new mem_wal_vector_bench plan_search call to the configurable (refine_base_table, overfetch_factor) signature. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Fixes a stale read in LSM MemWAL vector search: when a primary key is updated and its fresh row falls out of its own source's top-k, the superseded copy from an older generation could win the cross-source dedup and be returned.
Background
LsmGlobalPkDedupExec(introduced in #6881) is exact only over the candidates each source surfaces. If a PK's fresh version is pushed out of its source's top-k by closer rows, the dedup never sees it and cannot suppress the stale copy from an older generation — so the stale row is served. Repro:test_vector_search_stale_read_when_fresh_falls_out_of_top_k.Approach
Make staleness a per-source PK-hash post-filter applied to each source's KNN before the cross-source union, so a stale row never reaches the merge.
Arc<HashSet<u64>>of PK hashes (compute_pk_hash— the same hash the dedup nodes use). Built once per generation; flushed generations' sets are cached onFlushedMemTableCacheand scanned streaming (one batch resident at a time, no full PK-column buffer).compute_source_block_listsgives each source the membership sets of the generations newer than it —NEWER(G)— as aVec<Arc<HashSet<u64>>>, referenced, never merged into a per-query union. Generations are per-shard, so the map is keyed(shard_id, generation)and a source is only superseded by strictly-newer generations of its own shard (the base table, shardless and oldest, is blocked by every generation). The base table is not scanned — it's filtered by hashing only its KNN candidates.PkHashFilterExecdrops any candidate whose PK hash is in any of its source's blocked sets. This handles only cross-generation supersession: a PK in a newer generation makes every copy of it stale, so dropping by hash needs no row address. Within-generation duplicates (same PK twice in one generation) share a hash and are left to the existing global dedup's(generation, freshness)tiebreaker.Configuration
LsmVectorSearchPlanner::plan_searchexposes two knobs (Rust + Python + Java bindings):overfetch_factor: f64— a single knob controlling both stale filtering and over-fetch:< 1.0(e.g.0.0): stale filtering off (no block-list /PkHashFilterExec; the global dedup still runs).== 1.0(default): filtering on, no over-fetch — a source with superseded rows fetches exactlykand may return fewer thanklive rows.> 1.0: filtering on, over-fetchceil(k * factor)so dropping the stale rows still leavesk.There is intentionally no separate on/off flag — over-fetch is only meaningful while filtering, so the factor encodes both.
refine_base_table: bool(replaces the oldrefine_factor: Option<u32>) — re-rank the base arm's approximate index distances to exact (factor 1). Auto-enabled whenever stale filtering runs (over-fetch widens the base's candidate pool, which must be exact before the merge).External API
LsmScanner::contains_pks(&RecordBatch) -> Vec<bool>— test which primary keys have been (re)written in the WAL fresh tier (active + frozen memtables + flushed generations), built like any query (construct the scanner, then call). Hashing is internal, so callers never reproducecompute_pk_hash.Scope / caveats
klive results in the adversarial case (more superseded rows near the query than the over-fetch covers);PkHashFilterExeclogs a per-source warning when this happens. Promoting the same membership into the KNN as a true prefilter (the index traverses untilkrows pass, removing the over-fetch) is the headline follow-up.Tests
test_vector_search_stale_read_when_fresh_falls_out_of_top_kpasses (with positive +overfetch_factor=0.0toggle-off assertions); plus over-fetch backfill, cross-flushed and composite-PK stale reads, same-L0 newest-wins, per-shard block-list isolation, andPkHashFilterExec/ membership unit tests (incl. null and composite PKs).cargo fmt,cargo clippy -p lance --tests -- -D warnings, and thelance/lance-jni/pylancecrate checks are clean.🤖 Generated with Claude Code