Skip to content

fix(mem_wal): suppress stale LSM vector-search reads via block-list post-filter#6899

Open
hamersaw wants to merge 14 commits into
lance-format:mainfrom
hamersaw:bug/wal-stale-row-on-update
Open

fix(mem_wal): suppress stale LSM vector-search reads via block-list post-filter#6899
hamersaw wants to merge 14 commits into
lance-format:mainfrom
hamersaw:bug/wal-stale-row-on-update

Conversation

@hamersaw
Copy link
Copy Markdown
Contributor

@hamersaw hamersaw commented May 21, 2026

What

Fixes a stale read in LSM MemWAL vector search: when a primary key is updated and its fresh row falls out of its own source's top-k, the superseded copy from an older generation could win the cross-source dedup and be returned.

Background

LsmGlobalPkDedupExec (introduced in #6881) is exact only over the candidates each source surfaces. If a PK's fresh version is pushed out of its source's top-k by closer rows, the dedup never sees it and cannot suppress the stale copy from an older generation — so the stale row is served. Repro: test_vector_search_stale_read_when_fresh_falls_out_of_top_k.

Approach

Make staleness a per-source PK-hash post-filter applied to each source's KNN before the cross-source union, so a stale row never reaches the merge.

  • Membership. Each generation's membership is an Arc<HashSet<u64>> of PK hashes (compute_pk_hash — the same hash the dedup nodes use). Built once per generation; flushed generations' sets are cached on FlushedMemTableCache and scanned streaming (one batch resident at a time, no full PK-column buffer).
  • Per-source block set. compute_source_block_lists gives each source the membership sets of the generations newer than it — NEWER(G) — as a Vec<Arc<HashSet<u64>>>, referenced, never merged into a per-query union. Generations are per-shard, so the map is keyed (shard_id, generation) and a source is only superseded by strictly-newer generations of its own shard (the base table, shardless and oldest, is blocked by every generation). The base table is not scanned — it's filtered by hashing only its KNN candidates.
  • Execution. PkHashFilterExec drops any candidate whose PK hash is in any of its source's blocked sets. This handles only cross-generation supersession: a PK in a newer generation makes every copy of it stale, so dropping by hash needs no row address. Within-generation duplicates (same PK twice in one generation) share a hash and are left to the existing global dedup's (generation, freshness) tiebreaker.

Configuration

LsmVectorSearchPlanner::plan_search exposes two knobs (Rust + Python + Java bindings):

  • overfetch_factor: f64 — a single knob controlling both stale filtering and over-fetch:

    • < 1.0 (e.g. 0.0): stale filtering off (no block-list / PkHashFilterExec; the global dedup still runs).
    • == 1.0 (default): filtering on, no over-fetch — a source with superseded rows fetches exactly k and may return fewer than k live rows.
    • > 1.0: filtering on, over-fetch ceil(k * factor) so dropping the stale rows still leaves k.

    There is intentionally no separate on/off flag — over-fetch is only meaningful while filtering, so the factor encodes both.

  • refine_base_table: bool (replaces the old refine_factor: Option<u32>) — re-rank the base arm's approximate index distances to exact (factor 1). Auto-enabled whenever stale filtering runs (over-fetch widens the base's candidate pool, which must be exact before the merge).

External API

LsmScanner::contains_pks(&RecordBatch) -> Vec<bool> — test which primary keys have been (re)written in the WAL fresh tier (active + frozen memtables + flushed generations), built like any query (construct the scanner, then call). Hashing is internal, so callers never reproduce compute_pk_hash.

Scope / caveats

  • Post-filter, not a true prefilter. It relies on over-fetch to backfill dropped rows and does not guarantee k live results in the adversarial case (more superseded rows near the query than the over-fetch covers); PkHashFilterExec logs a per-source warning when this happens. Promoting the same membership into the KNN as a true prefilter (the index traverses until k rows pass, removing the over-fetch) is the headline follow-up.
  • Within-generation top-k eviction (a generation holds both a stale and a fresh copy of a PK, and the fresh one is evicted) is not pre-filtered — it shares a hash, so it can't be disambiguated by membership. It's the same bug class as the cross-source case, currently relying on the global dedup; closing it would need flush-time dedup (so flushed generations are internally deduped like the base table). The base table is assumed internally deduped.

Tests

test_vector_search_stale_read_when_fresh_falls_out_of_top_k passes (with positive + overfetch_factor=0.0 toggle-off assertions); plus over-fetch backfill, cross-flushed and composite-PK stale reads, same-L0 newest-wins, per-shard block-list isolation, and PkHashFilterExec / membership unit tests (incl. null and composite PKs). cargo fmt, cargo clippy -p lance --tests -- -D warnings, and the lance / lance-jni / pylance crate checks are clean.

🤖 Generated with Claude Code

…e-read fix

Foundation for fixing the LSM vector-search stale read where an updated PK's
fresh row falls out of its own source's top-k, letting the superseded copy
from an older generation win (the per-source top-k -> global-dedup gap that
replaced the bloom-based FilterStaleExec in lance-format#6881).

Adds the mem_wal-side machinery that *produces* the per-source block-lists,
keyed on the same compute_pk_hash the dedup nodes use:

- GenPkIndex: per-generation pk-hash -> rowaddrs, with membership,
  within-generation supersession (write-polarity aware), and cross-generation
  supersession (superseded_by NEWER(G)).
- compute_block_lists: folds generations newest-first into per-gen block trees
  plus the membership union base uses.
- compute_source_block_lists: assembles a per-source RowAddrMask bitmap from
  the real LSM sources (active/frozen via BatchStore positions, flushed via
  disk scan of pk + _rowaddr, base blocked cross-generation only).

Scope: this PR only builds the bitmaps; how a mask drives the actual KNN
search (a Scanner pre-filter API) is a deliberate follow-up, so the new
functions are gated dead_code until wired and there is no behavior change yet.
The failing spec test lands with that follow-up.

Unit-tested: membership, within/cross-gen supersession, in-memory assembly,
and the base disk-scan path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@github-actions github-actions Bot added the enhancement New feature or request label May 21, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented May 21, 2026

…ost-filter

A primary-key update whose fresh row falls out of its own source's top-k let
the superseded copy from an older generation win the global dedup, returning a
stale row (the gap LsmGlobalPkDedupExec cannot close, exposed by
test_vector_search_stale_read_when_fresh_falls_out_of_top_k).

Each per-source KNN now over-fetches and drops rows superseded by a newer
generation (matched by _rowid) via BlockListFilterExec, before the cross-source
union, so a stale row never reaches the merge. The existing global dedup still
resolves within-source duplicates. Block-list bitmaps come from the block-list
machinery, re-keyed from _rowaddr to _rowid (what fast_search emits; equal to
_rowaddr unless stable row ids are enabled).

This is a post-filter that relies on over-fetch (STALE_OVERFETCH_FACTOR) to
backfill dropped rows; it does not guarantee k live results in the adversarial
case. A TODO at the filter site tracks promoting the same mask to a true KNN
prefilter (which traverses until k rows pass, removing the over-fetch).

Turns the failing spec test green; existing vector-search tests unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@hamersaw hamersaw changed the title feat(mem_wal): block-list membership + bitmaps for vector-search stale-read fix fix(mem_wal): suppress stale LSM vector-search reads via block-list post-filter May 21, 2026
@github-actions github-actions Bot added the bug Something isn't working label May 21, 2026
hamersaw and others added 11 commits May 21, 2026 14:33
…tch default 2.5

Default STALE_OVERFETCH_FACTOR to 2.5 (was 4) and round up at the call site
(ceil(k * factor)), so a filtered source still over-fetches at least k+1.

Add UnderfillFilterWarnExec: a pass-through node placed at the top of the plan
only when a per-source block-list was applied. It logs a tracing warning if the
query returns fewer than k rows — the signature of an under-fetch, where the
over-fetch did not leave enough live candidates after dropping superseded rows.
Gated on filtering having happened so a genuinely small unfiltered result does
not warn; the real fix remains a true KNN prefilter (tracked at the filter site).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…block-list

compute_source_block_lists rebuilt each flushed generation's GenPkIndex on every
vector query via a full PK-column scan. Cache it on FlushedMemTableCache (a
parallel moka cache keyed by the same immutable flushed path) so the index is
built once per generation, single-flight, and reused as an Arc::clone on warm
queries; pruned alongside the dataset cache by retain_paths at compaction.

Cold miss still builds lazily by scanning. Build-at-flush is deliberately not
wired: the flush write path has no handle to the reader-side cache, so seeding
it would mean threading a read cache through the writer for marginal gain (it
only saves the first cold scan). MVCC watermark pinning is likewise deferred:
membership is built at plan time (a subset of execution-time candidates), so it
cannot over-suppress a not-yet-visible row.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…K stale reads

Adds three vector-search dedup tests:
- over-fetch backfill: a source whose entire top-k is stale still returns k live
  results from its next-nearest rows (no under-fill).
- cross-flushed suppression: an older flushed generation's stale row is blocked
  by a newer flushed generation (no base/active involved).
- composite primary key: the block-list keys on the full (id1,id2) PK, so an
  updated base row is suppressed end-to-end.

Also Box::pin the now-larger plan_search sub-futures (block-list build and the
per-source KNN build) and the flushed PK-index build future to stay under
clippy::large_futures at every await site.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… (dangerous direction)

Our existing within-active dedup test keeps the newer copy only because it is
also the closer one, so it passes even with broken newest-wins. This ports the
lance-format#6844 spec: the re-inserted copy is FARTHER than the stale one (which sits on
the query), so a correct result requires real newest-wins. Passes on the
block-list post-filter.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…che method

get_or_build_pk_index is public (on FlushedMemTableCache) but GenPkIndex lives in
a private module, so the [`GenPkIndex`] intra-doc link failed the rustdoc CI job
under -D warnings (rustdoc::private-intra-doc-links). Keep it a plain code span.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the per-generation row-address block-list (RowAddrMask +
BlockListFilterExec) with PK-hash membership filtering (PkHashFilterExec)
for every source, base included.

- GenPkIndex is now a generation's set of PK hashes; compute_blocked_sets
  yields each generation's NEWER(G) set plus the base table's full-union
  set. No row addresses are tracked.
- Removes the per-query full base PK scan: base is filtered by hashing
  only its KNN candidates against the membership set, like every other
  source, instead of scanning the whole base table.
- Within-generation duplicates are left to the global dedup's
  (generation, freshness) tiebreaker. The address-based within-gen block
  is dropped (no test exercised it). Trade-off: within-gen top-k eviction
  is no longer pre-filtered -- the same bug class we fix cross-source,
  to be closed later by a true KNN prefilter or flush-time dedup.
- Fold the under-fetch warning into PkHashFilterExec as a per-source
  check (the over-fetch is per-source) and delete UnderfillFilterWarnExec.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… newtype

Now that a generation's membership is just a set of PK hashes, the
separate module and newtype no longer earn their keep.

- Merge gen_pk_index.rs into block_list.rs. GenPkIndex (a thin wrapper
  over HashSet<u64>) is replaced by free functions over HashSet<u64>:
  pk_hashes_from_batches / pk_hashes_from_batch_store and the private
  compute_blocked_sets.
- FlushedMemTableCache caches Arc<HashSet<u64>> directly
  (get_or_build_pk_hashes), dropping the newtype from the cache type.
- Tighten doc and inline comments across the vector-search block-list
  path (block_list, pk_hash_filter, vector_search, flushed_cache).

No behavior change. 109 scanner tests green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ry union

`compute_source_block_lists` previously merged every newer generation's
hashes into a fresh `NEWER(G)` set per source, per query — copying a large
flushed gen's PKs on every search. Instead pass `PkHashFilterExec` a
`Vec<Arc<HashSet<u64>>>` of the newer generations' membership sets and have
it block a candidate whose PK hash is in any of them.

- Each generation's membership is a shared `Arc<HashSet<u64>>` referenced,
  never merged; flushed gens reuse their cached set instead of copying it.
- Removes `compute_blocked_sets`; the per-source list is the sets
  accumulated newest-first.

No behavior change. 109 scanner tests green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add a writer-independent way to ask whether primary keys have been
(re)written in the WAL fresh tier — the active + frozen memtables and
flushed generations a scanner spans — i.e. are shadowed above the base
table. Built like any query: construct the `LsmScanner`, then call
`contains_pks`.

- `LsmScanner::contains_pks(&RecordBatch) -> Vec<bool>`: hashes each input
  PK row with the same `compute_pk_hash` the dedup uses and tests it
  against the per-generation membership sets, so callers never hash PKs
  themselves; composite PKs work by including all PK columns.
- Internal `block_list::fresh_tier_block_list` returns those membership
  sets as `Vec<Arc<HashSet<u64>>>` (the shape the vector-search filter
  already consumes); flushed sets are reused from the FlushedMemTableCache.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Addresses review findings.

1. (correctness) Generations are per-shard, but compute_source_block_lists
   keyed the block-list by generation alone and accumulated membership
   across all shards. With two shards each at the same generation, the
   adjacent same-gen sources cross-blocked: a source ran through
   PkHashFilterExec against its own (or another shard's) membership, so an
   entire shard's memtable/flushed generation was silently filtered out of
   a multi-shard vector search. Now keyed by (shard_id, generation) and
   accumulated per shard, so a source is blocked only by strictly-newer
   generations of its own shard and never by itself. Planner looks up by
   (shard_id, generation). New regression test block_lists_are_keyed_per_shard.

2. (memory) The flushed PK scan folds the projected stream into the hash
   set one batch at a time (scan_pk_hashes) instead of try_collect into a
   Vec<RecordBatch>, so only one PK batch is resident at a time.

3. (test) The stale-read regression test now also asserts the nearest live
   neighbor (pk=2) is the returned top-1, not just that the stale read is
   absent.

112 scanner tests green; clippy clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the hardcoded STALE_OVERFETCH_FACTOR with a configurable
plan_search argument, and adopt PR lance-format#6908's refine simplification.

- `overfetch_factor: f64` is a single knob that controls both stale-read
  filtering and over-fetch:
    < 1.0 (e.g. 0.0) → stale filter off (global PK dedup still runs)
    == 1.0           → filter on, no over-fetch (may return < k live rows)
    > 1.0            → filter on, fetch ceil(k * factor) to backfill drops
  There is deliberately no separate on/off flag: over-fetch is only
  meaningful while filtering, so the factor encodes both. Default 1.0
  (filter on, no over-fetch). NOTE: 0.0 now disables filtering — it no
  longer means "filter on, no over-fetch".
- `refine_factor: Option<u32>` → `refine_base_table: bool`, auto-enabled
  whenever stale filtering runs (over-fetch widens the base's approximate
  candidate pool, so it must be re-ranked to exact before the merge).
- Plumbed through Python (PyO3 + wrapper) and Java (JNI +
  LsmVectorSearchPlanner overloads), each with explicit docs.

Adapted to this branch's per-shard block-list: over-fetch applies only to
a source that a newer same-shard generation supersedes.

112 scanner tests green; clippy clean; lance / lance-jni / pylance compile.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Resolve modify/delete conflicts: upstream lance-format#6882 replaced the flat
mem_wal_read/mem_wal_vector benches with a benches/mem_wal/ tree, so
accept those deletions; port the new mem_wal_vector_bench plan_search
call to the configurable (refine_base_table, overfetch_factor) signature.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working enhancement New feature or request java python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant