Skip to content

fix(mem_wal): exact PK dedup for LSM vector search and point lookup#6856

Open
jackye1995 wants to merge 5 commits into
lance-format:mainfrom
jackye1995:jack/fix-lsm-pk-dedupe
Open

fix(mem_wal): exact PK dedup for LSM vector search and point lookup#6856
jackye1995 wants to merge 5 commits into
lance-format:mainfrom
jackye1995:jack/fix-lsm-pk-dedupe

Conversation

@jackye1995
Copy link
Copy Markdown
Contributor

@jackye1995 jackye1995 commented May 20, 2026

Duplicate primary keys written into one memtable or across generations leaked through as distinct rows in vector search and point lookup.

Exact PK dedup — replaces the bloom-filter-based FilterStaleExec with an exact pipeline:

  • LsmSourceTagExec tags each row with (_memtable_gen, _freshness)
  • LsmGlobalPkDedupExec does single-pass cross-source PK dedup keeping the freshest row per PK
  • Point lookup adds WithinSourceDedupExec(KeepMaxFreshness) on the active arm

Removes FilterStaleExec, GenerationBloomFilter, and the bloom-filter building from transaction commit.

Post-rerank TakeExec — the planner now accepts a base Dataset reference. After global PK dedup + sort + top-k, a TakeExec materializes any user-projected columns not in the per-source KNN output by fetching from the base dataset via _rowid.

Refine factor plumbingplan_search() now accepts refine_factor so callers can enable base-table refine (over-fetch k * factor candidates, re-rank with exact distances). Memtable arms use exact HNSW and are unaffected. Exposed in both Python and Java bindings.

Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@github-actions github-actions Bot added the bug Something isn't working label May 20, 2026
…point lookup

Same primary key written multiple times into one memtable, or into both a
memtable and an older generation, used to leak through to the user as
distinct rows. Two paths were affected:

- Vector search: HNSW indexes every insert as its own graph node, so KNN
  could return both V1 and V2 of the same PK from a single source.
- Point lookup (active arm): `FilterExec + LIMIT 1` over an insert-ordered
  scan returned the oldest match among duplicates.

Vector search now runs each per-source KNN through `LsmSourceTagExec`,
which appends `(_memtable_gen, _freshness)`. A single `LsmGlobalPkDedupExec`
over the union picks the row with the largest tuple per PK — newer
generations win, ties fall to the normalized within-source order (larger
`_rowid` for the active arm; flipped via `u64::MAX - _rowid` for flushed
arms to compensate for the reverse-write convention). This replaces the
older two-step `WithinSourceDedupExec` + bloom-based `FilterStaleExec`
design and is exact (no false-positive recall loss, no top-k under-fill,
no missing-bloom footgun).

Point lookup keeps a `WithinSourceDedupExec(KeepMaxFreshness)` on the
active arm only; `CoalesceFirstExec` already short-circuits cross-source
selection so global dedup would conflict. Flushed and base arms still
rely on `LIMIT 1` under the reverse-write / forward-write conventions
respectively.

Removed: `FilterStaleExec`, `GenerationBloomFilter`, and the
`LsmVectorSearchPlanner::with_bloom_filter[s]` API — no longer needed
now that dedup is exact.

New tests pin: duplicate PK within one active memtable (both planners),
duplicate PK across generations (vector search), and the partition
coalesce ahead of `LsmGlobalPkDedupExec` that keeps active-memtable
rows from being silently dropped.
@jackye1995 jackye1995 force-pushed the jack/fix-lsm-pk-dedupe branch from 33cf21c to a6e7d82 Compare May 20, 2026 05:35
…gh LSM vector search

The LSM vector search planner now accepts a base dataset reference and
appends a TakeExec after the global PK dedup + sort. This allows the
final top-k rows to materialize any user-projected columns that were
not part of the per-source KNN output, fetching from the base dataset
by _rowid.

Also plumbs refine_factor as a parameter on plan_search() so callers
can enable base-table refine (over-fetch k*factor candidates, re-rank
with exact distances). Memtable arms use exact HNSW and are unaffected.

Both Python and Java bindings are updated with the new parameter.
@jackye1995 jackye1995 changed the title fix(mem_wal): dedupe duplicate primary keys in LSM vector search and point lookup fix(mem_wal): exact PK dedup for LSM vector search and point lookup May 20, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented May 20, 2026

…point-lookup benchmarks

Remove the duplicated criterion-based mem_wal_read.rs and mem_wal_vector.rs
benchmarks. Replace with two standalone CLI benchmarks that produce JSON
output for panel-style trend analysis:

- mem_wal_vector_bench: KNN search across LSM levels using real 384-dim
  embeddings from lance-format/fineweb-edu, IVF-RQ base table index,
  recall verification against brute-force ground truth.

- mem_wal_point_lookup_bench: PK-based point lookups across base table,
  flushed generations, and active memtable.

Both accept --flushed-generations (0/1/2) and --max-memtable-rows
(100k/500k/1M) for sweeping the full matrix. Results are written as
individual JSON files for aggregation.
The hf:// URI scheme for accessing lance-format/fineweb-edu requires
network access that may not be available or reliable on all environments.
Switch to deterministic synthetic 384-dim embeddings using the same
cluster+noise scheme as mem_wal_hnsw_bench.rs. This makes the bench
self-contained with no external dependencies.
Copy link
Copy Markdown
Contributor

@hamersaw hamersaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC this algorithm is basically parallelize top-K from each source and then dedup based on freshness. When exploring this elsewhere we noted the bug where a high ranking row is bumped out of the top-K by an un-fresh row and ends up with incorrect results. For a concrete example:

active memtable
    PK=0, _distance=10
    PK=1, _distance=11
    PK=2, _distance=12
flushed memtable (l0 cache) gen 0
    PK=0, _distance=1
    PK=3, _distance=2
    PK=4, _distance=3

If we we a top-K of 2 and dedup on this we have active memtable returning PK 0, 1 and flushed memtable returning PK 0, 3. the dedup results in keys 0, 3 returned. When really, it should be 3, 4 -- both from flushed memtable.

The various mitigations we discussed were:
(1) overfetching each source: we can have some bound that overfetches - don't love it.
(2) incremental refill: if a lower tier source has keys that are update we re-query it to ensure top-K non-updated keys - don't love it.

TBH I thought the bloom filter approach was reasonable to stop duplicates between sources. My thought was that within a source can we simply add a deletion vector to the flushed memtable (l0 cache) on write so that the read tooling automatically removes duplicates without having to rebuild indexes on write.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working java python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants