Skip to content

feat(mem_wal): cache opened L0 flushed-generation datasets#6816

Open
hamersaw wants to merge 3 commits into
lance-format:mainfrom
hamersaw:feature/cache-l0-reads
Open

feat(mem_wal): cache opened L0 flushed-generation datasets#6816
hamersaw wants to merge 3 commits into
lance-format:mainfrom
hamersaw:feature/cache-l0-reads

Conversation

@hamersaw
Copy link
Copy Markdown
Contributor

Problem

In the LSM scanner, every query against an L0 (frozen/flushed) generation re-opens that generation's Lance dataset from object storage. There are three identical cold-open sites — scan (planner.rs), point lookup (point_lookup.rs), and vector search (vector_search.rs) — each doing DatasetBuilder::from_uri(path).load() with no session. Per query, per flushed generation, this pays: manifest version discovery + manifest read + decode, file-metadata decode, and scalar/vector index load. For an LSM tree, frozen generations are the single best caching target, yet they were the only data source paying full cold-open cost on every query.

Key invariant

Flush writes each generation once to a globally-unique, content-addressed path (memtable/flush.rs). Same path ⟹ same bytes, forever — a cache hit can never be stale. This is the rare cache that needs no TTL and no invalidation for correctness; pruning is desirable only to reclaim memory.

Changes (OSS lance)

Two complementary, independently-useful, opt-in pieces — defaults preserve existing behavior exactly:

  1. with_session plumbing — thread an existing Arc<Session> into the scanner/planners so the first open of each generation populates and reuses the shared index + file-metadata caches. LsmScanner::new defaults this to the base table's session; without_base_table defaults to None.

  2. FlushedDatasetCache — a moka-backed, single-flight cache of Arc<Dataset> keyed by resolved flushed path, owned and sized by the consumer and injected per-request. After the first open, every subsequent query for that generation is a pure Arc::clone with zero object-store I/O. retain_paths(live_paths) prunes retired generations at compaction (memory-only; correctness never depends on it).

A single shared open_flushed_dataset(path, session, cache) helper replaces all three cold-open sites (repo rule: dedupe logic in 2+ places). None/None reproduces the original behavior precisely, so no existing test changes.

data_source.rs / collector.rs are untouched — opening stays lazy inside the planner, preserving bloom-filter pruning on point lookups. Planner wiring uses chainable with_session/with_flushed_cache builder methods rather than constructor changes, keeping new() signatures (and every existing test/bench) untouched.

Testing

  • New unit tests for FlushedDatasetCache: miss opens once; hit returns the same Arc (pointer eq); 16-way concurrent get_or_open opens exactly once (single-flight); retain_paths drops the right keys; no-cache path cold-opens each call.
  • Regression: full mem_wal::scanner suite (78 tests) passes untouched.
  • cargo clippy -p lance --tests --benches clean; cargo fmt clean.

Notes

The sophon consumer side (process-bootstrap cache ownership, scanner wiring, compaction retain_paths) is out of scope for this PR. Phase 1 (with_session) is independently shippable ahead of the cache.

🤖 Generated with Claude Code

In the LSM scanner, every query against an L0 flushed generation
re-opened that generation's Lance dataset from object storage at three
identical sites (scan, point lookup, vector search), paying manifest
read + metadata decode + index load each time.

Add two opt-in, non-breaking pieces:

- `with_session` plumbing on the scanner/planners so the first open of
  each generation populates and reuses the shared index/metadata
  caches (defaults to the base table's session).
- `FlushedDatasetCache`: a moka-backed, single-flight cache of
  `Arc<Dataset>` keyed by resolved flushed path, injected by the
  consumer. After the first open, subsequent queries are a pure
  `Arc::clone` with zero object-store I/O.

Flushed generations are written once to a globally-unique immutable
path, so cached entries are never stale and need no TTL; `retain_paths`
pruning at compaction is memory-only and correctness never depends on
it. A single shared `open_flushed_dataset` helper covers all three
sites; `None`/`None` reproduces the original cold-open exactly, so all
existing tests pass untouched.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@github-actions github-actions Bot added the enhancement New feature or request label May 17, 2026
The cache-l0-reads change added `moka` to rust/lance and updated the
root Cargo.lock, but python/ is a separate cargo workspace with its
own lock. CI's "Lint Rust" step runs `cargo clippy --locked` from
python/ and failed at lock resolution before clippy could run.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
/// The key is the resolved absolute flushed path
/// (`{base}/_mem_wal/{shard}/{folder}`), which is globally unique, so a single
/// cache can safely span multiple tables.
pub struct FlushedDatasetCache {
Copy link
Copy Markdown
Contributor

@jackye1995 jackye1995 May 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this. I think the Session plumbing is absolutely right, but I’m less convinced we should add FlushedDatasetCache to the Lance SDK.

My mental model is that each flushed memtable is already naturally a standalone Lance dataset. When that dataset is opened with the same Lance Session, the SDK-level caches should already cover the Lance-internal reusable state: object store registry/store reuse, file metadata cache, index cache, and index extensions. That part feels like the right SDK responsibility.

Caching the opened Dataset object itself feels like an application-level concern. The right owner of that cache is the calling service/application that knows the lifecycle of the L0 generations, compaction timing, memory budget, tenant/table boundaries, and whether a cache should be per-process, per-table, per-session, or scoped in some other way. I’d prefer not to make Lance SDK own that policy.

What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants