feat(mem_wal): cache opened L0 flushed-generation datasets by hamersaw · Pull Request #6816 · lance-format/lance

hamersaw · 2026-05-17T19:11:27Z

Problem

In the LSM scanner, every query against an L0 (frozen/flushed) generation re-opens that generation's Lance dataset from object storage. There are three identical cold-open sites — scan (planner.rs), point lookup (point_lookup.rs), and vector search (vector_search.rs) — each doing DatasetBuilder::from_uri(path).load() with no session. Per query, per flushed generation, this pays: manifest version discovery + manifest read + decode, file-metadata decode, and scalar/vector index load. For an LSM tree, frozen generations are the single best caching target, yet they were the only data source paying full cold-open cost on every query.

Key invariant

Flush writes each generation once to a globally-unique, content-addressed path (memtable/flush.rs). Same path ⟹ same bytes, forever — a cache hit can never be stale. This is the rare cache that needs no TTL and no invalidation for correctness; pruning is desirable only to reclaim memory.

Changes (OSS `lance`)

Two complementary, independently-useful, opt-in pieces — defaults preserve existing behavior exactly:

with_session plumbing — thread an existing Arc<Session> into the scanner/planners so the first open of each generation populates and reuses the shared index + file-metadata caches. LsmScanner::new defaults this to the base table's session; without_base_table defaults to None.
FlushedDatasetCache — a moka-backed, single-flight cache of Arc<Dataset> keyed by resolved flushed path, owned and sized by the consumer and injected per-request. After the first open, every subsequent query for that generation is a pure Arc::clone with zero object-store I/O. retain_paths(live_paths) prunes retired generations at compaction (memory-only; correctness never depends on it).

A single shared open_flushed_dataset(path, session, cache) helper replaces all three cold-open sites (repo rule: dedupe logic in 2+ places). None/None reproduces the original behavior precisely, so no existing test changes.

data_source.rs / collector.rs are untouched — opening stays lazy inside the planner, preserving bloom-filter pruning on point lookups. Planner wiring uses chainable with_session/with_flushed_cache builder methods rather than constructor changes, keeping new() signatures (and every existing test/bench) untouched.

Testing

New unit tests for FlushedDatasetCache: miss opens once; hit returns the same Arc (pointer eq); 16-way concurrent get_or_open opens exactly once (single-flight); retain_paths drops the right keys; no-cache path cold-opens each call.
Regression: full mem_wal::scanner suite (78 tests) passes untouched.
cargo clippy -p lance --tests --benches clean; cargo fmt clean.

Notes

The sophon consumer side (process-bootstrap cache ownership, scanner wiring, compaction retain_paths) is out of scope for this PR. Phase 1 (with_session) is independently shippable ahead of the cache.

🤖 Generated with Claude Code

In the LSM scanner, every query against an L0 flushed generation re-opened that generation's Lance dataset from object storage at three identical sites (scan, point lookup, vector search), paying manifest read + metadata decode + index load each time. Add two opt-in, non-breaking pieces: - `with_session` plumbing on the scanner/planners so the first open of each generation populates and reuses the shared index/metadata caches (defaults to the base table's session). - `FlushedDatasetCache`: a moka-backed, single-flight cache of `Arc<Dataset>` keyed by resolved flushed path, injected by the consumer. After the first open, subsequent queries are a pure `Arc::clone` with zero object-store I/O. Flushed generations are written once to a globally-unique immutable path, so cached entries are never stale and need no TTL; `retain_paths` pruning at compaction is memory-only and correctness never depends on it. A single shared `open_flushed_dataset` helper covers all three sites; `None`/`None` reproduces the original cold-open exactly, so all existing tests pass untouched. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

codecov · 2026-05-17T19:42:42Z

Codecov Report

❌ Patch coverage is 77.55102% with 44 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
rust/lance/src/dataset/mem_wal/scanner/builder.rs	45.00%	9 Missing and 2 partials ⚠️
...lance/src/dataset/mem_wal/scanner/flushed_cache.rs	92.70%	6 Missing and 4 partials ⚠️
.../lance/src/dataset/mem_wal/scanner/point_lookup.rs	30.76%	8 Missing and 1 partial ⚠️
...lance/src/dataset/mem_wal/scanner/vector_search.rs	30.76%	8 Missing and 1 partial ⚠️
rust/lance/src/dataset/mem_wal/scanner/planner.rs	61.53%	4 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

The cache-l0-reads change added `moka` to rust/lance and updated the root Cargo.lock, but python/ is a separate cargo workspace with its own lock. CI's "Lint Rust" step runs `cargo clippy --locked` from python/ and failed at lock resolution before clippy could run. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

jackye1995 · 2026-05-18T05:10:40Z

+/// The key is the resolved absolute flushed path
+/// (`{base}/_mem_wal/{shard}/{folder}`), which is globally unique, so a single
+/// cache can safely span multiple tables.
+pub struct FlushedDatasetCache {


Thanks for working on this. I think the Session plumbing is absolutely right, but I’m less convinced we should add FlushedDatasetCache to the Lance SDK.

My mental model is that each flushed memtable is already naturally a standalone Lance dataset. When that dataset is opened with the same Lance Session, the SDK-level caches should already cover the Lance-internal reusable state: object store registry/store reuse, file metadata cache, index cache, and index extensions. That part feels like the right SDK responsibility.

Caching the opened Dataset object itself feels like an application-level concern. The right owner of that cache is the calling service/application that knows the lifecycle of the L0 generations, compaction timing, memory budget, tenant/table boundaries, and whether a cache should be per-process, per-table, per-session, or scoped in some other way. I’d prefer not to make Lance SDK own that policy.

What do you think?

claude Bot reviewed May 17, 2026

View reviewed changes

github-actions Bot added the enhancement New feature or request label May 17, 2026

github-actions Bot added the python label May 17, 2026

jackye1995 reviewed May 18, 2026

View reviewed changes

Merge remote-tracking branch 'upstream/main' into feature/cache-l0-reads

f88c672

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(mem_wal): cache opened L0 flushed-generation datasets#6816

feat(mem_wal): cache opened L0 flushed-generation datasets#6816
hamersaw wants to merge 3 commits into
lance-format:mainfrom
hamersaw:feature/cache-l0-reads

hamersaw commented May 17, 2026

Uh oh!

claude Bot left a comment

Uh oh!

codecov Bot commented May 17, 2026

Uh oh!

jackye1995 May 18, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hamersaw commented May 17, 2026

Problem

Key invariant

Changes (OSS lance)

Testing

Notes

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

codecov Bot commented May 17, 2026

Codecov Report

Uh oh!

jackye1995 May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Changes (OSS `lance`)

jackye1995 May 18, 2026 •

edited

Loading