refactor: remove Java-side dataset cache, rely on Rust-side Session by LuciferYang · Pull Request #353 · lance-format/lance-spark

LuciferYang · 2026-03-27T14:38:30Z

Summary

Delete LanceDatasetCache; the Rust-side Session already caches metadata and index blocks, so a Java-side Dataset cache only saves manifest deserialization (microseconds).
Load fragment metadata on demand via dataset.getFragment(id) per partition, instead of eagerly building Map<Integer, Fragment> from dataset.getFragments() on every cache miss — O(N) in total fragment count, even though each partition reads one fragment.
Each LanceFragmentScanner opens, owns, and closes its own Dataset through a new LanceRuntime.openDataset() helper.

Changes

File	Change
`LanceDatasetCache.java`	Deleted (~291 lines).
`LanceRuntime.java`	Add `openDataset()` — wires the catalog `Session`, reconstructs the namespace from the `(namespaceImpl, namespaceProperties)` strings carried in `LanceInputPartition`, merges storage options, and pins the version.
`LanceFragmentScanner.java`	`create()` opens its own `Dataset` and loads only the target fragment; `close()` closes both scanner and dataset, using `Throwable.addSuppressed` so the first failure isn't masked by cleanup.

Before / after

Before (main)                          After
-------------                          -----
LanceFragmentScanner.create()          LanceFragmentScanner.create()
  LanceDatasetCache.getFragment()        LanceRuntime.openDataset()   // per partition
    Guava cache.get()                      (Rust Session: metadata + index)
      new CachedDataset(dataset)           dataset.getFragment(id)    // single fragment
        dataset.getFragments()  ← O(N)
    cached.getFragment(id)

The per-catalog Rust Session (configured via LANCE_INDEX_CACHE_SIZE / LANCE_METADATA_CACHE_SIZE) and the global Arrow BufferAllocator are unchanged.

hamersaw

Do you think it would make sense to just remove the LanceDatasetCache and house fragment caching in the LanceRuntime? It feels like there's just multiple layers of cache here.

LuciferYang · 2026-03-27T15:33:37Z

Do you think it would make sense to just remove the LanceDatasetCache and house fragment caching in the LanceRuntime? It feels like there's just multiple layers of cache here.

625df7c followed this approach

hamersaw · 2026-04-13T03:04:09Z

I think overall this looks reasonable, but we need some relatively comprehensive benchmarks to support any performance implications.

…ches-335 # Conflicts: # lance-spark-base_2.12/src/main/java/org/lance/spark/internal/LanceDatasetCache.java # lance-spark-base_2.12/src/main/java/org/lance/spark/read/LanceCountStarPartitionReader.java

Replace non-existent LanceNamespaceStorageOptionsProvider with LanceNamespace from getOrCreateNamespace(), matching the namespace API used by OpenDatasetBuilder. Add missing java.util.List import.

@tag

Adds FragmentLoadingBenchmarkTest to quantify the performance difference between getFragments() (old eager approach, O(N)) and getFragment(id) (new lazy approach, O(1)). Results on datasets with 10-1000 fragments show 10x-609x speedup for the lazy approach, confirming the motivation for PR lance-format#353. Tagged with @tag("benchmark") to exclude from normal test runs.

@tag

Add test.excludedGroups property (default: benchmark) to surefire config so @tag("benchmark") tests are excluded from normal mvn test runs. Override with -Dtest.excludedGroups= to include them. Run benchmarks with: mvn test -Dtest=FragmentLoadingBenchmarkTest \ -Dtest.excludedGroups= -Dgroups=benchmark

LuciferYang · 2026-04-13T04:28:10Z

+ * <p>Tagged with "benchmark" so it is excluded from normal test runs.
+ */
+@Tag("benchmark")
+public class FragmentLoadingBenchmarkTest {


Is this microbenchmark sufficient to demonstrate the effect? We can run

mvn test -pl lance-spark-base_2.12 \ -Dtest=FragmentLoadingBenchmarkTest \ -Dtest.excludedGroups= \ -Dgroups=benchmark

to verify:

=== Fragment Loading Benchmark === Fragments | getFragments() (ms) | getFragment(id) (ms) | Speedup ---------------------------------------------------------------------- 10 | 0.082 ms | 0.007 ms | 11.0x 50 | 0.329 ms | 0.008 ms | 43.2x 100 | 0.630 ms | 0.010 ms | 65.9x 500 | 2.929 ms | 0.008 ms | 380.9x 1000 | 6.064 ms | 0.008 ms | 760.4x

Notes:

getFragments(): loads ALL fragment metadata (old eager approach)

getFragment(id): loads ONE fragment by ID (new lazy approach)

Each worker partition only needs one fragment, so the lazy approach avoids
loading metadata for all other fragments in the dataset.

The project does not integrate JMH yet, so this benchmark is relatively simple.

If end-to-end impact is needed, we are running tests on a 1TB TPC-DS dataset, but this improvement likely won’t show a noticeable difference there.

@summaryzb Do you have any test results about this ?

BatchScanExec.equals() compares batch objects via equals(), which delegates to LanceScan since it implements Batch. Without overriding equals/hashCode, Object identity is used, so two scans of the same table are never equal and Spark cannot reuse exchanges. Compare schema, readOptions, filters, limit, offset, topN, and aggregation. Exclude scanId (per-instance UUID for tracing only).

This reverts commit a34190c.

hamersaw · 2026-04-13T14:38:19Z

Ok, so diving a little deeper. The rust-side implementation has it's own caches; for metadata (LANCE_METADATA_CACHE_SIZE) and indexes (LANCE_INDEX_CACHE_SIZE). By caching the actual dataset handle, we're really only saving time in deserializing the dataset manifest bytes, which should be measureable in microseconds (or milliseconds at worst). I'm wondering if we should just remove the notion of Spark-side caches for everything (dataset + fragment) and rely on the rust-side Lance cache. This is basically what the LanceRuntime handles, rust-size Lance cache.

LuciferYang · 2026-04-13T16:03:06Z

That makes sense. Since the Session already caches metadata and index blocks on the Rust side, the Java-side LoadingCache<DatasetCacheKey, Dataset> only saves manifest deserialization time. I agree we should remove it and let each scanner open/close its own Dataset handle, relying on the Rust-side cache for the heavy lifting. I'll update the pr to:

Remove the Guava LoadingCache, DatasetCacheKey, getCachedDataset(), and getFragment()
Have LanceFragmentScanner open its own Dataset (via openDataset()) and close it in close()
Keep the per-catalog Session caching as-is

I'll also enhance openDataset() to honor per-query blockSize/indexCacheSize/metadataCacheSize settings in the same change — the cache path currently silently ignores these, and since we're already rewiring the open path, this is the natural place to fix it.

Do you think this is reasonable?

hamersaw · 2026-04-13T21:04:31Z

I'll also enhance openDataset() to honor per-query blockSize/indexCacheSize/metadataCacheSize settings in the same change — the cache path currently silently ignores these, and since we're already rewiring the open path, this is the natural place to fix it.

Do you think this is reasonable?

I'm not following exactly where these are going to be inserted. This gets a little bit tricky because these a global session configuration options and may not be reasonably applied to a single dataset? Maybe it's worth hacking together a proposal and we can iterate?

LuciferYang · 2026-04-14T04:35:38Z

@hamersaw I have refactored the code, removed LanceDatasetCache, and updated the PR description. Can we do further iterations based on this version?

LuciferYang · 2026-04-14T04:41:14Z

-              readOptions,
-              fragmentId,
+      dataset =
+          LanceRuntime.openDataset(


Let me check if we can use the utility methods from Utils here.

LuciferYang · 2026-04-14T05:08:51Z

-              inputPartition.getInitialStorageOptions(),
-              inputPartition.getNamespaceImpl(),
-              inputPartition.getNamespaceProperties());
+      dataset =


@hamersaw On your earlier concern about blockSize/indexCacheSize/metadataCacheSize: this pr doesn't add new API for them now, but switching to Utils.openDatasetBuilder does mean the fragment-scan path now passes these through (previously they were dropped by LanceDatasetCache). This matches what LanceCountStarPartitionReader and ~27 other call sites already do. WDYT?

The refactor to remove LanceDatasetCache dropped namespace reconstruction on executors. Vended credentials (STS tokens) need the namespace client to refresh during long-running scans. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

hamersaw

Thanks, I think this looks great! I added the namespace build back in so that we can ensure credential refresh on long-running operations.

LuciferYang · 2026-04-15T05:32:03Z

Thank you @hamersaw

init

8441edf

github-actions bot added the performance Features that improves performance label Mar 27, 2026

hamersaw reviewed Mar 27, 2026

View reviewed changes

address comments

625df7c

LuciferYang added 5 commits April 13, 2026 11:19

Merge remote-tracking branch 'upstream/main' into consolidate-read-ca…

1adbc14

…ches-335 # Conflicts: # lance-spark-base_2.12/src/main/java/org/lance/spark/internal/LanceDatasetCache.java # lance-spark-base_2.12/src/main/java/org/lance/spark/read/LanceCountStarPartitionReader.java

fix: resolve compile errors in LanceRuntime.openDataset after merge

04cb06e

Replace non-existent LanceNamespaceStorageOptionsProvider with LanceNamespace from getOrCreateNamespace(), matching the namespace API used by OpenDatasetBuilder. Add missing java.util.List import.

refactor: remove redundant notes from benchmark output

91f912f

LuciferYang commented Apr 13, 2026

View reviewed changes

LuciferYang added 2 commits April 13, 2026 15:42

Revert "fix: implement equals/hashCode on LanceScan for ReusedExchange"

0a9ac1c

This reverts commit a34190c.

remove guava cache

36478ca

LuciferYang changed the title ~~perf: consolidate read caches and remove eager fragment pre-loading~~ refactor: remove Java-side dataset cache, rely on Rust-side Session Apr 14, 2026

LuciferYang commented Apr 14, 2026

View reviewed changes

use Utils.openDatasetBuilder(

097aed5

LuciferYang commented Apr 14, 2026

View reviewed changes

hamersaw approved these changes Apr 14, 2026

View reviewed changes

hamersaw merged commit be64654 into lance-format:main Apr 14, 2026
16 checks passed

fangbo mentioned this pull request Apr 16, 2026

fix(java): reuse session for OpenDatasetBuilder.buildFromNamespaceClient lance-format/lance#6536

Open

Conversation

LuciferYang commented Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Before / after

Uh oh!

hamersaw left a comment

Choose a reason for hiding this comment

Uh oh!

LuciferYang commented Mar 27, 2026

Uh oh!

hamersaw commented Apr 13, 2026

Uh oh!

LuciferYang Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

LuciferYang Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

LuciferYang Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

hamersaw commented Apr 13, 2026

Uh oh!

LuciferYang commented Apr 13, 2026

Uh oh!

hamersaw commented Apr 13, 2026

Uh oh!

LuciferYang commented Apr 14, 2026

Uh oh!

LuciferYang Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

LuciferYang Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

hamersaw left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

LuciferYang commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

LuciferYang commented Mar 27, 2026 •

edited

Loading