feat(btree): support distributed range-partitioned BTree index build#6859
Draft
Jay-ju wants to merge 10 commits into
Draft
feat(btree): support distributed range-partitioned BTree index build#6859Jay-ju wants to merge 10 commits into
Jay-ju wants to merge 10 commits into
Conversation
This change completes the end-to-end distributed BTree index construction
(range-partitioned mode), fixes a correctness bug on the fragment-partitioned
path that was exposed once K-way sort-merge was removed, and tightens
transaction safety by deferring shard cleanup until after commit succeeds.
What's new
----------
- DistributedMode { Single, Fragment, Range } selects the build strategy.
- range_partitions(N) on CreateIndexBuilder triggers range-partitioned
construction:
1. Quantile sampling of column values (with dedup) to compute N-1 boundaries
2. Per-partition independent training that emits a BTreeShardManifest
3. Merging of all shard manifests into the final index files
- New BTreeShardManifest carries the per-shard build output. Marked
#[non_exhaustive] so future fields can be added without breaking callers.
- Sample size cap raised from 100k to 1M and values are deduplicated to
reduce skew from high-frequency keys.
Bug fix: pages_between with overlapping pages
---------------------------------------------
On the fragment-partitioned path each fragment is trained independently, so
page key ranges may overlap across fragments. The previous largest_node_less
peek-one-step search could miss pages whose min is below the lower bound but
max is in range. Fixed by adding a may_overlap flag on BTreeLookup; when set,
the search starts from Bound::Unbounded. Range-partitioned indices keep the
existing tight search.
Transaction safety: defer cleanup until after commit
----------------------------------------------------
cleanup_shard_files was previously invoked inside build_btree_index_partitioned
and merge_index_metadata, before the caller's transaction was committed. If
the commit fails, the shard files are already deleted and the operation
cannot be retried.
Both paths now return the MergeResult to the caller and cleanup is deferred:
- CreateIndexBuilder::execute stores a PendingShardCleanup and runs it after
apply_commit returns Ok.
- merge_index_metadata returns Option<MergeResult>; the Python binding exposes
a BTreeMergeResult class with .cleanup() that callers must invoke after
their own commit succeeds. The Java binding keeps the legacy immediate-
cleanup behavior with a TODO for follow-up design work.
API hardening
-------------
- range_partitions() asserts num_partitions >= 2 with a descriptive message.
- Non-BTree index types reject range_partitions with a clear error.
- train=false combined with range_partitions returns an error instead of
silently building an empty partitioned index.
- transaction_properties() and range_partitions() now have rustdoc.
- Python create_scalar_index/create_index docstrings document the new
range_partitions parameter.
Tests
-----
- test_btree_index_with_range_partitions asserts the post-merge file layout:
shard part_*_page_lookup.lance files are cleaned up, multiple
part_*_page_data.lance remain as final artifacts, and exactly one merged
page_lookup.lance is present.
- test_partitioned_batch_crosses_boundary, test_train_btree_index_partitioned,
test_partitioned_multiple_boundaries, test_fragment_btree_index_consistency
and test_fragment_btree_index_boundary_queries cover the distributed build
and may_overlap correctness fix.
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
1. Fix biased sampling in sample_partition_boundaries: replace ORDER BY + LIMIT (which only samples smallest values) with stride-based sampling over sorted data to cover full value range. 2. Preserve partition structure in derive_index_params: add num_partitions field to BTreeParameters so that index initialization and updates maintain range partition structure. 3. Fix shard file leak in execute_uncommitted path: expose PendingShardCleanup as public with a cleanup() method and add take_pending_cleanup() API for callers who commit manually. 4. Add boundary monotonicity validation in train_btree_index_partitioned: reject non-ascending or duplicate boundaries at function entry.
…anifests APIs - Add MergeResult struct with part_lookup_files and part_page_files - Add cleanup_shard_files async function for deferred shard cleanup - Add merge_from_manifests for merging partition manifests - Change train_btree_index_partitioned to return Vec<String> - Change merge_index_files to return MergeResult - Fix all callers of train_btree_index for DistributedMode enum - Fix BTreeParameters missing num_partitions field - Fix range_id reference in train_btree_index - Fix Python binding compilation errors - Fix Rust format and clippy warnings
- cleanup_shard_files no longer deletes part_*_page_data.lance files which are still needed for range-partitioned queries - Fix test_merge_index_metadata_btree_reports_progress to accept merge_pages events (fragment-partitioned path) instead of only merge_lookups events (range-partitioned path)
Xuanwo
reviewed
May 20, 2026
| batch_readhead: Optional[int] = None, | ||
| progress_callback: Optional[Callable[[IndexProgress], None]] = None, | ||
| ): ... | ||
| ) -> Optional[BTreeMergeResult]: ... |
Collaborator
There was a problem hiding this comment.
merge_index_metadata is considered deprecated. Please using merge_existing_index_segments and commit_existing_index_segments instead to build index segments.
#[non_exhaustive] structs cannot be constructed with struct literal syntax from outside their defining crate. Add a public new() constructor and update the Python binding to use it.
Same issue as Python binding: from_dataset_for_new is a trait method on LanceIndexStoreExt, not an inherent method on LanceIndexStore.
The second RT.block_on() for cleanup_shard_files was called outside the tokio runtime context, causing 'there is no reactor running' panic that crashed the JVM (SIGABRT). Merging both async operations into a single RT.block_on() block ensures they run within the same tokio runtime.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This change completes the end-to-end distributed BTree index construction (range-partitioned mode), fixes a correctness bug on the fragment-partitioned path that was exposed once K-way sort-merge was removed, and tightens transaction safety by deferring shard cleanup until after commit succeeds.
What's new
Bug fix: pages_between with overlapping pages
On the fragment-partitioned path each fragment is trained independently, so page key ranges may overlap across fragments. The previous largest_node_less peek-one-step search could miss pages whose min is below the lower bound but max is in range. Fixed by adding a may_overlap flag on BTreeLookup; when set, the search starts from Bound::Unbounded. Range-partitioned indices keep the existing tight search.
Transaction safety: defer cleanup until after commit ---------------------------------------------------- cleanup_shard_files was previously invoked inside build_btree_index_partitioned and merge_index_metadata, before the caller's transaction was committed. If the commit fails, the shard files are already deleted and the operation cannot be retried.
Both paths now return the MergeResult to the caller and cleanup is deferred:
API hardening
Tests