Skip to content

feat(btree): support distributed range-partitioned BTree index build#6859

Draft
Jay-ju wants to merge 10 commits into
lance-format:mainfrom
Jay-ju:feat/btree-distributed-build
Draft

feat(btree): support distributed range-partitioned BTree index build#6859
Jay-ju wants to merge 10 commits into
lance-format:mainfrom
Jay-ju:feat/btree-distributed-build

Conversation

@Jay-ju
Copy link
Copy Markdown
Contributor

@Jay-ju Jay-ju commented May 20, 2026

This change completes the end-to-end distributed BTree index construction (range-partitioned mode), fixes a correctness bug on the fragment-partitioned path that was exposed once K-way sort-merge was removed, and tightens transaction safety by deferring shard cleanup until after commit succeeds.

What's new

  • DistributedMode { Single, Fragment, Range } selects the build strategy.
  • range_partitions(N) on CreateIndexBuilder triggers range-partitioned construction:
    1. Quantile sampling of column values (with dedup) to compute N-1 boundaries
    2. Per-partition independent training that emits a BTreeShardManifest
    3. Merging of all shard manifests into the final index files
  • New BTreeShardManifest carries the per-shard build output. Marked #[non_exhaustive] so future fields can be added without breaking callers.
  • Sample size cap raised from 100k to 1M and values are deduplicated to reduce skew from high-frequency keys.

Bug fix: pages_between with overlapping pages

On the fragment-partitioned path each fragment is trained independently, so page key ranges may overlap across fragments. The previous largest_node_less peek-one-step search could miss pages whose min is below the lower bound but max is in range. Fixed by adding a may_overlap flag on BTreeLookup; when set, the search starts from Bound::Unbounded. Range-partitioned indices keep the existing tight search.

Transaction safety: defer cleanup until after commit ---------------------------------------------------- cleanup_shard_files was previously invoked inside build_btree_index_partitioned and merge_index_metadata, before the caller's transaction was committed. If the commit fails, the shard files are already deleted and the operation cannot be retried.

Both paths now return the MergeResult to the caller and cleanup is deferred:

  • CreateIndexBuilder::execute stores a PendingShardCleanup and runs it after apply_commit returns Ok.
  • merge_index_metadata returns Option; the Python binding exposes a BTreeMergeResult class with .cleanup() that callers must invoke after their own commit succeeds. The Java binding keeps the legacy immediate- cleanup behavior with a TODO for follow-up design work.

API hardening

  • range_partitions() asserts num_partitions >= 2 with a descriptive message.
  • Non-BTree index types reject range_partitions with a clear error.
  • train=false combined with range_partitions returns an error instead of silently building an empty partitioned index.
  • transaction_properties() and range_partitions() now have rustdoc.
  • Python create_scalar_index/create_index docstrings document the new range_partitions parameter.

Tests

  • test_btree_index_with_range_partitions asserts the post-merge file layout: shard part_page_lookup.lance files are cleaned up, multiple part_page_data.lance remain as final artifacts, and exactly one merged page_lookup.lance is present.
  • test_partitioned_batch_crosses_boundary, test_train_btree_index_partitioned, test_partitioned_multiple_boundaries, test_fragment_btree_index_consistency and test_fragment_btree_index_boundary_queries cover the distributed build and may_overlap correctness fix.

This change completes the end-to-end distributed BTree index construction
(range-partitioned mode), fixes a correctness bug on the fragment-partitioned
path that was exposed once K-way sort-merge was removed, and tightens
transaction safety by deferring shard cleanup until after commit succeeds.

What's new
----------
- DistributedMode { Single, Fragment, Range } selects the build strategy.
- range_partitions(N) on CreateIndexBuilder triggers range-partitioned
  construction:
  1. Quantile sampling of column values (with dedup) to compute N-1 boundaries
  2. Per-partition independent training that emits a BTreeShardManifest
  3. Merging of all shard manifests into the final index files
- New BTreeShardManifest carries the per-shard build output. Marked
  #[non_exhaustive] so future fields can be added without breaking callers.
- Sample size cap raised from 100k to 1M and values are deduplicated to
  reduce skew from high-frequency keys.

Bug fix: pages_between with overlapping pages
---------------------------------------------
On the fragment-partitioned path each fragment is trained independently, so
page key ranges may overlap across fragments. The previous largest_node_less
peek-one-step search could miss pages whose min is below the lower bound but
max is in range. Fixed by adding a may_overlap flag on BTreeLookup; when set,
the search starts from Bound::Unbounded. Range-partitioned indices keep the
existing tight search.

Transaction safety: defer cleanup until after commit
----------------------------------------------------
cleanup_shard_files was previously invoked inside build_btree_index_partitioned
and merge_index_metadata, before the caller's transaction was committed. If
the commit fails, the shard files are already deleted and the operation
cannot be retried.

Both paths now return the MergeResult to the caller and cleanup is deferred:
- CreateIndexBuilder::execute stores a PendingShardCleanup and runs it after
  apply_commit returns Ok.
- merge_index_metadata returns Option<MergeResult>; the Python binding exposes
  a BTreeMergeResult class with .cleanup() that callers must invoke after
  their own commit succeeds. The Java binding keeps the legacy immediate-
  cleanup behavior with a TODO for follow-up design work.

API hardening
-------------
- range_partitions() asserts num_partitions >= 2 with a descriptive message.
- Non-BTree index types reject range_partitions with a clear error.
- train=false combined with range_partitions returns an error instead of
  silently building an empty partitioned index.
- transaction_properties() and range_partitions() now have rustdoc.
- Python create_scalar_index/create_index docstrings document the new
  range_partitions parameter.

Tests
-----
- test_btree_index_with_range_partitions asserts the post-merge file layout:
  shard part_*_page_lookup.lance files are cleaned up, multiple
  part_*_page_data.lance remain as final artifacts, and exactly one merged
  page_lookup.lance is present.
- test_partitioned_batch_crosses_boundary, test_train_btree_index_partitioned,
  test_partitioned_multiple_boundaries, test_fragment_btree_index_consistency
  and test_fragment_btree_index_boundary_queries cover the distributed build
  and may_overlap correctness fix.
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@Jay-ju Jay-ju marked this pull request as draft May 20, 2026 06:10
@github-actions github-actions Bot added enhancement New feature or request python java labels May 20, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented May 20, 2026

Jay-ju added 3 commits May 20, 2026 15:55
1. Fix biased sampling in sample_partition_boundaries: replace
   ORDER BY + LIMIT (which only samples smallest values) with
   stride-based sampling over sorted data to cover full value range.

2. Preserve partition structure in derive_index_params: add
   num_partitions field to BTreeParameters so that index
   initialization and updates maintain range partition structure.

3. Fix shard file leak in execute_uncommitted path: expose
   PendingShardCleanup as public with a cleanup() method and add
   take_pending_cleanup() API for callers who commit manually.

4. Add boundary monotonicity validation in train_btree_index_partitioned:
   reject non-ascending or duplicate boundaries at function entry.
…anifests APIs

- Add MergeResult struct with part_lookup_files and part_page_files
- Add cleanup_shard_files async function for deferred shard cleanup
- Add merge_from_manifests for merging partition manifests
- Change train_btree_index_partitioned to return Vec<String>
- Change merge_index_files to return MergeResult
- Fix all callers of train_btree_index for DistributedMode enum
- Fix BTreeParameters missing num_partitions field
- Fix range_id reference in train_btree_index
- Fix Python binding compilation errors
- Fix Rust format and clippy warnings
- cleanup_shard_files no longer deletes part_*_page_data.lance files
  which are still needed for range-partitioned queries
- Fix test_merge_index_metadata_btree_reports_progress to accept
  merge_pages events (fragment-partitioned path) instead of only
  merge_lookups events (range-partitioned path)
batch_readhead: Optional[int] = None,
progress_callback: Optional[Callable[[IndexProgress], None]] = None,
): ...
) -> Optional[BTreeMergeResult]: ...
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

merge_index_metadata is considered deprecated. Please using merge_existing_index_segments and commit_existing_index_segments instead to build index segments.

Jay-ju added 6 commits May 20, 2026 19:40
#[non_exhaustive] structs cannot be constructed with struct literal
syntax from outside their defining crate. Add a public new() constructor
and update the Python binding to use it.
Same issue as Python binding: from_dataset_for_new is a trait method
on LanceIndexStoreExt, not an inherent method on LanceIndexStore.
The second RT.block_on() for cleanup_shard_files was called outside the
tokio runtime context, causing 'there is no reactor running' panic that
crashed the JVM (SIGABRT). Merging both async operations into a single
RT.block_on() block ensures they run within the same tokio runtime.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request java python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants