Skip to content

feat: add early termination for compaction plan with max_compaction_bytes option#6890

Open
Jay-ju wants to merge 4 commits into
lance-format:mainfrom
Jay-ju:feat/compaction-plan-early-termination
Open

feat: add early termination for compaction plan with max_compaction_bytes option#6890
Jay-ju wants to merge 4 commits into
lance-format:mainfrom
Jay-ju:feat/compaction-plan-early-termination

Conversation

@Jay-ju
Copy link
Copy Markdown
Contributor

@Jay-ju Jay-ju commented May 21, 2026

Summary

Add budget-based early termination to DefaultCompactionPlanner to prevent OOM when planning compaction on datasets with many fragments (e.g., hundreds of thousands).

Closes: #6039

Problem

When a dataset has hundreds of thousands of fragments, plan_compaction collects metrics for all fragments before producing the plan. This leads to:

  1. OOM risk: All fragment metadata + metrics are held in memory simultaneously
  2. Excessive I/O: Each fragment requires a read of its deletion file
  3. Large serialized plans: 10K fragments → ~2.3MB JSON; 300K fragments → ~70MB JSON

The existing max_source_fragments option was a post-hoc truncation — it collected all metrics first, then truncated the output. This did not reduce planning time or memory.

Benchmark data (10K fragments, no deletions):

max_source_fragments plan_time_ms plan_json_size
None (unlimited) 51 2.3MB
100 56 408B
500 56 408B

Plan time barely changed because all metrics were still collected.

Solution

Refactor max_source_fragments from post-hoc truncation to in-loop early termination, and add a new max_compaction_bytes option. The planner now tracks total_candidate_fragments and total_candidate_bytes during the metrics collection loop and breaks out as soon as either budget is exceeded.

Key changes:

  • max_source_fragments: Now terminates metrics collection early (was post-hoc truncation)
  • max_compaction_bytes: New option to limit by cumulative fragment byte size
  • exceeds_budget(): Helper method checking both limits during the planning loop
  • Preserves existing parallel I/O (.buffered(io_parallelism())) — unlike PR feat: support bounded compaction planner #6095 which used serial I/O

Design Rationale

This approach follows hamersaw's review feedback on PR #6095: extending CompactionOptions rather than adding a new BoundedCompactionPlanner type. Users configure limits directly without needing to choose a planner implementation.

Changes

Rust

  • CompactionOptions: Add max_compaction_bytes: Option<usize> field
  • DefaultCompactionPlanner::plan(): Replace post-hoc truncation with in-loop early termination
  • DefaultCompactionPlanner::exceeds_budget(): New helper method
  • CompactionOptions::apply_dataset_config(): Support lance.compaction.max_compaction_bytes
  • Tests: 3 functional tests + 3 benchmark tests

Python

  • CompactionOptions TypedDict: Add max_compaction_bytes field with docs
  • PyO3 binding: Handle max_compaction_bytes key

Usage

# Limit by fragment count
dataset.optimize.compact_files(max_source_fragments=1000)

# Limit by total bytes
dataset.optimize.compact_files(max_compaction_bytes=10 * 1024**3)  # 10GB

# Both limits combined
dataset.optimize.compact_files(
    max_source_fragments=1000,
    max_compaction_bytes=10 * 1024**3,
)
let options = CompactionOptions {
    max_source_fragments: Some(1000),
    max_compaction_bytes: Some(10 * 1024 * 1024 * 1024),
    ..Default::default()
};

Comparison with PR #6095

Dimension PR #6095 This PR
Architecture New BoundedCompactionPlanner type Extend existing DefaultCompactionPlanner
I/O pattern Serial (one-at-a-time) Parallel (preserved)
User API planner="bounded" + limits Direct max_* options
Maintainer feedback Design not accepted Follows maintainer preference

…ytes option

Add budget-based early termination to DefaultCompactionPlanner to
prevent OOM when planning compaction on datasets with many fragments
(e.g., hundreds of thousands).

Changes:
- Add max_compaction_bytes option to CompactionOptions
- Refactor max_source_fragments from post-hoc truncation to in-loop
  early termination, stopping fragment metrics collection once budget
  is exceeded
- Add exceeds_budget() helper checking both fragment count and byte
  limits during the planning loop
- Update Python bindings and TypedDict docs
- Add functional tests for early termination behavior
- Add benchmark tests for plan performance at scale

Closes: lance-format#6039
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@github-actions github-actions Bot added enhancement New feature or request python labels May 21, 2026
Jay-ju added 2 commits May 21, 2026 21:02
…imits

- Add apply_budget_limits() for strict post-hoc truncation on task list
- Move early termination check before fragment is added to bin
- Guarantee at least 1 task is always included
- Fix test_max_source_fragments CI failure
- Fix Issue 1: Remove first-task exemption in apply_budget_limits,
  budget is now a strict hard limit (0 tasks if first task exceeds it)
- Fix Issue 2: Early termination now tracks effective (non-noop)
  candidate fragments only, preventing budget waste on bins that
  will be filtered by is_noop()
- Fix Issue 3: Mark benchmark tests as #[ignore] to reduce CI cost
- Update docs to clarify hard-limit semantics
@Jay-ju
Copy link
Copy Markdown
Contributor Author

Jay-ju commented May 21, 2026

@claude review

@codecov
Copy link
Copy Markdown

codecov Bot commented May 21, 2026

Codecov Report

❌ Patch coverage is 44.48161% with 166 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance/src/dataset/optimize.rs 44.48% 165 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

The test uses IVF with 2 partitions but default nprobes=1, which only
probes 1 partition per segment. With delta indices (2 segments), the
search may miss the partition containing ID 0 in the first segment,
causing the assertion to fail non-deterministically (e.g., returning
[889, 1000] instead of [0, 1000]).

Setting nprobes=2 ensures all partitions are probed, making the search
exhaustive and the test deterministic.
@Jay-ju
Copy link
Copy Markdown
Contributor Author

Jay-ju commented May 22, 2026

Hi @hamersaw. Fragment planning consumes much time in large data scenarios. I have discussed with @zhangyue19921010 . Based on the discussion of #6039 , we revised the original logic of full planning followed by trimming to on-demand planning. Planning will stop once reaching the threshold.

Could you take a look when you have time?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bounded Compaction Planner To Limit the amount of data processed during a single compaction

1 participant