feat(mask): add range-aware `Runs` variant + `insert_run` / `iter_runs` APIs by westonpace · Pull Request #6830 · lance-format/lance

westonpace · 2026-05-18T16:22:49Z

Summary

Adds a third RowAddrSelection variant, Runs(Vec<RangeInclusive<u32>>),
for storing range-shaped per-fragment selections without inflating to a
per-row roaring bitmap.
Adds two new methods on RowAddrTreeMap: insert_run(fragment_id, run)
for range-shaped producers and iter_runs() for range-shaped consumers.
Adds a criterion bench suite (row_addr_mask) that pins the row-
cardinality scaling weakness of the existing API and the cost saved by
the new one.

This is an additive, backwards-compatible change — existing code paths
keep working unchanged, the on-disk format does not move, and all 97
pre-existing utils::mask tests pass alongside 11 new ones.

Motivation

Producers like lance-index's search_zones and consumers like
mask_to_offset_ranges operate naturally on row-address ranges, but
the only public RowAddrSelection representations today are Full and
Partial(RoaringBitmap). Every range-shaped result therefore round-trips
through a per-row bitmap, so the cost of using a RowAddrMask is set by
the row cardinality of the result, not the number of distinct ranges.

The baseline benchmark suite (row_addr_mask) introduced in the first
commit of this PR makes this concrete:

Op (10M-row contiguous selection)	Existing	Achievable
Producer (`insert_range`)	6.5 µs	—
Consumer (`into_addr_iter`)	19.9 ms	1.7 µs
End-to-end (`mask_to_offset_ranges_inner_loop`)	19.3 ms	—

The consumer-side gap (≈11,000×) is the largest, and matches what we
observed in production: a chrome trace of IS NULL against a zonemap-
indexed 10M-row dataset spent ≈495 ms of 889 ms inside
mask_to_offset_ranges, all of it converting between the per-row mask
and a Vec<Range<u64>>.

What this PR does not do

It does not migrate any callers to the new APIs, and does not change
on-disk semantics. The point of this PR is to land the data structure +
API surface + benchmarks so follow-up PRs can cut over search_zones,
mask_to_offset_ranges, and friends one at a time with a measurable
delta each.

API surface

// New variant on the existing enum.
pub enum RowAddrSelection {
    Full,
    Partial(RoaringBitmap),
    /// A sorted, non-overlapping, non-adjacent list of inclusive ranges.
    Runs(Vec<RangeInclusive<u32>>),
}

impl RowAddrTreeMap {
    /// Range-shaped producer. O(1) amortized when runs arrive in order
    /// (the common pattern for scalar-index zone walks).
    pub fn insert_run(&mut self, fragment_id: u32, run: RangeInclusive<u32>);

    /// Range-shaped consumer. Yields `(fragment_id, RangeInclusive<u32>)`,
    /// one item per run regardless of how many rows the run covers.
    pub unsafe fn iter_runs(&self) -> impl Iterator<Item = (u32, RangeInclusive<u32>)> + '_;

    /// Force a Runs entry into its equivalent Partial form. Useful for
    /// callers that need direct bitmap access via `get_fragment_bitmap`.
    pub fn canonicalize_to_partial(&mut self, fragment_id: u32);
}

Both insert_run and iter_runs are full citizens of the mask
machinery:

insert_run preserves the sorted / non-overlapping / non-adjacent
invariants even on unsorted input. Merging is O(num_runs) in the
pathological case, O(1) amortized in the common in-order case.
iter_runs works on all three variants: yields stored ranges
for Runs, surfaces roaring's container run-encoding for Partial
via Iter::next_range, and panics on Full (matching the existing
into_addr_iter contract — same unsafe justification).

Backwards compatibility

Concern	Status
Existing `utils::mask` tests	97/97 pass unchanged
On-disk wire format	Unchanged — `serialize_into` inflates `Runs` to its equivalent bitmap before writing; old readers continue to load. `deserialize_from` always returns `Partial`.
`into_addr_iter` cost	Unchanged: 19.4 ms before, 19.9 ms after (within noise). The trick is `itertools::Either` rather than `Box<dyn Iterator>` so the new arm adds no dynamic dispatch to the existing path.
Existing set operations (`&`, `\|`, `-`, `Extend`)	Updated to accept `Runs` inputs by transparently inflating to `Partial` before applying the existing roaring-bitmap logic — semantics-preserving fallback. Native run-shaped set ops are deferred to a follow-up since `intersect_two_runs` at 10M rows is already 12 µs (not a bottleneck).
Existing `RowAddrSelection` consumers outside `mask.rs`	One match site in `filtered_read.rs` (in `FilteredReadExec::with_plan`) grew a `Runs` arm that emits the stored runs as `Range<u64>` directly. Compilation guarantees no other consumer was missed.

Benchmark results

Run with cargo bench -p lance-core --bench row_addr_mask.

Producer: insert one run covering N rows

N	`insert_range` (existing)	`insert_run` (new)	speedup
10K	54 ns	31 ns	1.8×
100K	67 ns	31 ns	2.2×
1M	543 ns	31 ns	17.7×
10M	6.5 µs	31 ns	210×

Consumer: iterate selection of N rows

N	`into_addr_iter` (existing)	`iter_runs` (new)	speedup
10K	19.4 µs	6.3 ns	3,078×
100K	193 µs	6.4 ns	30,209×
1M	1.94 ms	6.3 ns	306,879×
10M	19.9 ms	6.3 ns	3,154,000×

Producer: K runs summing to 1M rows

K	`insert_range` (existing)	`insert_run` (new)	speedup
1	608 ns	32 ns	19.2×
10	827 ns	199 ns	4.2×
100	3.1 µs	769 ns	4.1×
1,000	28.4 µs	5.7 µs	5.0×
10,000	273 µs	49 µs	5.6×

Test plan

All 97 pre-existing utils::mask tests still pass.
11 new unit tests cover:
- insert_run invariant preservation on in-order, out-of-order, and
  overlapping inputs.
- insert_run degradation rules: Full → no-op, Partial → stays
  Partial, empty/Runs → stays Runs.
- iter_runs against pure-Runs, pure-Partial, and mixed-variant maps.
- canonicalize_to_partial converts Runs to Partial in place.
- Serialization round-trip: a Runs-built map and the equivalent
  Partial-built map produce byte-identical on-disk output.
- Set operations (&, |, -) yield identical cardinalities for
  Runs-built and Partial-built equivalent inputs.
cargo build --workspace --tests clean.
cargo bench -p lance-core --bench row_addr_mask runs end-to-end
and the criterion change: output confirms into_addr_iter was not
regressed by the new variant.

Follow-ups (not in this PR)

Migrate lance_index::scalar::zoned::search_zones from
RowAddrTreeMap::insert_range to insert_run. This is the producer
half of the zonemap IS NULL hot path.
Migrate lance_table::rowids::RowIdSequence::mask_to_offset_ranges's
U64Segment::Range arm to consume iter_runs instead of
materializing the source range and intersecting. Closes the consumer
half.
Add native run-shaped intersection (Runs ∩ Runs → Runs) once a
call site materializes that they want the result-side representation
preserved.
Optional: a serialization-format minor-version bump so Runs can be
written on the wire too, avoiding the inflate-on-write step. Not
needed for any of (1)–(3) since current call sites build masks
in-memory per query.

🤖 Generated with Claude Code

Add a criterion benchmark suite targeting RowAddrMask / RowAddrTreeMap that quantifies the cost of operations whose work is fundamentally range-shaped but currently goes through per-row Partial(RoaringBitmap) representation. Six groups: insert_range_single_run - producer cost: insert one range into_addr_iter_single_run - consumer cost: walk every row addr next_range_iter_single_run - achievable cost via Iter::next_range intersect_two_runs - set op on two range-shaped masks mask_to_offset_ranges_inner_loop - end-to-end slow path observed in IS NULL trace (495 ms / 889 ms) insert_runs_constant_cardinality - many small runs vs one big run Each varies dataset size while holding number-of-ranges fixed at 1, so linear scaling in N reveals where row count dominates the cost. Headline finding (10M-row inputs): into_addr_iter: 19.4 ms per-bit walk next_range iter: 1.72 us per-run walk (~11000x faster) The next_range/iter delta represents the speedup an alternate range-aware iterator could surface to callers. The roaring crate already represents the data as run-encoded containers; the RowAddrMask public API does not expose them. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds a third RowAddrSelection variant, `Runs(Vec<RangeInclusive<u32>>)`, that stores a per-fragment selection as a sorted, non-overlapping, non-adjacent list of run-length-encoded ranges. This is the backwards- compatible step toward a range-aware row-address mask: existing producers and consumers keep working unchanged, while new range-shaped callers can sidestep the per-row roaring bitmap that today dominates mask construction and iteration cost. New APIs on RowAddrTreeMap: insert_run(fragment_id, run) Range-shaped producer counterpart to insert(value) / insert_range. O(1) amortized when the run extends or is adjacent to the last entry (the common case for in-order producers like scalar-index zone searches). Merges into existing Runs preserving invariants. Falls back to Partial-bitmap inserts when the existing entry is already Partial (so scalar inserts never silently re-shape data). iter_runs() -> Iterator<(u32, RangeInclusive<u32>)> Range-shaped consumer counterpart to into_addr_iter. Yields one item per contiguous run, not per row. For `Runs` entries the runs are emitted directly; for `Partial` entries roaring's Iter::next_range surfaces the bitmap's internal run encoding. Panics on `Full` (same contract as into_addr_iter). canonicalize_to_partial(fragment_id) Force a Runs entry into its equivalent Partial form. Useful for callers that need raw bitmap access via get_fragment_bitmap. Compatibility: * Every existing match site on RowAddrSelection grew a Runs arm that either handles the variant natively (len, contains, row_addrs, iter_runs, into_addr_iter, serialize_into, etc.) or inflates to Partial via the private into_partial_bitmap helper for ops not yet range-aware (insert, remove, BitOr/BitAnd/Sub, FromIterator, Extend). All 97 existing mask tests pass unchanged. * On-disk format is unchanged: serialize_into inflates Runs to its equivalent bitmap before writing, so readers built against older versions continue to load. deserialize_from always yields Partial. * Hot paths use itertools::Either rather than Box<dyn Iterator> so the new variant adds no dyn-dispatch cost to the existing Partial iteration path. Verified by criterion: into_addr_iter at 10M rows is 19.9 ms before and after. Benchmark deltas (single contiguous run, vs the pre-existing APIs documented in commit 1b9d7c0): Producer (insert one run of N rows): insert_range insert_run speedup N = 10K 54 ns 31 ns 1.8x N = 100K 67 ns 31 ns 2.2x N = 1M 543 ns 31 ns 17.7x N = 10M 6,499 ns 31 ns 210x Consumer (iterate selection of N rows): into_addr_iter iter_runs speedup N = 10K 19,396 ns 6.3 ns 3,078x N = 100K 193,111 ns 6.4 ns 30,209x N = 1M 1,943,641 ns 6.3 ns 306,879x N = 10M 19,871,915 ns 6.3 ns 3,154,000x Many runs (1M total cardinality, K runs): insert_range insert_run speedup K = 1 608 ns 32 ns 19.2x K = 10 827 ns 199 ns 4.2x K = 100 3,123 ns 769 ns 4.1x K = 1,000 28,416 ns 5,680 ns 5.0x K = 10,000 272,891 ns 49,155 ns 5.6x 11 new unit tests cover invariant preservation, mixed-variant set ops, serialization round-trip, and degradation rules (insert into Partial collapses to Partial, insert into Full is no-op). filtered_read.rs gains a Runs arm in the existing FilteredReadPlan consumer at line 1606 so callers wiring the new producer through that path are not blocked. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Apply cargo fmt to the new Runs-variant code and address two clippy findings: * manual_let_else in BitAndAssign: convert the `Some(set) => set, None => continue` match into a `let ... else` (the retain pass above already guarantees the entry exists; the else arm is just a defensive skip). * identity_op in test_iter_runs_mixed_variants: drop the stray `+ 0` in the second insert_range bound. No behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

codecov · 2026-05-18T17:04:30Z

Codecov Report

❌ Patch coverage is 79.15567% with 79 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
rust/lance-core/src/utils/mask.rs	80.00%	67 Missing and 8 partials ⚠️
rust/lance/src/io/exec/filtered_read.rs	0.00%	4 Missing ⚠️

📢 Thoughts on this report? Let us know!

westonpace and others added 2 commits May 16, 2026 18:51

github-actions Bot added the enhancement New feature or request label May 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(mask): add range-aware `Runs` variant + `insert_run` / `iter_runs` APIs#6830

feat(mask): add range-aware `Runs` variant + `insert_run` / `iter_runs` APIs#6830
westonpace wants to merge 3 commits into
lance-format:mainfrom
westonpace:perf-mask-bench-baseline

westonpace commented May 18, 2026

Uh oh!

codecov Bot commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

westonpace commented May 18, 2026

Summary

Motivation

What this PR does not do

API surface

Backwards compatibility

Benchmark results

Producer: insert one run covering N rows

Consumer: iterate selection of N rows

Producer: K runs summing to 1M rows

Test plan

Follow-ups (not in this PR)

Uh oh!

codecov Bot commented May 18, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant