Skip to content

feat(mask): add range-aware Runs variant + insert_run / iter_runs APIs#6830

Draft
westonpace wants to merge 3 commits into
lance-format:mainfrom
westonpace:perf-mask-bench-baseline
Draft

feat(mask): add range-aware Runs variant + insert_run / iter_runs APIs#6830
westonpace wants to merge 3 commits into
lance-format:mainfrom
westonpace:perf-mask-bench-baseline

Conversation

@westonpace
Copy link
Copy Markdown
Member

Summary

  • Adds a third RowAddrSelection variant, Runs(Vec<RangeInclusive<u32>>),
    for storing range-shaped per-fragment selections without inflating to a
    per-row roaring bitmap.
  • Adds two new methods on RowAddrTreeMap: insert_run(fragment_id, run)
    for range-shaped producers and iter_runs() for range-shaped consumers.
  • Adds a criterion bench suite (row_addr_mask) that pins the row-
    cardinality scaling weakness of the existing API and the cost saved by
    the new one.

This is an additive, backwards-compatible change — existing code paths
keep working unchanged, the on-disk format does not move, and all 97
pre-existing utils::mask tests pass alongside 11 new ones.

Motivation

Producers like lance-index's search_zones and consumers like
mask_to_offset_ranges operate naturally on row-address ranges, but
the only public RowAddrSelection representations today are Full and
Partial(RoaringBitmap). Every range-shaped result therefore round-trips
through a per-row bitmap, so the cost of using a RowAddrMask is set by
the row cardinality of the result, not the number of distinct ranges.

The baseline benchmark suite (row_addr_mask) introduced in the first
commit of this PR makes this concrete:

Op (10M-row contiguous selection) Existing Achievable
Producer (insert_range) 6.5 µs
Consumer (into_addr_iter) 19.9 ms 1.7 µs
End-to-end (mask_to_offset_ranges_inner_loop) 19.3 ms

The consumer-side gap (≈11,000×) is the largest, and matches what we
observed in production: a chrome trace of IS NULL against a zonemap-
indexed 10M-row dataset spent ≈495 ms of 889 ms inside
mask_to_offset_ranges, all of it converting between the per-row mask
and a Vec<Range<u64>>.

What this PR does not do

It does not migrate any callers to the new APIs, and does not change
on-disk semantics. The point of this PR is to land the data structure +
API surface + benchmarks so follow-up PRs can cut over search_zones,
mask_to_offset_ranges, and friends one at a time with a measurable
delta each.

API surface

// New variant on the existing enum.
pub enum RowAddrSelection {
    Full,
    Partial(RoaringBitmap),
    /// A sorted, non-overlapping, non-adjacent list of inclusive ranges.
    Runs(Vec<RangeInclusive<u32>>),
}

impl RowAddrTreeMap {
    /// Range-shaped producer. O(1) amortized when runs arrive in order
    /// (the common pattern for scalar-index zone walks).
    pub fn insert_run(&mut self, fragment_id: u32, run: RangeInclusive<u32>);

    /// Range-shaped consumer. Yields `(fragment_id, RangeInclusive<u32>)`,
    /// one item per run regardless of how many rows the run covers.
    pub unsafe fn iter_runs(&self) -> impl Iterator<Item = (u32, RangeInclusive<u32>)> + '_;

    /// Force a Runs entry into its equivalent Partial form. Useful for
    /// callers that need direct bitmap access via `get_fragment_bitmap`.
    pub fn canonicalize_to_partial(&mut self, fragment_id: u32);
}

Both insert_run and iter_runs are full citizens of the mask
machinery:

  • insert_run preserves the sorted / non-overlapping / non-adjacent
    invariants even on unsorted input. Merging is O(num_runs) in the
    pathological case, O(1) amortized in the common in-order case.
  • iter_runs works on all three variants: yields stored ranges
    for Runs, surfaces roaring's container run-encoding for Partial
    via Iter::next_range, and panics on Full (matching the existing
    into_addr_iter contract — same unsafe justification).

Backwards compatibility

Concern Status
Existing utils::mask tests 97/97 pass unchanged
On-disk wire format Unchanged — serialize_into inflates Runs to its equivalent bitmap before writing; old readers continue to load. deserialize_from always returns Partial.
into_addr_iter cost Unchanged: 19.4 ms before, 19.9 ms after (within noise). The trick is itertools::Either rather than Box<dyn Iterator> so the new arm adds no dynamic dispatch to the existing path.
Existing set operations (&, |, -, Extend) Updated to accept Runs inputs by transparently inflating to Partial before applying the existing roaring-bitmap logic — semantics-preserving fallback. Native run-shaped set ops are deferred to a follow-up since intersect_two_runs at 10M rows is already 12 µs (not a bottleneck).
Existing RowAddrSelection consumers outside mask.rs One match site in filtered_read.rs (in FilteredReadExec::with_plan) grew a Runs arm that emits the stored runs as Range<u64> directly. Compilation guarantees no other consumer was missed.

Benchmark results

Run with cargo bench -p lance-core --bench row_addr_mask.

Producer: insert one run covering N rows

N insert_range (existing) insert_run (new) speedup
10K 54 ns 31 ns 1.8×
100K 67 ns 31 ns 2.2×
1M 543 ns 31 ns 17.7×
10M 6.5 µs 31 ns 210×

Consumer: iterate selection of N rows

N into_addr_iter (existing) iter_runs (new) speedup
10K 19.4 µs 6.3 ns 3,078×
100K 193 µs 6.4 ns 30,209×
1M 1.94 ms 6.3 ns 306,879×
10M 19.9 ms 6.3 ns 3,154,000×

Producer: K runs summing to 1M rows

K insert_range (existing) insert_run (new) speedup
1 608 ns 32 ns 19.2×
10 827 ns 199 ns 4.2×
100 3.1 µs 769 ns 4.1×
1,000 28.4 µs 5.7 µs 5.0×
10,000 273 µs 49 µs 5.6×

Test plan

  • All 97 pre-existing utils::mask tests still pass.
  • 11 new unit tests cover:
    • insert_run invariant preservation on in-order, out-of-order, and
      overlapping inputs.
    • insert_run degradation rules: Full → no-op, Partial → stays
      Partial, empty/Runs → stays Runs.
    • iter_runs against pure-Runs, pure-Partial, and mixed-variant maps.
    • canonicalize_to_partial converts Runs to Partial in place.
    • Serialization round-trip: a Runs-built map and the equivalent
      Partial-built map produce byte-identical on-disk output.
    • Set operations (&, |, -) yield identical cardinalities for
      Runs-built and Partial-built equivalent inputs.
  • cargo build --workspace --tests clean.
  • cargo bench -p lance-core --bench row_addr_mask runs end-to-end
    and the criterion change: output confirms into_addr_iter was not
    regressed by the new variant.

Follow-ups (not in this PR)

  1. Migrate lance_index::scalar::zoned::search_zones from
    RowAddrTreeMap::insert_range to insert_run. This is the producer
    half of the zonemap IS NULL hot path.
  2. Migrate lance_table::rowids::RowIdSequence::mask_to_offset_ranges's
    U64Segment::Range arm to consume iter_runs instead of
    materializing the source range and intersecting. Closes the consumer
    half.
  3. Add native run-shaped intersection (Runs ∩ RunsRuns) once a
    call site materializes that they want the result-side representation
    preserved.
  4. Optional: a serialization-format minor-version bump so Runs can be
    written on the wire too, avoiding the inflate-on-write step. Not
    needed for any of (1)–(3) since current call sites build masks
    in-memory per query.

🤖 Generated with Claude Code

westonpace and others added 2 commits May 16, 2026 18:51
Add a criterion benchmark suite targeting RowAddrMask / RowAddrTreeMap
that quantifies the cost of operations whose work is fundamentally
range-shaped but currently goes through per-row Partial(RoaringBitmap)
representation. Six groups:

  insert_range_single_run        - producer cost: insert one range
  into_addr_iter_single_run      - consumer cost: walk every row addr
  next_range_iter_single_run     - achievable cost via Iter::next_range
  intersect_two_runs             - set op on two range-shaped masks
  mask_to_offset_ranges_inner_loop - end-to-end slow path observed in
                                     IS NULL trace (495 ms / 889 ms)
  insert_runs_constant_cardinality - many small runs vs one big run

Each varies dataset size while holding number-of-ranges fixed at 1, so
linear scaling in N reveals where row count dominates the cost.

Headline finding (10M-row inputs):
  into_addr_iter:      19.4 ms   per-bit walk
  next_range iter:     1.72 us   per-run walk (~11000x faster)

The next_range/iter delta represents the speedup an alternate
range-aware iterator could surface to callers. The roaring crate
already represents the data as run-encoded containers; the
RowAddrMask public API does not expose them.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a third RowAddrSelection variant, `Runs(Vec<RangeInclusive<u32>>)`,
that stores a per-fragment selection as a sorted, non-overlapping,
non-adjacent list of run-length-encoded ranges. This is the backwards-
compatible step toward a range-aware row-address mask: existing
producers and consumers keep working unchanged, while new range-shaped
callers can sidestep the per-row roaring bitmap that today dominates
mask construction and iteration cost.

New APIs on RowAddrTreeMap:

  insert_run(fragment_id, run)
    Range-shaped producer counterpart to insert(value) / insert_range.
    O(1) amortized when the run extends or is adjacent to the last
    entry (the common case for in-order producers like scalar-index
    zone searches). Merges into existing Runs preserving invariants.
    Falls back to Partial-bitmap inserts when the existing entry is
    already Partial (so scalar inserts never silently re-shape data).

  iter_runs() -> Iterator<(u32, RangeInclusive<u32>)>
    Range-shaped consumer counterpart to into_addr_iter. Yields one
    item per contiguous run, not per row. For `Runs` entries the runs
    are emitted directly; for `Partial` entries roaring's
    Iter::next_range surfaces the bitmap's internal run encoding.
    Panics on `Full` (same contract as into_addr_iter).

  canonicalize_to_partial(fragment_id)
    Force a Runs entry into its equivalent Partial form. Useful for
    callers that need raw bitmap access via get_fragment_bitmap.

Compatibility:

  * Every existing match site on RowAddrSelection grew a Runs arm that
    either handles the variant natively (len, contains, row_addrs,
    iter_runs, into_addr_iter, serialize_into, etc.) or inflates to
    Partial via the private into_partial_bitmap helper for ops not
    yet range-aware (insert, remove, BitOr/BitAnd/Sub, FromIterator,
    Extend). All 97 existing mask tests pass unchanged.

  * On-disk format is unchanged: serialize_into inflates Runs to its
    equivalent bitmap before writing, so readers built against older
    versions continue to load. deserialize_from always yields Partial.

  * Hot paths use itertools::Either rather than Box<dyn Iterator> so
    the new variant adds no dyn-dispatch cost to the existing Partial
    iteration path. Verified by criterion: into_addr_iter at 10M rows
    is 19.9 ms before and after.

Benchmark deltas (single contiguous run, vs the pre-existing APIs
documented in commit 1b9d7c0):

  Producer (insert one run of N rows):
                          insert_range  insert_run    speedup
       N = 10K                  54 ns      31 ns      1.8x
       N = 100K                 67 ns      31 ns      2.2x
       N = 1M                  543 ns      31 ns     17.7x
       N = 10M               6,499 ns      31 ns      210x

  Consumer (iterate selection of N rows):
                       into_addr_iter   iter_runs    speedup
       N = 10K              19,396 ns     6.3 ns     3,078x
       N = 100K            193,111 ns     6.4 ns    30,209x
       N = 1M            1,943,641 ns     6.3 ns   306,879x
       N = 10M          19,871,915 ns     6.3 ns 3,154,000x

  Many runs (1M total cardinality, K runs):
                          insert_range  insert_run   speedup
       K = 1                  608 ns       32 ns     19.2x
       K = 10                 827 ns      199 ns      4.2x
       K = 100              3,123 ns      769 ns      4.1x
       K = 1,000           28,416 ns    5,680 ns      5.0x
       K = 10,000         272,891 ns   49,155 ns      5.6x

11 new unit tests cover invariant preservation, mixed-variant set ops,
serialization round-trip, and degradation rules (insert into Partial
collapses to Partial, insert into Full is no-op).

filtered_read.rs gains a Runs arm in the existing FilteredReadPlan
consumer at line 1606 so callers wiring the new producer through that
path are not blocked.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added the enhancement New feature or request label May 18, 2026
Apply cargo fmt to the new Runs-variant code and address two clippy
findings:

* manual_let_else in BitAndAssign: convert the `Some(set) => set, None
  => continue` match into a `let ... else` (the retain pass above
  already guarantees the entry exists; the else arm is just a defensive
  skip).
* identity_op in test_iter_runs_mixed_variants: drop the stray `+ 0`
  in the second insert_range bound.

No behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented May 18, 2026

Codecov Report

❌ Patch coverage is 79.15567% with 79 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance-core/src/utils/mask.rs 80.00% 67 Missing and 8 partials ⚠️
rust/lance/src/io/exec/filtered_read.rs 0.00% 4 Missing ⚠️

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant