Skip to content

spike: chunked execution engine for primitive + listview decode#8043

Open
joseph-isaacs wants to merge 9 commits into
developfrom
claude/vortex-execution-engine-EUlmE
Open

spike: chunked execution engine for primitive + listview decode#8043
joseph-isaacs wants to merge 9 commits into
developfrom
claude/vortex-execution-engine-EUlmE

Conversation

@joseph-isaacs
Copy link
Copy Markdown
Contributor

Introduces a streaming chunked decode engine alongside the canonical
executor. The model is:

  • Driver-owned L1-resident Scratch of fixed CHUNK_LEN (1024); producers
    write decoded values into it and the driver consumes per chunk.
  • PrimitiveChunkProducer contract + dyn-dispatch
    PrimitiveChunkKernelDispatcher keyed by outermost encoding id.
  • DictKernel materializes its (bounded) values slot once via the regular
    executor, then streams the gather over codes; this naturally fuses
    Dict<RunEnd

    > since RunEnd is unrolled into the small dict in the
    materialization step.

  • RunEndKernel lives in vortex-runend (where the encoding is defined) and
    registers onto the dispatcher via register_chunk_kernels.
  • listview::ListChunkProducer emits (offsets, sizes, elements) row windows
    for ListView, including bit-packed offsets/sizes.

Module is _-prefixed (#[doc(hidden)]) so it stays out of the public API
surface while the spike settles.

Includes:

  • unit tests in vortex-array (slice round-trip, dict chunked, listview
    windows, fallback) and vortex-runend (runend, sliced runend, fused
    Dict<RunEnd

    >).

  • divan bench at encodings/runend/benches/chunked_exec.rs comparing
    chunked vs canonical for Dict

    , RunEnd

    , fused Dict<RunEnd

    >,
    and ListView

    with bit-packed offsets/sizes.

Signed-off-by: Claude claude@anthropic.com

claude added 9 commits May 19, 2026 18:59
Introduces a streaming chunked decode engine alongside the canonical
executor. The model is:

- Driver-owned L1-resident Scratch<T> of fixed CHUNK_LEN (1024); producers
  write decoded values into it and the driver consumes per chunk.
- PrimitiveChunkProducer<T> contract + dyn-dispatch
  PrimitiveChunkKernelDispatcher keyed by outermost encoding id.
- DictKernel materializes its (bounded) values slot once via the regular
  executor, then streams the gather over codes; this naturally fuses
  Dict<RunEnd<P>> since RunEnd is unrolled into the small dict in the
  materialization step.
- RunEndKernel lives in vortex-runend (where the encoding is defined) and
  registers onto the dispatcher via register_chunk_kernels.
- listview::ListChunkProducer emits (offsets, sizes, elements) row windows
  for ListView<Primitive>, including bit-packed offsets/sizes.

Module is _-prefixed (#[doc(hidden)]) so it stays out of the public API
surface while the spike settles.

Includes:
- unit tests in vortex-array (slice round-trip, dict chunked, listview
  windows, fallback) and vortex-runend (runend, sliced runend, fused
  Dict<RunEnd<P>>).
- divan bench at encodings/runend/benches/chunked_exec.rs comparing
  chunked vs canonical for Dict<P>, RunEnd<P>, fused Dict<RunEnd<P>>,
  and ListView<P> with bit-packed offsets/sizes.

Signed-off-by: Claude <claude@anthropic.com>
Reuse Scratch buffers across chunks in BoxedListChunkProducer (was
allocating two 8 KiB heap buffers per chunk). Adds phase-breakdown
benches for Dict<RunEnd<P>> to show where canonical's time actually
goes (the take phase, not RunEnd canonicalize).

Bench data (i32, release, divan medians):

Dict<Primitive> — canonical uses AVX2 gather; chunked uses scalar:
  len=1M dict=256:  canonical 415 µs  vs chunked 564 µs  (1.36× slower)
  len=1M dict=4K:   canonical 399 µs  vs chunked 602 µs  (1.51× slower)

RunEnd<Primitive>:
  len=64K run=4:    canonical 88 µs   vs chunked 50 µs   (1.77× faster)
  len=1M run=16:    canonical 462 µs  vs chunked 345 µs  (1.34× faster)
  len=1M run=256:   canonical 180 µs  vs chunked 229 µs  (0.79× slower)

Dict<RunEnd<Primitive>> (the "fused" stack):
  len=1M dict=256:  canonical 21.4 ms vs chunked 610 µs  (35× faster)
  len=4M dict=1K:   canonical 112 ms  vs chunked 5.5 ms  (20× faster)
  Diagnostic shows the inner RunEnd canonicalize is <2 µs, so canonical's
  21 ms is the take phase taking a slow path on RunEnd values — the
  chunked win here is a canonical pathology, not an asymptotic gain.

ListView<Primitive> bit-packed offsets+sizes, row-sum consumer:
  rows=262K avg_list=4: canonical 933 µs vs chunked 1722 µs (1.85× slower)
  V1 canonicalizes offsets/sizes/elements up front then chunks; net
  overhead with no win.

Signed-off-by: Claude <claude@anthropic.com>
Two structural fixes that change the picture:

1. AVX2 gather plumbed into the chunked path.
   - New `take_avx2_into(values, indices, dst)` writes the SIMD gather
     directly into a caller-supplied destination, no per-call Buffer alloc.
   - Public `take_into_uninit` selects AVX2 or scalar at runtime.
   - `PrimitiveChunkProducer::next_chunk_into_uninit` lets producers write
     straight to the output buffer's spare capacity, bypassing the scratch
     hop in `decode_to_buffer`. Overridden for Dict, RunEnd, Slice.

2. Bit-pack fusion in vortex-fastlanes.
   - `BitPackedDictKernel` matches `Dict<BitPacked<U>, ...>` and produces a
     `BitPackedDictProducer` that bit-unpacks one 1024-element code chunk
     at a time into a stack-resident scratch, then AVX2-gathers into the
     output. Avoids the upfront materialisation of the codes column.

Bench data (i32, release, divan medians, AVX2):

Dict<Primitive>:
  len=1M dict=256:  canonical 466 µs vs chunked 372 µs (1.25× faster)
  len=1M dict=4K:   canonical 465 µs vs chunked 360 µs (1.29× faster)
  len=256K dict=1K: canonical  93 µs vs chunked  69 µs (1.36× faster)

Dict<BitPacked<u16>> codes:
  len=1M dict=256 bw=8:  canonical 438 µs vs chunked 438 µs (tie)
  len=1M dict=4K  bw=12: canonical 442 µs vs chunked 454 µs (~tie)
  (fusion correct, not bandwidth-bound on these sizes)

RunEnd<Primitive>:
  len=64K  run=4:    canonical 90 µs  vs chunked 75 µs  (1.19× faster)
  len=64K  run=64:   canonical 12 µs  vs chunked 9 µs   (1.34× faster)
  len=1M   run=16:   canonical 475 µs vs chunked 289 µs (1.64× faster)
  len=1M   run=256:  canonical 186 µs vs chunked 204 µs (0.91×)

Dict<RunEnd<Primitive>> (fused stack):
  len=1M dict=256 ir=4:  canonical 21.4 ms vs chunked 359 µs (60× faster)
  len=1M dict=4K  ir=16: canonical 27.7 ms vs chunked 370 µs (75× faster)
  len=4M dict=1K  ir=8:  canonical 102 ms  vs chunked 1.5 ms (67× faster)
  (canonical's path is a real slow-path bug; chunked sidesteps it.)

ListView<Primitive> bit-packed offsets/sizes:
  rows=262K avg_list=4: canonical 963 µs vs chunked 1725 µs (0.56×)
  Still slower — v1 canonicalises offsets/sizes upfront and the consumer
  pays dyn-fn dispatch per row. Needs a typed callback API + chunked
  bit-unpack to win; left as follow-up.

Signed-off-by: Claude <claude@anthropic.com>
ListChunkProducer::next_chunk no longer memcpys offsets/sizes through the
scratch — it returns slices straight from the canonical buffers. The
scratch arguments stay on the signature for a future chunked-bit-unpack
producer that materialises per chunk.

New `build_listview_producer_typed::<O, S, E>` and
`ListChunkProducer::for_each_chunk_typed` give consumers raw typed slices,
removing the `&dyn Fn(usize) -> usize` hop on every row that
`BoxedListChunkProducer::for_each_chunk` was paying.

ListView<Primitive> bit-packed offsets+sizes, sum-elements consumer:
  rows=16K  avg_list=8: canonical 56 µs   vs chunked 51 µs   (1.09×)
  rows=65K  avg_list=4: canonical 199 µs  vs chunked 193 µs  (1.03×)
  rows=262K avg_list=4: canonical 1.15 ms vs chunked 1.09 ms (1.06×)

was 0.53–0.56× slower in v2.

Signed-off-by: Claude <claude@anthropic.com>
Two independent changes landed by parallel subagents.

# 1. Dict<RunEnd<P>> canonical slow path (50x → ~1x)

Root cause: `RunEnd::TakeExecute::take` (encodings/runend/src/compute/take.rs)
binary-searches the run-ends buffer **once per index**, allocating a
`Vec<u64>` of physical indices proportional to the index count. When the
parent is `Dict<RunEnd<P>>`, Dict's `take_canonical` calls it with
`indices.len()` == the codes column length (N), against a tiny inner
RunEnd of length `dict_size` (K << N). That's N log K binary searches +
an N-sized intermediate alloc, when the right thing to do is
canonicalize the (small) RunEnd values once and AVX-gather over the
codes.

Fix: gate `RunEnd::TakeExecute::take` to return `Ok(None)` when
`indices.len() > array.len()`. The canonical executor then falls back to
materializing the RunEnd values to a Primitive (microseconds since
array.len() = dict_size is small) and dispatching the AVX gather via
take_canonical(Primitive, codes).

Added regression test `ree_dict_take_dense_indices` that exercises the
exact `Dict<RunEnd<P>>` shape from the bench.

# 2. AVX-512 gather kernel

New `take_avx512_into` / `take_avx512` in
vortex-array/src/arrays/primitive/compute/take/avx512.rs, mirroring the
AVX-2 file's structure: `AVX512Gather: GatherFn<Idx, Value>` impls for
(u8/u16/u32/u64 indices × i32/u32/i64/u64/f32/f64 values) using
`_mm512_mask_i32gather_epi32` (16-lane) and `_mm512_mask_i64gather_epi64`
(8-lane). Pairs not implemented natively in AVX-512 fall through to
AVX-2 (a strict feature subset on the gated host).

`take/mod.rs`: `PRIMITIVE_TAKE_KERNEL` now prefers AVX-512 → AVX-2 →
scalar; `take_into_uninit` dispatches the same way.

# Combined bench data (i32, release, divan medians)

`Dict<RunEnd<P>>` canonical (was ~21 ms — the 50x bug):
  len=1M dict=256 inner_run=4:  21.4 ms → TBD (now must match Dict<P>)
  len=1M dict=4K  inner_run=16: 27.7 ms → TBD
  len=4M dict=1K  inner_run=8:  102 ms  → TBD

Dict<Primitive> chunked (AVX-512 vs prior AVX-2):
  len=1M dict=256:  427 µs → 365 µs (1.17×)
  len=1M dict=4096: 428 µs → 382 µs (1.12×)

Dict<BitPacked<u16>> chunked:
  len=1M dict=256  bw=8:  491 µs → 471 µs (1.04×)
  len=1M dict=4096 bw=12: 524 µs → 482 µs (1.09×)

RunEnd<P> chunked: unchanged (path doesn't use the gather kernel).

# Quirks

- `_mm512_cmplt_epu32_mask` (strict `<`) initially broke the
  `last_valid_index` regression; switched to `cmple` to match AVX-2.
- `_mm512_mask_i32gather_epi64` takes `__mmask8`, but the natural 32-bit
  compare for 8 indices returns `__mmask16`; zext + mask + downcast.
- Stable `cargo fmt` wanted to reformat unrelated files (nightly-only
  options); preserved only avx512.rs + mod.rs edits.

Signed-off-by: Claude <claude@anthropic.com>
Adds N = 4M / 16M / 64M variants to dict_primitive and dict_bp benches
to walk the L1 → L2 → L3 → DRAM boundaries on the test Xeon
(L1d=48 KiB, L2=2 MiB, L3=256 MiB).

Key finding: chunked-by-1K only wins on shapes where canonical materialises
an intermediate buffer that spills cache. For Dict<BitPacked<u16>> the
intermediate codes Buffer<u16> = 2N bytes; once N crosses ~1M (2 MiB = L2),
chunked starts beating canonical by 6-18%. At very large N (64M, intermediate
~ half of L3) the output buffer dominates and the gap closes.

For Dict<Primitive> (no intermediate buffer — codes are already canonical),
chunked stays tied-to-slightly-slower across all N, confirming the cache
trip is the only mechanism by which chunked-by-1K wins on
all-at-once-materialise.

Bench results (median, AVX-512, after the Dict<RunEnd> fix from cc0d578):

Dict<BitPacked<u16>>, dict=256, bw=8:
  N=1M:  canonical 494 µs   vs chunked 419 µs   (1.18× faster)
  N=4M:  canonical 2.10 ms  vs chunked 1.99 ms  (1.06× faster)
  N=16M: canonical 37.3 ms  vs chunked 34.7 ms  (1.08× faster)
  N=64M: canonical 175.8 ms vs chunked 176.3 ms (1.00× tied)

Dict<Primitive>, dict=256:
  N=1M:  canonical 384 µs   vs chunked 454 µs   (0.84×)
  N=4M:  canonical 1.37 ms  vs chunked 1.43 ms  (0.96×)
  N=16M: canonical 23.7 ms  vs chunked 24.4 ms  (0.97×)

Signed-off-by: Claude <claude@anthropic.com>
- New `examples/profile_chunked.rs` for `samply record` runs against
  long-loop chunked / canonical decompress.
- `BitPackedDictProducer::write_next_into` is now generic over a
  super-chunk size (CHUNK_LEN/FL_CHUNK). At CHUNK_LEN=1024 this is a
  no-op (1 fastlanes-chunk per outer iteration, same as before). Code is
  ready to take advantage of larger super-chunks if CHUNK_LEN ever grows.
- CHUNK_LEN stays at 1024. Empirically tried 4096 — neutral for
  Dict<BitPacked>, regressed RunEnd at run=64 by ~47% (compiler codegen
  on the wider fill loop). Documenting that in the const.

Sample profile of chunked Dict<BitPacked<u16>> at N=4M (samply, 4 kHz):
  58.7%  vortex_array...avx512::take_avx512_into  (the AVX-512 gather)
  17.5%  fastlanes::bitpacking::<u16 as BitPacking>::unpack
  23.8%  everything else (dispatch, alloc, traversal)

Same breakdown for canonical:
  62.7%  take_primitive_avx512  (same AVX-512 kernel, allocating Buffer)
  17.0%  unpack
  20.3%  everything else (incl. BufferMut::with_capacity)

Both paths are gather-throughput-bound on identical SIMD hardware. The
chunked win measured at large N is the L2→L3 cache trip on canonical's
intermediate `Buffer<u16>` of unpacked codes, which doesn't show in
sample profiles (it shows as L2-miss latency on the gather) but does
show in wall time (1.06–1.18× speedup at 1M–16M rows).

Signed-off-by: Claude <claude@anthropic.com>
Adds a sink API for fused single-pass operators on top of the chunked
decode engine. Decode and the downstream operator share a 4 KiB
L1-resident scratch; nothing materialises the source as a Buffer<T>
between them.

New trait `PrimitiveChunkSink<T>` with `push(chunk) -> Result<()>` +
`finish() -> Result<Output>`. Driver `drive_into_sink` walks the producer
once, feeding each chunk into the sink.

Sinks shipped:
- BufferSink: collect-to-Buffer<T> baseline, equivalent to decode_to_buffer
- SumSink:    sum(x) as i64, no output buffer at all
- MapSink:    per-element FnMut(T) -> U, used for casts and scalar funcs
- FilterSink: per-element predicate, surviving elements stream out

Bench (Dict<i32>/<BitPacked<u16>> input, AVX-512, divan medians):

Filter (x > 2000, ~50% selectivity):
  N=1M:   canonical 2.21 ms  vs sink 1.17 ms  (1.88x faster)
  N=4M:   canonical 19.7 ms  vs sink 5.27 ms  (3.73x faster)
  N=16M:  canonical 100.7 ms vs sink 49.0 ms  (2.06x faster)

Cast i32 -> i64:
  N=1M:   canonical 3.06 ms  vs sink 1.05 ms  (2.91x faster)
  N=4M:   canonical 22.5 ms  vs sink 15.8 ms  (1.42x faster)
  N=16M:  canonical 119.5 ms vs sink 64.9 ms  (1.84x faster)

Scalar add (x + 42):
  N=1M:   canonical 2.52 ms  vs sink 0.83 ms  (3.03x faster)
  N=4M:   canonical 11.9 ms  vs sink 4.92 ms  (2.42x faster)
  N=16M:  canonical 100.3 ms vs sink 39.7 ms  (2.53x faster)

Mul+add (x * 3 + 7):
  N=1M:   canonical 2.53 ms  vs sink 0.82 ms  (3.07x faster)
  N=4M:   canonical 11.6 ms  vs sink 4.93 ms  (2.36x faster)
  N=16M:  canonical 99.8 ms  vs sink 39.8 ms  (2.51x faster)

These confirm the earlier prediction: "the next 2-3x lives in fused
pipelines." The win comes from eliminating the intermediate Buffer<T>
that canonical materialises between decode and the per-element operator
— at N=4M it's 16 MB of intermediate that round-trips through L3, which
the sink path skips entirely.

Signed-off-by: Claude <claude@anthropic.com>
Demonstrates that "PatchedArray on top of a patchless BitPacked" decoded
through the chunked engine beats "patches stored inside BitPacked"
decoded canonically — because patch overlay becomes chunk-local (L1)
instead of a scatter into the fully-materialised N-element buffer.

New pieces:
- `PatchedProducer<T>` (vortex-array): wraps any inner PrimitiveChunkProducer
  and overlays sorted (index, value) patches via a monotonic merge-walk.
  Each base chunk is decoded into the scratch, the patches in that chunk's
  logical range are written while it's hot in L1, then flushed.
- `BitPackedPrimitiveProducer<T>` (vortex-fastlanes): plain chunked
  bit-unpack of a non-sliced, patch-free BitPacked<T>, used as the base.
- `build_chunked_patched_over_bitpacked`: splits a BitPacked-with-internal-
  patches into (patchless base producer, flat sorted patches) and wires up
  a PatchedProducer.

Mechanism: canonical `BitPacked::execute` bit-unpacks the whole column then
`apply_patches` scatters exception values into the full buffer by index —
random writes that miss cache once N spills L2/L3. The chunked path keeps
the patch writes inside the current 1024-element scratch.

Bench (BitPacked<u32>, bw=8, AVX-512, divan medians):

  N    patches  canonical   chunked    speedup
  1M    1%      388.6 µs    281.9 µs   1.38x
  1M    5%      515.1 µs    351.7 µs   1.46x
  1M   10%      554.4 µs    457.1 µs   1.21x
  4M    1%      2.047 ms    1.595 ms   1.28x
  4M    5%      2.466 ms    2.006 ms   1.23x
  4M   10%      2.992 ms    2.617 ms   1.14x
  16M   1%      39.79 ms    35.15 ms   1.13x
  16M   5%      45.73 ms    34.76 ms   1.32x
  16M  10%      45.47 ms    38.80 ms   1.17x

Win peaks at moderate density (~5%): enough patches that the canonical
scatter pays real cache-miss cost, but not so many that the chunk-local
overlay loop itself dominates.

Signed-off-by: Claude <claude@anthropic.com>
@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented May 20, 2026

Merging this PR will degrade performance by 13.16%

⚠️ Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

❌ 2 regressed benchmarks
✅ 1235 untouched benchmarks
🆕 98 new benchmarks

Warning

Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Mode Benchmark BASE HEAD Efficiency
WallTime cuda/bitpacked_u8/unpack/3bw[100M] 298.9 µs 351.6 µs -14.98%
Simulation chunked_varbinview_canonical_into[(100, 100)] 273.1 µs 308 µs -11.31%
🆕 Simulation dict_runend_canonical[len=4194304 dict=1024 inner_run=8] N/A 17.9 ms N/A
🆕 Simulation dict_runend_fused_chunked[len=1048576 dict=4096 inner_run=16] N/A 4.8 ms N/A
🆕 Simulation dict_primitive_chunked[len=262144 dict=1024] N/A 1.2 ms N/A
🆕 Simulation dict_runend_canonical[len=1048576 dict=256 inner_run=4] N/A 4.5 ms N/A
🆕 Simulation dict_runend_fused_chunked[len=1048576 dict=256 inner_run=4] N/A 4.7 ms N/A
🆕 Simulation dict_primitive_chunked[len=16777216 dict=256] N/A 74.5 ms N/A
🆕 Simulation dict_primitive_chunked[len=65536 dict=256] N/A 300.1 µs N/A
🆕 Simulation dict_runend_phase_inner_canonical[len=1048576 dict=256 inner_run=4] N/A 16.1 µs N/A
🆕 Simulation dict_bp_canonical[len=1048576 dict=1024 bw=10] N/A 4.3 ms N/A
🆕 Simulation dict_primitive_chunked[len=4194304 dict=256] N/A 18.6 ms N/A
🆕 Simulation dict_runend_canonical[len=1048576 dict=4096 inner_run=16] N/A 4.6 ms N/A
🆕 Simulation dict_bp_canonical[len=1048576 dict=256 bw=8] N/A 4.1 ms N/A
🆕 Simulation dict_bp_canonical[len=4194304 dict=256 bw=8] N/A 20.1 ms N/A
🆕 Simulation dict_bp_canonical[len=1048576 dict=4096 bw=12] N/A 4.5 ms N/A
🆕 Simulation dict_runend_phase_inner_canonical[len=1048576 dict=4096 inner_run=16] N/A 30.7 µs N/A
🆕 Simulation dict_runend_phase_inner_canonical[len=4194304 dict=1024 inner_run=8] N/A 20.2 µs N/A
🆕 Simulation dict_runend_phase_take[len=4194304 dict=1024 inner_run=8] N/A 5.3 µs N/A
🆕 Simulation listview_canonical_sum[rows=65536 avg_list=4 bw_off=18 bw_sz=4] N/A 1.8 ms N/A
... ... ... ... ... ...

ℹ️ Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks.

Tip

Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.


Comparing claude/vortex-execution-engine-EUlmE (bcf5b76) with develop (19a1fb3)

Open in CodSpeed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants