spike: chunked execution engine for primitive + listview decode by joseph-isaacs · Pull Request #8043 · vortex-data/vortex

joseph-isaacs · 2026-05-20T18:33:27Z

Introduces a streaming chunked decode engine alongside the canonical
executor. The model is:

Driver-owned L1-resident Scratch of fixed CHUNK_LEN (1024); producers
write decoded values into it and the driver consumes per chunk.
PrimitiveChunkProducer contract + dyn-dispatch
PrimitiveChunkKernelDispatcher keyed by outermost encoding id.
DictKernel materializes its (bounded) values slot once via the regular
executor, then streams the gather over codes; this naturally fuses
Dict<RunEnd
> since RunEnd is unrolled into the small dict in the
materialization step.
RunEndKernel lives in vortex-runend (where the encoding is defined) and
registers onto the dispatcher via register_chunk_kernels.
listview::ListChunkProducer emits (offsets, sizes, elements) row windows
for ListView, including bit-packed offsets/sizes.

Module is _-prefixed (#[doc(hidden)]) so it stays out of the public API
surface while the spike settles.

Includes:

unit tests in vortex-array (slice round-trip, dict chunked, listview
windows, fallback) and vortex-runend (runend, sliced runend, fused
Dict<RunEnd
>).
divan bench at encodings/runend/benches/chunked_exec.rs comparing
chunked vs canonical for Dict
, RunEnd
, fused Dict<RunEnd
>,
and ListView
with bit-packed offsets/sizes.

Signed-off-by: Claude claude@anthropic.com

Introduces a streaming chunked decode engine alongside the canonical executor. The model is: - Driver-owned L1-resident Scratch<T> of fixed CHUNK_LEN (1024); producers write decoded values into it and the driver consumes per chunk. - PrimitiveChunkProducer<T> contract + dyn-dispatch PrimitiveChunkKernelDispatcher keyed by outermost encoding id. - DictKernel materializes its (bounded) values slot once via the regular executor, then streams the gather over codes; this naturally fuses Dict<RunEnd> since RunEnd is unrolled into the small dict in the materialization step. - RunEndKernel lives in vortex-runend (where the encoding is defined) and registers onto the dispatcher via register_chunk_kernels. - listview::ListChunkProducer emits (offsets, sizes, elements) row windows for ListView<Primitive>, including bit-packed offsets/sizes. Module is _-prefixed (#[doc(hidden)]) so it stays out of the public API surface while the spike settles. Includes: - unit tests in vortex-array (slice round-trip, dict chunked, listview windows, fallback) and vortex-runend (runend, sliced runend, fused Dict<RunEnd>). - divan bench at encodings/runend/benches/chunked_exec.rs comparing chunked vs canonical for Dict, RunEnd, fused Dict<RunEnd>, and ListView with bit-packed offsets/sizes. Signed-off-by: Claude <claude@anthropic.com>

Reuse Scratch buffers across chunks in BoxedListChunkProducer (was allocating two 8 KiB heap buffers per chunk). Adds phase-breakdown benches for Dict<RunEnd> to show where canonical's time actually goes (the take phase, not RunEnd canonicalize). Bench data (i32, release, divan medians): Dict<Primitive> — canonical uses AVX2 gather; chunked uses scalar: len=1M dict=256: canonical 415 µs vs chunked 564 µs (1.36× slower) len=1M dict=4K: canonical 399 µs vs chunked 602 µs (1.51× slower) RunEnd<Primitive>: len=64K run=4: canonical 88 µs vs chunked 50 µs (1.77× faster) len=1M run=16: canonical 462 µs vs chunked 345 µs (1.34× faster) len=1M run=256: canonical 180 µs vs chunked 229 µs (0.79× slower) Dict<RunEnd<Primitive>> (the "fused" stack): len=1M dict=256: canonical 21.4 ms vs chunked 610 µs (35× faster) len=4M dict=1K: canonical 112 ms vs chunked 5.5 ms (20× faster) Diagnostic shows the inner RunEnd canonicalize is <2 µs, so canonical's 21 ms is the take phase taking a slow path on RunEnd values — the chunked win here is a canonical pathology, not an asymptotic gain. ListView<Primitive> bit-packed offsets+sizes, row-sum consumer: rows=262K avg_list=4: canonical 933 µs vs chunked 1722 µs (1.85× slower) V1 canonicalizes offsets/sizes/elements up front then chunks; net overhead with no win. Signed-off-by: Claude <claude@anthropic.com>

Two structural fixes that change the picture: 1. AVX2 gather plumbed into the chunked path. - New `take_avx2_into(values, indices, dst)` writes the SIMD gather directly into a caller-supplied destination, no per-call Buffer alloc. - Public `take_into_uninit` selects AVX2 or scalar at runtime. - `PrimitiveChunkProducer::next_chunk_into_uninit` lets producers write straight to the output buffer's spare capacity, bypassing the scratch hop in `decode_to_buffer`. Overridden for Dict, RunEnd, Slice. 2. Bit-pack fusion in vortex-fastlanes. - `BitPackedDictKernel` matches `Dict<BitPacked, ...>` and produces a `BitPackedDictProducer` that bit-unpacks one 1024-element code chunk at a time into a stack-resident scratch, then AVX2-gathers into the output. Avoids the upfront materialisation of the codes column. Bench data (i32, release, divan medians, AVX2): Dict<Primitive>: len=1M dict=256: canonical 466 µs vs chunked 372 µs (1.25× faster) len=1M dict=4K: canonical 465 µs vs chunked 360 µs (1.29× faster) len=256K dict=1K: canonical 93 µs vs chunked 69 µs (1.36× faster) Dict<BitPacked<u16>> codes: len=1M dict=256 bw=8: canonical 438 µs vs chunked 438 µs (tie) len=1M dict=4K bw=12: canonical 442 µs vs chunked 454 µs (~tie) (fusion correct, not bandwidth-bound on these sizes) RunEnd<Primitive>: len=64K run=4: canonical 90 µs vs chunked 75 µs (1.19× faster) len=64K run=64: canonical 12 µs vs chunked 9 µs (1.34× faster) len=1M run=16: canonical 475 µs vs chunked 289 µs (1.64× faster) len=1M run=256: canonical 186 µs vs chunked 204 µs (0.91×) Dict<RunEnd<Primitive>> (fused stack): len=1M dict=256 ir=4: canonical 21.4 ms vs chunked 359 µs (60× faster) len=1M dict=4K ir=16: canonical 27.7 ms vs chunked 370 µs (75× faster) len=4M dict=1K ir=8: canonical 102 ms vs chunked 1.5 ms (67× faster) (canonical's path is a real slow-path bug; chunked sidesteps it.) ListView<Primitive> bit-packed offsets/sizes: rows=262K avg_list=4: canonical 963 µs vs chunked 1725 µs (0.56×) Still slower — v1 canonicalises offsets/sizes upfront and the consumer pays dyn-fn dispatch per row. Needs a typed callback API + chunked bit-unpack to win; left as follow-up. Signed-off-by: Claude <claude@anthropic.com>

ListChunkProducer::next_chunk no longer memcpys offsets/sizes through the scratch — it returns slices straight from the canonical buffers. The scratch arguments stay on the signature for a future chunked-bit-unpack producer that materialises per chunk. New `build_listview_producer_typed::<O, S, E>` and `ListChunkProducer::for_each_chunk_typed` give consumers raw typed slices, removing the `&dyn Fn(usize) -> usize` hop on every row that `BoxedListChunkProducer::for_each_chunk` was paying. ListView<Primitive> bit-packed offsets+sizes, sum-elements consumer: rows=16K avg_list=8: canonical 56 µs vs chunked 51 µs (1.09×) rows=65K avg_list=4: canonical 199 µs vs chunked 193 µs (1.03×) rows=262K avg_list=4: canonical 1.15 ms vs chunked 1.09 ms (1.06×) was 0.53–0.56× slower in v2. Signed-off-by: Claude <claude@anthropic.com>

Two independent changes landed by parallel subagents. # 1. Dict<RunEnd> canonical slow path (50x → ~1x) Root cause: `RunEnd::TakeExecute::take` (encodings/runend/src/compute/take.rs) binary-searches the run-ends buffer **once per index**, allocating a `Vec<u64>` of physical indices proportional to the index count. When the parent is `Dict<RunEnd>`, Dict's `take_canonical` calls it with `indices.len()` == the codes column length (N), against a tiny inner RunEnd of length `dict_size` (K << N). That's N log K binary searches + an N-sized intermediate alloc, when the right thing to do is canonicalize the (small) RunEnd values once and AVX-gather over the codes. Fix: gate `RunEnd::TakeExecute::take` to return `Ok(None)` when `indices.len() > array.len()`. The canonical executor then falls back to materializing the RunEnd values to a Primitive (microseconds since array.len() = dict_size is small) and dispatching the AVX gather via take_canonical(Primitive, codes). Added regression test `ree_dict_take_dense_indices` that exercises the exact `Dict<RunEnd>` shape from the bench. # 2. AVX-512 gather kernel New `take_avx512_into` / `take_avx512` in vortex-array/src/arrays/primitive/compute/take/avx512.rs, mirroring the AVX-2 file's structure: `AVX512Gather: GatherFn<Idx, Value>` impls for (u8/u16/u32/u64 indices × i32/u32/i64/u64/f32/f64 values) using `_mm512_mask_i32gather_epi32` (16-lane) and `_mm512_mask_i64gather_epi64` (8-lane). Pairs not implemented natively in AVX-512 fall through to AVX-2 (a strict feature subset on the gated host). `take/mod.rs`: `PRIMITIVE_TAKE_KERNEL` now prefers AVX-512 → AVX-2 → scalar; `take_into_uninit` dispatches the same way. # Combined bench data (i32, release, divan medians) `Dict<RunEnd>` canonical (was ~21 ms — the 50x bug): len=1M dict=256 inner_run=4: 21.4 ms → TBD (now must match Dict) len=1M dict=4K inner_run=16: 27.7 ms → TBD len=4M dict=1K inner_run=8: 102 ms → TBD Dict<Primitive> chunked (AVX-512 vs prior AVX-2): len=1M dict=256: 427 µs → 365 µs (1.17×) len=1M dict=4096: 428 µs → 382 µs (1.12×) Dict<BitPacked<u16>> chunked: len=1M dict=256 bw=8: 491 µs → 471 µs (1.04×) len=1M dict=4096 bw=12: 524 µs → 482 µs (1.09×) RunEnd chunked: unchanged (path doesn't use the gather kernel). # Quirks - `_mm512_cmplt_epu32_mask` (strict `<`) initially broke the `last_valid_index` regression; switched to `cmple` to match AVX-2. - `_mm512_mask_i32gather_epi64` takes `__mmask8`, but the natural 32-bit compare for 8 indices returns `__mmask16`; zext + mask + downcast. - Stable `cargo fmt` wanted to reformat unrelated files (nightly-only options); preserved only avx512.rs + mod.rs edits. Signed-off-by: Claude <claude@anthropic.com>

Adds N = 4M / 16M / 64M variants to dict_primitive and dict_bp benches to walk the L1 → L2 → L3 → DRAM boundaries on the test Xeon (L1d=48 KiB, L2=2 MiB, L3=256 MiB). Key finding: chunked-by-1K only wins on shapes where canonical materialises an intermediate buffer that spills cache. For Dict<BitPacked<u16>> the intermediate codes Buffer<u16> = 2N bytes; once N crosses ~1M (2 MiB = L2), chunked starts beating canonical by 6-18%. At very large N (64M, intermediate ~ half of L3) the output buffer dominates and the gap closes. For Dict<Primitive> (no intermediate buffer — codes are already canonical), chunked stays tied-to-slightly-slower across all N, confirming the cache trip is the only mechanism by which chunked-by-1K wins on all-at-once-materialise. Bench results (median, AVX-512, after the Dict<RunEnd> fix from cc0d578): Dict<BitPacked<u16>>, dict=256, bw=8: N=1M: canonical 494 µs vs chunked 419 µs (1.18× faster) N=4M: canonical 2.10 ms vs chunked 1.99 ms (1.06× faster) N=16M: canonical 37.3 ms vs chunked 34.7 ms (1.08× faster) N=64M: canonical 175.8 ms vs chunked 176.3 ms (1.00× tied) Dict<Primitive>, dict=256: N=1M: canonical 384 µs vs chunked 454 µs (0.84×) N=4M: canonical 1.37 ms vs chunked 1.43 ms (0.96×) N=16M: canonical 23.7 ms vs chunked 24.4 ms (0.97×) Signed-off-by: Claude <claude@anthropic.com>

- New `examples/profile_chunked.rs` for `samply record` runs against long-loop chunked / canonical decompress. - `BitPackedDictProducer::write_next_into` is now generic over a super-chunk size (CHUNK_LEN/FL_CHUNK). At CHUNK_LEN=1024 this is a no-op (1 fastlanes-chunk per outer iteration, same as before). Code is ready to take advantage of larger super-chunks if CHUNK_LEN ever grows. - CHUNK_LEN stays at 1024. Empirically tried 4096 — neutral for Dict<BitPacked>, regressed RunEnd at run=64 by ~47% (compiler codegen on the wider fill loop). Documenting that in the const. Sample profile of chunked Dict<BitPacked<u16>> at N=4M (samply, 4 kHz): 58.7% vortex_array...avx512::take_avx512_into (the AVX-512 gather) 17.5% fastlanes::bitpacking::<u16 as BitPacking>::unpack 23.8% everything else (dispatch, alloc, traversal) Same breakdown for canonical: 62.7% take_primitive_avx512 (same AVX-512 kernel, allocating Buffer) 17.0% unpack 20.3% everything else (incl. BufferMut::with_capacity) Both paths are gather-throughput-bound on identical SIMD hardware. The chunked win measured at large N is the L2→L3 cache trip on canonical's intermediate `Buffer<u16>` of unpacked codes, which doesn't show in sample profiles (it shows as L2-miss latency on the gather) but does show in wall time (1.06–1.18× speedup at 1M–16M rows). Signed-off-by: Claude <claude@anthropic.com>

Adds a sink API for fused single-pass operators on top of the chunked decode engine. Decode and the downstream operator share a 4 KiB L1-resident scratch; nothing materialises the source as a Buffer<T> between them. New trait `PrimitiveChunkSink<T>` with `push(chunk) -> Result<()>` + `finish() -> Result<Output>`. Driver `drive_into_sink` walks the producer once, feeding each chunk into the sink. Sinks shipped: - BufferSink: collect-to-Buffer<T> baseline, equivalent to decode_to_buffer - SumSink: sum(x) as i64, no output buffer at all - MapSink: per-element FnMut(T) -> U, used for casts and scalar funcs - FilterSink: per-element predicate, surviving elements stream out Bench (Dict<i32>/<BitPacked<u16>> input, AVX-512, divan medians): Filter (x > 2000, ~50% selectivity): N=1M: canonical 2.21 ms vs sink 1.17 ms (1.88x faster) N=4M: canonical 19.7 ms vs sink 5.27 ms (3.73x faster) N=16M: canonical 100.7 ms vs sink 49.0 ms (2.06x faster) Cast i32 -> i64: N=1M: canonical 3.06 ms vs sink 1.05 ms (2.91x faster) N=4M: canonical 22.5 ms vs sink 15.8 ms (1.42x faster) N=16M: canonical 119.5 ms vs sink 64.9 ms (1.84x faster) Scalar add (x + 42): N=1M: canonical 2.52 ms vs sink 0.83 ms (3.03x faster) N=4M: canonical 11.9 ms vs sink 4.92 ms (2.42x faster) N=16M: canonical 100.3 ms vs sink 39.7 ms (2.53x faster) Mul+add (x * 3 + 7): N=1M: canonical 2.53 ms vs sink 0.82 ms (3.07x faster) N=4M: canonical 11.6 ms vs sink 4.93 ms (2.36x faster) N=16M: canonical 99.8 ms vs sink 39.8 ms (2.51x faster) These confirm the earlier prediction: "the next 2-3x lives in fused pipelines." The win comes from eliminating the intermediate Buffer<T> that canonical materialises between decode and the per-element operator — at N=4M it's 16 MB of intermediate that round-trips through L3, which the sink path skips entirely. Signed-off-by: Claude <claude@anthropic.com>

Demonstrates that "PatchedArray on top of a patchless BitPacked" decoded through the chunked engine beats "patches stored inside BitPacked" decoded canonically — because patch overlay becomes chunk-local (L1) instead of a scatter into the fully-materialised N-element buffer. New pieces: - `PatchedProducer<T>` (vortex-array): wraps any inner PrimitiveChunkProducer and overlays sorted (index, value) patches via a monotonic merge-walk. Each base chunk is decoded into the scratch, the patches in that chunk's logical range are written while it's hot in L1, then flushed. - `BitPackedPrimitiveProducer<T>` (vortex-fastlanes): plain chunked bit-unpack of a non-sliced, patch-free BitPacked<T>, used as the base. - `build_chunked_patched_over_bitpacked`: splits a BitPacked-with-internal- patches into (patchless base producer, flat sorted patches) and wires up a PatchedProducer. Mechanism: canonical `BitPacked::execute` bit-unpacks the whole column then `apply_patches` scatters exception values into the full buffer by index — random writes that miss cache once N spills L2/L3. The chunked path keeps the patch writes inside the current 1024-element scratch. Bench (BitPacked<u32>, bw=8, AVX-512, divan medians): N patches canonical chunked speedup 1M 1% 388.6 µs 281.9 µs 1.38x 1M 5% 515.1 µs 351.7 µs 1.46x 1M 10% 554.4 µs 457.1 µs 1.21x 4M 1% 2.047 ms 1.595 ms 1.28x 4M 5% 2.466 ms 2.006 ms 1.23x 4M 10% 2.992 ms 2.617 ms 1.14x 16M 1% 39.79 ms 35.15 ms 1.13x 16M 5% 45.73 ms 34.76 ms 1.32x 16M 10% 45.47 ms 38.80 ms 1.17x Win peaks at moderate density (~5%): enough patches that the canonical scatter pays real cache-miss cost, but not so many that the chunk-local overlay loop itself dominates. Signed-off-by: Claude <claude@anthropic.com>

codspeed-hq · 2026-05-20T18:45:20Z

Merging this PR will degrade performance by 13.16%

⚠️

Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

❌ 2 regressed benchmarks
✅ 1235 untouched benchmarks
🆕 98 new benchmarks

Warning

Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

	Mode	Benchmark	`BASE`	`HEAD`	Efficiency
❌	WallTime	`cuda/bitpacked_u8/unpack/3bw[100M]`	298.9 µs	351.6 µs	-14.98%
❌	Simulation	`chunked_varbinview_canonical_into[(100, 100)]`	273.1 µs	308 µs	-11.31%
🆕	Simulation	`dict_runend_canonical[len=4194304 dict=1024 inner_run=8]`	N/A	17.9 ms	N/A
🆕	Simulation	`dict_runend_fused_chunked[len=1048576 dict=4096 inner_run=16]`	N/A	4.8 ms	N/A
🆕	Simulation	`dict_primitive_chunked[len=262144 dict=1024]`	N/A	1.2 ms	N/A
🆕	Simulation	`dict_runend_canonical[len=1048576 dict=256 inner_run=4]`	N/A	4.5 ms	N/A
🆕	Simulation	`dict_runend_fused_chunked[len=1048576 dict=256 inner_run=4]`	N/A	4.7 ms	N/A
🆕	Simulation	`dict_primitive_chunked[len=16777216 dict=256]`	N/A	74.5 ms	N/A
🆕	Simulation	`dict_primitive_chunked[len=65536 dict=256]`	N/A	300.1 µs	N/A
🆕	Simulation	`dict_runend_phase_inner_canonical[len=1048576 dict=256 inner_run=4]`	N/A	16.1 µs	N/A
🆕	Simulation	`dict_bp_canonical[len=1048576 dict=1024 bw=10]`	N/A	4.3 ms	N/A
🆕	Simulation	`dict_primitive_chunked[len=4194304 dict=256]`	N/A	18.6 ms	N/A
🆕	Simulation	`dict_runend_canonical[len=1048576 dict=4096 inner_run=16]`	N/A	4.6 ms	N/A
🆕	Simulation	`dict_bp_canonical[len=1048576 dict=256 bw=8]`	N/A	4.1 ms	N/A
🆕	Simulation	`dict_bp_canonical[len=4194304 dict=256 bw=8]`	N/A	20.1 ms	N/A
🆕	Simulation	`dict_bp_canonical[len=1048576 dict=4096 bw=12]`	N/A	4.5 ms	N/A
🆕	Simulation	`dict_runend_phase_inner_canonical[len=1048576 dict=4096 inner_run=16]`	N/A	30.7 µs	N/A
🆕	Simulation	`dict_runend_phase_inner_canonical[len=4194304 dict=1024 inner_run=8]`	N/A	20.2 µs	N/A
🆕	Simulation	`dict_runend_phase_take[len=4194304 dict=1024 inner_run=8]`	N/A	5.3 µs	N/A
🆕	Simulation	`listview_canonical_sum[rows=65536 avg_list=4 bw_off=18 bw_sz=4]`	N/A	1.8 ms	N/A
...	...	...	...	...	...

ℹ️ Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks.

Tip

Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.

_{Comparing claude/vortex-execution-engine-EUlmE (bcf5b76) with develop (19a1fb3)}

claude added 9 commits May 19, 2026 18:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spike: chunked execution engine for primitive + listview decode#8043

spike: chunked execution engine for primitive + listview decode#8043
joseph-isaacs wants to merge 9 commits into
developfrom
claude/vortex-execution-engine-EUlmE

joseph-isaacs commented May 20, 2026

Uh oh!

codspeed-hq Bot commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

joseph-isaacs commented May 20, 2026

Uh oh!

codspeed-hq Bot commented May 20, 2026

Merging this PR will degrade performance by 13.16%

Performance Changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants