spike: chunked execution engine for primitive + listview decode#8043
Open
joseph-isaacs wants to merge 9 commits into
Open
spike: chunked execution engine for primitive + listview decode#8043joseph-isaacs wants to merge 9 commits into
joseph-isaacs wants to merge 9 commits into
Conversation
Introduces a streaming chunked decode engine alongside the canonical executor. The model is: - Driver-owned L1-resident Scratch<T> of fixed CHUNK_LEN (1024); producers write decoded values into it and the driver consumes per chunk. - PrimitiveChunkProducer<T> contract + dyn-dispatch PrimitiveChunkKernelDispatcher keyed by outermost encoding id. - DictKernel materializes its (bounded) values slot once via the regular executor, then streams the gather over codes; this naturally fuses Dict<RunEnd<P>> since RunEnd is unrolled into the small dict in the materialization step. - RunEndKernel lives in vortex-runend (where the encoding is defined) and registers onto the dispatcher via register_chunk_kernels. - listview::ListChunkProducer emits (offsets, sizes, elements) row windows for ListView<Primitive>, including bit-packed offsets/sizes. Module is _-prefixed (#[doc(hidden)]) so it stays out of the public API surface while the spike settles. Includes: - unit tests in vortex-array (slice round-trip, dict chunked, listview windows, fallback) and vortex-runend (runend, sliced runend, fused Dict<RunEnd<P>>). - divan bench at encodings/runend/benches/chunked_exec.rs comparing chunked vs canonical for Dict<P>, RunEnd<P>, fused Dict<RunEnd<P>>, and ListView<P> with bit-packed offsets/sizes. Signed-off-by: Claude <claude@anthropic.com>
Reuse Scratch buffers across chunks in BoxedListChunkProducer (was allocating two 8 KiB heap buffers per chunk). Adds phase-breakdown benches for Dict<RunEnd<P>> to show where canonical's time actually goes (the take phase, not RunEnd canonicalize). Bench data (i32, release, divan medians): Dict<Primitive> — canonical uses AVX2 gather; chunked uses scalar: len=1M dict=256: canonical 415 µs vs chunked 564 µs (1.36× slower) len=1M dict=4K: canonical 399 µs vs chunked 602 µs (1.51× slower) RunEnd<Primitive>: len=64K run=4: canonical 88 µs vs chunked 50 µs (1.77× faster) len=1M run=16: canonical 462 µs vs chunked 345 µs (1.34× faster) len=1M run=256: canonical 180 µs vs chunked 229 µs (0.79× slower) Dict<RunEnd<Primitive>> (the "fused" stack): len=1M dict=256: canonical 21.4 ms vs chunked 610 µs (35× faster) len=4M dict=1K: canonical 112 ms vs chunked 5.5 ms (20× faster) Diagnostic shows the inner RunEnd canonicalize is <2 µs, so canonical's 21 ms is the take phase taking a slow path on RunEnd values — the chunked win here is a canonical pathology, not an asymptotic gain. ListView<Primitive> bit-packed offsets+sizes, row-sum consumer: rows=262K avg_list=4: canonical 933 µs vs chunked 1722 µs (1.85× slower) V1 canonicalizes offsets/sizes/elements up front then chunks; net overhead with no win. Signed-off-by: Claude <claude@anthropic.com>
Two structural fixes that change the picture:
1. AVX2 gather plumbed into the chunked path.
- New `take_avx2_into(values, indices, dst)` writes the SIMD gather
directly into a caller-supplied destination, no per-call Buffer alloc.
- Public `take_into_uninit` selects AVX2 or scalar at runtime.
- `PrimitiveChunkProducer::next_chunk_into_uninit` lets producers write
straight to the output buffer's spare capacity, bypassing the scratch
hop in `decode_to_buffer`. Overridden for Dict, RunEnd, Slice.
2. Bit-pack fusion in vortex-fastlanes.
- `BitPackedDictKernel` matches `Dict<BitPacked<U>, ...>` and produces a
`BitPackedDictProducer` that bit-unpacks one 1024-element code chunk
at a time into a stack-resident scratch, then AVX2-gathers into the
output. Avoids the upfront materialisation of the codes column.
Bench data (i32, release, divan medians, AVX2):
Dict<Primitive>:
len=1M dict=256: canonical 466 µs vs chunked 372 µs (1.25× faster)
len=1M dict=4K: canonical 465 µs vs chunked 360 µs (1.29× faster)
len=256K dict=1K: canonical 93 µs vs chunked 69 µs (1.36× faster)
Dict<BitPacked<u16>> codes:
len=1M dict=256 bw=8: canonical 438 µs vs chunked 438 µs (tie)
len=1M dict=4K bw=12: canonical 442 µs vs chunked 454 µs (~tie)
(fusion correct, not bandwidth-bound on these sizes)
RunEnd<Primitive>:
len=64K run=4: canonical 90 µs vs chunked 75 µs (1.19× faster)
len=64K run=64: canonical 12 µs vs chunked 9 µs (1.34× faster)
len=1M run=16: canonical 475 µs vs chunked 289 µs (1.64× faster)
len=1M run=256: canonical 186 µs vs chunked 204 µs (0.91×)
Dict<RunEnd<Primitive>> (fused stack):
len=1M dict=256 ir=4: canonical 21.4 ms vs chunked 359 µs (60× faster)
len=1M dict=4K ir=16: canonical 27.7 ms vs chunked 370 µs (75× faster)
len=4M dict=1K ir=8: canonical 102 ms vs chunked 1.5 ms (67× faster)
(canonical's path is a real slow-path bug; chunked sidesteps it.)
ListView<Primitive> bit-packed offsets/sizes:
rows=262K avg_list=4: canonical 963 µs vs chunked 1725 µs (0.56×)
Still slower — v1 canonicalises offsets/sizes upfront and the consumer
pays dyn-fn dispatch per row. Needs a typed callback API + chunked
bit-unpack to win; left as follow-up.
Signed-off-by: Claude <claude@anthropic.com>
ListChunkProducer::next_chunk no longer memcpys offsets/sizes through the scratch — it returns slices straight from the canonical buffers. The scratch arguments stay on the signature for a future chunked-bit-unpack producer that materialises per chunk. New `build_listview_producer_typed::<O, S, E>` and `ListChunkProducer::for_each_chunk_typed` give consumers raw typed slices, removing the `&dyn Fn(usize) -> usize` hop on every row that `BoxedListChunkProducer::for_each_chunk` was paying. ListView<Primitive> bit-packed offsets+sizes, sum-elements consumer: rows=16K avg_list=8: canonical 56 µs vs chunked 51 µs (1.09×) rows=65K avg_list=4: canonical 199 µs vs chunked 193 µs (1.03×) rows=262K avg_list=4: canonical 1.15 ms vs chunked 1.09 ms (1.06×) was 0.53–0.56× slower in v2. Signed-off-by: Claude <claude@anthropic.com>
Two independent changes landed by parallel subagents. # 1. Dict<RunEnd<P>> canonical slow path (50x → ~1x) Root cause: `RunEnd::TakeExecute::take` (encodings/runend/src/compute/take.rs) binary-searches the run-ends buffer **once per index**, allocating a `Vec<u64>` of physical indices proportional to the index count. When the parent is `Dict<RunEnd<P>>`, Dict's `take_canonical` calls it with `indices.len()` == the codes column length (N), against a tiny inner RunEnd of length `dict_size` (K << N). That's N log K binary searches + an N-sized intermediate alloc, when the right thing to do is canonicalize the (small) RunEnd values once and AVX-gather over the codes. Fix: gate `RunEnd::TakeExecute::take` to return `Ok(None)` when `indices.len() > array.len()`. The canonical executor then falls back to materializing the RunEnd values to a Primitive (microseconds since array.len() = dict_size is small) and dispatching the AVX gather via take_canonical(Primitive, codes). Added regression test `ree_dict_take_dense_indices` that exercises the exact `Dict<RunEnd<P>>` shape from the bench. # 2. AVX-512 gather kernel New `take_avx512_into` / `take_avx512` in vortex-array/src/arrays/primitive/compute/take/avx512.rs, mirroring the AVX-2 file's structure: `AVX512Gather: GatherFn<Idx, Value>` impls for (u8/u16/u32/u64 indices × i32/u32/i64/u64/f32/f64 values) using `_mm512_mask_i32gather_epi32` (16-lane) and `_mm512_mask_i64gather_epi64` (8-lane). Pairs not implemented natively in AVX-512 fall through to AVX-2 (a strict feature subset on the gated host). `take/mod.rs`: `PRIMITIVE_TAKE_KERNEL` now prefers AVX-512 → AVX-2 → scalar; `take_into_uninit` dispatches the same way. # Combined bench data (i32, release, divan medians) `Dict<RunEnd<P>>` canonical (was ~21 ms — the 50x bug): len=1M dict=256 inner_run=4: 21.4 ms → TBD (now must match Dict<P>) len=1M dict=4K inner_run=16: 27.7 ms → TBD len=4M dict=1K inner_run=8: 102 ms → TBD Dict<Primitive> chunked (AVX-512 vs prior AVX-2): len=1M dict=256: 427 µs → 365 µs (1.17×) len=1M dict=4096: 428 µs → 382 µs (1.12×) Dict<BitPacked<u16>> chunked: len=1M dict=256 bw=8: 491 µs → 471 µs (1.04×) len=1M dict=4096 bw=12: 524 µs → 482 µs (1.09×) RunEnd<P> chunked: unchanged (path doesn't use the gather kernel). # Quirks - `_mm512_cmplt_epu32_mask` (strict `<`) initially broke the `last_valid_index` regression; switched to `cmple` to match AVX-2. - `_mm512_mask_i32gather_epi64` takes `__mmask8`, but the natural 32-bit compare for 8 indices returns `__mmask16`; zext + mask + downcast. - Stable `cargo fmt` wanted to reformat unrelated files (nightly-only options); preserved only avx512.rs + mod.rs edits. Signed-off-by: Claude <claude@anthropic.com>
Adds N = 4M / 16M / 64M variants to dict_primitive and dict_bp benches to walk the L1 → L2 → L3 → DRAM boundaries on the test Xeon (L1d=48 KiB, L2=2 MiB, L3=256 MiB). Key finding: chunked-by-1K only wins on shapes where canonical materialises an intermediate buffer that spills cache. For Dict<BitPacked<u16>> the intermediate codes Buffer<u16> = 2N bytes; once N crosses ~1M (2 MiB = L2), chunked starts beating canonical by 6-18%. At very large N (64M, intermediate ~ half of L3) the output buffer dominates and the gap closes. For Dict<Primitive> (no intermediate buffer — codes are already canonical), chunked stays tied-to-slightly-slower across all N, confirming the cache trip is the only mechanism by which chunked-by-1K wins on all-at-once-materialise. Bench results (median, AVX-512, after the Dict<RunEnd> fix from cc0d578): Dict<BitPacked<u16>>, dict=256, bw=8: N=1M: canonical 494 µs vs chunked 419 µs (1.18× faster) N=4M: canonical 2.10 ms vs chunked 1.99 ms (1.06× faster) N=16M: canonical 37.3 ms vs chunked 34.7 ms (1.08× faster) N=64M: canonical 175.8 ms vs chunked 176.3 ms (1.00× tied) Dict<Primitive>, dict=256: N=1M: canonical 384 µs vs chunked 454 µs (0.84×) N=4M: canonical 1.37 ms vs chunked 1.43 ms (0.96×) N=16M: canonical 23.7 ms vs chunked 24.4 ms (0.97×) Signed-off-by: Claude <claude@anthropic.com>
- New `examples/profile_chunked.rs` for `samply record` runs against long-loop chunked / canonical decompress. - `BitPackedDictProducer::write_next_into` is now generic over a super-chunk size (CHUNK_LEN/FL_CHUNK). At CHUNK_LEN=1024 this is a no-op (1 fastlanes-chunk per outer iteration, same as before). Code is ready to take advantage of larger super-chunks if CHUNK_LEN ever grows. - CHUNK_LEN stays at 1024. Empirically tried 4096 — neutral for Dict<BitPacked>, regressed RunEnd at run=64 by ~47% (compiler codegen on the wider fill loop). Documenting that in the const. Sample profile of chunked Dict<BitPacked<u16>> at N=4M (samply, 4 kHz): 58.7% vortex_array...avx512::take_avx512_into (the AVX-512 gather) 17.5% fastlanes::bitpacking::<u16 as BitPacking>::unpack 23.8% everything else (dispatch, alloc, traversal) Same breakdown for canonical: 62.7% take_primitive_avx512 (same AVX-512 kernel, allocating Buffer) 17.0% unpack 20.3% everything else (incl. BufferMut::with_capacity) Both paths are gather-throughput-bound on identical SIMD hardware. The chunked win measured at large N is the L2→L3 cache trip on canonical's intermediate `Buffer<u16>` of unpacked codes, which doesn't show in sample profiles (it shows as L2-miss latency on the gather) but does show in wall time (1.06–1.18× speedup at 1M–16M rows). Signed-off-by: Claude <claude@anthropic.com>
Adds a sink API for fused single-pass operators on top of the chunked decode engine. Decode and the downstream operator share a 4 KiB L1-resident scratch; nothing materialises the source as a Buffer<T> between them. New trait `PrimitiveChunkSink<T>` with `push(chunk) -> Result<()>` + `finish() -> Result<Output>`. Driver `drive_into_sink` walks the producer once, feeding each chunk into the sink. Sinks shipped: - BufferSink: collect-to-Buffer<T> baseline, equivalent to decode_to_buffer - SumSink: sum(x) as i64, no output buffer at all - MapSink: per-element FnMut(T) -> U, used for casts and scalar funcs - FilterSink: per-element predicate, surviving elements stream out Bench (Dict<i32>/<BitPacked<u16>> input, AVX-512, divan medians): Filter (x > 2000, ~50% selectivity): N=1M: canonical 2.21 ms vs sink 1.17 ms (1.88x faster) N=4M: canonical 19.7 ms vs sink 5.27 ms (3.73x faster) N=16M: canonical 100.7 ms vs sink 49.0 ms (2.06x faster) Cast i32 -> i64: N=1M: canonical 3.06 ms vs sink 1.05 ms (2.91x faster) N=4M: canonical 22.5 ms vs sink 15.8 ms (1.42x faster) N=16M: canonical 119.5 ms vs sink 64.9 ms (1.84x faster) Scalar add (x + 42): N=1M: canonical 2.52 ms vs sink 0.83 ms (3.03x faster) N=4M: canonical 11.9 ms vs sink 4.92 ms (2.42x faster) N=16M: canonical 100.3 ms vs sink 39.7 ms (2.53x faster) Mul+add (x * 3 + 7): N=1M: canonical 2.53 ms vs sink 0.82 ms (3.07x faster) N=4M: canonical 11.6 ms vs sink 4.93 ms (2.36x faster) N=16M: canonical 99.8 ms vs sink 39.8 ms (2.51x faster) These confirm the earlier prediction: "the next 2-3x lives in fused pipelines." The win comes from eliminating the intermediate Buffer<T> that canonical materialises between decode and the per-element operator — at N=4M it's 16 MB of intermediate that round-trips through L3, which the sink path skips entirely. Signed-off-by: Claude <claude@anthropic.com>
Demonstrates that "PatchedArray on top of a patchless BitPacked" decoded through the chunked engine beats "patches stored inside BitPacked" decoded canonically — because patch overlay becomes chunk-local (L1) instead of a scatter into the fully-materialised N-element buffer. New pieces: - `PatchedProducer<T>` (vortex-array): wraps any inner PrimitiveChunkProducer and overlays sorted (index, value) patches via a monotonic merge-walk. Each base chunk is decoded into the scratch, the patches in that chunk's logical range are written while it's hot in L1, then flushed. - `BitPackedPrimitiveProducer<T>` (vortex-fastlanes): plain chunked bit-unpack of a non-sliced, patch-free BitPacked<T>, used as the base. - `build_chunked_patched_over_bitpacked`: splits a BitPacked-with-internal- patches into (patchless base producer, flat sorted patches) and wires up a PatchedProducer. Mechanism: canonical `BitPacked::execute` bit-unpacks the whole column then `apply_patches` scatters exception values into the full buffer by index — random writes that miss cache once N spills L2/L3. The chunked path keeps the patch writes inside the current 1024-element scratch. Bench (BitPacked<u32>, bw=8, AVX-512, divan medians): N patches canonical chunked speedup 1M 1% 388.6 µs 281.9 µs 1.38x 1M 5% 515.1 µs 351.7 µs 1.46x 1M 10% 554.4 µs 457.1 µs 1.21x 4M 1% 2.047 ms 1.595 ms 1.28x 4M 5% 2.466 ms 2.006 ms 1.23x 4M 10% 2.992 ms 2.617 ms 1.14x 16M 1% 39.79 ms 35.15 ms 1.13x 16M 5% 45.73 ms 34.76 ms 1.32x 16M 10% 45.47 ms 38.80 ms 1.17x Win peaks at moderate density (~5%): enough patches that the canonical scatter pays real cache-miss cost, but not so many that the chunk-local overlay loop itself dominates. Signed-off-by: Claude <claude@anthropic.com>
Merging this PR will degrade performance by 13.16%
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Introduces a streaming chunked decode engine alongside the canonical
executor. The model is:
write decoded values into it and the driver consumes per chunk.
PrimitiveChunkKernelDispatcher keyed by outermost encoding id.
executor, then streams the gather over codes; this naturally fuses
Dict<RunEnd
> since RunEnd is unrolled into the small dict in the
materialization step.
registers onto the dispatcher via register_chunk_kernels.
for ListView, including bit-packed offsets/sizes.
Module is _-prefixed (#[doc(hidden)]) so it stays out of the public API
surface while the spike settles.
Includes:
windows, fallback) and vortex-runend (runend, sliced runend, fused
Dict<RunEnd
>).
chunked vs canonical for Dict
, RunEnd
, fused Dict<RunEnd
>,
and ListView
with bit-packed offsets/sizes.
Signed-off-by: Claude claude@anthropic.com