Skip to content

[VL] Subdivide SplitRV CPU/wall timer into 4 sub-stages#12108

Draft
luis4a0 wants to merge 2 commits into
apache:mainfrom
luis4a0:lpenaranda/pr-splitrv-substages
Draft

[VL] Subdivide SplitRV CPU/wall timer into 4 sub-stages#12108
luis4a0 wants to merge 2 commits into
apache:mainfrom
luis4a0:lpenaranda/pr-splitrv-substages

Conversation

@luis4a0
Copy link
Copy Markdown
Contributor

@luis4a0 luis4a0 commented May 18, 2026

What changes are proposed in this pull request?

Companion follow-up to #12083 (which made cpuWallTimingList_ actually record real numbers) and #12107 (which added a per-batch input-encoding counter). The existing CpuWallTimingSplitRV bucket wraps the entire splitRowVector() call, so when it dominates wall time in a VELOX_SHUFFLE_WRITER_LOG_FLAG=1 profile, there's no way to tell which of the four sub-paths is responsible: fixed-width scatter, validity copy, binary split, or complex (Presto-serialized) split.

This PR subdivides SplitRV into four new buckets, one per helper inside splitRowVector():

New enum Helper
CpuWallTimingSplitFixedWidth splitFixedWidthValueBuffer()
CpuWallTimingSplitValidity splitValidityBuffer()
CpuWallTimingSplitBinary splitBinaryArray()
CpuWallTimingSplitComplex splitComplexType()

The outer SplitRV bucket is unchanged: it continues to count once per call, and its wall/cpu measurement now happens to overlap the four inner buckets (the outer timer is still ticking while each inner one is). The outer minus the sum-of-inners is therefore the time spent in splitRowVector() outside the four sub-paths (mostly the post-split partitionBufferBase_ update loop). This is the same parent/child pattern most profiling tools use for nested timers, and it's documented in a comment next to the enum.

The CpuWallTimingType enum, the CpuWallTimingName() helper, and a new single-value accessor cpuWallTiming(CpuWallTimingType) are hoisted from protected to public so the data is accessible from tests. The underlying cpuWallTimingList_ storage stays protected.

How was this patch tested?

This PR adds cpp/velox/tests/VeloxHashShuffleWriterSplitRvSubStagesTest.cc (second commit) with five gtest cases:

  • enumNames: validates the strings emitted under VELOX_SHUFFLE_WRITER_LOG_FLAG for the 4 new enum entries (these strings are stable observable surface).
  • fixedWidthBumpsItsBucket: single fixed-width batch; asserts each of the 4 inner buckets has count == 1 (every helper inside splitRowVector() is called once per batch regardless of column-type mix, since the new sub-stage SCOPED_TIMER calls live at function entry).
  • binaryBumpsItsBucket: VARCHAR batch; asserts the binary sub-stage's wallNanos > 0, confirming real work is being measured (not just function-entry overhead).
  • complexBumpsItsBucket: ARRAY batch; same wallNanos > 0 assertion against the complex sub-stage (Presto-serialized round-trip per partition).
  • outerSplitRvStillCounted: sanity check, two batches in and outer SplitRV bucket has count == 2. Confirms the new inner timers refine, not replace, the existing outer timer.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: GitHub Copilot CLI (Claude Opus 4.7 1M context)

@luis4a0 luis4a0 marked this pull request as ready for review May 18, 2026 14:08
@github-actions github-actions Bot added the VELOX label May 18, 2026
@luis4a0 luis4a0 force-pushed the lpenaranda/pr-splitrv-substages branch 2 times, most recently from fb4ac15 to 1e87d3b Compare May 19, 2026 05:51
@luis4a0 luis4a0 marked this pull request as draft May 19, 2026 09:43
luis4a0 and others added 2 commits May 19, 2026 09:49
The existing `CpuWallTimingSplitRV` bucket wraps the entire
`splitRowVector()` call, so when it dominates wall time in a
`VELOX_SHUFFLE_WRITER_LOG_FLAG=1` profile, there's no way to tell
which of the four sub-paths is responsible: fixed-width scatter,
validity copy, binary split, or complex (Presto-serialized) split.

This change subdivides `SplitRV` into four new buckets, one per
helper inside `splitRowVector()`:

  - `CpuWallTimingSplitFixedWidth` — `splitFixedWidthValueBuffer`
  - `CpuWallTimingSplitValidity`   — `splitValidityBuffer`
  - `CpuWallTimingSplitBinary`     — `splitBinaryArray`
  - `CpuWallTimingSplitComplex`    — `splitComplexType`

The outer `SplitRV` bucket is unchanged: it continues to count
once per call, and its wall/cpu measurement now happens to overlap
the four inner buckets (the outer timer is still ticking while
each inner is). The outer minus the sum-of-inners is therefore the
time spent in `splitRowVector()` outside the four sub-paths
(mostly the post-split `partitionBufferBase_` update loop).

The `CpuWallTimingType` enum, the `CpuWallTimingName()` helper,
and a new single-value accessor `cpuWallTiming(CpuWallTimingType)`
are hoisted from `protected` to `public` so the data is accessible
from tests. The underlying `cpuWallTimingList_` storage stays
`protected`.

Generated-by: GitHub Copilot CLI (Claude Opus 4.7 1M context)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Five gtest cases exercise the new sub-stage timers introduced in
the previous commit:

  - `enumNames` — validates the strings emitted under
    `VELOX_SHUFFLE_WRITER_LOG_FLAG` for the 4 new enum entries
    (these strings are stable observable surface, since they're
    what shows up in the per-stage log line).
  - `fixedWidthBumpsItsBucket` — single fixed-width batch;
    asserts each of the 4 inner buckets has `count == 1` (every
    helper inside `splitRowVector()` is called regardless of
    column-type mix).
  - `binaryBumpsItsBucket` — VARCHAR batch; asserts the binary
    sub-stage's wallNanos > 0, confirming real work is being
    measured (not just function-entry overhead).
  - `complexBumpsItsBucket` — ARRAY batch; same wallNanos > 0
    assertion against the complex sub-stage (Presto serializer
    round-trip per partition).
  - `outerSplitRvStillCounted` — sanity: two batches in, outer
    `SplitRV` bucket has `count == 2`. Confirms the new inner
    timers refine, not replace, the existing outer timer.

The new test is a standalone `add_velox_test(...)` entry so each
sub-stage's behavior is asserted independently of the parameterised
round-trip matrix in `VeloxShuffleWriterTest.cc`.

Generated-by: GitHub Copilot CLI (Claude Opus 4.7 1M context)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@luis4a0 luis4a0 force-pushed the lpenaranda/pr-splitrv-substages branch from 1e87d3b to 5a7c742 Compare May 19, 2026 09:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant