Skip to content

[CORE] Add customMetrics extension point to ShuffleWriterMetrics for backend-specific shuffle stats#12114

Draft
luis4a0 wants to merge 2 commits into
apache:mainfrom
luis4a0:lpenaranda/pr-custom-metrics
Draft

[CORE] Add customMetrics extension point to ShuffleWriterMetrics for backend-specific shuffle stats#12114
luis4a0 wants to merge 2 commits into
apache:mainfrom
luis4a0:lpenaranda/pr-custom-metrics

Conversation

@luis4a0
Copy link
Copy Markdown
Contributor

@luis4a0 luis4a0 commented May 19, 2026

What changes are proposed in this pull request?

ShuffleWriterMetrics (in cpp/core/shuffle/Options.h) currently has a hand-rolled list of 9 scalar fields, two of which (avgDictionaryFields, dictionarySize) are Velox-specific. Every time someone wants to expose another backend-specific shuffle stat, the pattern is: new scalar field on the struct → new constructor parameter on GlutenSplitResult → new getter → new line in ColumnarShuffleWriter.scala. That's four files per metric, and it forces cross-backend coordination (the ClickHouse / GPU / RSS backends would each need similar plumbing for their own stats).

This PR adds a generic std::unordered_map<std::string, int64_t> customMetrics field to ShuffleWriterMetrics that any shuffle writer can populate with backend-specific stats, plumbed through the existing JNI stop() path as two parallel arrays (keys + values) into GlutenSplitResult. The JVM side reassembles them lazily into an unmodifiable Map<String, Long> on first getCustomMetrics() access, with a defensive length-match check that fails loudly on producer-side bugs.

Convention for keys: <Backend>.<Family>.<Stat>. Spark-side surfacing (registration as SQLMetrics) happens per-key in the backend's MetricsApi.

Files touched

  • cpp/core/shuffle/Options.h — new customMetrics field with key-naming convention doc
  • cpp/core/shuffle/ShuffleWriter.{h,cc}customMetrics() const accessor
  • cpp/core/jni/JniWrapper.cc — marshal map as two parallel arrays at the existing stop() site
  • gluten-arrow/src/main/java/org/apache/gluten/vectorized/GlutenSplitResult.java — new constructor params, lazy getCustomMetrics() accessor with caching + defensive length-match check
  • gluten-arrow/src/test/scala/org/apache/gluten/vectorized/GlutenSplitResultSuite.scala — JVM-side unit tests for the lazy reassembly logic

How was this patch tested?

  • ninja gluten velox: clean
  • mvn -Pbackends-velox,spark-3.5 -pl gluten-arrow compile: BUILD SUCCESS
  • mvn -Pbackends-velox,spark-3.5 -pl gluten-arrow scalatest:test -DwildcardSuites='org.apache.gluten.vectorized.GlutenSplitResultSuite': 8/8 passed
  • dev/git-clang-format --binary clang-format-15 --style=file: clean
  • dev/format-scala-code.sh: clean
  • dev/check.py format main: Ok on all C++ files touched

Was this patch authored or co-authored using generative AI tooling?

Generated-by: GitHub Copilot CLI (Claude Opus 4.7 1M context)

@github-actions github-actions Bot added the VELOX label May 19, 2026
(key, value) =>
val m = dep.metrics.get(key)
if (m.isDefined) {
m.get.add(value)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

splitResult.getCustomMetrics() is consumed here but not in VeloxCelebornColumnarShuffleWriter.scala:101+ or VeloxUniffleColumnarShuffleWriter.java:232. RSS users won't see new Velox.InputEncoding.* metrics. Suggest a shared applyCustomMetrics(splitResult, dep) helper called from all three writers.

luis4a0 added a commit to luis4a0/gluten that referenced this pull request May 19, 2026
…ay lengths

apache#12114 (comment)

If a future native-side producer ever ships mismatched key/value arrays,
the AIOOBE inside the synchronized block in `getCustomMetrics()` would
leave the volatile cache field unassigned — every subsequent call would
re-enter the synchronized block and re-throw, an unhelpful failure mode.

Adds an explicit length-match check (before the cache field is assigned
to a partial map) that throws `IllegalStateException` mentioning both
array lengths so the producer-side bug is unambiguous.

Adds two new `GlutenSplitResultSuite` cases covering:
- mismatched lengths (3-element keys + 1-element values)
- null values array when keys is non-empty

The second call after a failure must still throw (the cache field stays
null on the failure path); both new tests assert this.

Generated-by: GitHub Copilot CLI (Claude Opus 4.7 1M context)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@luis4a0 luis4a0 force-pushed the lpenaranda/pr-custom-metrics branch from 6d0f4f8 to 6e6d98d Compare May 19, 2026 09:15
luis4a0 added a commit to luis4a0/gluten that referenced this pull request May 19, 2026
… RSS-helper TODO

Two review comments from @yaooqinn on apache#12114:

(1) apache#12114 (comment) —
If a future native-side producer ever ships mismatched key/value arrays,
the AIOOBE inside the synchronized block in `getCustomMetrics()` would
leave the volatile cache field unassigned. Every subsequent call would
re-enter the synchronized block and re-throw, an unhelpful failure mode.

Adds an explicit length-match check (before the cache field is assigned
to a partial map) that throws `IllegalStateException` mentioning both
array lengths so the producer-side bug is unambiguous. Two new
`GlutenSplitResultSuite` cases:
- mismatched lengths (2-element keys + 1-element values)
- null values array when keys is non-empty
Both assert the second call still throws (cache field stays null on
failure paths). Full suite is now 8/8 locally.

(2) apache#12114 (comment) —
There are 3 shuffle-writer entry points that call
`shuffleWriterJniWrapper.stop(...)`: `ColumnarShuffleWriter`,
`VeloxCelebornColumnarShuffleWriter`, and
`VeloxUniffleColumnarShuffleWriter`. Any future PR that drains
`getCustomMetrics()` into Spark SQLMetrics must factor a shared helper
(e.g. `applyCustomMetrics(splitResult, dep)`) and call it from all 3
writers, otherwise RSS users won't see metrics that local-writer users
do. This PR no longer wires the Scala consumer side (dropped per
offline reviewer feedback — see PR description), so the parity gap
is moot today, but the helper-shape requirement is captured in a
javadoc note on `getCustomMetrics()` so the first per-metric
follow-up gets it right from day one.

Generated-by: GitHub Copilot CLI (Claude Opus 4.7 1M context)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@luis4a0 luis4a0 force-pushed the lpenaranda/pr-custom-metrics branch from 6e6d98d to d68ae7c Compare May 19, 2026 09:17
luis4a0 and others added 2 commits May 19, 2026 09:40
ShuffleWriterMetrics currently has a hand-rolled list of 9 scalar fields,
two of which (`avgDictionaryFields`, `dictionarySize`) are Velox-specific.
Adding more backend-specific scalars every time a backend needs another
counter doesn't scale — other backends (ClickHouse, GPU, RSS) have the same
need and the cross-backend coordination cost grows linearly per metric.

This change adds a generic `std::unordered_map<std::string, int64_t> customMetrics`
to ShuffleWriterMetrics that any shuffle writer can populate with
backend-specific stats. It is plumbed through the existing JNI `stop()`
serialization as two parallel arrays (keys + values) into
`GlutenSplitResult`, where the JVM side reassembles them lazily into an
unmodifiable `Map<String, Long>` on first access.

Convention for keys: `<Backend>.<Family>.<Stat>`. Spark-side surfacing
(registration as SQLMetrics) happens per-key in the backend's MetricsApi;
unknown keys are silently dropped on the Scala side so the producer can
ship without coordinating with every downstream surface.

Includes `GlutenSplitResultSuite` covering the JVM-side reassembly
(empty / null / populated arrays, caching, immutability) so the JNI boundary
is fenced by a unit test that doesn't need a full Spark / native round-trip.

Generated-by: GitHub Copilot CLI (Claude Opus 4.7 1M context)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
If a producer ever ships mismatched key/value arrays through the JNI
boundary, the AIOOBE inside the synchronized block in
`getCustomMetrics()` would leave the volatile cache field unassigned —
every subsequent call would re-enter the synchronized block and
re-throw, an unhelpful failure mode.

Adds an explicit length-match check (before the cache field is assigned
to a partial map) that throws `IllegalStateException` mentioning both
array lengths so the producer-side bug is unambiguous. Two new
`GlutenSplitResultSuite` cases:
- mismatched lengths
- null values array when keys is non-empty
Both assert the second call still throws (cache stays null on failure
paths). Full suite is now 8/8 locally.

Generated-by: GitHub Copilot CLI (Claude Opus 4.7 1M context)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@luis4a0 luis4a0 force-pushed the lpenaranda/pr-custom-metrics branch from d68ae7c to 722fa45 Compare May 19, 2026 09:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants