[CORE] Add customMetrics extension point to ShuffleWriterMetrics for backend-specific shuffle stats#12114
Draft
luis4a0 wants to merge 2 commits into
Draft
[CORE] Add customMetrics extension point to ShuffleWriterMetrics for backend-specific shuffle stats#12114luis4a0 wants to merge 2 commits into
luis4a0 wants to merge 2 commits into
Conversation
yaooqinn
reviewed
May 19, 2026
| (key, value) => | ||
| val m = dep.metrics.get(key) | ||
| if (m.isDefined) { | ||
| m.get.add(value) |
Member
There was a problem hiding this comment.
splitResult.getCustomMetrics() is consumed here but not in VeloxCelebornColumnarShuffleWriter.scala:101+ or VeloxUniffleColumnarShuffleWriter.java:232. RSS users won't see new Velox.InputEncoding.* metrics. Suggest a shared applyCustomMetrics(splitResult, dep) helper called from all three writers.
yaooqinn
reviewed
May 19, 2026
luis4a0
added a commit
to luis4a0/gluten
that referenced
this pull request
May 19, 2026
…ay lengths apache#12114 (comment) If a future native-side producer ever ships mismatched key/value arrays, the AIOOBE inside the synchronized block in `getCustomMetrics()` would leave the volatile cache field unassigned — every subsequent call would re-enter the synchronized block and re-throw, an unhelpful failure mode. Adds an explicit length-match check (before the cache field is assigned to a partial map) that throws `IllegalStateException` mentioning both array lengths so the producer-side bug is unambiguous. Adds two new `GlutenSplitResultSuite` cases covering: - mismatched lengths (3-element keys + 1-element values) - null values array when keys is non-empty The second call after a failure must still throw (the cache field stays null on the failure path); both new tests assert this. Generated-by: GitHub Copilot CLI (Claude Opus 4.7 1M context) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
6d0f4f8 to
6e6d98d
Compare
luis4a0
added a commit
to luis4a0/gluten
that referenced
this pull request
May 19, 2026
… RSS-helper TODO Two review comments from @yaooqinn on apache#12114: (1) apache#12114 (comment) — If a future native-side producer ever ships mismatched key/value arrays, the AIOOBE inside the synchronized block in `getCustomMetrics()` would leave the volatile cache field unassigned. Every subsequent call would re-enter the synchronized block and re-throw, an unhelpful failure mode. Adds an explicit length-match check (before the cache field is assigned to a partial map) that throws `IllegalStateException` mentioning both array lengths so the producer-side bug is unambiguous. Two new `GlutenSplitResultSuite` cases: - mismatched lengths (2-element keys + 1-element values) - null values array when keys is non-empty Both assert the second call still throws (cache field stays null on failure paths). Full suite is now 8/8 locally. (2) apache#12114 (comment) — There are 3 shuffle-writer entry points that call `shuffleWriterJniWrapper.stop(...)`: `ColumnarShuffleWriter`, `VeloxCelebornColumnarShuffleWriter`, and `VeloxUniffleColumnarShuffleWriter`. Any future PR that drains `getCustomMetrics()` into Spark SQLMetrics must factor a shared helper (e.g. `applyCustomMetrics(splitResult, dep)`) and call it from all 3 writers, otherwise RSS users won't see metrics that local-writer users do. This PR no longer wires the Scala consumer side (dropped per offline reviewer feedback — see PR description), so the parity gap is moot today, but the helper-shape requirement is captured in a javadoc note on `getCustomMetrics()` so the first per-metric follow-up gets it right from day one. Generated-by: GitHub Copilot CLI (Claude Opus 4.7 1M context) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
6e6d98d to
d68ae7c
Compare
ShuffleWriterMetrics currently has a hand-rolled list of 9 scalar fields, two of which (`avgDictionaryFields`, `dictionarySize`) are Velox-specific. Adding more backend-specific scalars every time a backend needs another counter doesn't scale — other backends (ClickHouse, GPU, RSS) have the same need and the cross-backend coordination cost grows linearly per metric. This change adds a generic `std::unordered_map<std::string, int64_t> customMetrics` to ShuffleWriterMetrics that any shuffle writer can populate with backend-specific stats. It is plumbed through the existing JNI `stop()` serialization as two parallel arrays (keys + values) into `GlutenSplitResult`, where the JVM side reassembles them lazily into an unmodifiable `Map<String, Long>` on first access. Convention for keys: `<Backend>.<Family>.<Stat>`. Spark-side surfacing (registration as SQLMetrics) happens per-key in the backend's MetricsApi; unknown keys are silently dropped on the Scala side so the producer can ship without coordinating with every downstream surface. Includes `GlutenSplitResultSuite` covering the JVM-side reassembly (empty / null / populated arrays, caching, immutability) so the JNI boundary is fenced by a unit test that doesn't need a full Spark / native round-trip. Generated-by: GitHub Copilot CLI (Claude Opus 4.7 1M context) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
If a producer ever ships mismatched key/value arrays through the JNI boundary, the AIOOBE inside the synchronized block in `getCustomMetrics()` would leave the volatile cache field unassigned — every subsequent call would re-enter the synchronized block and re-throw, an unhelpful failure mode. Adds an explicit length-match check (before the cache field is assigned to a partial map) that throws `IllegalStateException` mentioning both array lengths so the producer-side bug is unambiguous. Two new `GlutenSplitResultSuite` cases: - mismatched lengths - null values array when keys is non-empty Both assert the second call still throws (cache stays null on failure paths). Full suite is now 8/8 locally. Generated-by: GitHub Copilot CLI (Claude Opus 4.7 1M context) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
d68ae7c to
722fa45
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes are proposed in this pull request?
ShuffleWriterMetrics(incpp/core/shuffle/Options.h) currently has a hand-rolled list of 9 scalar fields, two of which (avgDictionaryFields,dictionarySize) are Velox-specific. Every time someone wants to expose another backend-specific shuffle stat, the pattern is: new scalar field on the struct → new constructor parameter onGlutenSplitResult→ new getter → new line inColumnarShuffleWriter.scala. That's four files per metric, and it forces cross-backend coordination (the ClickHouse / GPU / RSS backends would each need similar plumbing for their own stats).This PR adds a generic
std::unordered_map<std::string, int64_t> customMetricsfield toShuffleWriterMetricsthat any shuffle writer can populate with backend-specific stats, plumbed through the existing JNIstop()path as two parallel arrays (keys + values) intoGlutenSplitResult. The JVM side reassembles them lazily into an unmodifiableMap<String, Long>on firstgetCustomMetrics()access, with a defensive length-match check that fails loudly on producer-side bugs.Convention for keys:
<Backend>.<Family>.<Stat>. Spark-side surfacing (registration as SQLMetrics) happens per-key in the backend's MetricsApi.Files touched
cpp/core/shuffle/Options.h— newcustomMetricsfield with key-naming convention doccpp/core/shuffle/ShuffleWriter.{h,cc}—customMetrics()const accessorcpp/core/jni/JniWrapper.cc— marshal map as two parallel arrays at the existingstop()sitegluten-arrow/src/main/java/org/apache/gluten/vectorized/GlutenSplitResult.java— new constructor params, lazygetCustomMetrics()accessor with caching + defensive length-match checkgluten-arrow/src/test/scala/org/apache/gluten/vectorized/GlutenSplitResultSuite.scala— JVM-side unit tests for the lazy reassembly logicHow was this patch tested?
ninja gluten velox: cleanmvn -Pbackends-velox,spark-3.5 -pl gluten-arrow compile: BUILD SUCCESSmvn -Pbackends-velox,spark-3.5 -pl gluten-arrow scalatest:test -DwildcardSuites='org.apache.gluten.vectorized.GlutenSplitResultSuite': 8/8 passeddev/git-clang-format --binary clang-format-15 --style=file: cleandev/format-scala-code.sh: cleandev/check.py format main: Ok on all C++ files touchedWas this patch authored or co-authored using generative AI tooling?
Generated-by: GitHub Copilot CLI (Claude Opus 4.7 1M context)