[CORE] Add customMetrics extension point to ShuffleWriterMetrics for backend-specific shuffle stats by luis4a0 · Pull Request #12114 · apache/gluten

luis4a0 · 2026-05-19T06:21:10Z

What changes are proposed in this pull request?

ShuffleWriterMetrics (in cpp/core/shuffle/Options.h) currently has a hand-rolled list of 9 scalar fields, two of which (avgDictionaryFields, dictionarySize) are Velox-specific. Every time someone wants to expose another backend-specific shuffle stat, the pattern is: new scalar field on the struct → new constructor parameter on GlutenSplitResult → new getter → new line in ColumnarShuffleWriter.scala. That's four files per metric, and it forces cross-backend coordination (the ClickHouse / GPU / RSS backends would each need similar plumbing for their own stats).

This PR adds a generic std::unordered_map<std::string, int64_t> customMetrics field to ShuffleWriterMetrics that any shuffle writer can populate with backend-specific stats, plumbed through the existing JNI stop() path as two parallel arrays (keys + values) into GlutenSplitResult. The JVM side reassembles them lazily into an unmodifiable Map<String, Long> on first getCustomMetrics() access, with a defensive length-match check that fails loudly on producer-side bugs.

Convention for keys: <Backend>.<Family>.<Stat>. Spark-side surfacing (registration as SQLMetrics) happens per-key in the backend's MetricsApi.

Files touched

cpp/core/shuffle/Options.h — new customMetrics field with key-naming convention doc
cpp/core/shuffle/ShuffleWriter.{h,cc} — customMetrics() const accessor
cpp/core/jni/JniWrapper.cc — marshal map as two parallel arrays at the existing stop() site
gluten-arrow/src/main/java/org/apache/gluten/vectorized/GlutenSplitResult.java — new constructor params, lazy getCustomMetrics() accessor with caching + defensive length-match check
gluten-arrow/src/test/scala/org/apache/gluten/vectorized/GlutenSplitResultSuite.scala — JVM-side unit tests for the lazy reassembly logic

How was this patch tested?

ninja gluten velox: clean
mvn -Pbackends-velox,spark-3.5 -pl gluten-arrow compile: BUILD SUCCESS
mvn -Pbackends-velox,spark-3.5 -pl gluten-arrow scalatest:test -DwildcardSuites='org.apache.gluten.vectorized.GlutenSplitResultSuite': 8/8 passed
dev/git-clang-format --binary clang-format-15 --style=file: clean
dev/format-scala-code.sh: clean
dev/check.py format main: Ok on all C++ files touched

Was this patch authored or co-authored using generative AI tooling?

Generated-by: GitHub Copilot CLI (Claude Opus 4.7 1M context)

yaooqinn · 2026-05-19T09:07:31Z

+      (key, value) =>
+        val m = dep.metrics.get(key)
+        if (m.isDefined) {
+          m.get.add(value)


splitResult.getCustomMetrics() is consumed here but not in VeloxCelebornColumnarShuffleWriter.scala:101+ or VeloxUniffleColumnarShuffleWriter.java:232. RSS users won't see new Velox.InputEncoding.* metrics. Suggest a shared applyCustomMetrics(splitResult, dep) helper called from all three writers.

…ay lengths apache#12114 (comment) If a future native-side producer ever ships mismatched key/value arrays, the AIOOBE inside the synchronized block in `getCustomMetrics()` would leave the volatile cache field unassigned — every subsequent call would re-enter the synchronized block and re-throw, an unhelpful failure mode. Adds an explicit length-match check (before the cache field is assigned to a partial map) that throws `IllegalStateException` mentioning both array lengths so the producer-side bug is unambiguous. Adds two new `GlutenSplitResultSuite` cases covering: - mismatched lengths (3-element keys + 1-element values) - null values array when keys is non-empty The second call after a failure must still throw (the cache field stays null on the failure path); both new tests assert this. Generated-by: GitHub Copilot CLI (Claude Opus 4.7 1M context) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

@yaooqinn

… RSS-helper TODO Two review comments from @yaooqinn on apache#12114: (1) apache#12114 (comment) — If a future native-side producer ever ships mismatched key/value arrays, the AIOOBE inside the synchronized block in `getCustomMetrics()` would leave the volatile cache field unassigned. Every subsequent call would re-enter the synchronized block and re-throw, an unhelpful failure mode. Adds an explicit length-match check (before the cache field is assigned to a partial map) that throws `IllegalStateException` mentioning both array lengths so the producer-side bug is unambiguous. Two new `GlutenSplitResultSuite` cases: - mismatched lengths (2-element keys + 1-element values) - null values array when keys is non-empty Both assert the second call still throws (cache field stays null on failure paths). Full suite is now 8/8 locally. (2) apache#12114 (comment) — There are 3 shuffle-writer entry points that call `shuffleWriterJniWrapper.stop(...)`: `ColumnarShuffleWriter`, `VeloxCelebornColumnarShuffleWriter`, and `VeloxUniffleColumnarShuffleWriter`. Any future PR that drains `getCustomMetrics()` into Spark SQLMetrics must factor a shared helper (e.g. `applyCustomMetrics(splitResult, dep)`) and call it from all 3 writers, otherwise RSS users won't see metrics that local-writer users do. This PR no longer wires the Scala consumer side (dropped per offline reviewer feedback — see PR description), so the parity gap is moot today, but the helper-shape requirement is captured in a javadoc note on `getCustomMetrics()` so the first per-metric follow-up gets it right from day one. Generated-by: GitHub Copilot CLI (Claude Opus 4.7 1M context) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

ShuffleWriterMetrics currently has a hand-rolled list of 9 scalar fields, two of which (`avgDictionaryFields`, `dictionarySize`) are Velox-specific. Adding more backend-specific scalars every time a backend needs another counter doesn't scale — other backends (ClickHouse, GPU, RSS) have the same need and the cross-backend coordination cost grows linearly per metric. This change adds a generic `std::unordered_map<std::string, int64_t> customMetrics` to ShuffleWriterMetrics that any shuffle writer can populate with backend-specific stats. It is plumbed through the existing JNI `stop()` serialization as two parallel arrays (keys + values) into `GlutenSplitResult`, where the JVM side reassembles them lazily into an unmodifiable `Map<String, Long>` on first access. Convention for keys: `<Backend>.<Family>.<Stat>`. Spark-side surfacing (registration as SQLMetrics) happens per-key in the backend's MetricsApi; unknown keys are silently dropped on the Scala side so the producer can ship without coordinating with every downstream surface. Includes `GlutenSplitResultSuite` covering the JVM-side reassembly (empty / null / populated arrays, caching, immutability) so the JNI boundary is fenced by a unit test that doesn't need a full Spark / native round-trip. Generated-by: GitHub Copilot CLI (Claude Opus 4.7 1M context) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

If a producer ever ships mismatched key/value arrays through the JNI boundary, the AIOOBE inside the synchronized block in `getCustomMetrics()` would leave the volatile cache field unassigned — every subsequent call would re-enter the synchronized block and re-throw, an unhelpful failure mode. Adds an explicit length-match check (before the cache field is assigned to a partial map) that throws `IllegalStateException` mentioning both array lengths so the producer-side bug is unambiguous. Two new `GlutenSplitResultSuite` cases: - mismatched lengths - null values array when keys is non-empty Both assert the second call still throws (cache stays null on failure paths). Full suite is now 8/8 locally. Generated-by: GitHub Copilot CLI (Claude Opus 4.7 1M context) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

github-actions Bot added the VELOX label May 19, 2026

yaooqinn reviewed May 19, 2026

View reviewed changes

Comment thread gluten-arrow/src/main/java/org/apache/gluten/vectorized/GlutenSplitResult.java

luis4a0 force-pushed the lpenaranda/pr-custom-metrics branch from 6d0f4f8 to 6e6d98d Compare May 19, 2026 09:15

luis4a0 force-pushed the lpenaranda/pr-custom-metrics branch from 6e6d98d to d68ae7c Compare May 19, 2026 09:17

luis4a0 and others added 2 commits May 19, 2026 09:40

luis4a0 force-pushed the lpenaranda/pr-custom-metrics branch from d68ae7c to 722fa45 Compare May 19, 2026 09:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CORE] Add customMetrics extension point to ShuffleWriterMetrics for backend-specific shuffle stats#12114

[CORE] Add customMetrics extension point to ShuffleWriterMetrics for backend-specific shuffle stats#12114
luis4a0 wants to merge 2 commits into
apache:mainfrom
luis4a0:lpenaranda/pr-custom-metrics

luis4a0 commented May 19, 2026 •

edited

Loading

Uh oh!

yaooqinn May 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

luis4a0 commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes are proposed in this pull request?

Files touched

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

yaooqinn May 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

luis4a0 commented May 19, 2026 •

edited

Loading