Skip to content

[VL] Reduce Velox scan SQL metrics by default to mitigate driver OOM#12127

Draft
lifulong wants to merge 1 commit into
apache:mainfrom
lifulong:gluten_driver_oom_while_spark_ok_use_same_driver_memory
Draft

[VL] Reduce Velox scan SQL metrics by default to mitigate driver OOM#12127
lifulong wants to merge 1 commit into
apache:mainfrom
lifulong:gluten_driver_oom_while_spark_ok_use_same_driver_memory

Conversation

@lifulong
Copy link
Copy Markdown
Contributor

@lifulong lifulong commented May 22, 2026

What changes are proposed in this pull request?

Gluten jobs on the Velox backend are more prone to driver memory pressure than vanilla Spark in some production workloads. Investigation points to scan operators registering too many SQL metrics (accumulators).

Each BatchScanExecTransformer / FileSourceScanExecTransformer / HiveTableScanExecTransformer previously registered 30+ executor-side metrics per scan node.

Vanilla Spark is much leaner—for example, BatchScanExec only exposes numOutputRows (+ connector customMetrics), and FileSourceScanExec adds a small set of driver metrics (numFiles, metadataTime, etc.).

This gap increases driver heap usage and can contribute to driver OOM, especially on scan-heavy queries.

(Driver heap dump analysis while oom, the largest memory-consuming object is LiveStageMetrics)
企业微信截图_7f05f208-9f83-472b-b638-0aa70650abfc

(Gluten has been failed in first scan stage, while vanilla spark finished successfully with same driver memory 12g.)
企业微信截图_0f06b928-eff5-4ba8-a1ae-6f87aca571be

Introduce a Velox-only minimal scan metrics set by default, with an opt-in switch for full metrics collection (debugging / advanced troubleshooting).
spark.gluten.sql.scan.detailedMetrics.enabled

ClickHouse backend is unchanged—this config does not affect CH scan metrics.

Default minimal metrics (Velox)
BatchScan (9 executor metrics):
rawInputRows, rawInputBytes, numOutputRows, outputBytes, scanTime, wallNanos, peakMemoryBytes, ioWaitTime, storageReadBytes

FileSourceScan / HiveTableScan — above plus Spark-aligned driver metrics:
numFiles, metadataTime, filesSize, numPartitions, pruningTime

Moved to full collection only (when detailed metrics enabled)
Examples include: numInputRows, inputVectors, inputBytes, outputVectors, cpuCount, numMemoryAllocations, skippedSplits, processedSplits, numDynamicFiltersAccepted, loadLazyVectorTime, skippedStrides, processedStrides, connector timing (preloadSplits, pageLoadTime, dataSourceAddSplitTime, dataSourceReadTime), storage cache details (storageReads, localReadBytes, ramReadBytes), etc.

How was this patch tested?

WIP on our produce envriment

Was this patch authored or co-authored using generative AI tooling?

co-authored using cursor.

@github-actions github-actions Bot added CORE works for Gluten Core VELOX labels May 22, 2026
@lifulong lifulong marked this pull request as draft May 22, 2026 07:41
@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@lifulong lifulong force-pushed the gluten_driver_oom_while_spark_ok_use_same_driver_memory branch from 86f7772 to 09c0f07 Compare May 22, 2026 07:52
@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@lifulong lifulong force-pushed the gluten_driver_oom_while_spark_ok_use_same_driver_memory branch 2 times, most recently from a8e8cab to 67c52c7 Compare May 22, 2026 10:04
@github-actions github-actions Bot added the DOCS label May 22, 2026
@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

1 similar comment
@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@lifulong lifulong force-pushed the gluten_driver_oom_while_spark_ok_use_same_driver_memory branch from 67c52c7 to 6bbb6e8 Compare May 22, 2026 10:16
@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

@lifulong lifulong force-pushed the gluten_driver_oom_while_spark_ok_use_same_driver_memory branch from 6bbb6e8 to 4bb6c9a Compare May 22, 2026 11:09
@lifulong lifulong force-pushed the gluten_driver_oom_while_spark_ok_use_same_driver_memory branch from 4bb6c9a to c621483 Compare May 22, 2026 11:19
@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

1 similar comment
@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CORE works for Gluten Core DOCS VELOX

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants