[MINOR][VL] Remove dead Arrow-CSV / Arrow-Dataset JVM code paths by yaooqinn · Pull Request #12130 · apache/gluten

yaooqinn · 2026-05-22T19:12:02Z

What changes were proposed in this pull request?

Removes dead JVM-side code paths related to Arrow CSV scanning and Arrow Dataset readers in gluten-arrow. Two commits:

[MINOR][VL] Remove dead Arrow-CSV / Arrow-Dataset JVM code path (-1441 lines)
- Deletes 12 files: ArrowCSVFileFormat, ArrowCSVOptionConverter, ArrowCSVPartitionReaderFactory, ArrowCSVScan, ArrowCSVScanBuilder, ArrowCSVTable, ArrowBatchScanExec, ArrowConvertorRule, ArrowScanReplaceRule, ArrowFileSourceScanExec, BaseArrowScanExec, ArrowCsvScanSuite.
- Truncates the ArrowBatchScanExecShim segment from 5 shim files (spark33/34/35/40/41).
[MINOR][VL] Remove dead Arrow dataset reader paths from ArrowUtil (-167 lines)
- Removes 6 methods from ArrowUtil.scala: makeArrowDiscovery, readArrowSchema, readArrowFileColumnNames, readSchema (×2 overloads), loadMissingColumns, loadPartitionColumns, loadBatch. These all instantiated FileSystemDatasetFactory / CsvFragmentScanOptions and were only used by the now-deleted classes above.

Total: 18 files changed, -1608 lines.

Why are the changes needed?

This code is dead:

No service registration: no META-INF/services entry routes Spark to ArrowCSVFileFormat or ArrowCSVTable.
No rule injection: VeloxRuleApi does not inject ArrowConvertorRule or ArrowScanReplaceRule into the optimizer pipeline.
No active tests: ArrowCsvScanSuite is fully @Ignored.
No callers: greping the whole repo for the 6 ArrowUtil reader methods returns 0 call sites outside the deleted files.

These classes appear to be unreachable code introduced by a previous squash-merge and never wired into the actual execution path. They also keep gluten-arrow glued to the patched arrow-dataset JVM API (CsvFragmentScanOptions.from(Map), 5-arg FileSystemDatasetFactory) shipped via dev/build-arrow.sh — removing them unblocks future work to drop the patched-arrow build entirely.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Compiled clean on both Spark profiles:

./build/mvn compile      -Pbackends-velox -Pspark-4.0 -Pscala-2.13 -DskipTests  # ✅
./build/mvn test-compile -Pbackends-velox -Pspark-3.5 -Pscala-2.12 -DskipTests  # ✅

All 6 affected modules pass scalastyle with 0 errors.

Generated-by: claude-opus-4.7

@ignore

The ArrowCSV file format and ArrowBatchScanExec chain are unreachable: no injection in VeloxRuleApi, no META-INF/services entry, and all ArrowCsvScanSuite cases are @ignore'd. They were introduced as a squash-merge byproduct in apache#11776 and never wired up. Verified by compiling: * spark-3.5 + scala-2.12 + arrow 15.0.0-gluten (install) * spark-4.0 + scala-2.13 + arrow 18.1.0 (compile) Generated-by: claude-opus-4.7

makeArrowDiscovery / readArrowSchema / readArrowFileColumnNames / readSchema(FragmentScanOptions) overloads / loadMissingColumns / loadPartitionColumns / loadBatch in ArrowUtil have zero callers across the repo after the previous removal of the ArrowCSV chain. Drop them together with the now-unused imports (arrow.dataset.*, FileStatus, URI/URLDecoder, ArrowRecordBatch, Optional, Logging, etc.). Verified by compiling: * spark-3.5 + scala-2.12 (test-compile, patched arrow 15.0.0-gluten) * spark-4.0 + scala-2.13 (compile, pure arrow 18.1.0) Generated-by: claude-opus-4.7

github-actions · 2026-05-22T19:14:00Z

Run Gluten Clickhouse CI on x86

yaooqinn added 2 commits May 22, 2026 13:06

github-actions Bot added CORE works for Gluten Core VELOX labels May 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MINOR][VL] Remove dead Arrow-CSV / Arrow-Dataset JVM code paths#12130

[MINOR][VL] Remove dead Arrow-CSV / Arrow-Dataset JVM code paths#12130
yaooqinn wants to merge 2 commits into
apache:mainfrom
yaooqinn:users/kentyao/spike-drop-arrow-csv-dataset

yaooqinn commented May 22, 2026

Uh oh!

github-actions Bot commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

yaooqinn commented May 22, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

github-actions Bot commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant