Skip to content

[MINOR][VL] Remove dead Arrow-CSV / Arrow-Dataset JVM code paths#12130

Open
yaooqinn wants to merge 2 commits into
apache:mainfrom
yaooqinn:users/kentyao/spike-drop-arrow-csv-dataset
Open

[MINOR][VL] Remove dead Arrow-CSV / Arrow-Dataset JVM code paths#12130
yaooqinn wants to merge 2 commits into
apache:mainfrom
yaooqinn:users/kentyao/spike-drop-arrow-csv-dataset

Conversation

@yaooqinn
Copy link
Copy Markdown
Member

What changes were proposed in this pull request?

Removes dead JVM-side code paths related to Arrow CSV scanning and Arrow Dataset readers in gluten-arrow. Two commits:

  1. [MINOR][VL] Remove dead Arrow-CSV / Arrow-Dataset JVM code path (-1441 lines)

    • Deletes 12 files: ArrowCSVFileFormat, ArrowCSVOptionConverter, ArrowCSVPartitionReaderFactory, ArrowCSVScan, ArrowCSVScanBuilder, ArrowCSVTable, ArrowBatchScanExec, ArrowConvertorRule, ArrowScanReplaceRule, ArrowFileSourceScanExec, BaseArrowScanExec, ArrowCsvScanSuite.
    • Truncates the ArrowBatchScanExecShim segment from 5 shim files (spark33/34/35/40/41).
  2. [MINOR][VL] Remove dead Arrow dataset reader paths from ArrowUtil (-167 lines)

    • Removes 6 methods from ArrowUtil.scala: makeArrowDiscovery, readArrowSchema, readArrowFileColumnNames, readSchema (×2 overloads), loadMissingColumns, loadPartitionColumns, loadBatch. These all instantiated FileSystemDatasetFactory / CsvFragmentScanOptions and were only used by the now-deleted classes above.

Total: 18 files changed, -1608 lines.

Why are the changes needed?

This code is dead:

  • No service registration: no META-INF/services entry routes Spark to ArrowCSVFileFormat or ArrowCSVTable.
  • No rule injection: VeloxRuleApi does not inject ArrowConvertorRule or ArrowScanReplaceRule into the optimizer pipeline.
  • No active tests: ArrowCsvScanSuite is fully @Ignored.
  • No callers: greping the whole repo for the 6 ArrowUtil reader methods returns 0 call sites outside the deleted files.

These classes appear to be unreachable code introduced by a previous squash-merge and never wired into the actual execution path. They also keep gluten-arrow glued to the patched arrow-dataset JVM API (CsvFragmentScanOptions.from(Map), 5-arg FileSystemDatasetFactory) shipped via dev/build-arrow.sh — removing them unblocks future work to drop the patched-arrow build entirely.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Compiled clean on both Spark profiles:

./build/mvn compile      -Pbackends-velox -Pspark-4.0 -Pscala-2.13 -DskipTests  #
./build/mvn test-compile -Pbackends-velox -Pspark-3.5 -Pscala-2.12 -DskipTests  #

All 6 affected modules pass scalastyle with 0 errors.

Generated-by: claude-opus-4.7

yaooqinn added 2 commits May 22, 2026 13:06
The ArrowCSV file format and ArrowBatchScanExec chain are unreachable:
no injection in VeloxRuleApi, no META-INF/services entry, and all
ArrowCsvScanSuite cases are @ignore'd. They were introduced as a
squash-merge byproduct in apache#11776 and never wired up.

Verified by compiling:
  * spark-3.5 + scala-2.12 + arrow 15.0.0-gluten (install)
  * spark-4.0 + scala-2.13 + arrow 18.1.0 (compile)

Generated-by: claude-opus-4.7
makeArrowDiscovery / readArrowSchema / readArrowFileColumnNames /
readSchema(FragmentScanOptions) overloads / loadMissingColumns /
loadPartitionColumns / loadBatch in ArrowUtil have zero callers across
the repo after the previous removal of the ArrowCSV chain. Drop them
together with the now-unused imports (arrow.dataset.*, FileStatus,
URI/URLDecoder, ArrowRecordBatch, Optional, Logging, etc.).

Verified by compiling:
  * spark-3.5 + scala-2.12 (test-compile, patched arrow 15.0.0-gluten)
  * spark-4.0 + scala-2.13 (compile, pure arrow 18.1.0)

Generated-by: claude-opus-4.7
@github-actions github-actions Bot added CORE works for Gluten Core VELOX labels May 22, 2026
@github-actions
Copy link
Copy Markdown

Run Gluten Clickhouse CI on x86

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CORE works for Gluten Core VELOX

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant