Skip to content

[VL] Add RDDScanExec support to Velox backend#12113

Open
minni31 wants to merge 3 commits into
apache:mainfrom
minni31:oss/rddscan-velox-support
Open

[VL] Add RDDScanExec support to Velox backend#12113
minni31 wants to merge 3 commits into
apache:mainfrom
minni31:oss/rddscan-velox-support

Conversation

@minni31
Copy link
Copy Markdown

@minni31 minni31 commented May 19, 2026

Implements VeloxRDDScanTransformer to offload RDDScanExec to Velox's native row-to-columnar conversion path. This enables columnar execution for DataFrames backed by LogicalRDD (e.g., from df.checkpoint(), df.localCheckpoint(), or programmatic RDD creation).

Key design:

  • Validates schema via VeloxValidatorApi.validateSchema (recursive) plus Arrow compatibility checks (rejects MapType, interval types)
  • Handles BatchCarrierRow (from checkpoint) by unwrapping directly
  • Standard InternalRow path delegates to RowToVeloxColumnarExec.toColumnarBatchIterator
  • Preserves original partitioning and ordering from the source RDDScanExec
  • Unsupported schemas fall back gracefully to vanilla Spark

Test coverage: 13 tests covering basic types, complex types, aggregation, empty RDD, nulls, fallback scenarios, and BatchCarrierRow from checkpoint.

What changes are proposed in this pull request?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Related issue: #8629

Implements VeloxRDDScanTransformer to offload RDDScanExec to Velox's
native row-to-columnar conversion path. This enables columnar execution
for DataFrames backed by LogicalRDD (e.g., from df.checkpoint(),
df.localCheckpoint(), or programmatic RDD creation).

Key design:
- Validates schema via VeloxValidatorApi.validateSchema (recursive) plus
  Arrow compatibility checks (rejects MapType, interval types)
- Handles BatchCarrierRow (from checkpoint) by unwrapping directly
- Standard InternalRow path delegates to RowToVeloxColumnarExec.toColumnarBatchIterator
- Preserves original partitioning and ordering from the source RDDScanExec
- Unsupported schemas fall back gracefully to vanilla Spark

Test coverage: 13 tests covering basic types, complex types, aggregation,
empty RDD, nulls, fallback scenarios, and BatchCarrierRow from checkpoint.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@minni31 minni31 changed the title [GLUTEN-8629][VL] Add RDDScanExec support to Velox backend [VL] Add RDDScanExec support to Velox backend May 19, 2026
@github-actions github-actions Bot added the VELOX label May 19, 2026
Minni Mittal and others added 2 commits May 19, 2026 08:35
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The logical plan's output contains unresolved attributes which cause
AnalysisException when creating a new DataFrame from LogicalRDD.
Use analyzed.output to get fully resolved attributes with proper
types and expression IDs.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant