[ARUON #1994] Support HiveTableScanExec to native #1995

guixiaowen · 2026-02-09T14:29:50Z

Which issue does this PR close?

Closes #1994

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

How was this patch tested?

ShreyeshArangath · 2026-02-10T18:20:30Z

...park/src/main/scala/org/apache/spark/sql/hive/execution/auron/plan/HiveConvertProvider.scala

+  override def isEnabled: Boolean =
+    getBooleanConf("spark.auron.enable.hiveTable", defaultValue = true)
+
+  def enableHiveTableScanExec: Boolean =
+    getBooleanConf("spark.auron.enable.hiveTableScanExec", defaultValue = false)


Should we be adding these to @SparkAuronConfiguration.java?

ShreyeshArangath · 2026-02-10T18:22:37Z

...park/src/main/scala/org/apache/spark/sql/hive/execution/auron/plan/HiveConvertProvider.scala

+  override def isSupported(exec: SparkPlan): Boolean =
+    exec match {
+      case e: HiveTableScanExec if enableHiveTableScanExec &&
+        e.relation.tableMeta.provider.isDefined &&
+        e.relation.tableMeta.provider.get.equals("hive") =>
+        true
+    case _ => false
+  }


Slightly hard to distinguish when to use isSupported v/s enableHiveTableScanExec flag. Do you mind adding documentation here?

ShreyeshArangath · 2026-02-10T18:25:28Z

.../src/main/scala/org/apache/spark/sql/hive/execution/auron/plan/NativeHIveTableScanExec.scala

+              .newBuilder()
+              .setBaseConf(nativeFileScanConf)
+              .setFsResourceId(resourceId)
+              .addAllPruningPredicates(new java.util.ArrayList()) // not support this filter


So we will do a full table scan here? If that's a case, do you mind creating an issue and linking it here?

ShreyeshArangath · 2026-02-10T18:27:43Z

Are there tests we can add to verify conversion behavior here?

Copilot

Pull request overview

This PR aims to add native execution support for Spark HiveTableScanExec by introducing a Hive-specific conversion provider and a new native scan exec implementation, plus updating build dependencies to compile against Spark Hive classes.

Changes:

Add HiveConvertProvider and a new NativeHiveTableScanExec implementation intended to convert HiveTableScanExec to native ORC/Parquet scans.
Update spark-extension-shims-spark Maven dependencies to include Spark Hive/Catalyst APIs needed by the new shims code.
Refactor/import changes in AuronConverters, and add a (currently unimplemented) convertHiveTableScanExec stub.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 19 comments.

File	Description
spark-extension/src/main/scala/org/apache/spark/sql/auron/AuronConverters.scala	Import refactor; adds an unimplemented `convertHiveTableScanExec` method; currently introduces compile issues (missing import + stub).
spark-extension-shims-spark/src/main/scala/org/apache/spark/sql/hive/execution/auron/plan/NativeHIveTableScanExec.scala	New native Hive table scan exec; currently contains multiple compile/runtime issues (API mismatches, pattern matches, session usage, unsupported formats).
spark-extension-shims-spark/src/main/scala/org/apache/spark/sql/hive/execution/auron/plan/HiveConvertProvider.scala	New ServiceLoader-based conversion provider for Hive scans; currently missing ServiceLoader registration and insufficient support checks.
spark-extension-shims-spark/pom.xml	Adds Spark Hive/Catalyst dependencies, but introduces duplicate `spark-hive` dependency entries with conflicting scopes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-11T10:44:54Z

...park/src/main/scala/org/apache/spark/sql/hive/execution/auron/plan/HiveConvertProvider.scala

+  override def isSupported(exec: SparkPlan): Boolean =
+    exec match {
+      case e: HiveTableScanExec if enableHiveTableScanExec &&
+        e.relation.tableMeta.provider.isDefined &&
+        e.relation.tableMeta.provider.get.equals("hive") =>
+        true


isSupported accepts all Hive tables with provider == "hive", but NativeHiveTableScanExec only builds native nodes for ORC/Parquet and otherwise will throw (e.g., MatchError on file format). Add an explicit format check here (or make the native exec gracefully fall back) to avoid runtime failures on non-ORC/Parquet Hive tables.

Copilot · 2026-02-11T10:44:55Z

.../src/main/scala/org/apache/spark/sql/hive/execution/auron/plan/NativeHIveTableScanExec.scala

+    nativeTable.getMetadata)
+
+  @transient private lazy val nativeHadoopConf = {
+    val hiveConf = SparkSession.getActiveSession.get.sessionState.newHadoopConf()


nativeHadoopConf uses SparkSession.getActiveSession.get, which can throw if there is no active session (e.g., execution triggered outside a SQL context). Consider using the same session derivation as NativeHiveTableScanBase.broadcastedHadoopConf (Shims.get.getSqlContext(basedHiveScan).sparkSession) to avoid runtime failures.

Suggested change

val hiveConf = SparkSession.getActiveSession.get.sessionState.newHadoopConf()

val sparkSession = Shims.get.getSqlContext(basedHiveScan).sparkSession

val hiveConf = sparkSession.sessionState.newHadoopConf()

Copilot · 2026-02-11T10:44:55Z

.../src/main/scala/org/apache/spark/sql/hive/execution/auron/plan/NativeHIveTableScanExec.scala

+  private val minPartitions = if (SparkSession.getActiveSession.get.sparkContext.isLocal) {
+    0 // will splitted based on block by default.
+  } else {
+    math.max(nativeHadoopConf.getInt("mapreduce.job.maps", 1),
+      SparkSession.getActiveSession.get.sparkContext.defaultMinPartitions)


minPartitions reads SparkSession.getActiveSession.get.sparkContext multiple times. Besides the .get risk, it’s also inconsistent with other native scan implementations in this repo that pass an explicit sparkSession around. Prefer using a single sparkSession resolved from basedHiveScan and derive sparkContext from it.

Copilot · 2026-02-11T10:44:55Z

.../src/main/scala/org/apache/spark/sql/hive/execution/auron/plan/NativeHIveTableScanExec.scala

+  private val minPartitions = if (SparkSession.getActiveSession.get.sparkContext.isLocal) {
+    0 // will splitted based on block by default.
+  } else {
+    math.max(nativeHadoopConf.getInt("mapreduce.job.maps", 1),
+      SparkSession.getActiveSession.get.sparkContext.defaultMinPartitions)
+  }
+
+  private val ignoreEmptySplits =
+    SparkSession.getActiveSession.get.sparkContext.conf.get(HADOOP_RDD_IGNORE_EMPTY_SPLITS)


ignoreEmptySplits also depends on SparkSession.getActiveSession.get. This should use the same non-optional session/context resolution as the rest of the execution code to avoid NoSuchElementException when there is no active session.

Suggested change

private val minPartitions = if (SparkSession.getActiveSession.get.sparkContext.isLocal) {

0 // will splitted based on block by default.

} else {

math.max(nativeHadoopConf.getInt("mapreduce.job.maps", 1),

SparkSession.getActiveSession.get.sparkContext.defaultMinPartitions)

}

private val ignoreEmptySplits =

SparkSession.getActiveSession.get.sparkContext.conf.get(HADOOP_RDD_IGNORE_EMPTY_SPLITS)

private val minPartitions = if (sparkContext.isLocal) {

0 // will splitted based on block by default.

} else {

math.max(nativeHadoopConf.getInt("mapreduce.job.maps", 1),

sparkContext.defaultMinPartitions)

}

private val ignoreEmptySplits =

sparkContext.conf.get(HADOOP_RDD_IGNORE_EMPTY_SPLITS)

Copilot · 2026-02-11T10:44:56Z

.../src/main/scala/org/apache/spark/sql/hive/execution/auron/plan/NativeHIveTableScanExec.scala

+  private def getInputFormat(conf: JobConf, inputFormatClass: Class[newInputClass[Writable, Writable]]):
+  InputFormat[Writable, Writable] = {
+    val newInputFormat = ReflectionUtils.newInstance(inputFormatClass.asInstanceOf[Class[_]], conf)
+      .asInstanceOf[InputFormat[Writable, Writable]]
+    newInputFormat match {


getInputFormat takes a mapreduce.InputFormat class (newInputClass) but returns/instantiates org.apache.hadoop.mapred.InputFormat. This signature mismatch makes the unchecked cast even riskier. Align the parameter type with the returned InputFormat type (or vice versa) so the compiler can help enforce correctness.

Copilot · 2026-02-11T10:44:58Z

.../src/main/scala/org/apache/spark/sql/hive/execution/auron/plan/NativeHIveTableScanExec.scala

+      getMaxSplitBytes(SparkSession.getActiveSession.get)).toArray
+    arrayFilePartition.toArray
+  }
+
+  private def getMaxSplitBytes(sparkSession: SparkSession): Long = {
+    val defaultMaxSplitBytes = sparkSession.sessionState.conf.filesMaxPartitionBytes
+    val openCostInBytes = sparkSession.sessionState.conf.filesOpenCostInBytes
+    Math.min(defaultMaxSplitBytes, openCostInBytes)


getMaxSplitBytes currently returns min(filesMaxPartitionBytes, filesOpenCostInBytes), which can drastically shrink splits and create excessive partitions. Elsewhere in this repo (NativePaimonTableScanExec) you fork Spark’s FilePartition#maxSplitBytes logic using min(defaultMaxSplitBytes, max(openCostInBytes, bytesPerCore)). Align this implementation to that logic (or call the shared helper) to avoid performance regressions.

Suggested change

getMaxSplitBytes(SparkSession.getActiveSession.get)).toArray

arrayFilePartition.toArray

}

private def getMaxSplitBytes(sparkSession: SparkSession): Long = {

val defaultMaxSplitBytes = sparkSession.sessionState.conf.filesMaxPartitionBytes

val openCostInBytes = sparkSession.sessionState.conf.filesOpenCostInBytes

Math.min(defaultMaxSplitBytes, openCostInBytes)

getMaxSplitBytes(SparkSession.getActiveSession.get, partitionedFiles)).toArray

arrayFilePartition.toArray

}

private def getMaxSplitBytes(

sparkSession: SparkSession,

partitionedFiles: Seq[PartitionedFile]): Long = {

val defaultMaxSplitBytes = sparkSession.sessionState.conf.filesMaxPartitionBytes

val openCostInBytes = sparkSession.sessionState.conf.filesOpenCostInBytes

val totalBytes = partitionedFiles.map(_.length).sum

val parallelism = math.max(1, sparkSession.sparkContext.defaultParallelism)

val bytesPerCore = if (totalBytes <= 0L) openCostInBytes else totalBytes / parallelism

Math.min(defaultMaxSplitBytes, Math.max(openCostInBytes, bytesPerCore))

Copilot · 2026-02-11T10:44:58Z

...park/src/main/scala/org/apache/spark/sql/hive/execution/auron/plan/HiveConvertProvider.scala

+class HiveConvertProvider extends AuronConvertProvider with Logging {
+  override def isEnabled: Boolean =
+    getBooleanConf("spark.auron.enable.hiveTable", defaultValue = true)
+
+  def enableHiveTableScanExec: Boolean =
+    getBooleanConf("spark.auron.enable.hiveTableScanExec", defaultValue = false)
+


HiveConvertProvider is discovered via ServiceLoader (AuronConverters loads AuronConvertProvider implementations). This module currently doesn’t include a META-INF/services/org.apache.spark.sql.auron.AuronConvertProvider entry, so the provider won’t be loaded at runtime (see thirdparty/auron-paimon for the existing pattern).

Copilot · 2026-02-11T10:44:58Z

...park/src/main/scala/org/apache/spark/sql/hive/execution/auron/plan/HiveConvertProvider.scala

+  def convertHiveTableScanExec(hiveExec: HiveTableScanExec): SparkPlan = {
+    AuronConverters.addRenameColumnsExec(NativeHiveTableScanExec(hiveExec))
+  }


There are existing query/operator validation test utilities in this module, but this new Hive scan conversion path doesn’t appear to have coverage. Add at least one suite that enables spark.auron.enable.hiveTableScanExec and verifies HiveTableScanExec is converted (and that unsupported formats don’t break execution).

Copilot · 2026-02-11T10:44:59Z

.../src/main/scala/org/apache/spark/sql/hive/execution/auron/plan/NativeHIveTableScanExec.scala

+    arrayFilePartition += FilePartition.getFilePartitions(SparkSession.getActiveSession.get,
+      partitionedFiles,
+      getMaxSplitBytes(SparkSession.getActiveSession.get)).toArray


FilePartition.getFilePartitions(SparkSession.getActiveSession.get, ...) again relies on getActiveSession.get. Use the sparkSession derived from basedHiveScan (as in NativeHiveTableScanBase) so partition planning doesn’t fail when there’s no active session.

Suggested change

arrayFilePartition += FilePartition.getFilePartitions(SparkSession.getActiveSession.get,

partitionedFiles,

getMaxSplitBytes(SparkSession.getActiveSession.get)).toArray

val sparkSession = basedHiveScan.sparkSession

arrayFilePartition += FilePartition.getFilePartitions(

sparkSession,

partitionedFiles,

getMaxSplitBytes(sparkSession)

).toArray

Copilot · 2026-02-11T10:44:59Z

.../src/main/scala/org/apache/spark/sql/hive/execution/auron/plan/NativeHIveTableScanExec.scala

+    inputFormatClass match {
+      case OrcInputFormat =>
+      case MapredParquetInputFormat =>
+      case _ =>
+    }


inputFormatClass match { case OrcInputFormat => ... } is matching a Class[_] value against a class name, and the cases are empty. If you need special handling by input format, compare against classOf[OrcInputFormat] / classOf[MapredParquetInputFormat] and implement the intended behavior; otherwise remove this dead code block.

Suggested change

inputFormatClass match {

case OrcInputFormat =>

case MapredParquetInputFormat =>

case _ =>

}

[ARUON apache#1994] Support HiveTableScanExec to native

7a54f83

github-actions bot added spark build labels Feb 9, 2026

[ARUON apache#1994] Support HiveTableScanExec to native

b53d84b

ShreyeshArangath reviewed Feb 10, 2026

View reviewed changes

cxzl25 requested a review from Copilot February 11, 2026 10:33

Copilot started reviewing on behalf of cxzl25 February 11, 2026 10:34 View session

Copilot AI reviewed Feb 11, 2026

View reviewed changes

	val hiveConf = SparkSession.getActiveSession.get.sessionState.newHadoopConf()
	val sparkSession = Shims.get.getSqlContext(basedHiveScan).sparkSession
	val hiveConf = sparkSession.sessionState.newHadoopConf()

[ARUON #1994] Support HiveTableScanExec to native #1995

Are you sure you want to change the base?

[ARUON #1994] Support HiveTableScanExec to native #1995

Uh oh!

Conversation

guixiaowen commented Feb 9, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

How was this patch tested?

Uh oh!

ShreyeshArangath Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

ShreyeshArangath Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

ShreyeshArangath Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

ShreyeshArangath commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ShreyeshArangath commented Feb 10, 2026 •

edited

Loading