Skip to content

Comet should gracefully handle OnHeapColumnVector instead of failing #3215

@andygrove

Description

@andygrove

Description

When a native Comet operator receives data from a Spark scan that produces OnHeapColumnVector instead of Arrow arrays, Comet fails with:

org.apache.spark.SparkException: Comet execution only takes Arrow Arrays, but got class org.apache.spark.sql.execution.vectorized.OnHeapColumnVector

This can happen when:

  1. The native scan (e.g., native_comet) doesn't support certain data types (like complex types)
  2. The scan falls back to Spark's Parquet reader
  3. A downstream native operator (like the native Parquet writer) receives the non-Arrow data

Reproduction

// With native Parquet write enabled but without COMET_SCAN_ALLOW_INCOMPATIBLE
withSQLConf(
  "spark.comet.parquet.write.enabled" -> "true",
  "spark.comet.exec.enabled" -> "true") {
  
  // Create data with complex types
  val df = Seq((1, Seq(1, 2, 3))).toDF("id", "values")
  
  // Write to parquet (without Comet)
  df.write.parquet("/tmp/input")
  
  // Read and write - this fails because native_comet scan doesn't support 
  // complex types, falls back to Spark reader, but downstream native writer
  // expects Arrow arrays
  spark.read.parquet("/tmp/input").write.parquet("/tmp/output")
}

Expected Behavior

Comet should either:

  1. Fall back the entire query to Spark when native operators would receive non-Arrow data
  2. Automatically insert conversion from OnHeapColumnVector to Arrow (using the existing spark.comet.convert.parquet.enabled mechanism)
  3. Provide a clearer error message explaining why this happened and how to fix it (e.g., "Enable spark.comet.scan.allowIncompatible to use native_iceberg_compat scan which supports complex types")

Current Workarounds

  1. Enable spark.comet.scan.allowIncompatible=true so that native_iceberg_compat scan is used (which supports complex types)
  2. Enable spark.comet.convert.parquet.enabled=true to convert Spark columnar data to Arrow

Context

This was discovered while adding complex type support to the native Parquet writer (#3214). The fix there uses COMET_SCAN_ALLOW_INCOMPATIBLE, but the underlying issue of ungraceful failure should be addressed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions