Skip to content

Conversation

@andygrove
Copy link
Member

@andygrove andygrove commented Jan 19, 2026

Summary

This PR adds an experimental native (Rust-based) implementation of ColumnarToRowExec that converts Arrow columnar data to Spark UnsafeRow format.

Benefits over the current Scala implementation:

  • Zero-copy for variable-length types: String and Binary data is written directly to the output buffer without intermediate Java object allocation
  • Vectorized processing: The native implementation processes data in a columnar fashion, improving CPU cache utilization
  • Reduced GC pressure: All conversion happens in native memory, avoiding the creation of temporary Java objects that would need garbage collection
  • Buffer reuse: The output buffer is allocated once and reused across batches, minimizing memory allocation overhead

Configuration:

The feature is disabled by default and can be enabled by setting:

spark.comet.exec.columnarToRow.native.enabled=true

Supported data types:

  • Primitive types: Boolean, Byte, Short, Int, Long, Float, Double
  • Date and Timestamp (microseconds)
  • Decimal (both inline precision<=18 and variable-length precision>18)
  • String and Binary
  • Complex types: Struct, Array, Map with arbitrary nesting depth
    • e.g., Array<Array<Int>>, Map<String, Array<Int>>, Struct<Array<Map<String, Int>>>

Test plan

  • Rust unit tests for conversion logic (11 tests)
  • Comprehensive Scala test suite (CometNativeColumnarToRowSuite) with 25 tests covering:
    • All primitive types with and without nulls
    • Variable-length types (String, Binary, large Decimal)
    • Complex types (Struct, Array, Map)
    • Deeply nested types (Array of Arrays, Map with Array values, Struct containing Array of Maps)
    • Fuzz testing with randomly generated nested schemas
    • Edge cases (empty batches, all nulls, large batches)
  • Performance benchmarks comparing native vs JVM implementation

⚠️ This is an experimental feature for evaluation and benchmarking purposes.

🤖 Generated with Claude Code

This PR adds an experimental native (Rust-based) implementation of
ColumnarToRowExec that converts Arrow columnar data to Spark UnsafeRow
format.

Benefits over the current Scala implementation:
- Zero-copy for variable-length types: String and Binary data is written
  directly to the output buffer without intermediate Java object allocation
- Vectorized processing: The native implementation processes data in a
  columnar fashion, improving CPU cache utilization
- Reduced GC pressure: All conversion happens in native memory, avoiding
  the creation of temporary Java objects that would need garbage collection
- Buffer reuse: The output buffer is allocated once and reused across
  batches, minimizing memory allocation overhead

The feature is disabled by default and can be enabled by setting:
  spark.comet.exec.columnarToRow.native.enabled=true

Supported data types:
- Primitive types: Boolean, Byte, Short, Int, Long, Float, Double
- Date and Timestamp (microseconds)
- Decimal (both inline precision<=18 and variable-length precision>18)
- String and Binary
- Complex types: Struct, Array, Map (nested)

This is an experimental feature for evaluation and benchmarking purposes.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@andygrove andygrove changed the title [EXPERIMENTAL] Native columnar to row conversion feat: [EXPERIMENTAL] Native columnar to row conversion Jan 19, 2026
andygrove and others added 13 commits January 19, 2026 13:31
Spark's UnsafeArrayData uses the actual primitive size for elements (e.g.,
4 bytes for INT32), not always 8 bytes like UnsafeRow fields. This fix:

- Added get_element_size() to determine correct sizes for each type
- Added write_array_element() to write values with type-specific widths
- Updated write_list_data() and write_map_data() to use correct sizes
- Added LargeUtf8/LargeBinary support for struct fields
- Added comprehensive test suite (CometNativeColumnarToRowSuite)
- Updated compatibility documentation

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add a fuzz test using FuzzDataGenerator to test the native columnar to row
conversion with randomly generated schemas containing arrays, structs,
and maps.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add tests verifying that native columnar to row conversion correctly
handles complex nested types:
- Array<Array<Int>>
- Map<String, Array<Int>>
- Struct<Array<Map<String, Int>>, String>

These tests confirm the recursive conversion logic works for arbitrary
nesting depth.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add a fuzz test using FuzzDataGenerator.generateNestedSchema to test
native columnar to row conversion with deeply nested random schemas
(depth 1-3, with arrays, structs, and maps).

The test uses only primitive types supported by native C2R (excludes
TimestampNTZType which is not yet supported).

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Use actual array type for dispatching instead of schema type to
  handle type mismatches between serialized schema and FFI arrays
- Add support for LargeList (64-bit offsets) arrays
- Replace .unwrap() with proper error handling to provide clear
  error messages instead of panics
- Add tests for LargeList handling

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When Parquet data is read, string columns may be dictionary-encoded
for efficiency. The schema says Utf8 but the actual Arrow array is
Dictionary(Int32, Utf8). This caused a type mismatch error.

- Add support for Dictionary-encoded arrays in get_variable_length_data
- Handle all common key types (Int8, Int16, Int32, Int64, UInt8-64)
- Support Utf8, LargeUtf8, Binary, and LargeBinary value types
- Add tests for dictionary-encoded string arrays

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@codecov-commenter
Copy link

codecov-commenter commented Jan 20, 2026

Codecov Report

❌ Patch coverage is 7.40741% with 100 lines in your changes missing coverage. Please review.
✅ Project coverage is 59.68%. Comparing base (f09f8af) to head (b8ed2e7).
⚠️ Report is 856 commits behind head on main.

Files with missing lines Patch % Lines
...spark/sql/comet/CometNativeColumnarToRowExec.scala 0.00% 43 Missing ⚠️
...rg/apache/comet/NativeColumnarToRowConverter.scala 0.00% 36 Missing ⚠️
...he/comet/rules/EliminateRedundantTransitions.scala 27.27% 4 Missing and 4 partials ⚠️
...ain/scala/org/apache/comet/vector/NativeUtil.scala 0.00% 6 Missing ⚠️
...java/org/apache/comet/NativeColumnarToRowInfo.java 0.00% 6 Missing ⚠️
...n/scala/org/apache/comet/ExtendedExplainInfo.scala 0.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #3221      +/-   ##
============================================
+ Coverage     56.12%   59.68%   +3.56%     
- Complexity      976     1430     +454     
============================================
  Files           119      173      +54     
  Lines         11743    15866    +4123     
  Branches       2251     2625     +374     
============================================
+ Hits           6591     9470    +2879     
- Misses         4012     5072    +1060     
- Partials       1140     1324     +184     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants