feat: [EXPERIMENTAL] Native columnar to row conversion #3221

andygrove · 2026-01-19T20:20:37Z

Summary

This PR adds an experimental native (Rust-based) implementation of ColumnarToRowExec that converts Arrow columnar data to Spark UnsafeRow format.

Benefits over the current Scala implementation:

Zero-copy for variable-length types: String and Binary data is written directly to the output buffer without intermediate Java object allocation
Vectorized processing: The native implementation processes data in a columnar fashion, improving CPU cache utilization
Reduced GC pressure: All conversion happens in native memory, avoiding the creation of temporary Java objects that would need garbage collection
Buffer reuse: The output buffer is allocated once and reused across batches, minimizing memory allocation overhead

Configuration:

The feature is disabled by default and can be enabled by setting:

spark.comet.exec.columnarToRow.native.enabled=true

Supported data types:

Primitive types: Boolean, Byte, Short, Int, Long, Float, Double
Date and Timestamp (microseconds)
Decimal (both inline precision<=18 and variable-length precision>18)
String and Binary
Complex types: Struct, Array, Map with arbitrary nesting depth
- e.g., Array<Array<Int>>, Map<String, Array<Int>>, Struct<Array<Map<String, Int>>>

Test plan

Rust unit tests for conversion logic (11 tests)
Comprehensive Scala test suite (CometNativeColumnarToRowSuite) with 25 tests covering:
- All primitive types with and without nulls
- Variable-length types (String, Binary, large Decimal)
- Complex types (Struct, Array, Map)
- Deeply nested types (Array of Arrays, Map with Array values, Struct containing Array of Maps)
- Fuzz testing with randomly generated nested schemas
- Edge cases (empty batches, all nulls, large batches)
Performance benchmarks comparing native vs JVM implementation

⚠️ This is an experimental feature for evaluation and benchmarking purposes.

🤖 Generated with Claude Code

This PR adds an experimental native (Rust-based) implementation of ColumnarToRowExec that converts Arrow columnar data to Spark UnsafeRow format. Benefits over the current Scala implementation: - Zero-copy for variable-length types: String and Binary data is written directly to the output buffer without intermediate Java object allocation - Vectorized processing: The native implementation processes data in a columnar fashion, improving CPU cache utilization - Reduced GC pressure: All conversion happens in native memory, avoiding the creation of temporary Java objects that would need garbage collection - Buffer reuse: The output buffer is allocated once and reused across batches, minimizing memory allocation overhead The feature is disabled by default and can be enabled by setting: spark.comet.exec.columnarToRow.native.enabled=true Supported data types: - Primitive types: Boolean, Byte, Short, Int, Long, Float, Double - Date and Timestamp (microseconds) - Decimal (both inline precision<=18 and variable-length precision>18) - String and Binary - Complex types: Struct, Array, Map (nested) This is an experimental feature for evaluation and benchmarking purposes. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Spark's UnsafeArrayData uses the actual primitive size for elements (e.g., 4 bytes for INT32), not always 8 bytes like UnsafeRow fields. This fix: - Added get_element_size() to determine correct sizes for each type - Added write_array_element() to write values with type-specific widths - Updated write_list_data() and write_map_data() to use correct sizes - Added LargeUtf8/LargeBinary support for struct fields - Added comprehensive test suite (CometNativeColumnarToRowSuite) - Updated compatibility documentation Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Add a fuzz test using FuzzDataGenerator to test the native columnar to row conversion with randomly generated schemas containing arrays, structs, and maps. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Add tests verifying that native columnar to row conversion correctly handles complex nested types: - Array<Array<Int>> - Map<String, Array<Int>> - Struct<Array<Map<String, Int>>, String> These tests confirm the recursive conversion logic works for arbitrary nesting depth. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Add a fuzz test using FuzzDataGenerator.generateNestedSchema to test native columnar to row conversion with deeply nested random schemas (depth 1-3, with arrays, structs, and maps). The test uses only primitive types supported by native C2R (excludes TimestampNTZType which is not yet supported). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Use actual array type for dispatching instead of schema type to handle type mismatches between serialized schema and FFI arrays - Add support for LargeList (64-bit offsets) arrays - Replace .unwrap() with proper error handling to provide clear error messages instead of panics - Add tests for LargeList handling Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

When Parquet data is read, string columns may be dictionary-encoded for efficiency. The schema says Utf8 but the actual Arrow array is Dictionary(Int32, Utf8). This caused a type mismatch error. - Add support for Dictionary-encoded arrays in get_variable_length_data - Handle all common key types (Int8, Int16, Int32, Int64, UInt8-64) - Support Utf8, LargeUtf8, Binary, and LargeBinary value types - Add tests for dictionary-encoded string arrays Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

codecov-commenter · 2026-01-20T02:37:47Z

Codecov Report

❌ Patch coverage is 7.40741% with 100 lines in your changes missing coverage. Please review.
✅ Project coverage is 59.68%. Comparing base (f09f8af) to head (b8ed2e7).
⚠️ Report is 856 commits behind head on main.

Files with missing lines	Patch %	Lines
...spark/sql/comet/CometNativeColumnarToRowExec.scala	0.00%	43 Missing ⚠️
...rg/apache/comet/NativeColumnarToRowConverter.scala	0.00%	36 Missing ⚠️
...he/comet/rules/EliminateRedundantTransitions.scala	27.27%	4 Missing and 4 partials ⚠️
...ain/scala/org/apache/comet/vector/NativeUtil.scala	0.00%	6 Missing ⚠️
...java/org/apache/comet/NativeColumnarToRowInfo.java	0.00%	6 Missing ⚠️
...n/scala/org/apache/comet/ExtendedExplainInfo.scala	0.00%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #3221      +/-   ##
============================================
+ Coverage     56.12%   59.68%   +3.56%     
- Complexity      976     1430     +454     
============================================
  Files           119      173      +54     
  Lines         11743    15866    +4123     
  Branches       2251     2625     +374     
============================================
+ Hits           6591     9470    +2879     
- Misses         4012     5072    +1060     
- Partials       1140     1324     +184

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

andygrove changed the title ~~[EXPERIMENTAL] Native columnar to row conversion~~ feat: [EXPERIMENTAL] Native columnar to row conversion Jan 19, 2026

andygrove and others added 13 commits January 19, 2026 13:31

cargo fmt

49a5b20

cargo clippy

e558073

docs

a44066f

update benchmark [skip ci]

fd58cba

test: add fuzz test with nested types to native C2R suite

3ca5553

Add a fuzz test using FuzzDataGenerator to test the native columnar to row conversion with randomly generated schemas containing arrays, structs, and maps. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

format

adc13a6

fix

461c625

format

b8ed2e7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: [EXPERIMENTAL] Native columnar to row conversion #3221

feat: [EXPERIMENTAL] Native columnar to row conversion #3221

Uh oh!

andygrove commented Jan 19, 2026 •

edited

Loading

Uh oh!

codecov-commenter commented Jan 20, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: [EXPERIMENTAL] Native columnar to row conversion #3221

Are you sure you want to change the base?

feat: [EXPERIMENTAL] Native columnar to row conversion #3221

Uh oh!

Conversation

andygrove commented Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

codecov-commenter commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

andygrove commented Jan 19, 2026 •

edited

Loading

codecov-commenter commented Jan 20, 2026 •

edited

Loading