Skip to content

Conversation

@andygrove
Copy link
Member

@andygrove andygrove commented Jan 18, 2026

Summary

  • Enables support for complex types (arrays, maps, structs) in Comet's native Parquet writer
  • Removes the blocking check that previously prevented complex types
  • Adds comprehensive test coverage for complex types

Changes

  • Remove complex type blocking check in CometDataWritingCommand.scala
  • Add 12 new tests for complex types in CometParquetWriterSuite.scala:
    • Basic complex types (array, struct, map)
    • Nested complex types (array of structs, struct containing array, map with struct values, deeply nested)
    • Nullable complex types with nulls at various nesting levels
    • Complex types containing decimal and temporal types
    • Empty arrays and maps
    • Fuzz testing with randomly generated complex type schemas
  • Update documentation to reflect complex type support

Test plan

  • Tests verify round-trip compatibility (write with Comet, read with Spark/Comet)
  • Fuzz testing with randomly generated schemas

🤖 Generated with Claude Code

Enables support for complex types (arrays, maps, structs) in Comet's native
Parquet writer by removing the blocking check that previously prevented them.

Changes:
- Remove complex type blocking check in CometDataWritingCommand.scala
- Add comprehensive test coverage for complex types including:
  - Basic complex types (array, struct, map)
  - Nested complex types (array of structs, struct containing array, etc.)
  - Nullable complex types with nulls at various nesting levels
  - Complex types containing decimal and temporal types
  - Empty arrays and maps
  - Fuzz testing with randomly generated complex type schemas
- Update documentation to reflect complex type support

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@codecov-commenter
Copy link

codecov-commenter commented Jan 18, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 60.05%. Comparing base (f09f8af) to head (369ddab).
⚠️ Report is 856 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main    #3214      +/-   ##
============================================
+ Coverage     56.12%   60.05%   +3.92%     
- Complexity      976     1429     +453     
============================================
  Files           119      170      +51     
  Lines         11743    15758    +4015     
  Branches       2251     2602     +351     
============================================
+ Hits           6591     9463    +2872     
- Misses         4012     4976     +964     
- Partials       1140     1319     +179     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@andygrove andygrove marked this pull request as draft January 18, 2026 20:17
Enable spark.comet.scan.allowIncompatible in complex type tests so that
native_iceberg_compat scan is used (which supports complex types) instead
of falling back to native_comet (which doesn't support complex types).

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@andygrove
Copy link
Member Author

andygrove commented Jan 18, 2026

With these changes, we can run the PySpark repartition benchmark fully native, and it shows an almost 2x speedup compared to Spark (and also ~2x compared to Comet when writes are disabled and Comet does the columnar-to-row transition).

plan

andygrove and others added 3 commits January 18, 2026 14:18
The CI sets COMET_PARQUET_SCAN_IMPL=native_comet for some test profiles,
which overrides the default auto mode. Since native_comet doesn't support
complex types, the scan falls back to Spark's reader which produces
OnHeapColumnVector instead of CometVector, causing the native writer to fail.

This fix explicitly sets COMET_NATIVE_SCAN_IMPL to "auto" in the test
configuration, allowing native_iceberg_compat to be used for complex types.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@andygrove andygrove marked this pull request as ready for review January 18, 2026 22:42
@andygrove andygrove requested a review from comphead January 18, 2026 22:47
@andygrove
Copy link
Member Author

With these changes, we can run the PySpark repartition benchmark fully native, and it shows an almost 2x speedup compared to Spark (and also ~2x compared to Comet when writes are disabled and Comet does the columnar-to-row transition).

@comphead ☝️

@andygrove andygrove requested a review from wForget January 19, 2026 01:55
@andygrove
Copy link
Member Author

Thanks for the thorough review @wForget. I refactored the test framework and added assertions that the plans are running as intended (Spark vs Comet).

.strip_prefix("file://")
.or_else(|| part_file.strip_prefix("file:"))
.unwrap_or(&part_file);
let file_size = std::fs::metadata(local_path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A question unrelated to this PR: #2929 supports write to hdfs. So, is the logic here correct when local_path is an hdfs path?

Copy link
Member

@wForget wForget left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @andygrove , LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants