-
Notifications
You must be signed in to change notification settings - Fork 270
feat: add complex type support to native Parquet writer #3214
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Enables support for complex types (arrays, maps, structs) in Comet's native Parquet writer by removing the blocking check that previously prevented them. Changes: - Remove complex type blocking check in CometDataWritingCommand.scala - Add comprehensive test coverage for complex types including: - Basic complex types (array, struct, map) - Nested complex types (array of structs, struct containing array, etc.) - Nullable complex types with nulls at various nesting levels - Complex types containing decimal and temporal types - Empty arrays and maps - Fuzz testing with randomly generated complex type schemas - Update documentation to reflect complex type support Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3214 +/- ##
============================================
+ Coverage 56.12% 60.05% +3.92%
- Complexity 976 1429 +453
============================================
Files 119 170 +51
Lines 11743 15758 +4015
Branches 2251 2602 +351
============================================
+ Hits 6591 9463 +2872
- Misses 4012 4976 +964
- Partials 1140 1319 +179 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Enable spark.comet.scan.allowIncompatible in complex type tests so that native_iceberg_compat scan is used (which supports complex types) instead of falling back to native_comet (which doesn't support complex types). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
c6bc58a to
93d1f82
Compare
The CI sets COMET_PARQUET_SCAN_IMPL=native_comet for some test profiles, which overrides the default auto mode. Since native_comet doesn't support complex types, the scan falls back to Spark's reader which produces OnHeapColumnVector instead of CometVector, causing the native writer to fail. This fix explicitly sets COMET_NATIVE_SCAN_IMPL to "auto" in the test configuration, allowing native_iceberg_compat to be used for complex types. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@comphead ☝️ |
spark/src/test/scala/org/apache/comet/parquet/CometParquetWriterSuite.scala
Outdated
Show resolved
Hide resolved
spark/src/test/scala/org/apache/comet/parquet/CometParquetWriterSuite.scala
Outdated
Show resolved
Hide resolved
|
Thanks for the thorough review @wForget. I refactored the test framework and added assertions that the plans are running as intended (Spark vs Comet). |
| .strip_prefix("file://") | ||
| .or_else(|| part_file.strip_prefix("file:")) | ||
| .unwrap_or(&part_file); | ||
| let file_size = std::fs::metadata(local_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A question unrelated to this PR: #2929 supports write to hdfs. So, is the logic here correct when local_path is an hdfs path?
wForget
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @andygrove , LGTM

Summary
Changes
CometDataWritingCommand.scalaCometParquetWriterSuite.scala:Test plan
🤖 Generated with Claude Code