HIVE-29649: Iceberg: Support Parquet DECIMAL_64 vectorization#6527
Open
deniskuzZ wants to merge 1 commit into
Open
HIVE-29649: Iceberg: Support Parquet DECIMAL_64 vectorization#6527deniskuzZ wants to merge 1 commit into
deniskuzZ wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
Pull request overview
This PR aims to enable and validate Parquet DECIMAL_64 vectorization (including for Iceberg tables) by advertising DECIMAL_64 feature support from Parquet/Iceberg input formats, adding a Decimal64 read path to the Parquet vectorized reader, and extending test coverage and golden outputs to assert the new vectorization behavior.
Changes:
- Advertise
DECIMAL_64as a supported vectorization feature for Parquet (and Iceberg when enabled via table properties). - Add a
Decimal64ColumnVectorread path to the Parquet vectorized primitive column reader (including dictionary decode support). - Add/adjust LLAP query tests and Java unit tests to exercise
DECIMAL_64vectorization and update expected outputs.
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| ql/src/test/results/clientpositive/llap/parquet_decimal64.q.out | New golden output validating Parquet DECIMAL_64 vectorization plan + results. |
| ql/src/test/queries/clientpositive/parquet_decimal64.q | New query to assert Parquet vectorized reader engages DECIMAL_64. |
| ql/src/test/org/apache/hadoop/hive/ql/io/parquet/VectorizedColumnReaderTestBase.java | Adds wiring for physical variations + a Decimal64 read verification helper. |
| ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestVectorizedDictionaryEncodingColumnReader.java | Adds a unit test entry point for Decimal64 reads (dictionary). |
| ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestVectorizedColumnReader.java | Adds a unit test entry point for Decimal64 reads (non-dictionary). |
| ql/src/java/org/apache/hadoop/hive/ql/io/parquet/vector/VectorizedPrimitiveColumnReader.java | Implements Parquet Decimal64 read + dictionary decode logic. |
| ql/src/java/org/apache/hadoop/hive/ql/io/parquet/MapredParquetInputFormat.java | Advertises DECIMAL_64 support for Parquet vectorization. |
| iceberg/iceberg-handler/src/test/results/positive/llap/vectorized_iceberg_read_parquet.q.out | Updates plan/output expectations to show DECIMAL_64 in use for Iceberg Parquet. |
| iceberg/iceberg-handler/src/test/results/positive/llap/vectorized_iceberg_read_multitable.q.out | Updates expected output (removes now-unneeded ORC-only toggling sections). |
| iceberg/iceberg-handler/src/test/results/positive/llap/vectorized_iceberg_read_mixed.q.out | Updates plan/output expectations for mixed ORC/Parquet Iceberg vectorization. |
| iceberg/iceberg-handler/src/test/queries/positive/vectorized_iceberg_read_multitable.q | Removes ORC-only toggling queries that are no longer required. |
| iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergMetaHook.java | Removes ORC-only parameter propagation used for prior vectorization gating. |
| iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergInputFormat.java | Enables advertising DECIMAL_64 for Iceberg when table property is enabled (regardless of ORC/Parquet). |
| iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/BaseHiveIcebergMetaHook.java | Removes ORC-only parameter constant + helper methods used for vectorization gating. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+710
to
+733
| // INT32/INT64-backed decimals already give the unscaled value via the Parquet reader; for the | ||
| // (rare) byte-array-backed case reuse HiveDecimalWritable.serialize64 -- the same encoding | ||
| // Decimal64ColumnVector.set uses -- rather than decoding the bytes by hand. | ||
| private long readUnscaledLong(PrimitiveTypeName physical, short scale) { | ||
| return switch (physical) { | ||
| case INT32 -> dataColumn.readInteger(); | ||
| case INT64 -> dataColumn.readLong(); | ||
| default -> { | ||
| scratchDecimal.set(dataColumn.readDecimal(), scale); | ||
| yield scratchDecimal.serialize64(scale); | ||
| } | ||
| }; | ||
| } | ||
|
|
||
| private long readUnscaledLongFromDict(PrimitiveTypeName physical, int dictId, short scale) { | ||
| return switch (physical) { | ||
| case INT32 -> dictionary.readInteger(dictId); | ||
| case INT64 -> dictionary.readLong(dictId); | ||
| default -> { | ||
| scratchDecimal.set(dictionary.readDecimal(dictId), scale); | ||
| yield scratchDecimal.serialize64(scale); | ||
| } | ||
| }; | ||
| } |
Comment on lines
+696
to
707
| while (left > 0) { | ||
| readRepetitionAndDefinitionLevels(); | ||
| if (definitionLevel >= maxDefLevel) { | ||
| c.vector[rowId] = readUnscaledLong(physical, c.scale); | ||
| c.isNull[rowId] = false; | ||
| c.isRepeating = c.isRepeating && (c.vector[0] == c.vector[rowId]); | ||
| } else { | ||
| setNullValue(c, rowId); | ||
| } | ||
| rowId++; | ||
| left--; | ||
| } |
Comment on lines
+617
to
+623
| if (column instanceof Decimal64ColumnVector dec64) { | ||
| fillDecimal64PrecisionScale(dec64); | ||
| PrimitiveTypeName dictPhysical = type.asPrimitiveType().getPrimitiveTypeName(); | ||
| for (int i = rowId; i < rowId + num; ++i) { | ||
| if (!column.isNull[i]) { | ||
| dec64.vector[i] = readUnscaledLongFromDict(dictPhysical, (int) dictionaryIds.vector[i], dec64.scale); | ||
| } |
| POSTHOOK: Input: default@tbl_ice_parquet_all_types | ||
| #### A masked pattern was here #### | ||
| 1.1 1.2 false 4 567890123456789 6 col7 2012-10-03 19:58:08 1234-09-02 10.01 | ||
| 1.1 1.2 false 4 567890123456789 6 col7 2012-10-03 19:58:08 1234-09-02 0.00 |
| #### A masked pattern was here #### | ||
| 1.1 1.2 false 4 567890123456789 6 col7 2012-10-03 19:58:08 1234-09-02 10.01 | ||
| 5.1 6.2 true 40 567890123456780 8 col07 2012-10-03 19:58:09 1234-09-03 10.02 | ||
| 5.1 6.2 true 40 567890123456780 8 col07 2012-10-03 19:58:09 1234-09-03 0.00 |
Comment on lines
+687
to
+694
| /** | ||
| * Decimal64 fast path: read the unscaled value straight into the long-backed vector instead of | ||
| * materializing a HiveDecimal per row. Only reached for decimal columns the vectorizer tagged | ||
| * DECIMAL_64 (precision <= 18); higher precision still uses {@link #readDecimal}. | ||
| */ | ||
| private void readDecimal64(int total, Decimal64ColumnVector c, int rowId) { | ||
| fillDecimal64PrecisionScale(c); | ||
| PrimitiveTypeName physical = type.asPrimitiveType().getPrimitiveTypeName(); |
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.



What changes were proposed in this pull request?
Adds support for Parquet DECIMAL_64 vectorization
Why are the changes needed?
Performace
Does this PR introduce any user-facing change?
No
How was this patch tested?
parquet_decimal64.q, vectorized_iceberg_read_multitable.q