Skip to content

HIVE-29649: Iceberg: Support Parquet DECIMAL_64 vectorization#6527

Open
deniskuzZ wants to merge 1 commit into
apache:masterfrom
deniskuzZ:HIVE-29649
Open

HIVE-29649: Iceberg: Support Parquet DECIMAL_64 vectorization#6527
deniskuzZ wants to merge 1 commit into
apache:masterfrom
deniskuzZ:HIVE-29649

Conversation

@deniskuzZ
Copy link
Copy Markdown
Member

@deniskuzZ deniskuzZ commented Jun 5, 2026

What changes were proposed in this pull request?

Adds support for Parquet DECIMAL_64 vectorization

Why are the changes needed?

Performace

Does this PR introduce any user-facing change?

No

How was this patch tested?

parquet_decimal64.q, vectorized_iceberg_read_multitable.q

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to enable and validate Parquet DECIMAL_64 vectorization (including for Iceberg tables) by advertising DECIMAL_64 feature support from Parquet/Iceberg input formats, adding a Decimal64 read path to the Parquet vectorized reader, and extending test coverage and golden outputs to assert the new vectorization behavior.

Changes:

  • Advertise DECIMAL_64 as a supported vectorization feature for Parquet (and Iceberg when enabled via table properties).
  • Add a Decimal64ColumnVector read path to the Parquet vectorized primitive column reader (including dictionary decode support).
  • Add/adjust LLAP query tests and Java unit tests to exercise DECIMAL_64 vectorization and update expected outputs.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
ql/src/test/results/clientpositive/llap/parquet_decimal64.q.out New golden output validating Parquet DECIMAL_64 vectorization plan + results.
ql/src/test/queries/clientpositive/parquet_decimal64.q New query to assert Parquet vectorized reader engages DECIMAL_64.
ql/src/test/org/apache/hadoop/hive/ql/io/parquet/VectorizedColumnReaderTestBase.java Adds wiring for physical variations + a Decimal64 read verification helper.
ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestVectorizedDictionaryEncodingColumnReader.java Adds a unit test entry point for Decimal64 reads (dictionary).
ql/src/test/org/apache/hadoop/hive/ql/io/parquet/TestVectorizedColumnReader.java Adds a unit test entry point for Decimal64 reads (non-dictionary).
ql/src/java/org/apache/hadoop/hive/ql/io/parquet/vector/VectorizedPrimitiveColumnReader.java Implements Parquet Decimal64 read + dictionary decode logic.
ql/src/java/org/apache/hadoop/hive/ql/io/parquet/MapredParquetInputFormat.java Advertises DECIMAL_64 support for Parquet vectorization.
iceberg/iceberg-handler/src/test/results/positive/llap/vectorized_iceberg_read_parquet.q.out Updates plan/output expectations to show DECIMAL_64 in use for Iceberg Parquet.
iceberg/iceberg-handler/src/test/results/positive/llap/vectorized_iceberg_read_multitable.q.out Updates expected output (removes now-unneeded ORC-only toggling sections).
iceberg/iceberg-handler/src/test/results/positive/llap/vectorized_iceberg_read_mixed.q.out Updates plan/output expectations for mixed ORC/Parquet Iceberg vectorization.
iceberg/iceberg-handler/src/test/queries/positive/vectorized_iceberg_read_multitable.q Removes ORC-only toggling queries that are no longer required.
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergMetaHook.java Removes ORC-only parameter propagation used for prior vectorization gating.
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergInputFormat.java Enables advertising DECIMAL_64 for Iceberg when table property is enabled (regardless of ORC/Parquet).
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/BaseHiveIcebergMetaHook.java Removes ORC-only parameter constant + helper methods used for vectorization gating.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +710 to +733
// INT32/INT64-backed decimals already give the unscaled value via the Parquet reader; for the
// (rare) byte-array-backed case reuse HiveDecimalWritable.serialize64 -- the same encoding
// Decimal64ColumnVector.set uses -- rather than decoding the bytes by hand.
private long readUnscaledLong(PrimitiveTypeName physical, short scale) {
return switch (physical) {
case INT32 -> dataColumn.readInteger();
case INT64 -> dataColumn.readLong();
default -> {
scratchDecimal.set(dataColumn.readDecimal(), scale);
yield scratchDecimal.serialize64(scale);
}
};
}

private long readUnscaledLongFromDict(PrimitiveTypeName physical, int dictId, short scale) {
return switch (physical) {
case INT32 -> dictionary.readInteger(dictId);
case INT64 -> dictionary.readLong(dictId);
default -> {
scratchDecimal.set(dictionary.readDecimal(dictId), scale);
yield scratchDecimal.serialize64(scale);
}
};
}
Comment on lines +696 to 707
while (left > 0) {
readRepetitionAndDefinitionLevels();
if (definitionLevel >= maxDefLevel) {
c.vector[rowId] = readUnscaledLong(physical, c.scale);
c.isNull[rowId] = false;
c.isRepeating = c.isRepeating && (c.vector[0] == c.vector[rowId]);
} else {
setNullValue(c, rowId);
}
rowId++;
left--;
}
Comment on lines +617 to +623
if (column instanceof Decimal64ColumnVector dec64) {
fillDecimal64PrecisionScale(dec64);
PrimitiveTypeName dictPhysical = type.asPrimitiveType().getPrimitiveTypeName();
for (int i = rowId; i < rowId + num; ++i) {
if (!column.isNull[i]) {
dec64.vector[i] = readUnscaledLongFromDict(dictPhysical, (int) dictionaryIds.vector[i], dec64.scale);
}
POSTHOOK: Input: default@tbl_ice_parquet_all_types
#### A masked pattern was here ####
1.1 1.2 false 4 567890123456789 6 col7 2012-10-03 19:58:08 1234-09-02 10.01
1.1 1.2 false 4 567890123456789 6 col7 2012-10-03 19:58:08 1234-09-02 0.00
#### A masked pattern was here ####
1.1 1.2 false 4 567890123456789 6 col7 2012-10-03 19:58:08 1234-09-02 10.01
5.1 6.2 true 40 567890123456780 8 col07 2012-10-03 19:58:09 1234-09-03 10.02
5.1 6.2 true 40 567890123456780 8 col07 2012-10-03 19:58:09 1234-09-03 0.00
Comment on lines +687 to +694
/**
* Decimal64 fast path: read the unscaled value straight into the long-backed vector instead of
* materializing a HiveDecimal per row. Only reached for decimal columns the vectorizer tagged
* DECIMAL_64 (precision <= 18); higher precision still uses {@link #readDecimal}.
*/
private void readDecimal64(int total, Decimal64ColumnVector c, int rowId) {
fillDecimal64PrecisionScale(c);
PrimitiveTypeName physical = type.asPrimitiveType().getPrimitiveTypeName();
@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud Bot commented Jun 5, 2026

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants