HIVE-29637: Incorrect Results for IS_NULL Predicate on Partition Column#6515
HIVE-29637: Incorrect Results for IS_NULL Predicate on Partition Column#6515deniskuzZ wants to merge 3 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
Fixes incorrect results for WHERE <partition_col> IS NULL on Iceberg tables by ensuring null partition values are represented as actual Java null during partition predicate evaluation, and adds a regression qtest to cover the scenario.
Changes:
- Add
IcebergTableUtil.makeSpecFromName(...)to build a partition spec map from an Iceberg partition path while translating Iceberg null partition values to Javanull. - Update
HiveIcebergStorageHandler#getPartitionsByExprto use the new helper when constructingDummyPartitionspecs for expression evaluation. - Add a new positive qtest (
iceberg_isnull_partition_pruning.q+ expected output) validatingIS NULLpartition pruning behavior.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/IcebergTableUtil.java |
Adds a helper for parsing Iceberg partition paths into a spec map with correct null handling. |
iceberg/iceberg-handler/src/main/java/org/apache/iceberg/mr/hive/HiveIcebergStorageHandler.java |
Switches partition spec construction in getPartitionsByExpr to the new helper to fix IS NULL results. |
iceberg/iceberg-handler/src/test/queries/positive/iceberg_isnull_partition_pruning.q |
Adds a regression query file covering WHERE str_col IS NULL on an Iceberg-partitioned table. |
iceberg/iceberg-handler/src/test/results/positive/iceberg_isnull_partition_pruning.q.out |
Adds the expected output for the new regression test. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
zabetak
left a comment
There was a problem hiding this comment.
Left some high level comments/questions cause I am not very familiar with the modified APIs.
|
zabetak
left a comment
There was a problem hiding this comment.
Tentatively approve assuming that we all agree that the new behavior is the one desired by all users and the old one is merely a bug.
|
|
||
| String partitionValue = entry.getValue(); | ||
| if (partitionValue.equals(defaultPartitionName)) { | ||
| partValues.add(null); // Null for default partition. | ||
| } else { | ||
| partValues.add(ObjectInspectorConverters.getConverter( | ||
| PrimitiveObjectInspectorFactory.javaStringObjectInspector, oi) | ||
| .convert(partitionValue)); | ||
| } |
There was a problem hiding this comment.
The description of hive.exec.default.partition.name property says the following:
"The default partition name in case the dynamic partition column value is null/empty string or any other values that cannot be escaped."
This code assumes that the value is null but according to the description it may be other things as well.
The change here will fix the IS [NOT] NULL predicate but may potentially change the behavior of some queries:
SELECT key, value, ds FROM pcr_t1 WHERE ds > 'A';
SELECT key, value, ds FROM pcr_t1 WHERE ds < 'A';Do the queries return the same results before/after the changes in the PR for Iceberg and non-Iceberg tables ?
Which behavior should prevail?
There was a problem hiding this comment.
Non-Iceberg pcr_t1
(NULL and '' both land in ds=__HIVE_DEFAULT_PARTITION__)
┌────────────────┬─────────────────────────────────────────┬─────────────────────────────────────────┬─────────────────┐
│ Query │ WITHOUT fix │ WITH fix │ Changed by fix? │
├────────────────┼─────────────────────────────────────────┼─────────────────────────────────────────┼─────────────────┤
│ ds is null │ (none) ❌ │ C, D → ds=NULL │ ✅ fixed │
├────────────────┼─────────────────────────────────────────┼─────────────────────────────────────────┼─────────────────┤
│ ds is not null │ A, B │ A, B │ no │
├────────────────┼─────────────────────────────────────────┼─────────────────────────────────────────┼─────────────────┤
│ ds = 'null' │ B │ B │ no │
├────────────────┼─────────────────────────────────────────┼─────────────────────────────────────────┼─────────────────┤
│ ds > 'A' │ C, D, B → ds=__HIVE_DEFAULT_PARTITION__ │ C, D, B → ds=__HIVE_DEFAULT_PARTITION__ │ no │
├────────────────┼─────────────────────────────────────────┼─────────────────────────────────────────┼─────────────────┤
│ ds < 'A' │ A │ A │ no │
└────────────────┴─────────────────────────────────────────┴─────────────────────────────────────────┴─────────────────┘
Iceberg ice_01
(NULL → null partition; '' → real ds='')
┌────────────────┬─────────────┬──────────┬─────────────────┐
│ Query │ WITHOUT fix │ WITH fix │ Changed by fix? │
├────────────────┼─────────────┼──────────┼─────────────────┤
│ ds is null │ (none) ❌ │ C │ ✅ fixed │
├────────────────┼─────────────┼──────────┼─────────────────┤
│ ds is not null │ A, B, D │ A, B, D │ no │
├────────────────┼─────────────┼──────────┼─────────────────┤
│ ds = 'null' │ B │ B │ no │
├────────────────┼─────────────┼──────────┼─────────────────┤
│ ds > 'A' │ B │ B │ no │
├────────────────┼─────────────┼──────────┼─────────────────┤
│ ds < 'A' │ D, A │ A, D │ no (same set) │
└────────────────┴─────────────┴──────────┴─────────────────┘
- The fix changes exactly one query on each table type: ds is null (previously returned nothing — the bug)
- Range queries (>, <) are unchanged before/after on both table types → no regression.
- Iceberg is fully SQL-correct throughout because it never conflates NULL with ''.
PR scope is NULL semantics only.
| insert into pcr_t1 partition (ds) select 'B', 'V2', 'null'; | ||
| insert into pcr_t1 partition (ds) select 'C', 'V3', null; | ||
|
|
||
| explain select key, value, ds from pcr_t1 where ds is null; |
There was a problem hiding this comment.
Is there something useful in the EXPLAIN output? If not we can drop those.
There was a problem hiding this comment.
yes, filterExpr: ds is not null (type: boolean)
|
|
||
| insert into pcr_t1 partition (ds) select 'A', 'V1', '2000-04-08'; | ||
| insert into pcr_t1 partition (ds) select 'B', 'V2', 'null'; | ||
| insert into pcr_t1 partition (ds) select 'C', 'V3', null; |
There was a problem hiding this comment.
The empty string value is another edge case that may create problems
insert into pcr_t1 partition (ds) select 'D', 'V4', '';| explain select key, value, ds from pcr_t1 where ds is not null; | ||
| select key, value, ds from pcr_t1 where ds is not null order by key; | ||
|
|
||
| select key, value, ds from pcr_t1 where ds = 'null'; |
There was a problem hiding this comment.
Consider enriching the tests with:
SELECT key, value, ds FROM pcr_t1 WHERE ds > 'A';
SELECT key, value, ds FROM pcr_t1 WHERE ds < 'A';


…n Column
What changes were proposed in this pull request?
IcebergTableUtil.makeSpecFromName() — a new helper that builds a Hive-compatible partition spec map from Iceberg's PartitionData. When a partition field value is null, it maps it to an actual NULL instead of the literal string "null".
Why are the changes needed?
Queries with WHERE <partition_col> IS NULL on Iceberg tables returned incorrect results (empty result set).
Does this PR introduce any user-facing change?
No
How was this patch tested?
mvn test -Dtest=TestIcebergCliDriver -Dqfile=iceberg_isnull_partition_pruning.q