HIVE-28911: Improve SEARCH expansion to exploit <> operator by rubenada · Pull Request #6503 · apache/hive

rubenada · 2026-05-21T07:06:24Z

What changes were proposed in this pull request?

Improve SEARCH expansion to exploit <> operator.
SEARCH operator can be used to represent many types of range predicates including the inequality operator (<>).
For example d_dom <> 10 and d_dom <> 20 can be represented as SEARCH($9, Sarg[(-∞..10), (10..20), (20..+∞)]).
Currently, after SEARCH expansion the following expression will be generated OR(<($9, 10), >($9, 20), AND(>($9, 10), <($9, 20))). With the proposed change we shall get the original (and simpler) AND(<>($9, 10), <>($9, 10)).

Why are the changes needed?

Exploit the inequality operator when expanding ranges to generate simpler and slightly more efficient expressions.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Unit test added. A few test plans adjusted reflecting this change.

…d of 'ref <> value1 AND ref <> value2' since the latter can break statistic propagation on partitioned tables (such as Iceberg). During Conjunctive Normal Form (CNF) expansion, nested inequalities inside 'OR' clauses flatten into structures that Hive's SearchArgument (Sarg) builder and Iceberg's partition-pruning layer cannot natively translate. This may cause the compiler to abandon filter pushdown at the TableScan phase, resetting column statistics from PARTIAL to NONE.

… partition pruning to maximize cache hits

sonarqubecloud · 2026-06-02T17:13:21Z

Quality Gate passed

Issues
2 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

zabetak

The changes LGTM. One point that is worth clarifying is what to do in FilterSelectivityEstimator and if its worth applying changes there or not.

zabetak · 2026-06-04T13:28:32Z

+      if (sarg.isComplementedPoints()) {
+        // Generate 'ref <> value1 AND ... AND ref <> valueN'
+        List<RexNode> notEq = sarg.rangeSet.complement().asRanges().stream()
+            .map(range -> rexBuilder.makeCall(SqlStdOperatorTable.NOT_EQUALS, ref, makeLiteral(range.lowerEndpoint())))
+            .toList();
+        searchSelectivities.add(RexUtil.composeConjunction(rexBuilder, notEq).accept(FilterSelectivityEstimator.this));
+      } else {


Does this form lead to better selectivity estimates than what we had before? If not then we don't necessarily need to change it. The purpose of this code is to compute a good selectivity for the SEARCH predicate not to transform it to something else.

Uhmmm... you're right, the original version (which uses histograms) would have a more accurate estimation than the proposed one with NOT_EQUALS (which is estimated simply with ndv-1/ndv , which can be quite off). However, it might be possible that histograms are not available in general (so the original version would default to a hadcoded selectivity), whereas the sub-optimal optimization with NOT_EQUALS uses a more generally available ndv estimated value (and this estimation, although not perfect, would be better than the hardcoded value of the original version).

Having said that, I guess we should try to aim for the better solution, and trust that statistics would be available, so I lean towards reverting the change in this file.

zabetak · 2026-06-04T13:57:49Z

-      operands.addAll(consumer.inLiterals);
-      orList.add(rexBuilder.makeCall(HiveIn.INSTANCE, operands));
+
+    if (sarg.isComplementedPoints()) {


When the sarg is not strictly a complement then I guess we cannot handle it. Examples:

... WHERE NOT ( x = 10 OR x = 20 OR (x > 30 AND x < 50)) ... WHERE (x <> 10 AND x <> 20 AND (x <= 30 OR x >= 50))

One way to cover those would be to apply the existing RangeConverter on sarg.rangeSet.complement() and invert the EQUALS and HiveIn operators in the switch. The challenge here is when to use the sarg.rangeSet and when its complement. A naive choice could be to base the decision on sarg.rangeSet.asRanges().size() or something along these lines.

Anyways, the change here is fine as it is so we can log a another ticket and follow-up there if its worth to do it or not.

In fact we could negate the entire sarg (not only the rangeset) if that leads to a simpler more efficient expansion. This would evolve negating the operators (IS NULL, =, IN, etc.), and changing the orList into an andList.

What that would look like? Try systematically the "normal" sarg expansion and the negated one, and pick the "simpler" one?
I guess we could try... but could we consider that a follow-up, separate task, out of the scope of the current PR?

HIVE-28911: Improve SEARCH expansion to exploit <> operator

827b879

asf-ci-hive added tests pending tests unstable and removed tests pending labels May 21, 2026

rubenada added 2 commits May 21, 2026 11:02

minor

2142444

Adjust tests

7a73f3b

asf-ci-hive added tests pending tests unstable and removed tests unstable tests pending labels May 21, 2026

Adjust tests

952a039

asf-ci-hive added tests pending tests passed and removed tests unstable tests pending labels May 23, 2026

asf-ci-hive added tests pending tests unstable tests passed and removed tests passed tests pending tests unstable labels May 26, 2026

Put back AND of NOT_EQUALS (instead of NOT IN), flatten ANDs / ORs in…

cbe2b37

… partition pruning to maximize cache hits

asf-ci-hive added tests pending tests unstable and removed tests passed tests pending labels Jun 1, 2026

Update tests files

55bb8cc

asf-ci-hive added tests pending tests unstable tests passed and removed tests unstable tests pending labels Jun 1, 2026

Simplify

da58715

asf-ci-hive added tests pending and removed tests passed labels Jun 2, 2026

Remove unused import

cdec829

asf-ci-hive added tests passed and removed tests pending labels Jun 2, 2026

Apply fix (AND flattening) in ExprNodeDescUtils

2635139

rubenada closed this Jun 2, 2026

rubenada reopened this Jun 2, 2026

asf-ci-hive added tests pending tests failed and removed tests passed tests pending labels Jun 2, 2026

rubenada added 2 commits June 2, 2026 12:23

empty

8a9a283

Apply OR flattening in ExprNodeConverter for consistency

0578182

asf-ci-hive added tests pending and removed tests failed labels Jun 2, 2026

asf-ci-hive added tests passed and removed tests pending labels Jun 2, 2026

zabetak reviewed Jun 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HIVE-28911: Improve SEARCH expansion to exploit <> operator#6503

HIVE-28911: Improve SEARCH expansion to exploit <> operator#6503
rubenada wants to merge 12 commits into
apache:masterfrom
rubenada:HIVE-28911

rubenada commented May 21, 2026 •

edited

Loading

Uh oh!

sonarqubecloud Bot commented Jun 2, 2026

Uh oh!

zabetak left a comment

Uh oh!

zabetak Jun 4, 2026

Uh oh!

rubenada Jun 5, 2026

Uh oh!

zabetak Jun 4, 2026

Uh oh!

zabetak Jun 5, 2026

Uh oh!

rubenada Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

rubenada commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

sonarqubecloud Bot commented Jun 2, 2026

Quality Gate passed

Uh oh!

zabetak left a comment

Choose a reason for hiding this comment

Uh oh!

zabetak Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

rubenada Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

zabetak Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

zabetak Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

rubenada Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rubenada commented May 21, 2026 •

edited

Loading