Skip to content

HIVE-28911: Improve SEARCH expansion to exploit <> operator#6503

Open
rubenada wants to merge 12 commits into
apache:masterfrom
rubenada:HIVE-28911
Open

HIVE-28911: Improve SEARCH expansion to exploit <> operator#6503
rubenada wants to merge 12 commits into
apache:masterfrom
rubenada:HIVE-28911

Conversation

@rubenada
Copy link
Copy Markdown
Contributor

@rubenada rubenada commented May 21, 2026

What changes were proposed in this pull request?

Improve SEARCH expansion to exploit <> operator.
SEARCH operator can be used to represent many types of range predicates including the inequality operator (<>).
For example d_dom <> 10 and d_dom <> 20 can be represented as SEARCH($9, Sarg[(-∞..10), (10..20), (20..+∞)]).
Currently, after SEARCH expansion the following expression will be generated OR(<($9, 10), >($9, 20), AND(>($9, 10), <($9, 20))). With the proposed change we shall get the original (and simpler) AND(<>($9, 10), <>($9, 10)).

Why are the changes needed?

Exploit the inequality operator when expanding ranges to generate simpler and slightly more efficient expressions.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Unit test added. A few test plans adjusted reflecting this change.

…d of 'ref <> value1 AND ref <> value2' since the latter can break statistic propagation on partitioned tables (such as Iceberg).

During Conjunctive Normal Form (CNF) expansion, nested inequalities inside 'OR' clauses flatten into structures that Hive's SearchArgument (Sarg) builder and Iceberg's partition-pruning layer cannot natively translate. This may cause the compiler to abandon filter pushdown at the TableScan phase, resetting column statistics from PARTIAL to NONE.
@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud Bot commented Jun 2, 2026

Copy link
Copy Markdown
Member

@zabetak zabetak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes LGTM. One point that is worth clarifying is what to do in FilterSelectivityEstimator and if its worth applying changes there or not.

Comment on lines +633 to +639
if (sarg.isComplementedPoints()) {
// Generate 'ref <> value1 AND ... AND ref <> valueN'
List<RexNode> notEq = sarg.rangeSet.complement().asRanges().stream()
.map(range -> rexBuilder.makeCall(SqlStdOperatorTable.NOT_EQUALS, ref, makeLiteral(range.lowerEndpoint())))
.toList();
searchSelectivities.add(RexUtil.composeConjunction(rexBuilder, notEq).accept(FilterSelectivityEstimator.this));
} else {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this form lead to better selectivity estimates than what we had before? If not then we don't necessarily need to change it. The purpose of this code is to compute a good selectivity for the SEARCH predicate not to transform it to something else.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uhmmm... you're right, the original version (which uses histograms) would have a more accurate estimation than the proposed one with NOT_EQUALS (which is estimated simply with ndv-1/ndv , which can be quite off). However, it might be possible that histograms are not available in general (so the original version would default to a hadcoded selectivity), whereas the sub-optimal optimization with NOT_EQUALS uses a more generally available ndv estimated value (and this estimation, although not perfect, would be better than the hardcoded value of the original version).

Having said that, I guess we should try to aim for the better solution, and trust that statistics would be available, so I lean towards reverting the change in this file.

operands.addAll(consumer.inLiterals);
orList.add(rexBuilder.makeCall(HiveIn.INSTANCE, operands));

if (sarg.isComplementedPoints()) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the sarg is not strictly a complement then I guess we cannot handle it. Examples:

... WHERE NOT ( x = 10 OR x = 20 OR (x > 30 AND x < 50))
... WHERE (x <> 10 AND x <> 20 AND (x <= 30 OR x >= 50))

One way to cover those would be to apply the existing RangeConverter on sarg.rangeSet.complement() and invert the EQUALS and HiveIn operators in the switch. The challenge here is when to use the sarg.rangeSet and when its complement. A naive choice could be to base the decision on sarg.rangeSet.asRanges().size() or something along these lines.

Anyways, the change here is fine as it is so we can log a another ticket and follow-up there if its worth to do it or not.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact we could negate the entire sarg (not only the rangeset) if that leads to a simpler more efficient expansion. This would evolve negating the operators (IS NULL, =, IN, etc.), and changing the orList into an andList.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What that would look like? Try systematically the "normal" sarg expansion and the negated one, and pick the "simpler" one?
I guess we could try... but could we consider that a follow-up, separate task, out of the scope of the current PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants