HIVE-28911: Improve SEARCH expansion to exploit <> operator#6503
HIVE-28911: Improve SEARCH expansion to exploit <> operator#6503rubenada wants to merge 12 commits into
Conversation
…d of 'ref <> value1 AND ref <> value2' since the latter can break statistic propagation on partitioned tables (such as Iceberg). During Conjunctive Normal Form (CNF) expansion, nested inequalities inside 'OR' clauses flatten into structures that Hive's SearchArgument (Sarg) builder and Iceberg's partition-pruning layer cannot natively translate. This may cause the compiler to abandon filter pushdown at the TableScan phase, resetting column statistics from PARTIAL to NONE.
… partition pruning to maximize cache hits
|
zabetak
left a comment
There was a problem hiding this comment.
The changes LGTM. One point that is worth clarifying is what to do in FilterSelectivityEstimator and if its worth applying changes there or not.
| if (sarg.isComplementedPoints()) { | ||
| // Generate 'ref <> value1 AND ... AND ref <> valueN' | ||
| List<RexNode> notEq = sarg.rangeSet.complement().asRanges().stream() | ||
| .map(range -> rexBuilder.makeCall(SqlStdOperatorTable.NOT_EQUALS, ref, makeLiteral(range.lowerEndpoint()))) | ||
| .toList(); | ||
| searchSelectivities.add(RexUtil.composeConjunction(rexBuilder, notEq).accept(FilterSelectivityEstimator.this)); | ||
| } else { |
There was a problem hiding this comment.
Does this form lead to better selectivity estimates than what we had before? If not then we don't necessarily need to change it. The purpose of this code is to compute a good selectivity for the SEARCH predicate not to transform it to something else.
There was a problem hiding this comment.
Uhmmm... you're right, the original version (which uses histograms) would have a more accurate estimation than the proposed one with NOT_EQUALS (which is estimated simply with ndv-1/ndv , which can be quite off). However, it might be possible that histograms are not available in general (so the original version would default to a hadcoded selectivity), whereas the sub-optimal optimization with NOT_EQUALS uses a more generally available ndv estimated value (and this estimation, although not perfect, would be better than the hardcoded value of the original version).
Having said that, I guess we should try to aim for the better solution, and trust that statistics would be available, so I lean towards reverting the change in this file.
| operands.addAll(consumer.inLiterals); | ||
| orList.add(rexBuilder.makeCall(HiveIn.INSTANCE, operands)); | ||
|
|
||
| if (sarg.isComplementedPoints()) { |
There was a problem hiding this comment.
When the sarg is not strictly a complement then I guess we cannot handle it. Examples:
... WHERE NOT ( x = 10 OR x = 20 OR (x > 30 AND x < 50))
... WHERE (x <> 10 AND x <> 20 AND (x <= 30 OR x >= 50))One way to cover those would be to apply the existing RangeConverter on sarg.rangeSet.complement() and invert the EQUALS and HiveIn operators in the switch. The challenge here is when to use the sarg.rangeSet and when its complement. A naive choice could be to base the decision on sarg.rangeSet.asRanges().size() or something along these lines.
Anyways, the change here is fine as it is so we can log a another ticket and follow-up there if its worth to do it or not.
There was a problem hiding this comment.
In fact we could negate the entire sarg (not only the rangeset) if that leads to a simpler more efficient expansion. This would evolve negating the operators (IS NULL, =, IN, etc.), and changing the orList into an andList.
There was a problem hiding this comment.
What that would look like? Try systematically the "normal" sarg expansion and the negated one, and pick the "simpler" one?
I guess we could try... but could we consider that a follow-up, separate task, out of the scope of the current PR?



What changes were proposed in this pull request?
Improve SEARCH expansion to exploit <> operator.
SEARCH operator can be used to represent many types of range predicates including the inequality operator (<>).
For example
d_dom <> 10 and d_dom <> 20can be represented asSEARCH($9, Sarg[(-∞..10), (10..20), (20..+∞)]).Currently, after SEARCH expansion the following expression will be generated
OR(<($9, 10), >($9, 20), AND(>($9, 10), <($9, 20))). With the proposed change we shall get the original (and simpler)AND(<>($9, 10), <>($9, 10)).Why are the changes needed?
Exploit the inequality operator when expanding ranges to generate simpler and slightly more efficient expressions.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Unit test added. A few test plans adjusted reflecting this change.