-
Notifications
You must be signed in to change notification settings - Fork 4.8k
HIVE-29556: do not adjust "unknown" NDVs in extractNDVGroupingColumns #6418
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
1e903bb
d873d6f
297cab5
11800bf
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -823,9 +823,11 @@ public static ColStatistics getColStatistics(ColumnStatisticsObj cso, String col | |||||||||||||
| if (numTrues == 0 && numFalses == 0) { | ||||||||||||||
| // All NULL column - no non-null distinct values | ||||||||||||||
| cs.setCountDistint(0); | ||||||||||||||
| cs.setConst(true); | ||||||||||||||
| } else if (numTrues == 0 || numFalses == 0) { | ||||||||||||||
| // One value type confirmed absent (=0), other is present (>0) or unknown (<0) | ||||||||||||||
| cs.setCountDistint(1); | ||||||||||||||
|
Comment on lines
828
to
829
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let's assume that |
||||||||||||||
| cs.setConst(csd.getBooleanStats().getNumNulls() == 0 && (numTrues > 0 || numFalses > 0)); | ||||||||||||||
| } else { | ||||||||||||||
| // Both != 0: either both present (>0), both unknown (<0), or one present + one unknown | ||||||||||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If both are unknown then why countDistinct is set to 2? Possibly out-of-scope for the current PR but looks like a bug although it may never happen in practice. I don't see any prod code setting trues/false to negative values. The comment in conjunction with the code is a bit confusing. |
||||||||||||||
| cs.setCountDistint(2); | ||||||||||||||
|
|
@@ -1646,6 +1648,7 @@ private static ColStatistics buildColStatForConstant(HiveConf conf, long numRows | |||||||||||||
| colStats.setAvgColLen(avgColSize); | ||||||||||||||
| colStats.setCountDistint(countDistincts); | ||||||||||||||
| colStats.setNumNulls(numNulls); | ||||||||||||||
| colStats.setConst(true); | ||||||||||||||
|
|
||||||||||||||
| Optional<Number> value = getConstValue(encd); | ||||||||||||||
| value.ifPresent(number -> colStats.setRange(number, number)); | ||||||||||||||
|
|
@@ -2112,7 +2115,9 @@ private static List<Long> extractNDVGroupingColumns(List<ColStatistics> colStats | |||||||||||||
| for (ColStatistics cs : colStats) { | ||||||||||||||
| if (cs != null) { | ||||||||||||||
| long ndv = cs.getCountDistint(); | ||||||||||||||
| if (cs.getNumNulls() > 0) { | ||||||||||||||
| if (cs.isConst()) { | ||||||||||||||
| ndv = 1; | ||||||||||||||
| } else if (ndv > 0 && cs.getNumNulls() > 0) { | ||||||||||||||
| ndv = StatsUtils.safeAdd(ndv, 1); | ||||||||||||||
| } | ||||||||||||||
|
Comment on lines
2117
to
2122
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If we know that a column is a constant then shouldn't
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good point. The But you're right that I'll switch to a much more readable variant: if (cs.isConst()) {
ndv = 1;
} else if (ndv > 0 && cs.getNumNulls() > 0) {
ndv = StatsUtils.safeAdd(ndv, 1);
}
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Part ASetting NDV to 1 when we have constants seems like a reasonable and safe improvement in the accuracy of estimations. I would be OK to approve the PR with just this change. if (cs.isConst()) {
ndv = 1;
}Part BThe benefit from from guarding the +1 increment with if (ndv > 0 && cs.getNumNulls() > 0) {
ndv = StatsUtils.safeAdd(ndv, 1);
}First of all, if we have stats and if numNulls is greater than zero then very likely NDV would also be greater than zero.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So for the Part B, I think we can discuss a bit more about the "Sources of the (NDV=0, numNulls>0)" under HIVE-29556 and depending on the frequency and gravity of the underestimation we can decide how to proceed. |
||||||||||||||
| ndvValues.add(ndv); | ||||||||||||||
|
|
||||||||||||||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,30 @@ | ||
| -- HIVE-29556: GROUP BY of a column with unknown NDV (NDV=0 sentinel + numNulls>0) | ||
| -- must not collapse to "1 row" estimate. Both probes feed a join sized so that | ||
| -- master's bogus 1-row estimate triggers Map Join while the heuristic fallback | ||
| -- forces Merge Join. | ||
|
|
||
| SET hive.auto.convert.join=true; | ||
|
|
||
| CREATE TABLE big (k BIGINT); | ||
| CREATE TABLE small (id BIGINT, name STRING); | ||
|
|
||
| ALTER TABLE big UPDATE STATISTICS SET('numRows'='100000000'); | ||
| ALTER TABLE big UPDATE STATISTICS for column k SET ('numDVs'='0','numNulls'='100000000'); | ||
|
|
||
| ALTER TABLE small UPDATE STATISTICS SET('numRows'='1000000'); | ||
| ALTER TABLE small UPDATE STATISTICS for column id SET ('numDVs'='1000000','numNulls'='0'); | ||
| ALTER TABLE small UPDATE STATISTICS for column name SET ('numDVs'='1000000','numNulls'='0','avgColLen'='100','maxColLen'='100'); | ||
|
|
||
| -- U1: direct column with unknown NDV. | ||
| EXPLAIN | ||
| SELECT s.name, g.cnt | ||
| FROM (SELECT k, COUNT(*) AS cnt FROM big GROUP BY k) g | ||
| JOIN small s ON g.cnt = s.id; | ||
|
|
||
| -- U2: PessimisticStatCombiner output for CASE WHEN with one NULL branch. | ||
| EXPLAIN | ||
| SELECT s.name, g.cnt | ||
| FROM (SELECT x, COUNT(*) AS cnt | ||
| FROM (SELECT CASE WHEN k > 0 THEN k ELSE cast(NULL AS BIGINT) END AS x FROM big) t | ||
| GROUP BY x) g | ||
| JOIN small s ON g.cnt = s.id; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this is correct. If the table is empty then naturally we will go into this if statement and mark the column as constant. I am a bit skeptical if we we should use the
setConstmethod inside this method.