[CALCITE-5101] LISTAGG function with DISTINCT and ORDER BY fails#4983
[CALCITE-5101] LISTAGG function with DISTINCT and ORDER BY fails#4983xuzifu666 wants to merge 4 commits into
Conversation
| if (collations.size() == 1) { | ||
| RelFieldCollation collation = collations.get(0); | ||
| final int index = collation.getFieldIndex(); | ||
| if (index < 0 || index >= rowType.getFieldList().size()) { |
There was a problem hiding this comment.
I guess these should never be reachable if the code generator is correct
There was a problem hiding this comment.
Yes, this is just a defensive check, but it seems unnecessary now, so I deleted it.
| } | ||
| } | ||
|
|
||
| if (!validCollation) { |
There was a problem hiding this comment.
I am a bit confused: there is a sort order specified, yet it is ignored? That cannot be right.
There was a problem hiding this comment.
There are two options:
- Complete Fix: add
ORDER BYcolumns to the re-grouped input, requires architectural changes to aggregate planning,changes query semantics, potential impact on other parts.This is a large-scale refactoring; - Graceful Degradation(current way): returns correct results without crashes, loses sorting information, but doesn't crash.This is a rare scenario:only occurs with
LISTAGG(DISTINCT ...) WITHIN GROUP (ORDER BY non-group-column).
According the conditions I choose the second way. So do you think my current plan is reasonable?
There was a problem hiding this comment.
Why is the result correct if you ignore the order?
There was a problem hiding this comment.
This is a compromise solution to avoid crashes by downgrading the handling of such special statements (removing the order by clause). The main issue is that a thorough fix involves many aspects, with less consideration for ROI. If it becomes clear that a completely accurate order by result is required, I will attempt a fix with the lowest possible cost in the future.
There was a problem hiding this comment.
I don't understand the difference between "less acurate" and "wrong"
Either the result is correct, or it's not. I don't think we should take a fix which avoids crashes and produces wrong results.
There was a problem hiding this comment.
Okay, I will try to fix this issue completely later.
There was a problem hiding this comment.
FYI, if you're going to change AggregateExpandDistinctAggregatesRule to fix this bug, you'll also should to update rewriteUsingGroupingSets, since the same issue can be triggered by a query like:
SELECT deptno,
LISTAGG(DISTINCT ename) WITHIN GROUP (ORDER BY sal),
LISTAGG(DISTINCT deptno) WITHIN GROUP (ORDER BY ename)
FROM emp
GROUP BY deptno;```
There was a problem hiding this comment.
At the very least, a test checking this should be added
|
mihaibudiu
left a comment
There was a problem hiding this comment.
Please reply to the other comment you have received as well.
| if (aggCall.isDistinct()) { | ||
| bottomGroups.addAll(aggCall.getArgList()); | ||
| // Also include ORDER BY columns from WITHIN GROUP | ||
| for (RelFieldCollation fc : aggCall.collation.getFieldCollations()) { |
There was a problem hiding this comment.
I think this is a better approach.
Can this add the same column twice? Is that a problem?
There was a problem hiding this comment.
Wouldn't this cause DISTINCT to be ignored in certain scenarios?
Consider the following example:
SELECT deptno, SUM(DISTINCT sal) WITHIN GROUP (ORDER BY bonus)
FROM EMP
GROUP BY deptno
With this modification, the rule would rewrite the query as:
SELECT deptno, SUM(sal) WITHIN GROUP (ORDER BY bonus)
FROM (
SELECT deptno, sal, bonus FROM EMP GROUP BY deptno, sal, bonus
)
GROUP BY deptno
This does not correctly enforce DISTINCT on sal: if two rows share the same sal value but have different bonus values, both survive the inner GROUP BY and sal ends up counted twice in the outer SUM, which violates the DISTINCT semantics.
PS: Please feel free to correct me or let me know if I am intervening inappropriately.



jira: https://issues.apache.org/jira/browse/CALCITE-5101