Skip to content

[CALCITE-5101] LISTAGG function with DISTINCT and ORDER BY fails#4983

Open
xuzifu666 wants to merge 4 commits into
apache:mainfrom
xuzifu666:calcite-5101
Open

[CALCITE-5101] LISTAGG function with DISTINCT and ORDER BY fails#4983
xuzifu666 wants to merge 4 commits into
apache:mainfrom
xuzifu666:calcite-5101

Conversation

@xuzifu666
Copy link
Copy Markdown
Member

if (collations.size() == 1) {
RelFieldCollation collation = collations.get(0);
final int index = collation.getFieldIndex();
if (index < 0 || index >= rowType.getFieldList().size()) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess these should never be reachable if the code generator is correct

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is just a defensive check, but it seems unnecessary now, so I deleted it.

}
}

if (!validCollation) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit confused: there is a sort order specified, yet it is ignored? That cannot be right.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two options:

  1. Complete Fix: add ORDER BY columns to the re-grouped input, requires architectural changes to aggregate planning,changes query semantics, potential impact on other parts.This is a large-scale refactoring;
  2. Graceful Degradation(current way): returns correct results without crashes, loses sorting information, but doesn't crash.This is a rare scenario:only occurs with LISTAGG(DISTINCT ...) WITHIN GROUP (ORDER BY non-group-column).

According the conditions I choose the second way. So do you think my current plan is reasonable?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the result correct if you ignore the order?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a compromise solution to avoid crashes by downgrading the handling of such special statements (removing the order by clause). The main issue is that a thorough fix involves many aspects, with less consideration for ROI. If it becomes clear that a completely accurate order by result is required, I will attempt a fix with the lowest possible cost in the future.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the difference between "less acurate" and "wrong"
Either the result is correct, or it's not. I don't think we should take a fix which avoids crashes and produces wrong results.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I will try to fix this issue completely later.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mihaibudiu I had fixed the issue, PTAL.

Copy link
Copy Markdown

@GoncaloCoutoDosSantos GoncaloCoutoDosSantos Jun 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, if you're going to change AggregateExpandDistinctAggregatesRule to fix this bug, you'll also should to update rewriteUsingGroupingSets, since the same issue can be triggered by a query like:

SELECT deptno,
       LISTAGG(DISTINCT ename) WITHIN GROUP (ORDER BY sal),
       LISTAGG(DISTINCT deptno) WITHIN GROUP (ORDER BY ename)
FROM emp
GROUP BY deptno;```

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the very least, a test checking this should be added

@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud Bot commented Jun 2, 2026

@xuzifu666 xuzifu666 requested a review from mihaibudiu June 2, 2026 15:27
Copy link
Copy Markdown
Contributor

@mihaibudiu mihaibudiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please reply to the other comment you have received as well.

if (aggCall.isDistinct()) {
bottomGroups.addAll(aggCall.getArgList());
// Also include ORDER BY columns from WITHIN GROUP
for (RelFieldCollation fc : aggCall.collation.getFieldCollations()) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a better approach.
Can this add the same column twice? Is that a problem?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't this cause DISTINCT to be ignored in certain scenarios?

Consider the following example:

SELECT deptno, SUM(DISTINCT sal) WITHIN GROUP (ORDER BY bonus)
FROM EMP
GROUP BY deptno

With this modification, the rule would rewrite the query as:

SELECT deptno, SUM(sal) WITHIN GROUP (ORDER BY bonus)
FROM (
  SELECT deptno, sal, bonus FROM EMP GROUP BY deptno, sal, bonus
)
GROUP BY deptno

This does not correctly enforce DISTINCT on sal: if two rows share the same sal value but have different bonus values, both survive the inner GROUP BY and sal ends up counted twice in the outer SUM, which violates the DISTINCT semantics.

PS: Please feel free to correct me or let me know if I am intervening inappropriately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants