Skip to content

[CALCITE-6451] Improve Nullability Derivation for Intersect and Minus#4897

Open
xiedeyantu wants to merge 11 commits into
apache:mainfrom
xiedeyantu:CALCITE-6451
Open

[CALCITE-6451] Improve Nullability Derivation for Intersect and Minus#4897
xiedeyantu wants to merge 11 commits into
apache:mainfrom
xiedeyantu:CALCITE-6451

Conversation

@xiedeyantu
Copy link
Copy Markdown
Member

Jira Link

CALCITE-6451

Changes Proposed

SetOp overrides deriveRowType() and computes the output row type to be the least restrictive across all inputs here.

So for example given

Input 1: (I64, I64, I64?, I64?)
Input 2: (I64, I64?, I64, I64?)
where ? denotes nullable, the least restrictive output computes:

Output: (I64, I64?, I64?, I64?)
For UNION operations, these nullabilities are accurate.

However for MINUS and INTERSECT there is room for improvement.

MINUS only returns rows from the first input, as such its output nullability should always match that of its first input:

Output: (I64, I64, I64?, I64?)
INTERSECT only returns rows that match across all inputs. If a column is not nullable in any of the inputs, then it is not nullable in the output because no rows can be emitted in which that column is null:

Output: (I64, I64, I64, I64?)

Co-authored-by: Victor Barua <victor.barua@datadoghq.com>
@xiedeyantu
Copy link
Copy Markdown
Member Author

Related PR #3845.

@mihaibudiu
Copy link
Copy Markdown
Contributor

you have some checker failures

@mihaibudiu
Copy link
Copy Markdown
Contributor

Does this work around the problems in the other PR?

@xiedeyantu
Copy link
Copy Markdown
Member Author

Does this work around the problems in the other PR?

Are you referring to #3845? I noticed that you had approved this PR before, but there were some conflicts. Since it's been a long time, the CI status is no longer visible, and it's unclear if there were other issues back then. I think it's a good PR, so I’m trying to finish it up.

@mihaibudiu
Copy link
Copy Markdown
Contributor

Yes, the discussion in JIRA was about causing problems with other rules.

@xiedeyantu
Copy link
Copy Markdown
Member Author

xiedeyantu commented Apr 21, 2026

Yes, the discussion in JIRA was about causing problems with other rules.

I didn’t see any discussion in the Jira. Are you referring to the discussion in the original PR? I have resolved the rule conflicts.

@mihaibudiu
Copy link
Copy Markdown
Contributor

yes, the original PR

@xiedeyantu
Copy link
Copy Markdown
Member Author

According to this disscusion #3845 (comment) .
I think we don't need to dwell on this issue too much. If we transform INTERSECT into UNION, then we can simply use UNION's type inference. It's like LEFT JOIN (though this example might not be entirely appropriate), which can also change the nullability of columns from the right table. I'm not sure if my understanding is correct.

xiedeyantu and others added 2 commits April 21, 2026 09:22
Co-authored-by: Victor Barua <victor.barua@datadoghq.com>
Co-authored-by: Victor Barua <victor.barua@datadoghq.com>
@xiedeyantu
Copy link
Copy Markdown
Member Author

@mihaibudiu I'm not sure if you agree with the current simplified processing logic. If you have time, please review this PR to see if there are any other concerns.

@mihaibudiu
Copy link
Copy Markdown
Contributor

@vbarua what do you think of this approach?

!ok

EnumerableCalc(expr#0..1=[{inputs}], expr#2=[CAST($t0):DECIMAL(11, 1)], A=[$t2])
EnumerableAggregate(group=[{0}])
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know why this plan changed?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't been able to pinpoint how the old plan included AGG in the join, but the new plan seems fine. I might need to investigate this difference in more detail later when I have more time.

}

if (!nullFilters.isEmpty()) {
relBuilder.filter(nullFilters);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a relBuilder.convert(intersect.getRowType(), false) call at the end, is this logic necessary?
Calcite doesn't have the ability to refine type inference from predicates, this rewrite is semantically equivalent, but it is unrelated to the issue-6451.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

relBuilder.project(Util.skipLast(relBuilder.fields()));

// ensure the nullabilities of columns in the new relation match those of the input relation
relBuilder.convert(intersect.getRowType(), false);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that whether onMatchAggregateOnUnion need this.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, it would be safer that way.

@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud Bot commented May 1, 2026

@mihaibudiu
Copy link
Copy Markdown
Contributor

What is the status of this PR?

@xiedeyantu
Copy link
Copy Markdown
Member Author

What is the status of this PR?

There may only be one problem left now, which is that I haven't found the root cause of the changes in the plan mentioned by silun above.

@xiedeyantu
Copy link
Copy Markdown
Member Author

xiedeyantu commented Jun 3, 2026

Current PR Change Management Process:

  1. MinusMergeRule
    Merges the nested binary Minus operators into a single multi-input LogicalMinus:
LogicalMinus(all=[false])
  t1
  t2
  t3
  1. MinusToAntiJoinRule
    Rewrites the multi-input Minus into a chain of anti joins with a top-level distinct aggregate:
LogicalAggregate(group=[{0}])
  LogicalJoin(joinType=[anti])
    LogicalJoin(joinType=[anti])
      t1
      t2
    t3
  1. JoinPushExpressionsRule
    Extracts the CAST expressions from the join condition into projection columns (A0), allowing the join to use direct column references:
LogicalProject(A=[$0])
  LogicalJoin(condition=[=($1, $3)], joinType=[anti])
    LogicalProject(A=[$0], A0=[CAST($0)])
    LogicalProject(A=[$0], A0=[CAST($0)])
  1. EnumerableJoinRule
    Converts the logical anti joins into physical join implementations. The inner anti join becomes a nested-loop join, while the outer anti join is implemented as a hash join:
EnumerableHashJoin(joinType=[anti])
  EnumerableProject(A=[$0], A0=[CAST($0)])
    EnumerableNestedLoopJoin(joinType=[anti])
      ...
  1. EnumerableAggregateRule
    Converts the top-level logical aggregate into a physical enumerable aggregate:
EnumerableAggregate(group=[{0}])
  EnumerableHashJoin(joinType=[anti])
    ...

The final selected plan is:

EnumerableAggregate
  EnumerableHashJoin (anti)
    EnumerableProject(A0=CAST(...))
      EnumerableNestedLoopJoin (anti)
        ...

The original plan took this specific form because, within Volcano, every rule generates a corresponding candidate RelNode. A particular scenario arises with nested Minus operations: for each level of nesting, an equivalent Aggregation-plus-Join structure is generated. Subsequently, during the final cost-based selection phase, this specific structure was deemed to have the lowest cost and was therefore selected. Please refer to the plan diagram below.

graphviz

This is my current understanding; please take another look to see if it is clearly explained. @silundong @mihaibudiu

@mihaibudiu
Copy link
Copy Markdown
Contributor

If the plan is semantically equivalent, it's not a problem from my pov.
If @silundong cannot complete the review I can try to take a look.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants