Support full outer join#10778
Support full outer join#10778windtalker wants to merge 12 commits intopingcap:feature/release-8.5-materialized-viewfrom
Conversation
Signed-off-by: xufei <xufeixw@mail.ustc.edu.cn>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
📝 WalkthroughWalkthroughImplements Full Outer Join support: protocol mapping, nullable-schema handling, execution-path changes (row-flagged hash map for full+other-condition), probe/post-probe handling for unmatched rows, submodule update, extensive tests, and documentation updates. Changes
Sequence Diagram(s)sequenceDiagram
participant Probe as ProbeBlock
participant Hash as RowFlaggedHashMap
participant Filter as OtherCondition
participant Scan as PostProbeScanner
participant Output as OutputWriter
Probe->>Hash: probe key(s)
alt match found
Hash->>Filter: provide matched build rows + probe row
Filter->>Filter: evaluate other-condition
alt filter passes
Filter->>Output: emit matched row (both sides may be nullable)
Filter->>Hash: mark build row as used
else filter fails
Filter->>Output: (do not emit this matched pair)
Note over Hash,Scan: build row remains unused for later scan
end
else no match
Hash->>Output: emit probe row with NULL build side
end
Note over Hash,Scan: after probing
Hash->>Scan: scan unused build rows
Scan->>Output: emit unmatched build rows with NULL probe side
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
@pantheon-ai review |
|
Review completed with partial results (1 of 3 review agents failed after retries). Findings: 2 issues Note: review_agent_v1.1 failed after 3 attempts due to infrastructure issues. The other 2 agents completed successfully and found 1 P0 and 1 P1 issue. ℹ️ Learn more details on Pantheon AI. |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (2)
dbms/src/DataStreams/ScanHashMapAfterProbeBlockInputStream.cpp (1)
264-276: Consider consolidating duplicate case handling.The
Fullcase (lines 264-269) andRightAnti/RightOutercases (lines 270-276) have identical logic. You could combine them:case ASTTableJoin::Kind::Full: case ASTTableJoin::Kind::RightAnti: case ASTTableJoin::Kind::RightOuter: if (parent.has_other_condition) fillColumnsUsingCurrentPartition<true, false>(...); else fillColumnsUsingCurrentPartition<false, false>(...); break;This is optional since the current code is correct and explicit handling may be preferred for clarity.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@dbms/src/DataStreams/ScanHashMapAfterProbeBlockInputStream.cpp` around lines 264 - 276, The three switch cases ASTTableJoin::Kind::Full, ASTTableJoin::Kind::RightAnti, and ASTTableJoin::Kind::RightOuter all perform identical logic: they check parent.has_other_condition and call fillColumnsUsingCurrentPartition with either <true,false> or <false,false>; consolidate these by merging the cases into a single grouped case label (Full, RightAnti, RightOuter) and keep the existing conditional that calls fillColumnsUsingCurrentPartition<true,false>(columns_left, columns_right, row_counter_column) when parent.has_other_condition is true and fillColumnsUsingCurrentPartition<false,false>(...) otherwise, followed by a single break.dbms/src/Flash/tests/gtest_spill_join.cpp (1)
588-619: Exercise theinner_idx = 0spill path too.
TypeFullOuterJoinis the one equal-join family that keepsinner_idxas the build side. This test only coversinner_idx = 1, so a regression when the left side builds the hash table would still pass here.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@dbms/src/Flash/tests/gtest_spill_join.cpp` around lines 588 - 619, The test only exercises the build-side = 1 path; add a second pair of requests that set inner_idx = 0 so the left side becomes the build side for TypeFullOuterJoin. Duplicate the existing request and request_column_prune blocks (which use context.scan(...).join(..., tipb::JoinType::TypeFullOuterJoin, ..., 0, false, 1) / aggregation) but change the join call's inner_idx argument to 0 (e.g., replace the trailing 1 with 0) so both the projection test (request) and the aggregation prune test (request_column_prune) also run with inner_idx = 0; keep the same scans, join keys, condition (lt(...)), project columns and aggregation.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@AGENTS.md`:
- Around line 1-8: Update AGENTS.md to clarify the build/test guidance: replace
the "Codex-owned" phrase with a general ccache recommendation and show how to
configure CCACHE_DIR and CCACHE_TEMPDIR for any user (reference
CCACHE_DIR/CCACHE_TEMPDIR), explain how the ./cmake-build-debug/dbms/gtests_dbms
path is produced (i.e., it’s the default CMake build output for a debug build or
created after running cmake --build) and suggest a discovery alternative (e.g.,
using find or cmake --build output) instead of assuming the path, and add a
short sentence describing what gtests_dbms is (the DBMS unit-test binary) and
when to run it (targeted gtest runs during development).
In `@dbms/src/Debug/MockExecutor/JoinBinder.cpp`:
- Around line 121-124: Rebuild output_schema using the same left/right helpers
used by compileJoin() instead of always appending both children: detect the join
type (TypeSemiJoin, TypeAntiSemiJoin, TypeLeftOuterSemiJoin,
TypeAntiLeftOuterSemiJoin) and for semi/anti-semi use the left-side schema plus
the right-side schema produced by buildRightSideJoinSchema() (preserving the
synthetic match-helper shape), and for other joins continue to use
appendJoinSchema with JoinInterpreterHelper::makeLeftJoinSideNullable(tp) and
makeRightJoinSideNullable(tp); update the logic that currently clears
output_schema and calls appendJoinSchema(children[0]...) and
appendJoinSchema(children[1]...) to call the appropriate helper functions so
output_schema matches compileJoin() behavior.
In `@dbms/src/Interpreters/JoinPartition.cpp`:
- Around line 1640-1654: Probe rows with NULL keys take the has_null_map branch
and currently call RowFlaggedHashMapAdder::addNotFound(), which emits nothing,
causing unmatched NULL probe keys to be dropped for FULL joins; update that
branch so when KIND == ASTTableJoin::Kind::Full it calls
RowFlaggedHashMapAdder::addNotFoundForFull(...) with the same arguments used in
the hash-miss path (num_columns_to_add, added_columns, i, current_offset,
offsets_to_replicate.get(), probe_process_info), otherwise call addNotFound(...)
as before so non-FULL behavior is unchanged.
---
Nitpick comments:
In `@dbms/src/DataStreams/ScanHashMapAfterProbeBlockInputStream.cpp`:
- Around line 264-276: The three switch cases ASTTableJoin::Kind::Full,
ASTTableJoin::Kind::RightAnti, and ASTTableJoin::Kind::RightOuter all perform
identical logic: they check parent.has_other_condition and call
fillColumnsUsingCurrentPartition with either <true,false> or <false,false>;
consolidate these by merging the cases into a single grouped case label (Full,
RightAnti, RightOuter) and keep the existing conditional that calls
fillColumnsUsingCurrentPartition<true,false>(columns_left, columns_right,
row_counter_column) when parent.has_other_condition is true and
fillColumnsUsingCurrentPartition<false,false>(...) otherwise, followed by a
single break.
In `@dbms/src/Flash/tests/gtest_spill_join.cpp`:
- Around line 588-619: The test only exercises the build-side = 1 path; add a
second pair of requests that set inner_idx = 0 so the left side becomes the
build side for TypeFullOuterJoin. Duplicate the existing request and
request_column_prune blocks (which use context.scan(...).join(...,
tipb::JoinType::TypeFullOuterJoin, ..., 0, false, 1) / aggregation) but change
the join call's inner_idx argument to 0 (e.g., replace the trailing 1 with 0) so
both the projection test (request) and the aggregation prune test
(request_column_prune) also run with inner_idx = 0; keep the same scans, join
keys, condition (lt(...)), project columns and aggregation.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 264452d6-dc3f-4650-a126-8b040e1b9e4d
📒 Files selected for processing (16)
AGENTS.mdcontrib/tipbdbms/src/DataStreams/ScanHashMapAfterProbeBlockInputStream.cppdbms/src/Debug/MockExecutor/JoinBinder.cppdbms/src/Flash/Coprocessor/DAGUtils.cppdbms/src/Flash/Coprocessor/JoinInterpreterHelper.cppdbms/src/Flash/Coprocessor/JoinInterpreterHelper.hdbms/src/Flash/Coprocessor/collectOutputFieldTypes.cppdbms/src/Flash/Coprocessor/tests/gtest_join_get_kind_and_build_index.cppdbms/src/Flash/tests/gtest_join_executor.cppdbms/src/Flash/tests/gtest_spill_join.cppdbms/src/Interpreters/Join.cppdbms/src/Interpreters/JoinPartition.cppdbms/src/Interpreters/JoinUtils.hdbms/src/TestUtils/tests/gtest_mock_executors.cppdocs/note/fullouter_join.md
contrib/tipb
Outdated
| @@ -1 +1 @@ | |||
| Subproject commit fe7badb76b660abaedd8706accd809809c475280 | |||
| Subproject commit 3f0090f4d459d6afefdf0862e582d3432c99c96c | |||
There was a problem hiding this comment.
[P0] contrib/tipb submodule points to non-fetchable commit
Why: Breaks clean checkout/build/CI because git submodule update --init --recursive contrib/tipb fails to fetch the referenced commit from https://github.com/pingcap/tipb.git. The commit 3f0090f4d459d6afefdf0862e582d3432c99c96c is not available on the remote, and CMake unconditionally requires this submodule for builds.
Evidence: contrib/tipb:1 - Subproject commit 3f0090f4d459d6afefdf0862e582d3432c99c96c (fetch error: "remote: not our ref 3f0090f4d459d6afefdf0862e582d3432c99c96c")
There was a problem hiding this comment.
Thanks for the fix! The submodule should now be fetchable for clean builds. ✅
ℹ️ Learn more details on Pantheon AI.
| if constexpr (row_flagged_map) | ||
| { | ||
| block_full = RowFlaggedHashMapAdder<Map>::addNotFound(i, current_offset, offsets_to_replicate.get()); | ||
| if constexpr (KIND == ASTTableJoin::Kind::Full) |
There was a problem hiding this comment.
[P1] Full join with other_condition silently drops probe rows with NULL join key or failed left filter
Why: Violates full outer join semantics. Probe rows flagged in null_map (due to NULL join key or failed left_filter_column) should appear in the result with NULL-filled right columns, but are silently dropped causing wrong results (missing rows).
Evidence: dbms/src/Interpreters/JoinPartition.cpp:1539 - When Kind == Full and row_flagged_map is true (other_condition path), the code calls RowFlaggedHashMapAdder::addNotFound(...) for rows with (*null_map)[i] set. However, addNotFound (lines 1450-1454) does not increment current_offset, causing 0-replication. The correct behavior requires addNotFoundForFull(...) (lines 1456-1475) which increments offset and fills right-side NULLs.
There was a problem hiding this comment.
Thanks for addressing this! The fix looks good - now properly using addNotFoundForFull for KIND::Full to ensure probe rows with NULL join keys or failed left filters are output with right-side NULLs instead of being dropped. ✅
ℹ️ Learn more details on Pantheon AI.
Signed-off-by: xufei <xufeixw@mail.ustc.edu.cn>
What problem does this PR solve?
Issue Number: close #10777
Problem Summary:
Support full outer join in TiFlash, including protocol / join kind plumbing, nullable schema handling, execution support for non-equal other conditions, and targeted test coverage.
What is changed and how it works?
Check List
Tests
Unit tests:
./cmake-build-debug/dbms/gtests_dbms --gtest_filter='JoinExecutorTestRunner.FullOuterJoinWithOtherCondition'./cmake-build-debug/dbms/gtests_dbms --gtest_filter='MockDAGRequestTest.SemiJoinColumnPruneKeepsJoinOutputSchema'Side effects
Documentation
Release note