Skip probe-side consumption when hash join build side is empty#21068
Open
kosiew wants to merge 9 commits intoapache:mainfrom
Open
Skip probe-side consumption when hash join build side is empty#21068kosiew wants to merge 9 commits intoapache:mainfrom
kosiew wants to merge 9 commits intoapache:mainfrom
Conversation
Implement a staged mini-plan for HashJoinStream to immediately exit when the build side is empty and the join type's result is fully determined. This change avoids unnecessary entry into FetchProbeBatch for Inner, Left, LeftSemi, LeftAnti, LeftMark, and RightSemi joins without filters. Add tests to verify join behavior with empty build: - join_does_not_consume_probe_when_empty_build_fixes_output - join_still_consumes_probe_when_empty_build_needs_probe_rows These use MockExec to distinguish between short-circuiting and necessary probe row consumption.
Extract duplicate post-build transition logic into next_state_after_build_ready in stream.rs. This centralizes the decision between Completed and FetchProbeBatch in one location and streamlines both collect_build_side and wait_for_partition_bounds_report to use the new helper function.
Move the pure JoinType semantic rule to utils.rs, placing it alongside the existing join behavior helpers. Update HashJoinStream in stream.rs to focus solely on its stream-specific execution concern by removing unnecessary logic related to filtering.
Extract shared empty-build/probe-error test setup into a new function, empty_build_with_probe_error_inputs(), in exec.rs. Both regression tests now reuse this setup, allowing each test to focus more on the join-type behavior it asserts rather than rebuilding the same fixture.
Implement test_hash_join_skips_probe_on_empty_build_after_partition_bounds_report in exec.rs. Ensure that dynamic filtering is enabled by keeping a consumer reference alive. Verify that an Inner join with an empty build side correctly skips probe consumption, even when passing through the WaitPartitionBoundsReport path.
Refactor exec.rs by consolidating empty-build probe-behavior tests into `assert_empty_build_probe_behavior(...)` and repeated dynamic filter join setup into `hash_join_with_dynamic_filter(...)`. Maintain existing runtime logic while reducing duplicate test boilerplate and redundant local setup for improved clarity and maintainability.
Remove duplication by introducing a shared helper, empty_build_side_produces_empty_result, in utils.rs. Update build_batch_empty_build_side to use this helper directly, ensuring alignment in the short-circuit and batch-construction logic within the hash join state transition in stream.rs.
Refactor stream.rs and utils.rs to streamline the hash-join state machine. Compute the post-build state directly from inputs, eliminating unnecessary indirection. Update the empty-build-side batch construction to utilize early returns and iterator-based collection for columns, replacing manual Vec setup and push logic.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Rationale for this change
HashJoinExeccurrently continues polling and consuming the probe side even after the build side has completed with zero rows.For join types whose output is guaranteed to be empty when the build side is empty, this work is unnecessary. In practice, it can trigger large avoidable scans and extra compute despite producing no output. This is especially costly for cases such as INNER, LEFT, LEFT SEMI, LEFT ANTI, LEFT MARK, and RIGHT SEMI joins.
This change makes the stream state machine aware of that condition so execution can terminate as soon as the build side is known to be empty and no probe rows are needed to determine the final result.
The change also preserves the existing behavior for join types that still require probe-side rows even when the build side is empty, such as RIGHT, FULL, RIGHT ANTI, and RIGHT MARK joins.
What changes are included in this PR?
This PR introduces a small refactor and an execution-path optimization for hash join empty-build handling:
adds
empty_build_side_produces_empty_result(join_type)injoins/utils.rsas a shared source of truth for join types that can short-circuit when the build side is emptyupdates
HashJoinStreamto use a newnext_state_after_build_ready(...)helper so that, once build-side collection and any required coordination finish, the stream can transition directly toCompletedinstead of always enteringFetchProbeBatchapplies the same shared helper inside
build_batch_empty_build_side(...)to keep output semantics aligned with the stream short-circuit logicfactors dynamic-filter join setup in tests into a reusable helper, reducing duplicated test code
adds regression tests covering:
Are these changes tested?
Yes.
This PR adds targeted async tests covering both the optimized and non-optimized cases:
join_does_not_consume_probe_when_empty_build_fixes_outputjoin_still_consumes_probe_when_empty_build_needs_probe_rowstest_hash_join_skips_probe_on_empty_build_after_partition_bounds_reportThe tests also verify that errors from the probe side are not observed when the join can correctly short-circuit, and that they are still surfaced for join types that must continue consuming probe input.
Existing dynamic-filter tests were also cleaned up to use a shared helper while preserving the completion checks.
Are there any user-facing changes?
Yes, in execution behavior and performance.
For affected hash join types, queries can now stop earlier when the build side is empty, avoiding unnecessary probe-side scans and reducing wasted I/O and compute. There are no intended API changes.
LLM-generated code disclosure
This PR includes LLM-generated code and comments. All LLM-generated content has been manually reviewed and tested.