Skip to content

Skip probe-side consumption when hash join build side is empty#21068

Open
kosiew wants to merge 9 commits intoapache:mainfrom
kosiew:hashjoinexec-probeconsumption-20492
Open

Skip probe-side consumption when hash join build side is empty#21068
kosiew wants to merge 9 commits intoapache:mainfrom
kosiew:hashjoinexec-probeconsumption-20492

Conversation

@kosiew
Copy link
Contributor

@kosiew kosiew commented Mar 20, 2026

Which issue does this PR close?

Rationale for this change

HashJoinExec currently continues polling and consuming the probe side even after the build side has completed with zero rows.

For join types whose output is guaranteed to be empty when the build side is empty, this work is unnecessary. In practice, it can trigger large avoidable scans and extra compute despite producing no output. This is especially costly for cases such as INNER, LEFT, LEFT SEMI, LEFT ANTI, LEFT MARK, and RIGHT SEMI joins.

This change makes the stream state machine aware of that condition so execution can terminate as soon as the build side is known to be empty and no probe rows are needed to determine the final result.

The change also preserves the existing behavior for join types that still require probe-side rows even when the build side is empty, such as RIGHT, FULL, RIGHT ANTI, and RIGHT MARK joins.

What changes are included in this PR?

This PR introduces a small refactor and an execution-path optimization for hash join empty-build handling:

  • adds empty_build_side_produces_empty_result(join_type) in joins/utils.rs as a shared source of truth for join types that can short-circuit when the build side is empty

  • updates HashJoinStream to use a new next_state_after_build_ready(...) helper so that, once build-side collection and any required coordination finish, the stream can transition directly to Completed instead of always entering FetchProbeBatch

  • applies the same shared helper inside build_batch_empty_build_side(...) to keep output semantics aligned with the stream short-circuit logic

  • factors dynamic-filter join setup in tests into a reusable helper, reducing duplicated test code

  • adds regression tests covering:

    • join types that should not consume the probe side when the build side is empty
    • join types that must still consume the probe side because probe rows are needed for correct output
    • dynamic-filter completion behavior, including the path that waits for partition-bounds reporting before deciding whether probe polling is necessary

Are these changes tested?

Yes.

This PR adds targeted async tests covering both the optimized and non-optimized cases:

  • join_does_not_consume_probe_when_empty_build_fixes_output
  • join_still_consumes_probe_when_empty_build_needs_probe_rows
  • test_hash_join_skips_probe_on_empty_build_after_partition_bounds_report

The tests also verify that errors from the probe side are not observed when the join can correctly short-circuit, and that they are still surfaced for join types that must continue consuming probe input.

Existing dynamic-filter tests were also cleaned up to use a shared helper while preserving the completion checks.

Are there any user-facing changes?

Yes, in execution behavior and performance.

For affected hash join types, queries can now stop earlier when the build side is empty, avoiding unnecessary probe-side scans and reducing wasted I/O and compute. There are no intended API changes.

LLM-generated code disclosure

This PR includes LLM-generated code and comments. All LLM-generated content has been manually reviewed and tested.

kosiew added 7 commits March 19, 2026 22:14
Implement a staged mini-plan for HashJoinStream to
immediately exit when the build side is empty and
the join type's result is fully determined. This change
avoids unnecessary entry into FetchProbeBatch for
Inner, Left, LeftSemi, LeftAnti, LeftMark, and RightSemi
joins without filters.

Add tests to verify join behavior with empty build:
- join_does_not_consume_probe_when_empty_build_fixes_output
- join_still_consumes_probe_when_empty_build_needs_probe_rows
These use MockExec to distinguish between short-circuiting
and necessary probe row consumption.
Extract duplicate post-build transition logic into
next_state_after_build_ready in stream.rs. This
centralizes the decision between Completed and
FetchProbeBatch in one location and streamlines
both collect_build_side and wait_for_partition_bounds_report
to use the new helper function.
Move the pure JoinType semantic rule to utils.rs, placing
it alongside the existing join behavior helpers. Update
HashJoinStream in stream.rs to focus solely on its
stream-specific execution concern by removing unnecessary
logic related to filtering.
Extract shared empty-build/probe-error test setup into a new
function, empty_build_with_probe_error_inputs(), in exec.rs.
Both regression tests now reuse this setup, allowing each test to
focus more on the join-type behavior it asserts rather than
rebuilding the same fixture.
Implement test_hash_join_skips_probe_on_empty_build_after_partition_bounds_report
in exec.rs. Ensure that dynamic filtering is enabled by keeping a
consumer reference alive. Verify that an Inner join with an empty
build side correctly skips probe consumption, even when passing
through the WaitPartitionBoundsReport path.
Refactor exec.rs by consolidating empty-build probe-behavior tests
into `assert_empty_build_probe_behavior(...)` and repeated dynamic
filter join setup into `hash_join_with_dynamic_filter(...)`.
Maintain existing runtime logic while reducing duplicate test
boilerplate and redundant local setup for improved clarity
and maintainability.
Remove duplication by introducing a shared helper,
empty_build_side_produces_empty_result, in utils.rs.
Update build_batch_empty_build_side to use this helper
directly, ensuring alignment in the short-circuit and
batch-construction logic within the hash join state
transition in stream.rs.
@github-actions github-actions bot added the physical-plan Changes to the physical-plan crate label Mar 20, 2026
kosiew added 2 commits March 20, 2026 11:19
Refactor stream.rs and utils.rs to streamline the hash-join
state machine. Compute the post-build state directly from
inputs, eliminating unnecessary indirection. Update the
empty-build-side batch construction to utilize early returns
and iterator-based collection for columns, replacing manual
Vec setup and push logic.
@kosiew kosiew marked this pull request as ready for review March 20, 2026 06:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

physical-plan Changes to the physical-plan crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

HashJoinExec consumes probe side when build side is empty

1 participant