Skip to content

[python][ray] Ray merge into support condition#8076

Merged
JingsongLi merged 7 commits into
apache:masterfrom
XiaoHongbo-Hope:ray_merge_into_support_condition
Jun 3, 2026
Merged

[python][ray] Ray merge into support condition#8076
JingsongLi merged 7 commits into
apache:masterfrom
XiaoHongbo-Hope:ray_merge_into_support_condition

Conversation

@XiaoHongbo-Hope
Copy link
Copy Markdown
Contributor

Purpose

Tests

@XiaoHongbo-Hope XiaoHongbo-Hope changed the title Ray merge into support condition [python][ray] Ray merge into support condition Jun 2, 2026
@XiaoHongbo-Hope XiaoHongbo-Hope marked this pull request as ready for review June 2, 2026 08:29
rewritten: str, on_map: Mapping[str, str],
) -> str:
for s_col, t_col in on_map.items():
rewritten = rewritten.replace(f'"s.{s_col}"', f'"t.{t_col}"')
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we avoid doing this raw replace across the whole rewritten SQL? rewrite_condition() keeps string literals intact, but this second pass can still mutate literals that happen to contain the quoted ON-key token. For example, a matched condition like s.note = '"s.id"' AND s.id = 1 is rewritten to "s.note" = '"s.id"' AND "s.id" = 1, then this remap turns it into "s.note" = '"t.id"' AND "t.id" = 1, so DataFusion filters against the wrong literal and can skip/update the wrong rows. It would be safer to apply the ON-key remap while still respecting SQL string-literal spans (similar to rewrite_condition()), and add a regression test for literals containing "s.<on-key>".

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we avoid doing this raw replace across the whole rewritten SQL? rewrite_condition() keeps string literals intact, but this second pass can still mutate literals that happen to contain the quoted ON-key token. For example, a matched condition like s.note = '"s.id"' AND s.id = 1 is rewritten to "s.note" = '"s.id"' AND "s.id" = 1, then this remap turns it into "s.note" = '"t.id"' AND "t.id" = 1, so DataFusion filters against the wrong literal and can skip/update the wrong rows. It would be safer to apply the ON-key remap while still respecting SQL string-literal spans (similar to rewrite_condition()), and add a regression test for literals containing "s.<on-key>".

Fixed

@XiaoHongbo-Hope XiaoHongbo-Hope force-pushed the ray_merge_into_support_condition branch from 2f54041 to 8d8ef6a Compare June 3, 2026 07:25
Add condition support for WhenMatched and WhenNotMatched clauses
using DataFusion SQL engine for expression evaluation.

- Condition filtering in both matched (update) and not-matched (insert) paths
- Rewrite and remap respect SQL string literal spans
- Validate: WhenNotMatched rejects t.* refs, blob column refs rejected
- Fail-fast datafusion availability check
- Source ON key remapped to target ON key in matched conditions
- Add datafusion>=52 to CI dependencies
- SessionContext cached per worker, empty batch handled safely
@XiaoHongbo-Hope XiaoHongbo-Hope force-pushed the ray_merge_into_support_condition branch 3 times, most recently from 6bc2edc to 88abb4b Compare June 3, 2026 07:54
- Create fresh SessionContext per filter_batch call (no global state)
- Guard merge_condition import behind condition check
- Check WhenNotMatched target-ref before blob-ref for clearer errors
- Clarify num_matched semantics in comment
@XiaoHongbo-Hope XiaoHongbo-Hope force-pushed the ray_merge_into_support_condition branch from 88abb4b to 23ef745 Compare June 3, 2026 07:58
Local dev installs via requirements-dev.txt were missing datafusion,
causing condition integration tests to fail outside CI.
@XiaoHongbo-Hope XiaoHongbo-Hope marked this pull request as draft June 3, 2026 09:04
@JingsongLi
Copy link
Copy Markdown
Contributor

LGTM. I went through the condition rewrite/evaluation path, matched and not-matched filtering, blob column validation, empty-batch handling, and the new test coverage. The current implementation looks reasonable to me.

Two minor non-blocking suggestions:

  1. It may be worth adding one test for duplicate source rows matching the same target row when the matched condition filters the duplicates down to zero or one output row. This would make the intended cardinality semantics explicit for conditional merge.

  2. Column validation is currently mostly delegated to DataFusion at execution time. That is fine, but a small driver-side validation/parse step could make error reporting more predictable, especially for empty input batches.

@XiaoHongbo-Hope XiaoHongbo-Hope marked this pull request as ready for review June 3, 2026 09:25
Two source rows match the same target row (id=1). Without condition
this would raise "multiple source rows". With condition s.age > t.age,
only the row with age=20 passes (age=5 is filtered), so the update
succeeds with exactly one matching row.
Check that s.* and t.* references in condition expressions exist in
the source and target schemas at merge_into call time, instead of
deferring to DataFusion runtime errors.
Add @unittest.skipIf decorator to all condition E2E tests so they
gracefully skip on Python < 3.10 or environments without datafusion.
test_not_matched_condition_rejects_target_refs also requires
datafusion (_prepare calls _require_datafusion before the
ValueError check).
@XiaoHongbo-Hope
Copy link
Copy Markdown
Contributor Author

LGTM. I went through the condition rewrite/evaluation path, matched and not-matched filtering, blob column validation, empty-batch handling, and the new test coverage. The current implementation looks reasonable to me.

Two minor non-blocking suggestions:

  1. It may be worth adding one test for duplicate source rows matching the same target row when the matched condition filters the duplicates down to zero or one output row. This would make the intended cardinality semantics explicit for conditional merge.
  2. Column validation is currently mostly delegated to DataFusion at execution time. That is fine, but a small driver-side validation/parse step could make error reporting more predictable, especially for empty input batches.

Thanks, updated.

Copy link
Copy Markdown
Contributor

@JingsongLi JingsongLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@JingsongLi JingsongLi merged commit e4d0573 into apache:master Jun 3, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants