[python][ray] Ray merge into support condition by XiaoHongbo-Hope · Pull Request #8076 · apache/paimon

XiaoHongbo-Hope · 2026-06-02T07:23:42Z

Purpose

Tests

JingsongLi · 2026-06-02T15:00:05Z

+    rewritten: str, on_map: Mapping[str, str],
+) -> str:
+    for s_col, t_col in on_map.items():
+        rewritten = rewritten.replace(f'"s.{s_col}"', f'"t.{t_col}"')


Could we avoid doing this raw replace across the whole rewritten SQL? rewrite_condition() keeps string literals intact, but this second pass can still mutate literals that happen to contain the quoted ON-key token. For example, a matched condition like s.note = '"s.id"' AND s.id = 1 is rewritten to "s.note" = '"s.id"' AND "s.id" = 1, then this remap turns it into "s.note" = '"t.id"' AND "t.id" = 1, so DataFusion filters against the wrong literal and can skip/update the wrong rows. It would be safer to apply the ON-key remap while still respecting SQL string-literal spans (similar to rewrite_condition()), and add a regression test for literals containing "s.<on-key>".

Could we avoid doing this raw replace across the whole rewritten SQL? rewrite_condition() keeps string literals intact, but this second pass can still mutate literals that happen to contain the quoted ON-key token. For example, a matched condition like s.note = '"s.id"' AND s.id = 1 is rewritten to "s.note" = '"s.id"' AND "s.id" = 1, then this remap turns it into "s.note" = '"t.id"' AND "t.id" = 1, so DataFusion filters against the wrong literal and can skip/update the wrong rows. It would be safer to apply the ON-key remap while still respecting SQL string-literal spans (similar to rewrite_condition()), and add a regression test for literals containing "s.<on-key>".

Fixed

Add condition support for WhenMatched and WhenNotMatched clauses using DataFusion SQL engine for expression evaluation. - Condition filtering in both matched (update) and not-matched (insert) paths - Rewrite and remap respect SQL string literal spans - Validate: WhenNotMatched rejects t.* refs, blob column refs rejected - Fail-fast datafusion availability check - Source ON key remapped to target ON key in matched conditions - Add datafusion>=52 to CI dependencies - SessionContext cached per worker, empty batch handled safely

- Create fresh SessionContext per filter_batch call (no global state) - Guard merge_condition import behind condition check - Check WhenNotMatched target-ref before blob-ref for clearer errors - Clarify num_matched semantics in comment

Local dev installs via requirements-dev.txt were missing datafusion, causing condition integration tests to fail outside CI.

JingsongLi · 2026-06-03T09:19:55Z

LGTM. I went through the condition rewrite/evaluation path, matched and not-matched filtering, blob column validation, empty-batch handling, and the new test coverage. The current implementation looks reasonable to me.

Two minor non-blocking suggestions:

It may be worth adding one test for duplicate source rows matching the same target row when the matched condition filters the duplicates down to zero or one output row. This would make the intended cardinality semantics explicit for conditional merge.
Column validation is currently mostly delegated to DataFusion at execution time. That is fine, but a small driver-side validation/parse step could make error reporting more predictable, especially for empty input batches.

Two source rows match the same target row (id=1). Without condition this would raise "multiple source rows". With condition s.age > t.age, only the row with age=20 passes (age=5 is filtered), so the update succeeds with exactly one matching row.

Check that s.* and t.* references in condition expressions exist in the source and target schemas at merge_into call time, instead of deferring to DataFusion runtime errors.

Add @unittest.skipIf decorator to all condition E2E tests so they gracefully skip on Python < 3.10 or environments without datafusion.

test_not_matched_condition_rejects_target_refs also requires datafusion (_prepare calls _require_datafusion before the ValueError check).

XiaoHongbo-Hope · 2026-06-03T10:23:39Z

LGTM. I went through the condition rewrite/evaluation path, matched and not-matched filtering, blob column validation, empty-batch handling, and the new test coverage. The current implementation looks reasonable to me.

Two minor non-blocking suggestions:

It may be worth adding one test for duplicate source rows matching the same target row when the matched condition filters the duplicates down to zero or one output row. This would make the intended cardinality semantics explicit for conditional merge.

Column validation is currently mostly delegated to DataFusion at execution time. That is fine, but a small driver-side validation/parse step could make error reporting more predictable, especially for empty input batches.

Thanks, updated.

JingsongLi

+1

XiaoHongbo-Hope changed the title ~~Ray merge into support condition~~ [python][ray] Ray merge into support condition Jun 2, 2026

XiaoHongbo-Hope marked this pull request as ready for review June 2, 2026 08:29

JingsongLi reviewed Jun 2, 2026

View reviewed changes

XiaoHongbo-Hope force-pushed the ray_merge_into_support_condition branch from 2f54041 to 8d8ef6a Compare June 3, 2026 07:25

XiaoHongbo-Hope force-pushed the ray_merge_into_support_condition branch 3 times, most recently from 6bc2edc to 88abb4b Compare June 3, 2026 07:54

XiaoHongbo-Hope force-pushed the ray_merge_into_support_condition branch from 88abb4b to 23ef745 Compare June 3, 2026 07:58

[ray] Add datafusion to dev requirements

19381cf

Local dev installs via requirements-dev.txt were missing datafusion, causing condition integration tests to fail outside CI.

XiaoHongbo-Hope marked this pull request as draft June 3, 2026 09:04

XiaoHongbo-Hope marked this pull request as ready for review June 3, 2026 09:25

XiaoHongbo-Hope added 4 commits June 3, 2026 17:30

[ray] Validate condition column refs against source/target schema

927511a

Check that s.* and t.* references in condition expressions exist in the source and target schemas at merge_into call time, instead of deferring to DataFusion runtime errors.

[ray] Skip condition tests when datafusion is not installed

dab57ba

Add @unittest.skipIf decorator to all condition E2E tests so they gracefully skip on Python < 3.10 or environments without datafusion.

[ray] Add missing skip decorator to target-ref rejection test

4849d57

test_not_matched_condition_rejects_target_refs also requires datafusion (_prepare calls _require_datafusion before the ValueError check).

JingsongLi approved these changes Jun 3, 2026

View reviewed changes

JingsongLi merged commit e4d0573 into apache:master Jun 3, 2026
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[python][ray] Ray merge into support condition#8076

[python][ray] Ray merge into support condition#8076
JingsongLi merged 7 commits into
apache:masterfrom
XiaoHongbo-Hope:ray_merge_into_support_condition

XiaoHongbo-Hope commented Jun 2, 2026

Uh oh!

JingsongLi Jun 2, 2026

Uh oh!

XiaoHongbo-Hope Jun 3, 2026

Uh oh!

JingsongLi commented Jun 3, 2026

Uh oh!

XiaoHongbo-Hope commented Jun 3, 2026

Uh oh!

JingsongLi left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

XiaoHongbo-Hope commented Jun 2, 2026

Purpose

Tests

Uh oh!

JingsongLi Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

XiaoHongbo-Hope Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

JingsongLi commented Jun 3, 2026

Uh oh!

XiaoHongbo-Hope commented Jun 3, 2026

Uh oh!

JingsongLi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants