fix(upsert): early rejection of unsupported join column types by abnobdoss · Pull Request #3384 · apache/iceberg-python

abnobdoss · 2026-05-19T20:07:36Z

Rationale for this change

This is the first in a planned series of PRs to improve the stability and speed of the Table.upsert operation. This PR focuses on improving the correctness foundations by implementing "Fail Fast" validation for join key types.

By rejecting unsupported types upfront, we prevent two major classes of issues:

Silent Data Loss: Floating-point join keys are rejected because PyArrow joins treat -0.0 and 0.0 as distinct values while Iceberg filters treat them as equal, leading to missed updates.
Engine Crashes: Nested types (structs, lists, maps), dictionary-encoded columns, and extension types (e.g., UUID) are rejected early to avoid cryptic C++ crashes in the underlying PyArrow join kernels.

This establishes a safe contract for the subsequent performance-focused PRs (Vectorization and Anti-Join de-duplication).

Are these changes tested?

Yes. I have added a comprehensive suite of parameterized tests in tests/table/test_upsert.py.

Validation Matrix: Verified that ValueError or NotImplementedError are correctly raised for Floating Point, Nested, Dictionary, Null, and Extension types.
Correctness: Confirmed that standard primitive types (String, Int, Long, Decimal, etc.) continue to function as expected.
Schema Authority: Added tests ensuring that validation happens against both the Table Schema (architectural integrity) and the Dataframe Schema (memory format implementation).

Are there any user-facing changes?

Yes. The Table.upsert method now includes strict type validation for the join columns a user provides.

Users attempting to upsert on floating-point or nested columns will now receive a descriptive error message explaining the risk and suggesting a cast to Decimal or Integer.
This is a protective change that prevents users from accidentally writing corrupt data or encountering low-level engine crashes.

For full disclosure - this PR was developed with the assistance of an AI coding assistant (Antigravity) to help refine the type-safety checks and edge-case validation.

…pdate error expectations

…in key The error message renders the type from the Iceberg table's pyarrow schema, and schema_to_pyarrow converts pa.list_ into pa.large_list (see pyiceberg/io/pyarrow.py). The test regex must match the rendered large_list<element: int32>, not the source list<item: int32>.

A pa.null() source column was being rejected by _check_pyarrow_schema_compatible (format-version=2 forbids null) before the join-column validation could surface the intended "Null-type column ... cannot be used as a join key" error. Reordering the checks lets the upsert-specific rejection fire first, giving users the actionable message. Dataframe-level checks now skip columns that are absent from the source so the pre-existing _check_pyarrow_schema_compatible path still owns the "PyArrow table contains more columns" error in test_key_cols_misaligned.

Abanoub Doss and others added 6 commits May 19, 2026 14:51

refactor(upsert): early rejection of unsupported join column types

2abd44f

test(upsert): align test schemas with early compatibility check and u…

12d978e

…pdate error expectations

docs(upsert): add rationale for duplicate check ordering

336dcf8

style: fix linting issues (line length and whitespace)

c97f724

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(upsert): early rejection of unsupported join column types#3384

fix(upsert): early rejection of unsupported join column types#3384
abnobdoss wants to merge 6 commits into
apache:mainfrom
abnobdoss:fix/upsert-pr1-type-safety

abnobdoss commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

abnobdoss commented May 19, 2026

Rationale for this change

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants