Skip to content

fix(upsert): early rejection of unsupported join column types#3384

Open
abnobdoss wants to merge 6 commits into
apache:mainfrom
abnobdoss:fix/upsert-pr1-type-safety
Open

fix(upsert): early rejection of unsupported join column types#3384
abnobdoss wants to merge 6 commits into
apache:mainfrom
abnobdoss:fix/upsert-pr1-type-safety

Conversation

@abnobdoss
Copy link
Copy Markdown

Rationale for this change

This is the first in a planned series of PRs to improve the stability and speed of the Table.upsert operation. This PR focuses on improving the correctness foundations by implementing "Fail Fast" validation for join key types.

By rejecting unsupported types upfront, we prevent two major classes of issues:

  1. Silent Data Loss: Floating-point join keys are rejected because PyArrow joins treat -0.0 and 0.0 as distinct values while Iceberg filters treat them as equal, leading to missed updates.
  2. Engine Crashes: Nested types (structs, lists, maps), dictionary-encoded columns, and extension types (e.g., UUID) are rejected early to avoid cryptic C++ crashes in the underlying PyArrow join kernels.

This establishes a safe contract for the subsequent performance-focused PRs (Vectorization and Anti-Join de-duplication).

Are these changes tested?

Yes. I have added a comprehensive suite of parameterized tests in tests/table/test_upsert.py.

  • Validation Matrix: Verified that ValueError or NotImplementedError are correctly raised for Floating Point, Nested, Dictionary, Null, and Extension types.
  • Correctness: Confirmed that standard primitive types (String, Int, Long, Decimal, etc.) continue to function as expected.
  • Schema Authority: Added tests ensuring that validation happens against both the Table Schema (architectural integrity) and the Dataframe Schema (memory format implementation).

Are there any user-facing changes?

Yes. The Table.upsert method now includes strict type validation for the join columns a user provides.

  • Users attempting to upsert on floating-point or nested columns will now receive a descriptive error message explaining the risk and suggesting a cast to Decimal or Integer.
  • This is a protective change that prevents users from accidentally writing corrupt data or encountering low-level engine crashes.

For full disclosure - this PR was developed with the assistance of an AI coding assistant (Antigravity) to help refine the type-safety checks and edge-case validation.

Abanoub Doss and others added 6 commits May 19, 2026 14:51
…in key

The error message renders the type from the Iceberg table's pyarrow schema,
and schema_to_pyarrow converts pa.list_ into pa.large_list (see
pyiceberg/io/pyarrow.py). The test regex must match the rendered
large_list<element: int32>, not the source list<item: int32>.
A pa.null() source column was being rejected by
_check_pyarrow_schema_compatible (format-version=2 forbids null) before the
join-column validation could surface the intended "Null-type column ...
cannot be used as a join key" error. Reordering the checks lets the
upsert-specific rejection fire first, giving users the actionable message.

Dataframe-level checks now skip columns that are absent from the source so
the pre-existing _check_pyarrow_schema_compatible path still owns the
"PyArrow table contains more columns" error in test_key_cols_misaligned.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants