fix: dynamic_partition_overwrite builds per-spec delete predicates after partition spec evolution#3149
Open
tusharchou wants to merge 3 commits intoapache:mainfrom
Open
Conversation
…s after partition spec evolution Fixes apache#3148 When a table has undergone partition spec evolution, its snapshot may contain manifests written under different partition_spec_ids. Previously, dynamic_partition_overwrite built the delete predicate using only the current spec, causing the manifest evaluator to incorrectly skip manifests from older specs — leaving stale data files silently behind. The fix builds the delete predicate per historical spec present in the snapshot, projecting the new data files' partition values into each spec's coordinate space before evaluating. Regression tests added covering: - Mixed-spec snapshot (manifests from both spec-0 and spec-1) - Overwrite of a partition that only exists in spec-0 manifests (silent data duplication case)
Contributor
Author
|
AI Disclosure |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Rationale
While reviewing PR #3011 (manifest pruning optimization), I identified a correctness
gap when tables have undergone partition spec evolution.
When
dynamic_partition_overwriteis called on a table with mixedpartition_spec_idsin its snapshot, the delete predicate was built using only the current partition spec.
This caused
inclusive_projectionto fail silently when evaluating older manifests —the predicate contained field references (e.g.
region) that have no correspondingpartition field in the old spec, causing the manifest evaluator to skip those manifests
entirely. The result is silent data duplication: stale rows from old spec manifests are
never deleted.
Changes
pyiceberg/table/__init__.py:dynamic_partition_overwritenow iterates over allpartition_spec_ids present in the current snapshot and builds a per-spec deletepredicate, projecting the new data files' partition values into each historical spec's
coordinate space before evaluating.
tests/integration/test_manifest_pruning_spec_evolution.py: two regression tests added:duplication case — no exception raised, wrong rows survive)
Are these changes tested?
Yes — two new integration tests using the SQLite in-memory catalog, no external
services required.
Are there any user-facing changes?
Yes —
dynamic_partition_overwritenow correctly deletes all matching rows acrossall historical partition specs, fixing silent data duplication on evolved tables.
Related