Core: Fix silent DV conflict when conflictDetectionFilter doesn't cover all touched partitions#15966
Draft
bvolpato wants to merge 3 commits intoapache:mainfrom
Draft
Conversation
Adds a failing test in TestRowDelta that demonstrates a correctness bug introduced by the partition-pruning optimization in apache#15653. MergingSnapshotProducer.validateAddedDVs prunes concurrent delete manifests by projecting the user-supplied conflictDetectionFilter onto each manifest's partition spec. When the current commit writes DVs across multiple partitions but the caller supplies a filter that only names a subset of them, a manifest holding a concurrent DV for one of the un-filtered partitions can be pruned out of the scan. The loop never sees the conflicting DV and the commit succeeds, leaving the table with two DVs referencing the same data file. Subsequent scans then fail with "Can't index multiple DVs" in DeleteFileIndex.Builder.add. The test constructs exactly this case: Writer A adds DVs for files in bucket 0 and bucket 2 with a conflict detection filter scoped to bucket 0 only; Writer B concurrently commits a DV for the bucket 2 file; Writer A's commit is expected to throw "Found concurrently added DV". On current main the commit returns normally, so the test fails. This commit intentionally adds only the reproducer. A fix for the pruning logic is being discussed separately.
validateAddedDVs was dropping concurrent delete manifests (and, even before that, dropping concurrent delete-manifest entries) using a filter derived from the caller-supplied conflictDetectionFilter. That filter comes from the query's WHERE clause and describes the user's scope of interest, not the set of partitions the current commit is actually writing DVs into. A commit that writes DVs across multiple partitions with a filter scoped to only some of them would skip concurrent DVs in the un-filtered partitions, even when those DVs target a data file the current commit is also DV-ing. The resulting state has two DVs referencing the same data file, which breaks every subsequent scan with "Can't index multiple DVs" in DeleteFileIndex.Builder.add. That is committed, irrecoverable table state without manual intervention. Fix: remove the partition-level pruning added in apache#15653 and the entry-level filterRows that has been there since DVs were introduced in apache#11495. DV conflicts are detected by referencedDataFile identity, not by any data-level predicate, so this filter is semantically wrong in this path. All concurrent delete manifests with added files are now scanned in full, and every live DV entry is checked against dvsByReferencedFile. The protected validateAddedDVs signature is unchanged so subclasses and existing callers keep compiling. The companion test testConcurrentDVsMultiPartitionNarrowFilter (added in the previous commit) goes from failing to passing under this fix. The three DV validation tests added alongside the buggy optimization in apache#15653 (testConcurrentDVsInDifferentPartitionsWithFilter, testConcurrentDVsInSamePartitionWithFilter, testDVValidationPartitionPruningManifestCount) also still pass, since they were testing scenarios where the filter happens to be aligned with the commit's DV set.
LocalVariableName in checkstyle requires ^[a-z][a-zA-Z0-9]++$, so the underscore-separated writerA_dvBucket0 / writerA_dvBucket2 / writerB_dvBucket2 were rejected by :iceberg-core:checkstyleTest in CI. Rename to dvBucket0ForA / dvBucket2ForA / dvBucket2ForB. No behavioral change.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes a correctness bug in
MergingSnapshotProducer.validateAddedDVsthat lets two Deletion Vectors land on the same data file when concurrent row-level writers use aconflictDetectionFilternarrower than the set of partitions the current commit is writing DVs into. The committed table state ends up with duplicate DVs per data file, which poisons every subsequent scan withCan't index multiple DVsinDeleteFileIndex.Builder.add.The bug has two layers:
Manifest-level pruning (introduced in Core: Add manifest partition pruning to DV validation in MergingSnapshotProducer #15653, commit
79d4fc82c).filterManifestsByPartitionprojects the user-suppliedconflictDetectionFilteronto each concurrent delete manifest's partition spec and drops manifests whose partition range doesn't match. That filter comes from the query's WHERE clause (for Spark:SparkPositionDeltaWrite#conflictDetectionFilter→queryScan.filterExpressions()) and describes the user's scope of interest, not the union of partitions the commit is actually writing DVs into. A MERGE whose matched rows span partitions beyond what the WHERE clause names will drop concurrent manifests that hold a conflicting DV.Entry-level
filterRows(conflictDetectionFilter)(in place since the original DV support in Core: Support commits with DVs #11495). Even if the manifest is scanned,ManifestReader#filterRowsbuilds anEvaluatorfromProjections.inclusive(spec, caseSensitive).project(rowFilter)and drops entries whose partition doesn't satisfy the projected filter, so a concurrent DV for a file whose partition is outside the user's filter is still silently ignored. This layer alone is enough to reproduce the bug — removing only the manifest pruning is insufficient.Both layers misuse the same filter. DV conflicts are detected by
referencedDataFileidentity, not by any data-level predicate, soconflictDetectionFilterhas no business being applied in this path at all.This PR:
filterManifestsByPartitionhelper and its call site invalidateAddedDVs.ManifestEvaluator/Projectionsimports.filterRows(conflictDetectionFilter)from the per-manifest scan.validateAddedDVs(TableMetadata, Long, Expression, Snapshot)signature unchanged so existing callers and any subclasses keep compiling.A fix that adds the missing partition coverage (combining
conflictDetectionFilterwith a projection derived fromdvsByReferencedFile) is possible as a future optimization, but it's not what this PR does. This PR restores correctness by fully scanning concurrent delete manifests and their live entries, which is what the code did before #11495 + #15653 made both filtering layers wrong.Reproducer
The first commit adds
testConcurrentDVsMultiPartitionNarrowFiltertoTestRowDelta:bucket[16](data), seeded withF_P0in bucket 0 andF_P2in bucket 2.RowDeltathat adds DVs for bothF_P0andF_P2, withconflictDetectionFilter = data = 'u'(which projects tobucket(data) = 0).F_P2.commit()and must throwFound concurrently added DV for .../F_P2.On an un-fixed tree,
assertThatThrownByfails withExpecting code to raise a throwable— the commit returns normally and the table is left with two DVs forF_P2.On this branch, the assertion passes.
Test results on this branch (JDK 21, rebased on
56092bc8c)324 tests, 0 failures. The three DV validation tests added alongside the buggy optimization in #15653 still pass:
testConcurrentDVsInDifferentPartitionsWithFilter— Writer A's DVs live entirely in a partition that overlaps the filter; concurrent DV is elsewhere; correct outcome: A succeeds.testConcurrentDVsInSamePartitionWithFilter— DVs and concurrent DV share a partition; correct outcome: A fails with conflict.testDVValidationPartitionPruningManifestCount— micro-test of pruning arithmetic; doesn't callvalidateAddedDVsand is unaffected.Those tests were testing cases where the filter happens to be aligned with the commit's DV set, so they kept passing despite the bug.
Test gap in #15653
None of the three DV validation tests in #15653 cross the current commit's
dvsByReferencedFilepartition set against the pruning filter. That's the axis where the bug lives. The reproducer added here is the missing test.Follow-on severity
The corruption is committed at row-delta commit time but becomes visible only when a reader tries to build the
DeleteFileIndex:Every read, DML (UPDATE/DELETE/MERGE), and maintenance task touching the affected data file blows up. The table is effectively bricked until the duplicate DVs are manually resolved (rewrite, expire_snapshots on one of the duplicates, etc.).
Commits
dfcd371db—Core: Reproducer for concurrent DV conflict missed by partition pruning36266b4c7—Core: Don't filter DV validation by conflictDetectionFilterTest plan
testConcurrentDVsMultiPartitionNarrowFilterfails on main (56092bc8c) and passes with the fix.TestRowDeltapasses on this branch (324 tests, 0 failures).conflictDetectionFilterinRowDeltadocumented anywhere that would make removing it fromvalidateAddedDVscontroversial? Happy to restore a parameter-narrowing optimization in a follow-up if maintainers want to keep the perf win — but correctness first.