Skip to content

Core: Fix silent DV conflict when conflictDetectionFilter doesn't cover all touched partitions#15966

Draft
bvolpato wants to merge 3 commits intoapache:mainfrom
bvolpato:repro/dv-validation-pruning-multi-partition
Draft

Core: Fix silent DV conflict when conflictDetectionFilter doesn't cover all touched partitions#15966
bvolpato wants to merge 3 commits intoapache:mainfrom
bvolpato:repro/dv-validation-pruning-multi-partition

Conversation

@bvolpato
Copy link
Copy Markdown
Contributor

@bvolpato bvolpato commented Apr 13, 2026

Summary

Fixes a correctness bug in MergingSnapshotProducer.validateAddedDVs that lets two Deletion Vectors land on the same data file when concurrent row-level writers use a conflictDetectionFilter narrower than the set of partitions the current commit is writing DVs into. The committed table state ends up with duplicate DVs per data file, which poisons every subsequent scan with Can't index multiple DVs in DeleteFileIndex.Builder.add.

The bug has two layers:

  1. Manifest-level pruning (introduced in Core: Add manifest partition pruning to DV validation in MergingSnapshotProducer #15653, commit 79d4fc82c). filterManifestsByPartition projects the user-supplied conflictDetectionFilter onto each concurrent delete manifest's partition spec and drops manifests whose partition range doesn't match. That filter comes from the query's WHERE clause (for Spark: SparkPositionDeltaWrite#conflictDetectionFilterqueryScan.filterExpressions()) and describes the user's scope of interest, not the union of partitions the commit is actually writing DVs into. A MERGE whose matched rows span partitions beyond what the WHERE clause names will drop concurrent manifests that hold a conflicting DV.

  2. Entry-level filterRows(conflictDetectionFilter) (in place since the original DV support in Core: Support commits with DVs #11495). Even if the manifest is scanned, ManifestReader#filterRows builds an Evaluator from Projections.inclusive(spec, caseSensitive).project(rowFilter) and drops entries whose partition doesn't satisfy the projected filter, so a concurrent DV for a file whose partition is outside the user's filter is still silently ignored. This layer alone is enough to reproduce the bug — removing only the manifest pruning is insufficient.

Both layers misuse the same filter. DV conflicts are detected by referencedDataFile identity, not by any data-level predicate, so conflictDetectionFilter has no business being applied in this path at all.

This PR:

  • Removes the filterManifestsByPartition helper and its call site in validateAddedDVs.
  • Drops the now-unused ManifestEvaluator / Projections imports.
  • Removes filterRows(conflictDetectionFilter) from the per-manifest scan.
  • Simplifies the private overload to take only the manifest and new-snapshot-ids set.
  • Keeps the protected validateAddedDVs(TableMetadata, Long, Expression, Snapshot) signature unchanged so existing callers and any subclasses keep compiling.

A fix that adds the missing partition coverage (combining conflictDetectionFilter with a projection derived from dvsByReferencedFile) is possible as a future optimization, but it's not what this PR does. This PR restores correctness by fully scanning concurrent delete manifests and their live entries, which is what the code did before #11495 + #15653 made both filtering layers wrong.

Reproducer

The first commit adds testConcurrentDVsMultiPartitionNarrowFilter to TestRowDelta:

  1. Table partitioned by bucket[16](data), seeded with F_P0 in bucket 0 and F_P2 in bucket 2.
  2. Writer A builds a RowDelta that adds DVs for both F_P0 and F_P2, with conflictDetectionFilter = data = 'u' (which projects to bucket(data) = 0).
  3. Writer B concurrently commits a DV for F_P2.
  4. Writer A calls commit() and must throw Found concurrently added DV for .../F_P2.

On an un-fixed tree, assertThatThrownBy fails with Expecting code to raise a throwable — the commit returns normally and the table is left with two DVs for F_P2.

On this branch, the assertion passes.

Test results on this branch (JDK 21, rebased on 56092bc8c)

./gradlew :iceberg-core:test --tests 'org.apache.iceberg.TestRowDelta'
BUILD SUCCESSFUL

testsuite name="org.apache.iceberg.TestRowDelta" tests="324" skipped="49" failures="0" errors="0"

324 tests, 0 failures. The three DV validation tests added alongside the buggy optimization in #15653 still pass:

  • testConcurrentDVsInDifferentPartitionsWithFilter — Writer A's DVs live entirely in a partition that overlaps the filter; concurrent DV is elsewhere; correct outcome: A succeeds.
  • testConcurrentDVsInSamePartitionWithFilter — DVs and concurrent DV share a partition; correct outcome: A fails with conflict.
  • testDVValidationPartitionPruningManifestCount — micro-test of pruning arithmetic; doesn't call validateAddedDVs and is unaffected.

Those tests were testing cases where the filter happens to be aligned with the commit's DV set, so they kept passing despite the bug.

Test gap in #15653

None of the three DV validation tests in #15653 cross the current commit's dvsByReferencedFile partition set against the pruning filter. That's the axis where the bug lives. The reproducer added here is the missing test.

Follow-on severity

The corruption is committed at row-delta commit time but becomes visible only when a reader tries to build the DeleteFileIndex:

// core/src/main/java/org/apache/iceberg/DeleteFileIndex.java:528-536
private void add(Map<String, DeleteFile> dvByPath, DeleteFile dv) {
  String path = dv.referencedDataFile();
  DeleteFile existingDV = dvByPath.putIfAbsent(path, dv);
  if (existingDV != null) {
    throw new ValidationException(
        "Can't index multiple DVs for %s: %s and %s",
        path, ContentFileUtil.dvDesc(dv), ContentFileUtil.dvDesc(existingDV));
  }
}

Every read, DML (UPDATE/DELETE/MERGE), and maintenance task touching the affected data file blows up. The table is effectively bricked until the duplicate DVs are manually resolved (rewrite, expire_snapshots on one of the duplicates, etc.).

Commits

  1. dfcd371dbCore: Reproducer for concurrent DV conflict missed by partition pruning
  2. 36266b4c7Core: Don't filter DV validation by conflictDetectionFilter

Test plan

  • New reproducer test testConcurrentDVsMultiPartitionNarrowFilter fails on main (56092bc8c) and passes with the fix.
  • Full TestRowDelta passes on this branch (324 tests, 0 failures).
  • Reviewer to confirm: is the semantic contract of conflictDetectionFilter in RowDelta documented anywhere that would make removing it from validateAddedDVs controversial? Happy to restore a parameter-narrowing optimization in a follow-up if maintainers want to keep the perf win — but correctness first.

Adds a failing test in TestRowDelta that demonstrates a correctness bug
introduced by the partition-pruning optimization in apache#15653.

MergingSnapshotProducer.validateAddedDVs prunes concurrent delete manifests
by projecting the user-supplied conflictDetectionFilter onto each manifest's
partition spec. When the current commit writes DVs across multiple partitions
but the caller supplies a filter that only names a subset of them, a manifest
holding a concurrent DV for one of the un-filtered partitions can be pruned
out of the scan. The loop never sees the conflicting DV and the commit
succeeds, leaving the table with two DVs referencing the same data file.
Subsequent scans then fail with "Can't index multiple DVs" in
DeleteFileIndex.Builder.add.

The test constructs exactly this case: Writer A adds DVs for files in
bucket 0 and bucket 2 with a conflict detection filter scoped to bucket 0
only; Writer B concurrently commits a DV for the bucket 2 file; Writer A's
commit is expected to throw "Found concurrently added DV". On current main
the commit returns normally, so the test fails.

This commit intentionally adds only the reproducer. A fix for the pruning
logic is being discussed separately.
@github-actions github-actions bot added the core label Apr 13, 2026
validateAddedDVs was dropping concurrent delete manifests (and, even before
that, dropping concurrent delete-manifest entries) using a filter derived
from the caller-supplied conflictDetectionFilter. That filter comes from the
query's WHERE clause and describes the user's scope of interest, not the set
of partitions the current commit is actually writing DVs into. A commit that
writes DVs across multiple partitions with a filter scoped to only some of
them would skip concurrent DVs in the un-filtered partitions, even when those
DVs target a data file the current commit is also DV-ing.

The resulting state has two DVs referencing the same data file, which breaks
every subsequent scan with "Can't index multiple DVs" in
DeleteFileIndex.Builder.add. That is committed, irrecoverable table state
without manual intervention.

Fix: remove the partition-level pruning added in apache#15653 and the entry-level
filterRows that has been there since DVs were introduced in apache#11495. DV
conflicts are detected by referencedDataFile identity, not by any
data-level predicate, so this filter is semantically wrong in this path.
All concurrent delete manifests with added files are now scanned in full,
and every live DV entry is checked against dvsByReferencedFile. The
protected validateAddedDVs signature is unchanged so subclasses and
existing callers keep compiling.

The companion test testConcurrentDVsMultiPartitionNarrowFilter (added in
the previous commit) goes from failing to passing under this fix. The three
DV validation tests added alongside the buggy optimization in apache#15653
(testConcurrentDVsInDifferentPartitionsWithFilter,
testConcurrentDVsInSamePartitionWithFilter,
testDVValidationPartitionPruningManifestCount) also still pass, since
they were testing scenarios where the filter happens to be aligned with
the commit's DV set.
@bvolpato bvolpato changed the title Core: Reproducer for concurrent DV conflict missed by partition pruning (#15653) Core: Fix silent DV conflict when conflictDetectionFilter doesn't cover all touched partitions Apr 13, 2026
LocalVariableName in checkstyle requires ^[a-z][a-zA-Z0-9]++$, so the
underscore-separated writerA_dvBucket0 / writerA_dvBucket2 / writerB_dvBucket2
were rejected by :iceberg-core:checkstyleTest in CI. Rename to
dvBucket0ForA / dvBucket2ForA / dvBucket2ForB. No behavioral change.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant