Skip to content

API, Spark: Add branch support to RemoveDanglingDeleteFiles#15957

Open
kiyeonjeon21 wants to merge 1 commit intoapache:mainfrom
kiyeonjeon21:add-branch-support-remove-dangling-deletes
Open

API, Spark: Add branch support to RemoveDanglingDeleteFiles#15957
kiyeonjeon21 wants to merge 1 commit intoapache:mainfrom
kiyeonjeon21:add-branch-support-remove-dangling-deletes

Conversation

@kiyeonjeon21
Copy link
Copy Markdown

Summary

RemoveDanglingDeleteFiles always operated on the main branch. There was no way to target a specific branch, and RewriteDataFilesSparkAction did not forward its branch when invoking the action internally.

This PR:

  • Adds a toBranch(String) default method to the RemoveDanglingDeleteFiles API
  • Implements branch-aware metadata reads and commits in RemoveDanglingDeletesSparkAction
  • Forwards the branch from RewriteDataFilesSparkAction to the dangling delete removal step

Closes #15369

Changes

  • API: Added toBranch(String) with a default UnsupportedOperationException to avoid breaking changes (revapi passes)
  • Spark (v3.4, v3.5, v4.0): Metadata table reads are scoped to the branch snapshot via snapshot-id option. Commits are directed to the branch via RewriteFiles.toBranch(branch)
  • Spark (v4.1): Uses SparkTable.create(metadataTable, TimeTravel) instead of snapshot-id option, since time travel options were reworked in Spark 4.1
  • RewriteDataFilesSparkAction (all versions): Now passes its branch field to RemoveDanglingDeletesSparkAction
  • Tests: Added testBranchSupport and testBranchWithDanglingDeletes for v3.5, v4.0, v4.1

Notes

  • Unpartitioned table early return is kept as-is. The findDanglingDeletes SQL relies on data_file.partition which does not exist for unpartitioned tables. Addressing unpartitioned tables would require a different query strategy and is better handled separately.
  • AI tools were used to assist with code exploration and drafting. I reviewed and tested all changes locally.

Test plan

  • ./gradlew :iceberg-api:revapi passes
  • TestRemoveDanglingDeleteAction passes (18 tests, 0 failures on Spark 4.1)
  • ./gradlew spotlessCheck passes

RemoveDanglingDeleteFiles always operated on the main branch and did
not accept a branch parameter. This meant that when
RewriteDataFilesSparkAction invoked it with the remove-dangling-deletes
option, the branch context was lost.

This change adds a toBranch(String) method to the
RemoveDanglingDeleteFiles API and implements it in
RemoveDanglingDeletesSparkAction. Metadata table reads are now scoped
to the target branch's snapshot, and the resulting RewriteFiles commit
is directed to that branch.

RewriteDataFilesSparkAction now forwards its branch to the dangling
delete removal step.

Closes apache#15369
@kiyeonjeon21 kiyeonjeon21 force-pushed the add-branch-support-remove-dangling-deletes branch from f05dc2c to fda4872 Compare April 12, 2026 19:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

RemoveDanglingDeleteFiles does not support branches and skips unpartitioned tables

1 participant