Skip to content

feat: Adding support to block archival on last known ECTR for v6 tables#18380

Open
nsivabalan wants to merge 3 commits intoapache:masterfrom
nsivabalan:archivalBlockECTRV6
Open

feat: Adding support to block archival on last known ECTR for v6 tables#18380
nsivabalan wants to merge 3 commits intoapache:masterfrom
nsivabalan:archivalBlockECTRV6

Conversation

@nsivabalan
Copy link
Copy Markdown
Contributor

Describe the issue this Pull Request addresses

This PR adds support to block archival based on the Earliest Commit To Retain (ECTR) from the last completed clean operation, preventing potential data leaks when cleaning configurations change between clean and archival runs.

Problem: Currently, archival recomputes ECTR independently based on cleaning configs at archival time, rather than reading it from the last clean plan. When cleaning configs change between clean and archival operations, archival may archive commits whose data files haven't been cleaned yet, leading to timeline metadata loss for existing data files.

Example scenario:

  1. Clean runs with retainCommits=5, computes ECTR=commit_100, cleans files older than commit_100
  2. Config changes to retainCommits=2 before next clean
  3. Archival runs with new config, recomputes ECTR=commit_103
  4. Archival archives commits 100-102, but their data files still exist (weren't cleaned yet)
  5. Result: Timeline metadata is lost for existing data files → data leak

Summary and Changelog

User-facing summary: Users can now optionally enable archival blocking based on ECTR from the last clean to prevent archiving commits whose data files haven't been cleaned. This is useful when cleaning configurations may change over time or when strict data retention guarantees are needed.

Detailed changelog:

Configuration Changes:

  • Added new advanced config hoodie.archive.block.on.latest.clean.ectr (default: false)
    • When enabled, archival reads ECTR from last completed clean metadata
    • Blocks archival of commits with timestamp >= ECTR
    • Marked as advanced config for power users
    • Available since version 1.2.0

Implementation Changes:

  • TimelineArchiverV1.java: Added ECTR blocking logic in getCommitInstantsToArchive() method
    • Reads ECTR from last completed clean's metadata (lines 274-294)
    • Filters commit timeline to exclude commits >= ECTR (lines 322-326)
    • Follows same pattern as existing compaction/clustering retention checks
    • Includes error handling with graceful degradation (logs warning if metadata read fails)
  • HoodieArchivalConfig.java: Added config property BLOCK_ARCHIVAL_ON_LATEST_CLEAN_ECTR
    • Builder method: withBlockArchivalOnCleanECTR(boolean)
  • HoodieWriteConfig.java: Added access method shouldBlockArchivalOnCleanECTR()

Test Changes:

  • Added 7 comprehensive tests in TestHoodieTimelineArchiver.java (633 lines):
    a. testArchivalBlocksOnCleanECTRWhenEnabled - Core blocking functionality
    b. testArchivalProceedsNormallyWhenECTRBlockingDisabled - Backward compatibility
    c. testArchivalMakesProgressWhenECTRIsLaterThanArchivalWindow - Progress validation
    d. testArchivalContinuesWhenCleanMetadataIsMissing - Missing metadata handling
    e. testArchivalHandlesEmptyECTRInCleanMetadata - Empty ECTR handling
    f. testArchivalProceedsWhenCleanHasFileVersionsPolicyWithNullECTR - FILE_VERSIONS policy compatibility
    g. testArchivalBlocksOnCleanECTRWithTimelineArchiverV2AndVersion9 - Version 9 / LSM timeline compatibility

Impact

Public API Changes:

  • New config property: hoodie.archive.block.on.latest.clean.ectr (opt-in, default: false)
  • New builder method: HoodieArchivalConfig.Builder.withBlockArchivalOnCleanECTR(boolean)
  • New accessor: HoodieWriteConfig.shouldBlockArchivalOnCleanECTR()

User-facing changes:

  • When enabled, archival may retain more commits in active timeline if they haven't been cleaned
  • Timeline growth bounded by ECTR from last clean operation
  • No behavior change when config is disabled (default)

Performance impact:

  • Minimal: One additional metadata read per archival operation when enabled
  • Read operation is fast (single clean metadata file)
  • No impact when config is disabled (default)

Breaking changes: None - opt-in feature with no default behavior changes

Risk Level

low

Documentation Update

Config documentation:
The new config hoodie.archive.block.on.latest.clean.ectr is documented inline:
.withDocumentation("If enabled, archival will block on latest ECTR from last known clean")

Website documentation needed:

  • Add entry to config reference page for HoodieArchivalConfig.BLOCK_ARCHIVAL_ON_LATEST_CLEAN_ECTR
  • Update archival section in Hudi docs to explain ECTR blocking feature
  • Add usage example showing when to enable this config
  • Document interaction with different cleaning policies (KEEP_LATEST_COMMITS vs KEEP_LATEST_FILE_VERSIONS)

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

@github-actions github-actions bot added the size:L PR with lines of changes in (300, 1000] label Mar 25, 2026
Copy link
Copy Markdown
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for contributing! The motivation is solid — preventing data leaks when cleaning configs change between clean and archival is an important safety improvement. However, the current implementation has a fail-open design in both its error handling and timeline scope that could undermine the safety guarantee. Please see the inline comments for details.

@nsivabalan nsivabalan force-pushed the archivalBlockECTRV6 branch from cbd2f32 to d4dd819 Compare April 9, 2026 06:11
@nsivabalan nsivabalan force-pushed the archivalBlockECTRV6 branch from d4dd819 to 7b2aa41 Compare April 9, 2026 06:13
Copy link
Copy Markdown
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Style & Readability Review — A couple of minor cleanups: remove a commented-out throw statement and dead code that creates an unused Map variable.

Copy link
Copy Markdown
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for contributing! The approach is solid — reading ECTR from the last clean to gate archival is a clean solution to the config-drift problem. One concern: the swallowed IOException in the critical path could silently defeat the safety check (details in inline comment).

Copy link
Copy Markdown
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Greptile Summary: This PR introduces an opt-in archival guard for Hudi v6 tables (using TimelineArchiverV1) that reads the Earliest Commit To Retain (ECTR) from the last completed clean operation and prevents archiving any commits at or after that boundary. The feature is gated behind a new config hoodie.archive.block.on.latest.clean.ectr (default false) to maintain full backward compatibility. Five integration tests cover the primary scenarios.

Key changes:

  • TimelineArchiverV1: reads ECTR from last clean's HoodieCleanMetadata and adds a filter to the archival stream; the feature is skipped when no clean exists or ECTR is empty/null.
  • HoodieArchivalConfig: new BLOCK_ARCHIVAL_ON_LATEST_CLEAN_ECTR config property with builder support.
  • HoodieWriteConfig: new shouldBlockArchivalOnCleanECTR() accessor, plus three unrelated changes (ORC codec import/method, removal of getClusteringEarliestCommitToCluster) that should be in a separate PR.
  • Tests: five tests in TestHoodieTimelineArchiver; the test named testArchivalContinuesWhenCleanMetadataIsMissing only covers the "no clean commit" path, not the IOException path its name implies.

Main concern: In the IOException catch block (lines 291–294 of TimelineArchiverV1), the log.warn message says \"skipping ECTR check\" — implying graceful degradation — but the code immediately throws HoodieIOException, aborting the archival. The intent (fail-fast vs. graceful skip) needs to be clarified and the log message corrected to match.

Greptile Confidence Score: 3/5
Not safe to merge as-is — the IOException handling is self-contradictory and could silently fail archival in production when clean metadata is temporarily unreadable.

The core logic (ECTR filter, config, builder, tests) is sound, but the IOException catch block has a direct contradiction: the log says skipping while the code throws, making the failure mode ambiguous and untested. The unrelated changes to HoodieWriteConfig add diff noise and risk. Both issues need resolution before merge.

TimelineArchiverV1.java (IOException handling) and HoodieWriteConfig.java (unrelated changes).

Vulnerabilities

No security concerns identified. The new config is disabled by default, the feature only reads existing clean metadata from the timeline (no external input processed as code), and no credentials, tokens, or sensitive data are involved.

CodeRabbit Walkthrough: This PR introduces an ECTR (Earliest Commit To Retain) based archival blocking mechanism. When enabled, the timeline archiver reads the latest CLEAN metadata to extract the ECTR value and conditionally blocks archival of commit instants predating this threshold, preventing premature removal of essential commits.

Sequence Diagram (CodeRabbit):

sequenceDiagram
    participant TimelineArchiver
    participant Config
    participant Timeline
    participant CleanMetadata
    participant CommitFilter

    TimelineArchiver->>Config: Check shouldBlockArchivalOnCleanECTR()
    alt ECTR Blocking Enabled
        TimelineArchiver->>Timeline: Read latest CLEAN instant
        TimelineArchiver->>CleanMetadata: Parse HoodieCleanMetadata
        CleanMetadata-->>TimelineArchiver: getEarliestCommitToRetain()
        alt ECTR Value Present
            TimelineArchiver->>CommitFilter: Add ECTR-based filter condition
            CommitFilter->>CommitFilter: Exclude commits with requestedTime < ECTR
            CommitFilter-->>TimelineArchiver: Filtered commit instants
        else ECTR Missing/Blank
            TimelineArchiver->>CommitFilter: Skip ECTR filter
            CommitFilter-->>TimelineArchiver: Unfiltered commit instants
        end
    else ECTR Blocking Disabled
        TimelineArchiver->>CommitFilter: Apply standard archival filters
        CommitFilter-->>TimelineArchiver: Filtered commit instants
    end
    TimelineArchiver->>Timeline: Archive filtered instants
Loading

CodeRabbit: yihua#28 (review)
Greptile: yihua#28 (review)

} catch (IOException e) {
log.warn("Failed to read clean metadata for {}, skipping ECTR check", lastCleanInstant.get(), e);
throw new HoodieIOException("Failed to read clean metadata for " + lastCleanInstant.get() + ", skipping ECTR check", e);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Failure message contradicts runtime behavior.

Line 292 and Line 293 say “skipping ECTR check,” but the code throws immediately, which aborts archival. Please align the message with actual behavior to avoid operator confusion.

Suggested text fix
-            log.warn("Failed to read clean metadata for {}, skipping ECTR check", lastCleanInstant.get(), e);
-            throw new HoodieIOException("Failed to read clean metadata for " + lastCleanInstant.get() + ", skipping ECTR check", e);
+            log.warn("Failed to read clean metadata for {}, aborting archival due to ECTR guard", lastCleanInstant.get(), e);
+            throw new HoodieIOException("Failed to read clean metadata for " + lastCleanInstant.get()
+                + ", aborting archival due to ECTR guard", e);
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
}
} catch (IOException e) {
log.warn("Failed to read clean metadata for {}, aborting archival due to ECTR guard", lastCleanInstant.get(), e);
throw new HoodieIOException("Failed to read clean metadata for " + lastCleanInstant.get()
", aborting archival due to ECTR guard", e);
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/timeline/versioning/v1/TimelineArchiverV1.java`
around lines 291 - 294, The catch block in TimelineArchiverV1 currently logs and
throws with a message that says "skipping ECTR check" which is inaccurate
because the code immediately throws and aborts archival; update both the
log.warn call and the HoodieIOException message in the catch(IOException e)
handler so the text reflects that the archival is being aborted (e.g., "aborting
archival due to failure reading clean metadata for <instant>") instead of saying
it will skip the ECTR check, leaving the exception chaining intact.

CodeRabbit (original) (source:comment#3055869924)

Copy link
Copy Markdown
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Style & Readability Review — a few readability and code quality suggestions: consolidate the null/empty check, clarify a comment, avoid duplicating filter logic, and extract repeated test setup.

Copy link
Copy Markdown
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for contributing! The approach to block archival based on ECTR from the last clean is sound and addresses a real data-leak scenario. A couple of items to double-check in the inline comments.

@nsivabalan
Copy link
Copy Markdown
Contributor Author

@yihua : can you help land this once you are ok. if there are minor cosmetic suggestions, free free to address them and land them.

Copy link
Copy Markdown
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for addressing the feedback! The contradictory "skipping ECTR check" wording has been removed from both the log message and the exception, which resolves the prior finding cleanly. The fail-closed behavior (throwing on read failure) is now correctly reflected in the messaging.

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 86.66667% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.85%. Comparing base (2f07364) to head (c33f510).
⚠️ Report is 62 commits behind head on master.

Files with missing lines Patch % Lines
...ent/timeline/versioning/v1/TimelineArchiverV1.java 80.95% 3 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18380      +/-   ##
============================================
+ Coverage     68.36%   68.85%   +0.48%     
- Complexity    27566    28223     +657     
============================================
  Files          2432     2460      +28     
  Lines        133175   135290    +2115     
  Branches      16023    16399     +376     
============================================
+ Hits          91047    93148    +2101     
+ Misses        35068    34764     -304     
- Partials       7060     7378     +318     
Flag Coverage Δ
common-and-other-modules 44.59% <40.00%> (+0.27%) ⬆️
hadoop-mr-java-client 44.86% <23.33%> (-0.29%) ⬇️
spark-client-hadoop-common 48.61% <86.66%> (+0.04%) ⬆️
spark-java-tests 48.92% <40.00%> (+0.22%) ⬆️
spark-scala-tests 45.50% <23.33%> (+0.11%) ⬆️
utilities 38.23% <40.00%> (-0.30%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...a/org/apache/hudi/config/HoodieArchivalConfig.java 89.88% <100.00%> (+0.99%) ⬆️
...java/org/apache/hudi/config/HoodieWriteConfig.java 89.94% <100.00%> (+0.09%) ⬆️
...ent/timeline/versioning/v1/TimelineArchiverV1.java 79.91% <80.95%> (+0.10%) ⬆️

... and 206 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@hudi-bot
Copy link
Copy Markdown
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Copy link
Copy Markdown
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Nice updates — the test assertions are now much more precise and exhaustive, replacing vague "some commits were archived" checks with exact verification of which commits remain and which were archived. All prior findings from my previous review were already addressed (the contradictory log/exception messages were fixed in the earlier round), and no new issues are introduced by these test-only changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:L PR with lines of changes in (300, 1000]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants