fix: prevent heartbeat timer from being permanently killed by slow or delayed heartbeats by prashantwason · Pull Request #18904 · apache/hudi

prashantwason · 2026-06-02T21:15:39Z

Describe the issue this Pull Request addresses

HoodieHeartbeatClient can permanently stop generating heartbeats for an in-flight instant, causing a later commit to abort with HoodieException: Heartbeat for instant <t> has expired even though the writer is still alive. Two independent causes, both in updateHeartbeat():

The heartbeat file is written synchronously on the Timer thread. Since the timer uses scheduleAtFixedRate, a slow or hung storage write blocks the thread and freezes all subsequent heartbeats for the instant.
When a refresh is delayed past the tolerable interval, updateHeartbeat() calls Thread.currentThread().interrupt(), which permanently kills the timer thread — turning a transient delay (GC pause, driver stall, single slow write) into a permanent blackout.

Summary and Changelog

Perform the heartbeat file write on a bounded daemon executor and wait with a timeout (Future.get(heartbeatWriteTimeoutMs)), so a slow or hung storage call can no longer block the timer thread. The write timeout is one heartbeat interval; a timed-out write does not advance the last-heartbeat time and is retried on the next tick. A cached thread pool is used so that if one write hangs, subsequent ticks proceed on a fresh thread.
Remove the self-interrupt in updateHeartbeat(). Instead of Thread.currentThread().interrupt(), log a warning and continue refreshing. The commit-time check HeartbeatUtils.abortIfHeartbeatExpired() remains the sole enforcement point for staleness.
Shut the executor down in close().
Add TestHoodieHeartbeatClient.testTimerSurvivesHungHeartbeatWrite, which blocks the first heartbeat write and asserts the timer keeps generating heartbeats (covering both fixes).

Impact

No public API or config change. Heartbeat refresh becomes resilient to transient storage latency and driver pauses: a transient stall no longer permanently disables heartbeats for an instant. Staleness is still enforced at commit time, so correctness of the concurrency guard is unchanged.

Risk Level

low

Behavior change is confined to HoodieHeartbeatClient. Existing TestHoodieHeartbeatClient tests pass and a new regression test was added.

Documentation Update

none

Contributor's checklist

Read through contributor's guide
Enough context is provided in the sections above
Adequate tests were added if applicable

… delayed heartbeats HoodieHeartbeatClient could permanently stop generating heartbeats for an instant, causing later commits to abort with "Heartbeat for instant ... has expired" even though the writer was still alive: - The heartbeat file is written synchronously on the Timer thread. Because the timer uses scheduleAtFixedRate, a slow or hung storage write blocks the thread and freezes all subsequent heartbeats for that instant. - When a heartbeat refresh is delayed past the tolerable interval, updateHeartbeat() called Thread.currentThread().interrupt(), permanently killing the timer thread and turning a transient delay into a permanent blackout. Fix: - Perform the heartbeat file write on a bounded daemon executor (Future.get with a per-interval timeout) so a slow or hung storage call cannot block the timer thread; a timed-out write is retried on the next tick. - Remove the self-interrupt; log a warning and continue. The commit-time check HeartbeatUtils.abortIfHeartbeatExpired() remains the sole enforcement point. Add TestHoodieHeartbeatClient.testTimerSurvivesHungHeartbeatWrite.

codecov-commenter · 2026-06-02T22:50:36Z

Codecov Report

❌ Patch coverage is 71.42857% with 10 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.81%. Comparing base (b7adecc) to head (73d0fb9).

Files with missing lines	Patch %	Lines
...e/hudi/client/heartbeat/HoodieHeartbeatClient.java	71.42%	9 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff            @@
##             master   #18904   +/-   ##
=========================================
  Coverage     68.81%   68.81%           
- Complexity    29160    29178   +18     
=========================================
  Files          2520     2520           
  Lines        140056   140086   +30     
  Branches      17209    17209           
=========================================
+ Hits          96373    96397   +24     
- Misses        35909    35912    +3     
- Partials       7774     7777    +3

Flag	Coverage Δ
common-and-other-modules	`44.32% <71.42%> (-0.01%)`	⬇️
hadoop-mr-java-client	`44.85% <51.42%> (-0.04%)`	⬇️
spark-client-hadoop-common	`48.17% <51.42%> (+0.01%)`	⬆️
spark-java-tests	`49.35% <51.42%> (-0.01%)`	⬇️
spark-scala-tests	`45.25% <51.42%> (+<0.01%)`	⬆️
utilities	`37.38% <51.42%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
...e/hudi/client/heartbeat/HoodieHeartbeatClient.java	`81.81% <71.42%> (-5.69%)`	⬇️

... and 16 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

hudi-bot · 2026-06-02T22:53:38Z

CI report:

73d0fb9 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

hudi-agent

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for working on this! The PR addresses two real issues with the heartbeat client: (a) synchronous storage writes blocking the timer thread, and (b) the self-interrupt permanently killing the timer. The new design (executor with bounded timeout + remove self-interrupt) is reasonable. One edge case worth double-checking in the inline comments around what happens when the very first heartbeat write times out during start(). Please take a look, and this should be ready for a Hudi committer or PMC member to take it from here. One minor type inconsistency worth noting in the new field declaration.

hudi-agent · 2026-06-03T00:51:12Z

      heartbeat.setInstantTime(instantTime);
      heartbeat.setLastHeartbeatTime(newHeartbeatTime);
      heartbeat.setNumHeartbeats(heartbeat.getNumHeartbeats() + 1);
+    } catch (TimeoutException te) {


🤖 If the very first writeHeartbeatFile() call (invoked synchronously from start()) hits this timeout path, heartbeat.setLastHeartbeatTime(...) is never reached, so the Heartbeat's lastHeartbeatTime stays null. A subsequent call to isHeartbeatExpired() (e.g., from HeartbeatUtils.abortIfHeartbeatExpired() during a fast write-then-commit) would NPE at line 189 (currentTime - getLastHeartbeatTime() unboxes a null Long). Could lastHeartbeatTime be initialized conservatively here (or in start() before scheduling), or isHeartbeatExpired guarded against the null case, to close this corner?

_{- AI-generated; verify before applying. React 👍/👎 to flag quality.}

hudi-agent · 2026-06-03T00:51:12Z

+  // Maximum time the timer thread will wait for a single heartbeat file write to complete before
+  // abandoning it and letting the next tick retry. Bounded to one interval so that a slow/hung
+  // storage write cannot block the timer thread (and thus freeze all subsequent heartbeats).
+  private final long heartbeatWriteTimeoutMs;


🤖 nit: the two sibling duration fields (heartbeatIntervalInMs, maxAllowableHeartbeatIntervalInMs) are boxed Long — could you align heartbeatWriteTimeoutMs to Long as well to keep the field group consistent?

_{- AI-generated; verify before applying. React 👍/👎 to flag quality.}

danny0405 · 2026-06-03T04:16:58Z

  public void close() {
    this.stopHeartbeatTimers();
    this.instantToHeartbeatMap.clear();
+    synchronized (this) {


do we have concurernt close on this client, should we add a isClosed flag to avoid repetitive close.

danny0405 · 2026-06-03T04:19:32Z

+        // future heartbeats for this instant, turning a transient delay into a permanent blackout.
+        // Enforcement is done at commit time in HeartbeatUtils.abortIfHeartbeatExpired(), which is the
+        // correct and sole enforcement point.
+        log.warn("Missed generating heartbeat for instant {} within allowable interval {} ms; continuing to refresh",


makes sense somehow, but I do see some risk for correctness: when failed writes rollback strategy is configured as LAZY, the async cleaner would possibility rollback the current instant by removing some data files(not remove the metadata files on timeline yet), and then the write finish to commit, then the commit got data loss.

Should we also increase the tolerale missing cnt, the current default is 2, should we change it to 10 or 20

github-actions Bot added the size:M PR with lines of changes in (100, 300] label Jun 2, 2026

hudi-agent reviewed Jun 3, 2026

View reviewed changes

danny0405 reviewed Jun 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: prevent heartbeat timer from being permanently killed by slow or delayed heartbeats#18904

fix: prevent heartbeat timer from being permanently killed by slow or delayed heartbeats#18904
prashantwason wants to merge 1 commit into
apache:masterfrom
prashantwason:heartbeat-timer-resilience

prashantwason commented Jun 2, 2026

Uh oh!

codecov-commenter commented Jun 2, 2026

Uh oh!

hudi-bot commented Jun 2, 2026

Uh oh!

hudi-agent left a comment

Uh oh!

hudi-agent Jun 3, 2026

Uh oh!

hudi-agent Jun 3, 2026

Uh oh!

danny0405 Jun 3, 2026

Uh oh!

danny0405 Jun 3, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

prashantwason commented Jun 2, 2026

Describe the issue this Pull Request addresses

Summary and Changelog

Impact

Risk Level

Documentation Update

Contributor's checklist

Uh oh!

codecov-commenter commented Jun 2, 2026

Codecov Report

Uh oh!

hudi-bot commented Jun 2, 2026

CI report:

Uh oh!

hudi-agent left a comment

Choose a reason for hiding this comment

Uh oh!

hudi-agent Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

hudi-agent Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

danny0405 Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

danny0405 Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

danny0405 Jun 3, 2026 •

edited

Loading