Skip to content

Worker Heartbeating#2818

Open
yuandrew wants to merge 7 commits intotemporalio:masterfrom
yuandrew:worker-heartbeat-final
Open

Worker Heartbeating#2818
yuandrew wants to merge 7 commits intotemporalio:masterfrom
yuandrew:worker-heartbeat-final

Conversation

@yuandrew
Copy link
Copy Markdown
Contributor

@yuandrew yuandrew commented Mar 27, 2026

What was changed

Implements periodic worker heartbeat RPC that reports worker status, slot usage, poller info, host metrics, and sticky cache counters to the Temporal server. Includes HeartbeatManager, PollerTracker, and integration tests covering all heartbeat fields.

  • HeartbeatManager — Per-namespace heartbeat scheduler. Workers register/unregister; scheduler fires at the configured interval. Gracefully shuts down if server returns UNIMPLEMENTED.
  • PollerTracker — Tracks in-flight poll count and last successful poll time per worker type. Only records success when a poll returns actual work.
  • WorkflowClientOptions.workerHeartbeatInterval — New option to configure heartbeat interval. Defaults to 60s. Can be set between 1-60s, or a negative duration to disable.
  • Worker.getActiveTaskQueueTypes() — Reports WORKFLOW, ACTIVITY, and NEXUS (only when Nexus services are registered, matching Go SDK).
  • Worker.buildHeartbeat() — Assembles the full WorkerHeartbeat proto with slot info, poller info, host metrics, sticky cache counters, and timestamps.
  • Shutdown: Each worker sends ShutdownWorkerRequest with a final SHUTTING_DOWN heartbeat and active task queue types.
  • NexusWorker task counters — Fixed totalProcessedTasks/totalFailedTasks to be properly incremented.
  • TrackingSlotSupplier.getSlotSupplierKind() — Reports FixedSize vs ResourceBased in heartbeats.

Why?

New feature!

Checklist

  1. Closes Worker Heartbeating #2716

  2. How was this tested:

  • WorkerHeartbeatIntegrationTest (12 tests): End-to-end against test server — basic fields, slot info, task counters, poller info, failure metrics, interval counter reset, in-flight slot tracking, sticky cache counters/misses, multiple workers, resource-based tuner, shutdown status, disabled heartbeats
  • HeartbeatManagerTest: Unit tests for scheduler lifecycle, UNIMPLEMENTED handling, exception resilience
  • PollerTrackerTest: Unit tests for poll tracking and snapshot generation
  • TrackingSlotSupplierKindTest: Slot supplier kind detection
  • WorkerHeartbeatDeploymentVersionTest: Deployment version in heartbeats
  • All existing tests pass (./gradlew :temporal-sdk:test + spotlessCheck)
  1. Any docs updates needed?

Note

Medium Risk
Touches core worker polling/shutdown paths and introduces periodic RPCs and new configuration; issues could impact worker performance or shutdown semantics, though behavior is capability-gated and can be disabled.

Overview
Adds periodic worker heartbeat reporting to Temporal Server, including a new HeartbeatManager scheduler that aggregates per-namespace heartbeats and disables itself on UNIMPLEMENTED.

Extends WorkflowClientOptions with experimental workerHeartbeatInterval (default 60s; negative disables) and wires heartbeating into WorkerFactory/Worker lifecycle: workers generate detailed WorkerHeartbeat payloads (status, active task queue types, slot usage, poller stats, sticky cache hit/miss counters, host metrics, plugin info, deployment version) and send a final shutting down heartbeat during shutdownWorker.

Refactors poller/task accounting to support heartbeats by introducing PollerTracker (in-flight polls + last successful poll time) and TaskCounter (processed/failed), updating poll tasks/workers to use these trackers, enhancing TrackingSlotSupplier with supplier-kind/used-slot introspection, and enabling CI’s dev server with frontend.ListWorkersEnabled=true for tests.

Written by Cursor Bugbot for commit cb9b3b7. This will update automatically on new commits. Configure here.

Implements periodic worker heartbeat RPCs that report worker status,
slot usage, poller info, and task counters to the server.

Key components:
- HeartbeatManager: per-namespace scheduler that aggregates heartbeats
  from all workers sharing that namespace
- PollerTracker: tracks in-flight poll count and last successful poll time
- WorkflowClientOptions.workerHeartbeatInterval: configurable interval
  (default 60s, range 1-60s, negative to disable)
- TrackingSlotSupplier: extended with slot type reporting
- Worker: builds SharedNamespaceWorker heartbeat data from activity,
  workflow, and nexus worker stats
- TestWorkflowService: implements recordWorkerHeartbeat, describeWorker,
  and shutdownWorker RPCs for testing
@yuandrew yuandrew force-pushed the worker-heartbeat-final branch from 9eeea9b to bfd3fc6 Compare March 27, 2026 19:41
@yuandrew yuandrew marked this pull request as ready for review March 27, 2026 19:58
@yuandrew yuandrew requested a review from a team as a code owner March 27, 2026 19:58
Comment on lines +51 to +52
private final AtomicInteger totalProcessedTasks = new AtomicInteger();
private final AtomicInteger totalFailedTasks = new AtomicInteger();
Copy link
Copy Markdown
Member

@Sushisource Sushisource Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the workers now have something like this - can we abstract these out into something else? Seems like we could use a shared interface or base class.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved into a new TaskCounter class

DescribeNamespaceRequest.newBuilder()
.setNamespace(workflowClient.getOptions().getNamespace())
.build());
boolean heartbeatsSupported =
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my PR I've bundled a bunch of capability stuff up into a class, FYI, this will want to go in there

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the heads up, had Claude pull in your new class, so merge conflict should be minimal now. Feel free to merge first

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs a review tho!

if (hasNexusServices) {
types.add(TaskQueueType.TASK_QUEUE_TYPE_NEXUS);
}
return types;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Active queue types reported inaccurately

Medium Severity

getActiveTaskQueueTypes() derives types from object presence (workflowWorker, activityWorker) instead of whether those workers are actually polling. This reports TASK_QUEUE_TYPE_WORKFLOW (and often TASK_QUEUE_TYPE_ACTIVITY) even when no corresponding implementations are registered, so ShutdownWorkerRequest.task_queue_types can be incorrect.

Fix in Cursor Fix in Web

private volatile boolean shuttingDown = false;
private boolean hasNexusServices = false;
private final String workerInstanceKey = UUID.randomUUID().toString();
private final Instant startTime = Instant.now();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heartbeat start time captured too early

Low Severity

startTime is initialized at Worker construction, not when the worker actually starts polling. If a factory creates workers long before start(), heartbeats publish an artificially old start_time, skewing uptime and lifecycle timing data.

Fix in Cursor Fix in Web

} catch (InterruptedException e) {
Thread.currentThread().interrupt();
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heartbeat thread may leak during shutdown

Medium Severity

SharedNamespaceWorker.shutdown() only calls scheduler.shutdown() and waits 5 seconds. If heartbeatTick() is blocked in recordWorkerHeartbeat, the running task is never interrupted, so the scheduler thread can remain alive after shutdown and keep resources pinned.

Additional Locations (1)
Fix in Cursor Fix in Web

Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 4 total unresolved issues (including 3 from previous reviews).

Fix All in Cursor

private final Instant startTime = Instant.now();
private final WorkflowClientOptions clientOptions;
private final @Nonnull WorkflowExecutorCache cache;
private final Map<String, TaskSnapshot> previousSnapshots = new HashMap<>();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-thread-safe HashMap used from heartbeat scheduler thread

Low Severity

previousSnapshots is a plain HashMap mutated by buildSlotsInfo from the heartbeat scheduler thread. While currently protected by callbackLock inside the callback closure, the buildSlotsInfo method itself has no synchronization and doesn't document this threading requirement. Using a ConcurrentHashMap or moving the snapshot state into the synchronized callback closure (alongside lastHeartbeatTime) would make the thread-safety guarantee self-evident and prevent future accidental unsynchronized access.

Additional Locations (1)
Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Worker Heartbeating

2 participants