fix: Report accurate health for federated workloads by scotwells · Pull Request #127 · datum-cloud/compute

scotwells · 2026-05-29T22:54:21Z

Summary

Federated workloads were reported as Unavailable with 0/0 ready replicas even when all of their instances were running — datumctl compute workloads showed every workload unhealthy while datumctl compute instances showed them Running. This makes federated workload health and readiness reflect reality across the CLI, API, and Workload status.

Three independent defects in the status pipeline (Instance → WorkloadDeployment → Workload, spanning the cell, Karmada, and project planes) combined to hide healthy instances:

A reconcile panic froze deployment status. The cell reconciler dereferenced an instance's controller status (Status.Controller, a nilable pointer) while counting replicas. Infra providers set that field independently of the Programmed condition, so a programmed-but-not-yet-populated instance crashed the reconcile before it wrote status — freezing the deployment at 0 ready replicas and hot-looping. Now guarded.
Status never reached the project plane. Aggregated deployment status lived on the Karmada hub object but was never synced back to the project-namespace deployment users read. A new WorkloadDeploymentStatusSyncer on the management plane watches hub deployments and writes their status back to the originating project deployment.
The CLI city lookup failed. Instance projections carried the cell-plane deployment UID in the workload-deployment-uid label, which never matches the project-plane UID, breaking label-selector lookups (the CLI CITY column). Projections now stamp the project-side UID.

With all three fixed, a running federated instance flows readiness up the full chain: the cell deployment counts it, Karmada aggregates it, the syncer carries it to the project deployment, and the Workload reports Available / 1/1.

Test plan

make test (envtest) passes, including the updated instance-projector test asserting the project-side UID label
Add regression test: reconcileInstanceGates counts readiness without panicking when an instance is Programmed=True with Status.Controller == nil
In a federated lab (cell + Karmada + project), a workload with a running instance reports Available / 1/1 in datumctl compute workloads
datumctl compute instances resolves the CITY column for projected instances

Notes for reviewers

WorkloadDeploymentStatusSyncer runs only under --enable-management-controllers and must be rolled out to the management-plane controller for project-side status to populate.
The panic guard is load-bearing for any provider that sets Programmed=True before populating Status.Controller (the Unikraft provider today). A complementary provider-side PR reports the template hash so currentReplicas tracks rollouts.

This PR is stacked on feat/federated-deployment-scheduling (#107) and targets that branch, since the changes depend on federation-only code not yet in main.

🤖 Generated with Claude Code

WorkloadDeployment reconciliation read instance.Status.Controller.ObservedTemplateHash to count current replicas without checking that Status.Controller (a nilable pointer) was populated. Infra providers set that field independently of the Programmed condition, so an instance could report Programmed=True with Status.Controller still nil. The dereference panicked, and because it ran before the status write, the deployment's status froze at its create-time values and the reconcile hot-looped -- surfacing as a permanently Unavailable workload with 0 ready replicas even though its instances were running. Guard the dereference so reconciliation completes and the deployment reports accurate readiness. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Instance projections copied the workload-deployment-uid label verbatim from the cell-plane Instance, where it carries the cell/Karmada WorkloadDeployment UID. That value never matches the project-plane WorkloadDeployment UID, so label-selector lookups in the project cluster (e.g. the CLI CITY column) failed to resolve. Overwrite the label on the projection with the owning project-side WorkloadDeployment's UID, which the projector already resolves for the owner reference. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Federated WorkloadDeployment status was aggregated onto the Karmada hub object but never propagated back to the project-namespace deployment that users and the Workload controller read. The federator only watched the project-side object, so its downstream status sync never fired reactively, leaving the project deployment's status empty and the parent Workload stuck at 0/0 replicas. Add a WorkloadDeploymentStatusSyncer on the management plane that watches Karmada WorkloadDeployments and writes their aggregated status back to the originating project deployment, mirroring how the InstanceProjector resolves the project cluster and namespace. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

ecv

Do we need these comments? Genuine question, you know what I think but that's not what's important necessarily.

scotwells and others added 3 commits May 29, 2026 17:54

scotwells requested review from a team and savme May 29, 2026 23:11

scotwells marked this pull request as ready for review May 29, 2026 23:11

ecv approved these changes May 30, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Report accurate health for federated workloads#127

fix: Report accurate health for federated workloads#127
scotwells wants to merge 3 commits into
feat/federated-deployment-schedulingfrom
fix/federated-workload-health

scotwells commented May 29, 2026

Uh oh!

ecv left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

scotwells commented May 29, 2026

Summary

Test plan

Notes for reviewers

Uh oh!

ecv left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants