Skip to content

fix: Report accurate health for federated workloads#127

Open
scotwells wants to merge 3 commits into
feat/federated-deployment-schedulingfrom
fix/federated-workload-health
Open

fix: Report accurate health for federated workloads#127
scotwells wants to merge 3 commits into
feat/federated-deployment-schedulingfrom
fix/federated-workload-health

Conversation

@scotwells
Copy link
Copy Markdown
Contributor

Summary

Federated workloads were reported as Unavailable with 0/0 ready replicas even when all of their instances were running — datumctl compute workloads showed every workload unhealthy while datumctl compute instances showed them Running. This makes federated workload health and readiness reflect reality across the CLI, API, and Workload status.

Three independent defects in the status pipeline (Instance → WorkloadDeployment → Workload, spanning the cell, Karmada, and project planes) combined to hide healthy instances:

  • A reconcile panic froze deployment status. The cell reconciler dereferenced an instance's controller status (Status.Controller, a nilable pointer) while counting replicas. Infra providers set that field independently of the Programmed condition, so a programmed-but-not-yet-populated instance crashed the reconcile before it wrote status — freezing the deployment at 0 ready replicas and hot-looping. Now guarded.
  • Status never reached the project plane. Aggregated deployment status lived on the Karmada hub object but was never synced back to the project-namespace deployment users read. A new WorkloadDeploymentStatusSyncer on the management plane watches hub deployments and writes their status back to the originating project deployment.
  • The CLI city lookup failed. Instance projections carried the cell-plane deployment UID in the workload-deployment-uid label, which never matches the project-plane UID, breaking label-selector lookups (the CLI CITY column). Projections now stamp the project-side UID.

With all three fixed, a running federated instance flows readiness up the full chain: the cell deployment counts it, Karmada aggregates it, the syncer carries it to the project deployment, and the Workload reports Available / 1/1.

Test plan

  • make test (envtest) passes, including the updated instance-projector test asserting the project-side UID label
  • Add regression test: reconcileInstanceGates counts readiness without panicking when an instance is Programmed=True with Status.Controller == nil
  • In a federated lab (cell + Karmada + project), a workload with a running instance reports Available / 1/1 in datumctl compute workloads
  • datumctl compute instances resolves the CITY column for projected instances

Notes for reviewers

  • WorkloadDeploymentStatusSyncer runs only under --enable-management-controllers and must be rolled out to the management-plane controller for project-side status to populate.
  • The panic guard is load-bearing for any provider that sets Programmed=True before populating Status.Controller (the Unikraft provider today). A complementary provider-side PR reports the template hash so currentReplicas tracks rollouts.

This PR is stacked on feat/federated-deployment-scheduling (#107) and targets that branch, since the changes depend on federation-only code not yet in main.

🤖 Generated with Claude Code

scotwells and others added 3 commits May 29, 2026 17:54
WorkloadDeployment reconciliation read
instance.Status.Controller.ObservedTemplateHash to count current
replicas without checking that Status.Controller (a nilable pointer)
was populated. Infra providers set that field independently of the
Programmed condition, so an instance could report Programmed=True
with Status.Controller still nil. The dereference panicked, and
because it ran before the status write, the deployment's status
froze at its create-time values and the reconcile hot-looped --
surfacing as a permanently Unavailable workload with 0 ready
replicas even though its instances were running.

Guard the dereference so reconciliation completes and the deployment
reports accurate readiness.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Instance projections copied the workload-deployment-uid label
verbatim from the cell-plane Instance, where it carries the
cell/Karmada WorkloadDeployment UID. That value never matches the
project-plane WorkloadDeployment UID, so label-selector lookups in
the project cluster (e.g. the CLI CITY column) failed to resolve.

Overwrite the label on the projection with the owning project-side
WorkloadDeployment's UID, which the projector already resolves for
the owner reference.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Federated WorkloadDeployment status was aggregated onto the Karmada
hub object but never propagated back to the project-namespace
deployment that users and the Workload controller read. The
federator only watched the project-side object, so its downstream
status sync never fired reactively, leaving the project deployment's
status empty and the parent Workload stuck at 0/0 replicas.

Add a WorkloadDeploymentStatusSyncer on the management plane that
watches Karmada WorkloadDeployments and writes their aggregated
status back to the originating project deployment, mirroring how the
InstanceProjector resolves the project cluster and namespace.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@scotwells scotwells requested review from a team and savme May 29, 2026 23:11
@scotwells scotwells marked this pull request as ready for review May 29, 2026 23:11
Copy link
Copy Markdown

@ecv ecv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need these comments? Genuine question, you know what I think but that's not what's important necessarily.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants