fix: Report accurate health for federated workloads#127
Open
scotwells wants to merge 3 commits into
Open
Conversation
WorkloadDeployment reconciliation read instance.Status.Controller.ObservedTemplateHash to count current replicas without checking that Status.Controller (a nilable pointer) was populated. Infra providers set that field independently of the Programmed condition, so an instance could report Programmed=True with Status.Controller still nil. The dereference panicked, and because it ran before the status write, the deployment's status froze at its create-time values and the reconcile hot-looped -- surfacing as a permanently Unavailable workload with 0 ready replicas even though its instances were running. Guard the dereference so reconciliation completes and the deployment reports accurate readiness. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Instance projections copied the workload-deployment-uid label verbatim from the cell-plane Instance, where it carries the cell/Karmada WorkloadDeployment UID. That value never matches the project-plane WorkloadDeployment UID, so label-selector lookups in the project cluster (e.g. the CLI CITY column) failed to resolve. Overwrite the label on the projection with the owning project-side WorkloadDeployment's UID, which the projector already resolves for the owner reference. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Federated WorkloadDeployment status was aggregated onto the Karmada hub object but never propagated back to the project-namespace deployment that users and the Workload controller read. The federator only watched the project-side object, so its downstream status sync never fired reactively, leaving the project deployment's status empty and the parent Workload stuck at 0/0 replicas. Add a WorkloadDeploymentStatusSyncer on the management plane that watches Karmada WorkloadDeployments and writes their aggregated status back to the originating project deployment, mirroring how the InstanceProjector resolves the project cluster and namespace. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
ecv
approved these changes
May 30, 2026
ecv
left a comment
There was a problem hiding this comment.
Do we need these comments? Genuine question, you know what I think but that's not what's important necessarily.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Federated workloads were reported as
Unavailablewith0/0ready replicas even when all of their instances were running —datumctl compute workloadsshowed every workload unhealthy whiledatumctl compute instancesshowed themRunning. This makes federated workload health and readiness reflect reality across the CLI, API, andWorkloadstatus.Three independent defects in the status pipeline (Instance → WorkloadDeployment → Workload, spanning the cell, Karmada, and project planes) combined to hide healthy instances:
Status.Controller, a nilable pointer) while counting replicas. Infra providers set that field independently of theProgrammedcondition, so a programmed-but-not-yet-populated instance crashed the reconcile before it wrote status — freezing the deployment at0ready replicas and hot-looping. Now guarded.WorkloadDeploymentStatusSynceron the management plane watches hub deployments and writes their status back to the originating project deployment.workload-deployment-uidlabel, which never matches the project-plane UID, breaking label-selector lookups (the CLI CITY column). Projections now stamp the project-side UID.With all three fixed, a running federated instance flows readiness up the full chain: the cell deployment counts it, Karmada aggregates it, the syncer carries it to the project deployment, and the
WorkloadreportsAvailable/1/1.Test plan
make test(envtest) passes, including the updated instance-projector test asserting the project-side UID labelreconcileInstanceGatescounts readiness without panicking when an instance isProgrammed=TruewithStatus.Controller == nilAvailable/1/1indatumctl compute workloadsdatumctl compute instancesresolves the CITY column for projected instancesNotes for reviewers
WorkloadDeploymentStatusSyncerruns only under--enable-management-controllersand must be rolled out to the management-plane controller for project-side status to populate.Programmed=Truebefore populatingStatus.Controller(the Unikraft provider today). A complementary provider-side PR reports the template hash socurrentReplicastracks rollouts.This PR is stacked on
feat/federated-deployment-scheduling(#107) and targets that branch, since the changes depend on federation-only code not yet inmain.🤖 Generated with Claude Code