feat: federated workload scheduling across POP cells by scotwells · Pull Request #116 · datum-cloud/compute

scotwells · 2026-05-28T16:37:01Z

Summary

Before this PR, a Workload could only run within a single control plane cell. There was no mechanism to schedule workloads across geographically distributed POP cells, no way to enforce resource quotas across cells, and no way for the management plane to aggregate status from instances running in different cells.

This PR delivers the full federated deployment scheduling pipeline:

Placement-driven scheduling — the WorkloadReconciler (running in the management cluster) reads a Workload's spec.placements[].cityCodes and creates a WorkloadDeployment for each city. Deployments are reconciled by the per-cell WorkloadDeploymentReconciler, which manages Instance lifecycle within the cell.
Federated write-back — the WorkloadDeploymentFederator pushes each WorkloadDeployment into a shared downstream control plane (Karmada) using namespace-scoped projections. Karmada propagates the deployment to the matching edge cluster via PropagationPolicy.
Instance projection — InstanceReconciler in each POP cell writes a copy of every Instance back to the downstream control plane. The InstanceProjector (management plane) reads these write-backs and creates read-only projections in the project namespace, so status from all cells is visible through a single API surface.
Quota enforcement per cell — each Instance creates a ResourceClaim routed to the Milo project control plane. Quota is evaluated before the scheduling gate is removed from the Instance spec, preventing over-provisioning across cells.
Location-aware admission — workload admission and scheduling now consult LocationBinding objects (project-scoped, created by the service catalog) rather than the global Location list. A project only sees locations that are both healthy and enabled for that project.
Webhook TLS via CSI — removed cert-manager Certificate resources. The webhook server mounts its TLS cert directly from a CSI volume, eliminating the cert-manager dependency for in-cluster issuance.

Test plan

go test ./... passes locally
Chainsaw e2e tests: full-federation, instance-writeback, instance-projection, propagation-policy-lifecycle
Create a Workload with two city-code placements and confirm two WorkloadDeployment objects appear
Confirm PropagationPolicy is created in the downstream control plane for each city code
Confirm Instance write-backs appear in the downstream control plane and are projected back into the project namespace
Verify quota claim is created per Instance and the scheduling gate is removed once granted
Verify a placement referencing a city code not in the project's LocationBindings is skipped (no deployment created)
Verify management controllers do not run in edge-cell mode (--enable-management-controllers=false)

Breaking changes

WorkloadDeployment.spec.location field removed — location is now derived from spec.cityCode
Cert-manager Certificate resources for webhook TLS are removed; overlays must provide a CSI volume source instead (see config/overlays/)

Defines the Karmada-based federation architecture for compute workload scheduling. Covers control plane topology, resource locations, creation and deletion flows, instance visibility, operator changes, auto scaling model, and namespace mapping conventions. Resolves #85 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Workloads targeting a city location are now automatically routed to the correct physical site via a Karmada-based federation layer. Each POP cell operates independently, instance health is surfaced back to the control plane in real time, and the platform remains available even when parts of the control plane are temporarily unreachable. Controllers added: - WorkloadDeploymentFederator: replicates WDs into Karmada and manages PropagationPolicies per city code - InstanceProjector: mirrors Instance write-backs from Karmada into the project namespace on the control plane ResourceInterpreterCustomization deployed at config time teaches Karmada how to aggregate replica counts and conditions across POP cells. Operator flags --enable-management-controllers and --enable-cell-controllers allow each deployment to opt into only the controllers it needs. Includes a 6-test Chainsaw e2e suite covering federation, deletion cascade, propagation policy lifecycle, instance projection, instance write-back, and the full end-to-end chain. Resolves #85 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…edge Introduces management-plane and cell overlay paths to the compute OCI artifact so the infra repo can deploy compute-manager in the correct mode for each tier of the federation architecture. The management-plane overlay deploys compute-manager with only WorkloadDeploymentFederator and InstanceProjector enabled, connected to the Karmada downstream control plane via projected ServiceAccount token auth. The cell overlay deploys compute-manager with only WorkloadDeploymentReconciler and InstanceReconciler enabled, with no downstream connection or webhook server. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ts for webhook TLS Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Remove the hardcoded datum-control-plane ClusterIssuer from the csi-webhook-cert component. DNS names stay since they are fixed by the service name and namespace. Each consuming overlay now supplies the issuer via a strategic merge patch, allowing different environments to use different cert issuers without forking the component. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Each WorkloadDeployment is routed to exactly one cell cluster via its PropagationPolicy, so aggregation across multiple members is not needed. Replace the summing logic with a direct pass-through of the single member's status. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The cert issuer name is environment-specific configuration that belongs in the infra repo, not the compute overlay. The infra repo's base manager patch already owns the full webhook-server-tls volume definition including the issuer. Consumers deploying outside infra must patch the issuer in their own overlay. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…moval dev: inline self-signed Issuer + Certificate for host.docker.internal, replace kustomize replacements block with direct annotation patch, remove Certificate-patching from webhook_patch.yaml, and clear webhookServer secretRef from config.yaml. single-cluster: replace cert-manager Certificate approach with the csi-webhook-cert component, matching the main branch overlay. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The WorkloadReconciler watches networkingv1alpha.Network objects, which requires the network-services-operator CRDs to be installed. Cell clusters don't have those CRDs, causing the manager to crash on startup. Gate the WorkloadReconciler behind enableManagementControllers so it only runs where the Network CRDs are present. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Extracts server config file reading and decoding into a dedicated loadServerConfig helper, reducing main's cyclomatic complexity from 31 to 29 to satisfy the gocyclo linter limit of 30. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Milo's authorization webhook uses Extra claims on the admission request (iam.miloapis.com/parent-name, iam.miloapis.com/parent-type, etc.) to resolve the correct project-scoped policy binding. Dropping them caused the SAR to return Allowed=false even for users with networks.use, because the authorizer couldn't locate the binding without the project context. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

metricRules belongs under spec.quota, not spec.billing. The field is not declared in the ServiceBillingConfig schema, causing Flux dry-run failures in staging with: .spec.billing.metricRules: field not declared in schema

Previously, InstanceReconciler wrote ResourceClaim objects against the local deployment cluster via managementCluster.GetClient(). Those claims were never seen by the Milo quota system, leaving every Instance in QuotaGranted=Unknown indefinitely. This change routes claim creation and deletion to the correct Milo project control plane for each instance using a new ProjectQuotaClientManager that builds per-project REST clients by rewriting the host path — mirroring the URL construction already used by the milomulticluster provider. The management-cluster claim watch is replaced with a multicluster Watches call so that grant/denial status changes in project control planes re-trigger instance reconciles. Claims are stamped with a source-cluster label (discovery.clusterName) so each edge controller only reacts to the claims it created. Co-Authored-By: Claude <claude@anthropic.com>

The admission webhook requires that all metrics referenced in spec.quota.limits[].metric and spec.quota.metricRules[].metricCosts match a name declared in spec.metrics[]. The four quota-tracking metrics (workloads, instances, vcpus, memory) were missing from spec.metrics[], causing the webhook to reject the resource.

…o cell setup Controller flags --enable-management-controllers and --enable-cell-controllers now default to false so kustomize components must explicitly opt in, rather than both groups running by default. This prevented the management-plane deployment from crashing when discovery.clusterName was unset — that field is only required by the InstanceReconciler (a cell controller), so the validation now lives in InstanceReconciler.SetupWithManager instead of initializeClusterDiscovery. Also adds cell-controllers and management-controllers components to the single-cluster overlay, which was silently running with no controllers enabled. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…scovery The rebase during cherry-pick propagation introduced a mixed state where cmd/main.go had the edgeClusterName/projectRestConfig return values partially reverted. This cleans up the function signature and call sites to be consistent, while keeping the validation removed from initializeClusterDiscovery (it belongs in InstanceReconciler.SetupWithManager per the original fix intent). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… RBAC The workload-deployment-federator calls ensureDownstreamNamespace before federating WorkloadDeployment resources, but the compute-manager ClusterRole was missing core-group namespace permissions, causing every reconcile to fail with a forbidden error. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Workload scheduling and admission now consult LocationBinding objects (project-scoped, created by the service catalog) rather than the global Location list. This ensures consumers only see locations that are both healthy and available to their specific project. Also upgrades network-services-operator and milo dependencies to versions that introduce LocationBinding and address multicluster-runtime v0.23 API changes (ClusterName type, ProviderRunnable Start lifecycle, generic webhook builder). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ources WorkloadDeploymentReconciler creates and owns NetworkBinding and SubnetClaim resources, and watches Location, NetworkContext, and Subnet. InstanceReconciler watches ResourceClaim for quota. Neither was granted the necessary ClusterRole rules, causing watch failures on cell clusters. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

From the cell cluster's perspective, Karmada is upstream (the federation control plane), not downstream. Rename the flag, env var, and related variables throughout to reflect the actual relationship. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…viderRunnable fix Points go.miloapis.com/milo to the feature branch commit that implements multicluster.ProviderRunnable on the Milo provider, enabling the mc manager to auto-call provider.Start() and set p.mcAware so project clusters can be registered. Without this, p.mcAware was always nil and every project reconcile logged "Multicluster manager not yet started" forever. Also removes the & from ResourceRef in ResourceClaimSpec — the feature branch has ResourceRef as a value type, not a pointer. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Remove non-existent QuotaRestConfig() call and fix SetupWithManager argument count; pass nil quota config to skip quota enforcement for now. Single-tenant cell mode uses namespace-as-project-id and the fixed 'single' cluster name. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Wires up Milo ResourceClaim-based quota accounting for cells running in single-cell discovery mode (mode: single), where the multicluster ClusterName is always "single" rather than the Milo project name. Key changes: - Add QuotaKubeconfigPath config field and QuotaRestConfig() method so quota REST config can be configured independently of discovery mode. Returns (nil, nil) when neither path is set, disabling quota rather than silently targeting the local apiserver. - Add projectIDForInstance and clusterNameForProject func fields to InstanceReconciler. In single mode, project ID is derived from instance.Namespace; the watch map func always enqueues ClusterName "single" rather than the project namespace, avoiding ErrClusterNotFound on every quota-grant event. - Guard ResourceClaim watch map func against claims with empty ResourceRef to prevent a nil-dereference panic when a label-matching claim from another actor has no ResourceRef set. - Add TestReconcileQuotaSingleMode covering the full single-mode quota flow: project ID from namespace, watch re-enqueue to "single" cluster. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

v2.1.5 was built with Go 1.24 and refuses to lint Go 1.25 modules. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

clusterName is only needed when enableCellControllers is true (cell/edge deployments). Management plane deployments use Milo mode without it. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Connect the cell/edge InstanceReconciler to the Milo quota API so that ResourceClaim objects are created and watched in single-cell mode the same way they are in Milo-mode deployments. - Load quotaRestConfig from discovery.quotaKubeconfigPath at startup; quota stays disabled (nil) when the path is not set - Replace the no-op projectIDForInstance closure with one that reads the meta.datumapis.com/upstream-namespace label from the local cell namespace to resolve the Milo project name - Pass quotaRestConfig to SetupWithManager instead of nil - Add config/components/quota-credentials Kustomize component that mounts a compute-quota-credentials Secret into the manager pod - Enable the component in the cell overlay and add quotaKubeconfigPath to the cell ConfigMap patch Hardening from code review: - reconcileDeletion: when projectID is unresolvable, log and remove the finalizer rather than blocking Instance deletion forever - SetupWithManager: return an error at startup when quota is enabled but edgeClusterName is empty (prevents silent predicate breakage) - projectIDForInstance: add context.Context parameter so callers pass the per-reconcile context rather than the startup lifetime context Remove the vendor/ directory; no replace directives are in use and module cache / GOPROXY provides reproducible builds in CI. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…n fatal When the quota-credentials secret is not provisioned on a cell cluster, QuotaRestConfig now returns (nil, nil) instead of propagating the file-not-found error from BuildConfigFromFlags. This lets the cell controllers start up with quota enforcement disabled when the secret is absent. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Allows the compute-manager pod to start on cell clusters where the quota-credentials secret has not been provisioned. QuotaRestConfig already returns nil when the file is absent, so quota enforcement is disabled gracefully. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…en quota enabled in single mode In single mode the projectIDForInstance closure was calling cl.GetClient().Get(...) to read the namespace label that maps ns-{uid} to the Milo project ID. GetClient() returns the controller-runtime cache-backed client; its first cluster-scoped Namespace Get lazily starts a Namespace informer (LIST+WATCH) under the compute-manager SA. That SA has no `namespaces` RBAC, so the LIST returns Forbidden, WaitForCacheSync blocks indefinitely, and the single reconcile worker hangs permanently. Fix: switch to cl.GetAPIReader().Get(...) (the uncached direct-to-API reader) wrapped in a 5 s timeout context. GetAPIReader bypasses the informer machinery entirely — no watch is registered, no cache sync is required — so the call succeeds (or fails fast) regardless of RBAC on list/watch. Also add `// +kubebuilder:rbac:groups="",resources=namespaces,verbs=get` to the InstanceReconciler RBAC markers and regenerate config/components/controller_rbac/role.yaml so the compute-manager ClusterRole explicitly grants namespaces/get. This prevents the same failure mode if the direct reader is ever replaced, and is the minimal correct permission. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…me not upstream-namespace The closure was reading UpstreamOwnerNamespaceLabel (meta.datumapis.com/upstream-namespace) which carries the in-project namespace name (e.g. "default"), not the project identifier. The project is encoded in UpstreamOwnerClusterNameLabel (meta.datumapis.com/upstream-cluster-name) as "cluster-<name>" with "/" replaced by "_" (e.g. "cluster-datum-cloud" → "datum-cloud"). Decode using the same logic as InstanceProjector (instance_projector.go lines 77-78): strip "cluster-" prefix, replace "_" with "/". Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ngle mode Two bugs in reconcileQuotaClaim, both confirmed against staging Milo (POST 403 with old code, 201 Created with fixes): Bug: wrong claim namespace The ResourceClaim was created with Namespace = instance.Namespace (the edge namespace, e.g. "ns-efdf8ca1-..."). That namespace does not exist in the project control plane, so every GET/CREATE/DELETE returned 404. The claim must live in the in-project namespace, which is the value of meta.datumapis.com/upstream-namespace on the edge namespace (e.g. "default"). Fix: add projectNamespaceForInstance func field to InstanceReconciler (parallel to projectIDForInstance) with a resolveProjectNamespace helper that falls back to instance.Namespace for Milo mode. In single mode cmd/main.go supplies a closure that reads UpstreamOwnerNamespaceLabel from the edge namespace via GetAPIReader (same pattern as the hang fix). reconcileQuotaClaim and reconcileDeletion both use this resolver for claim namespace lookups. Bug: wrong ResourceRef (HTTP 403 from quota admission) ResourceRef was {compute.datumapis.com, Instance, ...}. The ResourceRegistration "compute-instances" declares claimingResources: [{resourcemanager.miloapis.com, Project}]. The admission plugin rejected any claim whose resourceRef didn't match, returning 403. Fix: set ResourceRef to {resourcemanager.miloapis.com, Project, projectID} (cluster-scoped — no Namespace field). This matches the ResourceRegistration and is consistent with ConsumerRef which already used Project. Also: update TestReconcileQuota makeClaim and TestReconcileQuotaSingleMode to assert the corrected claim shape. TestReconcileQuotaSingleMode now uses correct edgeNS/projectNS/projectID separation and wires projectNamespaceForInstance so the single-mode claim path is fully covered. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… fixes - gofmt: fix import ordering and struct field alignment in cmd/main.go, instance_controller.go, instance_controller_test.go, and three pre-existing pre-gofmt files in internal/controller/ that were already failing on the base branch (workloaddeployment_federator.go, instance_projector_test.go, workloaddeployment_federator_test.go) - lll: extract the two repeated edge-namespace reading closures in cmd/main.go into package-level helpers (singleModeProjectID, singleModeProjectNamespace, readEdgeNamespace). This eliminates the two closure declaration lines that exceeded 120 chars, reduces copy-paste, and brings main() gocyclo from 37 → 31 (was 34 on base). - Named types InstanceProjectIDFunc / InstanceProjectNamespaceFunc added to instance_controller.go so cmd/main.go can reference the types without re-spelling the full func signature (which was itself the source of the lll violations). Remaining lint failures (gocyclo=31 > 30, staticcheck×3 in workload_webhook.go) are pre-existing on the base branch and require unrelated refactoring outside the scope of this fix. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add spec.locations.supportedClasses=[datum-managed] to the compute ServiceConfiguration. This is gate 1 of the location three-gate model: the LocationBindingReconciler only projects LocationBindings into entitled projects for locations whose class is listed here. Without it the location-binding feature is inert for compute even when entitlements and ServiceAvailabilities exist. Note: this makes compute willing to serve datum-managed locations; a durable, genuinely-Ready datum-managed Location (e.g. DFW) must still be registered via the platform/infra PoP process — tracked separately. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…d reduce gocyclo staticcheck (3× SA1019): migrate workload_webhook.go from deprecated admission.CustomDefaulter/CustomValidator and WithCustomDefaulter/ WithCustomValidator to the generic admission.Defaulter[T]/Validator[T] interfaces and WithDefaulter/WithValidator builder methods. The concrete type parameter (*computev1alpha.Workload) eliminates all runtime type assertions. Behavior is unchanged — Default() was a no-op and the Validate* methods had identical logic. Removes unused `runtime` import. gocyclo (31 > 30): extract the management-plane controller wiring block (WorkloadDeploymentFederator + InstanceProjector + downstream manager) into setupManagementControllers(), bringing main() from 31 → 29. Note: these are pre-existing issues on feat/federated-deployment-scheduling folded into this PR per explicit decision. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ng resolver funcs CI root cause: Makefile pinned golangci-lint v2.1.5 but CI uses v2.12.2. v2.12.2 fires goconst (37 issues) and prealloc (6 issues) that v2.1.5 did not. All were pre-existing. Pin Makefile to v2.12.2 to match CI. goconst fixes (37 issues across 6 files): - instance_controller.go: add package constants quotaResourceTypeInstances, miloProjectAPIGroup, miloProjectKind, msgNotProgrammed, msgInstanceReady, msgInstanceProgrammed, msgInstanceRunning, reasonNetworkFailedToCreate - workload_controller.go: add workloadConditionTypeAvailable - workloaddeployment_federator.go: add kindWorkloadDeployment - instance_validation.go: add diskTypePDStandard, defaultImageName, defaultInstanceType - instance_controller_test.go: add test constants block - workload_validation_test.go: add test constants block - instance_projector_test.go, workloaddeployment_federator_test.go: use testDefaultPlacement for repeated "default" placement name prealloc fixes (6 issues): - .golangci.yml: add exclusions for internal/validation/ (field.ErrorList{} is the idiomatic Kubernetes validation init pattern; preallocating requires knowing error count in advance) and internal/controller/instancecontrol/ (test helper slices are clearer without prealloc) Part B — resolver funcs now return (string, error): - InstanceProjectIDFunc / InstanceProjectNamespaceFunc signatures changed from func(...) string to func(...) (string, error) - resolveProjectID / resolveProjectNamespace updated to propagate errors - reconcileQuotaClaim and reconcileDeletion propagate resolver errors so Reconcile returns them → controller-runtime requeues with backoff on transient failures (GetCluster error, APIReader error) - Empty string + nil error means "no project affiliation yet" — leaves QuotaGranted=Unknown and relies on natural requeue rather than error rate - singleModeProjectID / singleModeProjectNamespace in cmd/main.go updated to return (string, error); readEdgeNamespace updated to return error - All tests updated for the new signatures Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ed runtime, observability Implements the quota failure mode hardening described in docs/compute/development/rfcs/quota-failure-modes.md. P1 — Startup fail-loud (FM-1): - internal/config/config.go: QuotaRestConfig() now returns (nil, error) when quotaKubeconfigPath is explicitly set but the file does not exist. Previously returned (nil, nil), silently disabling enforcement whenever a Secret failed to mount. Now fatal at startup (existing os.Exit guard in cmd/main.go handles it). Only (nil, nil) when no path is configured at all — intentional opt-out. - cmd/main.go: change INFO log on disabled enforcement to Error-level so the disabled state is operationally visible. P2 — Runtime fail-closed (FM-2 through FM-6): - reconcileQuotaClaim returns structured *metav1.Condition on every failure path (not just nil,err), so the QuotaGranted condition always carries a specific Reason: QuotaBackendUnavailable — network error, GET failure, client build failure QuotaProjectIDUnresolvable — namespace label missing or unreadable QuotaProjectNotFound — 404 on Create (project CP path absent) QuotaNamespaceNotFound — 404 on Create (namespace absent on project CP) QuotaMisconfigured — 403/422 (no ResourceRegistration or rule mismatch) - Reconcile now writes the condition to status before returning the error, so the failure reason is always visible on the Instance, not just in logs. - QuotaNoBudget (FM-7): claim Granted=False/Pending → QuotaGranted=Unknown/ NoBudget instead of generic PendingEvaluation. Distinct from "evaluating". - QuotaDisabled: quotaClientManager==nil uses reason QuotaDisabled (not the misleading QuotaAvailable). - FM-9 orphaned claim: upgrade from INFO to Error log + Kubernetes event + increment quota_claim_orphaned_total. observedGeneration guard: - removeQuotaSchedulingGate now checks quotaGrantedCond.ObservedGeneration == instance.Generation before removing the gate. Prevents a stale True condition from generation N unblocking a generation-N+1 instance before quota for the new spec has been evaluated. P3 — Visibility: - internal/quota/metrics.go: new Prometheus metrics registered with controller-runtime registry: compute_quota_enforcement_enabled (gauge: 1=active, 0=disabled) compute_quota_eval_failures_total (counter by reason label) compute_quota_claim_orphaned_total (counter) - EventRecorder injected into InstanceReconciler via SetupWithManager; Warning events emitted for each quota failure mode with the reason string. - api/v1alpha/instance_types.go: 7 new QuotaGranted reason constants and a +kubebuilder:printcolumn for Quota state in kubectl get -o wide. - cmd/main.go: set quota_enforcement_enabled gauge at startup. Tests: - internal/config/config_test.go: TestQuotaRestConfig_{NilWhenNoPath, ErrorWhenPathMissing, SuccessWhenFileExists} covering the FM-1 change. - internal/controller/instance_controller_test.go: TestReconcileQuotaFailureModes covers FM-2 (backend unavailable), FM-5 (namespace not found), FM-6 (403 misconfigured), FM-7 (no budget), QuotaDisabled reason, and the observedGeneration stale-condition guard. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…le signature GetEventRecorder() (the non-deprecated replacement) returns events.EventRecorder whose Eventf method requires a related runtime.Object and an action string that the old record.EventRecorder.Event() API does not. Migrating all emit sites is a separate effort; suppress the staticcheck SA1019 warning with an inline nolint explaining why. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…mer-hang

…Client Management controllers (WorkloadDeploymentFederator, InstanceProjector) now refuse to start when --enable-management-controllers is set but --federation-kubeconfig is omitted, logging a clear error and exiting 1. Previously the controllers were silently skipped — the same fail-open-silent class as the quota P1 issue — leaving federation and instance projection broken with no operator-visible signal. Alongside the fail-loud guard, rename the Karmada/federation client identifiers to a neutral "federation" framing (FederationClient, federationRestConfig, --federation-kubeconfig / FEDERATION_KUBECONFIG) across all three controllers, cmd/main.go, and the kustomize base manifests. The previous --upstream-kubeconfig flag is removed; deployments must migrate to --federation-kubeconfig. Update all comments to match. Coordination note: once this artifact is deployed, management-plane and edge/lab deployments must set FEDERATION_KUBECONFIG (infra PRs in parallel). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fix(mgmt): fail loud on missing federation kubeconfig; rename federation client

…e cells Introduces a --feature-gates=NetworkingIntegration=false flag so operators can run compute on cells where network-services-operator (VPC) is not yet available. When disabled: no NetworkBinding is created, the Network scheduling gate is omitted from new Instances (and removed from existing ones), and the networking step is treated as immediately ready so Instances reach the runtime. Default is true, leaving all production behavior unchanged. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat: make the compute networking integration optional

…e manifest Overlays can now enable or disable feature gates (e.g. NetworkingIntegration) with a strategic-merge patch on the FEATURE_GATES env var rather than appending a raw flag string. The base default is empty, which the binary already guards (only calls Set when non-empty), so all existing cells are unaffected. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(config): wire feature gates through FEATURE_GATES env var in base deployment manifest

When the cell plane wrote Instances back to the Karmada control plane, writeBackToUpstream built the downstream object with only the two upstream-owner routing labels, dropping the workload-uid, workload-deployment-uid, and instance-index labels the cell stamps. The management-plane InstanceProjector then had no labels to copy and could not resolve the WorkloadDeployment owner reference, so `datumctl compute instances` rendered WORKLOAD as "orphaned" and CITY as "unknown". Carry the three linking labels onto the downstream object so the projector can propagate them and resolve the owner reference. The update path now merges only the labels this controller owns instead of replacing the whole map, preserving any Karmada-managed labels, and logs a warning when a linking label is missing at write-back. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Exposes per-container entrypoint and argument overrides on the SandboxContainer type, mirroring Kubernetes pod-spec semantics: - Command []string — overrides the image ENTRYPOINT - Args []string — overrides the image CMD; combined with Command when both are set When neither field is set the image's own ENTRYPOINT/CMD are used unchanged, which is the correct default for standard OCI images (e.g. hello-world, nginx). Infrastructure providers that translate Instance specs (such as unikraft-provider) should map these fields through to the underlying runtime. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…stant Four occurrences in instance_writeback_test.go triggered goconst because testInstanceType = "d1-standard-2" already exists in the same package. Replacing all four with the constant keeps golangci-lint at 0 issues. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…nd-args feat(api): add Command and Args fields to SandboxContainer

Instances are projected back to the project cluster where the CLI reads them. Previously, correlating an Instance with its workload/deployment required joining on workload-deployment-uid, which differs per Karmada plane (the WD UID in the management cluster does not match the uid assigned in the cell). Add four new label constants and stamp them on every Instance at create and update time: - workload-deployment-name (deployment.Name) - city-code (deployment.Spec.CityCode) - workload-name (deployment.Spec.WorkloadRef.Name) - placement-name (deployment.Spec.PlacementName) These self-describing labels let the CLI resolve WORKLOAD/CITY/placement directly from the projected Instance object, without any cross-plane join. All four labels are included in the writeBackToUpstream allowlist so they propagate through InstanceReconciler → Karmada → InstanceProjector into the user-facing project cluster. Also persist the resolved Location (already discovered during network reconciliation) onto WorkloadDeployment.Status.Location, and propagate it into Instance.Spec.Location best-effort. A nil location never blocks instance creation; the existing scheduling path is unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The owner reference on a projected Instance must reference the actual project-control-plane WorkloadDeployment so that GC cascades and deletes projections when the deployment is removed. The previous implementation compared the WorkloadDeploymentUIDLabel value (which carries the edge/Karmada plane WD UID) against project-cluster WD UIDs — a match that never succeeds because each Kubernetes plane mints its own UID for the same object. The result was that ownerWD stayed nil, no ownerReference was set, and Instance projections leaked indefinitely (e.g. my-api/test-workload orphaned after WD deletion). Fix: resolve the owning WD by the federation-stable WorkloadDeployment NAME via a direct projectClient.Get against the project cluster, satisfying the core invariant that the owner reference UID/name/GVK must come from a live project-cluster object. The name is read from the new WorkloadDeploymentNameLabel (already stamped by dd3421a); an ordinal-strip fallback handles Instances created before that label was introduced. If the project WD is NotFound, requeue with RequeueAfter: 5s without creating the projection, so a projection is never created without an owner reference. This handles the transient ordering race where Karmada propagates an Instance back before WorkloadReconciler has created the project WD. Existing ownerless projections self-heal on the next reconcile once the project WD exists. Tests added: - "WD name label present, edge UID differs from project UID": asserts ownerRef.UID == projTestWDUID AND != projTestEdgeWDUID (regression guard against reintroducing cross-plane UID matching). - "WD name label absent, fallback name extraction from instance name": verifies the ordinal-strip path produces a correct owner reference. - "project WD not found — requeue, no ownerless projection created": asserts RequeueAfter > 0 and no projection object exists. - "WD name label absent and instance name yields no resolvable WD": verifies unrecognised instance names are skipped cleanly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Newly-introduced controller labels (city-code, workload-name, workload-deployment-name, placement-name) were only stamped on the create path and the template-hash-mismatch update path. A pre-existing instance that is not-Ready with an unchanged template hash takes the Wait branch and was never re-stamped, so the labels were absent on instances like sre-gate-test-default-dfw-0. Add a dedicated label-backfill pass that runs after the ordered rollout decision and skip-loop. For each existing, non-deleting instance, desiredControllerLabels() computes the full desired label set; if any key differs a NewPatchLabelsAction (ActionTypePatchLabels) is emitted. The action executes via client.MergeFrom patch, which sends only the metadata diff — spec, template, and template-hash are never touched (constraint 1). Backfill actions are appended after the rollout skip-loop so they are never subject to the "skip all but first" rule and never counted as an update in progress (constraint 2). The pass is idempotent: it is a no-op when all labels already match. Fix the misleading comment on addInstanceControllerLabels that overstated coverage by claiming the function ran on "both create and update paths"; the comment now reflects that backfill covers every reconcile pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…s update After quota is granted, the Quota scheduling gate was never removed from spec.controller.schedulingGates, leaving instances stuck "Pending (SchedulingGatesPresent)" even though the workload was running. Root cause: Reconcile returned early after writing QuotaGranted=True to status (statusChanged=true path), before reaching removeQuotaSchedulingGate. Because ResourceClaims are immutable after creation and local Instances are not watched (WithEngageWithLocalCluster(false)), no subsequent event would re-enqueue the instance — the gate was stranded forever. Fix: on the success path (quotaErr==nil), fall through to removeQuotaSchedulingGate after persisting the status update rather than returning early. Only return early with quotaErr when it is non-nil, which preserves the transient-failure backoff-requeue behavior. Also updates existing tests that previously required two reconciles to clear the gate (the second of which could never arrive in production), and adds TestQuotaGateRemovedInSingleReconcile as a regression test. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

scotwells · 2026-05-29T15:27:24Z

This PR shares the feat/federated-deployment-scheduling branch with #107, so #107 already carries every commit here (including the quota gate-removal fix and the merged #125 work). To keep review on a single source of truth, we're consolidating onto #107 — which targets docs/issue-85-karmada-federation-design as the next link in the merge chain — and closing this duplicate to avoid divergent review. No work is lost: closing rather than merging, since #107 already has the commits.

scotwells and others added 30 commits May 18, 2026 12:46

Merge branch 'main' into docs/issue-85-karmada-federation-design

105c335

Merge branch 'main' into docs/issue-85-karmada-federation-design

9734bf6

Merge branch 'main' into docs/issue-85-karmada-federation-design

9d96bd5

feat: replace cert-manager certificate resources with CSI volume moun…

0f69956

…ts for webhook TLS Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat: remove webhook CA injection — Milo trusts the cert issuer directly

a11861e

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

ci: bump Go version to 1.25 to match go.mod requirement

81e73c3

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

ci: bump golangci-lint to v2.2.2 for Go 1.25 compatibility

bed3d12

v2.1.5 was built with Go 1.24 and refuses to lint Go 1.25 modules. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

ci: bump golangci-lint to v2.12.2 (latest, built with Go 1.25)

0d26598

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

scotwells and others added 28 commits May 28, 2026 15:45

fix: remove clusterName requirement in Milo mode for management plane

a5916b9

clusterName is only needed when enableCellControllers is true (cell/edge deployments). Management plane deployments use Milo mode without it. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Merge pull request #118 from datum-cloud/fix/instance-namespace-infor…

c1c6261

…mer-hang

Merge pull request #120 from datum-cloud/fix/mgmt-controller-fail-loud

553af62

fix(mgmt): fail loud on missing federation kubeconfig; rename federation client

Merge pull request #121 from datum-cloud/feat/networking-feature-flag

70579e3

feat: make the compute networking integration optional

Merge pull request #122 from datum-cloud/feat/feature-gates-env-var

cd052a6

feat(config): wire feature gates through FEATURE_GATES env var in base deployment manifest

Merge pull request #125 from datum-cloud/feat/sandbox-container-comma…

fa711b9

…nd-args feat(api): add Command and Args fields to SandboxContainer

scotwells closed this May 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: federated workload scheduling across POP cells#116

feat: federated workload scheduling across POP cells#116
scotwells wants to merge 62 commits into
mainfrom
feat/federated-deployment-scheduling

scotwells commented May 28, 2026

Uh oh!

scotwells commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

scotwells commented May 28, 2026

Summary

Test plan

Breaking changes

Uh oh!

scotwells commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant