feat: Route workloads to city locations via distributed scheduling by scotwells · Pull Request #107 · datum-cloud/compute

scotwells · 2026-05-18T22:41:29Z

Summary

Workloads targeting a city location are now automatically routed to the correct physical site, and their status is surfaced back to the platform in real time. Previously, a single central scheduler made all placement decisions; this distributes that responsibility across regional clusters so each site can operate independently.

When you deploy a workload to a city, the platform now routes it to the right physical site based on the city code, reports instance health and readiness back without any manual steps, and continues operating even if other parts of the control plane are temporarily unreachable.

Nothing changes for users — city-code targeting, instance visibility, and the existing API all work exactly as before.

Design: #106

Test plan

New workload deployed → routed to correct city → instance status visible on control plane within one reconcile cycle
Deleting a workload cleans up all federated resources
City-specific routing policies are created and removed as deployments come and go
Instance status propagates correctly through the full scheduling chain
Full end-to-end federation test (workload creation → site placement → instance projection)

Closes #85

Workloads targeting a city location are now automatically routed to the correct physical site via a Karmada-based federation layer. Each POP cell operates independently, instance health is surfaced back to the control plane in real time, and the platform remains available even when parts of the control plane are temporarily unreachable. Controllers added: - WorkloadDeploymentFederator: replicates WDs into Karmada and manages PropagationPolicies per city code - InstanceProjector: mirrors Instance write-backs from Karmada into the project namespace on the control plane ResourceInterpreterCustomization deployed at config time teaches Karmada how to aggregate replica counts and conditions across POP cells. Operator flags --enable-management-controllers and --enable-cell-controllers allow each deployment to opt into only the controllers it needs. Includes a 6-test Chainsaw e2e suite covering federation, deletion cascade, propagation policy lifecycle, instance projection, instance write-back, and the full end-to-end chain. Resolves #85 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…edge Introduces management-plane and cell overlay paths to the compute OCI artifact so the infra repo can deploy compute-manager in the correct mode for each tier of the federation architecture. The management-plane overlay deploys compute-manager with only WorkloadDeploymentFederator and InstanceProjector enabled, connected to the Karmada downstream control plane via projected ServiceAccount token auth. The cell overlay deploys compute-manager with only WorkloadDeploymentReconciler and InstanceReconciler enabled, with no downstream connection or webhook server. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ts for webhook TLS Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Remove the hardcoded datum-control-plane ClusterIssuer from the csi-webhook-cert component. DNS names stay since they are fixed by the service name and namespace. Each consuming overlay now supplies the issuer via a strategic merge patch, allowing different environments to use different cert issuers without forking the component. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Each WorkloadDeployment is routed to exactly one cell cluster via its PropagationPolicy, so aggregation across multiple members is not needed. Replace the summing logic with a direct pass-through of the single member's status. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The cert issuer name is environment-specific configuration that belongs in the infra repo, not the compute overlay. The infra repo's base manager patch already owns the full webhook-server-tls volume definition including the issuer. Consumers deploying outside infra must patch the issuer in their own overlay. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…moval dev: inline self-signed Issuer + Certificate for host.docker.internal, replace kustomize replacements block with direct annotation patch, remove Certificate-patching from webhook_patch.yaml, and clear webhookServer secretRef from config.yaml. single-cluster: replace cert-manager Certificate approach with the csi-webhook-cert component, matching the main branch overlay. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The WorkloadReconciler watches networkingv1alpha.Network objects, which requires the network-services-operator CRDs to be installed. Cell clusters don't have those CRDs, causing the manager to crash on startup. Gate the WorkloadReconciler behind enableManagementControllers so it only runs where the Network CRDs are present. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Extracts server config file reading and decoding into a dedicated loadServerConfig helper, reducing main's cyclomatic complexity from 31 to 29 to satisfy the gocyclo linter limit of 30. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Milo's authorization webhook uses Extra claims on the admission request (iam.miloapis.com/parent-name, iam.miloapis.com/parent-type, etc.) to resolve the correct project-scoped policy binding. Dropping them caused the SAR to return Allowed=false even for users with networks.use, because the authorizer couldn't locate the binding without the project context. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

metricRules belongs under spec.quota, not spec.billing. The field is not declared in the ServiceBillingConfig schema, causing Flux dry-run failures in staging with: .spec.billing.metricRules: field not declared in schema

Previously, InstanceReconciler wrote ResourceClaim objects against the local deployment cluster via managementCluster.GetClient(). Those claims were never seen by the Milo quota system, leaving every Instance in QuotaGranted=Unknown indefinitely. This change routes claim creation and deletion to the correct Milo project control plane for each instance using a new ProjectQuotaClientManager that builds per-project REST clients by rewriting the host path — mirroring the URL construction already used by the milomulticluster provider. The management-cluster claim watch is replaced with a multicluster Watches call so that grant/denial status changes in project control planes re-trigger instance reconciles. Claims are stamped with a source-cluster label (discovery.clusterName) so each edge controller only reacts to the claims it created. Co-Authored-By: Claude <claude@anthropic.com>

The admission webhook requires that all metrics referenced in spec.quota.limits[].metric and spec.quota.metricRules[].metricCosts match a name declared in spec.metrics[]. The four quota-tracking metrics (workloads, instances, vcpus, memory) were missing from spec.metrics[], causing the webhook to reject the resource.

…o cell setup Controller flags --enable-management-controllers and --enable-cell-controllers now default to false so kustomize components must explicitly opt in, rather than both groups running by default. This prevented the management-plane deployment from crashing when discovery.clusterName was unset — that field is only required by the InstanceReconciler (a cell controller), so the validation now lives in InstanceReconciler.SetupWithManager instead of initializeClusterDiscovery. Also adds cell-controllers and management-controllers components to the single-cluster overlay, which was silently running with no controllers enabled. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…scovery The rebase during cherry-pick propagation introduced a mixed state where cmd/main.go had the edgeClusterName/projectRestConfig return values partially reverted. This cleans up the function signature and call sites to be consistent, while keeping the validation removed from initializeClusterDiscovery (it belongs in InstanceReconciler.SetupWithManager per the original fix intent). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

scotwells requested review from JoseSzycho, kevwilliams, mattdjenkinson, privateip and savme May 19, 2026 19:31

scotwells mentioned this pull request May 19, 2026

Launch Workload Compute Service ("UFOs") datum-cloud/enhancements#682

Open

scotwells force-pushed the feat/federated-deployment-scheduling branch from 0c0d8df to 134086f Compare May 19, 2026 21:10

scotwells changed the title ~~feat: federated deployment scheduling across POP cells~~ feat: Route workloads to city locations via distributed scheduling May 20, 2026

scotwells force-pushed the feat/federated-deployment-scheduling branch 2 times, most recently from 6dc43ed to 6e9a268 Compare May 20, 2026 21:53

scotwells force-pushed the feat/federated-deployment-scheduling branch from 6e9a268 to 492eb6c Compare May 20, 2026 22:19

mattdjenkinson approved these changes May 22, 2026

View reviewed changes

scotwells and others added 8 commits May 26, 2026 15:04

feat: replace cert-manager certificate resources with CSI volume moun…

0f69956

…ts for webhook TLS Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat: remove webhook CA injection — Milo trusts the cert issuer directly

a11861e

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

scotwells requested a review from mattdjenkinson May 27, 2026 00:15

scotwells and others added 5 commits May 26, 2026 19:18

mattdjenkinson approved these changes May 27, 2026

View reviewed changes

scotwells and others added 2 commits May 27, 2026 10:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Route workloads to city locations via distributed scheduling#107

feat: Route workloads to city locations via distributed scheduling#107
scotwells wants to merge 16 commits into
docs/issue-85-karmada-federation-designfrom
feat/federated-deployment-scheduling

scotwells commented May 18, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

scotwells commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

scotwells commented May 18, 2026 •

edited

Loading