Skip to content

feat: Route workloads to city locations via distributed scheduling#107

Open
scotwells wants to merge 16 commits into
docs/issue-85-karmada-federation-designfrom
feat/federated-deployment-scheduling
Open

feat: Route workloads to city locations via distributed scheduling#107
scotwells wants to merge 16 commits into
docs/issue-85-karmada-federation-designfrom
feat/federated-deployment-scheduling

Conversation

@scotwells
Copy link
Copy Markdown
Contributor

@scotwells scotwells commented May 18, 2026

Summary

Workloads targeting a city location are now automatically routed to the correct physical site, and their status is surfaced back to the platform in real time. Previously, a single central scheduler made all placement decisions; this distributes that responsibility across regional clusters so each site can operate independently.

When you deploy a workload to a city, the platform now routes it to the right physical site based on the city code, reports instance health and readiness back without any manual steps, and continues operating even if other parts of the control plane are temporarily unreachable.

Nothing changes for users — city-code targeting, instance visibility, and the existing API all work exactly as before.

Design: #106

Test plan

  • New workload deployed → routed to correct city → instance status visible on control plane within one reconcile cycle
  • Deleting a workload cleans up all federated resources
  • City-specific routing policies are created and removed as deployments come and go
  • Instance status propagates correctly through the full scheduling chain
  • Full end-to-end federation test (workload creation → site placement → instance projection)

Closes #85

@scotwells scotwells force-pushed the feat/federated-deployment-scheduling branch from 0c0d8df to 134086f Compare May 19, 2026 21:10
@scotwells scotwells changed the title feat: federated deployment scheduling across POP cells feat: Route workloads to city locations via distributed scheduling May 20, 2026
@scotwells scotwells force-pushed the feat/federated-deployment-scheduling branch 2 times, most recently from 6dc43ed to 6e9a268 Compare May 20, 2026 21:53
Workloads targeting a city location are now automatically routed to the
correct physical site via a Karmada-based federation layer. Each POP cell
operates independently, instance health is surfaced back to the control
plane in real time, and the platform remains available even when parts of
the control plane are temporarily unreachable.

Controllers added:
- WorkloadDeploymentFederator: replicates WDs into Karmada and manages
  PropagationPolicies per city code
- InstanceProjector: mirrors Instance write-backs from Karmada into the
  project namespace on the control plane

ResourceInterpreterCustomization deployed at config time teaches Karmada
how to aggregate replica counts and conditions across POP cells.

Operator flags --enable-management-controllers and --enable-cell-controllers
allow each deployment to opt into only the controllers it needs.

Includes a 6-test Chainsaw e2e suite covering federation, deletion cascade,
propagation policy lifecycle, instance projection, instance write-back, and
the full end-to-end chain.

Resolves #85

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@scotwells scotwells force-pushed the feat/federated-deployment-scheduling branch from 6e9a268 to 492eb6c Compare May 20, 2026 22:19
scotwells and others added 8 commits May 26, 2026 15:04
…edge

Introduces management-plane and cell overlay paths to the compute
OCI artifact so the infra repo can deploy compute-manager in the
correct mode for each tier of the federation architecture.

The management-plane overlay deploys compute-manager with only
WorkloadDeploymentFederator and InstanceProjector enabled, connected
to the Karmada downstream control plane via projected ServiceAccount
token auth. The cell overlay deploys compute-manager with only
WorkloadDeploymentReconciler and InstanceReconciler enabled, with no
downstream connection or webhook server.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ts for webhook TLS

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Remove the hardcoded datum-control-plane ClusterIssuer from the
csi-webhook-cert component. DNS names stay since they are fixed by the
service name and namespace. Each consuming overlay now supplies the issuer
via a strategic merge patch, allowing different environments to use
different cert issuers without forking the component.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Each WorkloadDeployment is routed to exactly one cell cluster via its
PropagationPolicy, so aggregation across multiple members is not needed.
Replace the summing logic with a direct pass-through of the single member's
status.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The cert issuer name is environment-specific configuration that belongs
in the infra repo, not the compute overlay. The infra repo's base manager
patch already owns the full webhook-server-tls volume definition including
the issuer. Consumers deploying outside infra must patch the issuer in their
own overlay.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…moval

dev: inline self-signed Issuer + Certificate for host.docker.internal,
replace kustomize replacements block with direct annotation patch, remove
Certificate-patching from webhook_patch.yaml, and clear webhookServer
secretRef from config.yaml.

single-cluster: replace cert-manager Certificate approach with the
csi-webhook-cert component, matching the main branch overlay.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The WorkloadReconciler watches networkingv1alpha.Network objects, which
requires the network-services-operator CRDs to be installed. Cell clusters
don't have those CRDs, causing the manager to crash on startup. Gate the
WorkloadReconciler behind enableManagementControllers so it only runs where
the Network CRDs are present.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@scotwells scotwells requested a review from mattdjenkinson May 27, 2026 00:15
scotwells and others added 5 commits May 26, 2026 19:18
Extracts server config file reading and decoding into a dedicated
loadServerConfig helper, reducing main's cyclomatic complexity from
31 to 29 to satisfy the gocyclo linter limit of 30.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Milo's authorization webhook uses Extra claims on the admission request
(iam.miloapis.com/parent-name, iam.miloapis.com/parent-type, etc.) to
resolve the correct project-scoped policy binding. Dropping them caused
the SAR to return Allowed=false even for users with networks.use, because
the authorizer couldn't locate the binding without the project context.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
metricRules belongs under spec.quota, not spec.billing. The field is
not declared in the ServiceBillingConfig schema, causing Flux dry-run
failures in staging with:

  .spec.billing.metricRules: field not declared in schema
Previously, InstanceReconciler wrote ResourceClaim objects against
the local deployment cluster via managementCluster.GetClient(). Those
claims were never seen by the Milo quota system, leaving every Instance
in QuotaGranted=Unknown indefinitely.

This change routes claim creation and deletion to the correct Milo
project control plane for each instance using a new
ProjectQuotaClientManager that builds per-project REST clients by
rewriting the host path — mirroring the URL construction already used
by the milomulticluster provider.

The management-cluster claim watch is replaced with a multicluster
Watches call so that grant/denial status changes in project control
planes re-trigger instance reconciles. Claims are stamped with a
source-cluster label (discovery.clusterName) so each edge controller
only reacts to the claims it created.

Co-Authored-By: Claude <claude@anthropic.com>
The admission webhook requires that all metrics referenced in
spec.quota.limits[].metric and spec.quota.metricRules[].metricCosts
match a name declared in spec.metrics[]. The four quota-tracking
metrics (workloads, instances, vcpus, memory) were missing from
spec.metrics[], causing the webhook to reject the resource.
scotwells and others added 2 commits May 27, 2026 10:58
…o cell setup

Controller flags --enable-management-controllers and --enable-cell-controllers
now default to false so kustomize components must explicitly opt in, rather than
both groups running by default. This prevented the management-plane deployment
from crashing when discovery.clusterName was unset — that field is only required
by the InstanceReconciler (a cell controller), so the validation now lives in
InstanceReconciler.SetupWithManager instead of initializeClusterDiscovery.

Also adds cell-controllers and management-controllers components to the
single-cluster overlay, which was silently running with no controllers enabled.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…scovery

The rebase during cherry-pick propagation introduced a mixed state where
cmd/main.go had the edgeClusterName/projectRestConfig return values partially
reverted. This cleans up the function signature and call sites to be consistent,
while keeping the validation removed from initializeClusterDiscovery (it belongs
in InstanceReconciler.SetupWithManager per the original fix intent).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants