Skip to content

Latest commit

 

History

History
281 lines (226 loc) · 10.9 KB

File metadata and controls

281 lines (226 loc) · 10.9 KB

Tutorial 06 — Rolling module upgrade with canary

Status: active

What you'll learn: Upgrade a module version across a fleet in batches, with health-check gates between batches and an automatic circuit breaker that pauses for operator review when too many instances fail in a row.

Time: ~30 min (most of which is watching batches complete)

Builds on: Tutorial 02 (you understand module versions + promotion) and Tutorial 03 (you have a running fleet of instances with modules assigned).

Sets you up for: Tutorial 07 — CVE response — CVE remediation orchestrates the same rolling_module_upgrade skill across affected modules.

What you're building

flowchart TD
    Op[Operator] --> Plan[RollingModuleUpgradeExecutor<br/>plan mode<br/>batch_pct=20%]
    Plan --> Plan2[Skill plans batches:<br/>50 instances → 5 × 10]
    Plan2 --> Apr[ApprovalRequest<br/>created]
    Apr --> Op2{Operator<br/>approves?}
    Op2 -->|yes| B1[Batch 1: upgrade 10 instances]
    Op2 -->|edit| Edit[Adjust batch_pct or<br/>max_consecutive_failures]
    Edit --> Op2
    B1 --> H1{Health checks<br/>pass?}
    H1 -->|yes| B2[Batch 2: upgrade 10]
    H1 -->|>2 in a row fail| CB[Circuit breaker<br/>tripped]
    CB --> Apr2[New ApprovalRequest:<br/>continue / rollback / abort]
    Apr2 --> Op3{Operator<br/>decides}
    Op3 -->|continue| B2
    Op3 -->|rollback| RB[Rollback completed batches]
    Op3 -->|abort| Stop[Investigate failed instances]
    B2 --> B3[...batches 3-5]
    B3 --> Done[All instances on new version<br/>system.drift_report = false]
Loading

By the end you'll have upgraded a module (e.g., nginx 1.24 → 1.26) across a 50-instance fleet with zero unsafe rollouts.

Concept refresher

rolling_module_upgrade is a skill executor (see SKILL_EXECUTORS.md) that:

  1. Plans a batched sequence based on batch_pct (default 10%; this tutorial uses 20%)
  2. Creates an ApprovalRequest per Fleet Autonomy intervention policy (system.fleet_rolling_upgrade is require_approval)
  3. On approval, walks batches one at a time
  4. After each batch, runs health checks (default: instance heartbeats with new module digest in running_module_digests)
  5. Tracks consecutive failures; trips circuit breaker at max_consecutive_failures (default 2)
  6. On trip: emits module.upgrade.circuit_breaker_tripped event + creates a continuation ApprovalRequest with options continue_anyway / rollback_completed_batches / abort

Why batch? Limits blast radius. A bad version reaches at most batch_pct of the fleet before the circuit trips.

Why approval? Even with healthy batches, the rollout itself is a controlled production change — operator should sign off on timing.

Prerequisites

Requirement How
Existing fleet ≥10 NodeInstances assigned a common module (e.g., nginx 1.24.0) Provision via Tutorial 01 + assign via Tutorial 02 pattern
New version (nginx 1.26.0) published + promoted to blessed or live Tutorial 02 step 6–8
Operator permission system.fleet_rolling_upgrade (often paired with approval rights) Default for admin users

Step 1 — Identify the upgrade target

platform.system_list_module_versions({ module_id: "<nginx-module-id>" })
// → { versions: [
//      { id: "v-1.24.0", promotion_state: "live", ... },
//      { id: "v-1.26.0", promotion_state: "blessed", ... }
//    ] }

platform.system_list_instances({ template_id: "<edge-template>" })
// → { instances: [{ id, status: "running", running_module_digests: { nginx: "sha256:..." } }, ...50] }

Expected outcome: confirm 50 instances running v1.24.0 and v1.26.0 available for promotion.

Step 2 — Plan the upgrade (dry-run via the executor)

The rolling_module_upgrade skill is a monitor-agent executor: in production the autonomy reconciler runs the batches (Step 3), but you can compute the plan up-front in plan mode by invoking the executor directly. (There is no execute_skill MCP action — the executor is a Ruby class; run it via rails runner or a seed, exactly as db/seeds/example_rolling_upgrade.rb does.)

result = ::System::Ai::Skills::RollingModuleUpgradeExecutor.new(
  account: account, agent: fleet_autonomy_agent
).execute(
  template_id:               "<edge-template>",
  module_id:                 "<nginx-module-id>",
  target_version_id:         "v-1.26.0",
  batch_pct:                 20,
  max_consecutive_failures:  2,
  health_timeout_sec:        300
)
# → {
#      total_instances: 50,
#      batch_size: 10,
#      batch_count: 5,
#      estimated_total_seconds: 1500,
#      circuit_breaker: { trips_after_consecutive_failures: 2, status: "armed" },
#      batches: [
#        { index: 0, instance_ids: [...10], phase: "pending" },
#        ...
#      ]
#    }

(Defaults if you omit them: batch_pct: 10, max_consecutive_failures: 2, health_timeout_sec: 600.)

Expected outcome: plan shows 5 batches of 10 instances each, ~25 min total (5 batches × 5 min health window).

Step 3 — Approve the plan

Operator opens /app/approvals UI → reviews the plan → optionally edits batch_pct (smaller for Tier-1 services) or max_consecutive_failures (1 for stricter stop-on-fail) → clicks Approve.

Once approved, the autonomy reconciler picks up the plan on its next 60s tick and starts executing.

Step 4 — Watch progress

platform.recent_events({ kind_prefix: "module.upgrade", limit: 100 })
// → events: [
//      { kind: "module.upgrade.batch_started", batch_index: 0, instance_count: 10, ... },
//      { kind: "module.upgrade.instance_started", instance_id, target_version, ... },
//      { kind: "module.upgrade.instance_health_check", instance_id, healthy: true, ... },
//      { kind: "module.upgrade.batch_completed", batch_index: 0, healthy_count: 10, failed_count: 0, ... },
//      { kind: "module.upgrade.batch_started", batch_index: 1, ... }
//    ]

Or via UI: /app/system/operations → "Active rolling upgrades" panel shows batch status + per-instance progress.

Step 5 — Circuit breaker scenario (drill)

To rehearse circuit breaker behavior, deliberately publish a broken version (e.g., nginx with a syntax error in its config):

  1. Build & publish nginx 1.26.0-broken via Tutorial 02
  2. Run Step 2 with target_version_id: "v-1.26.0-broken"
  3. After 2 instances in batch 1 fail health checks:
// Reconciler emits:
{ kind: "module.upgrade.circuit_breaker_tripped",
  batch_index: 1,
  failed_instance_ids: ["...", "..."],
  reason: "max_consecutive_failures (2) exceeded" }

// And creates an approval request:
{ approval_request: {
    type: "rolling_upgrade_continuation",
    options: ["continue_anyway", "rollback_completed_batches", "abort"]
}}

Operator decides:

  • continue_anyway — ignore the trip, proceed (use only when failures are transient)
  • rollback_completed_batches — restore previously-upgraded instances to v1.24.0 (use when the new version has a fundamental flaw)
  • abort — stop here; investigate failed instances manually

Verification

After all batches complete:

platform.system_drift_report({ template_id: "<edge-template>" })
// → { drift: false }   (all instances now running v1.26.0)

platform.system_get_instance({ id: "<sample-instance>" })
// → { instance: { running_module_digests: { nginx: "sha256:<v1.26-digest>", ... } } }

Extract a learning

platform.create_learning({
  title: "nginx 1.24 → 1.26 rolling upgrade — batch_pct=20% works for edge fleet",
  category: "best_practice",
  content: "50-instance edge fleet: 20% batches × 5 batches × ~5min health window = 25 min total. Zero circuit breaker trips. Recommend keeping batch_pct=20% for similar-sized fleets; reduce to 10% for Tier-1 services with smaller blast radius tolerance.",
  tags: ["rolling-upgrade", "nginx", "batch-sizing"]
})

Future similar upgrades surface this learning in the rolling_module_upgrade skill's reasoning.

Cleanup

If you ran the circuit-breaker drill, restore the fleet to a known state:

# If you chose abort/rollback, no further action needed
# If you chose continue_anyway and the version was actually broken, manually
# rollback by re-running the executor with the previous version as target:
::System::Ai::Skills::RollingModuleUpgradeExecutor.new(
  account: account, agent: fleet_autonomy_agent
).execute(
  template_id: template_id,
  module_id: module_id,
  target_version_id: "v-1.24.0",
  batch_pct: 50                 # faster rollback
)

Troubleshooting

Approval never appears — check that system.fleet_rolling_upgrade is in the agent's intervention policies and not blocked. Inspect:

platform.agent_introspect({ agent_id: "fleet_autonomy_agent" })
// Look for "intervention_policies" containing system.fleet_rolling_upgrade

Batch never starts after approval — autonomy reconciler is paused or its tick isn't running. Check:

sudo systemctl status powernode-worker@default
journalctl -u powernode-worker@default | grep fleet_autonomy

Health checks always fail — default health check requires the instance to heartbeat with new digest in running_module_digests. If your heartbeat is broken (Tutorial 03 troubleshooting covers diagnosis) or your agent is offline, every batch fails. Fix the underlying connectivity first.

max_consecutive_failures: 2 trips on transient network blips — if your instances flap between healthy/unhealthy due to network conditions, either raise the threshold or fix the underlying networking (the threshold is a symptom, not the cause).

Rollback doesn't fully restore previous version — the rollback path re-runs rolling_module_upgrade with the previous version as target. A retired version is still kept for rollback/audit, so retiring the old version doesn't break rollback by itself — but if the version row was deleted outright, rollback fails because target_version_id no longer resolves. Keep the prior version row (even retired) until you're certain you'll never roll back.

What's next

  • Tutorial 07 — CVE response — uses rolling_module_upgrade as the actual remediation; CVE response is essentially "automated rolling upgrade triggered by a CVE signal."
  • Tutorial 08 — Instance pools — for stateless workloads, pool replacement is often safer than in-place upgrade: terminate old instance, claim a fresh new-version one from the pool. Pools cut blast radius further.
  • SKILL_EXECUTORS.md §rolling_module_upgrade — full skill input/output reference.
  • FLEET_SENSORS.mdsystem.fleet_rolling_upgrade intervention policy.

Last verified: 2026-06-03