Node Provisioning Runbook

Status: active

Step-by-step operator guide for the full Node + NodeInstance lifecycle: from "I have a Template" through "the instance is decommissioned and rows cleaned up." Includes per-state error recovery and the LocalQemuProvider variant for offline / smoke-test environments.

Audience: external operators (open-source consumers), internal Powernode operators, on-call SREs handling stuck instances.

Quick reference

Phase	What happens	Typical duration	MCP entry point
1. Create Node	Logical row representing a future-or-existing host	<1 s	`system_create_node`
2. Provision instance	Provider boots a VM with the netboot image	30 s – 10 min	`system_provision_instance`
3. Bootstrap	Agent installs, mTLS handshake, module reconcile	~90 s cold (5-10 min on slow providers)	none — agent-driven
4. Run	Heartbeats, reconcile loop, task lease	indefinite	`system_get_instance`
5. Drain	Workloads relocated, services stopped	1-30 min	`system_drain_instance`
6. Decommission	Provider VM destroyed, FK cascades fire	<1 min	`system_terminate_instance`

Lifecycle diagram

The NodeInstance AASM has 9 states: pending, provisioning, starting, running, stopping, stopped, rebooting, terminated, error. There is no draining state (drain is an operation that ultimately drives the instance running → stopping → stopped; draining is a pool_state on pooled instances, not an instance AASM state) and no failed state — the terminal failure state is error. The "bootstrapping" box below is the agent-driven boot window within the provisioning state, not a separate AASM state.

  ┌──────────────┐
  │ (no instance)│  Node row exists; no provider VM yet
  └──────┬───────┘
         │ system_provision_instance
         ▼
  ┌──────────────┐  AASM: pending → mark_provisioning → provisioning
  │ provisioning │  Provider creates VM; netboot fetches kernel + initramfs
  └──────┬───────┘
         │  (agent boot window: enroll CSR → mTLS,
         │   modules pulled, fs-verity verified, composefs mounted)
         │                                   ╲  provider error / boot failure
         │ first phase=ready heartbeat        ╲  → mark_errored
         ▼                                     ▼
  ┌──────────────┐  AASM: mark_running    ┌──────────┐
  │   running    │  heartbeats every 30s  │  error   │ terminal failure
  └──────┬───────┘  task lease ready      └────┬─────┘ (no orphaned row)
         │                                      │ terminate
         │ system_drain_instance (graceful):    │ (allowed from error)
         │   stop/cordon → AASM stopping→stopped │
         │ -or- system_terminate_instance (hard)│
         ▼                                       ▼
  ┌──────────────────────────────────────────────────┐
  │              terminated                          │
  │  Cascade FKs: Devops::DockerHost / KubernetesNode│
  │  cleanup, Vault TLS revoked, Sdwan::Peer removed │
  └──────────────────────────────────────────────────┘

The terminate event currently transitions only from running, stopped, and error — not directly from provisioning or starting. See Per-state error recovery for how to clear an instance still mid-provisioning.

Phase 1 — Create Node ✅

The Node is a logical row. No provider resources are touched here.

platform.system_create_node({
  account_id: "<account-id>",                     // current account by default
  hostname: "edge-tokyo-01",                      // human-readable
  node_template_id: "tmpl-edge-base",             // composes assigned modules
  node_platform_id: "platform-ubuntu-2404-amd64", // disk image family
  node_architecture_id: "arch-amd64",             // kernel + boot binaries
  lifecycle_class: "persistent",                  // "persistent" | "ephemeral" | "spot"
  metadata: {                                     // optional; surfaces in dashboard
    "owner": "edge-team",
    "purpose": "tokyo-cdn-edge"
  }
})
// → { node: { id, hostname, status: "no_instance", ... } }

What to watch:

lifecycle_class is set at creation time (persistent for control-plane / database / SaaS tenant; ephemeral for batch / CI / replaceable workers; spot for provider-preemptible workloads). The model validates against the allowed set but does not enforce immutability after first provision — operators should still treat the class as fixed in practice, since changing it after instances exist would invalidate downstream allocation assumptions.
node_template_id determines which modules will be assigned at bootstrap. To reuse an existing fleet template, query first: platform.system_list_templates.
A Node with no NodeInstance is harmless — bookkeeping only.

Phase 2 — Provision NodeInstance ✅

platform.system_provision_instance({
  node_id: "<node-id>",
  provider_region_id: "region-aws-us-east-1",   // or "region-local-qemu"
  provider_instance_type_id: "type-t3-medium",  // or "type-qemu-2cpu-4gb"
  // Optional:
  spot: false,
  ssh_key_ids: ["<key-id>"]   // injected via fw-cfg metadata for break-glass
})
// → { instance: { id, status: "provisioning", task_id, ... } }

The platform creates a Task (status=pending), enqueues a worker job, and returns immediately. The worker runs the provider's provision_instance! adapter:

AWS / GCP / Azure / OpenStack — provider-specific API calls; takes 30 s – 5 min
LocalQemuProvider — libvirt domain creation with direct kernel boot from M3 artifacts; takes ~10-30 s in real mode, instant in recorder mode (per project_local_qemu_provider memory)

Instance AASM transitions on the happy path: pending → provisioning → running (mark_provisioning, then mark_running on the first phase=ready heartbeat). If the provider call or boot fails, the worker drives mark_errored so the instance lands in error rather than dangling in provisioning — there is no orphaned-pending row left behind.

Idempotent retries. system_provision_instance accepts an operation_id. A retried call carrying the same operation_id reuses the existing instance row (tagged in config->>'operation_id') instead of creating a second VM — this dedups transient-error retries. A retry without an operation_id falls back to time-stamped naming and will create a distinct instance, so always thread the same operation_id through when retrying a failed provision.

Verify provisioning:

platform.system_get_instance({ id: "<instance-id>" })
// → { instance: { status: "provisioning", task_id, last_heartbeat_at: null, ... } }

// To watch task progress, list current tasks for the instance and read the matching row:
platform.system_list_tasks({ resource_type: "system_node_instance", resource_id: "<instance-id>" })
// (system_get_task as a single-record fetch is in ASPIRATIONAL_MCP.md — use system_list_tasks filtered to the resource for now)

What to watch:

Provider quota: most providers throttle bulk provisioning. For >10 instances, use provision_cluster skill (hard cap 50/call) or sequence calls with system_create_instance_pool (slice 7) for warm capacity.
MissingNetbootImageError: the platform-side disk image hasn't been published to OCI yet. Run system_list_disk_image_publications to confirm the publication exists with status=published.
LocalQemuProvider: ensure POWERNODE_LIBVIRT_MODE=real and POWERNODE_IMAGE_BASE points at extensions/system/initramfs/build. See SMOKE_TEST.md.

Phase 3 — Bootstrap ✅

The provider VM POSTs to runtime/handshake once the kernel boots:

Identity discovery — agent reads from cmdline / virtio-fw-cfg / cloud metadata; selects the appropriate IdentityStrategy
Enrollment — agent generates Ed25519 keypair, POSTs CSR to /api/v1/system/node_api/enroll with bootstrap token; receives signed mTLS cert
Module pull — agent fetches OCI artifacts for assigned modules from registry.example.com registry; verifies cosign signatures + fs-verity digests
Mount union root — composefs lower layer + tmpfs (or /persist) overlay; pivot_root into composed userspace
Service start — systemctl start powernode-agent.service; agent posts phase=ready heartbeat

The platform marks the instance status=running after the first phase=ready POST.

What to watch:

Bootstrap timeline: ~90 s from kernel boot to phase=ready on warm cache; +30-60 s on first run when modules aren't cached. Slice 7 instance pools cut this to <30 s by pre-warming.
Stuck in bootstrapping: usually a module pull failure (signature verify, network, OCI 404). Check journalctl -u powernode-agent on the node, or platform.recent_events for the instance.
Bootstrap token rotation: tokens expire 24 h after issue. Re-provision if you see BootstrapTokenExpiredError.

Phase 4 — Run ✅

The instance heartbeats every 30 s. Per-tick:

Agent posts heartbeat (uptime, version, last reconcile result)
Platform refreshes last_heartbeat_at
instance_status_sensor runs every 60 s; fires instance.silent if no heartbeat in 5 min (default)
Module reconciler walks assigned modules; pulls + verifies + mounts updates if module versions changed
Task lease: agent claims any pending tasks for this instance via worker_api/tasks and runs them

Verify health:

platform.system_get_instance({ id: "<instance-id>" })
// → { instance: {
//      status: "running",
//      last_heartbeat_at: "2026-05-04T13:42:01Z",
//      running_module_digests: { "system-base": "sha256:abc...", ... },
//      ...
//    }}

platform.system_drift_report({ instance_id: "<instance-id>" })
// → { drift: false } or { drift: true, attach: [...], detach: [...], update: [...] }

If drift: true, the module_drift_sensor will emit module.drift_detected; Fleet Autonomy auto-runs drift_remediate (notify_and_proceed policy) on next tick.

Phase 5 — Drain (graceful) ⚠️

For persistent instances running workloads (Docker daemon, K3s server), prefer drain over hard terminate:

platform.system_drain_instance({
  id: "<instance-id>",
  timeout_seconds: 600,           // give workloads up to 10 min to relocate
  cordon_only: false              // false → also stop services after cordon
})
// → { instance_id, instance_name, drain_initiated_at, drain_timeout_seconds }
// (records drain markers + emits platform.resilience.drain_started; the
//  instance AASM moves running → stopping → stopped as services wind down —
//  there is no "draining" instance state)

Drain coordinates with Devops layer:

DockerHost: docker stop containers tagged --restart=always first; then daemon shutdown
KubernetesNode (k3s-agent): kubectl cordon + kubectl drain --ignore-daemonsets
k3s-server bootstrap node: triggers slice 3 VIP failover before stopping the API server

What to watch:

Pod relocation requires capacity on remaining nodes — drain can stall if cluster is at capacity. Add capacity first or accept partial drain.
Local-path PVCs don't migrate; pods using them go pending. Plan stateful workload placement accordingly.
Single-server K3s clusters cannot drain the only server — kubectl loses access. Either add a second k3s-server first, or hard-terminate.

Phase 6 — Decommission ✅

platform.system_terminate_instance({ id: "<instance-id>" })
// → { instance_id, status: "terminated" }
// (single-step AASM transition running/stopped/error → terminated; there is
//  no intermediate "terminating" instance state)

Cascade actions (FK + service-level):

Provider VM destroyed via the same provider adapter that created it
Devops::DockerHost (if managed) destroyed; Vault TLS material revoked
Devops::KubernetesNode (if k3s-*) destroyed
Sdwan::Peer rows for this instance removed (slice 9 cleanup callback)
Sdwan::VirtualIp failover triggered if this instance was a holder (slice 3)
NodeCertificate rows revoked
BootstrapToken rows expired

The Node row remains by design — re-provisioning into the same logical Node preserves history and audit chain. Delete the Node explicitly via system_delete_node only if it's truly retired.

Per-state error recovery

Stuck states map to the 9 real AASM states. "bootstrapping" is the agent boot window inside provisioning; drain/terminate are operations (the instance moves running → stopping → stopped or → terminated), not states.

Stuck in…	Likely cause	Recovery
`pending` (>5 min)	Worker queue stalled or provider quota	Check `platform.recent_events` for `provider_quota_exceeded`; restart worker via `sudo systemctl restart powernode-worker@default`; retry with the same `operation_id`
`provisioning` (>10 min)	Provider API timeout, libvirt domain creation hung	`platform.system_cancel_task` the provision task; investigate provider. `terminate` does not fire from `provisioning` today — see "Clearing a stuck `provisioning` instance" below
`provisioning`, agent up but never `running` (>5 min after first heartbeat)	Module pull failure	SSH to node (if SDWAN attached) → `journalctl -u powernode-agent` shows the failed module + reason; common: cosign signature mismatch, OCI 404, network
`running` but no heartbeats >5 min	Network partition or agent crash	`platform.recent_events` for `instance.silent`; SSH or console-access via libvirt; manual restart of `powernode-agent.service`
`error` (terminal)	Provider/boot failure drove `mark_errored`	Inspect `platform.recent_events`; once the cause is understood, `system_terminate_instance` (allowed from `error`) to release provider resources, then re-provision with a fresh `operation_id`
Drain stalled (>30 min)	Pods can't reschedule (capacity)	Add capacity, or hard-terminate via `system_terminate_instance`
Terminate stalled (>5 min)	Provider VM teardown stuck	Check provider console; in the worst case `system_cancel_task` the teardown task and clean orphan rows manually

Clearing a stuck `provisioning` instance

Because the terminate event only fires from running, stopped, or error, an instance wedged in provisioning cannot be terminated directly. Recovery path:

system_cancel_task the provision task so the worker stops retrying.
Reconcile real provider state — for all cloud providers (AWS, GCP, Azure, OpenStack) and LocalQEMU/Proxmox, the provider's sync_status reconciles in-flight state against the provider API; a provider that reports the VM as gone maps to terminated, which stops the controller's in-flight wait loop.
If the VM genuinely failed to come up, the worker's mark_errored lands the instance in error; from there system_terminate_instance releases any provider resources and the row is cleaned up cascade-style.
Re-provision with the same operation_id to reuse the row, or a fresh one to start clean.

A proposed remediation (#7) would let terminate fire directly from provisioning/starting to simplify this path; it is not shipped yet, so use the cancel→sync→error→terminate sequence above today.

For all stuck states, use the attribute_failure skill (bound to the System Concierge) to enumerate recent module/version changes that may have caused the failure. The skill is invoked through the System Concierge chat agent (operator describes the failure; Concierge calls the executor internally and returns the analysis):

// Find the Concierge agent and invoke it with a natural-language ask:
platform.list_agents({ name_contains: "Concierge" })
// → { agents: [{ id: "<concierge-uuid>", name: "System Concierge", ... }] }

platform.execute_agent({
  agent_id: "<concierge-uuid>",
  prompt: "Attribute the recent failure on instance <instance-id> looking back 24 hours; surface the top candidate module/version change and confidence."
})
// The Concierge calls the system-attribute-failure executor internally and returns:
// → { candidates: [...], top_candidate: {...}, confidence: "medium", reasoning: "..." }

There is no direct execute_skill MCP action — skills are executor-shape, invoked by their owning agent. Operators interact with skills by talking to the agent that binds them.

LocalQemuProvider variant (smoke / dev)

For offline development or CI smoke tests, use the LocalQemuProvider:

# Prerequisites: libvirt, dracut, qemu-bridge-helper, Go toolchain
# Build M3 artifacts first:
cd extensions/system/initramfs && ./build.sh

# Run the smoke seed (provisions one NodeInstance to multi-user.target):
cd server && \
  POWERNODE_LIBVIRT_MODE=real \
  POWERNODE_IMAGE_BASE=../extensions/system/initramfs/build \
  bundle exec rails runner \
    "load Rails.root.join('../extensions/system/server/db/seeds/smoke_test_provision.rb')"

The seed creates: 1 Account → 1 Node (lifecycle_class=persistent) → 1 NodeInstance via LocalQemuProvider, watches the AASM Task progression, and reports the kernel boot pipeline through to multi-user.target. Total runtime: ~15 min on cold boot (TCG without /dev/kvm); ~3 min with KVM.

LocalQemuProvider modes:

real — actual libvirt domain creation + QEMU/KVM boot (default for smoke)
recorder — records what the libvirt adapter would do (fast; no VM)
disabled — skips provider entirely; useful for unit tests

Switch via POWERNODE_LIBVIRT_MODE=real|recorder|disabled.

How the System Concierge should use this

When an operator chats "I want to add a node" / "provision a new instance" / "decommission edge-tokyo-01":

Identify the requested phase (create / provision / drain / decommission)
Surface the relevant MCP action(s) + required inputs (template, region, instance type)
For destructive actions (drain, terminate), use request_confirmation skill before invoking
After invoking, watch the Task AASM transitions and report status changes back to the operator
If status hangs, surface the "Per-state error recovery" guidance for the relevant stuck state

The Concierge has 4 read-shape skills useful here: capacity_recommend (for "do I need more nodes?"), attribute_failure (for "why did instance X fail?"), runbook_generate (for template-specific runbooks), cve_runbook_generate (when provisioning is blocked by a CVE).

Related docs

USE_CASE_MATRIX.md — 10 NodeInstance use cases with status badges
CONTAINER_RUNTIMES.md — Phase 1 Docker + Phase 2 K3s lifecycle (depends on this runbook for instance provisioning)
runbooks/sdwan-network-setup.md — attach SDWAN peer (required for managed runtimes)
runbooks/instance-pool-tuning.md — pre-warmed pools (slice 7) for ephemeral workloads
SMOKE_TEST.md — LocalQemuProvider smoke test setup
db/seeds/smoke_test_provision.rb — canonical provisioning seed

Last verified: 2026-06-03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Node Provisioning Runbook

Quick reference

Lifecycle diagram

Phase 1 — Create Node ✅

Phase 2 — Provision NodeInstance ✅

Phase 3 — Bootstrap ✅

Phase 4 — Run ✅

Phase 5 — Drain (graceful) ⚠️

Phase 6 — Decommission ✅

Per-state error recovery

Clearing a stuck `provisioning` instance

LocalQemuProvider variant (smoke / dev)

How the System Concierge should use this

Related docs

FilesExpand file tree

node-provisioning.md

Latest commit

History

node-provisioning.md

File metadata and controls

Node Provisioning Runbook

Quick reference

Lifecycle diagram

Phase 1 — Create Node ✅

Phase 2 — Provision NodeInstance ✅

Phase 3 — Bootstrap ✅

Phase 4 — Run ✅

Phase 5 — Drain (graceful) ⚠️

Phase 6 — Decommission ✅

Per-state error recovery

Clearing a stuck provisioning instance

LocalQemuProvider variant (smoke / dev)

How the System Concierge should use this

Related docs

Clearing a stuck `provisioning` instance