Status: active
Step-by-step operator guide for the full Node + NodeInstance lifecycle: from "I have a Template" through "the instance is decommissioned and rows cleaned up." Includes per-state error recovery and the LocalQemuProvider variant for offline / smoke-test environments.
Audience: external operators (open-source consumers), internal Powernode operators, on-call SREs handling stuck instances.
| Phase | What happens | Typical duration | MCP entry point |
|---|---|---|---|
| 1. Create Node | Logical row representing a future-or-existing host | <1 s | system_create_node |
| 2. Provision instance | Provider boots a VM with the netboot image | 30 s – 10 min | system_provision_instance |
| 3. Bootstrap | Agent installs, mTLS handshake, module reconcile | ~90 s cold (5-10 min on slow providers) | none — agent-driven |
| 4. Run | Heartbeats, reconcile loop, task lease | indefinite | system_get_instance |
| 5. Drain | Workloads relocated, services stopped | 1-30 min | system_drain_instance |
| 6. Decommission | Provider VM destroyed, FK cascades fire | <1 min | system_terminate_instance |
The NodeInstance AASM has 9 states: pending, provisioning, starting,
running, stopping, stopped, rebooting, terminated, error. There is
no draining state (drain is an operation that ultimately drives the
instance running → stopping → stopped; draining is a pool_state on pooled
instances, not an instance AASM state) and no failed state — the terminal
failure state is error. The "bootstrapping" box below is the agent-driven boot
window within the provisioning state, not a separate AASM state.
┌──────────────┐
│ (no instance)│ Node row exists; no provider VM yet
└──────┬───────┘
│ system_provision_instance
▼
┌──────────────┐ AASM: pending → mark_provisioning → provisioning
│ provisioning │ Provider creates VM; netboot fetches kernel + initramfs
└──────┬───────┘
│ (agent boot window: enroll CSR → mTLS,
│ modules pulled, fs-verity verified, composefs mounted)
│ ╲ provider error / boot failure
│ first phase=ready heartbeat ╲ → mark_errored
▼ ▼
┌──────────────┐ AASM: mark_running ┌──────────┐
│ running │ heartbeats every 30s │ error │ terminal failure
└──────┬───────┘ task lease ready └────┬─────┘ (no orphaned row)
│ │ terminate
│ system_drain_instance (graceful): │ (allowed from error)
│ stop/cordon → AASM stopping→stopped │
│ -or- system_terminate_instance (hard)│
▼ ▼
┌──────────────────────────────────────────────────┐
│ terminated │
│ Cascade FKs: Devops::DockerHost / KubernetesNode│
│ cleanup, Vault TLS revoked, Sdwan::Peer removed │
└──────────────────────────────────────────────────┘
The
terminateevent currently transitions only fromrunning,stopped, anderror— not directly fromprovisioningorstarting. See Per-state error recovery for how to clear an instance still mid-provisioning.
The Node is a logical row. No provider resources are touched here.
platform.system_create_node({
account_id: "<account-id>", // current account by default
hostname: "edge-tokyo-01", // human-readable
node_template_id: "tmpl-edge-base", // composes assigned modules
node_platform_id: "platform-ubuntu-2404-amd64", // disk image family
node_architecture_id: "arch-amd64", // kernel + boot binaries
lifecycle_class: "persistent", // "persistent" | "ephemeral" | "spot"
metadata: { // optional; surfaces in dashboard
"owner": "edge-team",
"purpose": "tokyo-cdn-edge"
}
})
// → { node: { id, hostname, status: "no_instance", ... } }What to watch:
lifecycle_classis set at creation time (persistentfor control-plane / database / SaaS tenant;ephemeralfor batch / CI / replaceable workers;spotfor provider-preemptible workloads). The model validates against the allowed set but does not enforce immutability after first provision — operators should still treat the class as fixed in practice, since changing it after instances exist would invalidate downstream allocation assumptions.node_template_iddetermines which modules will be assigned at bootstrap. To reuse an existing fleet template, query first:platform.system_list_templates.- A
Nodewith noNodeInstanceis harmless — bookkeeping only.
platform.system_provision_instance({
node_id: "<node-id>",
provider_region_id: "region-aws-us-east-1", // or "region-local-qemu"
provider_instance_type_id: "type-t3-medium", // or "type-qemu-2cpu-4gb"
// Optional:
spot: false,
ssh_key_ids: ["<key-id>"] // injected via fw-cfg metadata for break-glass
})
// → { instance: { id, status: "provisioning", task_id, ... } }The platform creates a Task (status=pending), enqueues a worker job, and returns immediately. The worker runs the provider's provision_instance! adapter:
- AWS / GCP / Azure / OpenStack — provider-specific API calls; takes 30 s – 5 min
- LocalQemuProvider — libvirt domain creation with direct kernel boot from M3 artifacts; takes ~10-30 s in
realmode, instant inrecordermode (perproject_local_qemu_providermemory)
Instance AASM transitions on the happy path: pending → provisioning → running
(mark_provisioning, then mark_running on the first phase=ready heartbeat).
If the provider call or boot fails, the worker drives mark_errored so the
instance lands in error rather than dangling in provisioning — there is no
orphaned-pending row left behind.
Idempotent retries. system_provision_instance accepts an operation_id.
A retried call carrying the same operation_id reuses the existing instance
row (tagged in config->>'operation_id') instead of creating a second VM —
this dedups transient-error retries. A retry without an operation_id falls
back to time-stamped naming and will create a distinct instance, so always
thread the same operation_id through when retrying a failed provision.
Verify provisioning:
platform.system_get_instance({ id: "<instance-id>" })
// → { instance: { status: "provisioning", task_id, last_heartbeat_at: null, ... } }
// To watch task progress, list current tasks for the instance and read the matching row:
platform.system_list_tasks({ resource_type: "system_node_instance", resource_id: "<instance-id>" })
// (system_get_task as a single-record fetch is in ASPIRATIONAL_MCP.md — use system_list_tasks filtered to the resource for now)What to watch:
- Provider quota: most providers throttle bulk provisioning. For >10 instances, use
provision_clusterskill (hard cap 50/call) or sequence calls withsystem_create_instance_pool(slice 7) for warm capacity. MissingNetbootImageError: the platform-side disk image hasn't been published to OCI yet. Runsystem_list_disk_image_publicationsto confirm the publication exists withstatus=published.- LocalQemuProvider: ensure
POWERNODE_LIBVIRT_MODE=realandPOWERNODE_IMAGE_BASEpoints atextensions/system/initramfs/build. SeeSMOKE_TEST.md.
The provider VM POSTs to runtime/handshake once the kernel boots:
- Identity discovery — agent reads from
cmdline/virtio-fw-cfg/ cloud metadata; selects the appropriateIdentityStrategy - Enrollment — agent generates Ed25519 keypair, POSTs CSR to
/api/v1/system/node_api/enrollwith bootstrap token; receives signed mTLS cert - Module pull — agent fetches OCI artifacts for assigned modules from
registry.example.comregistry; verifiescosignsignatures + fs-verity digests - Mount union root — composefs lower layer + tmpfs (or
/persist) overlay;pivot_rootinto composed userspace - Service start —
systemctl start powernode-agent.service; agent postsphase=readyheartbeat
The platform marks the instance status=running after the first phase=ready POST.
What to watch:
- Bootstrap timeline: ~90 s from kernel boot to
phase=readyon warm cache; +30-60 s on first run when modules aren't cached. Slice 7 instance pools cut this to <30 s by pre-warming. - Stuck in
bootstrapping: usually a module pull failure (signature verify, network, OCI 404). Checkjournalctl -u powernode-agenton the node, orplatform.recent_eventsfor the instance. - Bootstrap token rotation: tokens expire 24 h after issue. Re-provision if you see
BootstrapTokenExpiredError.
The instance heartbeats every 30 s. Per-tick:
- Agent posts heartbeat (uptime, version, last reconcile result)
- Platform refreshes
last_heartbeat_at instance_status_sensorruns every 60 s; firesinstance.silentif no heartbeat in 5 min (default)- Module reconciler walks assigned modules; pulls + verifies + mounts updates if module versions changed
- Task lease: agent claims any pending tasks for this instance via
worker_api/tasksand runs them
Verify health:
platform.system_get_instance({ id: "<instance-id>" })
// → { instance: {
// status: "running",
// last_heartbeat_at: "2026-05-04T13:42:01Z",
// running_module_digests: { "system-base": "sha256:abc...", ... },
// ...
// }}
platform.system_drift_report({ instance_id: "<instance-id>" })
// → { drift: false } or { drift: true, attach: [...], detach: [...], update: [...] }If drift: true, the module_drift_sensor will emit module.drift_detected; Fleet Autonomy auto-runs drift_remediate (notify_and_proceed policy) on next tick.
For persistent instances running workloads (Docker daemon, K3s server), prefer drain over hard terminate:
platform.system_drain_instance({
id: "<instance-id>",
timeout_seconds: 600, // give workloads up to 10 min to relocate
cordon_only: false // false → also stop services after cordon
})
// → { instance_id, instance_name, drain_initiated_at, drain_timeout_seconds }
// (records drain markers + emits platform.resilience.drain_started; the
// instance AASM moves running → stopping → stopped as services wind down —
// there is no "draining" instance state)Drain coordinates with Devops layer:
- DockerHost:
docker stopcontainers tagged--restart=alwaysfirst; then daemon shutdown - KubernetesNode (k3s-agent):
kubectl cordon+kubectl drain --ignore-daemonsets - k3s-server bootstrap node: triggers slice 3 VIP failover before stopping the API server
What to watch:
- Pod relocation requires capacity on remaining nodes — drain can stall if cluster is at capacity. Add capacity first or accept partial drain.
- Local-path PVCs don't migrate; pods using them go pending. Plan stateful workload placement accordingly.
- Single-server K3s clusters cannot drain the only server — kubectl loses access. Either add a second
k3s-serverfirst, or hard-terminate.
platform.system_terminate_instance({ id: "<instance-id>" })
// → { instance_id, status: "terminated" }
// (single-step AASM transition running/stopped/error → terminated; there is
// no intermediate "terminating" instance state)Cascade actions (FK + service-level):
- Provider VM destroyed via the same provider adapter that created it
Devops::DockerHost(if managed) destroyed; Vault TLS material revokedDevops::KubernetesNode(if k3s-*) destroyedSdwan::Peerrows for this instance removed (slice 9 cleanup callback)Sdwan::VirtualIpfailover triggered if this instance was a holder (slice 3)NodeCertificaterows revokedBootstrapTokenrows expired
The Node row remains by design — re-provisioning into the same logical Node preserves history and audit chain. Delete the Node explicitly via system_delete_node only if it's truly retired.
Stuck states map to the 9 real AASM states. "bootstrapping" is the agent
boot window inside provisioning; drain/terminate are operations (the instance
moves running → stopping → stopped or → terminated), not states.
| Stuck in… | Likely cause | Recovery |
|---|---|---|
pending (>5 min) |
Worker queue stalled or provider quota | Check platform.recent_events for provider_quota_exceeded; restart worker via sudo systemctl restart powernode-worker@default; retry with the same operation_id |
provisioning (>10 min) |
Provider API timeout, libvirt domain creation hung | platform.system_cancel_task the provision task; investigate provider. terminate does not fire from provisioning today — see "Clearing a stuck provisioning instance" below |
provisioning, agent up but never running (>5 min after first heartbeat) |
Module pull failure | SSH to node (if SDWAN attached) → journalctl -u powernode-agent shows the failed module + reason; common: cosign signature mismatch, OCI 404, network |
running but no heartbeats >5 min |
Network partition or agent crash | platform.recent_events for instance.silent; SSH or console-access via libvirt; manual restart of powernode-agent.service |
error (terminal) |
Provider/boot failure drove mark_errored |
Inspect platform.recent_events; once the cause is understood, system_terminate_instance (allowed from error) to release provider resources, then re-provision with a fresh operation_id |
| Drain stalled (>30 min) | Pods can't reschedule (capacity) | Add capacity, or hard-terminate via system_terminate_instance |
| Terminate stalled (>5 min) | Provider VM teardown stuck | Check provider console; in the worst case system_cancel_task the teardown task and clean orphan rows manually |
Because the terminate event only fires from running, stopped, or error,
an instance wedged in provisioning cannot be terminated directly. Recovery
path:
system_cancel_taskthe provision task so the worker stops retrying.- Reconcile real provider state — for all cloud providers (AWS, GCP, Azure,
OpenStack) and LocalQEMU/Proxmox, the provider's
sync_statusreconciles in-flight state against the provider API; a provider that reports the VM as gone maps toterminated, which stops the controller's in-flight wait loop. - If the VM genuinely failed to come up, the worker's
mark_erroredlands the instance inerror; from theresystem_terminate_instancereleases any provider resources and the row is cleaned up cascade-style. - Re-provision with the same
operation_idto reuse the row, or a fresh one to start clean.
A proposed remediation (#7) would let
terminatefire directly fromprovisioning/startingto simplify this path; it is not shipped yet, so use the cancel→sync→error→terminate sequence above today.
For all stuck states, use the attribute_failure skill (bound to the System Concierge) to enumerate recent module/version changes that may have caused the failure. The skill is invoked through the System Concierge chat agent (operator describes the failure; Concierge calls the executor internally and returns the analysis):
// Find the Concierge agent and invoke it with a natural-language ask:
platform.list_agents({ name_contains: "Concierge" })
// → { agents: [{ id: "<concierge-uuid>", name: "System Concierge", ... }] }
platform.execute_agent({
agent_id: "<concierge-uuid>",
prompt: "Attribute the recent failure on instance <instance-id> looking back 24 hours; surface the top candidate module/version change and confidence."
})
// The Concierge calls the system-attribute-failure executor internally and returns:
// → { candidates: [...], top_candidate: {...}, confidence: "medium", reasoning: "..." }There is no direct execute_skill MCP action — skills are executor-shape, invoked by their owning agent. Operators interact with skills by talking to the agent that binds them.
For offline development or CI smoke tests, use the LocalQemuProvider:
# Prerequisites: libvirt, dracut, qemu-bridge-helper, Go toolchain
# Build M3 artifacts first:
cd extensions/system/initramfs && ./build.sh
# Run the smoke seed (provisions one NodeInstance to multi-user.target):
cd server && \
POWERNODE_LIBVIRT_MODE=real \
POWERNODE_IMAGE_BASE=../extensions/system/initramfs/build \
bundle exec rails runner \
"load Rails.root.join('../extensions/system/server/db/seeds/smoke_test_provision.rb')"The seed creates: 1 Account → 1 Node (lifecycle_class=persistent) → 1 NodeInstance via LocalQemuProvider, watches the AASM Task progression, and reports the kernel boot pipeline through to multi-user.target. Total runtime: ~15 min on cold boot (TCG without /dev/kvm); ~3 min with KVM.
LocalQemuProvider modes:
real— actual libvirt domain creation + QEMU/KVM boot (default for smoke)recorder— records what the libvirt adapter would do (fast; no VM)disabled— skips provider entirely; useful for unit tests
Switch via POWERNODE_LIBVIRT_MODE=real|recorder|disabled.
When an operator chats "I want to add a node" / "provision a new instance" / "decommission edge-tokyo-01":
- Identify the requested phase (create / provision / drain / decommission)
- Surface the relevant MCP action(s) + required inputs (template, region, instance type)
- For destructive actions (drain, terminate), use
request_confirmationskill before invoking - After invoking, watch the Task AASM transitions and report status changes back to the operator
- If status hangs, surface the "Per-state error recovery" guidance for the relevant stuck state
The Concierge has 4 read-shape skills useful here: capacity_recommend (for "do I need more nodes?"), attribute_failure (for "why did instance X fail?"), runbook_generate (for template-specific runbooks), cve_runbook_generate (when provisioning is blocked by a CVE).
USE_CASE_MATRIX.md— 10 NodeInstance use cases with status badgesCONTAINER_RUNTIMES.md— Phase 1 Docker + Phase 2 K3s lifecycle (depends on this runbook for instance provisioning)runbooks/sdwan-network-setup.md— attach SDWAN peer (required for managed runtimes)runbooks/instance-pool-tuning.md— pre-warmed pools (slice 7) for ephemeral workloadsSMOKE_TEST.md— LocalQemuProvider smoke test setupdb/seeds/smoke_test_provision.rb— canonical provisioning seed
Last verified: 2026-06-03