Skip to content

Latest commit

 

History

History
67 lines (56 loc) · 5.57 KB

File metadata and controls

67 lines (56 loc) · 5.57 KB

Operator Runbooks

Status: active

Step-by-step procedures for production operations on the System extension. Each runbook is focused on one workflow an operator might run — for the broader learning sequence, see ../tutorials/.

Index

Runbook Audience Prerequisites Runtime
acme-issuance.md SREs, security operators DNS provider token, Vault transit ~10–30 min per cert
acme-smoke.md SREs validating ACME, release gate operators Cloudflare token, test domain, powernode-hub ready ~30 min
cve-response.md Security operators, on-call SREs Fleet with SBOM-ingested modules, system.cve_remediate approval ~1–4 hours per CVE
disk-image-ci.md Platform engineers, CI maintainers Gitea runner, Vault credentials for OCI registry ~30 min setup + per-build runtime
docker-compose-cutover.md Platform operators migrating from legacy compose stacks Existing compose deployment, SDWAN network defined ~1–3 days (planned downtime)
expose-service.md SREs publishing public services, network operators SDWAN network + publicly-reachable hub peer, free VIP CIDR, Cloudflare ACME DNS credential (https) ~10–20 min per service
federation-setup.md Multi-region / multi-account operators Two reachable platforms, partner trust agreement ~30 min per pairing
federation-troubleshooting.md Operators triaging federation failures Established federation peer in degraded state ~5–60 min depending on cause
fleet-imaging-claim-by-id.md Operators provisioning physical fleets (SD/USB/NVMe) system.instances.create+read, published generic image for the arch ~5 min/device after one image flash
gitops-reconciliation.md SREs adopting GitOps, multi-engineer teams Git remote (Gitea / GitHub), Vault SSH credential ~30 min initial setup
instance-pool-tuning.md ML engineers, batch operators, CI platform owners Provider quota for pool members ~30 min initial sizing
k3s-smoke-full-lifecycle.md System operators validating the K3s + SDWAN surface before a release / post-incident Local platform running, local_qemu provider, seeded k3s modules varies by tier (db / single / site / full)
module-authoring.md Module authors, platform contributors Gitea repo + cosign + oras CLIs ~45 min per new module
multi-cluster-k3s.md Kubernetes-focused operators Multiple NodeInstances + SDWAN ~1 hour per cluster
node-provisioning.md New operators, on-call SREs Provider connection configured ~5–15 min per node
sdwan-network-setup.md Network engineers, multi-tenant operators At least one NodeInstance with publicly-reachable address ~30 min
vault-credential-restoration.md Security operators handling Vault DR Vault snapshot, Shamir unseal keys ~30 min – 2 hours
vendored-binary-bump.md Platform maintainers updating Traefik / rpi4-firmware / dracut / kernel pins Clean working tree; for ARM-only items, Pi 4 or QEMU-aarch64 15–60 min per bump

When to read which

If you're… Start with
New to the extension ../tutorials/01-first-boot.md → then specific runbooks
Provisioning a new node node-provisioning.md
Imaging a fleet of physical devices fleet-imaging-claim-by-id.md
Setting up SDWAN sdwan-network-setup.md
Publishing a service publicly with TLS expose-service.md (after sdwan-network-setup.md)
Authoring a module module-authoring.mddisk-image-ci.md (if base image too)
Responding to a security CVE cve-response.md
Building federation federation-setup.mdfederation-troubleshooting.md when stuck
Adopting GitOps gitops-reconciliation.md
Managing TLS certs acme-issuance.md for day-2, acme-smoke.md for release gates
Validating K3s + SDWAN before a release k3s-smoke-full-lifecycle.md
Recovering Vault vault-credential-restoration.md

Authoring conventions

When writing a new runbook:

  1. Lead with audience + prerequisites — readers should know in 30 seconds whether this is for them
  2. Numbered steps with code blocks; copy-pasteable beats prose
  3. Expected outcome lines after each side-effecting step
  4. Failure mode section — list 5–10 common errors with diagnosis + remediation
  5. Cross-references at the end — link to tutorials, design docs, and sibling runbooks
  6. Add a row to this index when shipping

For learning-oriented content (concept refreshers, builds-on chains), use ../tutorials/ instead. Runbooks are for operators who already know the concepts and need the procedure.

Last verified: 2026-06-03