fix(planner): guard validation gates in buildRunningPlan#380
Conversation
buildRunningPlan hardcoded validate-signing-key, validate-node-key, and validate-operator-keyring into the InPlace update program unconditionally, unlike buildBasePlan which guards each with its needs* predicate. A Running validator with signingKey+nodeKey but no operatorKeyring (the arctic-1 node-19 migration shape) therefore failed every update plan with "validate-operator-keyring: secretName is empty" the moment image or sidecar drift was detected — bricking the validator's first image bump. Build the running-plan program incrementally with the same needs* guards as buildBasePlan so an unset key source is skipped rather than validated as empty. This also repairs the genesis-ceremony-Running path, which carries no key sources and would have failed on the empty signing-key gate. Splits the test that encoded the old behavior (empty ValidatorSpec asserting all three gates) into declared-keys-only, operator-keyring-set, and genesis-ceremony-Running cases. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ring End-to-end regression guard for the buildRunningPlan fix. Seeds signingKey + nodeKey Secrets, brings up a replicas:1 validator with no operatorKeyring (the arctic-1 node-19 shape) to Running, bumps the image, and asserts the rollout completes — new image propagates, no child plan fails on the operator-keyring gate, and the SND reaches RolloutComplete. Verified against the pre-fix buildRunningPlan: the child's update plan latches Failed with "validate-operator-keyring: secretName is empty" and the test times out at the second waitFor. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
PR SummaryMedium Risk Overview Tests replace the old “always three gates” unit expectation with cases for declared-keys-only, operator keyring set, and genesis-ceremony Running with no gates; an envtest drives a full in-place image bump for that BYO shape and asserts rollout completion without operator-keyring failures. Reviewed by Cursor Bugbot for commit 8aacebb. Bugbot is set up for automated code reviews on this repo. Configure here. |
* docs(runbooks): add validator BYO-secrets migration runbook Cutover procedure for moving a live validator (arctic-1 node-19) off a legacy EC2 host onto the platform carrying its consensus identity via SOPS-encrypted Secrets. Centers the stop-before-start double-sign discipline and the layered equivocation defenses (procedure, replicas:1 CEL guard, double-sign alerts), the controller validation surface, the cutover/rollback sequence, and the four findings from the harbor dry-run (write-mode↔image coupling, networking.tcp DNS race, deploy-clean-not-recreate, deletionPolicy→PVC cascade, operatorKeyring guard #380). Written from the harbor dry-run and cross-reviewed by the platform and kubernetes specialists; verified against the controller source, the platform .sops.yaml layout, and the sei-infra#1034 node-19 removal mechanism. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(runbooks): fix bootstrap task order + SOPS table per cross-review - §5 step 3: config-apply runs before discover-peers (config-apply writes base config, discover-peers writes persistent-peers, config-validate checks last) — matches buildSidecarProgression (discover-peers inserted before config-validate, not before config-apply). kubernetes-specialist blocker. - §2: drop the clusters/harbor row — the platform repo has no clusters/harbor/.sops.yaml (harbor in-repo secrets use explicit --kms). platform-engineer correction. --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Problem
A SeiNode validator with
signingKey+nodeKeyset butoperatorKeyringunset comes up fine via the create path (buildBasePlan), but once it reachesStatus.Phase == Running, the InPlace update pathbuildRunningPlan(internal/planner/validator.go) fails every reconcile that detects image/sidecar drift with:buildBasePlanbuilds its program incrementally and guards each validation gate with itsneeds*predicate (needsValidateSigningKey/needsValidateNodeKey/needsValidateOperatorKeyring).buildRunningPlaninstead hardcoded all three into a fixed slice, so a Running validator with no operatorKeyring emits avalidate-operator-keyringtask whose params carry an emptysecretName, and the task handler fails terminally.This is a production blocker for the arctic-1 node-19 migration: validator-19 has
signingKey+nodeKeyand nooperatorKeyring, so it would bring up cleanly but brick on its first image upgrade — the update plan would wedge beforeapply-statefulset/replace-pod. It was caught during a harbor dry-run of the migration.Fix
Build the running-plan program incrementally with the same
needs*guards asbuildBasePlan, so an unset key source is skipped rather than validated as empty. This also repairs the genesis-ceremony-Running path (no key sources), which the old hardcoded slice would have failed on the empty signing-key gate.Tests
internal/planner/node_update_test.go): replaced the test that encoded the buggy behavior (emptyValidatorSpec{}asserting all three gates) with three cases — declared-keys-only (BYO, no operatorKeyring → operator-keyring gate absent), operatorKeyring-set (all three gates), and genesis-ceremony-Running (no gates).internal/controller/nodedeployment/envtest/inplace_rollout_validator_test.go): end-to-end against a real apiserver — seeds BYO key Secrets, brings areplicas:1no-operatorKeyring validator to Running, bumps the image, and asserts the rollout completes (new image propagates, no child plan fails on the operator-keyring gate, SND reachesRolloutComplete). Verified it fails against the pre-fix code: the child plan latchesvalidate-operator-keyring: secretName is emptyand the test times out.Review
Cross-reviewed by the kubernetes-specialist: verdict COMPATIBLE. Confirmed the fix mirrors
buildBasePlan, audited every other program-builder path (buildBootstrapPlan,buildGenesisPlan, full/archive/replay planners, post-bootstrap progression) — none carries the same bug — and confirmed no skip-regression and noassembleUpdatePlanside effects from a variable-length program.