fix(planner): guard validation gates in buildRunningPlan by bdchatham · Pull Request #380 · sei-protocol/sei-k8s-controller

bdchatham · 2026-06-02T22:18:05Z

Problem

A SeiNode validator with signingKey+nodeKey set but operatorKeyring unset comes up fine via the create path (buildBasePlan), but once it reaches Status.Phase == Running, the InPlace update path buildRunningPlan (internal/planner/validator.go) fails every reconcile that detects image/sidecar drift with:

validate-operator-keyring: secretName is empty

buildBasePlan builds its program incrementally and guards each validation gate with its needs* predicate (needsValidateSigningKey / needsValidateNodeKey / needsValidateOperatorKeyring). buildRunningPlan instead hardcoded all three into a fixed slice, so a Running validator with no operatorKeyring emits a validate-operator-keyring task whose params carry an empty secretName, and the task handler fails terminally.

This is a production blocker for the arctic-1 node-19 migration: validator-19 has signingKey+nodeKey and no operatorKeyring, so it would bring up cleanly but brick on its first image upgrade — the update plan would wedge before apply-statefulset/replace-pod. It was caught during a harbor dry-run of the migration.

Fix

Build the running-plan program incrementally with the same needs* guards as buildBasePlan, so an unset key source is skipped rather than validated as empty. This also repairs the genesis-ceremony-Running path (no key sources), which the old hardcoded slice would have failed on the empty signing-key gate.

Tests

Unit (internal/planner/node_update_test.go): replaced the test that encoded the buggy behavior (empty ValidatorSpec{} asserting all three gates) with three cases — declared-keys-only (BYO, no operatorKeyring → operator-keyring gate absent), operatorKeyring-set (all three gates), and genesis-ceremony-Running (no gates).
Integration (internal/controller/nodedeployment/envtest/inplace_rollout_validator_test.go): end-to-end against a real apiserver — seeds BYO key Secrets, brings a replicas:1 no-operatorKeyring validator to Running, bumps the image, and asserts the rollout completes (new image propagates, no child plan fails on the operator-keyring gate, SND reaches RolloutComplete). Verified it fails against the pre-fix code: the child plan latches validate-operator-keyring: secretName is empty and the test times out.

Review

Cross-reviewed by the kubernetes-specialist: verdict COMPATIBLE. Confirmed the fix mirrors buildBasePlan, audited every other program-builder path (buildBootstrapPlan, buildGenesisPlan, full/archive/replay planners, post-bootstrap progression) — none carries the same bug — and confirmed no skip-regression and no assembleUpdatePlan side effects from a variable-length program.

buildRunningPlan hardcoded validate-signing-key, validate-node-key, and validate-operator-keyring into the InPlace update program unconditionally, unlike buildBasePlan which guards each with its needs* predicate. A Running validator with signingKey+nodeKey but no operatorKeyring (the arctic-1 node-19 migration shape) therefore failed every update plan with "validate-operator-keyring: secretName is empty" the moment image or sidecar drift was detected — bricking the validator's first image bump. Build the running-plan program incrementally with the same needs* guards as buildBasePlan so an unset key source is skipped rather than validated as empty. This also repairs the genesis-ceremony-Running path, which carries no key sources and would have failed on the empty signing-key gate. Splits the test that encoded the old behavior (empty ValidatorSpec asserting all three gates) into declared-keys-only, operator-keyring-set, and genesis-ceremony-Running cases. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ring End-to-end regression guard for the buildRunningPlan fix. Seeds signingKey + nodeKey Secrets, brings up a replicas:1 validator with no operatorKeyring (the arctic-1 node-19 shape) to Running, bumps the image, and asserts the rollout completes — new image propagates, no child plan fails on the operator-keyring gate, and the SND reaches RolloutComplete. Verified against the pre-fix buildRunningPlan: the child's update plan latches Failed with "validate-operator-keyring: secretName is empty" and the test times out at the second waitFor. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

cursor · 2026-06-02T22:18:11Z

PR Summary

Medium Risk
Changes validator rollout task planning for production image upgrades; behavior is narrowly scoped to skipping validation when key sources are unset, with strong regression tests.

Overview
Running validator image/sidecar updates no longer always enqueue signing, node, and operator-keyring validation tasks. buildRunningPlan now builds the task list with the same needsValidate* guards as buildBasePlan, so validators with only BYO signing/node keys (no operatorKeyring) can roll without failing on an empty operator-keyring secret.

Tests replace the old “always three gates” unit expectation with cases for declared-keys-only, operator keyring set, and genesis-ceremony Running with no gates; an envtest drives a full in-place image bump for that BYO shape and asserts rollout completion without operator-keyring failures.

^{Reviewed by Cursor Bugbot for commit 8aacebb. Bugbot is set up for automated code reviews on this repo. Configure here.}

* docs(runbooks): add validator BYO-secrets migration runbook Cutover procedure for moving a live validator (arctic-1 node-19) off a legacy EC2 host onto the platform carrying its consensus identity via SOPS-encrypted Secrets. Centers the stop-before-start double-sign discipline and the layered equivocation defenses (procedure, replicas:1 CEL guard, double-sign alerts), the controller validation surface, the cutover/rollback sequence, and the four findings from the harbor dry-run (write-mode↔image coupling, networking.tcp DNS race, deploy-clean-not-recreate, deletionPolicy→PVC cascade, operatorKeyring guard #380). Written from the harbor dry-run and cross-reviewed by the platform and kubernetes specialists; verified against the controller source, the platform .sops.yaml layout, and the sei-infra#1034 node-19 removal mechanism. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(runbooks): fix bootstrap task order + SOPS table per cross-review - §5 step 3: config-apply runs before discover-peers (config-apply writes base config, discover-peers writes persistent-peers, config-validate checks last) — matches buildSidecarProgression (discover-peers inserted before config-validate, not before config-apply). kubernetes-specialist blocker. - §2: drop the clusters/harbor row — the platform repo has no clusters/harbor/.sops.yaml (harbor in-repo secrets use explicit --kms). platform-engineer correction. --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

bdchatham and others added 2 commits June 3, 2026 00:10

bdchatham merged commit 5d03f33 into main Jun 2, 2026
4 of 5 checks passed

bdchatham deleted the fix/running-plan-guard-operator-keyring branch June 2, 2026 22:20

This was referenced Jun 2, 2026

test(planner): extract repeated literals to satisfy goconst #381

Merged

docs(runbooks): validator BYO-secrets migration runbook #382

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(planner): guard validation gates in buildRunningPlan#380

fix(planner): guard validation gates in buildRunningPlan#380
bdchatham merged 2 commits into
mainfrom
fix/running-plan-guard-operator-keyring

bdchatham commented Jun 2, 2026

Uh oh!

cursor Bot commented Jun 2, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bdchatham commented Jun 2, 2026

Problem

Fix

Tests

Review

Uh oh!

cursor Bot commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Summary

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cursor Bot commented Jun 2, 2026 •

edited

Loading