Skip to content

fix(planner): guard validation gates in buildRunningPlan#380

Merged
bdchatham merged 2 commits into
mainfrom
fix/running-plan-guard-operator-keyring
Jun 2, 2026
Merged

fix(planner): guard validation gates in buildRunningPlan#380
bdchatham merged 2 commits into
mainfrom
fix/running-plan-guard-operator-keyring

Conversation

@bdchatham
Copy link
Copy Markdown
Collaborator

Problem

A SeiNode validator with signingKey+nodeKey set but operatorKeyring unset comes up fine via the create path (buildBasePlan), but once it reaches Status.Phase == Running, the InPlace update path buildRunningPlan (internal/planner/validator.go) fails every reconcile that detects image/sidecar drift with:

validate-operator-keyring: secretName is empty

buildBasePlan builds its program incrementally and guards each validation gate with its needs* predicate (needsValidateSigningKey / needsValidateNodeKey / needsValidateOperatorKeyring). buildRunningPlan instead hardcoded all three into a fixed slice, so a Running validator with no operatorKeyring emits a validate-operator-keyring task whose params carry an empty secretName, and the task handler fails terminally.

This is a production blocker for the arctic-1 node-19 migration: validator-19 has signingKey+nodeKey and no operatorKeyring, so it would bring up cleanly but brick on its first image upgrade — the update plan would wedge before apply-statefulset/replace-pod. It was caught during a harbor dry-run of the migration.

Fix

Build the running-plan program incrementally with the same needs* guards as buildBasePlan, so an unset key source is skipped rather than validated as empty. This also repairs the genesis-ceremony-Running path (no key sources), which the old hardcoded slice would have failed on the empty signing-key gate.

Tests

  • Unit (internal/planner/node_update_test.go): replaced the test that encoded the buggy behavior (empty ValidatorSpec{} asserting all three gates) with three cases — declared-keys-only (BYO, no operatorKeyring → operator-keyring gate absent), operatorKeyring-set (all three gates), and genesis-ceremony-Running (no gates).
  • Integration (internal/controller/nodedeployment/envtest/inplace_rollout_validator_test.go): end-to-end against a real apiserver — seeds BYO key Secrets, brings a replicas:1 no-operatorKeyring validator to Running, bumps the image, and asserts the rollout completes (new image propagates, no child plan fails on the operator-keyring gate, SND reaches RolloutComplete). Verified it fails against the pre-fix code: the child plan latches validate-operator-keyring: secretName is empty and the test times out.

Review

Cross-reviewed by the kubernetes-specialist: verdict COMPATIBLE. Confirmed the fix mirrors buildBasePlan, audited every other program-builder path (buildBootstrapPlan, buildGenesisPlan, full/archive/replay planners, post-bootstrap progression) — none carries the same bug — and confirmed no skip-regression and no assembleUpdatePlan side effects from a variable-length program.

bdchatham and others added 2 commits June 3, 2026 00:10
buildRunningPlan hardcoded validate-signing-key, validate-node-key, and
validate-operator-keyring into the InPlace update program unconditionally,
unlike buildBasePlan which guards each with its needs* predicate. A Running
validator with signingKey+nodeKey but no operatorKeyring (the arctic-1
node-19 migration shape) therefore failed every update plan with
"validate-operator-keyring: secretName is empty" the moment image or sidecar
drift was detected — bricking the validator's first image bump.

Build the running-plan program incrementally with the same needs* guards as
buildBasePlan so an unset key source is skipped rather than validated as
empty. This also repairs the genesis-ceremony-Running path, which carries no
key sources and would have failed on the empty signing-key gate.

Splits the test that encoded the old behavior (empty ValidatorSpec asserting
all three gates) into declared-keys-only, operator-keyring-set, and
genesis-ceremony-Running cases.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ring

End-to-end regression guard for the buildRunningPlan fix. Seeds signingKey +
nodeKey Secrets, brings up a replicas:1 validator with no operatorKeyring (the
arctic-1 node-19 shape) to Running, bumps the image, and asserts the rollout
completes — new image propagates, no child plan fails on the operator-keyring
gate, and the SND reaches RolloutComplete.

Verified against the pre-fix buildRunningPlan: the child's update plan latches
Failed with "validate-operator-keyring: secretName is empty" and the test times
out at the second waitFor.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@cursor
Copy link
Copy Markdown

cursor Bot commented Jun 2, 2026

PR Summary

Medium Risk
Changes validator rollout task planning for production image upgrades; behavior is narrowly scoped to skipping validation when key sources are unset, with strong regression tests.

Overview
Running validator image/sidecar updates no longer always enqueue signing, node, and operator-keyring validation tasks. buildRunningPlan now builds the task list with the same needsValidate* guards as buildBasePlan, so validators with only BYO signing/node keys (no operatorKeyring) can roll without failing on an empty operator-keyring secret.

Tests replace the old “always three gates” unit expectation with cases for declared-keys-only, operator keyring set, and genesis-ceremony Running with no gates; an envtest drives a full in-place image bump for that BYO shape and asserts rollout completion without operator-keyring failures.

Reviewed by Cursor Bugbot for commit 8aacebb. Bugbot is set up for automated code reviews on this repo. Configure here.

@bdchatham bdchatham merged commit 5d03f33 into main Jun 2, 2026
4 of 5 checks passed
@bdchatham bdchatham deleted the fix/running-plan-guard-operator-keyring branch June 2, 2026 22:20
bdchatham added a commit that referenced this pull request Jun 3, 2026
* docs(runbooks): add validator BYO-secrets migration runbook

Cutover procedure for moving a live validator (arctic-1 node-19) off a legacy
EC2 host onto the platform carrying its consensus identity via SOPS-encrypted
Secrets. Centers the stop-before-start double-sign discipline and the layered
equivocation defenses (procedure, replicas:1 CEL guard, double-sign alerts),
the controller validation surface, the cutover/rollback sequence, and the four
findings from the harbor dry-run (write-mode↔image coupling, networking.tcp DNS
race, deploy-clean-not-recreate, deletionPolicy→PVC cascade, operatorKeyring
guard #380).

Written from the harbor dry-run and cross-reviewed by the platform and
kubernetes specialists; verified against the controller source, the platform
.sops.yaml layout, and the sei-infra#1034 node-19 removal mechanism.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* docs(runbooks): fix bootstrap task order + SOPS table per cross-review

- §5 step 3: config-apply runs before discover-peers (config-apply writes base
  config, discover-peers writes persistent-peers, config-validate checks last) —
  matches buildSidecarProgression (discover-peers inserted before config-validate,
  not before config-apply). kubernetes-specialist blocker.
- §2: drop the clusters/harbor row — the platform repo has no
  clusters/harbor/.sops.yaml (harbor in-repo secrets use explicit --kms).
  platform-engineer correction.

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant