feat(planner): re-apply config.toml as part of buildNodeUpdatePlan#374
feat(planner): re-apply config.toml as part of buildNodeUpdatePlan#374bdchatham wants to merge 2 commits into
Conversation
The image-drift plan previously ran ApplyStatefulSet → ApplyService → ReplacePod → ObserveImage → MarkReady. config.toml was never re-applied, so any change to fields the controller owns via commonOverrides — most notably Spec.ExternalAddress — never reached the running pod. Inbound publishable P2P could work (NLB to pod), but seid never advertised the NLB hostname via PEX/NodeInfo because external-address stayed empty. Add TaskConfigApply + TaskConfigValidate to the progression, between ApplyService and ReplacePod. ConfigApply runs against the OLD pod's sidecar (writes to the PVC); ReplacePod then cycles the pod and the new seid reads the fresh config.toml. external-address has no hot-reload path in seid today (seictl config_reload.go has a TODO for SIGHUP). Two correctness fixes folded in per pre-implementation expert review: 1. buildNodeUpdatePlan now constructs a real ConfigIntent via the mode planner. Previously paramsForTaskType received nil and returned an empty-Mode ConfigIntent — sei-config's ResolveIntent rejects empty Mode, so TaskConfigApply would have failed at runtime on first image bump. New NodePlanner.BuildConfigIntent method extracted from each mode's BuildPlan; same code path on init and day-2, shared source of truth for mode + overrides. 2. Day-2 intent sets Incremental=true. The init applyFull regenerates config.toml from mode defaults + overrides; running that on a Running node would wipe persistent-peers, state-sync trust point, and any operator-managed TOML keys outside the controller's overrides set. applyIncremental reads on-disk config and patches only the keys we own. Tests: - TestBuildRunningPlan_ImageDrift_ReturnsNodeUpdatePlan updated to expect the new 7-task progression (was 5). - TestBuildRunningPlan_ImageDriftWinsOverSidecar updated similarly. - TestBuildNodeUpdatePlan_ConfigApplyHasIntentWithMode — asserts Mode is set and Overrides carries KeyP2PExternalAddress from Spec.ExternalAddress. - TestBuildNodeUpdatePlan_ConfigApplyIsIncremental — pins Incremental=true. - TestBuildNodeUpdatePlan_ConfigApplyIntentMatchesModePlanner — table- driven coverage across full/archive/validator modes. Out of scope (separate follow-up, both experts flagged): - Autonomous ExternalAddress drift detection on Running nodes. With this change ExternalAddress propagates on next image bump (or any other trigger of buildNodeUpdatePlan). Operators currently trigger via an image SHA pin. If autonomous propagation becomes operationally necessary (e.g., NLB hostname rotation, Spec.Overrides drift, opt-out of publishable P2P), file a follow-up to add drift detection in buildRunningPlan that routes to the same buildNodeUpdatePlan. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR SummaryMedium Risk Overview
Tests lock task order, non-empty Mode, ExternalAddress in overrides, incremental vs init behavior, and per-mode intent consistency. Reviewed by Cursor Bugbot for commit 43a0018. Bugbot is set up for automated code reviews on this repo. Configure here. |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 1d4dc7d. Configure here.
| task.TaskTypeApplyStatefulSet, | ||
| task.TaskTypeApplyService, | ||
| TaskConfigApply, | ||
| TaskConfigValidate, |
There was a problem hiding this comment.
Node update plan missing key validation readiness gates
High Severity
buildNodeUpdatePlan includes TaskTypeReplacePod (cycles the pod) but omits the TaskTypeValidateSigningKey, TaskTypeValidateNodeKey, and TaskTypeValidateOperatorKeyring readiness gates that both buildBasePlan and buildBootstrapPlan conditionally include before pod creation. For validator nodes with key secrets mounted as volumes, an image-drift rollout would recreate the pod without first verifying those secrets exist, risking a kubelet volume-mount failure instead of a clear controller-side validation error.
Additional Locations (1)
Triggered by learned rule: Planner readiness gates must apply to all pod-cycling plans
Reviewed by Cursor Bugbot for commit 1d4dc7d. Configure here.
Cross-review feedback from kubernetes-specialist: 1. Defensive nil-guard on each mode planner's BuildConfigIntent. Today plannerForMode() gates which planner gets called, so the nil-deref isn't reachable. But the method is now part of the NodePlanner interface contract — making it independently safe means future refactors that call it through other paths don't accidentally NPE. 2. Pin init plans to Incremental=false via a new test. The day-2 path sets Incremental=true intentionally; the init path must stay on applyFull (regenerate from mode defaults). Without the assertion, a future refactor could silently flip init and we'd only notice when a fresh node came up with stale state. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>


Problem
buildNodeUpdatePlan(the image-drift plan) progression was:No
TaskConfigApply. Result: any change to fields the controller owns viacommonOverrides— most notablySpec.ExternalAddressfrom the publishable-P2P feature — never reached the running pod'sconfig.toml. Inbound NLB→pod worked; outbound advertising (NodeInfo / PEX) didn't, because seid hadexternal-address = ''.Caught during the arctic-1 canary rollout (platform#764–771). With AWS LBC + harbor
SEI_NLB_TARGET_TYPE=instancenow in place, the NLB path is fully working, but seid still doesn't tell other peers how to reach it.Fix
Add
TaskConfigApply+TaskConfigValidateto the update progression, betweenApplyServiceandReplacePod:Conceptual reframe: "deployment" stops being "swap image" and becomes "rebuild the running state from current spec."
commonOverridesalready produces the correct config from currentSpec.ExternalAddress; we just weren't re-running TaskConfigApply outside init.Two correctness fixes from pre-implementation expert review
Both Platform and K8s specialists independently flagged these before I touched a keyboard:
1. Real
ConfigIntent(not nil). OldbuildNodeUpdatePlancalledparamsForTaskType(node, taskType, nil, nil). WithconfigIntent=nil, the params builder returned an emptyConfigIntent{}— empty Mode — andsei-config'sResolveIntentrejects empty Mode. TaskConfigApply would have failed at runtime on the first image bump.Fix: extracted
NodePlanner.BuildConfigIntent(node)from each mode planner'sBuildPlan.buildNodeUpdatePlancalls it viaplannerForMode(node).BuildConfigIntent(node). Same source of truth for mode + overrides on init and day-2.2.
Incremental: trueon the day-2 intent. The init path usesapplyFull— regeneratesconfig.tomlfrom mode defaults + overrides. Running that on a Running node would wipe:persistent-peers(set by the sidecar'swritePeersToConfig)applyIncrementalreads on-disk config and patches only the overrides we own. Day-2 must use it.Tests
TestBuildRunningPlan_ImageDrift_ReturnsNodeUpdatePlan— updated to expect the new 7-task progression.TestBuildRunningPlan_ImageDriftWinsOverSidecar— same.TestBuildNodeUpdatePlan_ConfigApplyHasIntentWithMode— asserts Mode is set +Overrides[KeyP2PExternalAddress]propagates fromSpec.ExternalAddress.TestBuildNodeUpdatePlan_ConfigApplyIsIncremental— pinsIncremental: true.TestBuildNodeUpdatePlan_ConfigApplyIntentMatchesModePlanner— table-driven coverage across full/archive/validator modes.Existing unit + envtest suites: green.
Progression ordering — why between ApplyService and ReplacePod
config.tomlto the PVC; the new pod mounts the PVC and reads the fresh file on seid start.external-addresshas no hot-reload path in seid today (seictl/sidecar/tasks/config_reload.goliterally has aTODO: signal seid to re-read config), so the pod restart is the trigger.baseProgressionorder.classifyPlan unchanged
classifyPlan(planner.go:217) identifies "node-update" plans byTaskTypeObserveImagepresence — still in the progression, classification still works.Out of scope
Autonomous
Spec.ExternalAddressdrift detection on Running nodes. With this change, ExternalAddress propagates on the next trigger ofbuildNodeUpdatePlan(image bump or other image drift). Operators currently trigger via an image SHA pin. If autonomous propagation becomes operationally necessary (NLB hostname rotation,Spec.Overridesdrift, publishable-P2P opt-out), file a follow-up to add drift detection inbuildRunningPlanthat routes to the samebuildNodeUpdatePlan.Test plan
buildNodeUpdatePlanwith the new progression. config.toml on the new pod hasexternal-address = "syncer-0-0-p2p.arctic-1.harbor.platform.sei.io:26656". seid logs show PEX advertising the NLB hostname.🤖 Generated with Claude Code