Skip to content

feat(planner): re-apply config.toml as part of buildNodeUpdatePlan#374

Open
bdchatham wants to merge 2 commits into
mainfrom
feat/update-plan-includes-config-apply
Open

feat(planner): re-apply config.toml as part of buildNodeUpdatePlan#374
bdchatham wants to merge 2 commits into
mainfrom
feat/update-plan-includes-config-apply

Conversation

@bdchatham
Copy link
Copy Markdown
Collaborator

Problem

buildNodeUpdatePlan (the image-drift plan) progression was:

ApplyStatefulSet → ApplyService → ReplacePod → ObserveImage → MarkReady

No TaskConfigApply. Result: any change to fields the controller owns via commonOverrides — most notably Spec.ExternalAddress from the publishable-P2P feature — never reached the running pod's config.toml. Inbound NLB→pod worked; outbound advertising (NodeInfo / PEX) didn't, because seid had external-address = ''.

Caught during the arctic-1 canary rollout (platform#764–771). With AWS LBC + harbor SEI_NLB_TARGET_TYPE=instance now in place, the NLB path is fully working, but seid still doesn't tell other peers how to reach it.

Fix

Add TaskConfigApply + TaskConfigValidate to the update progression, between ApplyService and ReplacePod:

ApplyStatefulSet → ApplyService → TaskConfigApply → TaskConfigValidate → ReplacePod → ObserveImage → MarkReady

Conceptual reframe: "deployment" stops being "swap image" and becomes "rebuild the running state from current spec." commonOverrides already produces the correct config from current Spec.ExternalAddress; we just weren't re-running TaskConfigApply outside init.

Two correctness fixes from pre-implementation expert review

Both Platform and K8s specialists independently flagged these before I touched a keyboard:

1. Real ConfigIntent (not nil). Old buildNodeUpdatePlan called paramsForTaskType(node, taskType, nil, nil). With configIntent=nil, the params builder returned an empty ConfigIntent{} — empty Mode — and sei-config's ResolveIntent rejects empty Mode. TaskConfigApply would have failed at runtime on the first image bump.

Fix: extracted NodePlanner.BuildConfigIntent(node) from each mode planner's BuildPlan. buildNodeUpdatePlan calls it via plannerForMode(node).BuildConfigIntent(node). Same source of truth for mode + overrides on init and day-2.

2. Incremental: true on the day-2 intent. The init path uses applyFull — regenerates config.toml from mode defaults + overrides. Running that on a Running node would wipe:

  • persistent-peers (set by the sidecar's writePeersToConfig)
  • State-sync trust point
  • Genesis-peers
  • Operator-managed TOML keys outside the controller's overrides set

applyIncremental reads on-disk config and patches only the overrides we own. Day-2 must use it.

Tests

  • TestBuildRunningPlan_ImageDrift_ReturnsNodeUpdatePlan — updated to expect the new 7-task progression.
  • TestBuildRunningPlan_ImageDriftWinsOverSidecar — same.
  • TestBuildNodeUpdatePlan_ConfigApplyHasIntentWithMode — asserts Mode is set + Overrides[KeyP2PExternalAddress] propagates from Spec.ExternalAddress.
  • TestBuildNodeUpdatePlan_ConfigApplyIsIncremental — pins Incremental: true.
  • TestBuildNodeUpdatePlan_ConfigApplyIntentMatchesModePlanner — table-driven coverage across full/archive/validator modes.

Existing unit + envtest suites: green.

Progression ordering — why between ApplyService and ReplacePod

  • ConfigApply must run before ReplacePod: the OLD pod's sidecar writes config.toml to the PVC; the new pod mounts the PVC and reads the fresh file on seid start.
  • ConfigApply after ReplacePod would race against new-pod sidecar liveness and waste a restart cycle.
  • external-address has no hot-reload path in seid today (seictl/sidecar/tasks/config_reload.go literally has a TODO: signal seid to re-read config), so the pod restart is the trigger.
  • Matches the init baseProgression order.

classifyPlan unchanged

classifyPlan (planner.go:217) identifies "node-update" plans by TaskTypeObserveImage presence — still in the progression, classification still works.

Out of scope

Autonomous Spec.ExternalAddress drift detection on Running nodes. With this change, ExternalAddress propagates on the next trigger of buildNodeUpdatePlan (image bump or other image drift). Operators currently trigger via an image SHA pin. If autonomous propagation becomes operationally necessary (NLB hostname rotation, Spec.Overrides drift, publishable-P2P opt-out), file a follow-up to add drift detection in buildRunningPlan that routes to the same buildNodeUpdatePlan.

Test plan

  • Unit + envtest green on CI.
  • After merge + ECR build + harbor controller image bump: pinning arctic-1 syncer image to the same digest (no-op semantically) triggers buildNodeUpdatePlan with the new progression. config.toml on the new pod has external-address = "syncer-0-0-p2p.arctic-1.harbor.platform.sei.io:26656". seid logs show PEX advertising the NLB hostname.

🤖 Generated with Claude Code

The image-drift plan previously ran ApplyStatefulSet → ApplyService →
ReplacePod → ObserveImage → MarkReady. config.toml was never re-applied,
so any change to fields the controller owns via commonOverrides — most
notably Spec.ExternalAddress — never reached the running pod. Inbound
publishable P2P could work (NLB to pod), but seid never advertised the
NLB hostname via PEX/NodeInfo because external-address stayed empty.

Add TaskConfigApply + TaskConfigValidate to the progression, between
ApplyService and ReplacePod. ConfigApply runs against the OLD pod's
sidecar (writes to the PVC); ReplacePod then cycles the pod and the new
seid reads the fresh config.toml. external-address has no hot-reload
path in seid today (seictl config_reload.go has a TODO for SIGHUP).

Two correctness fixes folded in per pre-implementation expert review:

1. buildNodeUpdatePlan now constructs a real ConfigIntent via the mode
   planner. Previously paramsForTaskType received nil and returned an
   empty-Mode ConfigIntent — sei-config's ResolveIntent rejects empty
   Mode, so TaskConfigApply would have failed at runtime on first image
   bump. New NodePlanner.BuildConfigIntent method extracted from each
   mode's BuildPlan; same code path on init and day-2, shared source
   of truth for mode + overrides.

2. Day-2 intent sets Incremental=true. The init applyFull regenerates
   config.toml from mode defaults + overrides; running that on a Running
   node would wipe persistent-peers, state-sync trust point, and any
   operator-managed TOML keys outside the controller's overrides set.
   applyIncremental reads on-disk config and patches only the keys we
   own.

Tests:
- TestBuildRunningPlan_ImageDrift_ReturnsNodeUpdatePlan updated to
  expect the new 7-task progression (was 5).
- TestBuildRunningPlan_ImageDriftWinsOverSidecar updated similarly.
- TestBuildNodeUpdatePlan_ConfigApplyHasIntentWithMode — asserts Mode
  is set and Overrides carries KeyP2PExternalAddress from Spec.ExternalAddress.
- TestBuildNodeUpdatePlan_ConfigApplyIsIncremental — pins Incremental=true.
- TestBuildNodeUpdatePlan_ConfigApplyIntentMatchesModePlanner — table-
  driven coverage across full/archive/validator modes.

Out of scope (separate follow-up, both experts flagged):
- Autonomous ExternalAddress drift detection on Running nodes. With
  this change ExternalAddress propagates on next image bump (or any
  other trigger of buildNodeUpdatePlan). Operators currently trigger via
  an image SHA pin. If autonomous propagation becomes operationally
  necessary (e.g., NLB hostname rotation, Spec.Overrides drift, opt-out
  of publishable P2P), file a follow-up to add drift detection in
  buildRunningPlan that routes to the same buildNodeUpdatePlan.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@cursor
Copy link
Copy Markdown

cursor Bot commented May 30, 2026

PR Summary

Medium Risk
Changes the running-node rollout path and PVC config writes; mitigated by incremental apply and strong unit tests, but mis-ordering or full apply on day-2 could still disrupt P2P/state on live nodes.

Overview
Running-node image-drift plans now re-apply config.toml before pod replacement, so controller-owned settings (e.g. Spec.ExternalAddress / P2P external address) reach the live node instead of only being applied at init.

buildNodeUpdatePlan inserts TaskConfigApply and TaskConfigValidate after apply Service and before ReplacePod (7 tasks total). Config is written to the PVC on the old pod’s sidecar, then the new pod reads it on start.

NodePlanner.BuildConfigIntent is added and implemented on full, archive, validator, and replayer planners so init and day-2 share the same mode + merged overrides. Day-2 sets Incremental: true so sei-config patches only owned keys and does not wipe on-disk peers, state-sync trust, or other operator TOML. Init remains non-incremental.

Tests lock task order, non-empty Mode, ExternalAddress in overrides, incremental vs init behavior, and per-mode intent consistency.

Reviewed by Cursor Bugbot for commit 43a0018. Bugbot is set up for automated code reviews on this repo. Configure here.

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 1d4dc7d. Configure here.

task.TaskTypeApplyStatefulSet,
task.TaskTypeApplyService,
TaskConfigApply,
TaskConfigValidate,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Node update plan missing key validation readiness gates

High Severity

buildNodeUpdatePlan includes TaskTypeReplacePod (cycles the pod) but omits the TaskTypeValidateSigningKey, TaskTypeValidateNodeKey, and TaskTypeValidateOperatorKeyring readiness gates that both buildBasePlan and buildBootstrapPlan conditionally include before pod creation. For validator nodes with key secrets mounted as volumes, an image-drift rollout would recreate the pod without first verifying those secrets exist, risking a kubelet volume-mount failure instead of a clear controller-side validation error.

Additional Locations (1)
Fix in Cursor Fix in Web

Triggered by learned rule: Planner readiness gates must apply to all pod-cycling plans

Reviewed by Cursor Bugbot for commit 1d4dc7d. Configure here.

Cross-review feedback from kubernetes-specialist:

1. Defensive nil-guard on each mode planner's BuildConfigIntent. Today
   plannerForMode() gates which planner gets called, so the nil-deref
   isn't reachable. But the method is now part of the NodePlanner
   interface contract — making it independently safe means future
   refactors that call it through other paths don't accidentally NPE.

2. Pin init plans to Incremental=false via a new test. The day-2 path
   sets Incremental=true intentionally; the init path must stay on
   applyFull (regenerate from mode defaults). Without the assertion, a
   future refactor could silently flip init and we'd only notice when
   a fresh node came up with stale state.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant