Skip to content

tools/stress/device-observer: abort decider + sentinel#3819

Open
nikw9944 wants to merge 3 commits into
mainfrom
nikw9944/doublezero-3796
Open

tools/stress/device-observer: abort decider + sentinel#3819
nikw9944 wants to merge 3 commits into
mainfrom
nikw9944/doublezero-3796

Conversation

@nikw9944
Copy link
Copy Markdown
Contributor

@nikw9944 nikw9944 commented Jun 1, 2026

Summary of Changes

  • Replace the abort decider stub with a state machine that ticks at --sample-interval and evaluates ten triggers (provision p95 / single-user, deprovision p95, sustained CPU, two doublezero-agent error counters, two agent-log substring patterns, agent silence, ledger heartbeat staleness).
  • On the first match the decider writes a JSON sentinel atomically (<abort-file>.tmp + rename) and cancels the observer's root context via OnFire, so the process exits 0 and the orchestrator's existence-only abort.Watch picks up the file. d.fired is only set after the rename succeeds, so a transient filesystem error retries on the next tick.
  • Expose LatestCPUPercent() on the eAPI sampler: parse the %Cpu(s): line from the show processes top once | json envelope (procps top -bn1 output), sum the non-idle fields, and tolerate locale-decimal commas.
  • Add --force to refuse start (or, with --force, remove) a stale sentinel left from a previous run.
  • Document the trigger table, sentinel JSON shape, exit-on-fire behavior, stale-sentinel guard, and ledger-heartbeat contract in the README; the contract is forward-compatible (an absent heartbeat file suppresses the trigger).

Fixes #3796

Diff Breakdown

Category Files Lines (+/-) Net
Core logic 2 +394 / -19 +375
Scaffolding 1 +49 / -10 +39
Tests 3 +591 / -0 +591
Docs 2 +72 / -12 +60
Total 8 +1106 / -41 +1065

Roughly half the change is tests (one case per trigger plus sentinel-atomicity, write-failure retry, and CPU-span guards); the rest is the decider state machine, the CPU parser, and the main.go wiring.

Key files (click to expand)

Testing Verification

  • go test ./tools/stress/device-observer/... passes (all trigger tests, sentinel atomicity, write-failure retry, CPU parser fixtures including comma-decimal locale, CPU-span guard, stale-sentinel cases).
  • go vet ./tools/stress/device-observer/... clean.
  • CPU-sustained trigger requires both ≥ minSamples retained AND samples spanning ≥ cpuSustainedWindow (60 s) so the first 4 samples can't fire on their own at startup; TestCPUSustainedBelowSpan covers the guard.
  • Sentinel write failure does not strand the decider: d.fired flips only after rename succeeds; the next tick retries (covered by TestSentinelWriteFailureRetries).

@nikw9944 nikw9944 force-pushed the nikw9944/doublezero-3796 branch from ffa2337 to 943d331 Compare June 1, 2026 23:13
@nikw9944 nikw9944 marked this pull request as ready for review June 1, 2026 23:43
nikw9944 added 3 commits June 1, 2026 23:43
Replace the abort stub with a full state machine: ticks at
--sample-interval, evaluates ten triggers (provision p95/single-user,
deprovision p95, sustained CPU, agent error counters, agent log
patterns, agent silence, ledger heartbeat staleness), and on the first
match writes a JSON sentinel atomically via tmp+rename. Firing cancels
the observer's root context so the process exits 0 and the
orchestrator's abort.Watch picks up the file.

Adds CPU exposure on the sampler: parses %Cpu(s) lines from
'show processes top once | json', sums non-idle fields with
locale-decimal-comma tolerance, and exposes LatestCPUPercent() for the
decider.

Adds --force startup flag that overwrites a stale sentinel from a
previous run; without it the observer refuses to start.

Documents the trigger map and operator contract (sentinel JSON shape,
exit-on-fire, stale-sentinel guard, ledger heartbeat contract) in the
README and updates the CHANGELOG entry.
Don't mark the decider as fired until the sentinel write + rename
actually succeed, so a transient filesystem failure (no space, EACCES,
missing parent dir) doesn't strand the observer with d.fired=true and
no OnFire invocation. With this change, the next tick re-evaluates and
retries the write.

Adds a unit test that confirms a failed write does not call OnFire and
that the decider successfully writes the sentinel on the next tick
once the path becomes writable.

#3796
- cpuSustained: require retained samples to span >= cpuSustainedWindow
  so the trigger can't fire on the first 4 samples (which only span
  ~30s at the 10s sample interval). Adds a guard test.
- Defensive copy on prevCounters to match the prevPatterns pattern, so
  the decider doesn't rely on an undocumented ownership contract from
  Sources.PromSnapshot.
- Pass tick's already-captured 'now' into fire() instead of re-reading
  the clock, so one tick has one timestamp.
- Strip a stale PR ref in sample/eos.go and verbose comments across
  decider.go, sample/eos.go, and main.go that restate what the code
  says.
- README: update the cpu_sustained row to spell out the span guard.

#3796
@nikw9944 nikw9944 force-pushed the nikw9944/doublezero-3796 branch from 943d331 to 04e335e Compare June 1, 2026 23:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3747-4: abort decider + sentinel (~200 LOC code)

1 participant