tools/stress/device-observer: abort decider + sentinel#3819
Open
nikw9944 wants to merge 3 commits into
Open
Conversation
ffa2337 to
943d331
Compare
Replace the abort stub with a full state machine: ticks at --sample-interval, evaluates ten triggers (provision p95/single-user, deprovision p95, sustained CPU, agent error counters, agent log patterns, agent silence, ledger heartbeat staleness), and on the first match writes a JSON sentinel atomically via tmp+rename. Firing cancels the observer's root context so the process exits 0 and the orchestrator's abort.Watch picks up the file. Adds CPU exposure on the sampler: parses %Cpu(s) lines from 'show processes top once | json', sums non-idle fields with locale-decimal-comma tolerance, and exposes LatestCPUPercent() for the decider. Adds --force startup flag that overwrites a stale sentinel from a previous run; without it the observer refuses to start. Documents the trigger map and operator contract (sentinel JSON shape, exit-on-fire, stale-sentinel guard, ledger heartbeat contract) in the README and updates the CHANGELOG entry.
Don't mark the decider as fired until the sentinel write + rename actually succeed, so a transient filesystem failure (no space, EACCES, missing parent dir) doesn't strand the observer with d.fired=true and no OnFire invocation. With this change, the next tick re-evaluates and retries the write. Adds a unit test that confirms a failed write does not call OnFire and that the decider successfully writes the sentinel on the next tick once the path becomes writable. #3796
- cpuSustained: require retained samples to span >= cpuSustainedWindow so the trigger can't fire on the first 4 samples (which only span ~30s at the 10s sample interval). Adds a guard test. - Defensive copy on prevCounters to match the prevPatterns pattern, so the decider doesn't rely on an undocumented ownership contract from Sources.PromSnapshot. - Pass tick's already-captured 'now' into fire() instead of re-reading the clock, so one tick has one timestamp. - Strip a stale PR ref in sample/eos.go and verbose comments across decider.go, sample/eos.go, and main.go that restate what the code says. - README: update the cpu_sustained row to spell out the span guard. #3796
943d331 to
04e335e
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary of Changes
--sample-intervaland evaluates ten triggers (provision p95 / single-user, deprovision p95, sustained CPU, two doublezero-agent error counters, two agent-log substring patterns, agent silence, ledger heartbeat staleness).<abort-file>.tmp+ rename) and cancels the observer's root context viaOnFire, so the process exits 0 and the orchestrator's existence-onlyabort.Watchpicks up the file.d.firedis only set after the rename succeeds, so a transient filesystem error retries on the next tick.LatestCPUPercent()on the eAPI sampler: parse the%Cpu(s):line from theshow processes top once | jsonenvelope (procpstop -bn1output), sum the non-idle fields, and tolerate locale-decimal commas.--forceto refuse start (or, with--force, remove) a stale sentinel left from a previous run.Fixes #3796
Diff Breakdown
Roughly half the change is tests (one case per trigger plus sentinel-atomicity, write-failure retry, and CPU-span guards); the rest is the decider state machine, the CPU parser, and the main.go wiring.
Key files (click to expand)
tools/stress/device-observer/internal/abort/decider.go— full state machine:Sourcesof function-typed snapshot getters, trigger constants,tick()ordered as p95 → single-user → deprovision p95 → CPU sustained → counter deltas → log patterns → agent silence → ledger heartbeat, atomic sentinel write with retry on failure.tools/stress/device-observer/internal/sample/eos.go—LatestCPUPercent()andparseCPUPercent()over theshow processes top once | jsonenvelope; locale-decimal tolerant.tools/stress/device-observer/cmd/device-observer/main.go—--forceflag,checkStaleAbort(), capture collector values into locals, buildabort.Sources, wireOnFireto the errgroup root cancel.tools/stress/device-observer/internal/abort/decider_test.go— one test per trigger, sentinel-once + atomic-rename, write-failure retry, OnFire-invoked-once, CPU span guard, percentile helpers.tools/stress/device-observer/internal/sample/eos_test.go— CPU parser fixtures (procps, busybox, comma-decimal locale, missing line) and theLatestCPUPercentsnapshot update.tools/stress/device-observer/README.md— trigger table and Operator-contract section (sentinel format, exit-on-fire, stale-sentinel guard, ledger-heartbeat contract).tools/stress/device-observer/cmd/device-observer/main_test.go—checkStaleAborttests (refuses, force-removes, missing no-op).Testing Verification
go test ./tools/stress/device-observer/...passes (all trigger tests, sentinel atomicity, write-failure retry, CPU parser fixtures including comma-decimal locale, CPU-span guard, stale-sentinel cases).go vet ./tools/stress/device-observer/...clean.minSamplesretained AND samples spanning ≥cpuSustainedWindow(60 s) so the first 4 samples can't fire on their own at startup;TestCPUSustainedBelowSpancovers the guard.d.firedflips only after rename succeeds; the next tick retries (covered byTestSentinelWriteFailureRetries).