tools/stress: fix sweep deprovision and observer cpu parse#3820
Open
nikw9944 wants to merge 5 commits into
Open
tools/stress: fix sweep deprovision and observer cpu parse#3820nikw9944 wants to merge 5 commits into
nikw9944 wants to merge 5 commits into
Conversation
Replace the abort stub with a full state machine: ticks at --sample-interval, evaluates ten triggers (provision p95/single-user, deprovision p95, sustained CPU, agent error counters, agent log patterns, agent silence, ledger heartbeat staleness), and on the first match writes a JSON sentinel atomically via tmp+rename. Firing cancels the observer's root context so the process exits 0 and the orchestrator's abort.Watch picks up the file. Adds CPU exposure on the sampler: parses %Cpu(s) lines from 'show processes top once | json', sums non-idle fields with locale-decimal-comma tolerance, and exposes LatestCPUPercent() for the decider. Adds --force startup flag that overwrites a stale sentinel from a previous run; without it the observer refuses to start. Documents the trigger map and operator contract (sentinel JSON shape, exit-on-fire, stale-sentinel guard, ledger heartbeat contract) in the README and updates the CHANGELOG entry. #3796
Don't mark the decider as fired until the sentinel write + rename actually succeed, so a transient filesystem failure (no space, EACCES, missing parent dir) doesn't strand the observer with d.fired=true and no OnFire invocation. With this change, the next tick re-evaluates and retries the write. Adds a unit test that confirms a failed write does not call OnFire and that the decider successfully writes the sentinel on the next tick once the path becomes writable. #3796
- cpuSustained: require retained samples to span >= cpuSustainedWindow so the trigger can't fire on the first 4 samples (which only span ~30s at the 10s sample interval). Adds a guard test. - Defensive copy on prevCounters to match the prevPatterns pattern, so the decider doesn't rely on an undocumented ownership contract from Sources.PromSnapshot. - Pass tick's already-captured 'now' into fire() instead of re-reading the clock, so one tick has one timestamp. - Strip a stale PR ref in sample/eos.go and verbose comments across decider.go, sample/eos.go, and main.go that restate what the code says. - README: update the cpu_sustained row to spell out the span guard. #3796
…waitForAccountGone
The gagliardetto RPC client surfaces a missing account as
(nil, ErrNotFound) rather than (&Result{Value: nil}, nil). The poll
loop only recognized the latter shape, so on a successful close it
kept retrying ErrNotFound until the deadline elapsed and then wrapped
the sentinel into a contradictory "still present: not found" error.
The stress orchestrator's deprovision phase reproduces this every
run on the final user deletion.
The sampler invokes `show processes top once` via RunShowJSON, so
the eAPI returns a structured object with cpuInfo."%Cpu(s)".{idle,
user,system,...} rather than the procps text envelope the old parser
expected. The old parser bailed every tick against cEOS / real EOS,
silently leaving LatestCPUPercent invalid and disabling the
CPUSustained abort trigger. Add a structured parser as the primary
path and keep the text-envelope parser as a fallback so an operator
can flip the command to RunShowText without losing the trigger.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary of Changes
Executor.waitForAccountGoneso the gagliardetto RPC client'sErrNotFoundsentinel is recognized as the success signal post-DeleteUser. Previously the poll only matched the(non-nil result, nil Value)shape and kept retrying until the deadline elapsed, then wrapped the sentinel into the contradictory error"account X still present: not found". The stress orchestrator's deprovision phase reproduced this on the final user deletion every run.cpuInfo."%Cpu(s)".{idle,user,system,…}returned byshow processes top onceoverRunShowJSON. The old parser scraped a procps%Cpu(s):text line out of an{"output":"…"}envelope that EOS never emits in JSON mode, soLatestCPUPercentstayed invalid every tick and theCPUSustainedabort trigger silently never fired. The text-envelope parser is retained as a fallback for any futureRunShowTextcallers.Diff Breakdown
Tests dominate — both fixes are 1-line behavioral changes plus a small parser-split refactor, validated by focused unit tests and a captured cEOS fixture.
Key files (click to expand)
tools/stress/device-observer/internal/sample/eos.go— splitparseCPUPercentinto a primary structured parser (cpuInfo."%Cpu(s)") and a text-envelope fallback. Map-decoding is used because%in the key blocks struct tags.smartcontract/sdk/go/serviceability/executor.go— adderrors.Is(err, solanarpc.ErrNotFound)short-circuit at the top of thewaitForAccountGonepoll loop with a comment explaining why bothErrNotFoundand(nil result with nil Value)mean closure.tools/stress/device-observer/internal/sample/testdata/ceos-show-processes-top-once.json— verbatim cEOS sample used as the structured-parser test fixture.Testing Verification
TestParseCPUPercentStructured(idle-only, missing-idle, missing-cpuInfo subcases) drives the new parser against the captured fixture.TestDeleteUserTreatsNotFoundAsClosure+TestWaitForAccountGone(ErrNotFound / nil-Value / timeout / ctx-cancel subtests) cover all four post-delete polling paths.sweep finishedexit 0, zerocould not parse CPUwarnings (was every tick before).make go-lintclean;go test -race ./tools/stress/device-observer/... ./smartcontract/sdk/go/serviceability/...green.