Skip to content

tools/stress: fix sweep deprovision and observer cpu parse#3820

Open
nikw9944 wants to merge 5 commits into
nikw9944/doublezero-3796from
nikw9944/stress-sweep-and-cpu-parse-fixes
Open

tools/stress: fix sweep deprovision and observer cpu parse#3820
nikw9944 wants to merge 5 commits into
nikw9944/doublezero-3796from
nikw9944/stress-sweep-and-cpu-parse-fixes

Conversation

@nikw9944
Copy link
Copy Markdown
Contributor

@nikw9944 nikw9944 commented Jun 2, 2026

Summary of Changes

  • Fix Executor.waitForAccountGone so the gagliardetto RPC client's ErrNotFound sentinel is recognized as the success signal post-DeleteUser. Previously the poll only matched the (non-nil result, nil Value) shape and kept retrying until the deadline elapsed, then wrapped the sentinel into the contradictory error "account X still present: not found". The stress orchestrator's deprovision phase reproduced this on the final user deletion every run.
  • Fix the device-observer's CPU parser to handle the eAPI structured JSON shape cpuInfo."%Cpu(s)".{idle,user,system,…} returned by show processes top once over RunShowJSON. The old parser scraped a procps %Cpu(s): text line out of an {"output":"…"} envelope that EOS never emits in JSON mode, so LatestCPUPercent stayed invalid every tick and the CPUSustained abort trigger silently never fired. The text-envelope parser is retained as a fallback for any future RunShowText callers.

Diff Breakdown

Category Files Lines (+/-) Net
Core logic 2 +55 / -3 +52
Tests 2 +146 / -0 +146
Fixtures 1 +1 / -0 +1
Total 5 +202 / -3 +199

Tests dominate — both fixes are 1-line behavioral changes plus a small parser-split refactor, validated by focused unit tests and a captured cEOS fixture.

Key files (click to expand)
  • tools/stress/device-observer/internal/sample/eos.go — split parseCPUPercent into a primary structured parser (cpuInfo."%Cpu(s)") and a text-envelope fallback. Map-decoding is used because % in the key blocks struct tags.
  • smartcontract/sdk/go/serviceability/executor.go — add errors.Is(err, solanarpc.ErrNotFound) short-circuit at the top of the waitForAccountGone poll loop with a comment explaining why both ErrNotFound and (nil result with nil Value) mean closure.
  • tools/stress/device-observer/internal/sample/testdata/ceos-show-processes-top-once.json — verbatim cEOS sample used as the structured-parser test fixture.

Testing Verification

  • New TestParseCPUPercentStructured (idle-only, missing-idle, missing-cpuInfo subcases) drives the new parser against the captured fixture.
  • New TestDeleteUserTreatsNotFoundAsClosure + TestWaitForAccountGone (ErrNotFound / nil-Value / timeout / ctx-cancel subtests) cover all four post-delete polling paths.
  • Re-ran the full stress sweep against the local containerized devnet end-to-end: 24/24 runlog rows (4 users × {submit, confirm, activate, deprovision_submit, deprovision_confirm, deprovision_activate}), sweep finished exit 0, zero could not parse CPU warnings (was every tick before).
  • make go-lint clean; go test -race ./tools/stress/device-observer/... ./smartcontract/sdk/go/serviceability/... green.

nikw9944 added 5 commits June 1, 2026 20:24
Replace the abort stub with a full state machine: ticks at
--sample-interval, evaluates ten triggers (provision p95/single-user,
deprovision p95, sustained CPU, agent error counters, agent log
patterns, agent silence, ledger heartbeat staleness), and on the first
match writes a JSON sentinel atomically via tmp+rename. Firing cancels
the observer's root context so the process exits 0 and the
orchestrator's abort.Watch picks up the file.

Adds CPU exposure on the sampler: parses %Cpu(s) lines from
'show processes top once | json', sums non-idle fields with
locale-decimal-comma tolerance, and exposes LatestCPUPercent() for the
decider.

Adds --force startup flag that overwrites a stale sentinel from a
previous run; without it the observer refuses to start.

Documents the trigger map and operator contract (sentinel JSON shape,
exit-on-fire, stale-sentinel guard, ledger heartbeat contract) in the
README and updates the CHANGELOG entry.

#3796
Don't mark the decider as fired until the sentinel write + rename
actually succeed, so a transient filesystem failure (no space, EACCES,
missing parent dir) doesn't strand the observer with d.fired=true and
no OnFire invocation. With this change, the next tick re-evaluates and
retries the write.

Adds a unit test that confirms a failed write does not call OnFire and
that the decider successfully writes the sentinel on the next tick
once the path becomes writable.

#3796
- cpuSustained: require retained samples to span >= cpuSustainedWindow
  so the trigger can't fire on the first 4 samples (which only span
  ~30s at the 10s sample interval). Adds a guard test.
- Defensive copy on prevCounters to match the prevPatterns pattern, so
  the decider doesn't rely on an undocumented ownership contract from
  Sources.PromSnapshot.
- Pass tick's already-captured 'now' into fire() instead of re-reading
  the clock, so one tick has one timestamp.
- Strip a stale PR ref in sample/eos.go and verbose comments across
  decider.go, sample/eos.go, and main.go that restate what the code
  says.
- README: update the cpu_sustained row to spell out the span guard.

#3796
…waitForAccountGone

The gagliardetto RPC client surfaces a missing account as
(nil, ErrNotFound) rather than (&Result{Value: nil}, nil). The poll
loop only recognized the latter shape, so on a successful close it
kept retrying ErrNotFound until the deadline elapsed and then wrapped
the sentinel into a contradictory "still present: not found" error.
The stress orchestrator's deprovision phase reproduces this every
run on the final user deletion.
The sampler invokes `show processes top once` via RunShowJSON, so
the eAPI returns a structured object with cpuInfo."%Cpu(s)".{idle,
user,system,...} rather than the procps text envelope the old parser
expected. The old parser bailed every tick against cEOS / real EOS,
silently leaving LatestCPUPercent invalid and disabling the
CPUSustained abort trigger. Add a structured parser as the primary
path and keep the text-envelope parser as a fallback so an operator
can flip the command to RunShowText without losing the trigger.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant