tools/stress: local containerized harness script by nikw9944 · Pull Request #3821 · malbeclabs/doublezero

nikw9944 · 2026-06-02T13:56:44Z

Depends on: #3820

Summary of Changes

Add tools/stress/scripts/run-stress-local.sh: an end-to-end harness that brings up the e2e ledger / manager / funder via dev/dzctl, layers a stress-test device on top of the e2e device image, creates the device + Loopback255/256 interfaces + prepaid access passes onchain, and launches device-orchestrator and device-observer against it. Tolerates and works around the broken upstream doublezero geolocation init step in dzctl start: if the smart contract is already initialized, continues despite the failure; starts the controller container itself if dzctl bailed before reaching it; and creates the dz-local-cyoa Docker network on the fly if missing.
Add tools/stress/docker/device/Dockerfile: thin layer on dz-local/device:dev that installs an agent wrapper, adds a stress system user with /bin/bash, and grants it passwordless sudo via /etc/sudoers.Eos (the cEOS sudoers template that gets overlaid onto /etc/sudoers at boot — /etc/sudoers.d/ is not sourced and direct edits to /etc/sudoers are wiped). No agent daemon is baked in — the orchestrator owns the agent's lifecycle and starts it over SSH per run.
Add tools/stress/docker/device/agent-wrapper.sh: shim invoked as doublezero-agent over the SSH agent path. Re-execs itself through sudo so the agent runs as root: the agent shells out to ip netns exec default /usr/bin/Cli to compute configure-session diffs, which requires CAP_SYS_ADMIN, and without root the agent loops forever on Operation not permitted and the sweep aborts. The wrapper also injects -pubkey (read from /etc/doublezero/agent/pubkey, planted by the run script) and -metrics-enable -metrics-addr :50100. Port 50100 sits in the controller-pushed MAIN-CONTROL-PLANE-ACL's default 50000-50100 permit range, which is needed because that ACL is fully redefined on every agent apply.
Add tools/stress/scripts/README.md: operator-facing usage, env-var knobs, the --no-agent smoke-test path, why the device image differs from the e2e device image, why the metrics port is 50100, teardown.

Diff Breakdown

Category	Files	Lines (+/-)	Net
Scaffolding	3	+589 / -0	+589
Docs	1	+79 / -0	+79
Total	4	+668 / -0	+668

All scaffolding — no production code, no tests. The run script is the bulk of the diff; the Dockerfile and wrapper are small shims.

Key files (click to expand)

tools/stress/scripts/run-stress-local.sh — 515-line bash harness: dzctl build/start with tolerant fallbacks, stress device image build, network introspection, fallback CYOA network + controller container creation, SSH keypair generation, container start + CYOA attach, EOS startup-config render with default-network IP for Management0 routing, onchain device create + loopback interface registration + access-pass seeding, orchestrator + observer launch.
tools/stress/docker/device/Dockerfile — 38-line layer on dz-local/device:dev: install the agent wrapper, create the stress bash user, and grant it passwordless sudo via /etc/sudoers.Eos.
tools/stress/docker/device/agent-wrapper.sh — 36-line shim that re-execs through sudo to run the agent as root, injects -pubkey from disk, and pins the prometheus listener on a controller-ACL-permitted port.

Testing Verification

Validated end-to-end against the local containerized devnet:

--no-agent mode: orchestrator completed a 4-user sweep (24 runlog rows, all six lifecycle events × 4 users); observer captured 17 ticks of show hardware capacity / show gre tunnel static / show processes top once / show logging without warnings.
Full SSH-agent mode: orchestrator opened the SSH session as stress, the wrapper re-exec'd through sudo, agent ran as root with -pubkey and metrics enabled. Agent reached the controller over the docker default network, received 1276-line configs, opened configure sessions, computed diffs, and committed without errors. Orchestrator streamed 38674 agent-log lines into orchestrator.agent.log; observer captured 728 prometheus metric rows from the agent's :50100 endpoint. Sweep exited cleanly with sweep finished, 28 runlog rows, 0 agent errors.
Self-sufficient fresh-controller run: removed the controller + stress device containers and re-ran the script; the fallback path correctly started a new controller container with the right env, hit the "loopback interface already exists" idempotent path, and proceeded to the same end-state as above.

Adds a script that brings up the e2e ledger / manager / controller / funder via dev/dzctl, layers a stress-test device on top of the e2e device image (with the agent daemon removed and SSH enabled), creates the device + prepaid access passes onchain, then launches the device-orchestrator and device-observer against it. Supports a --no-agent smoke-test path that exercises the onchain sweep + observer sampling without driving the SSH agent runner.

Two adjustments to make the SSH-driven agent path work end-to-end: - Add a `stress` system user with /bin/bash to the stress device image and plant the orchestrator's pubkey into its authorized_keys post-boot. cEOS pins admin's NSS shell to /usr/bin/RunCli, which intercepts SSH-exec'd commands and feeds them to the EOS Cli parser — the orchestrator's `doublezero-agent -verbose …` is not valid EOS Cli. The orchestrator now connects as `stress` so SSH-exec runs through bash and the /usr/local/bin/doublezero-agent wrapper executes. - Render the device's default-network IP, prefix, and gateway into Management0 after the container starts (so the agent can route to the controller container), and permit the agent's prometheus port in the control-plane ACL. The startup-config render now happens after the container is started and inspected — the device's entrypoint blocks until the config file appears, so this ordering is safe.

The controller-pushed device config fully redefines MAIN-CONTROL-PLANE-ACL on every agent apply (it starts with `no ip access-list MAIN-CONTROL-PLANE-ACL`), so any port permit we add to the startup-config gets wiped on the first apply. The controller's MAIN-CONTROL-PLANE-ACL is the ACL that's actually bound to `system control-plane in` (our MAIN-CONTROL-PLANE-ACL-MGMT is defined but unbound, so adding our permit there had no effect). That bound ACL does permit TCP 50000-50100 by default, so move the agent's prometheus listener to :50100 and point the observer's `--agent-metrics-url` at the same port. Drop the now-unused MAIN-CONTROL-PLANE-ACL-MGMT block from the startup-config — it was never bound to anything.

Two changes that make the harness self-sufficient against a fresh devnet: - If `dzctl start` bails at the (broken upstream) `doublezero geolocation init` step, the controller container is never started. When the script detects the controller is missing post-dzctl, start it with the same env the e2e harness uses (DZ_LEDGER_URL=http://ledger:8899, DZ_SERVICEABILITY_PROGRAM_ID resolved from the manager's keypair). - After creating the device onchain, register Loopback255 (vpnv4) and Loopback256 (ipv4). Without these the controller reports "device has pathology" every poll and returns an empty config — the agent runs but has nothing useful to apply. Interface types mirror the e2e harness. Also update the README to document why the agent's prometheus port lives at :50100 (the controller-pushed MAIN-CONTROL-PLANE-ACL overwrites our startup-config ACL on every apply, and that pushed ACL permits TCP 50000-50100 by default).

The doublezero-agent shells out to `ip netns exec default /usr/bin/Cli -p 15 -c "show session-config named X diffs"` to inspect the staged config session, and `ip netns exec` requires CAP_SYS_ADMIN. The orchestrator's SSH session lands as the unprivileged `stress` user, so this command fails with "Operation not permitted" every apply cycle and the agent never settles — every sweep aborts on apply_config_errors / diff_timeout. Grant `stress` passwordless sudo and re-exec the wrapper through sudo so the agent runs as root regardless of which user SSH lands. The sudo rule has to go in /etc/sudoers.Eos because cEOS overlays that template onto /etc/sudoers at boot and does not source /etc/sudoers.d/, so anything we'd write to either of those gets silently wiped. With this fix a 4-user sweep completes cleanly: 28 runlog rows, 0 agent errors, 728 prometheus metric rows captured by the observer.

Two adjustments for high-user-count runs: - Fan out the access-pass setup loop via xargs -P (default 16 concurrent) so the per-user serial CLI roundtrip doesn't bottleneck setup at high TARGET_USERS. At 1024 users a serial loop is ~12 minutes of sustained txn submission and reliably knocks the local validator over; the parallel version completes in well under a minute. - Always (re)start the controller container ourselves with -max-user-tunnel-slots set to TARGET_USERS (floor 128). The controller defaults the per-device slot count to 128; past that, it silently truncates the device config to the first 128 tunnels and the agent never applies the rest. dzctl currently can't start the controller anyway (its broken geolocation-init step bails first), so this also stops being a fallback and becomes the canonical start path. The entrypoint override preserves the original ledger-readiness wait. Knobs: DZ_STRESS_ACCESS_PASS_PARALLEL (default 16), DZ_STRESS_CONTROLLER_MAX_SLOTS (default TARGET_USERS).

nikw9944 marked this pull request as draft June 2, 2026 13:59

nikw9944 force-pushed the nikw9944/stress-test-harness branch from 48f5066 to 685050b Compare June 2, 2026 15:14

nikw9944 changed the base branch from nikw9944/doublezero-3796 to nikw9944/stress-sweep-and-cpu-parse-fixes June 2, 2026 15:14

nikw9944 force-pushed the nikw9944/stress-test-harness branch 9 times, most recently from 44db9ee to 14ce48c Compare June 3, 2026 02:31

nikw9944 added 6 commits June 3, 2026 02:42

nikw9944 force-pushed the nikw9944/stress-test-harness branch from 14ce48c to da22eff Compare June 3, 2026 02:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tools/stress: local containerized harness script#3821

tools/stress: local containerized harness script#3821
nikw9944 wants to merge 6 commits into
nikw9944/stress-sweep-and-cpu-parse-fixesfrom
nikw9944/stress-test-harness

nikw9944 commented Jun 2, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nikw9944 commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary of Changes

Diff Breakdown

Testing Verification

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

nikw9944 commented Jun 2, 2026 •

edited

Loading