Skip to content

tools/stress: local containerized harness script#3821

Draft
nikw9944 wants to merge 6 commits into
nikw9944/stress-sweep-and-cpu-parse-fixesfrom
nikw9944/stress-test-harness
Draft

tools/stress: local containerized harness script#3821
nikw9944 wants to merge 6 commits into
nikw9944/stress-sweep-and-cpu-parse-fixesfrom
nikw9944/stress-test-harness

Conversation

@nikw9944
Copy link
Copy Markdown
Contributor

@nikw9944 nikw9944 commented Jun 2, 2026

Depends on: #3820

Summary of Changes

  • Add tools/stress/scripts/run-stress-local.sh: an end-to-end harness that brings up the e2e ledger / manager / funder via dev/dzctl, layers a stress-test device on top of the e2e device image, creates the device + Loopback255/256 interfaces + prepaid access passes onchain, and launches device-orchestrator and device-observer against it. Tolerates and works around the broken upstream doublezero geolocation init step in dzctl start: if the smart contract is already initialized, continues despite the failure; starts the controller container itself if dzctl bailed before reaching it; and creates the dz-local-cyoa Docker network on the fly if missing.
  • Add tools/stress/docker/device/Dockerfile: thin layer on dz-local/device:dev that installs an agent wrapper, adds a stress system user with /bin/bash, and grants it passwordless sudo via /etc/sudoers.Eos (the cEOS sudoers template that gets overlaid onto /etc/sudoers at boot — /etc/sudoers.d/ is not sourced and direct edits to /etc/sudoers are wiped). No agent daemon is baked in — the orchestrator owns the agent's lifecycle and starts it over SSH per run.
  • Add tools/stress/docker/device/agent-wrapper.sh: shim invoked as doublezero-agent over the SSH agent path. Re-execs itself through sudo so the agent runs as root: the agent shells out to ip netns exec default /usr/bin/Cli to compute configure-session diffs, which requires CAP_SYS_ADMIN, and without root the agent loops forever on Operation not permitted and the sweep aborts. The wrapper also injects -pubkey (read from /etc/doublezero/agent/pubkey, planted by the run script) and -metrics-enable -metrics-addr :50100. Port 50100 sits in the controller-pushed MAIN-CONTROL-PLANE-ACL's default 50000-50100 permit range, which is needed because that ACL is fully redefined on every agent apply.
  • Add tools/stress/scripts/README.md: operator-facing usage, env-var knobs, the --no-agent smoke-test path, why the device image differs from the e2e device image, why the metrics port is 50100, teardown.

Diff Breakdown

Category Files Lines (+/-) Net
Scaffolding 3 +589 / -0 +589
Docs 1 +79 / -0 +79
Total 4 +668 / -0 +668

All scaffolding — no production code, no tests. The run script is the bulk of the diff; the Dockerfile and wrapper are small shims.

Key files (click to expand)
  • tools/stress/scripts/run-stress-local.sh — 515-line bash harness: dzctl build/start with tolerant fallbacks, stress device image build, network introspection, fallback CYOA network + controller container creation, SSH keypair generation, container start + CYOA attach, EOS startup-config render with default-network IP for Management0 routing, onchain device create + loopback interface registration + access-pass seeding, orchestrator + observer launch.
  • tools/stress/docker/device/Dockerfile — 38-line layer on dz-local/device:dev: install the agent wrapper, create the stress bash user, and grant it passwordless sudo via /etc/sudoers.Eos.
  • tools/stress/docker/device/agent-wrapper.sh — 36-line shim that re-execs through sudo to run the agent as root, injects -pubkey from disk, and pins the prometheus listener on a controller-ACL-permitted port.

Testing Verification

Validated end-to-end against the local containerized devnet:

  • --no-agent mode: orchestrator completed a 4-user sweep (24 runlog rows, all six lifecycle events × 4 users); observer captured 17 ticks of show hardware capacity / show gre tunnel static / show processes top once / show logging without warnings.
  • Full SSH-agent mode: orchestrator opened the SSH session as stress, the wrapper re-exec'd through sudo, agent ran as root with -pubkey and metrics enabled. Agent reached the controller over the docker default network, received 1276-line configs, opened configure sessions, computed diffs, and committed without errors. Orchestrator streamed 38674 agent-log lines into orchestrator.agent.log; observer captured 728 prometheus metric rows from the agent's :50100 endpoint. Sweep exited cleanly with sweep finished, 28 runlog rows, 0 agent errors.
  • Self-sufficient fresh-controller run: removed the controller + stress device containers and re-ran the script; the fallback path correctly started a new controller container with the right env, hit the "loopback interface already exists" idempotent path, and proceeded to the same end-state as above.

@nikw9944 nikw9944 marked this pull request as draft June 2, 2026 13:59
@nikw9944 nikw9944 force-pushed the nikw9944/stress-test-harness branch from 48f5066 to 685050b Compare June 2, 2026 15:14
@nikw9944 nikw9944 changed the base branch from nikw9944/doublezero-3796 to nikw9944/stress-sweep-and-cpu-parse-fixes June 2, 2026 15:14
@nikw9944 nikw9944 force-pushed the nikw9944/stress-test-harness branch 9 times, most recently from 44db9ee to 14ce48c Compare June 3, 2026 02:31
nikw9944 added 6 commits June 3, 2026 02:42
Adds a script that brings up the e2e ledger / manager / controller /
funder via dev/dzctl, layers a stress-test device on top of the e2e
device image (with the agent daemon removed and SSH enabled), creates
the device + prepaid access passes onchain, then launches the
device-orchestrator and device-observer against it. Supports a
--no-agent smoke-test path that exercises the onchain sweep + observer
sampling without driving the SSH agent runner.
Two adjustments to make the SSH-driven agent path work end-to-end:

- Add a `stress` system user with /bin/bash to the stress device image and
  plant the orchestrator's pubkey into its authorized_keys post-boot. cEOS
  pins admin's NSS shell to /usr/bin/RunCli, which intercepts SSH-exec'd
  commands and feeds them to the EOS Cli parser — the orchestrator's
  `doublezero-agent -verbose …` is not valid EOS Cli. The orchestrator
  now connects as `stress` so SSH-exec runs through bash and the
  /usr/local/bin/doublezero-agent wrapper executes.
- Render the device's default-network IP, prefix, and gateway into
  Management0 after the container starts (so the agent can route to the
  controller container), and permit the agent's prometheus port in the
  control-plane ACL. The startup-config render now happens after the
  container is started and inspected — the device's entrypoint blocks
  until the config file appears, so this ordering is safe.
The controller-pushed device config fully redefines
MAIN-CONTROL-PLANE-ACL on every agent apply (it starts with
`no ip access-list MAIN-CONTROL-PLANE-ACL`), so any port permit we
add to the startup-config gets wiped on the first apply. The
controller's MAIN-CONTROL-PLANE-ACL is the ACL that's actually
bound to `system control-plane in` (our MAIN-CONTROL-PLANE-ACL-MGMT
is defined but unbound, so adding our permit there had no effect).

That bound ACL does permit TCP 50000-50100 by default, so move the
agent's prometheus listener to :50100 and point the observer's
`--agent-metrics-url` at the same port. Drop the now-unused
MAIN-CONTROL-PLANE-ACL-MGMT block from the startup-config — it was
never bound to anything.
Two changes that make the harness self-sufficient against a fresh
devnet:

- If `dzctl start` bails at the (broken upstream) `doublezero
  geolocation init` step, the controller container is never started.
  When the script detects the controller is missing post-dzctl, start
  it with the same env the e2e harness uses
  (DZ_LEDGER_URL=http://ledger:8899, DZ_SERVICEABILITY_PROGRAM_ID
  resolved from the manager's keypair).
- After creating the device onchain, register Loopback255 (vpnv4) and
  Loopback256 (ipv4). Without these the controller reports "device
  has pathology" every poll and returns an empty config — the agent
  runs but has nothing useful to apply. Interface types mirror the
  e2e harness.

Also update the README to document why the agent's prometheus port
lives at :50100 (the controller-pushed MAIN-CONTROL-PLANE-ACL
overwrites our startup-config ACL on every apply, and that pushed ACL
permits TCP 50000-50100 by default).
The doublezero-agent shells out to
`ip netns exec default /usr/bin/Cli -p 15 -c "show session-config
named X diffs"` to inspect the staged config session, and `ip netns
exec` requires CAP_SYS_ADMIN. The orchestrator's SSH session lands as
the unprivileged `stress` user, so this command fails with "Operation
not permitted" every apply cycle and the agent never settles — every
sweep aborts on apply_config_errors / diff_timeout.

Grant `stress` passwordless sudo and re-exec the wrapper through sudo
so the agent runs as root regardless of which user SSH lands. The
sudo rule has to go in /etc/sudoers.Eos because cEOS overlays that
template onto /etc/sudoers at boot and does not source
/etc/sudoers.d/, so anything we'd write to either of those gets
silently wiped.

With this fix a 4-user sweep completes cleanly: 28 runlog rows, 0
agent errors, 728 prometheus metric rows captured by the observer.
Two adjustments for high-user-count runs:

- Fan out the access-pass setup loop via xargs -P (default 16
  concurrent) so the per-user serial CLI roundtrip doesn't bottleneck
  setup at high TARGET_USERS. At 1024 users a serial loop is ~12
  minutes of sustained txn submission and reliably knocks the local
  validator over; the parallel version completes in well under a
  minute.
- Always (re)start the controller container ourselves with
  -max-user-tunnel-slots set to TARGET_USERS (floor 128). The
  controller defaults the per-device slot count to 128; past that, it
  silently truncates the device config to the first 128 tunnels and
  the agent never applies the rest. dzctl currently can't start the
  controller anyway (its broken geolocation-init step bails first), so
  this also stops being a fallback and becomes the canonical start
  path. The entrypoint override preserves the original ledger-readiness
  wait.

Knobs: DZ_STRESS_ACCESS_PASS_PARALLEL (default 16),
DZ_STRESS_CONTROLLER_MAX_SLOTS (default TARGET_USERS).
@nikw9944 nikw9944 force-pushed the nikw9944/stress-test-harness branch from 14ce48c to da22eff Compare June 3, 2026 02:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant