tools/stress: local containerized harness script#3821
Draft
nikw9944 wants to merge 6 commits into
Draft
Conversation
48f5066 to
685050b
Compare
44db9ee to
14ce48c
Compare
Adds a script that brings up the e2e ledger / manager / controller / funder via dev/dzctl, layers a stress-test device on top of the e2e device image (with the agent daemon removed and SSH enabled), creates the device + prepaid access passes onchain, then launches the device-orchestrator and device-observer against it. Supports a --no-agent smoke-test path that exercises the onchain sweep + observer sampling without driving the SSH agent runner.
Two adjustments to make the SSH-driven agent path work end-to-end: - Add a `stress` system user with /bin/bash to the stress device image and plant the orchestrator's pubkey into its authorized_keys post-boot. cEOS pins admin's NSS shell to /usr/bin/RunCli, which intercepts SSH-exec'd commands and feeds them to the EOS Cli parser — the orchestrator's `doublezero-agent -verbose …` is not valid EOS Cli. The orchestrator now connects as `stress` so SSH-exec runs through bash and the /usr/local/bin/doublezero-agent wrapper executes. - Render the device's default-network IP, prefix, and gateway into Management0 after the container starts (so the agent can route to the controller container), and permit the agent's prometheus port in the control-plane ACL. The startup-config render now happens after the container is started and inspected — the device's entrypoint blocks until the config file appears, so this ordering is safe.
The controller-pushed device config fully redefines MAIN-CONTROL-PLANE-ACL on every agent apply (it starts with `no ip access-list MAIN-CONTROL-PLANE-ACL`), so any port permit we add to the startup-config gets wiped on the first apply. The controller's MAIN-CONTROL-PLANE-ACL is the ACL that's actually bound to `system control-plane in` (our MAIN-CONTROL-PLANE-ACL-MGMT is defined but unbound, so adding our permit there had no effect). That bound ACL does permit TCP 50000-50100 by default, so move the agent's prometheus listener to :50100 and point the observer's `--agent-metrics-url` at the same port. Drop the now-unused MAIN-CONTROL-PLANE-ACL-MGMT block from the startup-config — it was never bound to anything.
Two changes that make the harness self-sufficient against a fresh devnet: - If `dzctl start` bails at the (broken upstream) `doublezero geolocation init` step, the controller container is never started. When the script detects the controller is missing post-dzctl, start it with the same env the e2e harness uses (DZ_LEDGER_URL=http://ledger:8899, DZ_SERVICEABILITY_PROGRAM_ID resolved from the manager's keypair). - After creating the device onchain, register Loopback255 (vpnv4) and Loopback256 (ipv4). Without these the controller reports "device has pathology" every poll and returns an empty config — the agent runs but has nothing useful to apply. Interface types mirror the e2e harness. Also update the README to document why the agent's prometheus port lives at :50100 (the controller-pushed MAIN-CONTROL-PLANE-ACL overwrites our startup-config ACL on every apply, and that pushed ACL permits TCP 50000-50100 by default).
The doublezero-agent shells out to `ip netns exec default /usr/bin/Cli -p 15 -c "show session-config named X diffs"` to inspect the staged config session, and `ip netns exec` requires CAP_SYS_ADMIN. The orchestrator's SSH session lands as the unprivileged `stress` user, so this command fails with "Operation not permitted" every apply cycle and the agent never settles — every sweep aborts on apply_config_errors / diff_timeout. Grant `stress` passwordless sudo and re-exec the wrapper through sudo so the agent runs as root regardless of which user SSH lands. The sudo rule has to go in /etc/sudoers.Eos because cEOS overlays that template onto /etc/sudoers at boot and does not source /etc/sudoers.d/, so anything we'd write to either of those gets silently wiped. With this fix a 4-user sweep completes cleanly: 28 runlog rows, 0 agent errors, 728 prometheus metric rows captured by the observer.
Two adjustments for high-user-count runs: - Fan out the access-pass setup loop via xargs -P (default 16 concurrent) so the per-user serial CLI roundtrip doesn't bottleneck setup at high TARGET_USERS. At 1024 users a serial loop is ~12 minutes of sustained txn submission and reliably knocks the local validator over; the parallel version completes in well under a minute. - Always (re)start the controller container ourselves with -max-user-tunnel-slots set to TARGET_USERS (floor 128). The controller defaults the per-device slot count to 128; past that, it silently truncates the device config to the first 128 tunnels and the agent never applies the rest. dzctl currently can't start the controller anyway (its broken geolocation-init step bails first), so this also stops being a fallback and becomes the canonical start path. The entrypoint override preserves the original ledger-readiness wait. Knobs: DZ_STRESS_ACCESS_PASS_PARALLEL (default 16), DZ_STRESS_CONTROLLER_MAX_SLOTS (default TARGET_USERS).
14ce48c to
da22eff
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Depends on: #3820
Summary of Changes
tools/stress/scripts/run-stress-local.sh: an end-to-end harness that brings up the e2e ledger / manager / funder viadev/dzctl, layers a stress-test device on top of the e2e device image, creates the device + Loopback255/256 interfaces + prepaid access passes onchain, and launchesdevice-orchestratoranddevice-observeragainst it. Tolerates and works around the broken upstreamdoublezero geolocation initstep indzctl start: if the smart contract is already initialized, continues despite the failure; starts the controller container itself ifdzctlbailed before reaching it; and creates thedz-local-cyoaDocker network on the fly if missing.tools/stress/docker/device/Dockerfile: thin layer ondz-local/device:devthat installs an agent wrapper, adds astresssystem user with/bin/bash, and grants it passwordless sudo via/etc/sudoers.Eos(the cEOS sudoers template that gets overlaid onto/etc/sudoersat boot —/etc/sudoers.d/is not sourced and direct edits to/etc/sudoersare wiped). No agent daemon is baked in — the orchestrator owns the agent's lifecycle and starts it over SSH per run.tools/stress/docker/device/agent-wrapper.sh: shim invoked asdoublezero-agentover the SSH agent path. Re-execs itself throughsudoso the agent runs as root: the agent shells out toip netns exec default /usr/bin/Clito compute configure-session diffs, which requiresCAP_SYS_ADMIN, and without root the agent loops forever onOperation not permittedand the sweep aborts. The wrapper also injects-pubkey(read from/etc/doublezero/agent/pubkey, planted by the run script) and-metrics-enable -metrics-addr :50100. Port 50100 sits in the controller-pushedMAIN-CONTROL-PLANE-ACL's default50000-50100permit range, which is needed because that ACL is fully redefined on every agent apply.tools/stress/scripts/README.md: operator-facing usage, env-var knobs, the--no-agentsmoke-test path, why the device image differs from the e2e device image, why the metrics port is 50100, teardown.Diff Breakdown
All scaffolding — no production code, no tests. The run script is the bulk of the diff; the Dockerfile and wrapper are small shims.
Key files (click to expand)
tools/stress/scripts/run-stress-local.sh— 515-line bash harness: dzctl build/start with tolerant fallbacks, stress device image build, network introspection, fallback CYOA network + controller container creation, SSH keypair generation, container start + CYOA attach, EOS startup-config render with default-network IP for Management0 routing, onchain device create + loopback interface registration + access-pass seeding, orchestrator + observer launch.tools/stress/docker/device/Dockerfile— 38-line layer ondz-local/device:dev: install the agent wrapper, create thestressbash user, and grant it passwordless sudo via/etc/sudoers.Eos.tools/stress/docker/device/agent-wrapper.sh— 36-line shim that re-execs through sudo to run the agent as root, injects-pubkeyfrom disk, and pins the prometheus listener on a controller-ACL-permitted port.Testing Verification
Validated end-to-end against the local containerized devnet:
--no-agentmode: orchestrator completed a 4-user sweep (24 runlog rows, all six lifecycle events × 4 users); observer captured 17 ticks ofshow hardware capacity/show gre tunnel static/show processes top once/show loggingwithout warnings.stress, the wrapper re-exec'd through sudo, agent ran as root with-pubkeyand metrics enabled. Agent reached the controller over the docker default network, received 1276-line configs, opened configure sessions, computed diffs, and committed without errors. Orchestrator streamed 38674 agent-log lines intoorchestrator.agent.log; observer captured 728 prometheus metric rows from the agent's:50100endpoint. Sweep exited cleanly withsweep finished, 28 runlog rows, 0 agent errors.