malbeclabs · nikw9944 · Jun 3, 2026 · Jun 2, 2026 · Jun 2, 2026 · Jun 2, 2026
diff --git a/tools/stress/docker/device/Dockerfile b/tools/stress/docker/device/Dockerfile
@@ -0,0 +1,38 @@
+# Stress-test device image.
+#
+# Layers on top of the e2e device image (dz-local/device by default) and
+# replaces the default `doublezero-agent` binary path with a wrapper that the
+# orchestrator's SSH runner can invoke as `doublezero-agent`. The wrapper
+# injects the per-device --pubkey (read from /etc/doublezero/agent/pubkey,
+# populated by run-stress-local.sh) and enables prometheus metrics so the
+# observer can scrape them.
+#
+# This image deliberately does NOT bake an EOS startup-config that runs the
+# agent as a daemon. The orchestrator owns the agent lifecycle.
+ARG DZ_DEVICE_IMAGE=dz-local/device:dev
+FROM ${DZ_DEVICE_IMAGE}
+
+COPY agent-wrapper.sh /usr/local/bin/doublezero-agent
+RUN chmod +x /usr/local/bin/doublezero-agent
+
+# cEOS provisions admin via NSS with shell /usr/bin/RunCli, which intercepts
+# SSH-exec'd commands and feeds them to the EOS Cli parser. The orchestrator's
+# hardcoded SSH command (`doublezero-agent -verbose …`) is not valid EOS Cli,
+# so we add a separate system user with /bin/bash for the orchestrator to use.
+# run-stress-local.sh plants the orchestrator's pubkey into this user's
+# authorized_keys at runtime (the keypair is generated per devnet, so we can't
+# bake it in).
+#
+# The agent shells out to `ip netns exec default /usr/bin/Cli -p 15 -c "show
+# session-config named X diffs"` to inspect the staged config session, and
+# `ip netns exec` requires CAP_SYS_ADMIN. Give `stress` passwordless sudo so
+# the wrapper can run the agent as root; without this every apply cycle ends
+# in "Operation not permitted" from netns_exec and the agent never settles.
+# Append to /etc/sudoers.Eos because cEOS overlays that template onto
+# /etc/sudoers at boot (and it does not source /etc/sudoers.d/), so any
+# rule we'd add to /etc/sudoers or /etc/sudoers.d/ gets wiped.
+RUN useradd -m -s /bin/bash stress \
+    && echo 'stress ALL=(ALL) NOPASSWD: ALL' >> /etc/sudoers.Eos \
+    && visudo -cf /etc/sudoers.Eos
+
+EXPOSE 22
diff --git a/tools/stress/docker/device/agent-wrapper.sh b/tools/stress/docker/device/agent-wrapper.sh
@@ -0,0 +1,36 @@
+#!/bin/bash
+# Wrapper invoked over SSH by the orchestrator as `doublezero-agent`.
+# The orchestrator's SSH command is hardcoded today as:
+#   doublezero-agent -verbose [-controller HOST:PORT]
+# It does not pass -pubkey or enable metrics. This wrapper supplies both so
+# the agent can fetch its config and the observer can scrape its counters.
+#
+# The agent must run as root: it shells out to `ip netns exec default
+# /usr/bin/Cli` to inspect staged configure-session diffs, and that requires
+# CAP_SYS_ADMIN. We invoke ourselves through sudo so the agent runs with
+# the privilege it needs even when SSH lands the orchestrator as the `stress`
+# user. (The Dockerfile grants `stress` passwordless sudo.)
+set -eu
+
+if [ "$(id -u)" -ne 0 ]; then
+    exec sudo -E -- "$0" "$@"
+fi
+
+PUBKEY_FILE="/etc/doublezero/agent/pubkey"
+PUBKEY=""
+if [ -r "$PUBKEY_FILE" ]; then
+    PUBKEY="$(tr -d '[:space:]' < "$PUBKEY_FILE")"
+fi
+
+EXTRA_ARGS=()
+if [ -n "$PUBKEY" ]; then
+    EXTRA_ARGS+=(-pubkey "$PUBKEY")
+fi
+# Pick a port the controller-pushed MAIN-CONTROL-PLANE-ACL already permits.
+# That ACL binds `system control-plane in` and the controller fully redefines
+# it on every apply (`no ip access-list MAIN-CONTROL-PLANE-ACL` + recreate),
+# so anything we add gets wiped on the agent's next tick. The default ACL
+# does permit TCP 50000-50100, so park the metrics endpoint there.
+EXTRA_ARGS+=(-metrics-enable -metrics-addr ":50100")
+
+exec /mnt/flash/doublezero-agent "${EXTRA_ARGS[@]}" "$@"
diff --git a/tools/stress/scripts/README.md b/tools/stress/scripts/README.md
@@ -0,0 +1,79 @@
+# Local stress-test harness
+
+`run-stress-local.sh` brings up a containerized devnet (ledger + manager +
+controller + funder, via `dev/dzctl start`), adds one custom **stress device**
+container whose EOS startup config does NOT include the `doublezero-agent`
+daemon, then launches the orchestrator and observer against it. The
+orchestrator owns the agent lifecycle (starts it over SSH); the observer
+samples the device.
+
+Components:
+
+| Piece                                | Image / source                                                 |
+| ------------------------------------ | -------------------------------------------------------------- |
+| Ledger / manager / controller        | e2e harness images (`dz-local/{ledger,manager,controller}`)    |
+| Stress device                        | `tools/stress/docker/device/Dockerfile` (extends e2e device)   |
+| Agent invocation wrapper             | `tools/stress/docker/device/agent-wrapper.sh` (in stress image)|
+| Stress orchestrator                  | `tools/stress/device-orchestrator/cmd/device-orchestrator`     |
+| Stress observer                      | `tools/stress/device-observer/cmd/device-observer`             |
+
+## Quick start
+
+```bash
+# Full build + run (creates a 4-user sweep with 30s holds by default)
+tools/stress/scripts/run-stress-local.sh --clean
+
+# Skip docker build on subsequent runs
+tools/stress/scripts/run-stress-local.sh --no-build
+
+# Tweak the sweep
+tools/stress/scripts/run-stress-local.sh --target-users 8 --hold 60
+```
+
+The script ends by printing the orchestrator/observer PIDs and the run
+working directory (under `dev/.deploy/dz-local/stress/run/<UTC timestamp>/`).
+Both processes keep running in the background. Stop them with the
+`kill $(cat …)` snippet the script prints.
+
+## What the stress device differs from the e2e device
+
+It is the same cEOS base, but the startup config (rendered at run time
+by the script) drops the `daemon doublezero-agent` and
+`daemon doublezero-telemetry` blocks. cEOS pins admin's NSS shell to
+`/usr/bin/RunCli` (the EOS Cli wrapper), so the image adds a separate
+`stress` system user with `/bin/bash`; the script plants the
+orchestrator's pubkey into its authorized_keys at runtime, and the
+orchestrator's SSH session connects as `stress`.
+
+## Agent metrics port: why 50100, not 9100
+
+The agent's prometheus listener is parked on `:50100`, not the default
+`:9100`. The cEOS device's `system control-plane` binds
+`MAIN-CONTROL-PLANE-ACL` (no `-MGMT` suffix), and the
+doublezero-controller's pushed device config fully redefines that ACL
+on every apply (starting with `no ip access-list
+MAIN-CONTROL-PLANE-ACL`). Any port permit we add via our startup-config
+is wiped on the first agent apply. The controller's default ACL does
+permit TCP `50000-50100`, so the wrapper at
+`/usr/local/bin/doublezero-agent` sets `-metrics-addr :50100` and the
+script points the observer's `--agent-metrics-url` at the same port.
+
+## Caveats / known issues
+
+- The orchestrator's hardcoded SSH command is
+  `doublezero-agent -verbose [-controller HOST:PORT]`. It does not pass
+  `-pubkey` or `-metrics-enable`. The stress image works around this
+  with the wrapper at `/usr/local/bin/doublezero-agent`, which injects
+  `-pubkey` from `/etc/doublezero/agent/pubkey` and turns on metrics on
+  `:50100`.
+- Use `--no-agent` to skip the SSH agent entirely; the orchestrator
+  will only drive the onchain sweep and the observer will only see
+  passive device state (no agent-log / metrics rows). Useful as a
+  first smoke test.
+
+## Teardown
+
+```bash
+dev/dzctl destroy -y
+docker rm -f dz-local-device-dzstress 2>/dev/null
+```