Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
38 changes: 38 additions & 0 deletions tools/stress/docker/device/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
# Stress-test device image.
#
# Layers on top of the e2e device image (dz-local/device by default) and
# replaces the default `doublezero-agent` binary path with a wrapper that the
# orchestrator's SSH runner can invoke as `doublezero-agent`. The wrapper
# injects the per-device --pubkey (read from /etc/doublezero/agent/pubkey,
# populated by run-stress-local.sh) and enables prometheus metrics so the
# observer can scrape them.
#
# This image deliberately does NOT bake an EOS startup-config that runs the
# agent as a daemon. The orchestrator owns the agent lifecycle.
ARG DZ_DEVICE_IMAGE=dz-local/device:dev
FROM ${DZ_DEVICE_IMAGE}

COPY agent-wrapper.sh /usr/local/bin/doublezero-agent
RUN chmod +x /usr/local/bin/doublezero-agent

# cEOS provisions admin via NSS with shell /usr/bin/RunCli, which intercepts
# SSH-exec'd commands and feeds them to the EOS Cli parser. The orchestrator's
# hardcoded SSH command (`doublezero-agent -verbose …`) is not valid EOS Cli,
# so we add a separate system user with /bin/bash for the orchestrator to use.
# run-stress-local.sh plants the orchestrator's pubkey into this user's
# authorized_keys at runtime (the keypair is generated per devnet, so we can't
# bake it in).
#
# The agent shells out to `ip netns exec default /usr/bin/Cli -p 15 -c "show
# session-config named X diffs"` to inspect the staged config session, and
# `ip netns exec` requires CAP_SYS_ADMIN. Give `stress` passwordless sudo so
# the wrapper can run the agent as root; without this every apply cycle ends
# in "Operation not permitted" from netns_exec and the agent never settles.
# Append to /etc/sudoers.Eos because cEOS overlays that template onto
# /etc/sudoers at boot (and it does not source /etc/sudoers.d/), so any
# rule we'd add to /etc/sudoers or /etc/sudoers.d/ gets wiped.
RUN useradd -m -s /bin/bash stress \
&& echo 'stress ALL=(ALL) NOPASSWD: ALL' >> /etc/sudoers.Eos \
&& visudo -cf /etc/sudoers.Eos

EXPOSE 22
36 changes: 36 additions & 0 deletions tools/stress/docker/device/agent-wrapper.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
#!/bin/bash
# Wrapper invoked over SSH by the orchestrator as `doublezero-agent`.
# The orchestrator's SSH command is hardcoded today as:
# doublezero-agent -verbose [-controller HOST:PORT]
# It does not pass -pubkey or enable metrics. This wrapper supplies both so
# the agent can fetch its config and the observer can scrape its counters.
#
# The agent must run as root: it shells out to `ip netns exec default
# /usr/bin/Cli` to inspect staged configure-session diffs, and that requires
# CAP_SYS_ADMIN. We invoke ourselves through sudo so the agent runs with
# the privilege it needs even when SSH lands the orchestrator as the `stress`
# user. (The Dockerfile grants `stress` passwordless sudo.)
set -eu

if [ "$(id -u)" -ne 0 ]; then
exec sudo -E -- "$0" "$@"
fi

PUBKEY_FILE="/etc/doublezero/agent/pubkey"
PUBKEY=""
if [ -r "$PUBKEY_FILE" ]; then
PUBKEY="$(tr -d '[:space:]' < "$PUBKEY_FILE")"
fi

EXTRA_ARGS=()
if [ -n "$PUBKEY" ]; then
EXTRA_ARGS+=(-pubkey "$PUBKEY")
fi
# Pick a port the controller-pushed MAIN-CONTROL-PLANE-ACL already permits.
# That ACL binds `system control-plane in` and the controller fully redefines
# it on every apply (`no ip access-list MAIN-CONTROL-PLANE-ACL` + recreate),
# so anything we add gets wiped on the agent's next tick. The default ACL
# does permit TCP 50000-50100, so park the metrics endpoint there.
EXTRA_ARGS+=(-metrics-enable -metrics-addr ":50100")

exec /mnt/flash/doublezero-agent "${EXTRA_ARGS[@]}" "$@"
79 changes: 79 additions & 0 deletions tools/stress/scripts/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
# Local stress-test harness

`run-stress-local.sh` brings up a containerized devnet (ledger + manager +
controller + funder, via `dev/dzctl start`), adds one custom **stress device**
container whose EOS startup config does NOT include the `doublezero-agent`
daemon, then launches the orchestrator and observer against it. The
orchestrator owns the agent lifecycle (starts it over SSH); the observer
samples the device.

Components:

| Piece | Image / source |
| ------------------------------------ | -------------------------------------------------------------- |
| Ledger / manager / controller | e2e harness images (`dz-local/{ledger,manager,controller}`) |
| Stress device | `tools/stress/docker/device/Dockerfile` (extends e2e device) |
| Agent invocation wrapper | `tools/stress/docker/device/agent-wrapper.sh` (in stress image)|
| Stress orchestrator | `tools/stress/device-orchestrator/cmd/device-orchestrator` |
| Stress observer | `tools/stress/device-observer/cmd/device-observer` |

## Quick start

```bash
# Full build + run (creates a 4-user sweep with 30s holds by default)
tools/stress/scripts/run-stress-local.sh --clean

# Skip docker build on subsequent runs
tools/stress/scripts/run-stress-local.sh --no-build

# Tweak the sweep
tools/stress/scripts/run-stress-local.sh --target-users 8 --hold 60
```

The script ends by printing the orchestrator/observer PIDs and the run
working directory (under `dev/.deploy/dz-local/stress/run/<UTC timestamp>/`).
Both processes keep running in the background. Stop them with the
`kill $(cat …)` snippet the script prints.

## What the stress device differs from the e2e device

It is the same cEOS base, but the startup config (rendered at run time
by the script) drops the `daemon doublezero-agent` and
`daemon doublezero-telemetry` blocks. cEOS pins admin's NSS shell to
`/usr/bin/RunCli` (the EOS Cli wrapper), so the image adds a separate
`stress` system user with `/bin/bash`; the script plants the
orchestrator's pubkey into its authorized_keys at runtime, and the
orchestrator's SSH session connects as `stress`.

## Agent metrics port: why 50100, not 9100

The agent's prometheus listener is parked on `:50100`, not the default
`:9100`. The cEOS device's `system control-plane` binds
`MAIN-CONTROL-PLANE-ACL` (no `-MGMT` suffix), and the
doublezero-controller's pushed device config fully redefines that ACL
on every apply (starting with `no ip access-list
MAIN-CONTROL-PLANE-ACL`). Any port permit we add via our startup-config
is wiped on the first agent apply. The controller's default ACL does
permit TCP `50000-50100`, so the wrapper at
`/usr/local/bin/doublezero-agent` sets `-metrics-addr :50100` and the
script points the observer's `--agent-metrics-url` at the same port.

## Caveats / known issues

- The orchestrator's hardcoded SSH command is
`doublezero-agent -verbose [-controller HOST:PORT]`. It does not pass
`-pubkey` or `-metrics-enable`. The stress image works around this
with the wrapper at `/usr/local/bin/doublezero-agent`, which injects
`-pubkey` from `/etc/doublezero/agent/pubkey` and turns on metrics on
`:50100`.
- Use `--no-agent` to skip the SSH agent entirely; the orchestrator
will only drive the onchain sweep and the observer will only see
passive device state (no agent-log / metrics rows). Useful as a
first smoke test.

## Teardown

```bash
dev/dzctl destroy -y
docker rm -f dz-local-device-dzstress 2>/dev/null
```
Loading
Loading