sei-protocol · pdrobnjak · Feb 13, 2026 · Feb 13, 2026 · Feb 13, 2026 · Feb 13, 2026
diff --git a/.claude/commands/optimize.md b/.claude/commands/optimize.md
@@ -0,0 +1,94 @@
+---
+name: optimize
+description: Run a profiling-driven optimization loop for a specific function
+argument-hint: "<function-name> e.g. executeEVMTxWithGigaExecutor"
+allowed-tools:
+  - Read
+  - Write
+  - Edit
+  - Glob
+  - Grep
+  - Bash
+  - Task
+  - AskUserQuestion
+---
+
+# Optimization Loop for: $ARGUMENTS
+
+You are running a profiling-driven optimization loop focused on the function `$ARGUMENTS`.
+
+## References
+
+Read `benchmark/CLAUDE.md` for benchmark commands, environment variables, profiling, and the full optimization loop steps.
+
+## Workflow
+
+Execute the optimization loop from benchmark/CLAUDE.md section "Optimization loop", but focused on `$ARGUMENTS`:
+
+### Phase 1: Understand the target function
+
+1. Find the function `$ARGUMENTS` in the codebase using Grep
+2. Read the function and its callers/callees to understand the hot path
+3. Identify what packages, types, and helpers it uses
+
+### Phase 2: Profile
+
+4. Run the benchmark: `GIGA_EXECUTOR=true GIGA_OCC=true benchmark/benchmark.sh`
+5. Wait for it to complete (default DURATION=120s)
+
+### Phase 3: Analyze (focused on target function)
+
+6. Run pprof analysis focused on `$ARGUMENTS` and its call tree. Run these in parallel:
+   - CPU: `go tool pprof -top -cum -nodecount=40 /tmp/sei-bench/pprof/cpu.pb.gz 2>&1 | head -60`
+   - fgprof: `go tool pprof -top -cum -nodecount=40 /tmp/sei-bench/pprof/fgprof.pb.gz 2>&1 | head -60`
+   - Heap (alloc_space): `go tool pprof -alloc_space -top -cum -nodecount=40 /tmp/sei-bench/pprof/heap.pb.gz 2>&1 | head -60`
+   - Heap (alloc_objects): `go tool pprof -alloc_objects -top -cum -nodecount=40 /tmp/sei-bench/pprof/heap.pb.gz 2>&1 | head -60`
+   - Block: `go tool pprof -top -cum -nodecount=40 /tmp/sei-bench/pprof/block.pb.gz 2>&1 | head -60`
+   - Mutex: `go tool pprof -top -cum -nodecount=40 /tmp/sei-bench/pprof/mutex.pb.gz 2>&1 | head -60`
+7. Use `go tool pprof -text -focus='$ARGUMENTS' /tmp/sei-bench/pprof/cpu.pb.gz` to get function-focused breakdown
+8. Open flamegraphs on separate ports for the user to inspect:
+   - `go tool pprof -http=:8080 /tmp/sei-bench/pprof/cpu.pb.gz &`
+   - `go tool pprof -http=:8081 /tmp/sei-bench/pprof/fgprof.pb.gz &`
+   - `go tool pprof -http=:8082 -alloc_space /tmp/sei-bench/pprof/heap.pb.gz &`
+
+### Phase 4: Summarize and discuss
+
+9. Present findings to the user:
+   - TPS from the benchmark run (extract from `/tmp/sei-bench/tps.txt`)
+   - Where `$ARGUMENTS` and its callees spend the most time (CPU, wall-clock)
+   - Biggest allocation hotspots within the function's call tree
+   - Any contention (block/mutex) in the function's path
+   - Top 2-3 candidate optimizations with expected impact and trade-offs
+10. Ask the user which optimization direction to pursue. Do NOT write any code until the user picks.
+
+### Phase 5: Implement
+
+11. Implement the chosen optimization
+12. Run `gofmt -s -w` on all modified `.go` files
+13. Commit the change
+
+### Phase 6: Compare
+
+14. Record the commit hash before and after the optimization
+15. Run comparison: `benchmark/benchmark-compare.sh baseline=<before-commit> candidate=<after-commit>`
+16. Open diff flamegraphs for the user:
+    - `go tool pprof -http=:8083 -diff_base /tmp/sei-bench/baseline/pprof/cpu.pb.gz /tmp/sei-bench/candidate/pprof/cpu.pb.gz &`
+    - `go tool pprof -http=:8084 -diff_base /tmp/sei-bench/baseline/pprof/fgprof.pb.gz /tmp/sei-bench/candidate/pprof/fgprof.pb.gz &`
+    - `go tool pprof -http=:8085 -diff_base /tmp/sei-bench/baseline/pprof/heap.pb.gz /tmp/sei-bench/candidate/pprof/heap.pb.gz &`
+
+### Phase 7: Validate
+
+17. Present results:
+    - TPS delta (baseline vs candidate)
+    - CPU diff: `go tool pprof -top -diff_base /tmp/sei-bench/baseline/pprof/cpu.pb.gz /tmp/sei-bench/candidate/pprof/cpu.pb.gz`
+    - Heap diff: `go tool pprof -alloc_space -top -diff_base /tmp/sei-bench/baseline/pprof/heap.pb.gz /tmp/sei-bench/candidate/pprof/heap.pb.gz`
+18. Ask the user: keep, iterate, or revert?
+19. If keep and user approves, ask whether to open a PR
+
+## Important rules
+
+- ALWAYS ask the user before writing any optimization code (step 10)
+- ALWAYS ask the user before opening a PR (step 19)
+- Cross-session benchmark numbers are NOT comparable. Only compare within the same `benchmark-compare.sh` run.
+- Run `gofmt -s -w` on all modified Go files before committing
+- If `$ARGUMENTS` is empty or not found, ask the user to provide the function name
diff --git a/app/benchmark_profiling.go b/app/benchmark_profiling.go
@@ -0,0 +1,17 @@
+//go:build benchmark
+
+package app
+
+import "runtime"
+
+func init() {
+	// Enable block profiling: record blocking events lasting 1us or longer.
+	// Lower values capture more events but add overhead that can skew TPS.
+	// This lets /debug/pprof/block show time spent waiting on channels and mutexes.
+	runtime.SetBlockProfileRate(1000)
+
+	// Enable mutex contention profiling: sample 1 in 5 contention events.
+	// Full capture (fraction=1) adds measurable overhead; 1/5 is a good balance.
+	// This lets /debug/pprof/mutex show where goroutines contend on locks.
+	runtime.SetMutexProfileFraction(5)
+}
diff --git a/benchmark/CLAUDE.md b/benchmark/CLAUDE.md
@@ -3,16 +3,17 @@
 ## Single scenario
 
 ```bash
-GIGA_EXECUTOR=true GIGA_OCC=true DEBUG=true benchmark/benchmark.sh
+GIGA_EXECUTOR=true GIGA_OCC=true benchmark/benchmark.sh
 ```
 
-TPS is logged every 5s as `tps=<value>` (with ANSI color codes). To capture output and extract TPS:
+By default, the benchmark runs for `DURATION=120` seconds, auto-captures all 6 profile types, extracts TPS stats, and exits. Profiles are saved to `/tmp/sei-bench/pprof/`, TPS data to `/tmp/sei-bench/tps.txt`, and the full log to `/tmp/sei-bench/output.log`.
 
-```bash
-LOG_FILE=/tmp/bench.log DEBUG=true benchmark/benchmark.sh
+Use `DURATION=0` to run forever (manual capture, original behavior).
+
+TPS is logged every 5s as `tps=<value>` (with ANSI color codes). For manual extraction:
 
-# Extract TPS values
-sed 's/\x1b\[[0-9;]*m//g' /tmp/bench.log | sed -n 's/.*tps=\([0-9.]*\).*/\1/p'
+```bash
+sed 's/\x1b\[[0-9;]*m//g' /tmp/sei-bench/output.log | sed -n 's/.*tps=\([0-9.]*\).*/\1/p'
 ```
 
 Available scenarios in `benchmark/scenarios/`: `evm.json` (default), `erc20.json`, `mixed.json`, `default.json`.
@@ -29,9 +30,9 @@ BENCHMARK_CONFIG=benchmark/scenarios/erc20.json benchmark/benchmark.sh
 | Var | Default | Purpose |
 |-----|---------|---------|
 | `BENCHMARK_PHASE` | `all` | `init` (build+init+configure), `start` (run node), `all` (both) |
-| `SEI_HOME` | `$HOME/.sei` | Final chain data dir. If != ~/.sei, init in ~/.sei then `mv` |
-| `PORT_OFFSET` | `0` | Added to all ports (RPC, P2P, pprof, gRPC, etc.) |
-| `SEID_BIN` | `""` | Pre-built binary path. If set, skip build + copy to ~/go/bin/seid |
+| `SEI_HOME` | `$HOME/.sei` (or `$HOME/.sei-bench-<offset>` when auto-claimed) | Final chain data dir. Init uses a temp staging dir, then moves here |
+| `PORT_OFFSET` | auto-claimed | Added to all ports (RPC, P2P, pprof, gRPC, etc.). Auto-claimed via atomic `mkdir` slots when not set, same mechanism as benchmark-compare.sh |
+| `SEID_BIN` | `""` | Pre-built binary path. If set, skip build step |
 | `LOG_FILE` | `""` | Redirect seid output to file |
 | `BENCHMARK_CONFIG` | `$SCRIPT_DIR/scenarios/evm.json` | Scenario config file (absolute path resolved from script location) |
 | `BENCHMARK_TXS_PER_BATCH` | `1000` | Transactions per batch |
@@ -41,17 +42,20 @@ BENCHMARK_CONFIG=benchmark/scenarios/erc20.json benchmark/benchmark.sh
 | `MOCK_BALANCES` | `true` | Use mock balances during benchmark |
 | `DISABLE_INDEXER` | `true` | Disable indexer for benchmark (reduces I/O overhead) |
 | `DEBUG` | `false` | Print all log output without filtering |
+| `DURATION` | `120` | Auto-stop after N seconds (0 = run forever) |
 
 ### benchmark-compare.sh
 
 Inherits all benchmark.sh vars via delegation. Additionally:
 
 | Var | Default | Purpose |
 |-----|---------|---------|
-| `DURATION` | `600` | How long (seconds) to run each node before stopping |
+| `DURATION` | `120` | How long (seconds) to run each node before stopping |
 | `GIGA_EXECUTOR` | **`true`** | Overrides benchmark.sh default (false) |
 | `GIGA_OCC` | **`true`** | Overrides benchmark.sh default (false) |
 | `DB_BACKEND` | `goleveldb` | Forwarded to build and init phases |
+| `RUN_ID` | `$$` (PID) | Namespaces `BASE_DIR` as `/tmp/sei-bench-${RUN_ID}/` |
+| `RUN_PORT_OFFSET` | auto-claimed | Added to all per-scenario port offsets (auto-claimed via atomic `mkdir` slots) |
 
 **Note:** `GIGA_EXECUTOR` and `GIGA_OCC` default to `true` in the compare script but `false` in benchmark.sh. The compare script is designed for performance comparison where Giga Executor is typically enabled.
 
@@ -66,22 +70,140 @@ benchmark/benchmark-compare.sh \
   lazy-cms-fix=37a17fd02
 ```
 
-Each scenario gets its own binary, home dir, and port set (offset by 100). Results are printed at the end with median/avg/min/max TPS. Raw data in `/tmp/sei-bench/<label>/tps.txt`.
+Each scenario gets its own binary, home dir, and port set (offset by 100). Results are printed at the end with median/avg/min/max TPS. Raw data in `/tmp/sei-bench-<PID>/<label>/tps.txt`.
+
+### Running multiple comparisons concurrently
+
+Port offsets are auto-claimed, so the simplest parallel usage is just:
+
+```bash
+benchmark/benchmark-compare.sh baseline=abc123 opt=def456 &
+benchmark/benchmark-compare.sh baseline=abc123 fix=789abc &
+benchmark/benchmark-compare.sh baseline=abc123 alt=aabbcc &
+```
+
+Each instance gets its own `RUN_ID` (PID), `BASE_DIR`, and port range. No coordination needed.
+
+Manual override is available if needed:
+
+```bash
+RUN_PORT_OFFSET=0    benchmark/benchmark-compare.sh baseline=abc123 opt=def456 &
+RUN_PORT_OFFSET=1000 benchmark/benchmark-compare.sh baseline=abc123 fix=789abc &
+```
 
 ## Comparing results across sessions
 
 **Important:** Cross-session benchmark numbers (TPS, total allocs) are not directly comparable. Only comparisons within the same `benchmark-compare.sh` run are valid, since all scenarios share identical conditions.
 
-## Comparing pprof profiles
+## Profiling
+
+### Available profile types
+
+Both `benchmark.sh` (when `DURATION > 0`) and `benchmark-compare.sh` automatically capture all profile types midway through the run. Profiles are saved to `/tmp/sei-bench/pprof/` (single-scenario) or `/tmp/sei-bench/<label>/pprof/` (compare).
+
+| Profile | File | What it shows |
+|---------|------|---------------|
+| CPU | `cpu.pb.gz` | On-CPU time only (computation, hashing, EVM execution) |
+| fgprof | `fgprof.pb.gz` | Wall-clock time: on-CPU + off-CPU (I/O, blocking, GC pauses) |
+| Heap | `heap.pb.gz` | Memory allocations (analyzable with multiple metrics, see below) |
+| Goroutine | `goroutine.pb.gz` | Goroutine stacks (find pileups and leaks) |
+| Block | `block.pb.gz` | Time waiting on channel ops and mutex locks |
+| Mutex | `mutex.pb.gz` | Mutex contention (where goroutines fight over locks) |
+
+**CPU vs fgprof:** Go's built-in CPU profiler uses OS-level `SIGPROF` signals delivered only to running threads — goroutines waiting on I/O, channels, or locks are invisible. fgprof samples all goroutines via `runtime.GoroutineProfile` regardless of scheduling state, showing the full wall-clock picture. Use CPU when you suspect pure computation is the bottleneck; use fgprof when TPS is low but CPU utilization is also low (points to I/O or contention).
+
+**Block and mutex profiles** require `runtime.SetBlockProfileRate` and `runtime.SetMutexProfileFraction` to be enabled. Both are automatically enabled when seid is built with the `benchmark` build tag (`make install-bench`). They are disabled in production builds.
+
+### Capturing profiles during single-scenario runs
+
+When `DURATION > 0` (default), `benchmark.sh` automatically captures all 6 profile types midway through the run. Profiles are saved to `/tmp/sei-bench/pprof/`.
+
+When `DURATION=0` (run-forever mode), capture manually in another terminal:
+
+```bash
+PPROF_PORT=6060  # adjust with PORT_OFFSET if set
+
+# 30-second CPU profile
+go tool pprof -http=:8080 "http://localhost:$PPROF_PORT/debug/pprof/profile?seconds=30"
+
+# 30-second fgprof (wall-clock)
+go tool pprof -http=:8080 "http://localhost:$PPROF_PORT/debug/fgprof?seconds=30"
 
-`benchmark-compare.sh` automatically captures pprof profiles (CPU and heap) midway through the run. Single-scenario runs only enable the pprof HTTP endpoint — capture profiles manually with `curl` or `go tool pprof`.
+# Heap snapshot
+go tool pprof -http=:8080 "http://localhost:$PPROF_PORT/debug/pprof/heap"
+
+# Goroutine dump
+go tool pprof -http=:8080 "http://localhost:$PPROF_PORT/debug/pprof/goroutine"
+
+# Block profile
+go tool pprof -http=:8080 "http://localhost:$PPROF_PORT/debug/pprof/block"
+
+# Mutex contention
+go tool pprof -http=:8080 "http://localhost:$PPROF_PORT/debug/pprof/mutex"
+```
+
+### Comparing profiles (diff)
 
 Always use `pprof -diff_base` to compare profiles between benchmark runs. Never compare profiles side-by-side manually.
 
 ```bash
 # CPU diff (positive = regression, negative = improvement)
 go tool pprof -top -diff_base /tmp/sei-bench/<baseline>/pprof/cpu.pb.gz /tmp/sei-bench/<candidate>/pprof/cpu.pb.gz
 
-# Allocation diff
+# Wall-clock diff (fgprof)
+go tool pprof -top -diff_base /tmp/sei-bench/<baseline>/pprof/fgprof.pb.gz /tmp/sei-bench/<candidate>/pprof/fgprof.pb.gz
+
+# Allocation diff (total bytes allocated over time)
 go tool pprof -alloc_space -top -diff_base /tmp/sei-bench/<baseline>/pprof/heap.pb.gz /tmp/sei-bench/<candidate>/pprof/heap.pb.gz
+
+# Heap escape diff (objects that should be stack-allocated)
+go tool pprof -alloc_objects -top -diff_base /tmp/sei-bench/<baseline>/pprof/heap.pb.gz /tmp/sei-bench/<candidate>/pprof/heap.pb.gz
+```
+
+### Heap profile metrics
+
+The heap profile contains multiple metrics. Choose the right one for your question:
+
+| Metric | Flag | Use when |
+|--------|------|----------|
+| Active memory | `-inuse_space` | Finding memory leaks or high RSS |
+| Active objects | `-inuse_objects` | Finding what's holding memory right now |
+| Total allocated bytes | `-alloc_space` | Finding hot allocation paths (GC pressure) |
+| Total allocated objects | `-alloc_objects` | Finding heap escapes (objects that should live on the stack) |
+
+### Interactive analysis and flamegraphs
+
+Text output (`-top`) is useful for quick comparisons, but the web UI with flamegraphs is far more effective for navigating large profiles.
+
+```bash
+# Interactive web UI with flamegraphs
+go tool pprof -http=:8080 /tmp/sei-bench/<label>/pprof/cpu.pb.gz
+
+# Diff flamegraph (red = regression, blue = improvement)
+go tool pprof -http=:8080 -diff_base /tmp/sei-bench/<baseline>/pprof/cpu.pb.gz /tmp/sei-bench/<candidate>/pprof/cpu.pb.gz
+```
+
+For drilling into specific functions, use the interactive CLI:
+
+```bash
+go tool pprof /tmp/sei-bench/<label>/pprof/cpu.pb.gz
+(pprof) top20 --cum          # sort by cumulative time (default flat hides expensive callees)
+(pprof) list DeliverTx       # line-by-line source attribution
+(pprof) web DeliverTx        # SVG graph focused on one function's callers/callees
 ```
+
+Run `go tool pprof` from the sei-chain repo root so that `list` and `web` commands can resolve source file paths.
+
+## Optimization loop
+
+Full iteration cycle for profiling, optimizing, and validating performance changes:
+
+1. **Profile:** Run `benchmark/benchmark.sh` (auto-captures all 6 profile types, extracts TPS, exits)
+2. **Analyze:** Inspect profiles with `go tool pprof -top -cum /tmp/sei-bench/pprof/cpu.pb.gz` (repeat for fgprof, heap with `-alloc_space`, goroutine, block, mutex)
+3. **Flamegraphs:** Open interactive UIs for cpu, fgprof, and heap on separate ports: `go tool pprof -http=:8080 /tmp/sei-bench/pprof/cpu.pb.gz &`, `go tool pprof -http=:8081 /tmp/sei-bench/pprof/fgprof.pb.gz &`, `go tool pprof -http=:8082 /tmp/sei-bench/pprof/heap.pb.gz &`
+4. **Summarize:** Present findings to user — biggest bottleneck, candidate optimizations, expected impact, and trade-offs
+5. **Discuss:** Ask the user which optimization direction to pursue before writing any code. The user picks the approach or suggests an alternative
+6. **Implement:** Make the optimization, commit
+7. **Compare:** Run `benchmark/benchmark-compare.sh baseline=<before-commit> candidate=<after-commit>`
+8. **Validate:** Open diff flamegraphs for the user (`go tool pprof -http=:808X -diff_base <baseline> <candidate> &` for cpu, fgprof, heap). Present a CLI summary of TPS delta and profile regressions/improvements. Ask the user whether to open a PR
+9. **PR** if user approves