Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
94 changes: 94 additions & 0 deletions .claude/commands/optimize.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
---
name: optimize
description: Run a profiling-driven optimization loop for a specific function
argument-hint: "<function-name> e.g. executeEVMTxWithGigaExecutor"
allowed-tools:
- Read
- Write
- Edit
- Glob
- Grep
- Bash
- Task
- AskUserQuestion
---

# Optimization Loop for: $ARGUMENTS

You are running a profiling-driven optimization loop focused on the function `$ARGUMENTS`.

## References

Read `benchmark/CLAUDE.md` for benchmark commands, environment variables, profiling, and the full optimization loop steps.

## Workflow

Execute the optimization loop from benchmark/CLAUDE.md section "Optimization loop", but focused on `$ARGUMENTS`:

### Phase 1: Understand the target function

1. Find the function `$ARGUMENTS` in the codebase using Grep
2. Read the function and its callers/callees to understand the hot path
3. Identify what packages, types, and helpers it uses

### Phase 2: Profile

4. Run the benchmark: `GIGA_EXECUTOR=true GIGA_OCC=true benchmark/benchmark.sh`
5. Wait for it to complete (default DURATION=120s)

### Phase 3: Analyze (focused on target function)

6. Run pprof analysis focused on `$ARGUMENTS` and its call tree. Run these in parallel:
- CPU: `go tool pprof -top -cum -nodecount=40 /tmp/sei-bench/pprof/cpu.pb.gz 2>&1 | head -60`
- fgprof: `go tool pprof -top -cum -nodecount=40 /tmp/sei-bench/pprof/fgprof.pb.gz 2>&1 | head -60`
- Heap (alloc_space): `go tool pprof -alloc_space -top -cum -nodecount=40 /tmp/sei-bench/pprof/heap.pb.gz 2>&1 | head -60`
- Heap (alloc_objects): `go tool pprof -alloc_objects -top -cum -nodecount=40 /tmp/sei-bench/pprof/heap.pb.gz 2>&1 | head -60`
- Block: `go tool pprof -top -cum -nodecount=40 /tmp/sei-bench/pprof/block.pb.gz 2>&1 | head -60`
- Mutex: `go tool pprof -top -cum -nodecount=40 /tmp/sei-bench/pprof/mutex.pb.gz 2>&1 | head -60`
7. Use `go tool pprof -text -focus='$ARGUMENTS' /tmp/sei-bench/pprof/cpu.pb.gz` to get function-focused breakdown
8. Open flamegraphs on separate ports for the user to inspect:
- `go tool pprof -http=:8080 /tmp/sei-bench/pprof/cpu.pb.gz &`
- `go tool pprof -http=:8081 /tmp/sei-bench/pprof/fgprof.pb.gz &`
- `go tool pprof -http=:8082 -alloc_space /tmp/sei-bench/pprof/heap.pb.gz &`

### Phase 4: Summarize and discuss

9. Present findings to the user:
- TPS from the benchmark run (extract from `/tmp/sei-bench/tps.txt`)
- Where `$ARGUMENTS` and its callees spend the most time (CPU, wall-clock)
- Biggest allocation hotspots within the function's call tree
- Any contention (block/mutex) in the function's path
- Top 2-3 candidate optimizations with expected impact and trade-offs
10. Ask the user which optimization direction to pursue. Do NOT write any code until the user picks.

### Phase 5: Implement

11. Implement the chosen optimization
12. Run `gofmt -s -w` on all modified `.go` files
13. Commit the change

### Phase 6: Compare

14. Record the commit hash before and after the optimization
15. Run comparison: `benchmark/benchmark-compare.sh baseline=<before-commit> candidate=<after-commit>`
16. Open diff flamegraphs for the user:
- `go tool pprof -http=:8083 -diff_base /tmp/sei-bench/baseline/pprof/cpu.pb.gz /tmp/sei-bench/candidate/pprof/cpu.pb.gz &`
- `go tool pprof -http=:8084 -diff_base /tmp/sei-bench/baseline/pprof/fgprof.pb.gz /tmp/sei-bench/candidate/pprof/fgprof.pb.gz &`
- `go tool pprof -http=:8085 -diff_base /tmp/sei-bench/baseline/pprof/heap.pb.gz /tmp/sei-bench/candidate/pprof/heap.pb.gz &`

### Phase 7: Validate

17. Present results:
- TPS delta (baseline vs candidate)
- CPU diff: `go tool pprof -top -diff_base /tmp/sei-bench/baseline/pprof/cpu.pb.gz /tmp/sei-bench/candidate/pprof/cpu.pb.gz`
- Heap diff: `go tool pprof -alloc_space -top -diff_base /tmp/sei-bench/baseline/pprof/heap.pb.gz /tmp/sei-bench/candidate/pprof/heap.pb.gz`
18. Ask the user: keep, iterate, or revert?
19. If keep and user approves, ask whether to open a PR

## Important rules

- ALWAYS ask the user before writing any optimization code (step 10)
- ALWAYS ask the user before opening a PR (step 19)
- Cross-session benchmark numbers are NOT comparable. Only compare within the same `benchmark-compare.sh` run.
- Run `gofmt -s -w` on all modified Go files before committing
- If `$ARGUMENTS` is empty or not found, ask the user to provide the function name
17 changes: 17 additions & 0 deletions app/benchmark_profiling.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
//go:build benchmark

package app

import "runtime"

func init() {
// Enable block profiling: record blocking events lasting 1us or longer.
// Lower values capture more events but add overhead that can skew TPS.
// This lets /debug/pprof/block show time spent waiting on channels and mutexes.
runtime.SetBlockProfileRate(1000)

// Enable mutex contention profiling: sample 1 in 5 contention events.
// Full capture (fraction=1) adds measurable overhead; 1/5 is a good balance.
// This lets /debug/pprof/mutex show where goroutines contend on locks.
runtime.SetMutexProfileFraction(5)
}
150 changes: 136 additions & 14 deletions benchmark/CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,16 +3,17 @@
## Single scenario

```bash
GIGA_EXECUTOR=true GIGA_OCC=true DEBUG=true benchmark/benchmark.sh
GIGA_EXECUTOR=true GIGA_OCC=true benchmark/benchmark.sh
```

TPS is logged every 5s as `tps=<value>` (with ANSI color codes). To capture output and extract TPS:
By default, the benchmark runs for `DURATION=120` seconds, auto-captures all 6 profile types, extracts TPS stats, and exits. Profiles are saved to `/tmp/sei-bench/pprof/`, TPS data to `/tmp/sei-bench/tps.txt`, and the full log to `/tmp/sei-bench/output.log`.

```bash
LOG_FILE=/tmp/bench.log DEBUG=true benchmark/benchmark.sh
Use `DURATION=0` to run forever (manual capture, original behavior).

TPS is logged every 5s as `tps=<value>` (with ANSI color codes). For manual extraction:

# Extract TPS values
sed 's/\x1b\[[0-9;]*m//g' /tmp/bench.log | sed -n 's/.*tps=\([0-9.]*\).*/\1/p'
```bash
sed 's/\x1b\[[0-9;]*m//g' /tmp/sei-bench/output.log | sed -n 's/.*tps=\([0-9.]*\).*/\1/p'
```

Available scenarios in `benchmark/scenarios/`: `evm.json` (default), `erc20.json`, `mixed.json`, `default.json`.
Expand All @@ -29,9 +30,9 @@ BENCHMARK_CONFIG=benchmark/scenarios/erc20.json benchmark/benchmark.sh
| Var | Default | Purpose |
|-----|---------|---------|
| `BENCHMARK_PHASE` | `all` | `init` (build+init+configure), `start` (run node), `all` (both) |
| `SEI_HOME` | `$HOME/.sei` | Final chain data dir. If != ~/.sei, init in ~/.sei then `mv` |
| `PORT_OFFSET` | `0` | Added to all ports (RPC, P2P, pprof, gRPC, etc.) |
| `SEID_BIN` | `""` | Pre-built binary path. If set, skip build + copy to ~/go/bin/seid |
| `SEI_HOME` | `$HOME/.sei` (or `$HOME/.sei-bench-<offset>` when auto-claimed) | Final chain data dir. Init uses a temp staging dir, then moves here |
| `PORT_OFFSET` | auto-claimed | Added to all ports (RPC, P2P, pprof, gRPC, etc.). Auto-claimed via atomic `mkdir` slots when not set, same mechanism as benchmark-compare.sh |
| `SEID_BIN` | `""` | Pre-built binary path. If set, skip build step |
| `LOG_FILE` | `""` | Redirect seid output to file |
| `BENCHMARK_CONFIG` | `$SCRIPT_DIR/scenarios/evm.json` | Scenario config file (absolute path resolved from script location) |
| `BENCHMARK_TXS_PER_BATCH` | `1000` | Transactions per batch |
Expand All @@ -41,17 +42,20 @@ BENCHMARK_CONFIG=benchmark/scenarios/erc20.json benchmark/benchmark.sh
| `MOCK_BALANCES` | `true` | Use mock balances during benchmark |
| `DISABLE_INDEXER` | `true` | Disable indexer for benchmark (reduces I/O overhead) |
| `DEBUG` | `false` | Print all log output without filtering |
| `DURATION` | `120` | Auto-stop after N seconds (0 = run forever) |

### benchmark-compare.sh

Inherits all benchmark.sh vars via delegation. Additionally:

| Var | Default | Purpose |
|-----|---------|---------|
| `DURATION` | `600` | How long (seconds) to run each node before stopping |
| `DURATION` | `120` | How long (seconds) to run each node before stopping |
| `GIGA_EXECUTOR` | **`true`** | Overrides benchmark.sh default (false) |
| `GIGA_OCC` | **`true`** | Overrides benchmark.sh default (false) |
| `DB_BACKEND` | `goleveldb` | Forwarded to build and init phases |
| `RUN_ID` | `$$` (PID) | Namespaces `BASE_DIR` as `/tmp/sei-bench-${RUN_ID}/` |
| `RUN_PORT_OFFSET` | auto-claimed | Added to all per-scenario port offsets (auto-claimed via atomic `mkdir` slots) |

**Note:** `GIGA_EXECUTOR` and `GIGA_OCC` default to `true` in the compare script but `false` in benchmark.sh. The compare script is designed for performance comparison where Giga Executor is typically enabled.

Expand All @@ -66,22 +70,140 @@ benchmark/benchmark-compare.sh \
lazy-cms-fix=37a17fd02
```

Each scenario gets its own binary, home dir, and port set (offset by 100). Results are printed at the end with median/avg/min/max TPS. Raw data in `/tmp/sei-bench/<label>/tps.txt`.
Each scenario gets its own binary, home dir, and port set (offset by 100). Results are printed at the end with median/avg/min/max TPS. Raw data in `/tmp/sei-bench-<PID>/<label>/tps.txt`.

### Running multiple comparisons concurrently

Port offsets are auto-claimed, so the simplest parallel usage is just:

```bash
benchmark/benchmark-compare.sh baseline=abc123 opt=def456 &
benchmark/benchmark-compare.sh baseline=abc123 fix=789abc &
benchmark/benchmark-compare.sh baseline=abc123 alt=aabbcc &
```

Each instance gets its own `RUN_ID` (PID), `BASE_DIR`, and port range. No coordination needed.

Manual override is available if needed:

```bash
RUN_PORT_OFFSET=0 benchmark/benchmark-compare.sh baseline=abc123 opt=def456 &
RUN_PORT_OFFSET=1000 benchmark/benchmark-compare.sh baseline=abc123 fix=789abc &
```

## Comparing results across sessions

**Important:** Cross-session benchmark numbers (TPS, total allocs) are not directly comparable. Only comparisons within the same `benchmark-compare.sh` run are valid, since all scenarios share identical conditions.

## Comparing pprof profiles
## Profiling

### Available profile types

Both `benchmark.sh` (when `DURATION > 0`) and `benchmark-compare.sh` automatically capture all profile types midway through the run. Profiles are saved to `/tmp/sei-bench/pprof/` (single-scenario) or `/tmp/sei-bench/<label>/pprof/` (compare).

| Profile | File | What it shows |
|---------|------|---------------|
| CPU | `cpu.pb.gz` | On-CPU time only (computation, hashing, EVM execution) |
| fgprof | `fgprof.pb.gz` | Wall-clock time: on-CPU + off-CPU (I/O, blocking, GC pauses) |
| Heap | `heap.pb.gz` | Memory allocations (analyzable with multiple metrics, see below) |
| Goroutine | `goroutine.pb.gz` | Goroutine stacks (find pileups and leaks) |
| Block | `block.pb.gz` | Time waiting on channel ops and mutex locks |
| Mutex | `mutex.pb.gz` | Mutex contention (where goroutines fight over locks) |

**CPU vs fgprof:** Go's built-in CPU profiler uses OS-level `SIGPROF` signals delivered only to running threads — goroutines waiting on I/O, channels, or locks are invisible. fgprof samples all goroutines via `runtime.GoroutineProfile` regardless of scheduling state, showing the full wall-clock picture. Use CPU when you suspect pure computation is the bottleneck; use fgprof when TPS is low but CPU utilization is also low (points to I/O or contention).

**Block and mutex profiles** require `runtime.SetBlockProfileRate` and `runtime.SetMutexProfileFraction` to be enabled. Both are automatically enabled when seid is built with the `benchmark` build tag (`make install-bench`). They are disabled in production builds.

### Capturing profiles during single-scenario runs

When `DURATION > 0` (default), `benchmark.sh` automatically captures all 6 profile types midway through the run. Profiles are saved to `/tmp/sei-bench/pprof/`.

When `DURATION=0` (run-forever mode), capture manually in another terminal:

```bash
PPROF_PORT=6060 # adjust with PORT_OFFSET if set

# 30-second CPU profile
go tool pprof -http=:8080 "http://localhost:$PPROF_PORT/debug/pprof/profile?seconds=30"

# 30-second fgprof (wall-clock)
go tool pprof -http=:8080 "http://localhost:$PPROF_PORT/debug/fgprof?seconds=30"

`benchmark-compare.sh` automatically captures pprof profiles (CPU and heap) midway through the run. Single-scenario runs only enable the pprof HTTP endpoint — capture profiles manually with `curl` or `go tool pprof`.
# Heap snapshot
go tool pprof -http=:8080 "http://localhost:$PPROF_PORT/debug/pprof/heap"

# Goroutine dump
go tool pprof -http=:8080 "http://localhost:$PPROF_PORT/debug/pprof/goroutine"

# Block profile
go tool pprof -http=:8080 "http://localhost:$PPROF_PORT/debug/pprof/block"

# Mutex contention
go tool pprof -http=:8080 "http://localhost:$PPROF_PORT/debug/pprof/mutex"
```

### Comparing profiles (diff)

Always use `pprof -diff_base` to compare profiles between benchmark runs. Never compare profiles side-by-side manually.

```bash
# CPU diff (positive = regression, negative = improvement)
go tool pprof -top -diff_base /tmp/sei-bench/<baseline>/pprof/cpu.pb.gz /tmp/sei-bench/<candidate>/pprof/cpu.pb.gz

# Allocation diff
# Wall-clock diff (fgprof)
go tool pprof -top -diff_base /tmp/sei-bench/<baseline>/pprof/fgprof.pb.gz /tmp/sei-bench/<candidate>/pprof/fgprof.pb.gz

# Allocation diff (total bytes allocated over time)
go tool pprof -alloc_space -top -diff_base /tmp/sei-bench/<baseline>/pprof/heap.pb.gz /tmp/sei-bench/<candidate>/pprof/heap.pb.gz

# Heap escape diff (objects that should be stack-allocated)
go tool pprof -alloc_objects -top -diff_base /tmp/sei-bench/<baseline>/pprof/heap.pb.gz /tmp/sei-bench/<candidate>/pprof/heap.pb.gz
```

### Heap profile metrics

The heap profile contains multiple metrics. Choose the right one for your question:

| Metric | Flag | Use when |
|--------|------|----------|
| Active memory | `-inuse_space` | Finding memory leaks or high RSS |
| Active objects | `-inuse_objects` | Finding what's holding memory right now |
| Total allocated bytes | `-alloc_space` | Finding hot allocation paths (GC pressure) |
| Total allocated objects | `-alloc_objects` | Finding heap escapes (objects that should live on the stack) |

### Interactive analysis and flamegraphs

Text output (`-top`) is useful for quick comparisons, but the web UI with flamegraphs is far more effective for navigating large profiles.

```bash
# Interactive web UI with flamegraphs
go tool pprof -http=:8080 /tmp/sei-bench/<label>/pprof/cpu.pb.gz

# Diff flamegraph (red = regression, blue = improvement)
go tool pprof -http=:8080 -diff_base /tmp/sei-bench/<baseline>/pprof/cpu.pb.gz /tmp/sei-bench/<candidate>/pprof/cpu.pb.gz
```

For drilling into specific functions, use the interactive CLI:

```bash
go tool pprof /tmp/sei-bench/<label>/pprof/cpu.pb.gz
(pprof) top20 --cum # sort by cumulative time (default flat hides expensive callees)
(pprof) list DeliverTx # line-by-line source attribution
(pprof) web DeliverTx # SVG graph focused on one function's callers/callees
```

Run `go tool pprof` from the sei-chain repo root so that `list` and `web` commands can resolve source file paths.

## Optimization loop

Full iteration cycle for profiling, optimizing, and validating performance changes:

1. **Profile:** Run `benchmark/benchmark.sh` (auto-captures all 6 profile types, extracts TPS, exits)
2. **Analyze:** Inspect profiles with `go tool pprof -top -cum /tmp/sei-bench/pprof/cpu.pb.gz` (repeat for fgprof, heap with `-alloc_space`, goroutine, block, mutex)
3. **Flamegraphs:** Open interactive UIs for cpu, fgprof, and heap on separate ports: `go tool pprof -http=:8080 /tmp/sei-bench/pprof/cpu.pb.gz &`, `go tool pprof -http=:8081 /tmp/sei-bench/pprof/fgprof.pb.gz &`, `go tool pprof -http=:8082 /tmp/sei-bench/pprof/heap.pb.gz &`
4. **Summarize:** Present findings to user — biggest bottleneck, candidate optimizations, expected impact, and trade-offs
5. **Discuss:** Ask the user which optimization direction to pursue before writing any code. The user picks the approach or suggests an alternative
6. **Implement:** Make the optimization, commit
7. **Compare:** Run `benchmark/benchmark-compare.sh baseline=<before-commit> candidate=<after-commit>`
8. **Validate:** Open diff flamegraphs for the user (`go tool pprof -http=:808X -diff_base <baseline> <candidate> &` for cpu, fgprof, heap). Present a CLI summary of TPS delta and profile regressions/improvements. Ask the user whether to open a PR
9. **PR** if user approves
Loading
Loading