perf: OCC scheduler + EndBlock optimizations for 20K TPS by pdrobnjak · Pull Request #2912 · sei-protocol/sei-chain

pdrobnjak · 2026-02-18T09:45:39Z

Summary

Series of profiling-driven optimizations targeting the executeEVMTxWithGigaExecutor hot path and sequential block processing in the OCC (Optimistic Concurrency Control) scheduler.

Result: ~15K median / ~17K max TPS -> ~19K median / ~20K max TPS (GIGA_EXECUTOR=true GIGA_OCC=true, 12-core macOS, goleveldb)

Key optimizations (in order of impact):

Suppress EndBlock coinbase sweep events - eliminated ~4.4GB/benchmark of event allocations from 4500 per-tx sends
Guard traceSpan + skip gaskv in EndBlock - eliminated 1.5GB tracing overhead + 2.2GB gaskv wrapping when not needed
Lazy giga store creation in Snapshot CMS - 40% reduction in CacheMultiStoreGiga allocations
Single-pass WriteLatestToStore - removed unnecessary sort + double shard lookup
256 multiversion store shards - reduced lock contention with byte-keyed FNV hashing
Sorted slice replaces btree in multiversion store items - cache-friendly, fewer allocations

See benchmark/analysis/optimization-breakdown.md for full per-commit impact analysis.

Test plan

Review each commit for correctness (especially event suppression and LockedCoins skip)
Verify no regression on existing test suite
Run benchmark comparison against main branch
Validate WriteLatestToStore ordering correctness (parent cachekv sorts internally)

🤖 Generated with Claude Code

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ogic benchmark-compare.sh duplicated ~120 lines of chain init/config from benchmark.sh. This refactor adds env vars (SEI_HOME, PORT_OFFSET, SEID_BIN, BENCHMARK_PHASE, LOG_FILE) to benchmark.sh and rewrites benchmark-compare.sh as a thin orchestrator that delegates init/start via those env vars. - Move scripts/benchmark.sh -> benchmark/benchmark.sh - Move scripts/benchmark-compare.sh -> benchmark/benchmark-compare.sh - Move scripts/scenarios/ -> benchmark/scenarios/ - Add benchmark/CLAUDE.md with benchmark docs - Update path references in scenarios/README.md and app/benchmark/benchmark.go - Global CLAUDE.md references benchmark/CLAUDE.md for benchmark context Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Add missing env vars (MOCK_BALANCES, DISABLE_INDEXER, DURATION) - Split env var table by script (benchmark.sh vs benchmark-compare.sh) - Document GIGA_EXECUTOR/GIGA_OCC default difference between scripts - Fix TPS extraction example to use LOG_FILE instead of nonexistent path - Note pprof auto-capture is compare-only; single runs need manual capture - Add scenario file listing and usage example - Document DB_BACKEND forwarding in compare script - Note minimum 2 scenarios required for compare Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…e docs Add wall-clock profiling (fgprof) alongside standard CPU profiling to capture off-CPU time (I/O, blocking, GC pauses). Register the fgprof handler behind the benchmark build tag so production binaries are unaffected. Enable block and mutex contention profiling via runtime calls, also gated behind the benchmark build tag. Use conservative sampling rates (1us block threshold, 1/5 mutex fraction) to minimize overhead on TPS. Update benchmark-compare.sh to capture all 6 profile types (CPU, fgprof, heap, goroutine, block, mutex) and report sizes for each. Expand benchmark/CLAUDE.md with: - Profile type reference table with when-to-use guidance - CPU vs fgprof explanation - Heap metric selection guide (inuse_space vs alloc_objects etc) - Interactive flamegraph and drill-down commands - Single-scenario manual capture examples - Source-mapping tip for pprof Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…extraction benchmark.sh now runs for DURATION seconds (default 120), auto-captures all 6 profile types midway, extracts TPS stats, and exits cleanly. DURATION=0 preserves the original run-forever behavior. Also documents the full optimization loop workflow in CLAUDE.md. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…on loop Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ion loop Adds a Claude Code command that runs a structured optimization workflow: profile -> analyze -> discuss -> implement -> compare -> validate. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

… corruption Go's CPU profiler and fgprof conflict when running concurrently on the same process, producing empty or corrupted profiles. Switch from parallel background captures to sequential execution (CPU first, then fgprof), measure actual capture duration for accurate remaining-time calculation, and make BASE_DIR overridable so benchmark-compare.sh can route profiles to per-label directories. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

## Summary - Namespace `BASE_DIR` per run via `RUN_ID` (defaults to PID): `/tmp/sei-bench-${RUN_ID}/` - Auto-claim port offset slots via atomic `mkdir` (supports 30 concurrent runs, zero coordination) - Replace `git checkout` with git worktrees for isolated builds (no working tree collisions) - Replace `~/go/bin/seid` with `GOBIN`-based builds per label (no binary collisions) - Replace `~/.sei` staging with `mktemp` + `--home` on all `seid` commands (no init collisions) - Pass `SEI_HOME_DIR`/`SEID_BIN` env vars to `populate_genesis_accounts.py` (backward-compatible defaults) - Fix pre-existing double lifecycle bug by passing `DURATION=0` to child start-phase ## Test plan - [x] Syntax check: `bash -n` on both shell scripts, `py_compile` on Python - [x] Two concurrent `benchmark-compare.sh` runs with `DURATION=120` — both completed, separate `BASE_DIR`s, no port conflicts - [x] All 6 profile types captured for all 4 nodes (CPU ~145KB, fgprof ~115KB, heap ~248KB, etc.) - [x] TPS data collected (36-37 readings per node) - [x] `pprof -diff_base` produces valid analyzable output from both runs - [x] Port slot locks cleaned up after exit - [x] Git worktrees cleaned up after exit - [x] Backward compatible — no env vars needed for single-instance usage 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

Baseline: 8600 TPS median with GIGA_EXECUTOR=true GIGA_OCC=true. Key findings: - CacheMultiStore snapshot allocation is #1 target (15 GB, 27% of function) - cachekv.NewStore creates 9 GB of sync.Map objects per 30s sample - GC overhead (~24% CPU) is driven by allocation pressure - Lock contention (30.7% CPU) partially secondary to GC/alloc Candidate optimizations: sync.Pool for cachekv.Store, lazy per-store creation, replace sync.Map with regular map in giga path, cache block-level constants. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Replace per-tx cachekv.Store allocation with sync.Pool recycling for both standard and giga variants. Add CacheMultiStore.Release() and ReleaseDB() to return stores to pools at lifecycle boundaries (Cleanup, RevertToSnapshot, CleanupForTracer). Release replaced stores in SetKVStores/SetGigaKVStores and unused db store in OCC scheduler. Reset() replaces sync.Map fields with fresh instances (not Clear(), which is slower due to internal trie node walking and causes more allocations when repopulated). Targeting the #1 flat allocator from profiling: cachekv.NewStore at 9 GB / 157M objects over 30s at 8600 TPS. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

This reverts commit dbd6ad5.

This reverts commit 0244baf.

Cache ChainID, BlockContext, ChainConfig, and BaseFee once per block instead of recomputing them for every transaction. These values are identical for all txs in a block but were previously causing redundant store reads, Keccak256 hashing, and big.Int allocations on every call. The cache is constructed as a local variable and passed as a parameter to executeEVMTxWithGigaExecutor, avoiding any mutable state on App. Both giga execution paths (synchronous and OCC) are covered. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Replace sync.Map-based cachekv stores with plain-map FastStore in the giga executor snapshot path. Since each OCC task's snapshot stores are owned by a single goroutine, sync.Map's thread-safety overhead is unnecessary waste. Changes: - Add cachekv.FastStore and gigacachekv.FastStore using plain maps - Add CacheMultiStoreGiga() to cachemulti.Store for lightweight CMS - DBImpl.Snapshot() uses CacheMultiStoreGiga() when available - Remove initial Snapshot() from NewDBImpl, use committedCtx instead - Lazy store creation uses FastStore when CMS is in fast mode Expected impact: significant reduction in allocation pressure from sync.Map creation and internal node allocation (~30GB/30s savings), with compounding GC overhead reduction. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Wire CacheMultiStoreGiga() into the OCC scheduler's prepareTask path so each task iteration uses plain-map FastStore instead of sync.Map-based stores. This reduces allocation overhead for per-task CMS creation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Two optimizations to reduce allocation/GC overhead: 1. CacheMultiStoreForOCC: creates a hollow CMS where stores are populated directly from VersionIndexedStores, skipping creation of intermediate cachekv/FastStore instances that would be immediately discarded. 2. GOGC=off + GOMEMLIMIT=16GB during ProcessTXsWithOCCGiga: disables percentage-based GC trigger and uses memory-limit-based GC instead. GC only runs when approaching the 16GB limit, dramatically reducing GC frequency during the allocation-heavy OCC execution path. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Add EVMPool using sync.Pool to reuse vm.EVM structs across txs within a block (same block ctx, chain config, precompiles, interpreter). Uses vm.EVM.Reset() from local go-ethereum fork. - Pool TemporaryState: reuse maps (clear instead of reallocate) - Pool DBImpl: reuse struct and pre-allocated slices - go.work: use local go-ethereum fork with Reset() method Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

newFastCacheMultiStoreFromCMS was eagerly materializing all parent stores via cachekv.NewStore for ~50 store keys on every Snapshot, even though only 3-4 stores are accessed per EVM transaction. Now passes both stores and parents through to the child CMS, letting getOrCreateStore create FastStore wrappers lazily on demand. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

AccAddress.String() uses a global sync.Mutex to protect an LRU cache. With 24 OCC workers calling it concurrently, this caused 35s of mutex delay per 30s profile sample (8% of total mutex contention). Replace with sync.Map which provides lock-free concurrent reads. Trade-off: no LRU eviction (cache grows unbounded), but the number of unique addresses is bounded by accounts on chain. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The worker goroutines used select{} with ctx.Done() + work channel, causing massive lock contention: each select call locks BOTH channels, and with 24+ workers competing this consumed 28% CPU on lock2+selectgo. The ctx.Done() case was only used for cleanup when ProcessAll returns, never during actual work processing. Replace with for-range over the work channel and close channels when done. This eliminates selectgo entirely and halves lock contention (1 channel lock instead of 2). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Previously, prepareTask eagerly created a VersionIndexedStore for every store key (~30+) even though EVM transactions only access 3-4 stores. Each VersionIndexedStore allocates readset/writeset maps, so this wasted ~5GB of allocations per 30s benchmark. Now VersionIndexedStores are created lazily on first access via handler callbacks stored in the CMS. The CacheMultiStoreForOCC no longer eagerly calls handlers for all keys - stores are materialized by getOrCreateStore and GetGigaKVStore only when accessed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The giga executor uses infinite gas metering, making the gaskv.NewStore wrapper a pure allocation overhead (~3.34GB/30s). Add WithSkipGasKV() on Context that causes KVStore() and TransientStore() to return the raw store without the gaskv wrapper. Set this flag in executeEVMTxWithGigaExecutor. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Creating 1000 goroutines per block for validation causes significant scheduling overhead (goroutine creation/destruction, stack allocation). Validation is lightweight (readset comparison), so matching the execution worker count (24) is sufficient. The buffered channel ensures all validation work items can be enqueued without blocking. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

All OCC tasks shared the parent context's EventManager, causing 24 worker goroutines to fight over a single RWMutex for every EmitEvent call. This accounted for ~60% of all mutex contention (25.4s/41.8s). Create a fresh EventManager per task in prepareTask() to eliminate cross-task contention entirely. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Events emitted during OCC task execution are never collected into the response (Events field is nil in giga executor results). Replace the per-task EventManager with a noop version that discards all events, eliminating ~8.4GB of allocations per 30s (MarkEventsToIndex 3.48GB + NewEvent 3.56GB + AppendEvent 1.35GB) and their associated GC pressure. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…on store The multiversion store used three sync.Map instances per store for txWritesetKeys, txReadSets, and txIterateSets (keyed by integer tx indices). With ~30 stores per block, this created ~90 sync.Map objects whose internal hash-trie nodes (newIndirectNode, newEntryNode) drove significant allocation overhead. Replace these with pre-allocated slices sized to the tx count, eliminating all sync.Map overhead for per-tx metadata. The giga scheduler now passes the max absolute index to NewMultiVersionStoreWithTxCount for upfront allocation. A growSlices fallback handles the V2 scheduler and test paths. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

LockedCoins calls AccountKeeper.GetAccount + protobuf UnpackAny solely to check if an account is a vesting account. For standard (non-vesting) accounts it always returns zero — wasting ~15GB of allocations per 30s. In the giga executor path, all accounts are standard EVM accounts (never vesting), so we skip LockedCoins entirely in both GetBalance and SubUnlockedCoins. This eliminates the redundant account deserialization on every balance check and debit operation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

In the OCC giga executor path, events are never collected. The noop EventManager already discards emitted events, but call sites still allocated event objects (NewEvent, NewAttribute, String conversions) before passing them to EmitEvent. Add IsNoop() check at all 8 emission sites in send.go so the objects are never created. Also change eventsSuppressed path in state/balance.go to use NewNoopEventManager instead of NewEventManager, avoiding both the event allocation and the mutex/slice overhead. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ToIndex Add IsNoop() guards to all 8 event emission sites in the original cosmos-sdk bank keeper (sei-cosmos/x/bank/keeper/send.go), matching the guards already added to the giga fork. This saves ~1.7GB allocs in the OCC execution path. Optimize MarkEventsToIndex: when indexAll is true (empty indexSet), set the Index flag in-place on existing attributes instead of copying the entire events slice. Also replace fmt.Sprintf with string concat for the selective indexing path. Saves ~3.9GB allocs per 30s. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

GC parameter changes (GOGC, GOMEMLIMIT, etc.) are not legitimate optimizations. Focus on reducing allocations and algorithmic efficiency. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Replace the single sync.Map with 64 shards keyed by FNV-1a hash of the key string. This reduces cache-line contention when 24 OCC workers write concurrently via LoadOrStore, pushing the batch-size cliff higher and reducing lock spinning time in the runtime. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Snapshot() and Copy() were creating a regular EventManager even when the parent context had a noop one. In the OCC execution path, the scheduler sets a noop EventManager to skip all event allocation, but after the first Snapshot() call this was replaced with a real EventManager, causing ~31M unnecessary event object allocations per 120s benchmark run (~1.5GB). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

sync.Map's internal hash-trie nodes (newIndirectNode + newEntryNode) allocated ~5.4GB per 120s benchmark. With 64 shards and only 24 OCC workers, contention per shard is low (~0.375 goroutines/shard), making a simple map+mutex more efficient than sync.Map's lockless read path. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

GetLatestBeforeIndex accounts for 52.6% of all mutex delay because the plain Mutex forces exclusive locking even for read-only map lookups. Switch to RWMutex so concurrent readers (24 OCC workers) don't block each other. Write paths (SetWriteset, InvalidateWriteset, SetEstimatedWriteset) keep exclusive Lock. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ock)" This reverts commit 2f148ca.

Each multiVersionItem used a btree.BTree (degree 2) + FreeList to store per-transaction versions. With typically only 1-3 entries per key, the btree overhead was massive: ~392 bytes per key for the BTree struct, FreeList, and backing array. Profile showed 2.87GB allocated for btree infrastructure and 34M temporary pivot objects in GetLatestBeforeIndex. Replace with a simple sorted []*valueItem slice (initial capacity 2). Linear scan on 1-3 items is faster than btree traversal and eliminates all btree allocations, saving ~3.4GB per 120s benchmark. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…llocs in hot path Mutex profile shows 32.89s contention in GetLatestBeforeIndex (56% of all mutex delay). Two changes to address this: 1. Increase shard count from 64 to 256: reduces collision probability ~4x with 24 OCC workers, directly reducing lock2 (14.21%) and usleep (13.33%) CPU overhead. 2. Add mvShardIdxBytes for byte-slice keys and use Go's m[string(b)] compiler optimization in GetLatest, GetLatestBeforeIndex, and Has. This avoids a string allocation per call (previously: string(key) stored in variable, defeating the optimization). Also add getLatestBeforeIndexStr for checkReadsetAtIndex to avoid string→[]byte→string round-trips. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Guard traceSpan with span.IsRecording() to avoid sha256 + fmt.Sprintf + attribute allocation when tracing is disabled (~1.5GB allocs saved) - Add WithSkipGasKV() to EndBlock context since it uses infinite gas meter, avoiding gaskv.NewStore wrapper allocations (~2.2GB saved) - OR bloom bytes in-place in SetBlockBloom/SetEvmOnlyBlockBloom to avoid per-iteration allocation of 256-byte slices Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Make giga store creation lazy in newFastFromKVStore (Snapshot path). Previously, CacheMultiStoreGiga eagerly created gigacachekv.NewFastStore for ALL giga keys. Now stores are created on first access via gigaHandler, and the parent's handler is chained so lazily-created parent stores are also available. EVM snapshots typically touch 3-4 stores out of ~20 giga keys, saving ~1.5GB of allocations per benchmark run. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…edCoins Three optimizations targeting sequential block processing overhead: 1. WriteLatestToStore: iterate shards directly instead of collecting all keys into a slice, sorting, and re-looking up each shard. The parent cachekv store sorts internally during Write(), making the sort redundant. 2. Reduce OCC abort channel capacity from len(multiVersionStores) (~30) to 1. Only the first abort is consumed; subsequent ones are silently dropped via select/default. 3. Skip LockedCoins call in EVM EndBlock for per-tx coinbase addresses. These are temporary accounts that are never vesting accounts, so locked coins is always zero. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Per-tx coinbase sweeps in EndBlock emit thousands of transfer events per block (NewEvent + AppendEvents = ~4.4GB allocs/benchmark). These are internal bookkeeping operations that move fees from per-tx coinbase addresses to the actual coinbase. Use NoopEventManager for the sweep loop to eliminate event allocation overhead. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

github-actions · 2026-02-18T09:46:40Z

The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).

Build	Format	Lint	Breaking	Updated (UTC)
`✅ passed`	`✅ passed`	`✅ passed`	`✅ passed`	Feb 18, 2026, 10:12 AM