Skip to content

Add more metrics for snapshot and state sync #2879

Open
yzang2019 wants to merge 7 commits intomainfrom
yang/add-metrics-statesync
Open

Add more metrics for snapshot and state sync #2879
yzang2019 wants to merge 7 commits intomainfrom
yang/add-metrics-statesync

Conversation

@yzang2019
Copy link
Contributor

@yzang2019 yzang2019 commented Feb 12, 2026

Describe your changes and provide context

This PR is adding more visibility around MemIAVL snapshot creation + replay + pruning, as well as state sync snapshot creation process.

With these metrics, we should have better visibility to correlate some timing for performance changes in relate to the snapshot behavior

Testing performed to validate your change

Tested locally and verified the metrics works

* main:
  chore: remove wasm dir on unsafe-reset (#2875)
  fix: respect existing genesis file (#2868)
  fix to halt due to reconstructing block from bad proposal (backported #2823) (#2873)
  chore(refactor): drop unused code (#2811)
  made the peer dialing less aggressive (backported #2799) (#2872)
  perf(store): lazy-init `sortedCache` in `cachekv.Store` (#2804)
  feat: embed genesis for well-known chains (#2835)
  fix: use MADV_RANDOM during loadtree (#2857)
@github-actions
Copy link

github-actions bot commented Feb 12, 2026

The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed✅ passed✅ passed✅ passedFeb 13, 2026, 7:04 PM

db.logger.Error("failed to prune snapshot", "err", err)
} else {
db.logger.Info("successfully pruned snapshot", "name", name)
otelMetrics.SnapshotPruneCount.Add(context.Background(), 1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably want to measure the failure rate too right?

In which case, you can use the same metric and tag by status?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea!


// writeLeaf sends leaf and KV write operations to the pipeline
func (w *snapshotWriter) writeLeaf(version uint32, key, value, hash []byte) error {
// Track channel fill metrics for all channels
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing these since it seems they are not being used

@codecov
Copy link

codecov bot commented Feb 12, 2026

Codecov Report

❌ Patch coverage is 73.07692% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 57.17%. Comparing base (9ccd5ca) to head (ab38167).
⚠️ Report is 8 commits behind head on main.

Files with missing lines Patch % Lines
sei-db/state_db/sc/memiavl/db.go 75.00% 4 Missing and 2 partials ⚠️
sei-cosmos/storev2/rootmulti/store.go 0.00% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #2879      +/-   ##
==========================================
+ Coverage   57.16%   57.17%   +0.01%     
==========================================
  Files        2091     2091              
  Lines      171145   171528     +383     
==========================================
+ Hits        97827    98068     +241     
- Misses      64609    64710     +101     
- Partials     8709     8750      +41     
Flag Coverage Δ
sei-chain 52.64% <73.07%> (+0.02%) ⬆️
sei-cosmos 48.16% <0.00%> (+0.02%) ⬆️
sei-db 68.72% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
sei-db/state_db/sc/memiavl/metrics.go 50.00% <ø> (ø)
sei-db/state_db/sc/memiavl/multitree.go 79.22% <100.00%> (+0.06%) ⬆️
sei-db/state_db/sc/memiavl/snapshot.go 59.37% <ø> (-0.93%) ⬇️
sei-cosmos/storev2/rootmulti/store.go 41.76% <0.00%> (-0.09%) ⬇️
sei-db/state_db/sc/memiavl/db.go 65.81% <75.00%> (-0.50%) ⬇️

... and 53 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

}

// catchup the remaining entries in rlog
startTime := time.Now()

Check warning

Code scanning / CodeQL

Calling the system time Warning

Calling the system time may be a possible source of non-determinism
cloned.logger.Info("snapshot rewrite process completed", "duration_sec", totalElapsed, "duration_min", totalElapsed/60)
otelMetrics.SnapshotCreationLatency.Record(
totalRewriteElapsed := time.Since(startTime).Seconds()
cloned.logger.Info("snapshot rewrite process completed", "duration_sec", totalRewriteElapsed, "duration_min", totalRewriteElapsed/60)

Check notice

Code scanning / CodeQL

Floating point arithmetic Note

Floating point arithmetic operations are not associative and a possible source of non-determinism
)),
SnapshotRewriteCount: must(meter.Int64Counter(
"memiavl_snapshot_rewrite_count",
metric.WithDescription("Total num of times memiavl snapshot rewrite attempts"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: grammar seems a little weird, how about: "Total number of memiavl snapshot rewrite attempts"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, that looks better!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments