Add more metrics for snapshot and state sync #2879
Conversation
* main: chore: remove wasm dir on unsafe-reset (#2875) fix: respect existing genesis file (#2868) fix to halt due to reconstructing block from bad proposal (backported #2823) (#2873) chore(refactor): drop unused code (#2811) made the peer dialing less aggressive (backported #2799) (#2872) perf(store): lazy-init `sortedCache` in `cachekv.Store` (#2804) feat: embed genesis for well-known chains (#2835) fix: use MADV_RANDOM during loadtree (#2857)
|
The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).
|
sei-db/state_db/sc/memiavl/db.go
Outdated
| db.logger.Error("failed to prune snapshot", "err", err) | ||
| } else { | ||
| db.logger.Info("successfully pruned snapshot", "name", name) | ||
| otelMetrics.SnapshotPruneCount.Add(context.Background(), 1) |
There was a problem hiding this comment.
We probably want to measure the failure rate too right?
In which case, you can use the same metric and tag by status?
|
|
||
| // writeLeaf sends leaf and KV write operations to the pipeline | ||
| func (w *snapshotWriter) writeLeaf(version uint32, key, value, hash []byte) error { | ||
| // Track channel fill metrics for all channels |
There was a problem hiding this comment.
Removing these since it seems they are not being used
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2879 +/- ##
==========================================
+ Coverage 57.16% 57.17% +0.01%
==========================================
Files 2091 2091
Lines 171145 171528 +383
==========================================
+ Hits 97827 98068 +241
- Misses 64609 64710 +101
- Partials 8709 8750 +41
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
| } | ||
|
|
||
| // catchup the remaining entries in rlog | ||
| startTime := time.Now() |
Check warning
Code scanning / CodeQL
Calling the system time Warning
| cloned.logger.Info("snapshot rewrite process completed", "duration_sec", totalElapsed, "duration_min", totalElapsed/60) | ||
| otelMetrics.SnapshotCreationLatency.Record( | ||
| totalRewriteElapsed := time.Since(startTime).Seconds() | ||
| cloned.logger.Info("snapshot rewrite process completed", "duration_sec", totalRewriteElapsed, "duration_min", totalRewriteElapsed/60) |
Check notice
Code scanning / CodeQL
Floating point arithmetic Note
| )), | ||
| SnapshotRewriteCount: must(meter.Int64Counter( | ||
| "memiavl_snapshot_rewrite_count", | ||
| metric.WithDescription("Total num of times memiavl snapshot rewrite attempts"), |
There was a problem hiding this comment.
nit: grammar seems a little weird, how about: "Total number of memiavl snapshot rewrite attempts"
There was a problem hiding this comment.
Yup, that looks better!
Describe your changes and provide context
This PR is adding more visibility around MemIAVL snapshot creation + replay + pruning, as well as state sync snapshot creation process.
With these metrics, we should have better visibility to correlate some timing for performance changes in relate to the snapshot behavior
Testing performed to validate your change
Tested locally and verified the metrics works