Skip to content

persist: don't crash on undecodable rollup during usage scan#37017

Draft
DAlperin wants to merge 1 commit into
mainfrom
dovalperin/per-23-thread-tokiowork-9-panicked-at-srcpersist-ed29
Draft

persist: don't crash on undecodable rollup during usage scan#37017
DAlperin wants to merge 1 commit into
mainfrom
dovalperin/per-23-thread-tokiowork-9-panicked-at-srcpersist-ed29

Conversation

@DAlperin

Copy link
Copy Markdown
Member

Motivation

The periodic storage-usage collection (storage_usage_fetchStorageUsageClient::shards_usage_referenced) scans every live shard and decodes its rollups for billing. Rollup decoding expects the encoded state to be valid, so a single corrupt/undecodable rollup in any one shard panics the task — and since the scan runs in a spawned task under the abort-on-panic handler, that becomes a full process crash:

internal error: invalid encoded state: DecodeError { description: InvalidKey { key: 1781049768001 }, stack: [] }

This is a best-effort billing scan that touches every shard in the environment, so one bad shard should not be able to take down the whole process.

Description

Catch panics per shard inside shards_usage_referenced via ore_catch_unwind. On a panic we log the error, bump a new mz_persist_usage_referenced_shard_failures counter, and report that shard as zero usage while the remaining shards are computed normally. The "fail loudly" behavior on the actual data path is unchanged — only this periodic best-effort scan is made resilient.

Verification

New test usage_referenced_undecodable_rollup writes a shard, overwrites its rollup blob(s) with undecodable bytes, and asserts that shards_usage_referenced returns cleanly (reporting the shard as 0 bytes) and increments the failure counter, rather than panicking.

The periodic storage-usage collection (`storage_usage_fetch` ->
`shards_usage_referenced`) scans every live shard and decodes its
rollups. Rollup decoding `expect`s the encoded state to be valid, so a
single corrupt/undecodable rollup in any one shard panicked the task,
which the abort-on-panic handler turned into a full process crash (e.g.
`invalid encoded state: DecodeError { description: InvalidKey { .. } }`).

This is a best-effort billing scan, so a bad shard should be skipped,
not bring down the environment. Catch panics per shard, log the error,
bump a new `mz_persist_usage_referenced_shard_failures` counter, and
report that shard as zero usage while the rest are computed normally.

Generated with [Linear](https://linear.app/materializeinc/issue/PER-23/thread-tokiowork-9-panicked-at-srcpersist#agent-session-c16c7454)

Co-authored-by: linear-code[bot] <222613912+linear-code[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant