Skip to content

fix(depot): teach doctor to diagnose reclaimed history via shard-seeded replay#5222

Open
NathanFlurry wants to merge 1 commit into
stack/feat-depot-watermark-aligned-delta-reclaim-and-retention-engine-mkqlmntyfrom
stack/fix-depot-teach-doctor-to-diagnose-reclaimed-history-via-shard-seeded-replay-ulwkqwsz
Open

fix(depot): teach doctor to diagnose reclaimed history via shard-seeded replay#5222
NathanFlurry wants to merge 1 commit into
stack/feat-depot-watermark-aligned-delta-reclaim-and-retention-engine-mkqlmntyfrom
stack/fix-depot-teach-doctor-to-diagnose-reclaimed-history-via-shard-seeded-replay-ulwkqwsz

Conversation

@NathanFlurry

Copy link
Copy Markdown
Member

No description provided.

@NathanFlurry

Copy link
Copy Markdown
Member Author

Stack for rivet-dev/rivet

Get stack: forklift get 5222
Push local edits: forklift submit
Merge when ready: forklift merge 5222

@claude

claude Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Code Review

Overview

This PR teaches the depot doctor to correctly diagnose databases with reclaimed history. When the reclaimer prunes delta history at or below the hot watermark, the doctor previously saw the gaps as corruption. The fix seeds the sequential replay image from hot shard coverage at the watermark and only replays deltas above it, then threads replay_floor_txid through the analysis functions so gap-detection and sub-replay search are bounded to the safe range.

Code Quality

Correctness

  • The replay_floor_txid.min(selected_txid) re-application inside build_sequential_replay (line 1586) is not redundant even though it was already applied in load_storage_facts. Sub-replays in find_first_bad_txid and find_first_resolver_divergence pass a mid-range commit.txid as selected_txid, so the inner clamp is necessary to prevent the floor from exceeding the target commit. Correct.

  • The min_txid.max(storage.replay_floor_txid.saturating_add(1)) adjustment in both find_first_bad_txid and find_first_resolver_divergence prevents sub-replays below the floor. Correct.

  • analyze_commit_chain now starts the contiguity check at replay_floor_txid + 1 rather than txid 1, and includes replay_floor_txid in the JSON output for diagnostics. Correct.

Minor concern — error message lacks context

Line 1591:

.context("hot watermark commit row is missing for sequential replay")?;

If the watermark commit row is absent (e.g., a race where reclaim removes a keep-set row unexpectedly), the error gives no indication of which txid was expected. Consider:

.with_context(|| format!("hot watermark commit row missing for txid {} in sequential replay", replay_floor_txid))?;

Minor concern — O(pages × shard_pages) coverage check

Lines 1627–1633 iterate 1..=current_size and for each page scan all shard pages to detect uncovered pages. For large databases with many pages, this is quadratic. Building a HashSet<u32> of covered pages before the loop would flatten it to O(total_shard_pages + current_size). Given this runs in a diagnostic tool rather than the hot path, it is acceptable as-is, but worth noting.

No issue — separate transaction for watermark read

read_hot_watermark_txid issues its own transaction after the prefix scans in load_storage_facts complete. There is a narrow TOCTOU window if the watermark advances between these two reads. For a doctor diagnostic tool this is fine; just flagging it as an inherent trade-off of reading at different logical timestamps.

Test Coverage

The new doctor_reclaimed.rs test covers the core scenario well:

  • Two ancient-timestamp commits are folded into shard coverage by hot compaction, then punched by reclaim.
  • A third commit above the watermark survives with delta history intact.
  • The doctor is run and the verdict must be Healthy.

Suggestion — assert replay_floor_txid in the report

The test asserts the verdict is Healthy but does not verify that replay_floor_txid appears with the expected value in the analysis JSON. An assertion on report.analysis["commit_chain_analysis"]["replay_floor_txid"] would make the test a regression guard for the new field itself, not just the verdict.

Conventions

  • Comment style follows project conventions (complete sentences, no em dashes, lowercase log messages).
  • replay_floor_txid doc comment on the struct field is appropriately concise.
  • Test file lives under tests/ as required by the depot CLAUDE.md.
  • No new raw serde_bare encodings introduced.
  • No _ => fallthrough arms on any match introduced.

Summary

The approach is sound and edge cases are handled correctly: no shards when floor > 0 falls through to a zero-filled seed (correctly diagnosed as corrupted), sub-replays are clamped to above the floor, and contiguity analysis is bounded above the floor. The two items above are minor polish rather than blocking issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant