Skip to content

feat(depot): watermark-aligned delta reclaim and retention engine#5221

Open
NathanFlurry wants to merge 1 commit into
stack/test-depot-add-branch-history-snapshot-helper-for-exact-retention-assertions-tlrrrnmufrom
stack/feat-depot-watermark-aligned-delta-reclaim-and-retention-engine-mkqlmnty
Open

feat(depot): watermark-aligned delta reclaim and retention engine#5221
NathanFlurry wants to merge 1 commit into
stack/test-depot-add-branch-history-snapshot-helper-for-exact-retention-assertions-tlrrrnmufrom
stack/feat-depot-watermark-aligned-delta-reclaim-and-retention-engine-mkqlmnty

Conversation

@NathanFlurry

Copy link
Copy Markdown
Member

Collapse SQLite Depot history retention onto the hot watermark:

  • DELTA rows are reclaimed at or below the watermark with no per-shard or
    PIDX proof; the install that advances the watermark is the coverage proof.
  • COMMITS/VTX below the watermark survive only as keep-set islands (pins plus
    retained PITR interval representatives); superseded SHARD versions are GC'd
    once no covered txid reads through them.
  • Snapshot targets (pins, forks, restores) are fenced to covered-or-unfolded
    txids; versionstamp targets snap down to the newest covered point.
  • Truncate preserves shard history via pruned versions instead of deleting it;
    folding is per-coverage-txid and truncate-aware.
  • Bucket-fork pins materialize on covered points with a fail-safe fallback.
  • PITR policy scope is denormalized onto the branch record (vbare v2) to fix
    the plan-vs-validate livelock and drop the global pointer scan.
  • Hardening: bounded/paged reclaim scans, quota credit on delete, immediate
    drain re-arm, orphan-stage cleanup queue, fail-closed PIDX decode,
    future-clock interval coverage.

Collapse SQLite Depot history retention onto the hot watermark:

- DELTA rows are reclaimed at or below the watermark with no per-shard or
  PIDX proof; the install that advances the watermark is the coverage proof.
- COMMITS/VTX below the watermark survive only as keep-set islands (pins plus
  retained PITR interval representatives); superseded SHARD versions are GC'd
  once no covered txid reads through them.
- Snapshot targets (pins, forks, restores) are fenced to covered-or-unfolded
  txids; versionstamp targets snap down to the newest covered point.
- Truncate preserves shard history via pruned versions instead of deleting it;
  folding is per-coverage-txid and truncate-aware.
- Bucket-fork pins materialize on covered points with a fail-safe fallback.
- PITR policy scope is denormalized onto the branch record (vbare v2) to fix
  the plan-vs-validate livelock and drop the global pointer scan.
- Hardening: bounded/paged reclaim scans, quota credit on delete, immediate
  drain re-arm, orphan-stage cleanup queue, fail-closed PIDX decode,
  future-clock interval coverage.
@NathanFlurry

Copy link
Copy Markdown
Member Author

Stack for rivet-dev/rivet

Get stack: forklift get 5221
Push local edits: forklift submit
Merge when ready: forklift merge 5221

@claude

claude Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Code Review — watermark-aligned delta reclaim and retention engine

Overview

This PR replaces the old per-shard coverage-proof approach to delta reclaim with a simpler watermark rule: any DELTA at or below the hot watermark is deletable, with no PIDX or shard-version proof required. The constraint that makes this safe is a new fork-alignment fence that prevents forks, pins, and restores from targeting uncovered below-watermark txids. The PR also fixes a plan/validate livelock caused by reading the default PITR policy at planning time, adds quota credits on delete, and overhauls truncate to preserve shard history.

The design is coherent and the documentation in coverage.rs and db_reclaimer.rs is excellent. Several smaller issues are worth addressing before merge.


Issues

engine/CLAUDE.md shrink-write rule is now stale (line 89)

The existing rule says:

depot shrink writes must delete above-EOF PIDX rows and fully-above-EOF SHARD blobs inside the same commit/takeover transaction

The new truncate implementation keeps old SHARD versions intact for PITR history and instead writes a new pruned version at truncate_txid. The rule should be updated to reflect this, e.g.:

depot shrink writes must delete above-EOF PIDX rows and write a pruned SHARD version at the truncating txid for each affected shard; older SHARD versions are retained for PITR/pin history.

coverage.rs::snapshot_txid_is_resolvable reads its own clock

fn now_ms() -> Result<i64> {
    let millis = SystemTime::now()...
}

Every other call site in this file threads now_ms: i64 as a parameter (e.g. materialize_bucket_fork_pin, read_reclaim_input_snapshot). The standalone private function makes this fence harder to test deterministically and inconsistent with the rest of the module. Consider adding now_ms: i64 to snapshot_txid_is_resolvable's signature and removing the local function.

tracked_row_bytes — usize overflow before try_from

fn tracked_row_bytes(key: &[u8], value: &[u8]) -> i64 {
    i64::try_from(key.len() + value.len()).unwrap_or(i64::MAX)
}

key.len() + value.len() can panic on overflow before try_from even runs. This can't happen in practice with FDB value-size limits, but it's inconsistent with the saturating_add pattern used a few lines above in pidx_credit_bytes. Prefer:

i64::try_from(key.len().saturating_add(value.len())).unwrap_or(i64::MAX)

Staged shard cleanup hardcodes chunk_idx = 0 without a comment or assertion

In install_hot_job_tx, the new staged-row cleanup reads chunk 0 and clears chunk 0:

udb::compare_and_clear(
    tx,
    &keys::branch_compaction_stage_hot_shard_key(
        input.database_branch_id,
        input.job_id,
        output_ref.shard_id,
        output_ref.as_of_txid,
        0,  // chunk_idx always 0?
    ),
    staged_blob,
);

This is consistent with the existing read (also chunk 0), so it's correct if staged shards are always single-chunk. If that invariant holds, a debug_assert_eq!(chunk_idx, 0) or a brief comment ("stage phase writes single-chunk blobs") would prevent a future multi-chunk extension from silently leaving orphaned chunks.

pending_stage_cleanups.remove(0) is O(n)

let pending = state.pending_stage_cleanups.remove(0);

This shifts all remaining elements on each drain. The queue is expected to be tiny (one or two entries), but VecDeque<PendingStageCleanup> with pop_front is the idiomatic Rust choice for a FIFO drain and costs nothing to swap in.


Observations / Questions

prune_truncated_shard_value when all pages are above new EOF

The old code returned Ok(Some(Vec::new())) for an all-pages-above-EOF shard, which triggered a shard_clears path. The new code calls encode_ltx_v3(LtxHeader::delta(truncate_txid, 1, now_ms), &[]), writing a new empty shard version while leaving the old version. This is intentional (preserve old shard for PITR), but it's a silent behavioral change for the empty case — worth a short inline comment.

Shard GC cursor is opaque raw key bytes

shard_gc_cursor: Vec<u8> is threaded from refresh to plan to reclaim as a raw byte cursor. Verify the cursor is bounded to the shard prefix range and that a cursor past the last shard key correctly wraps back to the beginning (empty cursor = front of prefix, per the comment). If the prefix can change across schema versions, document the stability expectation.

reclaim_fdb_job_tx — rejected executes commit

The comment about "Rejected executes still commit this transaction, which is load-bearing: pin materializations must persist" is correct and important. It might be worth surfacing this in reclaim_fdb_job_tx's doc comment as well (or in the surrounding activity declaration), since the "transaction commits even on rejection" pattern is non-obvious to future readers who only look at the calling side.


Positive highlights

  • The watermark delta retention design and the fork-alignment invariant are thoroughly documented in coverage.rs and db_reclaimer.rs. The pairing rule ("removing either half breaks the other") is exactly the kind of invariant that gets missed in reviews without this level of documentation.
  • The DatabaseBranchRecordV1 → V2 migration with identity converters follows the engine's vbare pattern correctly.
  • The fail-safe change for ambiguous bucket-fork proofs (block only commit/VTX, not delta/PIDX) unblocks the common case without weakening correctness.
  • Quota credit on delete fixes a quota ratchet that would have accumulated unbounded debt over time.
  • Test coverage is excellent: alignment_fence.rs, reclaim_retention.rs, policy_scope.rs, truncate_history.rs, and bucket_fork_pins.rs all test the new invariants with concrete UDB-backed scenarios.
  • The has_retained_pitr_intervals wake-arm fix correctly handles idle databases with expiring interval rows.

🤖 Generated with Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant