fix(kiloclaw): back off stuck volume deletion retries#4083
Conversation
Code Review SummaryStatus: No Issues Found | Recommendation: Merge Executive SummaryThe incremental commit resolves the prior escalation-threshold SUGGESTION by lowering Resolved Issues (1)
Files Reviewed (incremental: 2 files)
NotesVerified the backoff math: Fix these issues in Kilo Cloud Previous Review Summary (commit 1562de2)Current summary above is authoritative. Previous snapshots are kept for context only. Previous review (commit 1562de2)Status: 1 Issue Found | Recommendation: Address before merge Overview
Issue Details (click to expand)SUGGESTION
Files Reviewed (4 files)
NotesThe tiered backoff implementation is sound: the Reviewed by glm-5.2-20260616 · 625,090 tokens Review guidance: REVIEW.md from base branch |
transition Lower DESTROY_VOLUME_ESCALATION_ATTEMPTS from 10 to 6 so destroy_volume_retry_escalated fires when the retry backoff reaches its daily tier (~7h in) instead of ~4 days in, where it merely trailed the tier transition. Update the escalation test fixture accordingly.
Summary
When a Fly volume deletion keeps failing, the destroy loop retried roughly once
a minute and gave up after 50 attempts (about an hour). A long-lived provider
failure burned through that budget fast and then abandoned the volume. This
changes the retry cadence to back off from one minute up to a daily interval and
raises the attempt cap, so a volume that is stuck on a longer provider outage
gets more total retry coverage without hammering Fly.
Changes
DESTROY_VOLUME_RETRY_DELAYS_MStiers (1 min, 5 min, 15 min, 1 hr, 6 hr,24 hr) in
config.ts. The last tier repeats until the retry cap.destroyRetryDelay(attempt, random)inlog.ts, which picks the tier forthe current attempt and applies proportional jitter (
base * (0.5 + random),so 0.5x to 1.5x of the tier).
nextAlarmTimenow takesdestroyVolumeAttempts; whiledestroyingwithprior attempts, it schedules the next alarm using the backoff delay instead of
the fixed 1-minute destroying interval.
scheduleAlarmpasses the attemptcount through.
MAX_DESTROY_VOLUME_ATTEMPTSfrom 50 to 100 and update the cap commentto reflect the new backoff cadence.
reconcile.destroy_volume_retry_escalatedtelemetry once at attempt 10(
DESTROY_VOLUME_ESCALATION_ATTEMPTS) so a stuck volume surfaces for alertingbefore it reaches the abandon cap.
event fires at attempt 10, and bump the abandon-at-cap fixtures from 49 to 99
attempts.
Verification
kiloclaw durable object; covered by the unit tests in
kiloclaw-instance.test.ts(tiered delay/jitter, escalation telemetry,abandon-at-cap).
Visual Changes
N/A
Reviewer Notes
attempts === 10), not on every attempt past10.
through 100 span a long window by design; the cap now represents a sustained
provider failure rather than a short outage.