Skip to content

fix(kiloclaw): back off stuck volume deletion retries#4083

Merged
St0rmz1 merged 2 commits into
mainfrom
fix/kiloclaw-destroy-retry-backoff
Jun 18, 2026
Merged

fix(kiloclaw): back off stuck volume deletion retries#4083
St0rmz1 merged 2 commits into
mainfrom
fix/kiloclaw-destroy-retry-backoff

Conversation

@St0rmz1

@St0rmz1 St0rmz1 commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Summary

When a Fly volume deletion keeps failing, the destroy loop retried roughly once
a minute and gave up after 50 attempts (about an hour). A long-lived provider
failure burned through that budget fast and then abandoned the volume. This
changes the retry cadence to back off from one minute up to a daily interval and
raises the attempt cap, so a volume that is stuck on a longer provider outage
gets more total retry coverage without hammering Fly.

Changes

  • Add DESTROY_VOLUME_RETRY_DELAYS_MS tiers (1 min, 5 min, 15 min, 1 hr, 6 hr,
    24 hr) in config.ts. The last tier repeats until the retry cap.
  • Add destroyRetryDelay(attempt, random) in log.ts, which picks the tier for
    the current attempt and applies proportional jitter (base * (0.5 + random),
    so 0.5x to 1.5x of the tier).
  • nextAlarmTime now takes destroyVolumeAttempts; while destroying with
    prior attempts, it schedules the next alarm using the backoff delay instead of
    the fixed 1-minute destroying interval. scheduleAlarm passes the attempt
    count through.
  • Raise MAX_DESTROY_VOLUME_ATTEMPTS from 50 to 100 and update the cap comment
    to reflect the new backoff cadence.
  • Emit reconcile.destroy_volume_retry_escalated telemetry once at attempt 10
    (DESTROY_VOLUME_ESCALATION_ATTEMPTS) so a stuck volume surfaces for alerting
    before it reaches the abandon cap.
  • Update tests: cover the tiered delay and jitter bounds, assert the escalation
    event fires at attempt 10, and bump the abandon-at-cap fixtures from 49 to 99
    attempts.

Verification

  • No manual testing. This is internal retry-scheduling logic for the
    kiloclaw durable object; covered by the unit tests in
    kiloclaw-instance.test.ts (tiered delay/jitter, escalation telemetry,
    abandon-at-cap).

Visual Changes

N/A

Reviewer Notes

  • The escalation event fires once (attempts === 10), not on every attempt past
    10.
  • With jitter the last tier ranges from 12h to 36h per retry, so attempts 6
    through 100 span a long window by design; the cap now represents a sustained
    provider failure rather than a short outage.

Comment thread services/kiloclaw/src/durable-objects/kiloclaw-instance/reconcile.ts Outdated
@kilo-code-bot

kilo-code-bot Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Code Review Summary

Status: No Issues Found | Recommendation: Merge

Executive Summary

The incremental commit resolves the prior escalation-threshold SUGGESTION by lowering DESTROY_VOLUME_ESCALATION_ATTEMPTS from 10 to 6 so the early signal fires at the daily-tier transition (~7h of accumulated retries) rather than trailing it by days; the test fixture was updated consistently.

Resolved Issues (1)
File Line Issue Resolution
services/kiloclaw/src/durable-objects/kiloclaw-instance/reconcile.ts 1576 Escalation threshold calibrated to the old 1-min cadence Lowered to 6 with explanatory comment; fires at the daily-tier transition
Files Reviewed (incremental: 2 files)
  • services/kiloclaw/src/durable-objects/kiloclaw-instance/reconcile.ts - 0 issues
  • services/kiloclaw/src/durable-objects/kiloclaw-instance.test.ts - 0 issues

Notes

Verified the backoff math: destroyRetryDelay(attempt) uses index = attempt - 1 clamped to the last tier. After attempt 6 fails, the next alarm is scheduled with destroyVolumeAttempts = 6 → tier[5] = 24h (daily), so the escalation event at attempt 6 correctly coincides with the daily-tier transition. Accumulated retries through attempt 6 sum to ~7h (1m + 5m + 15m + 1h + 6h), matching the comment. The test seeds destroyVolumeAttempts: 5 and asserts escalation fires once and the counter becomes 6, consistent with the new constant. No memory leaks introduced — DESTROY_VOLUME_RETRY_DELAYS_MS is a module-scope immutable constant and destroyRetryDelay's Math.random() default is re-evaluated per call.

Fix these issues in Kilo Cloud

Previous Review Summary (commit 1562de2)

Current summary above is authoritative. Previous snapshots are kept for context only.

Previous review (commit 1562de2)

Status: 1 Issue Found | Recommendation: Address before merge

Overview

Severity Count
CRITICAL 0
WARNING 0
SUGGESTION 1
Issue Details (click to expand)

SUGGESTION

File Line Issue
services/kiloclaw/src/durable-objects/kiloclaw-instance/reconcile.ts 1568 Escalation threshold (attempt 10) still calibrated to the old 1-min cadence; now fires ~4 days into the backoff, after the daily tier is already in effect
Files Reviewed (4 files)
  • services/kiloclaw/src/config.ts - 0 issues
  • services/kiloclaw/src/durable-objects/kiloclaw-instance/index.ts - 0 issues
  • services/kiloclaw/src/durable-objects/kiloclaw-instance/log.ts - 0 issues
  • services/kiloclaw/src/durable-objects/kiloclaw-instance/reconcile.ts - 1 issue
  • services/kiloclaw/src/durable-objects/kiloclaw-instance.test.ts - 0 issues

Notes

The tiered backoff implementation is sound: the destroyRetryDelay index/jitter math is correct, the alarm rescheduling path (scheduleAlarmnextAlarmTime) correctly picks up the incremented destroyVolumeAttempts, and the test fixtures (cap raised 50→100, abandon seeded at 99, escalation seeded at 9) are consistent with the new constants. No memory leaks introduced — DESTROY_VOLUME_RETRY_DELAYS_MS is a module-scope immutable constant and destroyRetryDelay's Math.random() default is re-evaluated per call.

Fix these issues in Kilo Cloud


Reviewed by glm-5.2-20260616 · 625,090 tokens

Review guidance: REVIEW.md from base branch main

  transition

  Lower DESTROY_VOLUME_ESCALATION_ATTEMPTS from 10 to 6 so
  destroy_volume_retry_escalated fires when the retry backoff reaches its
  daily tier (~7h in) instead of ~4 days in, where it merely trailed the
  tier transition. Update the escalation test fixture accordingly.
@St0rmz1 St0rmz1 merged commit a89d402 into main Jun 18, 2026
16 checks passed
@St0rmz1 St0rmz1 deleted the fix/kiloclaw-destroy-retry-backoff branch June 18, 2026 01:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants