fix(kiloclaw): back off stuck volume deletion retries by St0rmz1 · Pull Request #4083 · Kilo-Org/cloud

St0rmz1 · 2026-06-17T21:50:43Z

Summary

When a Fly volume deletion keeps failing, the destroy loop retried roughly once
a minute and gave up after 50 attempts (about an hour). A long-lived provider
failure burned through that budget fast and then abandoned the volume. This
changes the retry cadence to back off from one minute up to a daily interval and
raises the attempt cap, so a volume that is stuck on a longer provider outage
gets more total retry coverage without hammering Fly.

Changes

Add DESTROY_VOLUME_RETRY_DELAYS_MS tiers (1 min, 5 min, 15 min, 1 hr, 6 hr,
24 hr) in config.ts. The last tier repeats until the retry cap.
Add destroyRetryDelay(attempt, random) in log.ts, which picks the tier for
the current attempt and applies proportional jitter (base * (0.5 + random),
so 0.5x to 1.5x of the tier).
nextAlarmTime now takes destroyVolumeAttempts; while destroying with
prior attempts, it schedules the next alarm using the backoff delay instead of
the fixed 1-minute destroying interval. scheduleAlarm passes the attempt
count through.
Raise MAX_DESTROY_VOLUME_ATTEMPTS from 50 to 100 and update the cap comment
to reflect the new backoff cadence.
Emit reconcile.destroy_volume_retry_escalated telemetry once at attempt 10
(DESTROY_VOLUME_ESCALATION_ATTEMPTS) so a stuck volume surfaces for alerting
before it reaches the abandon cap.
Update tests: cover the tiered delay and jitter bounds, assert the escalation
event fires at attempt 10, and bump the abandon-at-cap fixtures from 49 to 99
attempts.

Verification

No manual testing. This is internal retry-scheduling logic for the
kiloclaw durable object; covered by the unit tests in
kiloclaw-instance.test.ts (tiered delay/jitter, escalation telemetry,
abandon-at-cap).

Visual Changes

N/A

Reviewer Notes

The escalation event fires once (attempts === 10), not on every attempt past
10.
With jitter the last tier ranges from 12h to 36h per retry, so attempts 6
through 100 span a long window by design; the cap now represents a sustained
provider failure rather than a short outage.

kilo-code-bot · 2026-06-17T21:56:19Z

Code Review Summary

Status: No Issues Found | Recommendation: Merge

Executive Summary

The incremental commit resolves the prior escalation-threshold SUGGESTION by lowering DESTROY_VOLUME_ESCALATION_ATTEMPTS from 10 to 6 so the early signal fires at the daily-tier transition (~7h of accumulated retries) rather than trailing it by days; the test fixture was updated consistently.

Resolved Issues (1)

File	Line	Issue	Resolution
`services/kiloclaw/src/durable-objects/kiloclaw-instance/reconcile.ts`	1576	Escalation threshold calibrated to the old 1-min cadence	Lowered to 6 with explanatory comment; fires at the daily-tier transition

Files Reviewed (incremental: 2 files)

services/kiloclaw/src/durable-objects/kiloclaw-instance/reconcile.ts - 0 issues
services/kiloclaw/src/durable-objects/kiloclaw-instance.test.ts - 0 issues

Notes

Verified the backoff math: destroyRetryDelay(attempt) uses index = attempt - 1 clamped to the last tier. After attempt 6 fails, the next alarm is scheduled with destroyVolumeAttempts = 6 → tier[5] = 24h (daily), so the escalation event at attempt 6 correctly coincides with the daily-tier transition. Accumulated retries through attempt 6 sum to ~7h (1m + 5m + 15m + 1h + 6h), matching the comment. The test seeds destroyVolumeAttempts: 5 and asserts escalation fires once and the counter becomes 6, consistent with the new constant. No memory leaks introduced — DESTROY_VOLUME_RETRY_DELAYS_MS is a module-scope immutable constant and destroyRetryDelay's Math.random() default is re-evaluated per call.

Fix these issues in Kilo Cloud

Previous Review Summary (commit 1562de2)

Current summary above is authoritative. Previous snapshots are kept for context only.

Previous review (commit `1562de2`)

Status: 1 Issue Found | Recommendation: Address before merge

Overview

Severity	Count
CRITICAL	0
WARNING	0
SUGGESTION	1

Issue Details (click to expand)

SUGGESTION

File	Line	Issue
`services/kiloclaw/src/durable-objects/kiloclaw-instance/reconcile.ts`	1568	Escalation threshold (attempt 10) still calibrated to the old 1-min cadence; now fires ~4 days into the backoff, after the daily tier is already in effect

Files Reviewed (4 files)

services/kiloclaw/src/config.ts - 0 issues
services/kiloclaw/src/durable-objects/kiloclaw-instance/index.ts - 0 issues
services/kiloclaw/src/durable-objects/kiloclaw-instance/log.ts - 0 issues
services/kiloclaw/src/durable-objects/kiloclaw-instance/reconcile.ts - 1 issue
services/kiloclaw/src/durable-objects/kiloclaw-instance.test.ts - 0 issues

Notes

The tiered backoff implementation is sound: the destroyRetryDelay index/jitter math is correct, the alarm rescheduling path (scheduleAlarm → nextAlarmTime) correctly picks up the incremented destroyVolumeAttempts, and the test fixtures (cap raised 50→100, abandon seeded at 99, escalation seeded at 9) are consistent with the new constants. No memory leaks introduced — DESTROY_VOLUME_RETRY_DELAYS_MS is a module-scope immutable constant and destroyRetryDelay's Math.random() default is re-evaluated per call.

Fix these issues in Kilo Cloud

_{Reviewed by glm-5.2-20260616 · 625,090 tokens}

_{Review guidance: REVIEW.md from base branch main}

transition Lower DESTROY_VOLUME_ESCALATION_ATTEMPTS from 10 to 6 so destroy_volume_retry_escalated fires when the retry backoff reaches its daily tier (~7h in) instead of ~4 days in, where it merely trailed the tier transition. Update the escalation test fixture accordingly.

fix(kiloclaw): back off stuck volume deletion retries

1562de2

kilo-code-bot Bot reviewed Jun 17, 2026

View reviewed changes

Comment thread services/kiloclaw/src/durable-objects/kiloclaw-instance/reconcile.ts Outdated

pandemicsyn approved these changes Jun 18, 2026

View reviewed changes

St0rmz1 merged commit a89d402 into main Jun 18, 2026
16 checks passed

St0rmz1 deleted the fix/kiloclaw-destroy-retry-backoff branch June 18, 2026 01:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(kiloclaw): back off stuck volume deletion retries#4083

fix(kiloclaw): back off stuck volume deletion retries#4083
St0rmz1 merged 2 commits into
mainfrom
fix/kiloclaw-destroy-retry-backoff

St0rmz1 commented Jun 17, 2026 •

edited

Loading

Uh oh!

Uh oh!

kilo-code-bot Bot commented Jun 17, 2026 •

edited

Loading

Previous review (commit `1562de2`)

Overview

SUGGESTION

Notes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

St0rmz1 commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Verification

Visual Changes

Reviewer Notes

Uh oh!

Uh oh!

kilo-code-bot Bot commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review Summary

Executive Summary

Notes

Previous review (commit 1562de2)

Overview

SUGGESTION

Notes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

St0rmz1 commented Jun 17, 2026 •

edited

Loading

kilo-code-bot Bot commented Jun 17, 2026 •

edited

Loading

Previous review (commit `1562de2`)