Skip to content

fix(sync): don't mark queued jobs as failed in stale guard (#7469)#7597

Open
aryaminus wants to merge 2 commits into
BasedHardware:mainfrom
aryaminus:fix/sync-job-stale-guard-queued
Open

fix(sync): don't mark queued jobs as failed in stale guard (#7469)#7597
aryaminus wants to merge 2 commits into
BasedHardware:mainfrom
aryaminus:fix/sync-job-stale-guard-queued

Conversation

@aryaminus
Copy link
Copy Markdown
Contributor

Problem

database/sync_jobs.py:get_sync_job() treats queued and processing identically in the stale guard — after 10 minutes both get marked as failed with "Job timed out (background worker likely died)".

This is wrong for queued jobs: they were never picked up by a worker because pools were saturated. No worker died — none was available. Users see "Couldn't process — retrying" and burn retry budgets for what is fundamentally a capacity issue, not a content failure.

Fix

One-line change: the stale guard now only applies to status == 'processing' (where a worker actually died mid-pipeline). queued jobs remain queued and naturally expire via the 24h Redis TTL.

# Before
if job['status'] in ('queued', 'processing'):

# After  
if job['status'] == 'processing':

Why this is correct

Test coverage

  • Existing test test_get_sync_job_detects_stale validates processing jobs are still correctly failed (unchanged)
  • New test test_get_sync_job_does_not_mark_stale_queued_as_failed validates queued jobs are NOT flipped to failed

Evidence from production

12:09:43  job ceee6bb3 → "queued", 0/0 segments
12:20:22  REVERT job 64457303 → "failed", error="Job timed out (background worker likely died)"

executor_pool_health:
  storage: 96/96 workers, 29 deep queue, 100% utilization
  postprocess: 24/24 workers, 80 deep queue, 100% utilization

Closes #7469

aryaminus added 2 commits June 2, 2026 11:51
The stale-job detection in get_sync_job() was treating 'queued' and
'processing' identically — after 10 minutes both got marked as failed
with "Job timed out (background worker likely died)".

This is wrong for 'queued' jobs: they were never picked up by a worker
because pools were saturated. No worker died — none was available.
Users saw "Couldn't process — retrying" and burned retry budgets.

Only fail 'processing' jobs where a worker actually died.

Closes BasedHardware#7469
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ea6f7fedfd

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

# Stale-job detection: if processing for too long, mark failed.
# Only applies to 'processing' jobs where a worker actually died.
# 'queued' jobs are still waiting for capacity — not a failure.
if job['status'] == 'processing':
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Restore an orphan timeout for queued jobs

When the backend process is redeployed or crashes after the POST creates the Redis job but before the in-memory background task enters the semaphore and calls mark_job_processing (backend/routers/sync.py:1396-1402), that job is permanently orphaned in queued. With this condition no longer treating stale queued jobs as failed, GET keeps returning a non-terminal status until the 24h Redis TTL; the app's reconciler explicitly leaves non-terminal queued/processing WALs in uploaded and only “check[s] later” (app/lib/services/wals/local_wal_sync.dart:916-926), so the user will not re-upload or recover for up to a day. Please keep a shorter orphan timeout for queued jobs, or otherwise distinguish live capacity waiters from lost in-memory tasks.

Useful? React with 👍 / 👎.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Jun 2, 2026

Greptile Summary

Narrows the stale-guard in get_sync_job so only processing jobs (where a worker was active and died) are marked failed after 10 minutes; queued jobs now remain queued and expire naturally via the 24-hour Redis TTL.

  • backend/database/sync_jobs.py: One-line condition change from in ('queued', 'processing') to == 'processing', with updated comments explaining the rationale.
  • backend/tests/unit/test_sync_v2.py: Adds test_get_sync_job_does_not_mark_stale_queued_as_failed to verify that a stale queued job is returned unchanged; the existing test_get_sync_job_detects_stale confirms processing jobs are still correctly failed.

Confidence Score: 4/5

Safe to merge; the fix correctly restricts the stale-job timeout to workers that were actually running.

The change is a single, well-reasoned condition narrowing backed by production evidence and a new targeted test. The only gap is that the new test does not assert mock_redis.set is never called, so a future regression that accidentally writes back to Redis for queued jobs would not be caught by the test suite.

The new test in backend/tests/unit/test_sync_v2.py would benefit from an additional assertion to guard against inadvertent Redis writes on stale queued jobs.

Important Files Changed

Filename Overview
backend/database/sync_jobs.py Stale-guard condition narrowed from queued
backend/tests/unit/test_sync_v2.py New test validates queued jobs are not flipped to failed; missing a mock_redis.set.assert_not_called() assertion to confirm no Redis write-back occurs.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[get_sync_job called] --> B[Fetch from Redis]
    B --> C{Data found?}
    C -- No --> D[Return None]
    C -- Yes --> E[Parse job dict]
    E --> F{status equals processing?}
    F -- No --> G[Return job as-is]
    F -- Yes --> H{Stale beyond 600s?}
    H -- No --> I[Return job unchanged]
    H -- Yes --> J[Mark status=failed]
    J --> K[Persist to Redis]
    K --> L[Return failed job]
Loading

Reviews (1): Last reviewed commit: "test(sync): verify stale queued jobs are..." | Re-trigger Greptile

Comment on lines +331 to +333
result = mod.get_sync_job('sq-1')
assert result['status'] == 'queued'
assert result.get('error') is None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 The test confirms the returned job is still queued, but it doesn't assert that Redis is never written. Without mock_redis.set.assert_not_called(), a future regression that mutates the job dict in Redis (e.g., refreshes the TTL or stamps an error field) would still pass this test.

Suggested change
result = mod.get_sync_job('sq-1')
assert result['status'] == 'queued'
assert result.get('error') is None
result = mod.get_sync_job('sq-1')
assert result['status'] == 'queued'
assert result.get('error') is None
mock_redis.set.assert_not_called()

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

sync-job stale guard incorrectly marks healthy queued jobs as 'failed' (background-busy is mislabelled)

1 participant