fix(sync): don't mark queued jobs as failed in stale guard (#7469) by aryaminus · Pull Request #7597 · BasedHardware/omi

aryaminus · 2026-06-02T18:52:04Z

Problem

database/sync_jobs.py:get_sync_job() treats queued and processing identically in the stale guard — after 10 minutes both get marked as failed with "Job timed out (background worker likely died)".

This is wrong for queued jobs: they were never picked up by a worker because pools were saturated. No worker died — none was available. Users see "Couldn't process — retrying" and burn retry budgets for what is fundamentally a capacity issue, not a content failure.

Fix

One-line change: the stale guard now only applies to status == 'processing' (where a worker actually died mid-pipeline). queued jobs remain queued and naturally expire via the 24h Redis TTL.

# Before
if job['status'] in ('queued', 'processing'):

# After  
if job['status'] == 'processing':

Why this is correct

Jobs start as queued → acquire semaphore → transition to processing (via mark_job_processing())
A queued job simply has not acquired a semaphore slot yet. Under load (pools at 100% utilization, semaphore at 16 concurrent), a job can sit queued for well over 10 minutes
The client-side workaround in feat(sync): graceful cooldowns for 429 fair-use + backend-busy stale-fails; surface the reason, back off, and log why #7451 already pattern-matches the error string — the right fix is on the backend

Test coverage

Existing test test_get_sync_job_detects_stale validates processing jobs are still correctly failed (unchanged)
New test test_get_sync_job_does_not_mark_stale_queued_as_failed validates queued jobs are NOT flipped to failed

Evidence from production

12:09:43  job ceee6bb3 → "queued", 0/0 segments
12:20:22  REVERT job 64457303 → "failed", error="Job timed out (background worker likely died)"

executor_pool_health:
  storage: 96/96 workers, 29 deep queue, 100% utilization
  postprocess: 24/24 workers, 80 deep queue, 100% utilization

Closes #7469

The stale-job detection in get_sync_job() was treating 'queued' and 'processing' identically — after 10 minutes both got marked as failed with "Job timed out (background worker likely died)". This is wrong for 'queued' jobs: they were never picked up by a worker because pools were saturated. No worker died — none was available. Users saw "Couldn't process — retrying" and burned retry budgets. Only fail 'processing' jobs where a worker actually died. Closes BasedHardware#7469

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ea6f7fedfd

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-02T18:54:32Z

+    # Stale-job detection: if processing for too long, mark failed.
+    # Only applies to 'processing' jobs where a worker actually died.
+    # 'queued' jobs are still waiting for capacity — not a failure.
+    if job['status'] == 'processing':


Restore an orphan timeout for queued jobs

When the backend process is redeployed or crashes after the POST creates the Redis job but before the in-memory background task enters the semaphore and calls mark_job_processing (backend/routers/sync.py:1396-1402), that job is permanently orphaned in queued. With this condition no longer treating stale queued jobs as failed, GET keeps returning a non-terminal status until the 24h Redis TTL; the app's reconciler explicitly leaves non-terminal queued/processing WALs in uploaded and only “check[s] later” (app/lib/services/wals/local_wal_sync.dart:916-926), so the user will not re-upload or recover for up to a day. Please keep a shorter orphan timeout for queued jobs, or otherwise distinguish live capacity waiters from lost in-memory tasks.

Useful? React with 👍 / 👎.

greptile-apps · 2026-06-02T18:55:00Z

Greptile Summary

Narrows the stale-guard in get_sync_job so only processing jobs (where a worker was active and died) are marked failed after 10 minutes; queued jobs now remain queued and expire naturally via the 24-hour Redis TTL.

backend/database/sync_jobs.py: One-line condition change from in ('queued', 'processing') to == 'processing', with updated comments explaining the rationale.
backend/tests/unit/test_sync_v2.py: Adds test_get_sync_job_does_not_mark_stale_queued_as_failed to verify that a stale queued job is returned unchanged; the existing test_get_sync_job_detects_stale confirms processing jobs are still correctly failed.

Confidence Score: 4/5

Safe to merge; the fix correctly restricts the stale-job timeout to workers that were actually running.

The change is a single, well-reasoned condition narrowing backed by production evidence and a new targeted test. The only gap is that the new test does not assert mock_redis.set is never called, so a future regression that accidentally writes back to Redis for queued jobs would not be caught by the test suite.

The new test in backend/tests/unit/test_sync_v2.py would benefit from an additional assertion to guard against inadvertent Redis writes on stale queued jobs.

Important Files Changed

Filename	Overview
backend/database/sync_jobs.py	Stale-guard condition narrowed from queued
backend/tests/unit/test_sync_v2.py	New test validates queued jobs are not flipped to failed; missing a mock_redis.set.assert_not_called() assertion to confirm no Redis write-back occurs.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[get_sync_job called] --> B[Fetch from Redis]
    B --> C{Data found?}
    C -- No --> D[Return None]
    C -- Yes --> E[Parse job dict]
    E --> F{status equals processing?}
    F -- No --> G[Return job as-is]
    F -- Yes --> H{Stale beyond 600s?}
    H -- No --> I[Return job unchanged]
    H -- Yes --> J[Mark status=failed]
    J --> K[Persist to Redis]
    K --> L[Return failed job]

_{Reviews (1): Last reviewed commit: "test(sync): verify stale queued jobs are..." | Re-trigger Greptile}

greptile-apps · 2026-06-02T18:55:08Z

+        result = mod.get_sync_job('sq-1')
+        assert result['status'] == 'queued'
+        assert result.get('error') is None


The test confirms the returned job is still queued, but it doesn't assert that Redis is never written. Without mock_redis.set.assert_not_called(), a future regression that mutates the job dict in Redis (e.g., refreshes the TTL or stamps an error field) would still pass this test.

Suggested change

result = mod.get_sync_job('sq-1')

assert result['status'] == 'queued'

assert result.get('error') is None

result = mod.get_sync_job('sq-1')

assert result['status'] == 'queued'

assert result.get('error') is None

mock_redis.set.assert_not_called()

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

aryaminus added 2 commits June 2, 2026 11:51

test(sync): verify stale queued jobs are not marked failed

ea6f7fe

chatgpt-codex-connector Bot reviewed Jun 2, 2026

View reviewed changes

greptile-apps Bot reviewed Jun 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(sync): don't mark queued jobs as failed in stale guard (#7469)#7597

fix(sync): don't mark queued jobs as failed in stale guard (#7469)#7597
aryaminus wants to merge 2 commits into
BasedHardware:mainfrom
aryaminus:fix/sync-job-stale-guard-queued

aryaminus commented Jun 2, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 2, 2026

Uh oh!

greptile-apps Bot commented Jun 2, 2026

Uh oh!

greptile-apps Bot Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aryaminus commented Jun 2, 2026

Problem

Fix

Why this is correct

Test coverage

Evidence from production

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot commented Jun 2, 2026

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps Bot Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant