fix(sync): don't mark queued jobs as failed in stale guard (#7469)#7597
fix(sync): don't mark queued jobs as failed in stale guard (#7469)#7597aryaminus wants to merge 2 commits into
Conversation
The stale-job detection in get_sync_job() was treating 'queued' and 'processing' identically — after 10 minutes both got marked as failed with "Job timed out (background worker likely died)". This is wrong for 'queued' jobs: they were never picked up by a worker because pools were saturated. No worker died — none was available. Users saw "Couldn't process — retrying" and burned retry budgets. Only fail 'processing' jobs where a worker actually died. Closes BasedHardware#7469
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ea6f7fedfd
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| # Stale-job detection: if processing for too long, mark failed. | ||
| # Only applies to 'processing' jobs where a worker actually died. | ||
| # 'queued' jobs are still waiting for capacity — not a failure. | ||
| if job['status'] == 'processing': |
There was a problem hiding this comment.
Restore an orphan timeout for queued jobs
When the backend process is redeployed or crashes after the POST creates the Redis job but before the in-memory background task enters the semaphore and calls mark_job_processing (backend/routers/sync.py:1396-1402), that job is permanently orphaned in queued. With this condition no longer treating stale queued jobs as failed, GET keeps returning a non-terminal status until the 24h Redis TTL; the app's reconciler explicitly leaves non-terminal queued/processing WALs in uploaded and only “check[s] later” (app/lib/services/wals/local_wal_sync.dart:916-926), so the user will not re-upload or recover for up to a day. Please keep a shorter orphan timeout for queued jobs, or otherwise distinguish live capacity waiters from lost in-memory tasks.
Useful? React with 👍 / 👎.
Greptile SummaryNarrows the stale-guard in
Confidence Score: 4/5Safe to merge; the fix correctly restricts the stale-job timeout to workers that were actually running. The change is a single, well-reasoned condition narrowing backed by production evidence and a new targeted test. The only gap is that the new test does not assert mock_redis.set is never called, so a future regression that accidentally writes back to Redis for queued jobs would not be caught by the test suite. The new test in backend/tests/unit/test_sync_v2.py would benefit from an additional assertion to guard against inadvertent Redis writes on stale queued jobs. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[get_sync_job called] --> B[Fetch from Redis]
B --> C{Data found?}
C -- No --> D[Return None]
C -- Yes --> E[Parse job dict]
E --> F{status equals processing?}
F -- No --> G[Return job as-is]
F -- Yes --> H{Stale beyond 600s?}
H -- No --> I[Return job unchanged]
H -- Yes --> J[Mark status=failed]
J --> K[Persist to Redis]
K --> L[Return failed job]
Reviews (1): Last reviewed commit: "test(sync): verify stale queued jobs are..." | Re-trigger Greptile |
| result = mod.get_sync_job('sq-1') | ||
| assert result['status'] == 'queued' | ||
| assert result.get('error') is None |
There was a problem hiding this comment.
The test confirms the returned job is still
queued, but it doesn't assert that Redis is never written. Without mock_redis.set.assert_not_called(), a future regression that mutates the job dict in Redis (e.g., refreshes the TTL or stamps an error field) would still pass this test.
| result = mod.get_sync_job('sq-1') | |
| assert result['status'] == 'queued' | |
| assert result.get('error') is None | |
| result = mod.get_sync_job('sq-1') | |
| assert result['status'] == 'queued' | |
| assert result.get('error') is None | |
| mock_redis.set.assert_not_called() |
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Problem
database/sync_jobs.py:get_sync_job()treatsqueuedandprocessingidentically in the stale guard — after 10 minutes both get marked asfailedwith"Job timed out (background worker likely died)".This is wrong for
queuedjobs: they were never picked up by a worker because pools were saturated. No worker died — none was available. Users see "Couldn't process — retrying" and burn retry budgets for what is fundamentally a capacity issue, not a content failure.Fix
One-line change: the stale guard now only applies to
status == 'processing'(where a worker actually died mid-pipeline).queuedjobs remain queued and naturally expire via the 24h Redis TTL.Why this is correct
queued→ acquire semaphore → transition toprocessing(viamark_job_processing())queuedjob simply has not acquired a semaphore slot yet. Under load (pools at 100% utilization, semaphore at 16 concurrent), a job can sitqueuedfor well over 10 minutesTest coverage
test_get_sync_job_detects_stalevalidatesprocessingjobs are still correctly failed (unchanged)test_get_sync_job_does_not_mark_stale_queued_as_failedvalidatesqueuedjobs are NOT flipped to failedEvidence from production
Closes #7469