Skip to content

Persist terminal aborted state so dead jobs stop counting as queued#504

Merged
dereuromark merged 1 commit into
masterfrom
feature/persist-terminal-aborted-state
May 28, 2026
Merged

Persist terminal aborted state so dead jobs stop counting as queued#504
dereuromark merged 1 commit into
masterfrom
feature/persist-terminal-aborted-state

Conversation

@dereuromark
Copy link
Copy Markdown
Owner

Supersedes #503 (which discounted stale-fetched jobs — wrong, since a job with retries left is one the queue itself re-runs, so the scheduler would re-dispatch a duplicate and pile up failed rows). This instead records the terminal verdict the queue already computes.

Problem

A row with completed IS NULL is treated as queued/in-progress by isQueued(), the admin "in_progress" filter, and the status badge. That conflates three states: pending, actively running, and permanently failed. Once a job exhausts its retries it never runs again, yet keeps completed IS NULL forever, so:

  • isQueued() keeps returning true → any non-concurrent caller (notably a QueueScheduler row) wedges permanently behind the dead job: both the cron tick and the admin "Run" return "could not be added to the queue";
  • the admin renders it as "Running" instead of failed — and a job whose worker was killed (OOM/timeout) shows "Running" forever too.

Change

getFailedStatus() already distinguishes a retryable failure (requeued) from a terminal one (aborted). This persists that verdict so the rest of the system can act on it:

  • Add QueuedJobsTable::STATUS_ABORTED and markJobAborted().
  • Processor stamps it when getFailedStatus() === aborted (next to the existing Queue.Job.maxAttemptsExhausted event).
  • isQueued() and the in_progress search filter exclude aborted rows. Null / other statuses are unaffected — fully backward compatible.
  • The status badge renders aborted as "Failed", checked before the fetched branch, so a job whose worker died on its last attempt no longer shows a stuck "Running".

Retryable failures are deliberately untouched: while attempts remain the job is genuinely in flight and still counts as queued, so there is no early re-dispatch and no pile-up. flushFailedJobs() continues to prune old rows.

Tests

testIsQueuedExcludesAborted, testMarkJobAborted, and an assertion in the Processor maxAttemptsExhausted test that the exhausted job is stamped aborted. Full QueuedJobsTableTest + ProcessorTest green; phpcs and phpstan clean on the changed files.

Downstream

Pairs with dereuromark/cakephp-queue-scheduler#47 — but with this change that scheduler PR is no longer needed: plain isQueued() stops counting dead jobs, so the schedule self-heals with no scheduler change. I'll close #47.

A job with `completed IS NULL` is treated as queued/in-progress by
isQueued(), the admin "in_progress" filter, and the status badge. That
conflates three different things: pending, actively running, and
permanently failed. Once a job exhausts its retries it will never run
again, yet it keeps `completed IS NULL` forever — so:

* isQueued() keeps returning true, which wedges any non-concurrent
  caller (e.g. a QueueScheduler row) permanently behind the dead job;
* the admin shows it as "Running" instead of failed.

getFailedStatus() already distinguishes a retryable failure
("requeued") from a terminal one ("aborted"); this only persists that
verdict so the rest of the system can see it.

* Add `QueuedJobsTable::STATUS_ABORTED` and `markJobAborted()`.
* Processor stamps it when getFailedStatus() === aborted (alongside the
  existing maxAttemptsExhausted event).
* isQueued() and the "in_progress" search filter exclude aborted rows
  (null/other statuses are unaffected — fully backward compatible).
* The status badge renders aborted as "Failed", checked before the
  `fetched` branch so a job whose worker died on its last attempt no
  longer shows a stuck "Running".

Retryable failures are untouched: while attempts remain the job is
genuinely in flight and still counts as queued, so there is no early
re-dispatch or pile-up. flushFailedJobs() continues to prune old rows.
@dereuromark dereuromark merged commit 4b4b934 into master May 28, 2026
16 checks passed
@dereuromark dereuromark deleted the feature/persist-terminal-aborted-state branch May 28, 2026 13:41
dereuromark added a commit to dereuromark/cakephp-queue-scheduler that referenced this pull request May 28, 2026
…#48)

* Rerun terminally-failed scheduled jobs in place, with failure backoff

A non-concurrent row gates its next dispatch on QueuedJobsTable::isQueued().
Once cakephp-queue records terminal "aborted" state, a dead job stops
counting as queued, so the schedule resumes — but it would then queue a
brand-new job every tick, leaving one failed row behind per interval
(e.g. ~1440/day for an every-minute task).

#1 — Rerun in place: when the row's previous dispatch terminally failed
(queue status = aborted), run() resets that same job (clears attempts and
the stale aborted status) instead of creating a new one. Only one
recycled row per logical scheduled job, no pile-up. Checked before the
in-flight hold-back so it works whether or not the installed queue
release already excludes aborted jobs from isQueued() — an aborted job is
terminal and never blocks. Still-retrying jobs are untouched: while
attempts remain the job is genuinely in flight and the next tick is held
back, so there is no early re-dispatch.

#2 — Backoff: a new `consecutive_failures` counter tracks aborts without
an intervening success. After `QueueScheduler.maxConsecutiveFailures`
(default 0 = unlimited) the row is disabled and a
`QueueScheduler.Row.disabled` event is fired so the host app can alert. A
successful/fresh dispatch resets the counter to 0.

The disable write commits while run() still reports "not dispatched":
the transactional() callback returns true (commit) for both a dispatch
and a backoff-disable, and only false (rollback) for hold-back / lost
compare-and-swap. The actual dispatched/not-dispatched result is tracked
in an outer flag so the disable is not rolled back.

Adds the `consecutive_failures` column (migration + fixture + entity),
three SchedulerRowsTable tests, and documents the config key in
docs/README.md and config/app.example.php.

Pairs with dereuromark/cakephp-queue#504 (which persists the terminal
aborted state); dormant and behaviour-preserving on older queue releases.

* Fix json_decode() arg order in SchedulerRow::_getJobData()

JSON_THROW_ON_ERROR was passed as the 3rd argument ($depth) instead of
the 4th ($flags), so the throw-on-error behaviour never actually engaged
and a malformed param would have returned null rather than throwing.
Pass an explicit depth (512, the PHP default) and the flag in its proper
slot. Newer phpstan flags this as argument.invalidConstant; the bug
predates this branch but only surfaced now via a phpstan release.

* Harden scheduler failure backoff

- maxConsecutiveFailures now counts reruns granted rather than aborts: the
  cap is checked before incrementing, so re-enabling a disabled row recovers
  through the in-place rerun path on every queue version. Previously a cap of
  1 could never retry, and re-enabling immediately re-disabled the row.
- Reset the consecutive_failures streak in beforeSave() when a row is
  re-enabled, granting it a fresh round of reruns.
- Dispatch the QueueScheduler.Row.disabled event and log the warning after
  the transaction commits, so a listener never runs mid-transaction.
- Tidy a detached test docblock left over from the earlier insertion.

* Reference cakephp-queue#505 in the reset-status shim comment
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Shell is not working with cake 2.x

1 participant