Skip to content

Rerun terminally-failed scheduled jobs in place, with failure backoff#48

Merged
dereuromark merged 4 commits into
masterfrom
feature/scheduler-rerun-aborted-and-backoff
May 28, 2026
Merged

Rerun terminally-failed scheduled jobs in place, with failure backoff#48
dereuromark merged 4 commits into
masterfrom
feature/scheduler-rerun-aborted-and-backoff

Conversation

@dereuromark
Copy link
Copy Markdown
Owner

@dereuromark dereuromark commented May 28, 2026

Pairs with dereuromark/cakephp-queue#504 (which persists terminal aborted state). Dormant and behaviour-preserving on older queue releases.

Problem

A non-concurrent row gates its next dispatch on QueuedJobsTable::isQueued(). Once the queue records terminal aborted state, a dead job stops counting as queued and the schedule resumes — but it would then queue a brand-new job every tick, leaving one failed row behind per interval (≈1440/day for an every-minute task). That's a different failure mode from the wedge, and it's noisy.

1 — Rerun in place

When the row's previous dispatch terminally failed (queue status = aborted), run() resets that same job (clears attempts + the stale aborted status) instead of creating a new one. One recycled row per logical scheduled job, no pile-up.

  • Checked before the in-flight hold-back, so it works whether or not the installed queue release already excludes aborted jobs from isQueued() (#504) — an aborted job is terminal and never blocks.
  • Still-retrying jobs are untouched: while attempts remain the job is genuinely in flight and the next tick is held back as before, so there is no early re-dispatch.

2 — Backoff + disable + alert

A new consecutive_failures counter tracks aborts without an intervening success. After QueueScheduler.maxConsecutiveFailures (default 0 = unlimited) the row is disabled and a QueueScheduler.Row.disabled event is dispatched so the host app can alert. A successful / fresh dispatch resets the counter.

$this->getEventManager()->on('QueueScheduler.Row.disabled', function ($event) {
    $row = $event->getData('row');
    $failures = $event->getData('consecutiveFailures');
    // notify ops…
});

Transaction note

The disable write must commit while run() still reports "not dispatched". The transactional() callback returns true (commit) for both a dispatch and a backoff-disable, and only false (rollback) for hold-back / lost compare-and-swap. The actual dispatched result is tracked in an outer flag — returning false from the callback would roll the disable back (that was a real bug caught in testing).

Changes

  • consecutive_failures column: migration + fixture + entity property.
  • run() rerun/backoff logic + abortedJobId() helper.
  • 3 tests (testRunRerunsAbortedJobInsteadOfCreatingNew, testRunDisablesRowAfterMaxConsecutiveFailures, testRunResetsFailureCounterOnFreshDispatch).
  • Config documented in docs/README.md + config/app.example.php.

SchedulerRowsTableTest green (28); phpcs clean; phpstan clean on the changed table file. (A pre-existing json_decode flags/depth phpstan note in SchedulerRow::_getJobData() is unrelated and only surfaces under newer phpstan than the pinned CI version.)

A non-concurrent row gates its next dispatch on QueuedJobsTable::isQueued().
Once cakephp-queue records terminal "aborted" state, a dead job stops
counting as queued, so the schedule resumes — but it would then queue a
brand-new job every tick, leaving one failed row behind per interval
(e.g. ~1440/day for an every-minute task).

#1 — Rerun in place: when the row's previous dispatch terminally failed
(queue status = aborted), run() resets that same job (clears attempts and
the stale aborted status) instead of creating a new one. Only one
recycled row per logical scheduled job, no pile-up. Checked before the
in-flight hold-back so it works whether or not the installed queue
release already excludes aborted jobs from isQueued() — an aborted job is
terminal and never blocks. Still-retrying jobs are untouched: while
attempts remain the job is genuinely in flight and the next tick is held
back, so there is no early re-dispatch.

#2 — Backoff: a new `consecutive_failures` counter tracks aborts without
an intervening success. After `QueueScheduler.maxConsecutiveFailures`
(default 0 = unlimited) the row is disabled and a
`QueueScheduler.Row.disabled` event is fired so the host app can alert. A
successful/fresh dispatch resets the counter to 0.

The disable write commits while run() still reports "not dispatched":
the transactional() callback returns true (commit) for both a dispatch
and a backoff-disable, and only false (rollback) for hold-back / lost
compare-and-swap. The actual dispatched/not-dispatched result is tracked
in an outer flag so the disable is not rolled back.

Adds the `consecutive_failures` column (migration + fixture + entity),
three SchedulerRowsTable tests, and documents the config key in
docs/README.md and config/app.example.php.

Pairs with dereuromark/cakephp-queue#504 (which persists the terminal
aborted state); dormant and behaviour-preserving on older queue releases.
JSON_THROW_ON_ERROR was passed as the 3rd argument ($depth) instead of
the 4th ($flags), so the throw-on-error behaviour never actually engaged
and a malformed param would have returned null rather than throwing.
Pass an explicit depth (512, the PHP default) and the flag in its proper
slot. Newer phpstan flags this as argument.invalidConstant; the bug
predates this branch but only surfaced now via a phpstan release.
- maxConsecutiveFailures now counts reruns granted rather than aborts: the
  cap is checked before incrementing, so re-enabling a disabled row recovers
  through the in-place rerun path on every queue version. Previously a cap of
  1 could never retry, and re-enabling immediately re-disabled the row.
- Reset the consecutive_failures streak in beforeSave() when a row is
  re-enabled, granting it a fresh round of reruns.
- Dispatch the QueueScheduler.Row.disabled event and log the warning after
  the transaction commits, so a listener never runs mid-transaction.
- Tidy a detached test docblock left over from the earlier insertion.
@dereuromark dereuromark merged commit 41794e0 into master May 28, 2026
16 checks passed
@dereuromark dereuromark deleted the feature/scheduler-rerun-aborted-and-backoff branch May 28, 2026 16:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant