Fix anyio 4.13 100% CPU hot-loop in CancelScope._deliver_cancellation#128
Open
constkolesnyak wants to merge 3 commits into
Open
Fix anyio 4.13 100% CPU hot-loop in CancelScope._deliver_cancellation#128constkolesnyak wants to merge 3 commits into
constkolesnyak wants to merge 3 commits into
Conversation
anyio 4.13.0's CancelScope._deliver_cancellation sets should_retry=True unconditionally for every task in self._tasks, then reschedules itself via call_soon(). When every task in the scope is the *current* task, nothing gets cancelled but the callback re-queues on every event-loop tick — pinning one CPU core at 100% with ~45k epoll_pwait syscalls/sec. Observed on April 22 and again on April 23 (24h+ of 97% CPU, no work done). The existing _safe_disconnect() workaround in agent.engine only clears the stuck scope during client.disconnect(), so spins triggered by telegram polling / cron / live SDK requests weren't covered. The patch sets should_retry=True only when we actually delivered a cancel or the task is still waiting pickup (_must_cancel). Semantics otherwise match upstream byte-for-byte. Applied via nerve/__init__.py so any import path picks it up. Includes a regression test that exercises the exact pathological shape (scope whose only task is the current task) and asserts the scope stops rescheduling itself.
The April 23 patch (ec0f8f7) fixed the 96%-CPU hot loop where should_retry was set unconditionally, but left a narrower version of the same bug: the `_must_cancel` branch still set should_retry=True. When the current task itself sits with _must_cancel=True while running the cancel callback (observed in production today, nerve/SDK path), this re-queues _deliver_cancellation on every event-loop tick: before fix: 20% CPU, ~61k epoll_pwait/sec after fix: should be idle (~5% CPU range) Changes: - Skip the current task entirely — it cannot cancel itself from inside the callback it is running. - Drop should_retry=True in the _must_cancel branch — asyncio's Task.__step raises CancelledError when the task resumes, no retry needed from us. - should_retry is now True only when we actually called task.cancel() in this pass. - Add regression test that poisons the _must_cancel branch with a fake task. The existing "current-task-only" test passed without reproducing this variant because it didn't set _must_cancel.
Third iteration of the anyio _deliver_cancellation spin. This time a CancelScope retained already-finished tasks in self._tasks (anyio doesn't always prune them before the cancel callback fires). For a done task _must_cancel=False (cleared on final step), _task_started is True, _fut_waiter is None — so the previous patch fell into the "waiter not done → cancel()" branch. task.cancel() is a no-op on done tasks, but should_retry was flagged anyway, so call_soon kept re-queuing forever. Observed live: three zombie-scopes in one process producing ~55k epoll_pwait/sec combined, 100% CPU on MainThread, load 1.6, 60°C. Confirmed via py-spy (stack parked on lines 91/95/97/98 of the patch) and a gc-scan dump of CancelScope objects (three active _cancel_handle scopes, each with a single done=True task). Fix: add `if task.done(): continue` at the top of the loop. Also add regression test that reproduces the zombie-scope shape with a stubbed done task and asserts no cancel() call, no retry, and no pending _cancel_handle.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
anyio4.13.0'sCancelScope._deliver_cancellationre-queues itself viacall_soon()on every event-loop tick when the cancel can't actually land. Two shapes hit this:should_retry = Truefor every task inself._tasksthen reschedules — but a task cannot cancel itself from inside its own cancel callback. If the scope's only member is the current task (common shape from anyio task groups / Claude Agent SDK), the loop reschedules forever._must_cancel). A child task that already finished still sits in_taskswith_must_cancel=Trueand yields nothing to cancel. The loop again setsshould_retry=Trueand reschedules forever.Both produce the same observed symptom: one core pinned at ~97% CPU with ~45k–61k
epoll_pwaitsyscalls/sec, no work done. Caught twice in production (24h+ stuck before manual restart). The existing_safe_disconnect()workaround innerve/agent/engine.pyonly clears the scope duringclient.disconnect(), so any spin triggered by Telegram polling, cron, or a live SDK request isn't covered.Fix
Monkeypatch
anyio.CancelScope._deliver_cancellation(applied fromnerve/__init__.pyso any import path picks it up). Semantics match upstream byte-for-byte except:task.done()).should_retry=Truewhen we actually delivered a cancel or the task is still waiting pickup (_must_cancelis False but the task is not done).Net effect: a scope whose only tasks are (current task ∪ done tasks ∪ pickup-pending tasks) stops rescheduling itself and the loop becomes idle. Before the fix: 20–97% CPU. After: idle range (~5% CPU).
Applied via a tiny monkeypatch instead of a vendored anyio because (a) the upstream
anyiochange is one line in_deliver_cancellationand (b) we want it to disappear automatically once anyio ships a fix.Tests
tests/test_anyio_patch.py(8 tests, all passing on this branch):test_patch_applied— patch is installed at import timetest_does_not_reschedule_when_only_task_is_current— the original 100% CPU shapetest_skips_done_tasks— settled-tasks shape (the second 100% CPU bug)test_must_cancel_pickup_pending_still_retries— semantics preservation: legit pickup-pending cancels still retrytest_delivered_cancel_still_retries— semantics preservation: real cancellations retry until landedtest_idempotent_apply— applying twice is a no-optest_original_signature_preserved— no API drifttest_no_effect_when_scope_finished— scope without pending cancellation is untouchedFull suite:
1252 passed, 2 skipped, 2 failed— the 2 failures are pre-existing intests/test_cron.py::TestMaybeRotateContext(unrelated rotate_at branch), present onmaintoo.Files
nerve/__init__.py— apply patch at importnerve/_anyio_patch.py— the patched_deliver_cancellation(new)tests/test_anyio_patch.py— regression suite (new)