Rectify: Stale Detector Kills Actively Working Sessions During run_skill#3391
Merged
Trecek merged 17 commits intoMay 31, 2026
Merged
Conversation
af5551c to
81fc2d9
Compare
Adds TDD tests that fail until the execution marker glob is generalized: - test_stale_NOT_fired_when_execution_marker_active: session monitor must suppress stale when run-skill-in-progress-* marker is fresh - test_stale_fired_when_execution_marker_expired: marker present but stale still fires when execution marker is expired - test_execution_marker_suppression_bounded_by_max_suppression_seconds: suppression is bounded even with continuously-touched run-skill marker - test_stdout_idle_NOT_fired_when_execution_marker_active: idle watchdog must suppress IDLE_STALL when run-skill-in-progress-* marker is fresh Current dispatch-in-progress-* glob does not match run-skill-in-progress-* markers, so tests 1a, 1c, 1d fail before the fix. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…_marker
Generalizes the marker detection function and its glob pattern to match
both dispatch-in-progress-* and run-skill-in-progress-* markers:
Old glob: dispatch-in-progress-{session_id}-*.marker
New glob: *-in-progress-{session_id}-*.marker
This makes the stale detector's suppression mechanism work for any
execution context, not just fleet dispatches. All callers in
_process_race.py (_watch_stdout_idle, _watch_child_activity) and
_process_monitor.py (_session_log_monitor) are updated.
Updated all test references: monkeypatch targets, imports, class names,
frozenset, arch guard predicate string, and CLAUDE.md table entry.
Adds parametrized backward-compat test confirming both prefixes match
and a wrong-session-id guard test.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds marker_dir: Path | None and caller_session_id: str | None to: - HeadlessExecutor.run() Protocol - DefaultHeadlessExecutor.run() concrete method - run_headless_core() function Both params are forwarded to _execute_claude_headless() as marker_dir and session_id respectively, making the stale detector's marker-based suppression available for run_skill sessions (not just dispatch_food_truck). Adds integration tests verifying the forwarding chain: - test_run_headless_core_forwards_marker_dir_and_caller_session_id - test_default_executor_run_forwards_marker_dir_and_caller_session_id - test_headless_executor_protocol_includes_marker_dir_params Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ker creation Create core/_execution_marker.py with a shared async context manager that handles marker write, heartbeat, and cleanup. Refactor fleet/_api.py to use execution_marker instead of inline marker logic, and wire run_skill in tools_execution.py to create markers before blocking on executor.run(). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…fication gap Verifies that execution_marker context manager creates a marker file on entry and deletes it on exit — the lifecycle unit test that was missing from the stale-detector run_skill marker blind spot audit. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…d_dispatch Add TestResumeSuccessGuard to test_resume_preflight.py covering the _run_dispatch early-return path when prior dispatch already succeeded. test_resume_rejected_when_prior_dispatch_succeeded fails until the guard is wired in Phase 2. Add TestHasCompletedDispatch to test_state.py and implement has_completed_dispatch in state_recovery.py (name-specific SUCCESS check, fail-open on missing/corrupted state). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…e-exports Wire _build_success_short_circuit into _run_dispatch: when prior dispatch already has DispatchStatus.SUCCESS, return the cached result immediately without launching a subprocess. Add has_completed_dispatch re-exports through state.py and fleet/__init__ so callers can use `from autoskillit.fleet import has_completed_dispatch`. Add dispatch_food_truck MCP tool pre-flight guard: if campaign state already shows SUCCESS for effective_name, return the cached result before execute_dispatch is called (prevents semaphore acquisition and state handle creation for duplicate calls). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…scanner depth Add stale_threshold: 2400 to all run_diagnostic* steps in implementation, remediation, and merge-prs recipes (10 steps total). The default 1200s is too short for multi-subagent scanner work; 2400s matches the existing convention for long-running steps (implement, retry_worktree). Add scanner maxTurns and wall-clock soft-deadline guidance to analyze-pipeline-health SKILL.md to bound investigation per scanner and prevent the stale detector misfiring window. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Move DispatchStateHandle.open_continued inside the try block so that FileNotFoundError (a subclass of OSError) is caught by the existing except handler. Add a create_fresh fallback in the except so handle is always bound after the try/except, enabling fail-open resume when the prior state file is absent. Extract recipe_snapshot before the if/else so it is available in both the except fallback and the else branch. Add test_resume_proceeds_when_prior_state_missing to TestResumeSuccessGuard to verify the fail-open behaviour. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace duplicate label suffix in marker filename with uuid4 hex to prevent collisions between concurrent executions sharing the same label + session_id. Replace asyncio.create_task with anyio task group for structured concurrency and proper exception visibility. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add find_completed_dispatch helper that returns the full DispatchRecord for a completed dispatch. Use it in the dispatch_food_truck MCP tool pre-flight guard to forward actual dispatch_id and dispatched_session_id instead of hardcoded empty strings. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ch fields Add monitor_status emptiness assertion in test_stale_NOT_fired_when_execution_marker_active to distinguish STALE from other early-exit conditions. Add dispatch_id, dispatched_session_id, and reason field assertions in test_resume_rejected_when_prior_dispatch_succeeded. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Revert from anyio task group (yield inside task group body causes cleanup issues with @asynccontextmanager) to asyncio.create_task with split exception handling: CancelledError silenced, other exceptions logged. Update tests to match new _touch_marker 2-arg signature. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ppression test Adds assert elapsed < 5.0 after the lower-bound check so a broken suppression path (marker never matched) cannot silently pass when the test happens to run slowly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…checks Adds assert "marker_dir" in sig.parameters and assert "caller_session_id" in sig.parameters before accessing .default, so a missing parameter produces an actionable assertion message instead of an uninformative KeyError. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
New core/ module _execution_marker was missing from MODULE_CASCADE_CORE, causing test_all_core_stems_classified to fail. Maps it to the four consuming packages: core, execution, fleet, server. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
81fc2d9 to
e2edc1b
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The stale detector uses JSONL file-size growth as its sole liveness signal. When a session is blocked on a long-running MCP tool call (e.g.,
run_skillexecutinganalyze-pipeline-health), the JSONL stops growing because Claude writes only on conversation turns, not while awaiting tool results. The stale detector interprets this silence as staleness and kills the session.The architectural weakness:
dispatch_food_truckcreates a dispatch marker that suppresses stale kills in all three watchers (_session_log_monitor,_watch_stdout_idle,_watch_child_activity), butrun_skillcreates no such marker. Furthermore, therun_skillcall chain throughDefaultHeadlessExecutor.run()→run_headless_core()never passesmarker_dirorcaller_session_idto_execute_claude_headless()— even though that function already accepts both parameters. This means the monitoring tasks insiderun_managed_asyncreceivemarker_dir=Noneand skip all marker-based suppression unconditionally.The solution: extend the execution marker protocol to
run_skillsessions by (1) renaming and generalizing the marker detection function, (2) threadingmarker_dirandcaller_session_idthrough therun()call chain, and (3) creating markers inrun_skillbefore blocking onexecutor.run().Requirements
Two
dispatch_food_trucksessions (issue #3246, issue #3317) completed ALL substantive work — planning, implementation, code review, PR merge via merge queue, and issue release — but were killed by the stale detector during the finalrun_diagnosticstep. The orchestrator then triggered unnecessary resume dispatches, wasting tokens on sessions whose PRs were already merged 18–70 minutes earlier.Closes #3349
Implementation Plan
Plan file:
/home/talon/projects/autoskillit-runs/remediation-20260530-174510-498188/.autoskillit/temp/rectify/rectify_stale_detector_mcp_blind_spot_2026-05-30_174510_part_a.md🤖 Generated with Claude Code via AutoSkillit
Token Usage Summary
* Step used a non-Anthropic provider; caching behavior may differ.
Token Efficiency
Model Usage Breakdown