Skip to content

Rectify: Channel B Resume-Boundary False-Fire#3387

Merged
Trecek merged 11 commits into
developfrom
channel-b-false-fires-on-stale-completion-marker-when-resumi/3360
May 31, 2026
Merged

Rectify: Channel B Resume-Boundary False-Fire#3387
Trecek merged 11 commits into
developfrom
channel-b-false-fires-on-stale-completion-marker-when-resumi/3360

Conversation

@Trecek
Copy link
Copy Markdown
Collaborator

@Trecek Trecek commented May 31, 2026

Summary

Channel B's _session_log_monitor has no resume-boundary awareness. Phase 2 unconditionally initializes scan_pos = 0 and last_size = 0, causing it to read the entire JSONL file — including content from the original session that already contains the completion marker. On resumed sessions (claude --resume), this produces a false ChannelBStatus.COMPLETION that kills the session in ~14 seconds with 0 tokens.

The architectural weakness is that _session_log_monitor treats every JSONL file as if it were freshly created, with no concept of a content boundary between prior and current session activity. The fix makes Phase 2's scan position a function of file state at discovery time, eliminating the entire class of stale-content false-fires.

Closes #3360

Implementation Plan

Plan file: /home/talon/projects/autoskillit-runs/remediation-20260530-172850-168963/.autoskillit/temp/rectify/rectify_channel_b_resume_boundary_2026-05-30_172850_part_a.md

🤖 Generated with Claude Code via AutoSkillit

Token Usage Summary

Step Model count uncached output cache_read peak_ctx turns cache_write time
rectify* opus[1m] 1 93 37.3k 3.0M 151.4k 241 173.3k 24m 29s
review_approach* sonnet 1 4.7k 8.3k 199.6k 52.0k 133 40.1k 7m 52s
dry_walkthrough* opus 2 2.4k 34.5k 1.3M 80.6k 164 97.2k 15m 39s
implement* sonnet 2 144 7.9k 650.5k 47.0k 73 32.7k 8m 16s
audit_impl* sonnet 2 1.0k 13.2k 3.8M 72.4k 142 247.5k 15m 18s
prepare_pr* sonnet 1 94.2k 3.3k 193.0k 27.6k 22 40.5k 1m 12s
compose_pr* sonnet 1 37.9k 1.4k 162.6k 27.6k 14 15.5k 41s
review_pr* sonnet 3 506 125.8k 4.0M 119.0k 207 329.1k 29m 32s
resolve_review* opus 3 195 48.3k 5.9M 91.2k 219 273.9k 32m 48s
Total 141.2k 280.0k 19.2M 151.4k 1.2M 2h 15m

* Step used a non-Anthropic provider; caching behavior may differ.

Token Efficiency

Step LoC Changed cache_read/LoC cache_write/LoC output/LoC
rectify 0
review_approach 0
dry_walkthrough 0
implement 195 3335.9 167.8 40.3
audit_impl 0
prepare_pr 0
compose_pr 0
review_pr 0
resolve_review 47 124837.1 5827.4 1028.0
Total 242 79375.1 5164.7 1157.1

Model Usage Breakdown

Model steps uncached output cache_read cache_write time
opus[1m] 1 93 37.3k 3.0M 173.3k 24m 29s
sonnet 6 138.4k 159.9k 9.0M 705.5k 1h 2m
opus 2 2.6k 82.8k 7.2M 371.0k 48m 27s

@Trecek Trecek force-pushed the channel-b-false-fires-on-stale-completion-marker-when-resumi/3360 branch 2 times, most recently from 1139207 to 3ce98db Compare May 31, 2026 04:33
Copy link
Copy Markdown
Collaborator Author

@Trecek Trecek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AutoSkillit PR Review — Verdict: approved_with_comments

Comment thread src/autoskillit/execution/process/_process_monitor.py
Comment thread src/autoskillit/execution/process/_process_monitor.py
last_size = len(_initial_content.encode("utf-8"))
except OSError:
logger.warning(
"session_log_phase2_init_read_failed",
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[warning] bugs: scan_pos is a Unicode char count (len of decoded str) while last_size is a UTF-8 byte count. The debug log at L310-L311 emits both as if comparable, but when the file contains multi-byte characters they diverge silently. Add a comment clarifying the units or use explicit variable names (scan_pos_chars / last_size_bytes).

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valid observation — flagged for design decision. scan_pos (char count for read_text slicing) and last_size (byte count for stat comparison) intentionally track different units for different consumers. Renaming variables or adding unit-suffix comments is a design decision for a follow-up.

Comment thread src/autoskillit/execution/process/_process_monitor.py
Comment thread tests/execution/test_process_channel_b.py Outdated
assert result.session_id == ""


class TestResumeBoundary:
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[warning] cohesion: TestResumeBoundary is added to test_process_session_log_monitor.py only. test_process_channel_b.py exercises the same scan_pos boundary through subprocess helpers but has no explicit resume-boundary assertion class — a regression in the channel-B integration path would not surface as a named boundary-test failure.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valid observation — flagged for design decision. Adding a parallel TestResumeBoundary class in test_process_channel_b.py would exercise the same scan_pos boundary through subprocess helpers. This is a meaningful test infrastructure decision beyond the scope of this bugfix PR.

Copy link
Copy Markdown
Collaborator Author

@Trecek Trecek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AutoSkillit review: warning-only findings detected. See inline comments — no blocking changes required.

Copy link
Copy Markdown
Collaborator Author

@Trecek Trecek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Observations accumulated from 2 local review rounds:

f.flush()
# Delay must exceed session_id_timeout + Phase 1 poll so Phase 2 initializes
# scan_pos from discovery boundary before the marker arrives.
time.sleep(1.0)
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Observation from local review round 0:

Timing-based synchronization between subprocess time.sleep(1.0) and the async monitor poll cadence is fragile under xdist -n 4 load. A readiness sentinel (flag file) would make ordering guaranteed rather than probabilistic.

Evidence: The sleep(1.0) is intentionally generous (1.0s vs actual constraint of ~0.02s), designed to avoid flakiness under WSL2+xdist load. Replacing with a flag-file sentinel is a meaningful redesign beyond PR scope.

f.write(json.dumps(init) + "\\n")
f.flush()
# Delay must exceed session_id_timeout + Phase 1 poll so Phase 2 initializes
# scan_pos from discovery boundary before the marker arrives.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Observation from local review round 1:

Hard-coded 1.0s sleep to exceed session_id_timeout + Phase 1 poll is a timing assumption. The actual timeout values passed to the monitor are not visible here to confirm 1.0s is sufficient under heavy xdist load.

Evidence: Same underlying concern classified DISCUSS in round 0 (line 38). The 1.0s sleep is intentionally generous (actual constraint ~0.02s), designed to absorb WSL2+xdist event-loop jitter. Replacing with flag-file sentinel is a meaningful redesign beyond PR scope.

f.flush()
# Delay must exceed _session_id_timeout + Phase 1 poll so the monitor
# discovers the file and initializes Phase 2 before the marker arrives.
time.sleep(1.0)
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Observation from local review round 1:

1.0s sleep repeated across multiple script variants with the same comment. A shared helper or named constant would reduce fragility.

Evidence: The 1.0s values live inside textwrap.dedent subprocess script strings, making extraction to a named constant non-trivial. This is a test infrastructure design decision beyond PR scope.

Trecek and others added 11 commits May 30, 2026 21:54
On resumed sessions (claude --resume), the JSONL file already contains
records from the prior session including the completion marker.
Previously, Phase 2 initialized scan_pos=0 and last_size=0 causing an
immediate false-fire COMPLETION result within ~14 seconds with 0 tokens.

Initialize both scan_pos (character offset) and last_size (byte size)
from the file's state at Phase 1 discovery time so Phase 2 only scans
content written after monitoring began, eliminating stale-content false-fires.

Adds TestResumeBoundary with three regression tests covering:
- pre-existing completion marker must not false-fire (was failing)
- new marker appended after monitoring starts still fires correctly
- stale threshold triggers correctly with no new writes (was failing)
Tests that pre-populated JSONL files with completion markers and expected
immediate COMPLETION now append the marker after monitoring starts,
matching the new Phase 2 behavior where scan_pos starts at the file's
discovery boundary.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Subprocess scripts in test_process_channel_b.py now write an initial
non-marker record before appending the completion marker, ensuring the
marker arrives after Phase 2 initialization. Heartbeat orphan detection
tests in test_process_heartbeat.py now append records after monitoring
starts rather than pre-populating.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…hannel B tests

Scripts now delay writing the completion marker by 1-2s (up from 0.15s)
to ensure Phase 2 initializes scan_pos from the discovery boundary before
the marker arrives. Tests that don't write stdout reduce _session_id_timeout
to 0.01s so Phase 1 starts promptly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Derive last_size from the same read_text() content used for scan_pos,
closing the TOCTOU window where a write between read_text() and stat()
could cause the poll loop to never detect the completion marker.

Add logger.warning on OSError fallback to scan_pos=0 so operators can
distinguish a transient read failure from intentional boundary-skip.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The comment referenced _phase1_poll=1.0 but the test passes 0.01 after
commit b28e328. Update to match the actual constraint.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move json import to module level, removing three inline occurrences in
TestResumeBoundary. Revise docstring of
test_phase2_no_spurious_read_on_preexisting_content to match what is
actually asserted (STALE status, not no-read invariant).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…or handler

The comment about "two reads in separate try/except blocks" was left stale
after commit edf7f1d merged them into one block. Removed the misleading
comment. Added exc_info=True to the OSError handler so errno/strerror are
preserved in production logs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…tatus

Replace isinstance(result_box[0].status, ChannelBStatus) — which is
trivially true for any status value — with an assertion checking the
expected COMPLETION status.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… test

The marker is written after 0.3s, so 5s provides ample headroom while
avoiding a 30s CI hang if the marker write fails.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Reorder the _session_id_timeout docstring so the JSONL marker timing
comes first, followed by the session ID extraction constraint, matching
the logical flow of the test setup.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@Trecek Trecek force-pushed the channel-b-false-fires-on-stale-completion-marker-when-resumi/3360 branch from 3ce98db to 1c01bf9 Compare May 31, 2026 04:54
@Trecek Trecek added this pull request to the merge queue May 31, 2026
Merged via the queue into develop with commit 6d58d85 May 31, 2026
3 checks passed
@Trecek Trecek deleted the channel-b-false-fires-on-stale-completion-marker-when-resumi/3360 branch May 31, 2026 05:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant