Symptom
After cf0557e shipped structured logging across harness + engines + substrates (Wedge 1.9), every lifecycle step now emits a stderr log line. But on fast agent.dispose() flows (sub-second sessions, e.g. a one-shot toy task), the last 1-2 log lines are missing from the terminal:
[12:24:36] INFO substrate.local.boot ...
[12:24:36] INFO harness.boot ...
[12:24:37] INFO harness.ready ...
[12:24:37] INFO substrate.local.harness.ready ...
[12:24:37] INFO harness.session.create ...
[12:24:37] INFO harness.session.start ...
[12:24:37] INFO harness.engine.start ...
---
done • 276 tokens
shutting down harness subprocess
[12:24:43] INFO substrate.local.dispose ...
^^ but `harness.engine.turn.end` and `harness.session.end`
should have appeared BEFORE this line, and they don't
Both engine.turn.end and session.end ARE called inside the harness child — run-session.ts:118 and engine-claude-agent-sdk/src/engine.ts:118 respectively — but the bytes never reach the parent terminal.
Root cause
agent.dispose() flow:
- SDK POSTs
/v1/sessions/:id/end-input and awaits the response.
- The route handler synchronously calls
session.endUserMessages() and returns {ok:true}.
- The POST resolves — but inside the child, the engine's cleanup is still in flight async: SDK's
query() iterator wraps up, the for-await over it exits, log.info(\"engine.turn.end\") fires, then runSession's drain IIFE fires log.info(\"session.end\").
- SDK calls
b.shutdown() → killProcess sends SIGTERM.
- Child receives SIGTERM and exits before its stderr pipe drains.
So the lines are written by the child, but Node's pipe buffer hasn't flushed by the time the parent reads EOF.
A 250ms drain in killProcess (currently in packages/runtime-local/src/local-substrate.ts:163) covers some cases but not all — the engine's actual async cleanup can take longer than 250ms even on a session that returned a 5-word response.
What works today
| Scenario |
Final lines logged? |
| Normal agent run (multi-second response) |
✅ yes |
Sub-second one-shot tasks (toy tests, echo > file.txt) |
✗ last 1-2 truncated |
| Errored runs (uncaught throw, network failure) |
mixed — error line lands, turn.end may not |
Cancelled runs (/cancel while in-flight) |
✅ yes (the cancellation path emits session.end synchronously before the loop returns) |
Why it doesn't hurt the user (today)
The user always sees every step UP TO the engine's final response. The two missing lines are the terminator events — useful for telemetry parsers but not for understanding what the agent did. And on any session that actually does meaningful work, they show up.
But it's still a bug, because:
- Truncated structured logs corrupt downstream parsers (Datadog, Loki, jq pipelines that count
event=session.end to compute concurrent sessions).
- Telemetry consumers can't distinguish "the session ended cleanly" from "the harness crashed mid-turn" without
session.end.
- The race is timing-dependent — a CI runner on a slower box may or may not see the lines, making test output non-deterministic.
Options for a fix
-
Wait for the engine to fully drain before responding to /end-input. Have the route handler await a per-session "engine drained" promise that resolves after runSession's drain IIFE completes. This makes the POST response semantically meaningful ("engine is done") rather than just "queue closed". ~30 LOC across routes/end-input.ts + session.ts. Recommended.
-
Drain child stderr in killProcess before exit. Subscribe to child.stderr.on('end') and wait for it (with a hard timeout). Cleaner than a fixed sleep but still racy if the harness is doing async work past the pipe close.
-
Use line-buffered stderr in the harness logger. Currently process.stderr.write is non-blocking when piped. Switch to process.stderr.write(line + \"\\n\", () => {}) and chain writes through a serialized queue — guarantees each line lands in the pipe before the next is queued. Doesn't help with the SIGTERM race itself, but reduces the window.
-
Document and move on. The user gets every meaningful log line during a real session; the very-fast-exit truncation is cosmetic.
Option 1 is the right architectural call — it makes /end-input a real synchronization point. Option 4 is the cheapest. The middle two are tactical patches.
Definition of done
/end-input returns only AFTER the engine has finished emitting all its lifecycle logs for the current turn (or after a hard 5s timeout, surfaced as http.warn log line).
- A test in
packages/sdk/src/computer-agent.test.ts that runs a trivial one-shot chat with a bufferingLogger test double and asserts both engine.turn.end and session.end events fire before dispose() resolves.
- Existing tests stay green.
Severity
Low — cosmetic for interactive use; potentially nasty for downstream log-aggregation pipelines that depend on session.end to close out telemetry records. No silent data loss in the SDK itself.
Repro
COMPUTERAGENT_LOG=info COMPUTERAGENT_LOG_FORMAT=pretty node --experimental-strip-types examples/your-fast-test.ts
Where the test does one short `agent.chat("echo hi")` and disposes immediately. Compare the log output to the expected event sequence in cf0557e's commit message.
Related
cf0557e (Wedge 1.9): the wedge that introduced the logging this issue is about.
packages/runtime-local/src/local-substrate.ts:163 — current 250ms band-aid.
Symptom
After
cf0557eshipped structured logging across harness + engines + substrates (Wedge 1.9), every lifecycle step now emits a stderr log line. But on fastagent.dispose()flows (sub-second sessions, e.g. a one-shot toy task), the last 1-2 log lines are missing from the terminal:Both
engine.turn.endandsession.endARE called inside the harness child —run-session.ts:118andengine-claude-agent-sdk/src/engine.ts:118respectively — but the bytes never reach the parent terminal.Root cause
agent.dispose()flow:/v1/sessions/:id/end-inputand awaits the response.session.endUserMessages()and returns{ok:true}.query()iterator wraps up, the for-await over it exits,log.info(\"engine.turn.end\")fires, thenrunSession's drain IIFE fireslog.info(\"session.end\").b.shutdown()→killProcesssends SIGTERM.So the lines are written by the child, but Node's pipe buffer hasn't flushed by the time the parent reads EOF.
A 250ms drain in
killProcess(currently inpackages/runtime-local/src/local-substrate.ts:163) covers some cases but not all — the engine's actual async cleanup can take longer than 250ms even on a session that returned a 5-word response.What works today
echo > file.txt)/cancelwhile in-flight)Why it doesn't hurt the user (today)
The user always sees every step UP TO the engine's final response. The two missing lines are the terminator events — useful for telemetry parsers but not for understanding what the agent did. And on any session that actually does meaningful work, they show up.
But it's still a bug, because:
event=session.endto compute concurrent sessions).session.end.Options for a fix
Wait for the engine to fully drain before responding to
/end-input. Have the route handler await a per-session "engine drained" promise that resolves afterrunSession's drain IIFE completes. This makes the POST response semantically meaningful ("engine is done") rather than just "queue closed". ~30 LOC acrossroutes/end-input.ts+session.ts. Recommended.Drain child stderr in
killProcessbefore exit. Subscribe tochild.stderr.on('end')and wait for it (with a hard timeout). Cleaner than a fixed sleep but still racy if the harness is doing async work past the pipe close.Use line-buffered stderr in the harness logger. Currently
process.stderr.writeis non-blocking when piped. Switch toprocess.stderr.write(line + \"\\n\", () => {})and chain writes through a serialized queue — guarantees each line lands in the pipe before the next is queued. Doesn't help with the SIGTERM race itself, but reduces the window.Document and move on. The user gets every meaningful log line during a real session; the very-fast-exit truncation is cosmetic.
Option 1 is the right architectural call — it makes
/end-inputa real synchronization point. Option 4 is the cheapest. The middle two are tactical patches.Definition of done
/end-inputreturns only AFTER the engine has finished emitting all its lifecycle logs for the current turn (or after a hard 5s timeout, surfaced ashttp.warnlog line).packages/sdk/src/computer-agent.test.tsthat runs a trivial one-shot chat with abufferingLoggertest double and asserts bothengine.turn.endandsession.endevents fire beforedispose()resolves.Severity
Low — cosmetic for interactive use; potentially nasty for downstream log-aggregation pipelines that depend on
session.endto close out telemetry records. No silent data loss in the SDK itself.Repro
Where the test does one short `agent.chat("echo hi")` and disposes immediately. Compare the log output to the expected event sequence in
cf0557e's commit message.Related
cf0557e(Wedge 1.9): the wedge that introduced the logging this issue is about.packages/runtime-local/src/local-substrate.ts:163— current 250ms band-aid.