Skip to content

Fix process crash when testing Stderr#1449

Open
ericstj wants to merge 15 commits intomodelcontextprotocol:mainfrom
ericstj:fix1448
Open

Fix process crash when testing Stderr#1449
ericstj wants to merge 15 commits intomodelcontextprotocol:mainfrom
ericstj:fix1448

Conversation

@ericstj
Copy link
Copy Markdown
Contributor

@ericstj ericstj commented Mar 19, 2026

Fix #1448

Summary of fixes:

  • Unhandled exceptions in child processes — The conformance client could crash with an unhandled exception,
    generating crash dumps that abort the parent test host. Wrapped the main logic in try/catch to report errors to
    stderr instead.
  • Throwing stderr callbacks crash the process — A user-provided StandardErrorLines callback that throws would
    propagate to the background ErrorDataReceived dispatch thread, crashing the process. Added try/catch with a
    warning-level log.
  • Event handler lifetime — ErrorDataReceived/OutputDataReceived handlers could fire after disposal, including
    ITestOutputHelper.WriteLine which throws after test completion. Stored handler delegates as named fields and detach
    them during cleanup.
  • WaitForExit() hangs — The parameterless WaitForExit() waits for stdio stream EOF and hangs indefinitely when
    grandchild processes hold pipe handles open. Replaced with WaitForExitAsync(ct) on .NET and WaitForExit(timeout) on
    .NET Framework.
  • Cleanup race condition — When ReadMessagesAsync and DisposeAsync enter CleanupAsync concurrently, the second
    caller returned without completing the message channel, causing await Completion to hang forever. The second caller
    now calls SetDisconnected to ensure the channel completes.
  • Thread pool starvation from in-process StdioServerTransport — Tests that only validate DI/config were
    instantiating StdioServerTransport, which permanently blocks a thread pool thread via Console.OpenStandardInput().
    On 2-vCPU CI runners, multiple instances starved the pool. Changed these tests to use
    StreamServerTransport(Stream.Null, Stream.Null)

@stephentoub

@ericstj
Copy link
Copy Markdown
Contributor Author

ericstj commented Mar 20, 2026

This may not be the root cause of the problem. We've now got dumps on both windows and linux. I'll have a look and see if the dumps reveal a better root-cause.

@ericstj
Copy link
Copy Markdown
Contributor Author

ericstj commented Mar 23, 2026

Crashes fixed, but we have tests hanging. Could be due to the WaitForExit is now hanging. Downloaded the dumps and having a look.

@ericstj
Copy link
Copy Markdown
Contributor Author

ericstj commented Mar 23, 2026

Yeah, WaitForExit is now hanging... I'm not convinced this is the right change. I'm wondering more "why aren't processes being cleaned up" and do we have a hang or test bug somewhere. Also - the fragility here where missing something means a crash seems busted. I think we should have a design that's resilient to not cleaning up external state.

@ericstj
Copy link
Copy Markdown
Contributor Author

ericstj commented Mar 24, 2026

Found leaked StdioServerTransport objects in dumps, these kept handles open to StdIn and were causing threadpool starvation. -- at least that's the theory.

What this fixes is the test host hang, which CI reports as a failure. The causal chain:

  1. 7 tests create StdioServerTransport via DI without disposing ServiceProvider
  2. Each transport starts Task.Run(ReadMessagesAsync) → blocks a thread pool thread on Console.OpenStandardInput()
    (native read that never returns)
  3. On 2-vCPU CI runners, 7 permanently blocked TP threads exhaust the pool
  4. Other tests needing TP threads for async I/O (like Process.Start completions) starve and hang
  5. CI blame collector detects the hang → kills the test host → reported as failure

The await using ensures DisposeAsync() is called on the ServiceProvider, which calls
StdioServerTransport.DisposeAsync(), which cancels the CancellationTokenSource and unblocks the read loop — freeing
those TP threads.

ericstj added 3 commits March 25, 2026 10:19
StdioServerTransport does not reliably close on linux, which causes a
threadpool thread to be blocked on read.  This can lead to threadpool
starvation and application hangs.  Since the tests in question do not
require stdio transport, switch to using StreamServerTransport with null
streams to avoid this issue.
@ericstj ericstj requested a review from stephentoub March 26, 2026 03:50
if (!processRunning && HasExited(process))
{
process.WaitForExit();
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't we now possibly miss some error messages being reported over stderr from the child process?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The WaitForExit() that was removed from DisposeProcess was only called when !processRunning && HasExited(process) — i.e. when the process had already exited on its own before we got here. The stderr flushing for that case is now handled earlier in StdioClientSessionTransport.GetUnexpectedExitExceptionAsync (line 99-106), which calls WaitForExitAsync(ct) on .NET or WaitForExit(timeout) on Framework. So stderr is still flushed before we build the error message and completion details.

In the processRunning case (where we call KillTree), the old code never called WaitForExit() either — the !processRunning guard prevented it. So there's no behavior change for that path.

The reason for removing it from DisposeProcess: DisposeProcess is a static helper called from both the session transport cleanup and the ConnectAsync error path. The session transport cleanup already handles stderr flushing, and the ConnectAsync error path doesn't need it. Keeping the no-arg WaitForExit() in DisposeProcess risked hangs when called from contexts where the streams aren't fully set up.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or WaitForExit(timeout) on Framework. So stderr is still flushed before we build the error message and completion details

WaitForExit(timeout) doesn't wait for EOF of the streams though, right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right — WaitForExit(timeout) does not wait for stream EOF.

I checked the .NET runtime source for WaitForExitAsync. It calls WaitUntilOutputEOF(cancellationToken) which does _error.EOF.WaitAsync(cancellationToken) — so it DOES wait for stream EOF, and the cancellation token IS respected for the stream-drain wait, not just the process-exit wait. So on .NET, WaitForExitAsync(ct) is strictly better than the old WaitForExit() — it flushes streams AND can be cancelled via CancelShutdown() if grandchild processes hold pipe handles open.

On .NET Framework there's no WaitForExitAsync, so we use WaitForExit(timeout) which doesn't guarantee stream flush. The risk of missing events is low in practice: by this point HasExited has returned true (line 93), the error handler is still attached and receiving events, and the ErrorDataReceived dispatching happens asynchronously — events that were already in the OS pipe buffer will still be delivered during the timeout window. We'd only miss events if there's a significant delay between the process exiting and the runtime dispatching buffered pipe data, which shouldn't happen when the process has cleanly exited. The alternative — no-arg WaitForExit() — has no cancellation mechanism on Framework and hangs indefinitely when grandchild processes hold pipe handles open.

ericstj and others added 3 commits March 26, 2026 11:08
When CleanupAsync is entered from ReadMessagesAsync (stdout EOF), the
cancellation token is _shutdownCts.Token which hasn't been canceled yet
(base.CleanupAsync runs later). If stderr EOF hasn't been delivered by
the threadpool yet, WaitForExitAsync hangs indefinitely.

Use a linked CTS with ShutdownTimeout to bound the wait. The process is
already dead; we're just draining stderr pipe buffers.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CreateAsync_ValidProcessInvalidServer_StdErrCallbackInvoked Test crash in CI

2 participants