Skip to content

Reconnect disconnected exec-server websocket clients with fresh sessions#23867

Merged
starr-openai merged 5 commits into
mainfrom
starr/exec-reconnect-min-cand-a-rework
May 21, 2026
Merged

Reconnect disconnected exec-server websocket clients with fresh sessions#23867
starr-openai merged 5 commits into
mainfrom
starr/exec-reconnect-min-cand-a-rework

Conversation

@starr-openai
Copy link
Copy Markdown
Contributor

@starr-openai starr-openai commented May 21, 2026

Summary

  • replace the one-shot lazy remote exec-server cache with a lock-protected current client
  • when the cached websocket client is already disconnected, create one fresh websocket client/session on the next get()
  • keep existing disconnect failure behavior for old process sessions and HTTP body streams; do not add session resume or request retry

Why

The prior PR direction was trying to grow into session restore: resume the old session_id, preserve existing process handles, and add reconnect retry policy. That is more machinery than we want for this slice.

For now, the useful minimum is simpler: later fresh remote operations should not be stuck behind a dead cached websocket client, but anything already attached to the dead connection should fail loudly through the existing disconnect path. The server already has detached-session cleanup via its existing TTL, so this PR does not need to add client-side session preservation.

What Changed

  • LazyRemoteExecServerClient::get() now keeps the current concrete client in a small mutex-protected cache plus one async connect lock.
  • If that cached client is still connected, get() returns it.
  • If that cached websocket client has observed the transport close, get() creates a brand-new websocket client with a brand-new exec-server session and replaces the cache.
  • If that cached client is stdio-backed, behavior stays one-shot: the dead client is returned and later work surfaces the existing disconnect error.
  • No resume_session_id, backoff, request replay, or existing RemoteExecProcess rebinding is added here.
  • Added focused websocket coverage that proves two concurrent get() calls after disconnect share one fresh replacement client/session.

Copy link
Copy Markdown
Contributor

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 42f2426c23

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread codex-rs/exec-server/src/client.rs Outdated
Copy link
Copy Markdown

@usgenes39-dotcom usgenes39-dotcom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed current head ac38595ed814802019b0584eb0e168074aab24cb.

No actionable findings from this pass. I re-checked the lazy remote exec-server cache/reconnect path against the current diff and the surrounding call chain (RemoteProcess, RemoteFileSystem, remote HTTP streams, and the lower JSON-RPC disconnect watch). The earlier stale-client race appears addressed by including RpcClient::is_disconnected() in ExecServerClient::is_disconnected(), and the semaphore-guarded cache replacement keeps concurrent post-disconnect get() callers sharing one fresh websocket session while preserving the intended one-shot behavior for stdio transports.

Local verification: git diff --check FETCH_HEAD HEAD -- codex-rs/exec-server/src/client.rs codex-rs/exec-server/src/rpc.rs passed. I could not run rustfmt locally because the installed Rust toolchain is missing the rustfmt component.

Copy link
Copy Markdown

@usgenes39-dotcom usgenes39-dotcom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second-pass review on current head ac38595ed814802019b0584eb0e168074aab24cb.

I still do not see actionable issues. I re-reviewed the reconnect/cache path and the surrounding callers for fresh process starts, fs calls, and remote HTTP streams. The current implementation keeps the intended boundaries: old session/stream handles stay bound to the old connection and fail through the existing disconnect paths, while later get() calls for websocket-backed environments replace a disconnected cached client under a single async connect gate. The added transport-watch predicate covers the stale-client window from the earlier review comment.

Verification checked:

  • local rustfmt --edition 2024 --check codex-rs/exec-server/src/client.rs codex-rs/exec-server/src/rpc.rs
  • local git diff --check for the PR files
  • GitHub Actions for this head are green, including rust-ci, Bazel, ci, cargo-deny, Codespell, sdk, and blob-size-policy; the Bazel run includes Linux/macOS/Windows test and clippy jobs.

Copy link
Copy Markdown

@usgenes39-dotcom usgenes39-dotcom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Business/product-logic pass on current head ac38595ed814802019b0584eb0e168074aab24cb.

I do not see actionable issues.

Checked areas:

  • Economic / billing / quota model: this PR does not touch one. I checked the diff and relevant exec-server call chain for billing, quota, credit, price, cost, usage, metering, rate-limit, charge, subscription, and token-accounting concepts; none are present here.
  • New work after websocket disconnect: LazyRemoteExecServerClient::get() now double-checks the cached client around a single async connect gate and treats either the high-level disconnect latch or the lower transport watch as disconnected before reusing a websocket-backed client.
  • Old process sessions: existing RemoteExecProcess instances keep their old Session / ExecServerClient, so this does not silently rebind old process handles onto a new server session. That matches the stated no-resume/no-retry scope.
  • Fresh process starts: RemoteProcess::start() fetches through the lazy client before register_session and exec, so starts after an observed websocket close get the replacement session; start-time races still fail through the existing disconnect path rather than replaying.
  • Remote filesystem and HTTP calls: filesystem calls and the HttpClient adapter also enter through get(), so later fresh calls move to the replacement websocket client while in-flight HTTP body streams remain tied to their original connection and fail via existing stream failure handling.
  • Stdio transport: disconnected stdio clients are still returned instead of reconnected, preserving the prior one-shot command-backed behavior.
  • Resource/lifecycle boundary: replacing the cached websocket client only drops the cache's strong reference. Any active sessions/streams keep their own old client handles and get drained/failed by that old client's disconnect logic.
  • Retry/backoff semantics: failed reconnect attempts do not poison the cache; later callers can try again. There is intentionally no request replay, resume session id, or backoff policy added in this slice.

Verification:

  • rustfmt --edition 2024 --check codex-rs/exec-server/src/client.rs codex-rs/exec-server/src/rpc.rs
  • git diff --check for the PR files
  • GitHub Actions for this head are green, including rust-ci, Bazel, ci, cargo-deny, Codespell, sdk, and blob-size-policy; the Bazel workflow includes Linux/macOS/Windows test and clippy jobs.

@etraut-openai
Copy link
Copy Markdown
Collaborator

@usgenes39-dotcom, please stop spamming our PRs. That's not helpful.

Copy link
Copy Markdown
Collaborator

@jif-oai jif-oai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks way better to me

@starr-openai starr-openai merged commit de80fa6 into main May 21, 2026
45 of 47 checks passed
@starr-openai starr-openai deleted the starr/exec-reconnect-min-cand-a-rework branch May 21, 2026 16:43
@github-actions github-actions Bot locked and limited conversation to collaborators May 21, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants