Reconnect disconnected exec-server websocket clients with fresh sessions#23867
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 42f2426c23
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
usgenes39-dotcom
left a comment
There was a problem hiding this comment.
Reviewed current head ac38595ed814802019b0584eb0e168074aab24cb.
No actionable findings from this pass. I re-checked the lazy remote exec-server cache/reconnect path against the current diff and the surrounding call chain (RemoteProcess, RemoteFileSystem, remote HTTP streams, and the lower JSON-RPC disconnect watch). The earlier stale-client race appears addressed by including RpcClient::is_disconnected() in ExecServerClient::is_disconnected(), and the semaphore-guarded cache replacement keeps concurrent post-disconnect get() callers sharing one fresh websocket session while preserving the intended one-shot behavior for stdio transports.
Local verification: git diff --check FETCH_HEAD HEAD -- codex-rs/exec-server/src/client.rs codex-rs/exec-server/src/rpc.rs passed. I could not run rustfmt locally because the installed Rust toolchain is missing the rustfmt component.
usgenes39-dotcom
left a comment
There was a problem hiding this comment.
Second-pass review on current head ac38595ed814802019b0584eb0e168074aab24cb.
I still do not see actionable issues. I re-reviewed the reconnect/cache path and the surrounding callers for fresh process starts, fs calls, and remote HTTP streams. The current implementation keeps the intended boundaries: old session/stream handles stay bound to the old connection and fail through the existing disconnect paths, while later get() calls for websocket-backed environments replace a disconnected cached client under a single async connect gate. The added transport-watch predicate covers the stale-client window from the earlier review comment.
Verification checked:
- local
rustfmt --edition 2024 --check codex-rs/exec-server/src/client.rs codex-rs/exec-server/src/rpc.rs - local
git diff --checkfor the PR files - GitHub Actions for this head are green, including
rust-ci,Bazel,ci,cargo-deny,Codespell,sdk, andblob-size-policy; the Bazel run includes Linux/macOS/Windows test and clippy jobs.
usgenes39-dotcom
left a comment
There was a problem hiding this comment.
Business/product-logic pass on current head ac38595ed814802019b0584eb0e168074aab24cb.
I do not see actionable issues.
Checked areas:
- Economic / billing / quota model: this PR does not touch one. I checked the diff and relevant exec-server call chain for billing, quota, credit, price, cost, usage, metering, rate-limit, charge, subscription, and token-accounting concepts; none are present here.
- New work after websocket disconnect:
LazyRemoteExecServerClient::get()now double-checks the cached client around a single async connect gate and treats either the high-level disconnect latch or the lower transport watch as disconnected before reusing a websocket-backed client. - Old process sessions: existing
RemoteExecProcessinstances keep their oldSession/ExecServerClient, so this does not silently rebind old process handles onto a new server session. That matches the stated no-resume/no-retry scope. - Fresh process starts:
RemoteProcess::start()fetches through the lazy client beforeregister_sessionandexec, so starts after an observed websocket close get the replacement session; start-time races still fail through the existing disconnect path rather than replaying. - Remote filesystem and HTTP calls: filesystem calls and the
HttpClientadapter also enter throughget(), so later fresh calls move to the replacement websocket client while in-flight HTTP body streams remain tied to their original connection and fail via existing stream failure handling. - Stdio transport: disconnected stdio clients are still returned instead of reconnected, preserving the prior one-shot command-backed behavior.
- Resource/lifecycle boundary: replacing the cached websocket client only drops the cache's strong reference. Any active sessions/streams keep their own old client handles and get drained/failed by that old client's disconnect logic.
- Retry/backoff semantics: failed reconnect attempts do not poison the cache; later callers can try again. There is intentionally no request replay, resume session id, or backoff policy added in this slice.
Verification:
rustfmt --edition 2024 --check codex-rs/exec-server/src/client.rs codex-rs/exec-server/src/rpc.rsgit diff --checkfor the PR files- GitHub Actions for this head are green, including rust-ci, Bazel, ci, cargo-deny, Codespell, sdk, and blob-size-policy; the Bazel workflow includes Linux/macOS/Windows test and clippy jobs.
|
@usgenes39-dotcom, please stop spamming our PRs. That's not helpful. |
Summary
get()Why
The prior PR direction was trying to grow into session restore: resume the old
session_id, preserve existing process handles, and add reconnect retry policy. That is more machinery than we want for this slice.For now, the useful minimum is simpler: later fresh remote operations should not be stuck behind a dead cached websocket client, but anything already attached to the dead connection should fail loudly through the existing disconnect path. The server already has detached-session cleanup via its existing TTL, so this PR does not need to add client-side session preservation.
What Changed
LazyRemoteExecServerClient::get()now keeps the current concrete client in a small mutex-protected cache plus one async connect lock.get()returns it.get()creates a brand-new websocket client with a brand-new exec-server session and replaces the cache.resume_session_id, backoff, request replay, or existingRemoteExecProcessrebinding is added here.get()calls after disconnect share one fresh replacement client/session.