feat(relay): surface connection loss in the workspace header#623
Open
tlongwell-block wants to merge 2 commits into
Open
feat(relay): surface connection loss in the workspace header#623tlongwell-block wants to merge 2 commits into
tlongwell-block wants to merge 2 commits into
Conversation
tlongwell-block
pushed a commit
that referenced
this pull request
May 20, 2026
Max's review of #623 caught a correctness bug. When the relay rejects AUTH (kind:22242 OK=false), handleOk calls resetConnection(err, { reconnect: false }) which sets state to 'disconnected' and clears the reconnect timer. BUT — the reconnect timer's catch handler (`void this.ensureConnected().catch(() => this.scheduleReconnect())`) and the retry wrappers in publishEvent / sendRawWithReconnectRetry would then immediately race the disconnected state back to 'reconnecting'. Same hazard from any future call that goes through ensureConnected(). Fix: add a sticky 'terminal' latch. - Set in resetConnection when reconnect:false. - Guards scheduleReconnect() (no new timers) and ensureConnected() (throws a terminal error). - Cleared on explicit re-engagement: disconnect() (workspace switch) and preconnect() (caller is asking us to come back up). While here: - Extract the reconnect/refuse predicates to a pure relayReconnectPolicy module so the state-machine rules live in one legible place — also addresses Max's elegance note about distributing edge cases across connect/reset/scheduleReconnect. - Fix the stale 'two consecutive misses' doc comment — implementation stalls on a single missed probe (or send failure) within the STALL_PROBE_TIMEOUT_MS window. 8 new unit tests cover the policy: baseline schedules, terminal wins over every other reason, pending timer suppresses, live socket suppresses, no-subs+no-keepalive idles, keep-alive alone is enough, and shouldRefuseConnect mirrors terminal.
Adds an observable `ConnectionState` to the relay singleton and a status indicator next to the workspace name so users can tell when the relay is unreachable. Addresses Alec's bug report about Warp's half-connected state silently breaking Sprout. The half-open case (Warp orange icon, asleep VPN, etc.) is the interesting one: tungstenite auto-pongs and the OS keeps the TCP socket open, so the WS layer reports nothing wrong. We add an app-level liveness probe — a periodic NIP-01 REQ with a filter that matches nothing real — and treat a missing EOSE within ~30s as a stalled socket. When stalled the watchdog tears down the WS so the existing reconnect path runs. UI: when state is `reconnecting`, `stalled`, or `disconnected` (after a 2s debounce to avoid flashing on brief blips), the 🌱 emoji is replaced with a pulsing red WifiOff icon, the workspace name turns red, and a tooltip explains what's happening. Split out of relayClientSession.ts: - relayConnectionStateEmitter.ts — pure observable - relayStallWatchdog.ts — probe + onStall callback Tests: 16 new unit tests covering the emitter, watchdog probe shape, EOSE handling, timeout/send-failure stalls, stop() cancellation, and idempotent start(). Refs Sprout bug: relay connection loss warning Signed-off-by: Tyler Longwell <109685178+tlongwell-block@users.noreply.github.com> Co-authored-by: Dawn (sprout agent) <c6237ef84fa537c78dcee78efd2d4e59f728859c7f194da42ac51ededfa0be05@sprout-oss.stage.blox.sqprod.co>
Max's review of #623 caught a correctness bug. When the relay rejects AUTH (kind:22242 OK=false), handleOk calls resetConnection(err, { reconnect: false }) which sets state to 'disconnected' and clears the reconnect timer. BUT — the reconnect timer's catch handler (`void this.ensureConnected().catch(() => this.scheduleReconnect())`) and the retry wrappers in publishEvent / sendRawWithReconnectRetry would then immediately race the disconnected state back to 'reconnecting'. Same hazard from any future call that goes through ensureConnected(). Fix: add a sticky 'terminal' latch. - Set in resetConnection when reconnect:false. - Guards scheduleReconnect() (no new timers) and ensureConnected() (throws a terminal error). - Cleared on explicit re-engagement: disconnect() (workspace switch) and preconnect() (caller is asking us to come back up). While here: - Extract the reconnect/refuse predicates to a pure relayReconnectPolicy module so the state-machine rules live in one legible place — also addresses Max's elegance note about distributing edge cases across connect/reset/scheduleReconnect. - Fix the stale 'two consecutive misses' doc comment — implementation stalls on a single missed probe (or send failure) within the STALL_PROBE_TIMEOUT_MS window. 8 new unit tests cover the policy: baseline schedules, terminal wins over every other reason, pending timer suppresses, live socket suppresses, no-subs+no-keepalive idles, keep-alive alone is enough, and shouldRefuseConnect mirrors terminal. Signed-off-by: Tyler Longwell <109685178+tlongwell-block@users.noreply.github.com> Co-authored-by: Dawn (sprout agent) <c6237ef84fa537c78dcee78efd2d4e59f728859c7f194da42ac51ededfa0be05@sprout-oss.stage.blox.sqprod.co>
5eda560 to
5ef8aff
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes the bug Alec filed in #sprout-bugs: when Warp is in its half-connected state (orange icon), Sprout looks like it's working but silently does weird stuff because the WS layer reports nothing wrong.
What changed
RelayClient now exposes a connection state.
ConnectionStateenum:idle | connecting | connected | reconnecting | stalled | disconnected.relayClient.getConnectionState()+relayClient.subscribeToConnectionState(listener).connect(),resetConnection(),scheduleReconnect(), anddisconnect().Half-open WS detection.
Tungstenite auto-pongs and the OS keeps the TCP socket open in the Warp orange-icon case, so we can't rely on Close/Error frames. Instead a lightweight watchdog sends a periodic NIP-01
REQwith a filter that matches nothing real and waits for the matching EOSE. If the EOSE doesn't arrive withinSTALL_PROBE_TIMEOUT_MS(10s) or the send itself fails, state flips tostalledand the socket is force-reset so the existing reconnect path runs. Probe cadence: 20s.UI: status indicator next to the workspace name.
reconnecting/stalled/disconnected), after a 2s debounce to avoid flashing on brief blips: the 🌱 emoji is replaced with a pulsing redWifiOfficon, the workspace name turns red and pulses, and a tooltip explains what's happening (Reconnecting to relay…/Connection lost — relay is not responding/Disconnected from relay).File split.
relayClientSession.tswas near its size cap; the new logic landed mostly in two new files for testability:relayConnectionStateEmitter.ts— tiny pub-sub for the state.relayStallWatchdog.ts— probe + onStall callback, no opinion on connection state.Tests
16 new unit tests via
node:test:relayClientShared.test.mjs—isRelayConnectionDegradedmatrix.relayConnectionStateEmitter.test.mjs— initial state, replay on subscribe, no-op when unchanged, unsubscribe, listener-throw isolation,clear().relayStallWatchdog.test.mjs— probe payload shape (kind 9999, limit 0, futuresince), EOSE clears in-flight, EOSE for non-probe subId returns false, timeout triggersonStall, send failure triggersonStall,stop()cancels pending stall,start()is idempotent.All pass:
Verification
pnpm typecheck,pnpm check,pnpm buildall green.Tuning notes
STALL_PROBE_INTERVAL_MS = 20_000,STALL_PROBE_TIMEOUT_MS = 10_000. Worst-case time-to-warn after Warp turns orange is ~30s, plus the 2s UI debounce.{ kinds: [9999], limit: 0, since: <far future> }) is shaped to match nothing real so the relay only ever replies with EOSE — zero throughput cost.connectedandidleclear it immediately.Refs the bug thread in #sprout-bugs (root
ea21e878…).