Skip to content

feat(relay): surface connection loss in the workspace header#623

Open
tlongwell-block wants to merge 2 commits into
mainfrom
dawn/relay-connection-status
Open

feat(relay): surface connection loss in the workspace header#623
tlongwell-block wants to merge 2 commits into
mainfrom
dawn/relay-connection-status

Conversation

@tlongwell-block
Copy link
Copy Markdown
Collaborator

Closes the bug Alec filed in #sprout-bugs: when Warp is in its half-connected state (orange icon), Sprout looks like it's working but silently does weird stuff because the WS layer reports nothing wrong.

What changed

RelayClient now exposes a connection state.

  • New ConnectionState enum: idle | connecting | connected | reconnecting | stalled | disconnected.
  • relayClient.getConnectionState() + relayClient.subscribeToConnectionState(listener).
  • State transitions wired into connect(), resetConnection(), scheduleReconnect(), and disconnect().

Half-open WS detection.
Tungstenite auto-pongs and the OS keeps the TCP socket open in the Warp orange-icon case, so we can't rely on Close/Error frames. Instead a lightweight watchdog sends a periodic NIP-01 REQ with a filter that matches nothing real and waits for the matching EOSE. If the EOSE doesn't arrive within STALL_PROBE_TIMEOUT_MS (10s) or the send itself fails, state flips to stalled and the socket is force-reset so the existing reconnect path runs. Probe cadence: 20s.

UI: status indicator next to the workspace name.

  • Healthy: unchanged 🌱 + name.
  • Degraded (reconnecting / stalled / disconnected), after a 2s debounce to avoid flashing on brief blips: the 🌱 emoji is replaced with a pulsing red WifiOff icon, the workspace name turns red and pulses, and a tooltip explains what's happening (Reconnecting to relay… / Connection lost — relay is not responding / Disconnected from relay).
  • ARIA label on the trigger reflects the degraded state for screen readers.

File split. relayClientSession.ts was near its size cap; the new logic landed mostly in two new files for testability:

  • relayConnectionStateEmitter.ts — tiny pub-sub for the state.
  • relayStallWatchdog.ts — probe + onStall callback, no opinion on connection state.

Tests

16 new unit tests via node:test:

  • relayClientShared.test.mjsisRelayConnectionDegraded matrix.
  • relayConnectionStateEmitter.test.mjs — initial state, replay on subscribe, no-op when unchanged, unsubscribe, listener-throw isolation, clear().
  • relayStallWatchdog.test.mjs — probe payload shape (kind 9999, limit 0, future since), EOSE clears in-flight, EOSE for non-probe subId returns false, timeout triggers onStall, send failure triggers onStall, stop() cancels pending stall, start() is idempotent.

All pass:

node --test --experimental-strip-types src/shared/api/*.test.mjs
ℹ tests 16  pass 16  fail 0

Verification

  • pnpm typecheck, pnpm check, pnpm build all green.
  • Local pre-push pipeline (desktop-check, desktop-build, desktop-tauri-check, web-build, mobile-check, mobile-test, rust-clippy, rust-tests) all green.

Tuning notes

  • STALL_PROBE_INTERVAL_MS = 20_000, STALL_PROBE_TIMEOUT_MS = 10_000. Worst-case time-to-warn after Warp turns orange is ~30s, plus the 2s UI debounce.
  • The probe filter ({ kinds: [9999], limit: 0, since: <far future> }) is shaped to match nothing real so the relay only ever replies with EOSE — zero throughput cost.
  • The 2s UI debounce means brief reconnect blips (initial AUTH, momentary network flap) never paint the warning. connected and idle clear it immediately.

Refs the bug thread in #sprout-bugs (root ea21e878…).

tlongwell-block pushed a commit that referenced this pull request May 20, 2026
Max's review of #623 caught a correctness bug. When the relay rejects
AUTH (kind:22242 OK=false), handleOk calls
resetConnection(err, { reconnect: false }) which sets state to
'disconnected' and clears the reconnect timer. BUT — the reconnect
timer's catch handler (`void this.ensureConnected().catch(() =>
this.scheduleReconnect())`) and the retry wrappers in publishEvent /
sendRawWithReconnectRetry would then immediately race the disconnected
state back to 'reconnecting'. Same hazard from any future call that
goes through ensureConnected().

Fix: add a sticky 'terminal' latch.
- Set in resetConnection when reconnect:false.
- Guards scheduleReconnect() (no new timers) and ensureConnected()
  (throws a terminal error).
- Cleared on explicit re-engagement: disconnect() (workspace switch)
  and preconnect() (caller is asking us to come back up).

While here:
- Extract the reconnect/refuse predicates to a pure relayReconnectPolicy
  module so the state-machine rules live in one legible place — also
  addresses Max's elegance note about distributing edge cases across
  connect/reset/scheduleReconnect.
- Fix the stale 'two consecutive misses' doc comment — implementation
  stalls on a single missed probe (or send failure) within the
  STALL_PROBE_TIMEOUT_MS window.

8 new unit tests cover the policy: baseline schedules, terminal wins
over every other reason, pending timer suppresses, live socket
suppresses, no-subs+no-keepalive idles, keep-alive alone is enough,
and shouldRefuseConnect mirrors terminal.
tlongwell-block and others added 2 commits May 19, 2026 22:47
Adds an observable `ConnectionState` to the relay singleton and a status
indicator next to the workspace name so users can tell when the relay is
unreachable. Addresses Alec's bug report about Warp's half-connected
state silently breaking Sprout.

The half-open case (Warp orange icon, asleep VPN, etc.) is the
interesting one: tungstenite auto-pongs and the OS keeps the TCP socket
open, so the WS layer reports nothing wrong. We add an app-level
liveness probe — a periodic NIP-01 REQ with a filter that matches
nothing real — and treat a missing EOSE within ~30s as a stalled
socket. When stalled the watchdog tears down the WS so the existing
reconnect path runs.

UI: when state is `reconnecting`, `stalled`, or `disconnected` (after a
2s debounce to avoid flashing on brief blips), the 🌱 emoji is replaced
with a pulsing red WifiOff icon, the workspace name turns red, and a
tooltip explains what's happening.

Split out of relayClientSession.ts:
  - relayConnectionStateEmitter.ts — pure observable
  - relayStallWatchdog.ts          — probe + onStall callback

Tests: 16 new unit tests covering the emitter, watchdog probe shape,
EOSE handling, timeout/send-failure stalls, stop() cancellation, and
idempotent start().

Refs Sprout bug: relay connection loss warning

Signed-off-by: Tyler Longwell <109685178+tlongwell-block@users.noreply.github.com>
Co-authored-by: Dawn (sprout agent) <c6237ef84fa537c78dcee78efd2d4e59f728859c7f194da42ac51ededfa0be05@sprout-oss.stage.blox.sqprod.co>
Max's review of #623 caught a correctness bug. When the relay rejects
AUTH (kind:22242 OK=false), handleOk calls
resetConnection(err, { reconnect: false }) which sets state to
'disconnected' and clears the reconnect timer. BUT — the reconnect
timer's catch handler (`void this.ensureConnected().catch(() =>
this.scheduleReconnect())`) and the retry wrappers in publishEvent /
sendRawWithReconnectRetry would then immediately race the disconnected
state back to 'reconnecting'. Same hazard from any future call that
goes through ensureConnected().

Fix: add a sticky 'terminal' latch.
- Set in resetConnection when reconnect:false.
- Guards scheduleReconnect() (no new timers) and ensureConnected()
  (throws a terminal error).
- Cleared on explicit re-engagement: disconnect() (workspace switch)
  and preconnect() (caller is asking us to come back up).

While here:
- Extract the reconnect/refuse predicates to a pure relayReconnectPolicy
  module so the state-machine rules live in one legible place — also
  addresses Max's elegance note about distributing edge cases across
  connect/reset/scheduleReconnect.
- Fix the stale 'two consecutive misses' doc comment — implementation
  stalls on a single missed probe (or send failure) within the
  STALL_PROBE_TIMEOUT_MS window.

8 new unit tests cover the policy: baseline schedules, terminal wins
over every other reason, pending timer suppresses, live socket
suppresses, no-subs+no-keepalive idles, keep-alive alone is enough,
and shouldRefuseConnect mirrors terminal.

Signed-off-by: Tyler Longwell <109685178+tlongwell-block@users.noreply.github.com>
Co-authored-by: Dawn (sprout agent) <c6237ef84fa537c78dcee78efd2d4e59f728859c7f194da42ac51ededfa0be05@sprout-oss.stage.blox.sqprod.co>
@tlongwell-block tlongwell-block force-pushed the dawn/relay-connection-status branch from 5eda560 to 5ef8aff Compare May 20, 2026 02:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant