feat: realtime omni voice (Gemini 3.1 Flash Live + GPT Realtime 2) for desktop PTT#7601
Conversation
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…speed selector Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…n-breaking release) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Greptile SummaryThis PR replaces the floating-bar PTT pipeline (Deepgram STT + cloud TTS) with a single realtime omni model (Gemini 3.1 Flash Live or GPT Realtime 2), routing audio through a new authenticated backend WebSocket relay so provider keys never leave the server. An "Auto" mode picks the best provider daily using Artificial Analysis benchmark scores, with client- and server-side caching.
Confidence Score: 3/5Not safe to merge as-is — the audio buffering bugs in the Swift client mean the user's first words are silently dropped on every PTT turn, and the URL injection in the relay puts the server's OpenAI key into a controllable URL parameter. The relay has a URL injection point where an authenticated user can append arbitrary query parameters to the upstream OpenAI request using the server's API key. Two independent audio-ordering bugs in the Swift layer mean the pre-connection audio buffer is always flushed before the WebSocket task exists (so the audio is dropped) and
|
| Filename | Overview |
|---|---|
| backend/routers/omni_relay.py | New authenticated WS relay to OpenAI/Gemini; contains URL injection via unsanitized client-supplied model parameter passed directly into the upstream URL string. |
| backend/routers/auto_model.py | New daily-cached provider picker; has a concurrent-refresh race condition (no async lock) and no authentication on the endpoint. |
| desktop/Desktop/Sources/RealtimeOmni/RealtimeOmniService.swift | Core WS client for Gemini/OpenAI; sendAudio() never gates on isOpen or enqueues to pendingAudio, so audio sent during the TCP/TLS handshake arrives at the provider before the session setup message. |
| desktop/Desktop/Sources/FloatingControlBar/PushToTalkManager.swift | PTT integration with omni STT; pre-connect audio buffer is flushed before service.start() is called, so buffered audio is silently dropped because the WebSocket task is still nil at flush time. |
| desktop/Desktop/Sources/RealtimeOmni/AutoModelSelector.swift | Client-side daily refresh of Artificial Analysis scores; tolerant parser, graceful fallback to Gemini, and applyServerPick hook for backend override. Looks correct. |
| desktop/Desktop/Sources/RealtimeOmni/RealtimeOmniSettings.swift | Provider enum and settings store; straightforward UserDefaults persistence with auto-resolve for .auto via AutoModelSelector. |
| desktop/Desktop/Sources/RealtimeOmni/RealtimeOmniTestHarness.swift | Headless fake-voice E2E harness; uses CheckedContinuation correctly with a done-guard against double resume. Inherits the pendingAudio ordering issue from RealtimeOmniService. |
| desktop/Desktop/Sources/MainWindow/Pages/SettingsPage.swift | Adds Voice Model picker in Advanced settings; straightforward SwiftUI binding to @AppStorage, calls refreshIfStale() when Auto is selected. |
| desktop/Desktop/Sources/DesktopAutomationBridge.swift | Adds omni_test_turn automation action for headless E2E testing; reads PCM file from disk and runs through RealtimeOmniTestHarness. Debug-bundle only per description. |
| backend/main.py | Registers the two new routers; no issues. |
| desktop/Desktop/Sources/OmiApp.swift | Adds AutoModelSelector.shared.refreshIfStale() call at app launch; straightforward and correct. |
Sequence Diagram
sequenceDiagram
participant Desktop as Desktop (PTT)
participant Relay as omi backend /v1/omni/relay
participant Provider as OpenAI / Gemini
Desktop->>Relay: "WS connect (Authorization: Bearer token) ?provider=openai|gemini&model=..."
Note over Relay: Authenticates via get_current_user_uid_ws_listen
Relay->>Provider: WS connect (API key server-side)
Desktop->>Relay: session setup JSON
Relay->>Provider: forward session setup
Provider-->>Relay: session.updated / setupComplete
Relay-->>Desktop: forward response
loop PTT turn (mic open)
Desktop->>Relay: input_audio_buffer.append / realtimeInput
Relay->>Provider: forward audio
Provider-->>Relay: inputTranscription delta
Relay-->>Desktop: forward transcript
end
Desktop->>Relay: commitInputTurn / activityEnd
Relay->>Provider: forward commit
Provider-->>Relay: transcription completed / turnComplete
Relay-->>Desktop: final transcript
Desktop->>Desktop: sendTranscript() to ChatProvider to Claude + tools
Comments Outside Diff (3)
-
desktop/Desktop/Sources/FloatingControlBar/PushToTalkManager.swift, line 375-378 (link)Pre-connect buffer flushed before
service.start()— all audio silently droppedThe buffer flush calls
service.sendAudio()beforeservice.start()is invoked. At flush timetaskis stillnilinsideRealtimeOmniService, so everysend(json:)call resolves totask?.send(...)which is a no-op. The audio captured during auth fetch is discarded. The PR's stated goal — "user's first words aren't clipped" — requires movingservice.start()before the flush loop so thattaskexists when audio is queued. -
desktop/Desktop/Sources/RealtimeOmni/RealtimeOmniService.swift, line 903-911 (link)sendAudio()bypassespendingAudiobuffer — audio arrives at provider before session setupsendAudio()callssend(json:)→task?.send(...)immediately, without checkingisOpen. Afterstart()/t.resume(),URLSessionWebSocketTaskbuffers any messages sent before the connection handshake and delivers them to the server in FIFO order. BecausedidOpenWithProtocol(wheresendSessionSetup()is called) fires only after the connection is established, any audio queued during the TCP/TLS handshake window will arrive at OpenAI or Gemini before the session configuration message, triggering a protocol error. ThependingAudioproperty and its drain inmarkReady()already exist for this purpose but are never populated —sendAudio()should enqueue topendingAudiowhen!isOpeninstead of callingsend()directly. -
desktop/Desktop/Sources/FloatingControlBar/PushToTalkManager.swift, line 389-398 (link)This function is defined but never called anywhere in the diff or the surrounding code — the relay path handles key resolution server-side and
RealtimeOmniServicetakes anauthHeader, not a provider API key. It can be removed to avoid confusion.Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Reviews (1): Last reviewed commit: "feat(desktop): fall back to Deepgram if ..." | Re-trigger Greptile
| key = os.getenv("OPENAI_API_KEY") | ||
| if not key: | ||
| return None, "OPENAI_API_KEY not configured" | ||
| url = OPENAI_URL.format(model=model or "gpt-realtime-2") |
There was a problem hiding this comment.
URL injection via unencoded
model query param
The model value from the client is interpolated into the OpenAI WebSocket URL with a plain Python .format() call, so any & or = in the string appends extra query parameters to the upstream request. A valid authenticated user could pass model=gpt-realtime-2&api-version=preview, and the injected fragment is forwarded verbatim to api.openai.com. While the immediate blast radius is limited to OpenAI's own parameter parsing, this is an injection point on a server-managed API key. The fix is to URL-encode the model value before substitution using urllib.parse.quote.
| TTL_SECONDS = 24 * 3600 | ||
| AA_URL = "https://artificialanalysis.ai/api/v2/data/llms/models" | ||
|
|
||
| _cache = {"provider": None, "ts": 0.0, "detail": {}} |
There was a problem hiding this comment.
Concurrent cache misses trigger multiple AA fetches
The stale-cache check is not guarded by a lock, so N concurrent requests arriving when the cache is expired will all evaluate the condition as true and each launch a separate _fetch_and_score() coroutine. This can fire many HTTP calls to the Artificial Analysis API in a short window, which may exhaust rate limits or quota. An asyncio.Lock acquired around the check-and-refresh block fixes this.
| _cache = {"provider": None, "ts": 0.0, "detail": {}} | |
| _cache = {"provider": None, "ts": 0.0, "detail": {}} | |
| _cache_lock = asyncio.Lock() |
| now = time.time() | ||
| if _cache["provider"] is None or (now - _cache["ts"]) > TTL_SECONDS: | ||
| try: | ||
| provider, detail = await _fetch_and_score() | ||
| _cache.update(provider=provider, ts=now, detail=detail) | ||
| except Exception as e: | ||
| logger.error(f"auto model-pick fetch failed: {e}") | ||
| if _cache["provider"] is None: | ||
| _cache.update(provider="geminiFlashLive", ts=now, detail={"reason": f"error: {e}"}) |
There was a problem hiding this comment.
Use the lock here to serialize concurrent refresh attempts.
| now = time.time() | |
| if _cache["provider"] is None or (now - _cache["ts"]) > TTL_SECONDS: | |
| try: | |
| provider, detail = await _fetch_and_score() | |
| _cache.update(provider=provider, ts=now, detail=detail) | |
| except Exception as e: | |
| logger.error(f"auto model-pick fetch failed: {e}") | |
| if _cache["provider"] is None: | |
| _cache.update(provider="geminiFlashLive", ts=now, detail={"reason": f"error: {e}"}) | |
| now = time.time() | |
| if _cache["provider"] is None or (now - _cache["ts"]) > TTL_SECONDS: | |
| async with _cache_lock: | |
| now = time.time() # re-read after acquiring | |
| if _cache["provider"] is None or (now - _cache["ts"]) > TTL_SECONDS: | |
| try: | |
| provider, detail = await _fetch_and_score() | |
| _cache.update(provider=provider, ts=now, detail=detail) | |
| except Exception as e: | |
| logger.error(f"auto model-pick fetch failed: {e}") | |
| if _cache["provider"] is None: | |
| _cache.update(provider="geminiFlashLive", ts=now, detail={"reason": f"error: {e}"}) |
| @router.get("/v1/auto/model-pick") | ||
| async def auto_model_pick(): |
There was a problem hiding this comment.
No authentication on
/v1/auto/model-pick
The endpoint is unauthenticated and publicly reachable. It exposes the cached ARTIFICIALANALYSIS_API_KEY-derived pick (and the detail dict including raw scores), and any caller can force a cache-bust by hitting it repeatedly once the TTL expires. Adding the standard uid: str = Depends(get_current_user_uid) dependency would restrict it to signed-in users consistent with the rest of the API surface.
…pt-5) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…erver-side) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
What
Replaces the floating-bar PTT cascade (Deepgram STT + cloud TTS) with one realtime omni model for voice I/O. Adds an Advanced → Voice Model picker (Auto / Gemini 3.1 Flash Live / GPT Realtime 2). Reasoning/agents/tools/text chat are unchanged — only the STT transport changes; the final transcript still flows through
ChatProvider→AgentBridge(Claude + tools).Architecture (keys server-side)
desktop → omi backend /v1/omni/relay (WS) → OpenAI/Gemini. Apple's WebSocket stacks can't hold a direct Gemini connection (Google frontend resets them), and provider keys shouldn't ship in the client — so the backend relays.Autoreads/v1/auto/model-pick(daily Artificial Analysis quality/speed score, cached server-side).Backend (this PR)
routers/omni_relay.py— authenticated WS relay to OpenAI/Gemini realtime (keys server-side).routers/auto_model.py—GET /v1/auto/model-pick, daily-cached Artificial Analysis quality/speed pick.Desktop (this PR)
RealtimeOmni/— settings (+Auto), backend-relay STT/TTS client, Auto selector, fake-voice test harness.PushToTalkManager— PTT uses omni STT via the relay (buffers audio until connected).omni_test_turnautomation action — headless fake-voice E2E test (debug bundles only).Verified (fake-voice, headless, through the deployed dev relay)
desktop/auto-releases via Codemagic)/v1/omni/relay+/v1/auto/model-pickare dev-only; prod returns 404. Desktop release must not precede this.ARTIFICIALANALYSIS_API_KEYto backend secrets — without it Auto defaults to Gemini (no live benchmark).🤖 Generated with Claude Code