Skip to content

feat: realtime omni voice (Gemini 3.1 Flash Live + GPT Realtime 2) for desktop PTT#7601

Merged
kodjima33 merged 16 commits into
mainfrom
nik/realtime-voice-demos
Jun 2, 2026
Merged

feat: realtime omni voice (Gemini 3.1 Flash Live + GPT Realtime 2) for desktop PTT#7601
kodjima33 merged 16 commits into
mainfrom
nik/realtime-voice-demos

Conversation

@kodjima33
Copy link
Copy Markdown
Collaborator

What

Replaces the floating-bar PTT cascade (Deepgram STT + cloud TTS) with one realtime omni model for voice I/O. Adds an Advanced → Voice Model picker (Auto / Gemini 3.1 Flash Live / GPT Realtime 2). Reasoning/agents/tools/text chat are unchanged — only the STT transport changes; the final transcript still flows through ChatProviderAgentBridge (Claude + tools).

Architecture (keys server-side)

desktop → omi backend /v1/omni/relay (WS) → OpenAI/Gemini. Apple's WebSocket stacks can't hold a direct Gemini connection (Google frontend resets them), and provider keys shouldn't ship in the client — so the backend relays. Auto reads /v1/auto/model-pick (daily Artificial Analysis quality/speed score, cached server-side).

Backend (this PR)

  • routers/omni_relay.py — authenticated WS relay to OpenAI/Gemini realtime (keys server-side).
  • routers/auto_model.pyGET /v1/auto/model-pick, daily-cached Artificial Analysis quality/speed pick.

Desktop (this PR)

  • RealtimeOmni/ — settings (+Auto), backend-relay STT/TTS client, Auto selector, fake-voice test harness.
  • PushToTalkManager — PTT uses omni STT via the relay (buffers audio until connected).
  • Advanced Voice Model picker (no toggle — the picker is the control).
  • omni_test_turn automation action — headless fake-voice E2E test (debug bundles only).

Verified (fake-voice, headless, through the deployed dev relay)

  • ✅ Gemini 3.1 Flash Live → connected, correct transcript.
  • ✅ GPT Realtime 2 → connected, correct transcript.
  • ✅ Real-voice PTT (OpenAI) confirmed in live sessions; Claude reasoning + chat replies intact.

⚠️ Before merge/release (merging desktop/ auto-releases via Codemagic)

  1. Deploy backend to prod/v1/omni/relay + /v1/auto/model-pick are dev-only; prod returns 404. Desktop release must not precede this.
  2. Add ARTIFICIALANALYSIS_API_KEY to backend secrets — without it Auto defaults to Gemini (no live benchmark).

🤖 Generated with Claude Code

kodjima33 and others added 12 commits June 2, 2026 16:21
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…speed selector

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…n-breaking release)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Jun 2, 2026

Greptile Summary

This PR replaces the floating-bar PTT pipeline (Deepgram STT + cloud TTS) with a single realtime omni model (Gemini 3.1 Flash Live or GPT Realtime 2), routing audio through a new authenticated backend WebSocket relay so provider keys never leave the server. An "Auto" mode picks the best provider daily using Artificial Analysis benchmark scores, with client- and server-side caching.

  • backend/routers/omni_relay.py — new authenticated WS relay that pipes raw frames between the desktop client and OpenAI/Gemini, resolving keys server-side.
  • backend/routers/auto_model.py — new GET /v1/auto/model-pick endpoint that caches a daily quality/speed-weighted provider pick from the Artificial Analysis API.
  • desktop/…/RealtimeOmni/ — Swift RealtimeOmniService (WS client), AutoModelSelector (daily AA fetch), settings store, test harness, and PTT integration in PushToTalkManager.

Confidence Score: 3/5

Not safe to merge as-is — the audio buffering bugs in the Swift client mean the user's first words are silently dropped on every PTT turn, and the URL injection in the relay puts the server's OpenAI key into a controllable URL parameter.

The relay has a URL injection point where an authenticated user can append arbitrary query parameters to the upstream OpenAI request using the server's API key. Two independent audio-ordering bugs in the Swift layer mean the pre-connection audio buffer is always flushed before the WebSocket task exists (so the audio is dropped) and sendAudio() never gates on session readiness (so audio can arrive at the provider before the session setup handshake completes). The concurrent cache-miss race in the auto-model endpoint is a third correctness issue on a fresh-deploy path.

backend/routers/omni_relay.py (URL injection), backend/routers/auto_model.py (concurrency + auth), desktop/.../RealtimeOmniService.swift (audio-before-setup), and desktop/.../PushToTalkManager.swift (buffer flushed before start).

Security Review

  • URL injection (backend/routers/omni_relay.py, line 44): The client-supplied model query parameter is interpolated directly into the OpenAI upstream WebSocket URL via Python .format(), without URL encoding. A character like & in the value appends extra query parameters to the request made with the server's OPENAI_API_KEY. Mitigation: apply urllib.parse.quote(model) before substitution.

Important Files Changed

Filename Overview
backend/routers/omni_relay.py New authenticated WS relay to OpenAI/Gemini; contains URL injection via unsanitized client-supplied model parameter passed directly into the upstream URL string.
backend/routers/auto_model.py New daily-cached provider picker; has a concurrent-refresh race condition (no async lock) and no authentication on the endpoint.
desktop/Desktop/Sources/RealtimeOmni/RealtimeOmniService.swift Core WS client for Gemini/OpenAI; sendAudio() never gates on isOpen or enqueues to pendingAudio, so audio sent during the TCP/TLS handshake arrives at the provider before the session setup message.
desktop/Desktop/Sources/FloatingControlBar/PushToTalkManager.swift PTT integration with omni STT; pre-connect audio buffer is flushed before service.start() is called, so buffered audio is silently dropped because the WebSocket task is still nil at flush time.
desktop/Desktop/Sources/RealtimeOmni/AutoModelSelector.swift Client-side daily refresh of Artificial Analysis scores; tolerant parser, graceful fallback to Gemini, and applyServerPick hook for backend override. Looks correct.
desktop/Desktop/Sources/RealtimeOmni/RealtimeOmniSettings.swift Provider enum and settings store; straightforward UserDefaults persistence with auto-resolve for .auto via AutoModelSelector.
desktop/Desktop/Sources/RealtimeOmni/RealtimeOmniTestHarness.swift Headless fake-voice E2E harness; uses CheckedContinuation correctly with a done-guard against double resume. Inherits the pendingAudio ordering issue from RealtimeOmniService.
desktop/Desktop/Sources/MainWindow/Pages/SettingsPage.swift Adds Voice Model picker in Advanced settings; straightforward SwiftUI binding to @AppStorage, calls refreshIfStale() when Auto is selected.
desktop/Desktop/Sources/DesktopAutomationBridge.swift Adds omni_test_turn automation action for headless E2E testing; reads PCM file from disk and runs through RealtimeOmniTestHarness. Debug-bundle only per description.
backend/main.py Registers the two new routers; no issues.
desktop/Desktop/Sources/OmiApp.swift Adds AutoModelSelector.shared.refreshIfStale() call at app launch; straightforward and correct.

Sequence Diagram

sequenceDiagram
    participant Desktop as Desktop (PTT)
    participant Relay as omi backend /v1/omni/relay
    participant Provider as OpenAI / Gemini

    Desktop->>Relay: "WS connect (Authorization: Bearer token) ?provider=openai|gemini&model=..."
    Note over Relay: Authenticates via get_current_user_uid_ws_listen
    Relay->>Provider: WS connect (API key server-side)
    Desktop->>Relay: session setup JSON
    Relay->>Provider: forward session setup
    Provider-->>Relay: session.updated / setupComplete
    Relay-->>Desktop: forward response

    loop PTT turn (mic open)
        Desktop->>Relay: input_audio_buffer.append / realtimeInput
        Relay->>Provider: forward audio
        Provider-->>Relay: inputTranscription delta
        Relay-->>Desktop: forward transcript
    end

    Desktop->>Relay: commitInputTurn / activityEnd
    Relay->>Provider: forward commit
    Provider-->>Relay: transcription completed / turnComplete
    Relay-->>Desktop: final transcript
    Desktop->>Desktop: sendTranscript() to ChatProvider to Claude + tools
Loading

Comments Outside Diff (3)

  1. desktop/Desktop/Sources/FloatingControlBar/PushToTalkManager.swift, line 375-378 (link)

    P1 Pre-connect buffer flushed before service.start() — all audio silently dropped

    The buffer flush calls service.sendAudio() before service.start() is invoked. At flush time task is still nil inside RealtimeOmniService, so every send(json:) call resolves to task?.send(...) which is a no-op. The audio captured during auth fetch is discarded. The PR's stated goal — "user's first words aren't clipped" — requires moving service.start() before the flush loop so that task exists when audio is queued.

  2. desktop/Desktop/Sources/RealtimeOmni/RealtimeOmniService.swift, line 903-911 (link)

    P1 sendAudio() bypasses pendingAudio buffer — audio arrives at provider before session setup

    sendAudio() calls send(json:)task?.send(...) immediately, without checking isOpen. After start() / t.resume(), URLSessionWebSocketTask buffers any messages sent before the connection handshake and delivers them to the server in FIFO order. Because didOpenWithProtocol (where sendSessionSetup() is called) fires only after the connection is established, any audio queued during the TCP/TLS handshake window will arrive at OpenAI or Gemini before the session configuration message, triggering a protocol error. The pendingAudio property and its drain in markReady() already exist for this purpose but are never populated — sendAudio() should enqueue to pendingAudio when !isOpen instead of calling send() directly.

  3. desktop/Desktop/Sources/FloatingControlBar/PushToTalkManager.swift, line 389-398 (link)

    P2 resolveOmniKey is dead code

    This function is defined but never called anywhere in the diff or the surrounding code — the relay path handles key resolution server-side and RealtimeOmniService takes an authHeader, not a provider API key. It can be removed to avoid confusion.

    Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Reviews (1): Last reviewed commit: "feat(desktop): fall back to Deepgram if ..." | Re-trigger Greptile

Comment thread backend/routers/omni_relay.py Outdated
key = os.getenv("OPENAI_API_KEY")
if not key:
return None, "OPENAI_API_KEY not configured"
url = OPENAI_URL.format(model=model or "gpt-realtime-2")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 security URL injection via unencoded model query param

The model value from the client is interpolated into the OpenAI WebSocket URL with a plain Python .format() call, so any & or = in the string appends extra query parameters to the upstream request. A valid authenticated user could pass model=gpt-realtime-2&api-version=preview, and the injected fragment is forwarded verbatim to api.openai.com. While the immediate blast radius is limited to OpenAI's own parameter parsing, this is an injection point on a server-managed API key. The fix is to URL-encode the model value before substitution using urllib.parse.quote.

TTL_SECONDS = 24 * 3600
AA_URL = "https://artificialanalysis.ai/api/v2/data/llms/models"

_cache = {"provider": None, "ts": 0.0, "detail": {}}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Concurrent cache misses trigger multiple AA fetches

The stale-cache check is not guarded by a lock, so N concurrent requests arriving when the cache is expired will all evaluate the condition as true and each launch a separate _fetch_and_score() coroutine. This can fire many HTTP calls to the Artificial Analysis API in a short window, which may exhaust rate limits or quota. An asyncio.Lock acquired around the check-and-refresh block fixes this.

Suggested change
_cache = {"provider": None, "ts": 0.0, "detail": {}}
_cache = {"provider": None, "ts": 0.0, "detail": {}}
_cache_lock = asyncio.Lock()

Comment thread backend/routers/auto_model.py Outdated
Comment on lines +78 to +86
now = time.time()
if _cache["provider"] is None or (now - _cache["ts"]) > TTL_SECONDS:
try:
provider, detail = await _fetch_and_score()
_cache.update(provider=provider, ts=now, detail=detail)
except Exception as e:
logger.error(f"auto model-pick fetch failed: {e}")
if _cache["provider"] is None:
_cache.update(provider="geminiFlashLive", ts=now, detail={"reason": f"error: {e}"})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Use the lock here to serialize concurrent refresh attempts.

Suggested change
now = time.time()
if _cache["provider"] is None or (now - _cache["ts"]) > TTL_SECONDS:
try:
provider, detail = await _fetch_and_score()
_cache.update(provider=provider, ts=now, detail=detail)
except Exception as e:
logger.error(f"auto model-pick fetch failed: {e}")
if _cache["provider"] is None:
_cache.update(provider="geminiFlashLive", ts=now, detail={"reason": f"error: {e}"})
now = time.time()
if _cache["provider"] is None or (now - _cache["ts"]) > TTL_SECONDS:
async with _cache_lock:
now = time.time() # re-read after acquiring
if _cache["provider"] is None or (now - _cache["ts"]) > TTL_SECONDS:
try:
provider, detail = await _fetch_and_score()
_cache.update(provider=provider, ts=now, detail=detail)
except Exception as e:
logger.error(f"auto model-pick fetch failed: {e}")
if _cache["provider"] is None:
_cache.update(provider="geminiFlashLive", ts=now, detail={"reason": f"error: {e}"})

Comment thread backend/routers/auto_model.py Outdated
Comment on lines +75 to +76
@router.get("/v1/auto/model-pick")
async def auto_model_pick():
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 No authentication on /v1/auto/model-pick

The endpoint is unauthenticated and publicly reachable. It exposes the cached ARTIFICIALANALYSIS_API_KEY-derived pick (and the detail dict including raw scores), and any caller can force a cache-bust by hitting it repeatedly once the TTL expires. Adding the standard uid: str = Depends(get_current_user_uid) dependency would restrict it to signed-in users consistent with the rest of the API surface.

kodjima33 and others added 4 commits June 2, 2026 16:55
…pt-5)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…erver-side)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@kodjima33 kodjima33 merged commit 38dcbb5 into main Jun 2, 2026
2 checks passed
@kodjima33 kodjima33 deleted the nik/realtime-voice-demos branch June 2, 2026 21:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant