feat: realtime omni voice (Gemini 3.1 Flash Live + GPT Realtime 2) for desktop PTT by kodjima33 · Pull Request #7601 · BasedHardware/omi

kodjima33 · 2026-06-02T20:36:59Z

What

Replaces the floating-bar PTT cascade (Deepgram STT + cloud TTS) with one realtime omni model for voice I/O. Adds an Advanced → Voice Model picker (Auto / Gemini 3.1 Flash Live / GPT Realtime 2). Reasoning/agents/tools/text chat are unchanged — only the STT transport changes; the final transcript still flows through ChatProvider → AgentBridge (Claude + tools).

Architecture (keys server-side)

desktop → omi backend /v1/omni/relay (WS) → OpenAI/Gemini. Apple's WebSocket stacks can't hold a direct Gemini connection (Google frontend resets them), and provider keys shouldn't ship in the client — so the backend relays. Auto reads /v1/auto/model-pick (daily Artificial Analysis quality/speed score, cached server-side).

Backend (this PR)

routers/omni_relay.py — authenticated WS relay to OpenAI/Gemini realtime (keys server-side).
routers/auto_model.py — GET /v1/auto/model-pick, daily-cached Artificial Analysis quality/speed pick.

Desktop (this PR)

RealtimeOmni/ — settings (+Auto), backend-relay STT/TTS client, Auto selector, fake-voice test harness.
PushToTalkManager — PTT uses omni STT via the relay (buffers audio until connected).
Advanced Voice Model picker (no toggle — the picker is the control).
omni_test_turn automation action — headless fake-voice E2E test (debug bundles only).

Verified (fake-voice, headless, through the deployed dev relay)

✅ Gemini 3.1 Flash Live → connected, correct transcript.
✅ GPT Realtime 2 → connected, correct transcript.
✅ Real-voice PTT (OpenAI) confirmed in live sessions; Claude reasoning + chat replies intact.

⚠️ Before merge/release (merging `desktop/` auto-releases via Codemagic)

Deploy backend to prod — /v1/omni/relay + /v1/auto/model-pick are dev-only; prod returns 404. Desktop release must not precede this.
Add ARTIFICIALANALYSIS_API_KEY to backend secrets — without it Auto defaults to Gemini (no live benchmark).

🤖 Generated with Claude Code

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…speed selector Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…n-breaking release) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

greptile-apps · 2026-06-02T20:44:33Z

Greptile Summary

This PR replaces the floating-bar PTT pipeline (Deepgram STT + cloud TTS) with a single realtime omni model (Gemini 3.1 Flash Live or GPT Realtime 2), routing audio through a new authenticated backend WebSocket relay so provider keys never leave the server. An "Auto" mode picks the best provider daily using Artificial Analysis benchmark scores, with client- and server-side caching.

backend/routers/omni_relay.py — new authenticated WS relay that pipes raw frames between the desktop client and OpenAI/Gemini, resolving keys server-side.
backend/routers/auto_model.py — new GET /v1/auto/model-pick endpoint that caches a daily quality/speed-weighted provider pick from the Artificial Analysis API.
desktop/…/RealtimeOmni/ — Swift RealtimeOmniService (WS client), AutoModelSelector (daily AA fetch), settings store, test harness, and PTT integration in PushToTalkManager.

Confidence Score: 3/5

Not safe to merge as-is — the audio buffering bugs in the Swift client mean the user's first words are silently dropped on every PTT turn, and the URL injection in the relay puts the server's OpenAI key into a controllable URL parameter.

The relay has a URL injection point where an authenticated user can append arbitrary query parameters to the upstream OpenAI request using the server's API key. Two independent audio-ordering bugs in the Swift layer mean the pre-connection audio buffer is always flushed before the WebSocket task exists (so the audio is dropped) and sendAudio() never gates on session readiness (so audio can arrive at the provider before the session setup handshake completes). The concurrent cache-miss race in the auto-model endpoint is a third correctness issue on a fresh-deploy path.

backend/routers/omni_relay.py (URL injection), backend/routers/auto_model.py (concurrency + auth), desktop/.../RealtimeOmniService.swift (audio-before-setup), and desktop/.../PushToTalkManager.swift (buffer flushed before start).

Security Review

URL injection (backend/routers/omni_relay.py, line 44): The client-supplied model query parameter is interpolated directly into the OpenAI upstream WebSocket URL via Python .format(), without URL encoding. A character like & in the value appends extra query parameters to the request made with the server's OPENAI_API_KEY. Mitigation: apply urllib.parse.quote(model) before substitution.

Important Files Changed

Filename	Overview
backend/routers/omni_relay.py	New authenticated WS relay to OpenAI/Gemini; contains URL injection via unsanitized client-supplied `model` parameter passed directly into the upstream URL string.
backend/routers/auto_model.py	New daily-cached provider picker; has a concurrent-refresh race condition (no async lock) and no authentication on the endpoint.
desktop/Desktop/Sources/RealtimeOmni/RealtimeOmniService.swift	Core WS client for Gemini/OpenAI; `sendAudio()` never gates on `isOpen` or enqueues to `pendingAudio`, so audio sent during the TCP/TLS handshake arrives at the provider before the session setup message.
desktop/Desktop/Sources/FloatingControlBar/PushToTalkManager.swift	PTT integration with omni STT; pre-connect audio buffer is flushed before `service.start()` is called, so buffered audio is silently dropped because the WebSocket task is still nil at flush time.
desktop/Desktop/Sources/RealtimeOmni/AutoModelSelector.swift	Client-side daily refresh of Artificial Analysis scores; tolerant parser, graceful fallback to Gemini, and `applyServerPick` hook for backend override. Looks correct.
desktop/Desktop/Sources/RealtimeOmni/RealtimeOmniSettings.swift	Provider enum and settings store; straightforward UserDefaults persistence with auto-resolve for `.auto` via AutoModelSelector.
desktop/Desktop/Sources/RealtimeOmni/RealtimeOmniTestHarness.swift	Headless fake-voice E2E harness; uses `CheckedContinuation` correctly with a done-guard against double resume. Inherits the `pendingAudio` ordering issue from `RealtimeOmniService`.
desktop/Desktop/Sources/MainWindow/Pages/SettingsPage.swift	Adds Voice Model picker in Advanced settings; straightforward SwiftUI binding to `@AppStorage`, calls `refreshIfStale()` when Auto is selected.
desktop/Desktop/Sources/DesktopAutomationBridge.swift	Adds `omni_test_turn` automation action for headless E2E testing; reads PCM file from disk and runs through `RealtimeOmniTestHarness`. Debug-bundle only per description.
backend/main.py	Registers the two new routers; no issues.
desktop/Desktop/Sources/OmiApp.swift	Adds `AutoModelSelector.shared.refreshIfStale()` call at app launch; straightforward and correct.

Sequence Diagram

sequenceDiagram
    participant Desktop as Desktop (PTT)
    participant Relay as omi backend /v1/omni/relay
    participant Provider as OpenAI / Gemini

    Desktop->>Relay: "WS connect (Authorization: Bearer token) ?provider=openai|gemini&model=..."
    Note over Relay: Authenticates via get_current_user_uid_ws_listen
    Relay->>Provider: WS connect (API key server-side)
    Desktop->>Relay: session setup JSON
    Relay->>Provider: forward session setup
    Provider-->>Relay: session.updated / setupComplete
    Relay-->>Desktop: forward response

    loop PTT turn (mic open)
        Desktop->>Relay: input_audio_buffer.append / realtimeInput
        Relay->>Provider: forward audio
        Provider-->>Relay: inputTranscription delta
        Relay-->>Desktop: forward transcript
    end

    Desktop->>Relay: commitInputTurn / activityEnd
    Relay->>Provider: forward commit
    Provider-->>Relay: transcription completed / turnComplete
    Relay-->>Desktop: final transcript
    Desktop->>Desktop: sendTranscript() to ChatProvider to Claude + tools

Comments Outside Diff (3)

desktop/Desktop/Sources/FloatingControlBar/PushToTalkManager.swift, line 375-378 (link)

Pre-connect buffer flushed before service.start() — all audio silently dropped

The buffer flush calls service.sendAudio() before service.start() is invoked. At flush time task is still nil inside RealtimeOmniService, so every send(json:) call resolves to task?.send(...) which is a no-op. The audio captured during auth fetch is discarded. The PR's stated goal — "user's first words aren't clipped" — requires moving service.start() before the flush loop so that task exists when audio is queued.
desktop/Desktop/Sources/RealtimeOmni/RealtimeOmniService.swift, line 903-911 (link)

sendAudio() bypasses pendingAudio buffer — audio arrives at provider before session setup

sendAudio() calls send(json:) → task?.send(...) immediately, without checking isOpen. After start() / t.resume(), URLSessionWebSocketTask buffers any messages sent before the connection handshake and delivers them to the server in FIFO order. Because didOpenWithProtocol (where sendSessionSetup() is called) fires only after the connection is established, any audio queued during the TCP/TLS handshake window will arrive at OpenAI or Gemini before the session configuration message, triggering a protocol error. The pendingAudio property and its drain in markReady() already exist for this purpose but are never populated — sendAudio() should enqueue to pendingAudio when !isOpen instead of calling send() directly.
desktop/Desktop/Sources/FloatingControlBar/PushToTalkManager.swift, line 389-398 (link)

resolveOmniKey is dead code

This function is defined but never called anywhere in the diff or the surrounding code — the relay path handles key resolution server-side and RealtimeOmniService takes an authHeader, not a provider API key. It can be removed to avoid confusion.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

_{Reviews (1): Last reviewed commit: "feat(desktop): fall back to Deepgram if ..." | Re-trigger Greptile}

greptile-apps · 2026-06-02T20:44:37Z

+        key = os.getenv("OPENAI_API_KEY")
+        if not key:
+            return None, "OPENAI_API_KEY not configured"
+        url = OPENAI_URL.format(model=model or "gpt-realtime-2")


URL injection via unencoded model query param

The model value from the client is interpolated into the OpenAI WebSocket URL with a plain Python .format() call, so any & or = in the string appends extra query parameters to the upstream request. A valid authenticated user could pass model=gpt-realtime-2&api-version=preview, and the injected fragment is forwarded verbatim to api.openai.com. While the immediate blast radius is limited to OpenAI's own parameter parsing, this is an injection point on a server-managed API key. The fix is to URL-encode the model value before substitution using urllib.parse.quote.

greptile-apps · 2026-06-02T20:44:38Z

+TTL_SECONDS = 24 * 3600
+AA_URL = "https://artificialanalysis.ai/api/v2/data/llms/models"
+
+_cache = {"provider": None, "ts": 0.0, "detail": {}}


Concurrent cache misses trigger multiple AA fetches

The stale-cache check is not guarded by a lock, so N concurrent requests arriving when the cache is expired will all evaluate the condition as true and each launch a separate _fetch_and_score() coroutine. This can fire many HTTP calls to the Artificial Analysis API in a short window, which may exhaust rate limits or quota. An asyncio.Lock acquired around the check-and-refresh block fixes this.

Suggested change

_cache = {"provider": None, "ts": 0.0, "detail": {}}

_cache = {"provider": None, "ts": 0.0, "detail": {}}

_cache_lock = asyncio.Lock()

greptile-apps · 2026-06-02T20:44:39Z

+    now = time.time()
+    if _cache["provider"] is None or (now - _cache["ts"]) > TTL_SECONDS:
+        try:
+            provider, detail = await _fetch_and_score()
+            _cache.update(provider=provider, ts=now, detail=detail)
+        except Exception as e:
+            logger.error(f"auto model-pick fetch failed: {e}")
+            if _cache["provider"] is None:
+                _cache.update(provider="geminiFlashLive", ts=now, detail={"reason": f"error: {e}"})


Use the lock here to serialize concurrent refresh attempts.

Suggested change

now = time.time()

if _cache["provider"] is None or (now - _cache["ts"]) > TTL_SECONDS:

try:

provider, detail = await _fetch_and_score()

_cache.update(provider=provider, ts=now, detail=detail)

except Exception as e:

logger.error(f"auto model-pick fetch failed: {e}")

if _cache["provider"] is None:

_cache.update(provider="geminiFlashLive", ts=now, detail={"reason": f"error: {e}"})

now = time.time()

if _cache["provider"] is None or (now - _cache["ts"]) > TTL_SECONDS:

async with _cache_lock:

now = time.time() # re-read after acquiring

if _cache["provider"] is None or (now - _cache["ts"]) > TTL_SECONDS:

try:

provider, detail = await _fetch_and_score()

_cache.update(provider=provider, ts=now, detail=detail)

except Exception as e:

logger.error(f"auto model-pick fetch failed: {e}")

if _cache["provider"] is None:

_cache.update(provider="geminiFlashLive", ts=now, detail={"reason": f"error: {e}"})

greptile-apps · 2026-06-02T20:44:42Z

+@router.get("/v1/auto/model-pick")
+async def auto_model_pick():


No authentication on /v1/auto/model-pick

The endpoint is unauthenticated and publicly reachable. It exposes the cached ARTIFICIALANALYSIS_API_KEY-derived pick (and the detail dict including raw scores), and any caller can force a cache-bust by hitting it repeatedly once the TTL expires. Adding the standard uid: str = Depends(get_current_user_uid) dependency would restrict it to signed-in users consistent with the rest of the API surface.

…pt-5) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…erver-side) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

kodjima33 and others added 12 commits June 2, 2026 16:21

feat(backend): omni realtime WS relay (Gemini/OpenAI, keys server-side)

190603a

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

feat(backend): /v1/auto/model-pick daily Artificial Analysis quality/…

e316dec

…speed selector Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

feat(backend): register omni_relay + auto_model routers

0ca5027

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

feat(desktop): realtime omni provider settings + Auto

b6ad500

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

feat(desktop): Auto model selector (reads backend /v1/auto/model-pick)

368523e

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

feat(desktop): realtime omni STT/TTS client via backend relay

ef39574

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

test(desktop): fake-voice omni harness (debug automation action)

e2a403b

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

feat(desktop): PTT uses omni STT (replaces Deepgram) via backend relay

b5846a2

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

feat(desktop): Advanced Voice Model picker (Auto/Gemini/OpenAI)

261e3ba

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

feat(desktop): refresh Auto model pick at launch

7db99de

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

test(desktop): omni_test_turn automation action for fake-voice E2E

4ff5e11

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

feat(desktop): fall back to Deepgram if omni relay is unreachable (no…

6dc4c18

…n-breaking release) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

greptile-apps Bot reviewed Jun 2, 2026

View reviewed changes

kodjima33 and others added 4 commits June 2, 2026 16:55

fix(backend): Auto proxy slugs to real AA models (gemini-3-5-flash, g…

dcb33ea

…pt-5) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

feat(desktop): Auto reads backend /v1/auto/model-pick (AA key stays s…

52fa6d7

…erver-side) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

fix(backend): URL-encode omni relay model param (greptile P1 injection)

6d2e7e8

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

fix(backend): lock auto-pick refresh + require auth (greptile P1/P2)

8a1c2a7

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

kodjima33 merged commit 38dcbb5 into main Jun 2, 2026
2 checks passed

kodjima33 deleted the nik/realtime-voice-demos branch June 2, 2026 21:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: realtime omni voice (Gemini 3.1 Flash Live + GPT Realtime 2) for desktop PTT#7601

feat: realtime omni voice (Gemini 3.1 Flash Live + GPT Realtime 2) for desktop PTT#7601
kodjima33 merged 16 commits into
mainfrom
nik/realtime-voice-demos

kodjima33 commented Jun 2, 2026

Uh oh!

greptile-apps Bot commented Jun 2, 2026 •

edited

Loading

Security Review

Comments Outside Diff (3)

Uh oh!

greptile-apps Bot Jun 2, 2026

Uh oh!

greptile-apps Bot Jun 2, 2026

Uh oh!

greptile-apps Bot Jun 2, 2026

Uh oh!

greptile-apps Bot Jun 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	_cache = {"provider": None, "ts": 0.0, "detail": {}}
	_cache = {"provider": None, "ts": 0.0, "detail": {}}
	_cache_lock = asyncio.Lock()

		@router.get("/v1/auto/model-pick")
		async def auto_model_pick():

Conversation

kodjima33 commented Jun 2, 2026

What

Architecture (keys server-side)

Backend (this PR)

Desktop (this PR)

Verified (fake-voice, headless, through the deployed dev relay)

⚠️ Before merge/release (merging desktop/ auto-releases via Codemagic)

Uh oh!

greptile-apps Bot commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 3/5

Security Review

Important Files Changed

Sequence Diagram

Comments Outside Diff (3)

Uh oh!

greptile-apps Bot Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

⚠️ Before merge/release (merging `desktop/` auto-releases via Codemagic)

greptile-apps Bot commented Jun 2, 2026 •

edited

Loading