making omi desktop faster#7583
Conversation
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds end-to-end query tracing and evaluation tooling to monitor desktop query performance (tokens/TTFT), reduces unnecessary tool/screenshot overhead for floating-bar queries, and introduces prompt caching support in the Rust Anthropic bridge.
Changes:
- Introduces
QueryTracer(Swift) with JSONL logging, span tracking, TTFT, gap detection, and tool/request/response capture + unit tests. - Updates floating-bar/chat flow to use tracing, selectively capture screenshots, and auto-reset bloated ACP sessions.
- Updates Rust Anthropic request translation to emit an ephemeral-cached system block and propagates cached-token usage in OpenAI-style usage.
Reviewed changes
Copilot reviewed 17 out of 18 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| desktop/eval-results/eval.py | Adds a CLI summarizer/differ for traces.jsonl to track tokens/TTFT over time. |
| desktop/eval-results/baseline-*.jsonl | Adds baseline trace logs to compare regressions/improvements. |
| desktop/agent/src/omi-tools-stdio.ts | Expands execute_sql schema guidance (FTS + relationships). |
| desktop/Desktop/Tests/QueryTracerTests.swift | Adds unit tests for the new QueryTracer behavior. |
| desktop/Desktop/Sources/Services/QueryTracer.swift | Implements QueryTracer, JSONL logging, and request/tool/response capture. |
| desktop/Desktop/Sources/Providers/ChatProvider.swift | Threads tracer through bridge + streaming, captures tool executions, and resets floating session on token bloat. |
| desktop/Desktop/Sources/FloatingControlBar/PushToTalkManager.swift | Creates/propagates tracers for PTT flows and adds spans around transcription/cleanup/context capture. |
| desktop/Desktop/Sources/FloatingControlBar/FloatingControlBarWindow.swift | Formatting-only refactors and routes floating queries through a QueryTracer task-local. |
| desktop/Desktop/Sources/FloatingControlBar/FloatingBarVoicePlaybackService.swift | Adds tracing hooks for TTS playback start timing. |
| desktop/Desktop/Sources/Chat/ChatPrompts.swift | Adjusts tool-use guidance and removes {database_schema} expansion in favor of execute_sql docs. |
| desktop/Desktop/Sources/Chat/AgentBridge.swift | Adds quota-check spans to traces. |
| desktop/Backend-Rust/src/routes/chat_completions.rs | Emits system prompt as an ephemeral-cached content-block array; updates tests accordingly. |
| desktop/Backend-Rust/src/models/chat_completions.rs | Adds prompt_tokens_details.cached_tokens and changes AnthropicRequest.system to JSON Value. |
| .gitignore | Ignores .worktrees/. |
Comments suppressed due to low confidence (2)
desktop/Desktop/Sources/Chat/AgentBridge.swift:1
- When
fetchChatUsageQuota()returnsnil(network/server error), the trace currently recordsresult=allowed, which is misleading. Consider distinguishing outcomes:allowedwhen quota is present + allowed,exceededwhen present + not allowed, andunknown/errorwhen quota couldn’t be fetched.
import Foundation
desktop/Desktop/Sources/FloatingControlBar/FloatingBarVoicePlaybackService.swift:1
- If
AVAudioPlayercreation/playback throws, thetts_startspan is never closed, which will skew timing (and may only get force-closed later asunclosed=true). Close the span in thecatchpath as well (ideally with metadata indicating the failure).
import AVFoundation
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| print(f"{i:>2} {g(r,'input_tokens',default=0):>7} {g(r,'output_tokens',default=0):>5} " | ||
| f"{int(ttft):>8} {tps:>5.0f} {g(r,'cache_read_tokens',default=0):>7} " | ||
| f"{cost:>7.4f} {('Y' if req.get('has_screenshot') else '-'):>4} " | ||
| f"{(str(int(shot_ms)) if shot_ms else '-'):>7} {(str(sysprompt_chars(r)) if sysprompt_chars(r) else '-'):>7} {q}") |
| recs = [] | ||
| if not Path(path).exists(): | ||
| return recs | ||
| for line in Path(path).read_text().splitlines(): |
| func end(_ name: String, metadata: [String: String]? = nil) { | ||
| lock.withLock { state in | ||
| let endInstant = ContinuousClock.now | ||
| // Search stack from top for matching name | ||
| guard let idx = state.spanStack.lastIndex(where: { $0.name == name }) else { return } | ||
|
|
||
| var span = state.spanStack.remove(at: idx) | ||
|
|
||
| // Merge metadata | ||
| if let meta = metadata { | ||
| if span.metadata == nil { | ||
| span.metadata = meta | ||
| } else { | ||
| for (k, v) in meta { | ||
| span.metadata?[k] = v | ||
| } | ||
| } | ||
| } | ||
|
|
||
| let startMs = (span.startInstant - state.origin).milliseconds | ||
| let endMs = (endInstant - state.origin).milliseconds | ||
| let built = BuiltSpan( | ||
| name: span.name, | ||
| start_ms: startMs, | ||
| end_ms: endMs, | ||
| dur_ms: endMs - startMs, | ||
| meta: span.metadata, | ||
| children: span.children.isEmpty ? nil : span.children | ||
| ) | ||
|
|
||
| if state.spanStack.isEmpty { | ||
| state.completedSpans.append(built) | ||
| } else { | ||
| state.spanStack[state.spanStack.count - 1].children.append(built) | ||
| } | ||
| } | ||
| } |
| /// Run `body` inside a QueryTracer task-local, reusing an existing tracer or creating a new one. | ||
| private func withQueryTracer(query: String, fromVoice: Bool, _ body: () async -> Void) async { | ||
| let tracer = | ||
| QueryTracerContext.current | ||
| ?? QueryTracer(query: query, inputMode: fromVoice ? .voicePTTBatch : .text) | ||
| await QueryTracerContext.$current.withValue(tracer) { | ||
| await body() | ||
| } | ||
| } |
| contextCaptureTask?.cancel() | ||
| contextCaptureTask = nil | ||
| activeTracer = nil | ||
| stopAudioTranscription() |
| /// Capture the full system prompt + message history sent to the API. | ||
| func captureRequest( | ||
| systemPrompt: String, | ||
| messages: [[String: String]], | ||
| hasScreenshot: Bool = false | ||
| ) { | ||
| lock.withLock { state in | ||
| state.systemPrompt = systemPrompt | ||
| state.messages = messages | ||
| state.hasScreenshot = hasScreenshot | ||
| } | ||
| } | ||
|
|
||
| /// Capture the final response text + finish reason. | ||
| func captureResponse(text: String, finishReason: String? = nil) { | ||
| lock.withLock { state in | ||
| state.responseText = text | ||
| state.finishReason = finishReason | ||
| } | ||
| } | ||
|
|
||
| /// Capture a tool call execution (name, input, output, duration). | ||
| func captureToolExecution(toolUseId: String?, name: String, input: String, output: String, durationMs: Int64? = nil) { | ||
| lock.withLock { state in | ||
| state.toolExecutions.append(TraceToolExecution( | ||
| tool_use_id: toolUseId, | ||
| name: name, | ||
| input: input, | ||
| output: output, | ||
| dur_ms: durationMs | ||
| )) | ||
| } | ||
| } |
| static func queryNeedsScreenshot(_ text: String) -> Bool { | ||
| // Tokenize: lowercase, split on non-alphanumerics (so "see?", "this." match as words). | ||
| let tokens = text.lowercased() | ||
| .replacingOccurrences(of: "[^a-z0-9]+", with: " ", options: .regularExpression) | ||
| .split(separator: " ").map(String.init) | ||
| let cues: Set<String> = [ | ||
| "screen", "screenshot", "see", "seeing", "look", "looking", "show", "showing", | ||
| "display", "displayed", "view", "viewing", "visible", "highlighted", "selected", | ||
| "this", "that", "these", "those", "here", "picture", "image", "photo", "page", | ||
| "window", "tab", "chart", "graph", "diagram", "button", "dialog", "popup", "menu", | ||
| // Deictic "choice" cues — the chat prompt treats these as screen-grounded | ||
| // (see ChatPrompts "which one / which option / which suits me"). | ||
| "which", "option", "options", "choose", "suits", | ||
| ] | ||
| // Deictic words are visual only when NOT part of a temporal phrase: "this week", | ||
| // "that day" are about time, not the screen — don't trigger a capture for those. | ||
| let deictic: Set<String> = ["this", "that", "these", "those"] | ||
| let timeWords: Set<String> = [ | ||
| "week", "weeks", "month", "months", "year", "years", "day", "days", "today", | ||
| "tonight", "morning", "afternoon", "evening", "night", "weekend", "weekends", | ||
| "quarter", "hour", "hours", "minute", "minutes", "time", "moment", "yesterday", | ||
| "tomorrow", "summer", "winter", "spring", "fall", | ||
| ] | ||
| for (i, tok) in tokens.enumerated() where cues.contains(tok) { | ||
| if deictic.contains(tok), i + 1 < tokens.count, timeWords.contains(tokens[i + 1]) { | ||
| continue // temporal ("this week") — not a screen reference | ||
| } | ||
| return true | ||
| } | ||
| return false | ||
| } |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1e4bf2d976
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| guard let jsonData = try? encoder.encode(trace), | ||
| let jsonString = String(data: jsonData, encoding: .utf8) | ||
| else { return } | ||
|
|
||
| let logFile = Self.logDir.appendingPathComponent("traces.jsonl") |
There was a problem hiding this comment.
Avoid writing raw chat payloads to disk
When normal desktop tracing runs, this encodes the full QueryTrace into ~/Library/Logs/Omi/traces.jsonl, including the captured system prompt, recent messages, final response, and tool inputs/outputs from captureRequest/captureResponse/captureToolExecution. For queries that use memories, SQL, screenshots, or conversations, this leaves sensitive user content in a rotating log file that can be exposed through local log readers, backups, or support-log collection; please gate this behind an explicit debug mode or redact/drop raw payloads and keep only timing/token metadata by default.
Useful? React with 👍 / 👎.
Greptile SummaryThis PR attacks floating-bar latency from the input-token side: a new
Confidence Score: 3/5Not safe to merge as-is: agent routing is completely removed and five baseline JSONL files with session data are committed; both need resolution before landing. The floating-bar now skips AgentPillsManager.classify() entirely, silently breaking multi-step agent flows for all users. The isFirstResponse/isGenerating flags are plain vars mutated by two closures that could fire concurrently, risking a double-open on the generation span. The eval-result JSONL files are committed to permanent git history contrary to the PR's own guidance. desktop/Desktop/Sources/FloatingControlBar/FloatingControlBarWindow.swift (agent routing removed), desktop/Desktop/Sources/Providers/ChatProvider.swift (isFirstResponse/isGenerating flags, session ceiling thrash), desktop/eval-results/*.jsonl (should be deleted) Important Files Changed
Sequence DiagramsequenceDiagram
participant User
participant FCB as FloatingControlBarManager
participant QT as QueryTracer TaskLocal
participant CP as ChatProvider
participant AB as AgentBridge actor
participant Rust as Backend-Rust
participant Claude as Anthropic API
User->>FCB: typed query or PTT voice
FCB->>QT: withQueryTracer begin pre_llm
FCB->>FCB: queryNeedsScreenshot conditional capture
FCB->>CP: sendMessage
CP->>QT: begin bridge_ensure
CP->>AB: ensureBridgeStarted
AB->>QT: begin end quota_check
CP->>QT: begin llm_request and ttft
CP->>AB: query prompt systemPrompt
AB->>Rust: POST chat completions
Note over Rust: system ephemeral cache_control block
Rust->>Claude: Anthropic API streaming
Claude-->>Rust: stream chunks
Rust-->>AB: SSE chunks with cache_read_tokens
AB-->>CP: textDeltaHandler or toolActivityHandler
CP->>QT: end ttft markTTFT on first token
CP->>QT: begin end tool span per tool call
AB-->>CP: QueryResult tokens cost cacheRead
CP->>QT: finalize to traces.jsonl
CP->>CP: floatingInputTokens over 21k resets session
CP-->>FCB: messages updated
FCB-->>User: response displayed
|
|
|
||
| // Query the active bridge with streaming | ||
| // Callbacks for agent bridge | ||
| // Mutable flags for TTFT / generation tracking. | ||
| var isFirstResponse = true // TTFT: first output of any kind (text or tool_use) | ||
| var isGenerating = false // generation span: tracks actual text streaming | ||
| let textDeltaHandler: AgentBridge.TextDeltaHandler = { [weak self] delta in | ||
| // Tracer is thread-safe (OSAllocatedUnfairLock) — call directly so | ||
| // spans close before finalize(). Wrapping in Task { @MainActor } | ||
| // made them fire-and-forget, racing with finalize(). | ||
| if isFirstResponse { | ||
| isFirstResponse = false | ||
| tracer?.end("ttft") | ||
| tracer?.markTTFT() | ||
| } | ||
| if !isGenerating { | ||
| isGenerating = true | ||
| tracer?.begin("generation") | ||
| } |
There was a problem hiding this comment.
Potential data race on
isFirstResponse / isGenerating flags
isFirstResponse and isGenerating are plain vars captured by both textDeltaHandler and toolActivityHandler. Both closures modify these flags without any synchronization. If the two handlers are ever invoked concurrently from within agentBridge.query(), both could observe isFirstResponse == true simultaneously and both call tracer?.begin("generation") — opening the span twice. The defensive end("generation") call after query() returns would close only one instance, leaving the other permanently unclosed in the trace.
| init(query: String, inputMode: QueryInputMode) { | ||
| let now = ContinuousClock.now | ||
| let id = "t_" + (0..<6).map { _ in String(format: "%x", Int.random(in: 0...15)) }.joined() | ||
| lock = OSAllocatedUnfairLock(initialState: State( | ||
| origin: now, | ||
| query: query, | ||
| inputMode: inputMode, | ||
| traceId: id | ||
| )) |
There was a problem hiding this comment.
Low entropy trace ID — collision risk in longer sessions
Each trace ID is 6 random hex nibbles (24 bits, ~16.7 M unique values). With the birthday paradox, ~4 000 traces give a ~50% collision chance, and traces.jsonl can hold thousands of entries within the 5 MB cap. A collision silently aliases two traces when comparing them in the eval script. Using UUID().uuidString costs nothing and eliminates the issue.
| init(query: String, inputMode: QueryInputMode) { | |
| let now = ContinuousClock.now | |
| let id = "t_" + (0..<6).map { _ in String(format: "%x", Int.random(in: 0...15)) }.joined() | |
| lock = OSAllocatedUnfairLock(initialState: State( | |
| origin: now, | |
| query: query, | |
| inputMode: inputMode, | |
| traceId: id | |
| )) | |
| let id = "t_" + UUID().uuidString.lowercased().replacingOccurrences(of: "-", with: "").prefix(12) |
| /// Capture the full system prompt + message history sent to the API. | ||
| func captureRequest( | ||
| systemPrompt: String, | ||
| messages: [[String: String]], | ||
| hasScreenshot: Bool = false | ||
| ) { | ||
| lock.withLock { state in | ||
| state.systemPrompt = systemPrompt | ||
| state.messages = messages | ||
| state.hasScreenshot = hasScreenshot | ||
| } | ||
| } |
There was a problem hiding this comment.
Full system prompt and message history written to a log file
captureRequest stores the complete system prompt and the last 40 messages, then serializes them to ~/Library/Logs/Omi/traces.jsonl. The floating-bar system prompt contains the user's personal memories, facts, tasks, goals, and conversation history — all written to disk with only a 5 MB rotation policy and no TTL. Consider stripping or truncating PII fields before writing, or gating capture behind an explicit opt-in flag.
| await agentBridge.invalidateSession(sessionKey: sessionKey) | ||
| } | ||
|
|
||
| /// Input-token ceiling for the floating-bar ACP session. The ACP SDK keeps the full | ||
| /// turn history (including large tool results) inside the session, so input climbs every | ||
| /// turn — observed drifting 17k → 26k across a handful of queries, slowing TTFT and | ||
| /// tripping the 30k-tokens/min rate limit. When a floating query crosses this, we reset | ||
| /// the session in the background so the next one starts back near the base-prompt floor. | ||
| private static let floatingSessionTokenCeiling = 21_000 | ||
|
|
||
| /// Guards against overlapping resets when several bloated queries land in quick succession. | ||
| private var floatingSessionResetInFlight = false | ||
|
|
||
| /// Reset the floating-bar ACP session to shed accumulated context (conversation history + | ||
| /// tool-result bloat). Runs a small background Task that re-records the floating session with a | ||
| /// freshly re-seeded system prompt (recent chat history only). This call does NOT hit the model | ||
| /// and burns no tokens. | ||
| /// | ||
| /// Cost / mechanism (pi-mono path): pi-mono is single-session and bakes the system prompt as a | ||
| /// process launch flag, so the reset actually takes effect on the NEXT floating query — because | ||
| /// the re-seeded prompt differs from the running process's, `createSession` restarts the Pi | ||
| /// subprocess, which is what wipes the bloated history. So the restart cost (≈ a process spawn) | ||
| /// lands on that next query, NOT here in the background. Confirm via "[pi-mono] subprocess | ||
| /// restarted with new system prompt" in the app log. Mirrors the floating warmup in | ||
| /// ensureBridgeStarted — keep them in sync. | ||
| func resetFloatingSessionContext() { | ||
| guard agentBridgeStarted, !modeSwitchInProgress, !floatingSessionResetInFlight else { return } | ||
| floatingSessionResetInFlight = true | ||
| Task { @MainActor in | ||
| defer { floatingSessionResetInFlight = false } | ||
| await preparePromptContextIfNeeded() | ||
| let mainSystemPrompt = buildSystemPrompt(contextString: formatMemoriesSection()) | ||
| let floatingSystemPrompt = Self.floatingBarSystemPromptPrefix + "\n\n" + mainSystemPrompt | ||
| let floatingModel = ShortcutSettings.shared.selectedModel.isEmpty | ||
| ? ModelQoS.Claude.defaultSelection | ||
| : ShortcutSettings.shared.selectedModel | ||
| cachedMainSystemPrompt = mainSystemPrompt | ||
| await agentBridge.invalidateSession(sessionKey: "floating") | ||
| await agentBridge.warmupSession(cwd: workingDirectory, sessions: [ | ||
| .init(key: "floating", model: floatingModel, systemPrompt: floatingSystemPrompt) | ||
| ]) | ||
| log("ChatProvider: floating session context reset — next floating query rebuilds on a fresh process") | ||
| } | ||
| } | ||
|
|
||
| /// Test that the Playwright Chrome extension is connected and working. | ||
| /// Ensures the bridge is started (restarting if needed to pick up new token), | ||
| /// then sends a lightweight test query that triggers a browser_snapshot tool call. |
There was a problem hiding this comment.
Session token ceiling likely causes frequent Pi subprocess restarts
The ceiling is 21 000 tokens and the base prompt is ~14 831 tokens (per the PR's benchmarks). A single tool-use round adds 3–5 k tokens, so the ceiling is crossed in roughly 2–3 turns. Each crossing schedules a warmupSession that restarts the Pi subprocess, with the spawn cost landing on the next user query. With prompt caching, cacheReadTokens (~11 k) counts toward the ceiling, so even cache-warm sessions reset quickly. Multi-turn data queries will pay a subprocess spawn penalty roughly every third response, partially offsetting the TTFT gains.
| @@ -0,0 +1,13 @@ | |||
| {"cache_read_tokens":0,"cache_write_tokens":0,"cost_usd":0,"flagged_gaps":[],"input_mode":"text","input_tokens":17730,"model":"claude-sonnet-4-6","output_tokens":48,"query_text":"warmup","spans":[{"children":[{"dur_ms":232,"end_ms":266,"name":"screenshot_capture","start_ms":34}],"dur_ms":237,"end_ms":271,"name":"pre_llm","start_ms":34},{"dur_ms":109,"end_ms":380,"gap_before_ms":0,"meta":{"status":"ok"},"name":"bridge_ensure","start_ms":271},{"children":[{"children":[{"dur_ms":177,"end_ms":557,"meta":{"endpoint":"\/v1\/usage\/quota","result":"allowed"},"name":"quota_check","start_ms":380}],"dur_ms":2754,"end_ms":3134,"name":"ttft","start_ms":380},{"dur_ms":1530,"end_ms":4664,"gap_before_ms":0,"name":"generation","start_ms":3134}],"dur_ms":4284,"end_ms":4664,"gap_before_ms":0,"meta":{"model":"claude-sonnet-4-6"},"name":"llm_request","start_ms":380}],"timestamp":"2026-05-28T20:57:18Z","token_count":48,"total_ms":4664,"tps":31.372549019607842,"trace_id":"t_86e4f7","ttft_ms":3134} | |||
There was a problem hiding this comment.
Eval result files should not be committed to the repository
The PR description explicitly states these files "can be removed completely before merging." Five baseline-run*.jsonl files with real query/response traces, token counts, and timing data are committed here, permanently bloating git history. They should be deleted from the branch before landing.
kodjima33
left a comment
There was a problem hiding this comment.
Floating-bar latency wins: conditional screenshot, scoped tool use, compact schema, session bounding, Anthropic prompt caching with cached_tokens telemetry. Approve only — 1427 lines (>500), leaving merge to Nik to review the temporary agent-routing bypass and the eval-results scaffolding.
Summary
The floating-bar assistant was slow (2.7–3.5s to first token, 9–11s for tool queries) and unreliable (~16% of baseline requests hard-failed on 429 rate limits). This PR adds a query tracer to measure the path end-to-end, then uses those traces to land a set of latency/cost optimizations and validate each one against saved baselines.
Two parts:
QueryTracer— lightweight span/timing instrumentation across the floating-bar → pi-mono → Rust → Anthropic path. Records per-stage timings, token/cost data, gaps, and full request/response capture to~/Library/Logs/Omi/traces.jsonl. This is what made the rest measurable.Agent routing is temporarily disabled. Every floating-bar query now goes straight through the inline chat path, I bypassed
AgentPillsManager.classify()(the router that decides chat-vs-agent and can spawn an agent pill). It was unstable and firing agent requests spuriously, whicheval-resultsare only for testing purposes and can be removed completely before merging.What i found after tracing each request
Trace decomposition of a "first query" (~18k input tokens) showed the cost wasn't generation speed, it was input tokens and round-trips:
What changed
execute_sqltool hint (+sqlite_masterfor rare columns), instead of aget_schemaround-trip (round-trips re-detonate the rate limit).system+tools static prefix as anephemeralcache_controlblock and surfaceprompt_tokens_details.cached_tokensso cache hits propagate back into traces. claude allows 5 min cache (1.25x cost) and 1 hr cache (2x cost). based on users communication style, one of this cache strategies can be chosen to optimize cost and speed.Results — measured vs baseline
Tokens/sec (generation) was never the bottleneck and is unchanged — all gains come from sending fewer / cached input tokens and fewer round-trips.
How to reproduce
~/Library/Logs/Omi/traces.jsonl· Baselines:eval-results/baseline-run*.jsonlcache_read ≈ 14k, lowerttft).YES. but not within the current architecture. This PR squeezes out the wins available without changing the stack: fewer input tokens, prompt caching, and fewer tool round-trips. What's left is the irreducible floor of the current path — the model's own time-to-first-token (~2–3s), which scales with input size and is paid on every uncached request across the pipeline. More prompt-trimming gives diminishing returns from here.
To go meaningfully faster while keeping response quality, the next step is architectural. My recommendation is to move the floating-bar / voice path onto a low-latency realtime model API (OpenAI's Realtime API). it's built for sub-second, streaming, voice-native interaction and attacks the TTFT floor directly instead of just shrinking the payload, which is especially valuable for the PTT/voice flow.
I deliberately didn't take that route here because it's a substantial refactor (new transport, streaming, and session model), and I wasn't sure you'd want that level of architectural change at this stage. I'd rather land these measurable, low-risk, reversible wins first and treat the realtime-API migration as a separate, scoped decision — happy to spec it out if there's appetite.