making omi desktop faster by vendz · Pull Request #7583 · BasedHardware/omi

vendz · 2026-06-01T20:36:26Z

Summary

The floating-bar assistant was slow (2.7–3.5s to first token, 9–11s for tool queries) and unreliable (~16% of baseline requests hard-failed on 429 rate limits). This PR adds a query tracer to measure the path end-to-end, then uses those traces to land a set of latency/cost optimizations and validate each one against saved baselines.

Two parts:

QueryTracer — lightweight span/timing instrumentation across the floating-bar → pi-mono → Rust → Anthropic path. Records per-stage timings, token/cost data, gaps, and full request/response capture to ~/Library/Logs/Omi/traces.jsonl. This is what made the rest measurable.
Optimizations driven by the trace data (5 cycles, each validated).

⚠️ Few Things to note:

Agent routing is temporarily disabled. Every floating-bar query now goes straight through the inline chat path, I bypassed AgentPillsManager.classify() (the router that decides chat-vs-agent and can spawn an agent pill). It was unstable and firing agent requests spuriously, which
- is a separate bug that's out of scope for this PR, and
- would have added noise to the latency numbers and made the before/after eval comparison unreliable. It's a single, localized call site, so re-enabling it is effectively a one-line revert once the agent path is stabilized — easy to do on top of this PR.
eval-results are only for testing purposes and can be removed completely before merging.

What i found after tracing each request

Trace decomposition of a "first query" (~18k input tokens) showed the cost wasn't generation speed, it was input tokens and round-trips:

Everything was sent on every query: full system prompt + ~12k-char DB schema + 14 tool definitions + a screenshot, regardless of need.
Tools were forced on every message ("ALWAYS check tools"), so even "hi" fired 2–3 tool round-trips, each re-sends the full ~15–18k context → one chit-chat query ≈ 45–50k tokens in seconds, self-detonating the 30k/min BYOK limit (claude has this limit for tier 1 API keys)
Context grew unbounded within a session (17k → 26k+).
No prompt caching - every query paid full price for the same static prefix.

What changed

Conditional screenshot — capture only when the query is visual/deictic. it used to take screenshot even for queries like "hi".
Scoped tool use — chit-chat / general-knowledge answer directly; tools fire only for the user's own data. Removes the forced-tool storm.
Compact schema — moved the ~12k-char DB schema out of the prompt into the execute_sql tool hint (+ sqlite_master for rare columns), instead of a get_schema round-trip (round-trips re-detonate the rate limit).
Session token bounding — reset & re-seed the pi-mono session when input balloons past a ceiling, capping runaway growth.
Prompt caching (Rust) — emit the system+tools static prefix as an ephemeral cache_control block and surface prompt_tokens_details.cached_tokens so cache hits propagate back into traces. claude allows 5 min cache (1.25x cost) and 1 hr cache (2x cost). based on users communication style, one of this cache strategies can be chosen to optimize cost and speed.

Results — measured vs baseline

Scenario	Baseline	After
Greeting — input tokens	18,618	14,831 (−20%)
Greeting — time-to-first-token	~2.9s	~2.0–2.6s
"Hi" / chit-chat	2–3 tool calls → 9–11s, 429s	0 tools, ~2.0s
Data query ("what should I do today?")	6.6–6.9s, frequent 429	1.9s, no 429
Repeat query (warm cache)	— (no caching)	~1.2s (−40%), ~350 billed input tokens (−97%), ~90% cheaper input
429 failure rate (claude tier 1 limits)	~16% of requests	largely eliminated for chit-chat

Tokens/sec (generation) was never the bottleneck and is unchanged — all gains come from sending fewer / cached input tokens and fewer round-trips.

How to reproduce

cd desktop && ./run.sh            # local backend incl. caching;  or  ./run.sh --yolo  (changes 1–4 only)
# run a few floating-bar queries, then diff against a baseline:
python3 eval-results/eval.py --baseline baseline-run5-extended.jsonl

Live traces: ~/Library/Logs/Omi/traces.jsonl · Baselines: eval-results/baseline-run*.jsonl
Caching is per-session: first query writes the cache, the next ~5 min read it (cache_read ≈ 14k, lower ttft).

⚠️ Key questions is can this be optimized further?

YES. but not within the current architecture. This PR squeezes out the wins available without changing the stack: fewer input tokens, prompt caching, and fewer tool round-trips. What's left is the irreducible floor of the current path — the model's own time-to-first-token (~2–3s), which scales with input size and is paid on every uncached request across the pipeline. More prompt-trimming gives diminishing returns from here.

To go meaningfully faster while keeping response quality, the next step is architectural. My recommendation is to move the floating-bar / voice path onto a low-latency realtime model API (OpenAI's Realtime API). it's built for sub-second, streaming, voice-native interaction and attacks the TTFT floor directly instead of just shrinking the payload, which is especially valuable for the PTT/voice flow.

I deliberately didn't take that route here because it's a substantial refactor (new transport, streaming, and session model), and I wasn't sure you'd want that level of architectural change at this stage. I'd rather land these measurable, low-risk, reversible wins first and treat the realtime-API migration as a separate, scoped decision — happy to spec it out if there's appetite.

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds end-to-end query tracing and evaluation tooling to monitor desktop query performance (tokens/TTFT), reduces unnecessary tool/screenshot overhead for floating-bar queries, and introduces prompt caching support in the Rust Anthropic bridge.

Changes:

Introduces QueryTracer (Swift) with JSONL logging, span tracking, TTFT, gap detection, and tool/request/response capture + unit tests.
Updates floating-bar/chat flow to use tracing, selectively capture screenshots, and auto-reset bloated ACP sessions.
Updates Rust Anthropic request translation to emit an ephemeral-cached system block and propagates cached-token usage in OpenAI-style usage.

Reviewed changes

Copilot reviewed 17 out of 18 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
desktop/eval-results/eval.py	Adds a CLI summarizer/differ for `traces.jsonl` to track tokens/TTFT over time.
desktop/eval-results/baseline-*.jsonl	Adds baseline trace logs to compare regressions/improvements.
desktop/agent/src/omi-tools-stdio.ts	Expands execute_sql schema guidance (FTS + relationships).
desktop/Desktop/Tests/QueryTracerTests.swift	Adds unit tests for the new QueryTracer behavior.
desktop/Desktop/Sources/Services/QueryTracer.swift	Implements QueryTracer, JSONL logging, and request/tool/response capture.
desktop/Desktop/Sources/Providers/ChatProvider.swift	Threads tracer through bridge + streaming, captures tool executions, and resets floating session on token bloat.
desktop/Desktop/Sources/FloatingControlBar/PushToTalkManager.swift	Creates/propagates tracers for PTT flows and adds spans around transcription/cleanup/context capture.
desktop/Desktop/Sources/FloatingControlBar/FloatingControlBarWindow.swift	Formatting-only refactors and routes floating queries through a QueryTracer task-local.
desktop/Desktop/Sources/FloatingControlBar/FloatingBarVoicePlaybackService.swift	Adds tracing hooks for TTS playback start timing.
desktop/Desktop/Sources/Chat/ChatPrompts.swift	Adjusts tool-use guidance and removes `{database_schema}` expansion in favor of execute_sql docs.
desktop/Desktop/Sources/Chat/AgentBridge.swift	Adds quota-check spans to traces.
desktop/Backend-Rust/src/routes/chat_completions.rs	Emits system prompt as an ephemeral-cached content-block array; updates tests accordingly.
desktop/Backend-Rust/src/models/chat_completions.rs	Adds `prompt_tokens_details.cached_tokens` and changes AnthropicRequest.system to JSON Value.
.gitignore	Ignores `.worktrees/`.

Comments suppressed due to low confidence (2)

desktop/Desktop/Sources/Chat/AgentBridge.swift:1

When fetchChatUsageQuota() returns nil (network/server error), the trace currently records result=allowed, which is misleading. Consider distinguishing outcomes: allowed when quota is present + allowed, exceeded when present + not allowed, and unknown/error when quota couldn’t be fetched.

import Foundation

desktop/Desktop/Sources/FloatingControlBar/FloatingBarVoicePlaybackService.swift:1

If AVAudioPlayer creation/playback throws, the tts_start span is never closed, which will skew timing (and may only get force-closed later as unclosed=true). Close the span in the catch path as well (ideally with metadata indicating the failure).

import AVFoundation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+        print(f"{i:>2}  {g(r,'input_tokens',default=0):>7} {g(r,'output_tokens',default=0):>5} "
+              f"{int(ttft):>8} {tps:>5.0f} {g(r,'cache_read_tokens',default=0):>7} "
+              f"{cost:>7.4f} {('Y' if req.get('has_screenshot') else '-'):>4} "
+              f"{(str(int(shot_ms)) if shot_ms else '-'):>7} {(str(sysprompt_chars(r)) if sysprompt_chars(r) else '-'):>7}  {q}")


+    recs = []
+    if not Path(path).exists():
+        return recs
+    for line in Path(path).read_text().splitlines():


+    func end(_ name: String, metadata: [String: String]? = nil) {
+        lock.withLock { state in
+            let endInstant = ContinuousClock.now
+            // Search stack from top for matching name
+            guard let idx = state.spanStack.lastIndex(where: { $0.name == name }) else { return }
+
+            var span = state.spanStack.remove(at: idx)
+
+            // Merge metadata
+            if let meta = metadata {
+                if span.metadata == nil {
+                    span.metadata = meta
+                } else {
+                    for (k, v) in meta {
+                        span.metadata?[k] = v
+                    }
+                }
+            }
+
+            let startMs = (span.startInstant - state.origin).milliseconds
+            let endMs = (endInstant - state.origin).milliseconds
+            let built = BuiltSpan(
+                name: span.name,
+                start_ms: startMs,
+                end_ms: endMs,
+                dur_ms: endMs - startMs,
+                meta: span.metadata,
+                children: span.children.isEmpty ? nil : span.children
+            )
+
+            if state.spanStack.isEmpty {
+                state.completedSpans.append(built)
+            } else {
+                state.spanStack[state.spanStack.count - 1].children.append(built)
+            }
+        }
+    }


+    /// Run `body` inside a QueryTracer task-local, reusing an existing tracer or creating a new one.
+    private func withQueryTracer(query: String, fromVoice: Bool, _ body: () async -> Void) async {
+        let tracer =
+            QueryTracerContext.current
+            ?? QueryTracer(query: query, inputMode: fromVoice ? .voicePTTBatch : .text)
+        await QueryTracerContext.$current.withValue(tracer) {
+            await body()
        }
    }


    contextCaptureTask?.cancel()
    contextCaptureTask = nil
+    activeTracer = nil
    stopAudioTranscription()


+    /// Capture the full system prompt + message history sent to the API.
+    func captureRequest(
+        systemPrompt: String,
+        messages: [[String: String]],
+        hasScreenshot: Bool = false
+    ) {
+        lock.withLock { state in
+            state.systemPrompt = systemPrompt
+            state.messages = messages
+            state.hasScreenshot = hasScreenshot
+        }
+    }
+
+    /// Capture the final response text + finish reason.
+    func captureResponse(text: String, finishReason: String? = nil) {
+        lock.withLock { state in
+            state.responseText = text
+            state.finishReason = finishReason
+        }
+    }
+
+    /// Capture a tool call execution (name, input, output, duration).
+    func captureToolExecution(toolUseId: String?, name: String, input: String, output: String, durationMs: Int64? = nil) {
+        lock.withLock { state in
+            state.toolExecutions.append(TraceToolExecution(
+                tool_use_id: toolUseId,
+                name: name,
+                input: input,
+                output: output,
+                dur_ms: durationMs
+            ))
+        }
+    }


+    static func queryNeedsScreenshot(_ text: String) -> Bool {
+        // Tokenize: lowercase, split on non-alphanumerics (so "see?", "this." match as words).
+        let tokens = text.lowercased()
+            .replacingOccurrences(of: "[^a-z0-9]+", with: " ", options: .regularExpression)
+            .split(separator: " ").map(String.init)
+        let cues: Set<String> = [
+            "screen", "screenshot", "see", "seeing", "look", "looking", "show", "showing",
+            "display", "displayed", "view", "viewing", "visible", "highlighted", "selected",
+            "this", "that", "these", "those", "here", "picture", "image", "photo", "page",
+            "window", "tab", "chart", "graph", "diagram", "button", "dialog", "popup", "menu",
+            // Deictic "choice" cues — the chat prompt treats these as screen-grounded
+            // (see ChatPrompts "which one / which option / which suits me").
+            "which", "option", "options", "choose", "suits",
+        ]
+        // Deictic words are visual only when NOT part of a temporal phrase: "this week",
+        // "that day" are about time, not the screen — don't trigger a capture for those.
+        let deictic: Set<String> = ["this", "that", "these", "those"]
+        let timeWords: Set<String> = [
+            "week", "weeks", "month", "months", "year", "years", "day", "days", "today",
+            "tonight", "morning", "afternoon", "evening", "night", "weekend", "weekends",
+            "quarter", "hour", "hours", "minute", "minutes", "time", "moment", "yesterday",
+            "tomorrow", "summer", "winter", "spring", "fall",
+        ]
+        for (i, tok) in tokens.enumerated() where cues.contains(tok) {
+            if deictic.contains(tok), i + 1 < tokens.count, timeWords.contains(tokens[i + 1]) {
+                continue  // temporal ("this week") — not a screen reference
+            }
+            return true
+        }
+        return false
+    }


chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1e4bf2d976

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-01T20:42:28Z

+            guard let jsonData = try? encoder.encode(trace),
+                  let jsonString = String(data: jsonData, encoding: .utf8)
+            else { return }
+
+            let logFile = Self.logDir.appendingPathComponent("traces.jsonl")


Avoid writing raw chat payloads to disk

When normal desktop tracing runs, this encodes the full QueryTrace into ~/Library/Logs/Omi/traces.jsonl, including the captured system prompt, recent messages, final response, and tool inputs/outputs from captureRequest/captureResponse/captureToolExecution. For queries that use memories, SQL, screenshots, or conversations, this leaves sensitive user content in a rotating log file that can be exposed through local log readers, backups, or support-log collection; please gate this behind an explicit debug mode or redact/drop raw payloads and keep only timing/token metadata by default.

Useful? React with 👍 / 👎.

greptile-apps · 2026-06-01T20:49:37Z

Greptile Summary

This PR attacks floating-bar latency from the input-token side: a new QueryTracer instruments the full Swift → pi-mono → Rust → Anthropic round-trip, and five optimisation cycles (conditional screenshot, scoped tool use, compact schema, session bounding, prompt caching) cut first-token times and largely eliminate 429s for chit-chat.

QueryTracer — lightweight OSAllocatedUnfairLock-based span recorder writing per-query JSONL traces to ~/Library/Logs/Omi/traces.jsonl; well-tested with 9 unit tests covering nesting, gap detection, TTFT, async closures, and serialisation.
Rust layer — system prompt emitted as an Anthropic cache_control: ephemeral content block; PromptTokensDetails.cached_tokens propagates cache-hit counts through the OpenAI-compatible usage payload to the Swift tracer.
Swift floating-bar path — conditional screenshot capture (queryNeedsScreenshot), scoped tool-use instruction, DB schema moved into the execute_sql tool hint, session token ceiling (21 k) with background reset, and a withQueryTracer wrapper seeding the @TaskLocal tracer across all query entry points."

Confidence Score: 3/5

Not safe to merge as-is: agent routing is completely removed and five baseline JSONL files with session data are committed; both need resolution before landing.

The floating-bar now skips AgentPillsManager.classify() entirely, silently breaking multi-step agent flows for all users. The isFirstResponse/isGenerating flags are plain vars mutated by two closures that could fire concurrently, risking a double-open on the generation span. The eval-result JSONL files are committed to permanent git history contrary to the PR's own guidance.

desktop/Desktop/Sources/FloatingControlBar/FloatingControlBarWindow.swift (agent routing removed), desktop/Desktop/Sources/Providers/ChatProvider.swift (isFirstResponse/isGenerating flags, session ceiling thrash), desktop/eval-results/*.jsonl (should be deleted)

Important Files Changed

Filename	Overview
desktop/Desktop/Sources/Services/QueryTracer.swift	New QueryTracer: well-structured OSAllocatedUnfairLock-based span recorder with file I/O rotation. Low-entropy trace IDs and full system-prompt capture (PII) are the two issues to address.
desktop/Desktop/Sources/Providers/ChatProvider.swift	Session token bounding, tracer integration, and tool-span tracking added. Potential data race on isFirstResponse/isGenerating flags, and 21k-token ceiling may trigger Pi subprocess restarts every 2-3 tool-heavy turns.
desktop/Desktop/Sources/FloatingControlBar/FloatingControlBarWindow.swift	Agent routing (AgentPillsManager.classify) completely removed from routeQuery — all queries bypass agent pills. withQueryTracer wrapper and queryNeedsScreenshot heuristic are well-implemented.
desktop/Backend-Rust/src/routes/chat_completions.rs	Correctly transforms the system prompt into an Anthropic cache-control content block. Updated tests verify the new structure. No issues found.
desktop/Backend-Rust/src/models/chat_completions.rs	PromptTokensDetails added to propagate cache_read_input_tokens through the OpenAI-compatible usage payload. Logic is correct; field skipped when zero.
desktop/Desktop/Sources/FloatingControlBar/PushToTalkManager.swift	Tracer lifecycle through PTT session well-handled via ownership transfer and explicit nil clear. No issues found.
desktop/Desktop/Sources/FloatingControlBar/FloatingBarVoicePlaybackService.swift	tts_start span correctly opened on first chunk and closed in audio-player success path and TTS fallback. Tracer nulled in reset().
desktop/Desktop/Sources/Chat/ChatPrompts.swift	Tool instruction updated to skip tools for chit-chat; DB schema moved into execute_sql tool hint. Both changes are sound.
desktop/eval-results/baseline-run5-extended.jsonl	Five baseline JSONL files committed to the repo; should be removed before merging per PR description.
desktop/Desktop/Tests/QueryTracerTests.swift	Good unit test coverage for span recording, nesting, gap detection, TTFT, metadata merge, async closure API, and JSONL serialization.

Sequence Diagram

sequenceDiagram
    participant User
    participant FCB as FloatingControlBarManager
    participant QT as QueryTracer TaskLocal
    participant CP as ChatProvider
    participant AB as AgentBridge actor
    participant Rust as Backend-Rust
    participant Claude as Anthropic API
    User->>FCB: typed query or PTT voice
    FCB->>QT: withQueryTracer begin pre_llm
    FCB->>FCB: queryNeedsScreenshot conditional capture
    FCB->>CP: sendMessage
    CP->>QT: begin bridge_ensure
    CP->>AB: ensureBridgeStarted
    AB->>QT: begin end quota_check
    CP->>QT: begin llm_request and ttft
    CP->>AB: query prompt systemPrompt
    AB->>Rust: POST chat completions
    Note over Rust: system ephemeral cache_control block
    Rust->>Claude: Anthropic API streaming
    Claude-->>Rust: stream chunks
    Rust-->>AB: SSE chunks with cache_read_tokens
    AB-->>CP: textDeltaHandler or toolActivityHandler
    CP->>QT: end ttft markTTFT on first token
    CP->>QT: begin end tool span per tool call
    AB-->>CP: QueryResult tokens cost cacheRead
    CP->>QT: finalize to traces.jsonl
    CP->>CP: floatingInputTokens over 21k resets session
    CP-->>FCB: messages updated
    FCB-->>User: response displayed

Comments Outside Diff (1)

desktop/Desktop/Sources/FloatingControlBar/FloatingControlBarWindow.swift, line 1390-1402 (link)

Agent routing permanently bypassed in this PR

AgentPillsManager.classify() is entirely removed from routeQuery, so every floating-bar query — including ones that previously spawned a structured agent pill — goes straight to inline chat. This silently removes a user-visible feature: users who relied on the agent route for multi-step tasks will get an inline chat response instead, with no indication the route changed. The PR description says re-enabling is "a one-line revert", but there's no tracking issue or follow-up stub to ensure it actually happens before release.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

_{Reviews (1): Last reviewed commit: "perf(desktop): reduce floating-bar query..." | Re-trigger Greptile}

greptile-apps · 2026-06-01T20:49:42Z


            // Query the active bridge with streaming
            // Callbacks for agent bridge
+            // Mutable flags for TTFT / generation tracking.
+            var isFirstResponse = true   // TTFT: first output of any kind (text or tool_use)
+            var isGenerating = false     // generation span: tracks actual text streaming
            let textDeltaHandler: AgentBridge.TextDeltaHandler = { [weak self] delta in
+                // Tracer is thread-safe (OSAllocatedUnfairLock) — call directly so
+                // spans close before finalize(). Wrapping in Task { @MainActor }
+                // made them fire-and-forget, racing with finalize().
+                if isFirstResponse {
+                    isFirstResponse = false
+                    tracer?.end("ttft")
+                    tracer?.markTTFT()
+                }
+                if !isGenerating {
+                    isGenerating = true
+                    tracer?.begin("generation")
+                }


Potential data race on isFirstResponse / isGenerating flags

isFirstResponse and isGenerating are plain vars captured by both textDeltaHandler and toolActivityHandler. Both closures modify these flags without any synchronization. If the two handlers are ever invoked concurrently from within agentBridge.query(), both could observe isFirstResponse == true simultaneously and both call tracer?.begin("generation") — opening the span twice. The defensive end("generation") call after query() returns would close only one instance, leaving the other permanently unclosed in the trace.

greptile-apps · 2026-06-01T20:49:43Z

+    init(query: String, inputMode: QueryInputMode) {
+        let now = ContinuousClock.now
+        let id = "t_" + (0..<6).map { _ in String(format: "%x", Int.random(in: 0...15)) }.joined()
+        lock = OSAllocatedUnfairLock(initialState: State(
+            origin: now,
+            query: query,
+            inputMode: inputMode,
+            traceId: id
+        ))


Low entropy trace ID — collision risk in longer sessions

Each trace ID is 6 random hex nibbles (24 bits, ~16.7 M unique values). With the birthday paradox, ~4 000 traces give a ~50% collision chance, and traces.jsonl can hold thousands of entries within the 5 MB cap. A collision silently aliases two traces when comparing them in the eval script. Using UUID().uuidString costs nothing and eliminates the issue.

Suggested change

init(query: String, inputMode: QueryInputMode) {

let now = ContinuousClock.now

let id = "t_" + (0..<6).map { _ in String(format: "%x", Int.random(in: 0...15)) }.joined()

lock = OSAllocatedUnfairLock(initialState: State(

origin: now,

query: query,

inputMode: inputMode,

traceId: id

))

let id = "t_" + UUID().uuidString.lowercased().replacingOccurrences(of: "-", with: "").prefix(12)

greptile-apps · 2026-06-01T20:49:43Z

+    /// Capture the full system prompt + message history sent to the API.
+    func captureRequest(
+        systemPrompt: String,
+        messages: [[String: String]],
+        hasScreenshot: Bool = false
+    ) {
+        lock.withLock { state in
+            state.systemPrompt = systemPrompt
+            state.messages = messages
+            state.hasScreenshot = hasScreenshot
+        }
+    }


Full system prompt and message history written to a log file

captureRequest stores the complete system prompt and the last 40 messages, then serializes them to ~/Library/Logs/Omi/traces.jsonl. The floating-bar system prompt contains the user's personal memories, facts, tasks, goals, and conversation history — all written to disk with only a 5 MB rotation policy and no TTL. Consider stripping or truncating PII fields before writing, or gating capture behind an explicit opt-in flag.

greptile-apps · 2026-06-01T20:49:45Z

        await agentBridge.invalidateSession(sessionKey: sessionKey)
    }

+    /// Input-token ceiling for the floating-bar ACP session. The ACP SDK keeps the full
+    /// turn history (including large tool results) inside the session, so input climbs every
+    /// turn — observed drifting 17k → 26k across a handful of queries, slowing TTFT and
+    /// tripping the 30k-tokens/min rate limit. When a floating query crosses this, we reset
+    /// the session in the background so the next one starts back near the base-prompt floor.
+    private static let floatingSessionTokenCeiling = 21_000
+
+    /// Guards against overlapping resets when several bloated queries land in quick succession.
+    private var floatingSessionResetInFlight = false
+
+    /// Reset the floating-bar ACP session to shed accumulated context (conversation history +
+    /// tool-result bloat). Runs a small background Task that re-records the floating session with a
+    /// freshly re-seeded system prompt (recent chat history only). This call does NOT hit the model
+    /// and burns no tokens.
+    ///
+    /// Cost / mechanism (pi-mono path): pi-mono is single-session and bakes the system prompt as a
+    /// process launch flag, so the reset actually takes effect on the NEXT floating query — because
+    /// the re-seeded prompt differs from the running process's, `createSession` restarts the Pi
+    /// subprocess, which is what wipes the bloated history. So the restart cost (≈ a process spawn)
+    /// lands on that next query, NOT here in the background. Confirm via "[pi-mono] subprocess
+    /// restarted with new system prompt" in the app log. Mirrors the floating warmup in
+    /// ensureBridgeStarted — keep them in sync.
+    func resetFloatingSessionContext() {
+        guard agentBridgeStarted, !modeSwitchInProgress, !floatingSessionResetInFlight else { return }
+        floatingSessionResetInFlight = true
+        Task { @MainActor in
+            defer { floatingSessionResetInFlight = false }
+            await preparePromptContextIfNeeded()
+            let mainSystemPrompt = buildSystemPrompt(contextString: formatMemoriesSection())
+            let floatingSystemPrompt = Self.floatingBarSystemPromptPrefix + "\n\n" + mainSystemPrompt
+            let floatingModel = ShortcutSettings.shared.selectedModel.isEmpty
+                ? ModelQoS.Claude.defaultSelection
+                : ShortcutSettings.shared.selectedModel
+            cachedMainSystemPrompt = mainSystemPrompt
+            await agentBridge.invalidateSession(sessionKey: "floating")
+            await agentBridge.warmupSession(cwd: workingDirectory, sessions: [
+                .init(key: "floating", model: floatingModel, systemPrompt: floatingSystemPrompt)
+            ])
+            log("ChatProvider: floating session context reset — next floating query rebuilds on a fresh process")
+        }
+    }
+
    /// Test that the Playwright Chrome extension is connected and working.
    /// Ensures the bridge is started (restarting if needed to pick up new token),
    /// then sends a lightweight test query that triggers a browser_snapshot tool call.


Session token ceiling likely causes frequent Pi subprocess restarts

The ceiling is 21 000 tokens and the base prompt is ~14 831 tokens (per the PR's benchmarks). A single tool-use round adds 3–5 k tokens, so the ceiling is crossed in roughly 2–3 turns. Each crossing schedules a warmupSession that restarts the Pi subprocess, with the spawn cost landing on the next user query. With prompt caching, cacheReadTokens (~11 k) counts toward the ceiling, so even cache-warm sessions reset quickly. Multi-turn data queries will pay a subprocess spawn penalty roughly every third response, partially offsetting the TTFT gains.

greptile-apps · 2026-06-01T20:49:46Z

@@ -0,0 +1,13 @@
+{"cache_read_tokens":0,"cache_write_tokens":0,"cost_usd":0,"flagged_gaps":[],"input_mode":"text","input_tokens":17730,"model":"claude-sonnet-4-6","output_tokens":48,"query_text":"warmup","spans":[{"children":[{"dur_ms":232,"end_ms":266,"name":"screenshot_capture","start_ms":34}],"dur_ms":237,"end_ms":271,"name":"pre_llm","start_ms":34},{"dur_ms":109,"end_ms":380,"gap_before_ms":0,"meta":{"status":"ok"},"name":"bridge_ensure","start_ms":271},{"children":[{"children":[{"dur_ms":177,"end_ms":557,"meta":{"endpoint":"\/v1\/usage\/quota","result":"allowed"},"name":"quota_check","start_ms":380}],"dur_ms":2754,"end_ms":3134,"name":"ttft","start_ms":380},{"dur_ms":1530,"end_ms":4664,"gap_before_ms":0,"name":"generation","start_ms":3134}],"dur_ms":4284,"end_ms":4664,"gap_before_ms":0,"meta":{"model":"claude-sonnet-4-6"},"name":"llm_request","start_ms":380}],"timestamp":"2026-05-28T20:57:18Z","token_count":48,"total_ms":4664,"tps":31.372549019607842,"trace_id":"t_86e4f7","ttft_ms":3134}


Eval result files should not be committed to the repository

The PR description explicitly states these files "can be removed completely before merging." Five baseline-run*.jsonl files with real query/response traces, token counts, and timing data are committed here, permanently bloating git history. They should be deleted from the branch before landing.

kodjima33

Floating-bar latency wins: conditional screenshot, scoped tool use, compact schema, session bounding, Anthropic prompt caching with cached_tokens telemetry. Approve only — 1427 lines (>500), leaving merge to Nik to review the temporary agent-routing bypass and the eval-results scaffolding.

vendz added 4 commits May 27, 2026 17:44

chore: add .worktrees/ to gitignore for development worktrees

c0d5838

feat(evals): added baseline evals to varify results and optimizations

da9ba6b

fix(evals): some parts were skipped in trace

29ee0be

perf(desktop): reduce floating-bar query latency and tokens

1e4bf2d

Copilot AI review requested due to automatic review settings June 1, 2026 20:36

Copilot AI reviewed Jun 1, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Jun 1, 2026

View reviewed changes

greptile-apps Bot reviewed Jun 1, 2026

View reviewed changes

kodjima33 approved these changes Jun 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

making omi desktop faster#7583

making omi desktop faster#7583
vendz wants to merge 4 commits into
BasedHardware:mainfrom
vendz:feature/query-tracer

vendz commented Jun 1, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 1, 2026

Uh oh!

greptile-apps Bot commented Jun 1, 2026 •

edited

Loading

Comments Outside Diff (1)

Uh oh!

greptile-apps Bot Jun 1, 2026

Uh oh!

greptile-apps Bot Jun 1, 2026

Uh oh!

greptile-apps Bot Jun 1, 2026

Uh oh!

greptile-apps Bot Jun 1, 2026

Uh oh!

greptile-apps Bot Jun 1, 2026

Uh oh!

kodjima33 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		@@ -0,0 +1,13 @@
		{"cache_read_tokens":0,"cache_write_tokens":0,"cost_usd":0,"flagged_gaps":[],"input_mode":"text","input_tokens":17730,"model":"claude-sonnet-4-6","output_tokens":48,"query_text":"warmup","spans":[{"children":[{"dur_ms":232,"end_ms":266,"name":"screenshot_capture","start_ms":34}],"dur_ms":237,"end_ms":271,"name":"pre_llm","start_ms":34},{"dur_ms":109,"end_ms":380,"gap_before_ms":0,"meta":{"status":"ok"},"name":"bridge_ensure","start_ms":271},{"children":[{"children":[{"dur_ms":177,"end_ms":557,"meta":{"endpoint":"\/v1\/usage\/quota","result":"allowed"},"name":"quota_check","start_ms":380}],"dur_ms":2754,"end_ms":3134,"name":"ttft","start_ms":380},{"dur_ms":1530,"end_ms":4664,"gap_before_ms":0,"name":"generation","start_ms":3134}],"dur_ms":4284,"end_ms":4664,"gap_before_ms":0,"meta":{"model":"claude-sonnet-4-6"},"name":"llm_request","start_ms":380}],"timestamp":"2026-05-28T20:57:18Z","token_count":48,"total_ms":4664,"tps":31.372549019607842,"trace_id":"t_86e4f7","ttft_ms":3134}

Conversation

vendz commented Jun 1, 2026

Summary

⚠️ Few Things to note:

What i found after tracing each request

What changed

Results — measured vs baseline

How to reproduce

⚠️ Key questions is can this be optimized further?

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Comments Outside Diff (1)

Uh oh!

greptile-apps Bot Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

kodjima33 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

greptile-apps Bot commented Jun 1, 2026 •

edited

Loading