Skip to content

making omi desktop faster#7583

Open
vendz wants to merge 4 commits into
BasedHardware:mainfrom
vendz:feature/query-tracer
Open

making omi desktop faster#7583
vendz wants to merge 4 commits into
BasedHardware:mainfrom
vendz:feature/query-tracer

Conversation

@vendz
Copy link
Copy Markdown
Contributor

@vendz vendz commented Jun 1, 2026

Summary

The floating-bar assistant was slow (2.7–3.5s to first token, 9–11s for tool queries) and unreliable (~16% of baseline requests hard-failed on 429 rate limits). This PR adds a query tracer to measure the path end-to-end, then uses those traces to land a set of latency/cost optimizations and validate each one against saved baselines.

Two parts:

  1. QueryTracer — lightweight span/timing instrumentation across the floating-bar → pi-mono → Rust → Anthropic path. Records per-stage timings, token/cost data, gaps, and full request/response capture to ~/Library/Logs/Omi/traces.jsonl. This is what made the rest measurable.
  2. Optimizations driven by the trace data (5 cycles, each validated).

⚠️ Few Things to note:

  • Agent routing is temporarily disabled. Every floating-bar query now goes straight through the inline chat path, I bypassed AgentPillsManager.classify() (the router that decides chat-vs-agent and can spawn an agent pill). It was unstable and firing agent requests spuriously, which

    • is a separate bug that's out of scope for this PR, and
    • would have added noise to the latency numbers and made the before/after eval comparison unreliable. It's a single, localized call site, so re-enabling it is effectively a one-line revert once the agent path is stabilized — easy to do on top of this PR.
  • eval-results are only for testing purposes and can be removed completely before merging.

What i found after tracing each request

Trace decomposition of a "first query" (~18k input tokens) showed the cost wasn't generation speed, it was input tokens and round-trips:

  • Everything was sent on every query: full system prompt + ~12k-char DB schema + 14 tool definitions + a screenshot, regardless of need.
  • Tools were forced on every message ("ALWAYS check tools"), so even "hi" fired 2–3 tool round-trips, each re-sends the full ~15–18k context → one chit-chat query ≈ 45–50k tokens in seconds, self-detonating the 30k/min BYOK limit (claude has this limit for tier 1 API keys)
  • Context grew unbounded within a session (17k → 26k+).
  • No prompt caching - every query paid full price for the same static prefix.

What changed

  1. Conditional screenshot — capture only when the query is visual/deictic. it used to take screenshot even for queries like "hi".
  2. Scoped tool use — chit-chat / general-knowledge answer directly; tools fire only for the user's own data. Removes the forced-tool storm.
  3. Compact schema — moved the ~12k-char DB schema out of the prompt into the execute_sql tool hint (+ sqlite_master for rare columns), instead of a get_schema round-trip (round-trips re-detonate the rate limit).
  4. Session token bounding — reset & re-seed the pi-mono session when input balloons past a ceiling, capping runaway growth.
  5. Prompt caching (Rust) — emit the system+tools static prefix as an ephemeral cache_control block and surface prompt_tokens_details.cached_tokens so cache hits propagate back into traces. claude allows 5 min cache (1.25x cost) and 1 hr cache (2x cost). based on users communication style, one of this cache strategies can be chosen to optimize cost and speed.

Results — measured vs baseline

Scenario Baseline After
Greeting — input tokens 18,618 14,831 (−20%)
Greeting — time-to-first-token ~2.9s ~2.0–2.6s
"Hi" / chit-chat 2–3 tool calls → 9–11s, 429s 0 tools, ~2.0s
Data query ("what should I do today?") 6.6–6.9s, frequent 429 1.9s, no 429
Repeat query (warm cache) — (no caching) ~1.2s (−40%), ~350 billed input tokens (−97%), ~90% cheaper input
429 failure rate (claude tier 1 limits) ~16% of requests largely eliminated for chit-chat

Tokens/sec (generation) was never the bottleneck and is unchanged — all gains come from sending fewer / cached input tokens and fewer round-trips.

How to reproduce

cd desktop && ./run.sh            # local backend incl. caching;  or  ./run.sh --yolo  (changes 1–4 only)
# run a few floating-bar queries, then diff against a baseline:
python3 eval-results/eval.py --baseline baseline-run5-extended.jsonl
  • Live traces: ~/Library/Logs/Omi/traces.jsonl · Baselines: eval-results/baseline-run*.jsonl
  • Caching is per-session: first query writes the cache, the next ~5 min read it (cache_read ≈ 14k, lower ttft).

⚠️ Key questions is can this be optimized further?

YES. but not within the current architecture. This PR squeezes out the wins available without changing the stack: fewer input tokens, prompt caching, and fewer tool round-trips. What's left is the irreducible floor of the current path — the model's own time-to-first-token (~2–3s), which scales with input size and is paid on every uncached request across the pipeline. More prompt-trimming gives diminishing returns from here.

To go meaningfully faster while keeping response quality, the next step is architectural. My recommendation is to move the floating-bar / voice path onto a low-latency realtime model API (OpenAI's Realtime API). it's built for sub-second, streaming, voice-native interaction and attacks the TTFT floor directly instead of just shrinking the payload, which is especially valuable for the PTT/voice flow.

I deliberately didn't take that route here because it's a substantial refactor (new transport, streaming, and session model), and I wasn't sure you'd want that level of architectural change at this stage. I'd rather land these measurable, low-risk, reversible wins first and treat the realtime-API migration as a separate, scoped decision — happy to spec it out if there's appetite.

Copilot AI review requested due to automatic review settings June 1, 2026 20:36
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds end-to-end query tracing and evaluation tooling to monitor desktop query performance (tokens/TTFT), reduces unnecessary tool/screenshot overhead for floating-bar queries, and introduces prompt caching support in the Rust Anthropic bridge.

Changes:

  • Introduces QueryTracer (Swift) with JSONL logging, span tracking, TTFT, gap detection, and tool/request/response capture + unit tests.
  • Updates floating-bar/chat flow to use tracing, selectively capture screenshots, and auto-reset bloated ACP sessions.
  • Updates Rust Anthropic request translation to emit an ephemeral-cached system block and propagates cached-token usage in OpenAI-style usage.

Reviewed changes

Copilot reviewed 17 out of 18 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
desktop/eval-results/eval.py Adds a CLI summarizer/differ for traces.jsonl to track tokens/TTFT over time.
desktop/eval-results/baseline-*.jsonl Adds baseline trace logs to compare regressions/improvements.
desktop/agent/src/omi-tools-stdio.ts Expands execute_sql schema guidance (FTS + relationships).
desktop/Desktop/Tests/QueryTracerTests.swift Adds unit tests for the new QueryTracer behavior.
desktop/Desktop/Sources/Services/QueryTracer.swift Implements QueryTracer, JSONL logging, and request/tool/response capture.
desktop/Desktop/Sources/Providers/ChatProvider.swift Threads tracer through bridge + streaming, captures tool executions, and resets floating session on token bloat.
desktop/Desktop/Sources/FloatingControlBar/PushToTalkManager.swift Creates/propagates tracers for PTT flows and adds spans around transcription/cleanup/context capture.
desktop/Desktop/Sources/FloatingControlBar/FloatingControlBarWindow.swift Formatting-only refactors and routes floating queries through a QueryTracer task-local.
desktop/Desktop/Sources/FloatingControlBar/FloatingBarVoicePlaybackService.swift Adds tracing hooks for TTS playback start timing.
desktop/Desktop/Sources/Chat/ChatPrompts.swift Adjusts tool-use guidance and removes {database_schema} expansion in favor of execute_sql docs.
desktop/Desktop/Sources/Chat/AgentBridge.swift Adds quota-check spans to traces.
desktop/Backend-Rust/src/routes/chat_completions.rs Emits system prompt as an ephemeral-cached content-block array; updates tests accordingly.
desktop/Backend-Rust/src/models/chat_completions.rs Adds prompt_tokens_details.cached_tokens and changes AnthropicRequest.system to JSON Value.
.gitignore Ignores .worktrees/.
Comments suppressed due to low confidence (2)

desktop/Desktop/Sources/Chat/AgentBridge.swift:1

  • When fetchChatUsageQuota() returns nil (network/server error), the trace currently records result=allowed, which is misleading. Consider distinguishing outcomes: allowed when quota is present + allowed, exceeded when present + not allowed, and unknown/error when quota couldn’t be fetched.
import Foundation

desktop/Desktop/Sources/FloatingControlBar/FloatingBarVoicePlaybackService.swift:1

  • If AVAudioPlayer creation/playback throws, the tts_start span is never closed, which will skew timing (and may only get force-closed later as unclosed=true). Close the span in the catch path as well (ideally with metadata indicating the failure).
import AVFoundation

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

print(f"{i:>2} {g(r,'input_tokens',default=0):>7} {g(r,'output_tokens',default=0):>5} "
f"{int(ttft):>8} {tps:>5.0f} {g(r,'cache_read_tokens',default=0):>7} "
f"{cost:>7.4f} {('Y' if req.get('has_screenshot') else '-'):>4} "
f"{(str(int(shot_ms)) if shot_ms else '-'):>7} {(str(sysprompt_chars(r)) if sysprompt_chars(r) else '-'):>7} {q}")
recs = []
if not Path(path).exists():
return recs
for line in Path(path).read_text().splitlines():
Comment on lines +195 to +231
func end(_ name: String, metadata: [String: String]? = nil) {
lock.withLock { state in
let endInstant = ContinuousClock.now
// Search stack from top for matching name
guard let idx = state.spanStack.lastIndex(where: { $0.name == name }) else { return }

var span = state.spanStack.remove(at: idx)

// Merge metadata
if let meta = metadata {
if span.metadata == nil {
span.metadata = meta
} else {
for (k, v) in meta {
span.metadata?[k] = v
}
}
}

let startMs = (span.startInstant - state.origin).milliseconds
let endMs = (endInstant - state.origin).milliseconds
let built = BuiltSpan(
name: span.name,
start_ms: startMs,
end_ms: endMs,
dur_ms: endMs - startMs,
meta: span.metadata,
children: span.children.isEmpty ? nil : span.children
)

if state.spanStack.isEmpty {
state.completedSpans.append(built)
} else {
state.spanStack[state.spanStack.count - 1].children.append(built)
}
}
}
Comment on lines +1373 to 1381
/// Run `body` inside a QueryTracer task-local, reusing an existing tracer or creating a new one.
private func withQueryTracer(query: String, fromVoice: Bool, _ body: () async -> Void) async {
let tracer =
QueryTracerContext.current
?? QueryTracer(query: query, inputMode: fromVoice ? .voicePTTBatch : .text)
await QueryTracerContext.$current.withValue(tracer) {
await body()
}
}
Comment on lines 315 to 318
contextCaptureTask?.cancel()
contextCaptureTask = nil
activeTracer = nil
stopAudioTranscription()
Comment on lines +252 to +284
/// Capture the full system prompt + message history sent to the API.
func captureRequest(
systemPrompt: String,
messages: [[String: String]],
hasScreenshot: Bool = false
) {
lock.withLock { state in
state.systemPrompt = systemPrompt
state.messages = messages
state.hasScreenshot = hasScreenshot
}
}

/// Capture the final response text + finish reason.
func captureResponse(text: String, finishReason: String? = nil) {
lock.withLock { state in
state.responseText = text
state.finishReason = finishReason
}
}

/// Capture a tool call execution (name, input, output, duration).
func captureToolExecution(toolUseId: String?, name: String, input: String, output: String, durationMs: Int64? = nil) {
lock.withLock { state in
state.toolExecutions.append(TraceToolExecution(
tool_use_id: toolUseId,
name: name,
input: input,
output: output,
dur_ms: durationMs
))
}
}
Comment on lines +1687 to +1717
static func queryNeedsScreenshot(_ text: String) -> Bool {
// Tokenize: lowercase, split on non-alphanumerics (so "see?", "this." match as words).
let tokens = text.lowercased()
.replacingOccurrences(of: "[^a-z0-9]+", with: " ", options: .regularExpression)
.split(separator: " ").map(String.init)
let cues: Set<String> = [
"screen", "screenshot", "see", "seeing", "look", "looking", "show", "showing",
"display", "displayed", "view", "viewing", "visible", "highlighted", "selected",
"this", "that", "these", "those", "here", "picture", "image", "photo", "page",
"window", "tab", "chart", "graph", "diagram", "button", "dialog", "popup", "menu",
// Deictic "choice" cues — the chat prompt treats these as screen-grounded
// (see ChatPrompts "which one / which option / which suits me").
"which", "option", "options", "choose", "suits",
]
// Deictic words are visual only when NOT part of a temporal phrase: "this week",
// "that day" are about time, not the screen — don't trigger a capture for those.
let deictic: Set<String> = ["this", "that", "these", "those"]
let timeWords: Set<String> = [
"week", "weeks", "month", "months", "year", "years", "day", "days", "today",
"tonight", "morning", "afternoon", "evening", "night", "weekend", "weekends",
"quarter", "hour", "hours", "minute", "minutes", "time", "moment", "yesterday",
"tomorrow", "summer", "winter", "spring", "fall",
]
for (i, tok) in tokens.enumerated() where cues.contains(tok) {
if deictic.contains(tok), i + 1 < tokens.count, timeWords.contains(tokens[i + 1]) {
continue // temporal ("this week") — not a screen reference
}
return true
}
return false
}
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1e4bf2d976

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +442 to +446
guard let jsonData = try? encoder.encode(trace),
let jsonString = String(data: jsonData, encoding: .utf8)
else { return }

let logFile = Self.logDir.appendingPathComponent("traces.jsonl")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid writing raw chat payloads to disk

When normal desktop tracing runs, this encodes the full QueryTrace into ~/Library/Logs/Omi/traces.jsonl, including the captured system prompt, recent messages, final response, and tool inputs/outputs from captureRequest/captureResponse/captureToolExecution. For queries that use memories, SQL, screenshots, or conversations, this leaves sensitive user content in a rotating log file that can be exposed through local log readers, backups, or support-log collection; please gate this behind an explicit debug mode or redact/drop raw payloads and keep only timing/token metadata by default.

Useful? React with 👍 / 👎.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Jun 1, 2026

Greptile Summary

This PR attacks floating-bar latency from the input-token side: a new QueryTracer instruments the full Swift → pi-mono → Rust → Anthropic round-trip, and five optimisation cycles (conditional screenshot, scoped tool use, compact schema, session bounding, prompt caching) cut first-token times and largely eliminate 429s for chit-chat.

  • QueryTracer — lightweight OSAllocatedUnfairLock-based span recorder writing per-query JSONL traces to ~/Library/Logs/Omi/traces.jsonl; well-tested with 9 unit tests covering nesting, gap detection, TTFT, async closures, and serialisation.
  • Rust layer — system prompt emitted as an Anthropic cache_control: ephemeral content block; PromptTokensDetails.cached_tokens propagates cache-hit counts through the OpenAI-compatible usage payload to the Swift tracer.
  • Swift floating-bar path — conditional screenshot capture (queryNeedsScreenshot), scoped tool-use instruction, DB schema moved into the execute_sql tool hint, session token ceiling (21 k) with background reset, and a withQueryTracer wrapper seeding the @TaskLocal tracer across all query entry points."

Confidence Score: 3/5

Not safe to merge as-is: agent routing is completely removed and five baseline JSONL files with session data are committed; both need resolution before landing.

The floating-bar now skips AgentPillsManager.classify() entirely, silently breaking multi-step agent flows for all users. The isFirstResponse/isGenerating flags are plain vars mutated by two closures that could fire concurrently, risking a double-open on the generation span. The eval-result JSONL files are committed to permanent git history contrary to the PR's own guidance.

desktop/Desktop/Sources/FloatingControlBar/FloatingControlBarWindow.swift (agent routing removed), desktop/Desktop/Sources/Providers/ChatProvider.swift (isFirstResponse/isGenerating flags, session ceiling thrash), desktop/eval-results/*.jsonl (should be deleted)

Important Files Changed

Filename Overview
desktop/Desktop/Sources/Services/QueryTracer.swift New QueryTracer: well-structured OSAllocatedUnfairLock-based span recorder with file I/O rotation. Low-entropy trace IDs and full system-prompt capture (PII) are the two issues to address.
desktop/Desktop/Sources/Providers/ChatProvider.swift Session token bounding, tracer integration, and tool-span tracking added. Potential data race on isFirstResponse/isGenerating flags, and 21k-token ceiling may trigger Pi subprocess restarts every 2-3 tool-heavy turns.
desktop/Desktop/Sources/FloatingControlBar/FloatingControlBarWindow.swift Agent routing (AgentPillsManager.classify) completely removed from routeQuery — all queries bypass agent pills. withQueryTracer wrapper and queryNeedsScreenshot heuristic are well-implemented.
desktop/Backend-Rust/src/routes/chat_completions.rs Correctly transforms the system prompt into an Anthropic cache-control content block. Updated tests verify the new structure. No issues found.
desktop/Backend-Rust/src/models/chat_completions.rs PromptTokensDetails added to propagate cache_read_input_tokens through the OpenAI-compatible usage payload. Logic is correct; field skipped when zero.
desktop/Desktop/Sources/FloatingControlBar/PushToTalkManager.swift Tracer lifecycle through PTT session well-handled via ownership transfer and explicit nil clear. No issues found.
desktop/Desktop/Sources/FloatingControlBar/FloatingBarVoicePlaybackService.swift tts_start span correctly opened on first chunk and closed in audio-player success path and TTS fallback. Tracer nulled in reset().
desktop/Desktop/Sources/Chat/ChatPrompts.swift Tool instruction updated to skip tools for chit-chat; DB schema moved into execute_sql tool hint. Both changes are sound.
desktop/eval-results/baseline-run5-extended.jsonl Five baseline JSONL files committed to the repo; should be removed before merging per PR description.
desktop/Desktop/Tests/QueryTracerTests.swift Good unit test coverage for span recording, nesting, gap detection, TTFT, metadata merge, async closure API, and JSONL serialization.

Sequence Diagram

sequenceDiagram
    participant User
    participant FCB as FloatingControlBarManager
    participant QT as QueryTracer TaskLocal
    participant CP as ChatProvider
    participant AB as AgentBridge actor
    participant Rust as Backend-Rust
    participant Claude as Anthropic API
    User->>FCB: typed query or PTT voice
    FCB->>QT: withQueryTracer begin pre_llm
    FCB->>FCB: queryNeedsScreenshot conditional capture
    FCB->>CP: sendMessage
    CP->>QT: begin bridge_ensure
    CP->>AB: ensureBridgeStarted
    AB->>QT: begin end quota_check
    CP->>QT: begin llm_request and ttft
    CP->>AB: query prompt systemPrompt
    AB->>Rust: POST chat completions
    Note over Rust: system ephemeral cache_control block
    Rust->>Claude: Anthropic API streaming
    Claude-->>Rust: stream chunks
    Rust-->>AB: SSE chunks with cache_read_tokens
    AB-->>CP: textDeltaHandler or toolActivityHandler
    CP->>QT: end ttft markTTFT on first token
    CP->>QT: begin end tool span per tool call
    AB-->>CP: QueryResult tokens cost cacheRead
    CP->>QT: finalize to traces.jsonl
    CP->>CP: floatingInputTokens over 21k resets session
    CP-->>FCB: messages updated
    FCB-->>User: response displayed
Loading

Comments Outside Diff (1)

  1. desktop/Desktop/Sources/FloatingControlBar/FloatingControlBarWindow.swift, line 1390-1402 (link)

    P1 Agent routing permanently bypassed in this PR

    AgentPillsManager.classify() is entirely removed from routeQuery, so every floating-bar query — including ones that previously spawned a structured agent pill — goes straight to inline chat. This silently removes a user-visible feature: users who relied on the agent route for multi-step tasks will get an inline chat response instead, with no indication the route changed. The PR description says re-enabling is "a one-line revert", but there's no tracking issue or follow-up stub to ensure it actually happens before release.

    Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Reviews (1): Last reviewed commit: "perf(desktop): reduce floating-bar query..." | Re-trigger Greptile

Comment on lines 2697 to +2715

// Query the active bridge with streaming
// Callbacks for agent bridge
// Mutable flags for TTFT / generation tracking.
var isFirstResponse = true // TTFT: first output of any kind (text or tool_use)
var isGenerating = false // generation span: tracks actual text streaming
let textDeltaHandler: AgentBridge.TextDeltaHandler = { [weak self] delta in
// Tracer is thread-safe (OSAllocatedUnfairLock) — call directly so
// spans close before finalize(). Wrapping in Task { @MainActor }
// made them fire-and-forget, racing with finalize().
if isFirstResponse {
isFirstResponse = false
tracer?.end("ttft")
tracer?.markTTFT()
}
if !isGenerating {
isGenerating = true
tracer?.begin("generation")
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Potential data race on isFirstResponse / isGenerating flags

isFirstResponse and isGenerating are plain vars captured by both textDeltaHandler and toolActivityHandler. Both closures modify these flags without any synchronization. If the two handlers are ever invoked concurrently from within agentBridge.query(), both could observe isFirstResponse == true simultaneously and both call tracer?.begin("generation") — opening the span twice. The defensive end("generation") call after query() returns would close only one instance, leaving the other permanently unclosed in the trace.

Comment on lines +165 to +173
init(query: String, inputMode: QueryInputMode) {
let now = ContinuousClock.now
let id = "t_" + (0..<6).map { _ in String(format: "%x", Int.random(in: 0...15)) }.joined()
lock = OSAllocatedUnfairLock(initialState: State(
origin: now,
query: query,
inputMode: inputMode,
traceId: id
))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Low entropy trace ID — collision risk in longer sessions

Each trace ID is 6 random hex nibbles (24 bits, ~16.7 M unique values). With the birthday paradox, ~4 000 traces give a ~50% collision chance, and traces.jsonl can hold thousands of entries within the 5 MB cap. A collision silently aliases two traces when comparing them in the eval script. Using UUID().uuidString costs nothing and eliminates the issue.

Suggested change
init(query: String, inputMode: QueryInputMode) {
let now = ContinuousClock.now
let id = "t_" + (0..<6).map { _ in String(format: "%x", Int.random(in: 0...15)) }.joined()
lock = OSAllocatedUnfairLock(initialState: State(
origin: now,
query: query,
inputMode: inputMode,
traceId: id
))
let id = "t_" + UUID().uuidString.lowercased().replacingOccurrences(of: "-", with: "").prefix(12)

Comment on lines +252 to +263
/// Capture the full system prompt + message history sent to the API.
func captureRequest(
systemPrompt: String,
messages: [[String: String]],
hasScreenshot: Bool = false
) {
lock.withLock { state in
state.systemPrompt = systemPrompt
state.messages = messages
state.hasScreenshot = hasScreenshot
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Full system prompt and message history written to a log file

captureRequest stores the complete system prompt and the last 40 messages, then serializes them to ~/Library/Logs/Omi/traces.jsonl. The floating-bar system prompt contains the user's personal memories, facts, tasks, goals, and conversation history — all written to disk with only a 5 MB rotation policy and no TTL. Consider stripping or truncating PII fields before writing, or gating capture behind an explicit opt-in flag.

Comment on lines 839 to 886
await agentBridge.invalidateSession(sessionKey: sessionKey)
}

/// Input-token ceiling for the floating-bar ACP session. The ACP SDK keeps the full
/// turn history (including large tool results) inside the session, so input climbs every
/// turn — observed drifting 17k → 26k across a handful of queries, slowing TTFT and
/// tripping the 30k-tokens/min rate limit. When a floating query crosses this, we reset
/// the session in the background so the next one starts back near the base-prompt floor.
private static let floatingSessionTokenCeiling = 21_000

/// Guards against overlapping resets when several bloated queries land in quick succession.
private var floatingSessionResetInFlight = false

/// Reset the floating-bar ACP session to shed accumulated context (conversation history +
/// tool-result bloat). Runs a small background Task that re-records the floating session with a
/// freshly re-seeded system prompt (recent chat history only). This call does NOT hit the model
/// and burns no tokens.
///
/// Cost / mechanism (pi-mono path): pi-mono is single-session and bakes the system prompt as a
/// process launch flag, so the reset actually takes effect on the NEXT floating query — because
/// the re-seeded prompt differs from the running process's, `createSession` restarts the Pi
/// subprocess, which is what wipes the bloated history. So the restart cost (≈ a process spawn)
/// lands on that next query, NOT here in the background. Confirm via "[pi-mono] subprocess
/// restarted with new system prompt" in the app log. Mirrors the floating warmup in
/// ensureBridgeStarted — keep them in sync.
func resetFloatingSessionContext() {
guard agentBridgeStarted, !modeSwitchInProgress, !floatingSessionResetInFlight else { return }
floatingSessionResetInFlight = true
Task { @MainActor in
defer { floatingSessionResetInFlight = false }
await preparePromptContextIfNeeded()
let mainSystemPrompt = buildSystemPrompt(contextString: formatMemoriesSection())
let floatingSystemPrompt = Self.floatingBarSystemPromptPrefix + "\n\n" + mainSystemPrompt
let floatingModel = ShortcutSettings.shared.selectedModel.isEmpty
? ModelQoS.Claude.defaultSelection
: ShortcutSettings.shared.selectedModel
cachedMainSystemPrompt = mainSystemPrompt
await agentBridge.invalidateSession(sessionKey: "floating")
await agentBridge.warmupSession(cwd: workingDirectory, sessions: [
.init(key: "floating", model: floatingModel, systemPrompt: floatingSystemPrompt)
])
log("ChatProvider: floating session context reset — next floating query rebuilds on a fresh process")
}
}

/// Test that the Playwright Chrome extension is connected and working.
/// Ensures the bridge is started (restarting if needed to pick up new token),
/// then sends a lightweight test query that triggers a browser_snapshot tool call.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Session token ceiling likely causes frequent Pi subprocess restarts

The ceiling is 21 000 tokens and the base prompt is ~14 831 tokens (per the PR's benchmarks). A single tool-use round adds 3–5 k tokens, so the ceiling is crossed in roughly 2–3 turns. Each crossing schedules a warmupSession that restarts the Pi subprocess, with the spawn cost landing on the next user query. With prompt caching, cacheReadTokens (~11 k) counts toward the ceiling, so even cache-warm sessions reset quickly. Multi-turn data queries will pay a subprocess spawn penalty roughly every third response, partially offsetting the TTFT gains.

@@ -0,0 +1,13 @@
{"cache_read_tokens":0,"cache_write_tokens":0,"cost_usd":0,"flagged_gaps":[],"input_mode":"text","input_tokens":17730,"model":"claude-sonnet-4-6","output_tokens":48,"query_text":"warmup","spans":[{"children":[{"dur_ms":232,"end_ms":266,"name":"screenshot_capture","start_ms":34}],"dur_ms":237,"end_ms":271,"name":"pre_llm","start_ms":34},{"dur_ms":109,"end_ms":380,"gap_before_ms":0,"meta":{"status":"ok"},"name":"bridge_ensure","start_ms":271},{"children":[{"children":[{"dur_ms":177,"end_ms":557,"meta":{"endpoint":"\/v1\/usage\/quota","result":"allowed"},"name":"quota_check","start_ms":380}],"dur_ms":2754,"end_ms":3134,"name":"ttft","start_ms":380},{"dur_ms":1530,"end_ms":4664,"gap_before_ms":0,"name":"generation","start_ms":3134}],"dur_ms":4284,"end_ms":4664,"gap_before_ms":0,"meta":{"model":"claude-sonnet-4-6"},"name":"llm_request","start_ms":380}],"timestamp":"2026-05-28T20:57:18Z","token_count":48,"total_ms":4664,"tps":31.372549019607842,"trace_id":"t_86e4f7","ttft_ms":3134}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Eval result files should not be committed to the repository

The PR description explicitly states these files "can be removed completely before merging." Five baseline-run*.jsonl files with real query/response traces, token counts, and timing data are committed here, permanently bloating git history. They should be deleted from the branch before landing.

Copy link
Copy Markdown
Collaborator

@kodjima33 kodjima33 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Floating-bar latency wins: conditional screenshot, scoped tool use, compact schema, session bounding, Anthropic prompt caching with cached_tokens telemetry. Approve only — 1427 lines (>500), leaving merge to Nik to review the temporary agent-routing bypass and the eval-results scaffolding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants