Skip to content

test(desktop): lock chat prompt ↔ tool-registry & schema consistency#7573

Closed
eulicesl wants to merge 2 commits into
BasedHardware:mainfrom
eulicesl:feat/desktop-chat-prompt-schema-tests
Closed

test(desktop): lock chat prompt ↔ tool-registry & schema consistency#7573
eulicesl wants to merge 2 commits into
BasedHardware:mainfrom
eulicesl:feat/desktop-chat-prompt-schema-tests

Conversation

@eulicesl
Copy link
Copy Markdown

Summary

  • Problem: Two prompt↔runtime drifts fail silently and read to the user as "the AI is dumb": (1) a tool declared in the chat prompt's <tools> block but not dispatched by ChatToolExecutor.execute(_:) → the model emits calls that come back "Unknown tool"; (2) a hardcoded SQL example referencing a table that no longer exists → runtime SQL errors. Neither is caught by the compiler.
  • What changed: Adds three Set<String> registries on ChatToolExecutor as the source of truth for the tool names it dispatches, and a PromptSchemaConsistencyTests suite that validates the prompt's formal regions against those registries and the schema annotations.
  • What did NOT change (scope boundary): No prompt rewrite, no runtime behavior change. Tests are scoped to the formal <tools> block and **Common SQL patterns:** section — prose narrative is intentionally out of scope (a regex over narrative would false-positive on ordinary English near SQL keywords).

Linked Issue / Bounty

  • N/A — independent contribution.

Root Cause (bug fixes only)

  • N/A — this is a regression-guard / test PR. It prevents a class of silent drift rather than fixing a current break (the tests pass against today's prompt).

How It Works

  • ChatToolExecutor gains piMonoChatToolNames (the 7 default chat tools), onboardingOnlyToolNames, and allRegisteredToolNames (their union), documented to stay in lockstep with the execute(_:) cases.
  • PromptSchemaConsistencyTests extracts the <tools> block and the **Common SQL patterns:** section from ChatPrompts.desktopChat and asserts:
    • every **tool**: marker is in allRegisteredToolNames, and every piMonoChatToolNames entry has a marker (both directions);
    • the prompt's "You have N tools" claim equals piMonoChatToolNames.count;
    • every table referenced in the SQL-patterns section exists in ChatPrompts.tableAnnotations (or a small FTS/aux allowlist);
    • no columnAnnotations set is orphaned after its table was removed.

Change Type

  • Chore / infra (test guardrail)

Scope

  • Desktop app (Swift/macOS)

Security Impact

  • New permissions or capabilities introduced? No
  • Auth or token handling changed? No
  • Data access scope changed (screen data, OCR, user data)? No
  • New or changed network calls? No
  • Plugin or tool execution surface changed? No — the registries are a declarative mirror of the existing execute(_:) dispatch; nothing new is executable.

Testing

  • Built locally — xcrun swift build -c debug --package-path DesktopBuild complete!
  • Manual verification: compile + test run on macOS.
  • Existing tests pass
  • New tests added — PromptSchemaConsistencyTests (5 cases), all green against the current prompt.

Human Verification

  • Verified scenarios: clean debug build against current main; all 5 new tests pass — confirming the shipping prompt's <tools> block, tool-count claim, and SQL examples are already consistent with the executor and schema (no drift today).
  • Edge cases checked: the tool-name extraction regex is [a-z0-9_\-]+ (not [a-z_]+) so a future hyphenated/numeric tool name can't slip past the check; SQL extraction covers FROM/JOIN/INSERT INTO/UPDATE/DELETE FROM.
  • What I did not verify (and why): no live omi.db introspection — the ChatPrompts annotation dictionaries are themselves what gets serialized into the prompt's schema, so validating against them gives the same guarantee without a DB fixture.

Evidence

  • Test output:
    PromptSchemaConsistencyTests — Executed 5 tests, 0 failures
    Build complete!
    

Docs

  • N/A — no docs impact

Performance, Privacy, and Reliability

  • No runtime impact (static sets + tests). Reliability: turns a class of silent prompt↔runtime drift into a CI failure at edit time.

Migration / Backward Compatibility

  • Backward compatible? Yes
  • Migration needed? No — additive registries + a new test file.

Risks and Mitigations

  • Risk: The registries could fall out of sync with execute(_:) if a tool is added without updating them. Mitigation: that's exactly what the tests catch — a new prompt tool with no registry entry (or vice versa) fails CI. The registries are documented as needing to track the execute(_:) cases.
  • Risk: The section-extraction regexes assume the prompt keeps its <tools>…</tools> and **Common SQL patterns:** structure. Mitigation: if those markers disappear, the tests XCTFail loudly rather than silently passing.

AI Disclosure

  • Tools used: Claude Code (Claude Opus).
  • AI-assisted portions: the registries, the tests, and this PR description.
  • How I verified correctness: ran the local debug build and the full PromptSchemaConsistencyTests suite (all green); confirmed the registries match the executor's actual execute(_:) dispatch cases.

A tool declared in the chat prompt's <tools> block but not dispatched by
ChatToolExecutor.execute(_:) makes the model emit calls that fail with
"Unknown tool"; a hardcoded SQL example referencing a dropped table
fails at runtime. Both read to the user as "the AI is dumb", and neither
is caught by the type checker.

Adds three Set<String> registries on ChatToolExecutor as the source of
truth for the tool names it dispatches (the 7 piMono chat tools, the
onboarding-only tools, and their union), kept in lockstep with the
execute(_:) cases. PromptSchemaConsistencyTests then validates the
formal regions of the prompt against them:

- every **tool**: marker in the <tools> block is a registered tool, and
  every piMono tool has a marker (both directions);
- the "You have N tools" claim matches the registry size;
- every table referenced in the Common SQL patterns section exists in
  tableAnnotations (or the FTS aux set);
- no column-annotation set orphaned after its table was removed.

Scoped to the formal <tools> block and SQL-patterns section so prose
narrative doesn't trigger false matches. All 5 pass against the current
prompt.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8288ec7d83

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread desktop/Desktop/Sources/Providers/ChatToolExecutor.swift Outdated
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 31, 2026

Greptile Summary

This PR adds a declarative tool-name registry to ChatToolExecutor and a new PromptSchemaConsistencyTests suite that validates the chat prompt's <tools> block and SQL patterns section against those registries and ChatPrompts schema annotations — turning silent prompt↔runtime drift into a CI failure.

  • Adds three Set<String> statics (piMonoChatToolNames, onboardingOnlyToolNames, allRegisteredToolNames) as the authoritative list of dispatchable tools, with documentation tying them to the execute(_:) switch cases.
  • Adds 5 XCTest cases covering tool-name bidirectional consistency, the prompt's tool-count claim, SQL table references against tableAnnotations, and orphaned column annotations.

Confidence Score: 3/5

Safe to merge for today's prompt, but the registry is missing 7 dispatcher-known backend tools, so the guard it adds won't fire correctly if those tools are ever introduced into the formal prompt block.

The core purpose of the PR — catching prompt↔executor drift — works for the 17 tools it does enumerate. However, allRegisteredToolNames omits the 7 backend RAG tools that execute(_:) already dispatches via executeBackendTool. The documented invariant ("every tool the executor knows how to dispatch") is incorrect today; and if one of those tools is later added to desktopChat's <tools> block, the test that's supposed to accept it will reject it instead, inverting the intended behavior of the regression guard.

desktop/Desktop/Sources/Providers/ChatToolExecutor.swift — the allRegisteredToolNames union needs a backendRagToolNames set added to include the 7 omitted dispatch cases.

Important Files Changed

Filename Overview
desktop/Desktop/Sources/Providers/ChatToolExecutor.swift Adds three static Set registries (piMonoChatToolNames, onboardingOnlyToolNames, allRegisteredToolNames), but allRegisteredToolNames omits the 7 backend RAG tools (get_conversations, search_conversations, get_memories, search_memories, get_action_items, create_action_item, update_action_item) that execute(_:) already dispatches, making the documented invariant incorrect.
desktop/Desktop/Tests/PromptSchemaConsistencyTests.swift New test suite with 5 well-structured tests; the extractToolNames regex could match inline param_name: markers inside tool descriptions, creating false failures if the prompt format ever gains parameter-level bold markers in the tools block.

Reviews (1): Last reviewed commit: "test(desktop): lock chat prompt ↔ tool-r..." | Re-trigger Greptile

Comment thread desktop/Desktop/Sources/Providers/ChatToolExecutor.swift Outdated
Comment thread desktop/Desktop/Sources/Providers/ChatToolExecutor.swift
Comment thread desktop/Desktop/Tests/PromptSchemaConsistencyTests.swift
Review on BasedHardware#7573:

- allRegisteredToolNames omitted the 7 backend RAG tools execute(_:)
  already dispatches via executeBackendTool (get_conversations,
  search_conversations, get_memories, search_memories, get_action_items,
  create_action_item, update_action_item). If a prompt ever declared one
  in its <tools> block, testDesktopChatToolNamesAreAllRegistered would
  false-fail despite working dispatch. Added a backendRagToolNames set and
  unioned it in, so the registered set is a true superset of dispatch.
- Removed the misleading onboardingOnlyToolNames docstring line implying
  save_knowledge_graph is a member (it lives only in piMonoChatToolNames).
- Anchored extractToolNames to a line-start marker (^[ \t]*\*\*name\*\*:)
  so parameter docs written with the same bold convention — inline or as
  `- **query**:` bullets — can't be harvested as tool names and trip the
  check.

All 5 tests still pass against the current prompt.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@eulicesl
Copy link
Copy Markdown
Author

Closing this one — on reflection it's a standalone test/guardrail PR and the project keeps test scaffolding out of upstream. The ChatToolExecutor registries here exist only to feed the consistency test; runtime tool dispatch uses its own switch in execute(_:) and doesn't reference them, so there's no functional code to keep. The actual chat-reliability changes are in #7552, #7555, and #7559. Thanks for the reviews!

@eulicesl eulicesl closed this May 31, 2026
@github-actions
Copy link
Copy Markdown
Contributor

Hey @eulicesl 👋

Thank you so much for taking the time to contribute to Omi! We truly appreciate you putting in the effort to submit this pull request.

After careful review, we've decided not to merge this particular PR. Please don't take this personally — we genuinely try to merge as many contributions as possible, but sometimes we have to make tough calls based on:

  • Project standards — Ensuring consistency across the codebase
  • User needs — Making sure changes align with what our users need
  • Code best practices — Maintaining code quality and maintainability
  • Project direction — Keeping aligned with our roadmap and vision

Your contribution is still valuable to us, and we'd love to see you contribute again in the future! If you'd like feedback on how to improve this PR or want to discuss alternative approaches, please don't hesitate to reach out.

Thank you for being part of the Omi community! 💜

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant