test(desktop): lock chat prompt ↔ tool-registry & schema consistency#7573
test(desktop): lock chat prompt ↔ tool-registry & schema consistency#7573eulicesl wants to merge 2 commits into
Conversation
A tool declared in the chat prompt's <tools> block but not dispatched by ChatToolExecutor.execute(_:) makes the model emit calls that fail with "Unknown tool"; a hardcoded SQL example referencing a dropped table fails at runtime. Both read to the user as "the AI is dumb", and neither is caught by the type checker. Adds three Set<String> registries on ChatToolExecutor as the source of truth for the tool names it dispatches (the 7 piMono chat tools, the onboarding-only tools, and their union), kept in lockstep with the execute(_:) cases. PromptSchemaConsistencyTests then validates the formal regions of the prompt against them: - every **tool**: marker in the <tools> block is a registered tool, and every piMono tool has a marker (both directions); - the "You have N tools" claim matches the registry size; - every table referenced in the Common SQL patterns section exists in tableAnnotations (or the FTS aux set); - no column-annotation set orphaned after its table was removed. Scoped to the formal <tools> block and SQL-patterns section so prose narrative doesn't trigger false matches. All 5 pass against the current prompt. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8288ec7d83
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Greptile SummaryThis PR adds a declarative tool-name registry to
Confidence Score: 3/5Safe to merge for today's prompt, but the registry is missing 7 dispatcher-known backend tools, so the guard it adds won't fire correctly if those tools are ever introduced into the formal prompt block. The core purpose of the PR — catching prompt↔executor drift — works for the 17 tools it does enumerate. However, desktop/Desktop/Sources/Providers/ChatToolExecutor.swift — the allRegisteredToolNames union needs a backendRagToolNames set added to include the 7 omitted dispatch cases. Important Files Changed
Reviews (1): Last reviewed commit: "test(desktop): lock chat prompt ↔ tool-r..." | Re-trigger Greptile |
Review on BasedHardware#7573: - allRegisteredToolNames omitted the 7 backend RAG tools execute(_:) already dispatches via executeBackendTool (get_conversations, search_conversations, get_memories, search_memories, get_action_items, create_action_item, update_action_item). If a prompt ever declared one in its <tools> block, testDesktopChatToolNamesAreAllRegistered would false-fail despite working dispatch. Added a backendRagToolNames set and unioned it in, so the registered set is a true superset of dispatch. - Removed the misleading onboardingOnlyToolNames docstring line implying save_knowledge_graph is a member (it lives only in piMonoChatToolNames). - Anchored extractToolNames to a line-start marker (^[ \t]*\*\*name\*\*:) so parameter docs written with the same bold convention — inline or as `- **query**:` bullets — can't be harvested as tool names and trip the check. All 5 tests still pass against the current prompt. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Closing this one — on reflection it's a standalone test/guardrail PR and the project keeps test scaffolding out of upstream. The |
|
Hey @eulicesl 👋 Thank you so much for taking the time to contribute to Omi! We truly appreciate you putting in the effort to submit this pull request. After careful review, we've decided not to merge this particular PR. Please don't take this personally — we genuinely try to merge as many contributions as possible, but sometimes we have to make tough calls based on:
Your contribution is still valuable to us, and we'd love to see you contribute again in the future! If you'd like feedback on how to improve this PR or want to discuss alternative approaches, please don't hesitate to reach out. Thank you for being part of the Omi community! 💜 |
Summary
<tools>block but not dispatched byChatToolExecutor.execute(_:)→ the model emits calls that come back "Unknown tool"; (2) a hardcoded SQL example referencing a table that no longer exists → runtime SQL errors. Neither is caught by the compiler.Set<String>registries onChatToolExecutoras the source of truth for the tool names it dispatches, and aPromptSchemaConsistencyTestssuite that validates the prompt's formal regions against those registries and the schema annotations.<tools>block and**Common SQL patterns:**section — prose narrative is intentionally out of scope (a regex over narrative would false-positive on ordinary English near SQL keywords).Linked Issue / Bounty
Root Cause (bug fixes only)
How It Works
ChatToolExecutorgainspiMonoChatToolNames(the 7 default chat tools),onboardingOnlyToolNames, andallRegisteredToolNames(their union), documented to stay in lockstep with theexecute(_:)cases.PromptSchemaConsistencyTestsextracts the<tools>block and the**Common SQL patterns:**section fromChatPrompts.desktopChatand asserts:**tool**:marker is inallRegisteredToolNames, and everypiMonoChatToolNamesentry has a marker (both directions);piMonoChatToolNames.count;ChatPrompts.tableAnnotations(or a small FTS/aux allowlist);columnAnnotationsset is orphaned after its table was removed.Change Type
Scope
Security Impact
execute(_:)dispatch; nothing new is executable.Testing
xcrun swift build -c debug --package-path Desktop→Build complete!PromptSchemaConsistencyTests(5 cases), all green against the current prompt.Human Verification
main; all 5 new tests pass — confirming the shipping prompt's<tools>block, tool-count claim, and SQL examples are already consistent with the executor and schema (no drift today).[a-z0-9_\-]+(not[a-z_]+) so a future hyphenated/numeric tool name can't slip past the check; SQL extraction coversFROM/JOIN/INSERT INTO/UPDATE/DELETE FROM.omi.dbintrospection — theChatPromptsannotation dictionaries are themselves what gets serialized into the prompt's schema, so validating against them gives the same guarantee without a DB fixture.Evidence
Docs
Performance, Privacy, and Reliability
Migration / Backward Compatibility
Risks and Mitigations
execute(_:)if a tool is added without updating them. Mitigation: that's exactly what the tests catch — a new prompt tool with no registry entry (or vice versa) fails CI. The registries are documented as needing to track theexecute(_:)cases.<tools>…</tools>and**Common SQL patterns:**structure. Mitigation: if those markers disappear, the testsXCTFailloudly rather than silently passing.AI Disclosure
PromptSchemaConsistencyTestssuite (all green); confirmed the registries match the executor's actualexecute(_:)dispatch cases.