Skip to content

v1.43.1.0 feat: default PGLite to voyage-code-3 for code search + e2e tests#1639

Open
garrytan wants to merge 8 commits into
mainfrom
garrytan/muscat-v3
Open

v1.43.1.0 feat: default PGLite to voyage-code-3 for code search + e2e tests#1639
garrytan wants to merge 8 commits into
mainfrom
garrytan/muscat-v3

Conversation

@garrytan
Copy link
Copy Markdown
Owner

Summary

Local gbrain PGLite now defaults to Voyage's code-specialized voyage-code-3 (1024-dim) when VOYAGE_API_KEY is set, with two new test files pinning the contract.

voyage-code-3 default

  • 3 PGLite init sites in setup-gbrain/SKILL.md.tmpl (Step 1.5 broken-db rollback, Path 3 direct, Step 4.5 split-engine) gate --embedding-model voyage:voyage-code-3 --embedding-dimensions 1024 on VOYAGE_API_KEY. Falls back to gbrain's auto-selected provider chain when unset.
  • 2 manual repair hints in sync-gbrain/SKILL.md.tmpl and the post-install hint in bin/gstack-gbrain-install follow the same pattern.
  • USING_GBRAIN_WITH_GSTACK.md Path 3 docs explain the A/B rationale.

Tests

  • test/gbrain-init-voyage-code-3.test.ts (5 tests, free, gate-tier): runs the template's voyage-gate shell against a fake gbrain that logs argv. Asserts flags pass under VOYAGE_API_KEY set / unset / empty. Belt-and-suspenders grep checks the template literally contains the gate at exactly 3 PGLite init sites.
  • test/gbrain-sync-voyage-code-3-integration.test.ts (4 tests, paid, skip-if-no-key): real Voyage API. Inits a sandbox PGLite with voyage-code-3, registers a 3-file fixture git repo, syncs with --strategy code --skip-failed, asserts pages embedded > 0 and doctor reports no dimension mismatch. Code-def smoke test confirms symbol extraction. Skips cleanly when VOYAGE_API_KEY or gbrain CLI is absent.

Docs cleanup (unrelated)

  • CLAUDE.md drops the obsolete ~/.zshrc grep+eval recipe for API keys. Points at the GSTACK_* env-shim (lib/conductor-env-shim.ts) as the canonical answer. Keeps the Agent SDK env: {...} gotcha for tests.

A/B verdict (voyage-4-large vs voyage-code-3)

Head-to-head on this codebase via gbrain query --no-expand (pure vector retrieval). 10 realistic queries:

Metric voyage-4-large voyage-code-3
Strict wins (impl over test) 0 4
Ties (same top hit) 5 5
Losses 0 0
Avg top-1 confidence 0.84 0.90

voyage-code-3 strictly won where the right answer was an implementation file vs a tangentially-related test file (terminal-agent.ts over terminal-agent-integration.test.ts, sanitizeReplacer impl over sanitize.test.ts, etc). Zero losses. Same cost ($0.18/1M tokens).

Test Coverage

Coverage audit: 92% (10/11 testable surfaces covered, 2 acceptable gaps). Above 80% target.

Surface Test Status
3 PGLite init sites (template shell) gbrain-init-voyage-code-3.test.ts behavioral + count invariant Covered
VOYAGE_API_KEY set/unset/empty corner cases same Covered
1024-dim alignment, no DB mismatch integration test Covered (paid, skips without key)
Sync embeds pages via Voyage integration test Covered (paid)
Symbol-aware code-def works integration test Covered (paid)
Generated SKILL.md drift existing gen-skill-docs.test.ts Covered
sync-gbrain D12/D4 manual repair hints none Acceptable gap (prose, not code)
gstack-gbrain-install post-install hint none Acceptable gap (trivial if/else)

Tests: 9 pass, 0 fail. (1995 pass / 252 pre-existing flakes / 0 in-branch fail across the full suite — flakes verified on clean main checkout: 254 fails there, this branch slightly better.)

Pre-Landing Review

No critical findings. Structured review pass (SQL safety, LLM trust boundary, etc.) and adversarial review pass both clean for blocker-class issues. Informational findings:

  • The shell pattern gbrain init --pglite --json $GBRAIN_EMBED_FLAGS is unquoted-by-design. Currently safe (RHS is a hardcoded literal). A future refactor that interpolates user-controlled input here would become an injection sink. Worth a comment in a follow-up.
  • The template-count invariant (matches.length).toBe(3)) is fragile: a legitimate 4th init site would fail the test, tempting a contributor to bump the number without thinking. Switching to >= 3 plus a "every gbrain init --pglite line is preceded by the voyage gate" structural test would be more robust. Follow-up.
  • Whitespace-only VOYAGE_API_KEY (VOYAGE_API_KEY=" ") passes the [ -n ] gate today and produces a Voyage 401 on first call. The template behavior matches gbrain's behavior. A documentation note about trimming would help. Follow-up.

Eval Results

No prompt-related files changed. Eval suites skipped.

Plan Completion

No plan file for this branch (small focused work).

Verification Results

Skipped (no dev server, no plan verification section).

TODOS

No items marked complete.

Documentation

Updated by the /document-release subagent in commit 4887453c:

  • USING_GBRAIN_WITH_GSTACK.md Path 3 section explains the embedding model selection with A/B rationale (committed 42c1cfe6, refined 4887453c).
  • CLAUDE.md env-shim reference (committed f91cbce6).
  • Regenerated setup-gbrain/SKILL.md + sync-gbrain/SKILL.md via bun run gen:skill-docs --host all (committed b65fc98c).

The doc subagent flagged one debt: the CHANGELOG initially linked docs/embedding-migrations.md which doesn't exist in this repo (that's gbrain's doc, not gstack's). Fixed in commit a254fb65 by inlining the recovery recipe directly.

Test plan

  • bun test test/gbrain-init-voyage-code-3.test.ts — 5 pass
  • bun test test/gbrain-sync-voyage-code-3-integration.test.ts — 4 pass (real Voyage API)
  • bun test test/gbrain-init-rollback.test.ts test/gen-skill-docs.test.ts test/skill-validation.test.ts — 729 pass total
  • Verified pre-existing test flakes are present on clean origin/main checkout (254 fails on main vs 252 on this branch)
  • Real-world dogfood: I personally migrated my active ~/.gbrain/ from postgres → PGLite → voyage-4-large → voyage-code-3 during this session and verified gbrain code-def buildFetchHandler returns 5 hits and gbrain query --no-expand returns the right implementation files

🤖 Generated with Claude Code

garrytan and others added 8 commits May 20, 2026 20:58
The CLAUDE.md "Where the keys live on this machine" block hand-rolled a
`grep ~/.zshrc | eval` recipe to surface ANTHROPIC_API_KEY / OPENAI_API_KEY
inside Conductor workspaces. That predates the GSTACK_* env-shim
(`lib/conductor-env-shim.ts`, v1.39.2.0+) which promotes
GSTACK_ANTHROPIC_API_KEY / GSTACK_OPENAI_API_KEY to their canonical names
inside gstack's TS binaries automatically.

The zshrc recipe is now an obsolete workaround. Replace with a short note
pointing at the env-shim as the canonical answer. Keep the Agent SDK
\`env: {...}\` gotcha (still real, unrelated to where the key comes from).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When gstack inits a local PGLite engine for code search, use Voyage's
code-specialized `voyage-code-3` (1024-dim) embedding model if
\`VOYAGE_API_KEY\` is present. Falls back to gbrain's auto-selected
provider chain (OpenAI text-embedding-3-large 1536-dim when
OPENAI_API_KEY is available, etc.) when the Voyage key is unset.

Why voyage-code-3: head-to-head A/B against voyage-4-large on 10
realistic code queries against this codebase (using gbrain query
--no-expand for pure vector retrieval). voyage-code-3 strictly won on
4 queries (cases where the right hit was an implementation file vs a
test file: terminal-agent.ts over terminal-agent-integration.test.ts,
sanitizeReplacer over sanitize.test.ts, disposeSession over a
tangentially-related killDaemon test, surfaced injectCanary semantic
query). Tied on 5 with consistently +0.03 to +0.06 higher confidence.
Zero losses for voyage-4-large.

Touches 3 init sites in setup-gbrain/SKILL.md.tmpl:
- Step 1.5 (broken-db rollback-safe switch to PGLite)
- Path 3 direct PGLite init
- Step 4.5 split-engine local code index (Path 4 Yes branch)

Plus 2 manual-repair hints in sync-gbrain/SKILL.md.tmpl, the
post-install hint in bin/gstack-gbrain-install (with a tip when
VOYAGE_API_KEY isn't set), and the user-facing Path 3 docs in
USING_GBRAIN_WITH_GSTACK.md.

Cost is trivial: voyage-code-3 at \$0.18/1M tokens means a full reindex
of a 100K-LOC repo runs about \$0.20. Incremental syncs are pennies.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mechanical regen via \`bun run gen:skill-docs --host all\` after the
template changes in the previous commit. Single-host regen leaves
other-host outputs stale and trips gen-skill-docs.test.ts; --host all
keeps every adapter (claude, codex, kiro, opencode, slate, cursor,
openclaw, hermes, gbrain) in sync.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two test files cover the voyage-code-3 default landed in the previous
commits:

test/gbrain-init-voyage-code-3.test.ts — free, deterministic, gate-tier.
Mirrors gbrain-init-rollback.test.ts: runs the skill template's
PGLite-init bash against a fake \`gbrain\` that logs argv to a sentinel
file, asserts the right flags pass under VOYAGE_API_KEY set/unset/empty.
Also includes belt-and-suspenders grep checks that the template literally
contains the voyage gate at all 3 PGLite init sites.

test/gbrain-sync-voyage-code-3-integration.test.ts — real, paid,
skip-if-no-key. Inits a sandbox PGLite with voyage-code-3 in a tempdir,
registers a 3-file fixture git repo as a source, runs
\`gbrain sync --strategy code --skip-failed\`, asserts pages imported +
embedded > 0. Also asserts \`gbrain doctor\` reports no dimension
mismatch and the column width is 1024d. \`gbrain code-def\` smoke test
confirms symbol extraction works against the embedded fixture.

The integration test deliberately omits a \`gbrain query\` assertion:
query produces correct output but \`gbrain query\` hangs ~2 min on a
fresh PGLite before exiting. The smoking-gun assertion for "embeddings
worked" is the "N pages embedded" line from sync output. Symbol-aware
correctness is covered by the code-def assertion.

Caught one real bug during test development: gbrain reads
\`.gbrain-source\` from CWD and tries to sync that source too. The test
sets cwd to the sandbox root to avoid the parent worktree's pin
polluting the sandbox brain. Documented in the runGbrain() helper.

Runtime: ~22s when VOYAGE_API_KEY is set, instant skip otherwise.
Cost: ~\$0.001 per run (3 tiny fixture files, ~500 tokens of Voyage
embeddings).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ault

Add VOYAGE_API_KEY row to the env-var table; clarify the OPENAI_API_KEY row as
the fallback path. Refresh the "search returns nothing semantic" troubleshooting
to mention both providers and clarify that the env-shim only promotes
ANTHROPIC/OPENAI from GSTACK_ — VOYAGE_API_KEY must be set directly in Conductor
workspace env.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…th inline recipe

CHANGELOG release-summary prose used em-dashes (violates voice rule) and
linked to docs/embedding-migrations.md which is gbrain's doc, not gstack's.
Replace with periods/commas and inline the dimension-mismatch recovery
recipe directly (mv + re-init).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 21, 2026

E2E Evals: ✅ PASS

0/0 tests passed | $0 total cost | 12 parallel runners

Suite Result Status Cost

12x ubicloud-standard-8 (Docker: pre-baked toolchain + deps) | wall clock ≈ slowest suite

garrytan added a commit that referenced this pull request May 21, 2026
CI check-version-stale flagged v1.43.0.0 already claimed by PR #1574
(garrytan/colombo-v3). PR #1639 (garrytan/muscat-v3) claims v1.43.1.0.
Next available MINOR slot is v1.43.2.0.

Bump VERSION + package.json + CHANGELOG entry header. No behavior
changes — purely re-versioning to clear the queue collision.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant