Skip to content

feat(outside-voice): add llm CLI fallback between Codex and Claude subagent#1631

Open
diogolealassis wants to merge 1 commit into
garrytan:mainfrom
diogolealassis:feat/outside-voice-llm-grok-fallback
Open

feat(outside-voice): add llm CLI fallback between Codex and Claude subagent#1631
diogolealassis wants to merge 1 commit into
garrytan:mainfrom
diogolealassis:feat/outside-voice-llm-grok-fallback

Conversation

@diogolealassis
Copy link
Copy Markdown

What

When Codex is unavailable or errors, plan-review and second-opinion skills currently fall back to the Claude subagent. That fallback works, but loses cross-model independence — the subagent is the same model family as the primary review. This PR adds a middle step: try the llm CLI (datasette/llm) with a configured model before falling back to Claude.

New chain: Codex → llm → Claude subagent

When llm is installed and the configured model is registered (e.g. via llm install llm-xai for Grok), the outside voice goes through a genuinely different model family before falling through to Claude. When llm isn't available, behavior is unchanged.

Why this matters today

Codex CLI currently fails at startup for users whose ~/.codex/config.toml has an xai provider with wire_api = "chat" — which is the default per xAI's setup docs but deprecated per openai/codex#7782:

Error loading config.toml: `wire_api = "chat"` is no longer supported.
in `model_providers.xai.wire_api`

The error fires at CLI startup BEFORE the provider is selected, so calls targeted at OpenAI also fail. Users in this state get the Claude subagent fallback on every outside-voice step. They lose the cross-model value the skill is designed to deliver.

Hit twice in a single 2026-05-20 session on /plan-eng-review and /plan-ceo-review. The fix on the user side is one line (wire_api = "responses"), but for users who never had OpenAI auth in the first place (they use Codex/llm because of Grok/Anthropic preference), restoring Codex doesn't actually help — they need a non-Codex cross-model path.

How

Two functions in scripts/resolvers/review.ts produce the outside-voice section that gets templated into 4 skill files:

  • generateCodexPlanReview — used by /plan-*-review skills
  • generateCodexSecondOpinion — used by /office-hours

Both now produce a Codex → llm → Claude subagent chain. The llm step is gated on:

  1. New outside_voice_llm_model config key (default grok-4-fast)
  2. command -v llm succeeds
  3. llm models lists the configured model

If any check fails, the chain falls through cleanly to the existing Claude subagent path. No behavior change for users who don't have llm installed.

The default grok-4-fast rather than grok-4-latest is deliberate — the latter's reasoning mode can hang 2-5 minutes on big prompts (the llm stream is empty during hidden reasoning, looks like a hang). grok-4-fast is the right shape for routine outside-voice work.

Configuration

gstack-config set outside_voice_llm_model grok-4-fast    # default
gstack-config set outside_voice_llm_model ""             # disable, skip llm step
gstack-config set outside_voice_llm_model gpt-4o         # any model `llm models` lists

Empty string skips the llm step entirely (Codex → subagent direct).

Files changed

bin/gstack-config                |  12 ++++
office-hours/SKILL.md            |  64 +++++++++++++++++++++--
plan-ceo-review/SKILL.md         |  47 +++++++++++++++++-
plan-devex-review/SKILL.md       |  47 +++++++++++++++++-
plan-eng-review/SKILL.md         |  47 +++++++++++++++++-
scripts/resolvers/review.ts      | 111 ++++++++++++++++++++++++++++++++++++++--
6 files changed, 314 insertions(+), 14 deletions(-)

The SKILL.md changes are mechanically regenerated by bun run gen:skill-docs — the only hand-edits are in bin/gstack-config and scripts/resolvers/review.ts.

Verification

  • bun run scripts/skill-check.ts passes, all freshness checks green
  • Spot-checked generated SKILL.md files — llm fallback section present, bash blocks syntactically correct, outside_voice_llm_model referenced consistently
  • Manually traced the 4-state matrix (codex/llm available × works/errors) through the generated bash to confirm fallthrough behavior
  • Sequencing fix in generateCodexSecondOpinion: $CODEX_PROMPT_FILE deletion moved from "after Codex" to "after chain end" since llm step reuses the prompt file
  • Did NOT test end-to-end against a real llm invocation with my Grok key — the resolver functions produce templated prose instructions for the agent, not executable code per se. Happy to do this if you want a smoke-test screenshot.

What this doesn't do

  • Doesn't change generateAdversarialStep (the /ship + /review adversarial flow). Different shape (always-on, not configurable). Could benefit from the same fallback but out of scope for this PR; happy to do as a follow-up.
  • Doesn't bump VERSION or add a CHANGELOG.md entry — leaving those to your release workflow.
  • Doesn't add a first-run prompt for outside_voice_llm_model via AskUserQuestion. The default grok-4-fast is sensible enough that most users won't need to touch it. If you want a first-run prompt I can add it.
  • Doesn't add tests. The resolver functions are pure string-returning, and the generated bash is exercised at agent-runtime not at gstack-build-time. Open to suggestions on how you'd want this tested.

Generated by

A real /plan-eng-review + outside-voice flow on a downstream user project (diogolealassis/gws-platform) hit the Codex failure twice in one session and surfaced the gap. Existing project memory codex-cli-broken-for-xai and the newly logged codex-cli-global-config-blocks-all-providers document the trigger. This PR closes the loop on the gstack side.

Happy to break this into smaller commits or split out the generateCodexSecondOpinion cleanup-sequencing fix if you'd prefer.

🤖 Generated with Claude Code

…bagent

When Codex is unavailable or errors, plan-review and second-opinion skills
fall back to the Claude subagent. That fallback works, but loses cross-model
independence — the subagent is the same model family as the primary review.

This adds a middle step: try the `llm` CLI (datasette/llm) with a configured
model. When `llm` is installed and the model is registered (e.g. `llm-xai`
plugin for Grok), the outside voice goes through a genuinely different
model family before falling back to Claude.

## Why this matters today

Codex CLI currently fails at startup for any user whose `~/.codex/config.toml`
has an xAI provider with `wire_api = "chat"` — which is the default per xAI's
setup docs but deprecated per github.com/openai/codex#7782:

    Error loading config.toml: `wire_api = "chat"` is no longer supported.
    in `model_providers.xai.wire_api`

The error fires at CLI startup BEFORE the provider is selected, so calls
targeted at OpenAI also fail. Users in this state currently get the Claude
subagent fallback on every outside-voice step. They lose the cross-model
value the skill is designed to deliver.

Hit twice in a single session 2026-05-20 on /plan-eng-review and
/plan-ceo-review. Existing memory `codex-cli-broken-for-xai` documents the
upstream Codex side; this PR adds the gstack-side fallback path.

## What changed

Two functions in scripts/resolvers/review.ts produce the outside-voice
section that gets templated into 4 skill files (office-hours,
plan-ceo-review, plan-devex-review, plan-eng-review):

- `generateCodexPlanReview` — used by /plan-*-review skills
- `generateCodexSecondOpinion` — used by /office-hours

Both now produce: Codex → llm → Claude subagent chain. The llm step is
gated on:

1. `outside_voice_llm_model` config key (new, default `grok-4-fast`)
2. `command -v llm` succeeds
3. `llm models` lists the configured model

If any check fails, the chain falls through cleanly to the existing Claude
subagent path. No behavior change for users who don't have llm installed.

The default model is `grok-4-fast` rather than `grok-4-latest` because the
latter's reasoning mode can hang 2-5 minutes on big prompts (observed
2026-05-18 — the llm stream is empty during hidden reasoning, looks like
a hang). `grok-4-fast` is the right shape for routine outside-voice work.

## Config key

```
gstack-config set outside_voice_llm_model grok-4-fast    # default
gstack-config set outside_voice_llm_model ""             # disable, skip llm step
gstack-config set outside_voice_llm_model gpt-4o         # any model `llm models` lists
```

Empty string skips the llm step entirely (back to Codex → subagent direct).

## File-level diff

- `bin/gstack-config` — new key registration in `lookup_default` case statement + CONFIG_HEADER docs section
- `scripts/resolvers/review.ts` — llm fallback block inserted between Codex
  error handling and the "If CODEX_NOT_AVAILABLE" section in both
  generator functions. Codex cleanup in `generateCodexSecondOpinion`
  changed to defer `$CODEX_PROMPT_FILE` deletion until end of chain
  (the llm step reads it).
- Regenerated SKILL.md outputs for the 4 affected skills via
  `bun run gen:skill-docs`. `.gbrain/` outputs also regenerated locally
  (gitignored, not committed).

## Verification

- `bun run scripts/skill-check.ts` — passes, all freshness checks green
- Spot-checked generated SKILL.md files: llm fallback section present,
  bash blocks syntactically correct, `outside_voice_llm_model` referenced
  consistently
- Manually traced the 4-state matrix (codex/llm available × works/errors)
  through the generated bash to confirm fallthrough behavior

## What this doesn't do

- Doesn't change `generateAdversarialStep` (the /ship + /review adversarial
  flow) — that's a different shape (always-on, not configurable). Could
  benefit from the same fallback but out of scope for this PR; happy to
  do as a follow-up if you want it.
- Doesn't bump VERSION or add a CHANGELOG entry — leaving that to the
  maintainer's release workflow.
- Doesn't update the `outside_voice_llm_model` to be settable via
  AskUserQuestion during a first-run flow. The default is sensible
  enough that most users won't need to touch it.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jbetala7
Copy link
Copy Markdown
Contributor

One config-surface gap: this adds outside_voice_llm_model to lookup_default() and the header, but not to the list / defaults key loops in bin/gstack-config.

That means gstack-config get outside_voice_llm_model returns the new default, while gstack-config defaults and the active-values section of gstack-config list omit the key. Users then cannot discover or audit the new fallback model through the normal config surface, and the generated docs drift from the CLI output.

Please add outside_voice_llm_model to both loops and pin it with a focused config regression, similar to the explain_level default/list/defaults issue tracked by #1607/#1608.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants