Skip to content

CLI: capture eval-runner failures and timeouts in result#3330

Merged
chubes4 merged 1 commit intotrunkfrom
eval-runner-failure-capture
May 4, 2026
Merged

CLI: capture eval-runner failures and timeouts in result#3330
chubes4 merged 1 commit intotrunkfrom
eval-runner-failure-capture

Conversation

@chubes4
Copy link
Copy Markdown
Contributor

@chubes4 chubes4 commented May 4, 2026

Summary

Extends the eval-runner result JSON with two new fields that any consumer of EVAL_RUNNER_RESULT_FILE can read:

  • error: string | null — set when an exception is thrown inside the message loop (auth blip, MCP crash, network error, SDK throw, etc.).
  • timedOut: boolean — flipped by the timeout callback before it calls query.interrupt().

Together these address the finalError ask in #3262 section 1 ("First actionable failure") and complete the failure-visibility story started by #3273 (firstToolError, toolEvents, phaseTimingsMs).

Why

Three failure modes today, three different visibility outcomes:

Failure shape Today After this PR
Model returns success: false cleanly ✅ Captured fully Same
Timeout fires (query.interrupt() after timeoutMs) ⚠️ Captured as success: false, indistinguishable from a clean model failure timedOut: true distinguishes it
Mid-loop exception (auth, MCP, network, SDK throw) ⚠️ Caught at main()'s top-level → emits stripped-down { success: false, error } JSON, losing all timings/tools/turns ✅ Caught inside runEval() → full structured result with error set alongside everything else

The third row is the most consequential: today, a run that completes meaningful work and then hits a late exception loses all the diagnostic state that was already captured. The new try { … } catch keeps the structured result intact.

What this enables

A consumer can now answer:

  • Did the eval time out vs. the model gave up cleanly? (Today: indistinguishable. After: check result.timedOut.)
  • Was there a runtime exception during the run, even if success was true? (Today: not representable — success: true and the exception path are mutually exclusive. After: both fields can coexist.)
  • When an exception fires, what tools ran before it, how long did each phase take? (Today: lost in the main()-level fallback JSON. After: preserved in the structured result.)

This is the same shape as #3273's contribution — producer-side observability improvements that any downstream consumer benefits from. Studio's own npm run eval, future internal CI, promptfoo configurations, and external benchmark harnesses all read the same JSON contract.

Diff

8 lines:

  • let error: string | null = null and let timedOut = false declarations
  • Timeout callback rewritten to set timedOut = true before calling query.interrupt()
  • try { … } catch ( caught ) { error = … } wrapper around the message loop
  • Two new fields in the return shape

The catch only changes behavior on exception paths; the existing successful and clean-failure paths are unchanged.

Origin

Originally drafted on the experimental Static Site Importer branch (#3309) where it was being used by an out-of-tree benchmark harness. Split out for upstream review because:

  1. The change is generic — useful to any consumer of the eval-runner's structured output, not just the SSI experiment.
  2. It shouldn't ship gated on the SSI experiment merging.
  3. Reviewing it on its own keeps the surface area small (8 lines vs. ~200).

The SSI draft PR will rebase on top of this once it lands.

Tests

  • npm run typecheck (all workspaces) — clean
  • npx eslint apps/cli/ai/eval-runner.ts — clean

The change is small enough that it's covered by the existing eval-runner exercise paths (npm run eval, etc.). If reviewers want explicit unit tests around the new fields I'm happy to add them.

Refs

AI assistance

  • AI assistance: Yes
  • Tool(s): Claude Code (Sonnet 4.5)
  • Used for: Drafted the catch wrapper, the timedOut flag, and the result-shape extension under Chris's direction. Chris reviewed the diff, the issue framing, and the split rationale.

Extends the eval-runner result JSON with two new fields that any
consumer of `EVAL_RUNNER_RESULT_FILE` can read:

- `error: string | null` — set when an exception is thrown inside
  the message loop (auth blip, MCP crash, network error, SDK throw,
  etc.). Today those exceptions propagate out of `runEval()` to
  `main()`'s top-level catch, which emits a stripped-down
  `{ success: false, error }` JSON without the timings, tools, turns,
  or partial state that were captured before the throw. With this
  change, the exception is caught inside `runEval()`, so consumers
  get the full structured result alongside the error message.
- `timedOut: boolean` — flipped by the timeout callback before it
  calls `query.interrupt()`. Today timeouts are indistinguishable
  from clean model failures: both surface as `success: false` with
  no other differentiator. Consumers can now tell them apart and
  attribute time-budget exhaustion correctly.

Together these address the `finalError` ask in #3262
section 1 ("First actionable failure") and complete the failure-
visibility story started by #3273 (which added `firstToolError`,
`toolEvents`, `phaseTimingsMs`).

The change is generic — it benefits any consumer of the eval-runner's
structured output (Studio's own scripts, `npm run eval`, future
internal CI, third-party benchmark harnesses).

Originally drafted on the experimental Static Site Importer branch
(#3309); split out for upstream review because the eval-runner
changes are independently valuable and shouldn't ship gated on
that experiment.

Refs: #3262, #3273, #3309

## AI assistance
- **AI assistance:** Yes
- **Tool(s):** Claude Code (Sonnet 4.5)
- **Used for:** Drafted the catch wrapper, the `timedOut` flag,
  and the result-shape extension under Chris's direction. Chris
  reviewed the diff, the issue framing, and the split rationale.
@chubes4 chubes4 requested a review from youknowriad May 4, 2026 12:10
chubes4 added a commit to chubes4/homeboy-rigs that referenced this pull request May 4, 2026
Pairs with Automattic/studio#3330, which adds two new fields to the
eval-runner result JSON:

- result.error: string | null  — set when an exception is caught
  inside the message loop (auth blip, MCP crash, network error, SDK
  throw, etc.)
- result.timedOut: boolean     — set when the timeout callback fires
  before query.interrupt()

Bench changes:

- studio-agent-runtime + studio-agent-site-info: failure-detail
  message now distinguishes 'timed out after Nms', 'exception: ...',
  and 'exit=N'. Lifts the timeout literal into a named constant so
  the message stays in sync with the actual budget.
- studio-agent-site-build: agentSucceeded gate now also requires
  !result.timedOut. Adds two new metrics:
  - timed_out (1 when the run hit the time budget)
  - agent_runner_error (1 when an exception landed in result.error)
  These categorize regressions that today look like generic agent
  failures.

Backwards-compatible: on Studio versions older than #3330, both
fields are nullish and the bench behaves exactly as it did before.

Refs: Automattic/studio#3330, Automattic/studio#3262, Automattic/studio#3273

## AI assistance
- **AI assistance:** Yes
- **Tool(s):** Claude Code (Sonnet 4.5)
- **Used for:** Drafted the gate-tightening, the constants, and the
  failure-detail messages under Chris's direction.
@wpmobilebot
Copy link
Copy Markdown
Collaborator

📊 Performance Test Results

Comparing f4ab2be vs trunk

app-size

Metric trunk f4ab2be Diff Change
App Size (Mac) 1557.92 MB 1557.92 MB +0.00 MB ⚪ 0.0%

site-editor

Metric trunk f4ab2be Diff Change
load 1522 ms 1491 ms 31 ms ⚪ 0.0%

site-startup

Metric trunk f4ab2be Diff Change
siteCreation 8092 ms 8088 ms 4 ms ⚪ 0.0%
siteStartup 4949 ms 4948 ms 1 ms ⚪ 0.0%

Results are median values from multiple test runs.

Legend: 🟢 Improvement (faster) | 🔴 Regression (slower) | ⚪ No change (<50ms diff)

@chubes4 chubes4 merged commit b42be68 into trunk May 4, 2026
11 checks passed
@chubes4 chubes4 deleted the eval-runner-failure-capture branch May 4, 2026 13:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants