feat(evals): Level 3 DOCX agent benchmark suite by tupizz · Pull Request #2664 · superdoc-dev/superdoc

tupizz · 2026-03-31T18:47:53Z

Summary

Adds Level 3 benchmark to the existing evals/ Promptfoo infrastructure
10 provider conditions (5 Claude Code + 5 Codex) x 18 DOCX tasks = 180 runs
Compares agent performance with and without SuperDoc across reading and editing tasks
Deterministic assertions (no LLM judge), Markdown + CSV report generation

What's new

Providers:

claude-code-agent.mjs — Claude Agent SDK (@anthropic-ai/claude-agent-sdk) with query()
codex-agent.mjs — OpenAI Codex SDK (@openai/codex-sdk) with Codex.startThread().run()

Tests:

7 reading tasks (extract headings, entities, financial figures, table data, comments)
11 editing tasks (replace text, add paragraphs, fill placeholders, tracked changes)
All validated against real fixture content

Shared utilities:

extractDocxText() — lightweight DOCX text extractor (SuperDoc CLI fallback to unzip+XML)
benchmarkMetrics assertion — captures steps, cost, duration, tokens, path used

Report:

benchmark-report.mjs — generates summary table, path usage, per-task breakdown, recommendation

Config:

promptfooconfig.benchmark.yaml — 10 conditions matrix
4 new scripts: eval:benchmark, eval:benchmark:claude, eval:benchmark:codex, eval:benchmark:report

Smoke test

Codex baseline verified end-to-end:

Read: extract all headings on report-with-formatting.docx — PASS
Found all 7 headings, document unchanged, 4 tool calls, 84s, ~62k tokens

Test plan

Unit tests pass (67/67)
All 8 fixtures extract text correctly via extractDocxText
All assertions validated against actual fixture content
Codex provider smoke test passes end-to-end
Claude Code provider smoke test (blocked by API credits)
Full matrix run (180 runs, ~$6, ~2hrs)

- Fix cwd ENOENT: create stateDir before passing to SDK query() - Fix Claude Code provider: clean up, remove pathToClaudeCodeExecutable hacks - Fix Codex provider: match real SDK API (command_execution items, approvalPolicy) - Fix test assertions: match actual fixture content - contract.docx -> report-with-formatting.docx for heading tasks - [Employee Name] -> [Candidate Name] for employment-offer.docx - Fix $150M collateral check (XML extraction splits as "1 50") - Upgrade @anthropic-ai/claude-agent-sdk to ^0.2.87

edoversb · 2026-03-31T18:51:04Z

Ooo!

- Copy fixture into stateDir so agents can write within their sandbox - Add stateDir fallback for output file detection - Add useClaudeSettings option to inherit local Claude Code config (MCP servers, skills, CLAUDE.md) via settingSources - Add CC-local condition for testing with user's own Claude Code setup - Wire superdocMcp config to attach SuperDoc MCP server via mcpServers - Add preeval:benchmark script to build MCP server before runs - Add model, maxTurns, systemPrompt config options

Standalone test script that verifies both providers end-to-end: - Claude baseline read/edit (without SuperDoc) - Claude superdoc-skill with MCP (superdoc_open → get_content → close) - Claude local with useClaudeSettings - Codex baseline read/edit (without SuperDoc) - Codex with SuperDoc MCP Run: node evals/scripts/smoke-test-benchmark.mjs --claude --codex

- Add system prompt for superdoc conditions instructing agents to use SuperDoc MCP tools exclusively, not raw unzip/XML - Write AGENTS.md in working directory reinforcing SuperDoc tool usage - Restrict CC-superdoc-skill allowedTools to Read/Glob/Grep (no Bash) so agents cannot fall back to raw DOCX manipulation - Add prompt reinforcement for Codex superdoc conditions - Verified: Claude superdoc-skill read + edit both use MCP exclusively (superdoc_open → search → edit → save → close, zero Bash calls)

- Pass process.env.OPENAI_API_KEY to new Codex({ apiKey }) so the SDK uses API key auth instead of relying on codex login session - Add Claude edit + MCP tests to smoke test script - Verified: Codex baseline read + edit pass with API key auth - Known: Codex MCP calls fail due to rmcp protocol incompatibility in the Codex CLI (serde error on tool calls, Transport closed)

Root cause: console.debug('[super-editor] Telemetry: enabled') in Editor.ts writes to stdout when superdoc_open initializes the editor. The Codex CLI's Rust MCP client (rmcp) parses stdout as JSON-RPC and dies with "serde error expected value at line 1 column 2" on the non-JSON line, closing the transport. Fixes: - Redirect all console methods (log/info/debug/warn) to stderr in the MCP server entry point, before any imports run - Add mcp_auto_approve config for Codex to auto-approve MCP tool calls (approval_policy=never only covers shell commands, not MCP) - Add stdio wrapper script for transport debugging (logs raw bytes) - Use runStreamed() in Codex provider to capture full MCP event lifecycle - Pass minimal env to prevent other stdout pollution from deps - Add preflight check for MCP server build artifact

Reduce from 18 to 6 tasks (3 reading + 3 editing) for faster iteration. Full suite: 12 runs in 3 minutes, 100% pass rate on Codex baseline + superdoc-skill conditions. Tasks: extract headings, extract entities, extract financials, replace entity, insert section, fill placeholders.

- Add per-task detail table with every metric per condition - Add input/output token breakdown (not just total) - Add p95 latency alongside median - Add estimated cost per task (based on model token pricing) - Add comprehensive recommendation with latency, token, cost, steps, and collateral comparisons between conditions - Fix task description extraction from vars.task fallback

Replace single benchmarkMetrics assertion with separate per-metric assertions (steps, latency, tokens, path), each with its own metric tag. Promptfoo displays these as individual columns with actual numeric values instead of a single "efficiency 1.00" score. Columns visible in UI: correctness, collateral, steps, latency, tokens, path

…ition The superdocOnPath flag was a no-op because the SuperDoc CLI was never installed as a binary on PATH. Now creates a shell wrapper script in the stateDir's bin/ that delegates to apps/cli/dist/index.js, and prepends it to the agent's PATH. Finding: even with superdoc on PATH, Codex doesn't discover or use it without explicit instruction. All superdoc-cli runs fall back to raw unzip/XML. This is valid benchmark data.

- benchmarkPath assertion now FAILS when superdoc-skill or superdoc-cli conditions don't use SuperDoc (was always passing before) - Add AGENTS.md + prompt hint for superdoc-cli condition telling agents the CLI exists on PATH with common commands - Split MCP and CLI AGENTS.md templates in both providers - Verified: all 3 Codex conditions use correct path (baseline=raw, superdoc-skill=MCP, superdoc-cli=CLI)

Add a _summary line at the top of provider JSON output showing path | steps | latency | tokens at a glance. Promptfoo renders the start of the output in each table cell, so this gives immediate visibility without clicking into the detail view.

- Add derivedMetrics: avg_latency, avg_steps, avg_tokens, superdoc_usage_pct - computed per provider after evaluation - Set weight: 0 on steps/latency/tokens assertions so they report values without affecting pass/fail score - Only correctness, collateral, and path drive pass/fail - Click "Show Charts" in Promptfoo UI for visual comparison

Add the Anthropic DOCX skill (from anthropics/skills repo) as the vendor condition. When vendorSkill: true, the skill is installed as AGENTS.md in the working directory, teaching agents to use unzip/XML for reading and docx-js for creation. This completes the benchmark matrix: - baseline: no skill, agent figures it out - vendor: Anthropic's DOCX skill (unzip + docx-js) - superdoc-skill: SuperDoc MCP server - superdoc-cli: SuperDoc CLI on PATH - choice: all available, agent picks

Claude Agent SDK reads CLAUDE.md (not AGENTS.md) for project context. Write vendor skill and CLI instructions as CLAUDE.md in the stateDir, and enable settingSources: ['project'] so the SDK loads it.

This reverts commit 85108ac.

Creates 4 DOCX fixtures designed to be fragile under raw XML edits: - consulting-agreement.docx: bold defined terms, italic refs, 6 heading sections, $250k indemnification cap, net 45 payment terms - pricing-proposal.docx: 4-row pricing table with shaded header, right-aligned prices, US Letter page size - contract-redlines.docx: 3 tracked insertions + 2 deletions by Jane Editor, 2 reviewer comments by Bob Reviewer - policy-manual.docx: 3-level nested numbered list (1./1.1/a)), header/footer with page numbers, page breaks between sections Adds create-v2-fixtures.mjs generator script and docx@9.6.1 dev dependency.

New capabilities: - docx-fidelity.mjs: OOXML structural checker (formatting, styles, numbering, tracked changes, comments, tables, XML diff) - benchmarkFidelity assertion: runs fidelity checks on output DOCX - benchmarkDiff assertion: measures XML change ratio (surgical vs rewrite) New fixtures (all synthetic names): - consulting-agreement.docx: bold terms, italic refs, numbered sections - pricing-proposal.docx: table with alignment and styled header - contract-redlines.docx: existing tracked changes and comments - policy-manual.docx: 3-level nested numbered lists 6 new fidelity tasks (CEO examples): - Mixed formatting replace (bold preservation) - Table cell edit (structure preservation) - Tracked changes edit (annotation survival) - Nested list insert (numbering continuation) - Multi-step workflow (heading style check) - Edit with existing annotations (comment survival) 92 tests total: 69 checks.cjs + 23 docx-fidelity

1. outputFile pointed to unedited fixture copy instead of localDocPath (the file the agent actually edits in stateDir) 2. Comment IDs in fidelity checks used "0","1" but fixture has "1","2" 3. Table cell text used exact match instead of includes 4. Remove overly strict paragraphStyle check on multi-step task

Category A — Structural creation (SuperDoc proven): - Create heading with Heading1 style - Create table with borders and data rows Category B — Formatting (SuperDoc proven): - Make specific text bold - Replace text preserving formatting Category C — Complex edits (track improvement): - Tracked change replacement - Add comment to clause

…up' into feat/level3-agent-benchmark

Remove settingSources which loaded ALL user MCP servers (43 Linear, 5 Excalidraw, Gmail, etc.) adding ~4000 tokens per turn. Pass CLAUDE.md content as systemPrompt instead. Result: 30% cost reduction ($0.97 -> $0.68 for NDA creation).

…r clarity Changed labels for several providers in the promptfooconfig.benchmark.yaml file to better reflect their functionality, including renaming 'CC-vendor' to 'CC-with-docx-skill', 'CC-superdoc-skill' to 'CC-superdoc-mcp', and others for consistency and improved understanding.

chatgpt-codex-connector

💡 Codex Review

superdoc/evals/lib/checks.cjs

Line 82 in 663709b

    
           const { toolKeys: FORMAT_TOOL_KEYS, inlineKeys: FORMAT_INLINE_KEYS } = loadFormatSchemaInfo();

Defer format schema loading to avoid crashing checks on startup

checks.cjs eagerly loads packages/sdk/tools/tools.openai.json at module import time and throws if it is missing, but that file is generated and may not exist in a fresh clone/CI job. This makes every eval assertion fail to load (including checks unrelated to formatting), breaking evals test execution before any test logic runs. Load this schema lazily inside correctFormatArgs (or provide a fallback) so non-format checks can still run.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-04-03T15:05:40Z

evals/providers/codex-agent.mjs

+        codexOpts.config = {
+          mcp_servers: {
+            superdoc: {
+              command: process.execPath, // Use exact node binary, not bare 'node'
+              args: [MCP_WRAPPER_PATH, process.execPath, MCP_SERVER_PATH],
+            },
+          },
+        };


Preserve MCP auto-approve when configuring Codex MCP server

When superdocMcp is enabled, codexOpts.config is reassigned to a new object containing only mcp_servers, which drops the earlier mcp_auto_approve setting. In unattended benchmark runs this can force interactive approval for MCP tool calls, causing superdoc-skill conditions to stall or fail instead of executing end-to-end. Merge mcp_servers into the existing config instead of replacing it.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-03T15:05:40Z

evals/providers/claude-code-agent.mjs

+        maxTurns: this.config.maxTurns || 20,
+        permissionMode: 'bypassPermissions',
+        allowDangerouslySkipPermissions: true,
+        settingSources: [], // SDK isolation mode: don't load user MCP servers (Linear, Excalidraw, etc.)


Honor useClaudeSettings when building Claude query options

The provider documents a useClaudeSettings mode, but queryOptions always sets settingSources: [], so local Claude settings are never loaded even when the flag is true. This makes the advertised local-settings condition nonfunctional and can skew benchmark comparisons because the runtime behavior cannot match the configured scenario.

Useful? React with 👍 / 👎.

tupizz added 8 commits March 31, 2026 14:31

feat(evals): add extractDocxText utility for benchmark text extraction

f296604

feat(evals): add benchmarkMetrics assertion for Level 3 benchmark

f987e1d

feat(evals): add Claude Code benchmark provider for Level 3

7d52a40

feat(evals): add Codex benchmark provider for Level 3

1070e55

feat(evals): add 18 benchmark tasks for Level 3 agent comparison

6a7b7ff

feat(evals): add benchmark report generator for Level 3

531a46c

feat(evals): add Level 3 benchmark Promptfoo config with 10 conditions

fe374c0

superdoc-bot bot added the review: quick label Mar 31, 2026

tupizz added 20 commits March 31, 2026 21:29

fix(evals): fix report generator to extract metrics from parsed output

bcd5720

feat(evals): add unit labels to metric names for self-documenting UI

3f448f2

revert(evals): restore original metric names

fbf7cc3

refactor(evals): clean up benchmark config to 4 conditions × 2 agents

a90a11d

fix(evals): use CLAUDE.md instead of AGENTS.md for Claude Code provider

cc067d8

Claude Agent SDK reads CLAUDE.md (not AGENTS.md) for project context. Write vendor skill and CLI instructions as CLAUDE.md in the stateDir, and enable settingSources: ['project'] so the SDK loads it.

feat(docs): document Level 3 DOCX agent benchmark in CLAUDE.md

4191e99

docs(evals): add guide for reading Level 3 benchmark results

920b8e3

tupizz added 12 commits April 2, 2026 11:52

docs(evals): add PRD for benchmark v2 document fidelity scoring

85108ac

Revert "docs(evals): add PRD for benchmark v2 document fidelity scoring"

5aa0ea7

This reverts commit 85108ac.

feat(evals): add DOCX fidelity checker utility

754fe89

feat(evals): add benchmarkFidelity and benchmarkDiff assertions

4c032f8

feat(evals): add 6 fidelity-sensitive v2 benchmark tasks

c88a2d3

Merge remote-tracking branch 'origin/andrii/sd-2451-refactor-mcp-set-…

795145d

…up' into feat/level3-agent-benchmark

docs(evals): add benchmark findings and next steps document

6030d57

tupizz changed the base branch from main to andrii/sd-2451-refactor-mcp-set-up April 3, 2026 01:51

tupizz added 3 commits April 2, 2026 23:07

fix(evals): set settingSources: [] for SDK isolation mode

2e44770

docs(evals): add MCP efficiency analysis with prioritized fixes

9beb78c

tupizz marked this pull request as ready for review April 3, 2026 14:58

tupizz merged commit bc66b32 into andrii/sd-2451-refactor-mcp-set-up Apr 3, 2026
1 check passed

tupizz deleted the feat/level3-agent-benchmark branch April 3, 2026 14:59

chatgpt-codex-connector bot reviewed Apr 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(evals): Level 3 DOCX agent benchmark suite#2664

feat(evals): Level 3 DOCX agent benchmark suite#2664
tupizz merged 43 commits intoandrii/sd-2451-refactor-mcp-set-upfrom
feat/level3-agent-benchmark

tupizz commented Mar 31, 2026

Uh oh!

edoversb commented Mar 31, 2026

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Apr 3, 2026

Uh oh!

chatgpt-codex-connector bot Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tupizz commented Mar 31, 2026

Summary

What's new

Smoke test

Test plan

Uh oh!

edoversb commented Mar 31, 2026

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants