Skip to content

Redesign 59 structurally-broken challenges + fix runner repeatability#3

Open
jobordu wants to merge 3 commits into
ci/benchmark-regression-gatefrom
benchmark/redesign-broken-challenges
Open

Redesign 59 structurally-broken challenges + fix runner repeatability#3
jobordu wants to merge 3 commits into
ci/benchmark-regression-gatefrom
benchmark/redesign-broken-challenges

Conversation

@jobordu
Copy link
Copy Markdown
Contributor

@jobordu jobordu commented May 19, 2026

Summary

Stacked on #2. Fixes the structurally-broken challenges found while triaging the
benchmark toward an honest 100%, and a runner bug that made file-create
challenges non-reproducible.

The problem

Triage of all 230 challenges found ~59 that no solver could ever pass:

  • Phantom-file challenges (1218, 26): file-modify on src/database.js,
    src/components/*.js, Dockerfile, etc. — files that don't exist in the target
    project (QGSD, a planning/formal-methods CLI). file-modify on a missing file
    is a silent no-op → no residual change → unmeasurable.
  • Dead-layer challenges (03, 04, 07): only target t_to_c/f_to_c,
    which the benchmark's --fast invocation deliberately skips → residual is
    always -1 → undetectable.

Changes

59 challenges redesigned to the verified c_to_r file-create pattern
(BENCH-188): create an untraced bin/rf-*.cjs utility with no @req
annotation, which the code→requirements layer flags as a coverage gap. Each new
file has a unique name, distinct realistic JS content, and honest title/tags.
Verified: BENCH-111 scores PASS (c_to_r residual 124→125, mutation detected).

Runner repeatability fix (lib/runner.cjs): the snapshot model runs solve
in place, but restoreSnapshot only cleaned .planning/formal — so the 125
file-create challenges (116 into bin/) left their files behind. On a repeat
run the file already exists, the residual doesn't increase, and the challenge
falsely fails. createSnapshot now records pre-existing files in the
create-guard dirs; restoreSnapshot deletes anything a challenge added.
.jsonl is also snapshotted so the solver's trend log is restored — runs now
leave the project pristine.

Verification

  • All 230 challenges schema-valid (npm run validate).
  • All 125 file-create target_files unique.
  • BENCH-111 PASS, twice, leaving QGSD with 0 dirty files.

A full re-measurement (multi-hour serial run) is the follow-up step to get the
new baseline score.

🤖 Generated with Claude Code

jobordu and others added 3 commits May 19, 2026 13:29
BENCH-111/112/113/115 mutated src/*.js files that don't exist in QGSD
(file-modify no-op → unmeasurable). Redesigned as file-create c_to_r
challenges following the verified BENCH-188 pattern: create an untraced
bin/rf-*.cjs utility with no @Req annotation, which the code→requirements
layer flags as a coverage gap.

Verified: BENCH-111 PASS (c_to_r residual 124→125, mutation detected).
BENCH-114 left untouched.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phantom-file challenges (mutated src/*.js, Dockerfile, etc. absent from
QGSD) and t_to_c/f_to_c-only challenges (layers skipped under --fast)
were structurally unmeasurable. Redesigned all 55 to the verified
BENCH-188 c_to_r pattern: file-create an untraced bin/rf-*.cjs utility.

Files: 02,03,04,07,13,14,15,16,17,18,26. All 230 challenges schema-valid;
all file-create target_files unique.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…enges

The snapshot model runs solve in place against the project. restoreSnapshot
only cleaned .planning/formal, so the 125 file-create challenges (116 into
bin/) left their created files behind — on a repeat run the file already
exists, the c_to_r residual doesn't increase, and the challenge falsely fails.

createSnapshot now records every pre-existing file under the create-guard
dirs (bin, test, src, hooks, templates, scripts, .planning); restoreSnapshot
deletes anything a challenge added. Snapshot shape is now {content, guardPaths}.
Also snapshot .jsonl so the solver's trend log is restored — runs leave the
project pristine.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 19, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 53595417-e273-4c92-8840-1fcaa96fdf4a

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch benchmark/redesign-broken-challenges

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant