Redesign 59 structurally-broken challenges + fix runner repeatability by jobordu · Pull Request #3 · nForma-AI/nf-benchmark

jobordu · 2026-05-19T12:40:52Z

Summary

Stacked on #2. Fixes the structurally-broken challenges found while triaging the
benchmark toward an honest 100%, and a runner bug that made file-create
challenges non-reproducible.

The problem

Triage of all 230 challenges found ~59 that no solver could ever pass:

Phantom-file challenges (12–18, 26): file-modify on src/database.js,
src/components/*.js, Dockerfile, etc. — files that don't exist in the target
project (QGSD, a planning/formal-methods CLI). file-modify on a missing file
is a silent no-op → no residual change → unmeasurable.
Dead-layer challenges (03, 04, 07): only target t_to_c/f_to_c,
which the benchmark's --fast invocation deliberately skips → residual is
always -1 → undetectable.

Changes

59 challenges redesigned to the verified c_to_r file-create pattern
(BENCH-188): create an untraced bin/rf-*.cjs utility with no @req
annotation, which the code→requirements layer flags as a coverage gap. Each new
file has a unique name, distinct realistic JS content, and honest title/tags.
Verified: BENCH-111 scores PASS (c_to_r residual 124→125, mutation detected).

Runner repeatability fix (lib/runner.cjs): the snapshot model runs solve
in place, but restoreSnapshot only cleaned .planning/formal — so the 125
file-create challenges (116 into bin/) left their files behind. On a repeat
run the file already exists, the residual doesn't increase, and the challenge
falsely fails. createSnapshot now records pre-existing files in the
create-guard dirs; restoreSnapshot deletes anything a challenge added.
.jsonl is also snapshotted so the solver's trend log is restored — runs now
leave the project pristine.

Verification

All 230 challenges schema-valid (npm run validate).
All 125 file-create target_files unique.
BENCH-111 PASS, twice, leaving QGSD with 0 dirty files.

A full re-measurement (multi-hour serial run) is the follow-up step to get the
new baseline score.

🤖 Generated with Claude Code

@Req

BENCH-111/112/113/115 mutated src/*.js files that don't exist in QGSD (file-modify no-op → unmeasurable). Redesigned as file-create c_to_r challenges following the verified BENCH-188 pattern: create an untraced bin/rf-*.cjs utility with no @Req annotation, which the code→requirements layer flags as a coverage gap. Verified: BENCH-111 PASS (c_to_r residual 124→125, mutation detected). BENCH-114 left untouched. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Phantom-file challenges (mutated src/*.js, Dockerfile, etc. absent from QGSD) and t_to_c/f_to_c-only challenges (layers skipped under --fast) were structurally unmeasurable. Redesigned all 55 to the verified BENCH-188 c_to_r pattern: file-create an untraced bin/rf-*.cjs utility. Files: 02,03,04,07,13,14,15,16,17,18,26. All 230 challenges schema-valid; all file-create target_files unique. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…enges The snapshot model runs solve in place against the project. restoreSnapshot only cleaned .planning/formal, so the 125 file-create challenges (116 into bin/) left their created files behind — on a repeat run the file already exists, the c_to_r residual doesn't increase, and the challenge falsely fails. createSnapshot now records every pre-existing file under the create-guard dirs (bin, test, src, hooks, templates, scripts, .planning); restoreSnapshot deletes anything a challenge added. Snapshot shape is now {content, guardPaths}. Also snapshot .jsonl so the solver's trend log is restored — runs leave the project pristine. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-05-19T12:41:31Z

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 53595417-e273-4c92-8840-1fcaa96fdf4a

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch benchmark/redesign-broken-challenges

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

jobordu and others added 3 commits May 19, 2026 13:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Redesign 59 structurally-broken challenges + fix runner repeatability#3

Redesign 59 structurally-broken challenges + fix runner repeatability#3
jobordu wants to merge 3 commits into
ci/benchmark-regression-gatefrom
benchmark/redesign-broken-challenges

jobordu commented May 19, 2026

Uh oh!

coderabbitai Bot commented May 19, 2026

Review skipped

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jobordu commented May 19, 2026

Summary

The problem

Changes

Verification

Uh oh!

coderabbitai Bot commented May 19, 2026

Review skipped

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant