Redesign 59 structurally-broken challenges + fix runner repeatability#3
Open
jobordu wants to merge 3 commits into
Open
Redesign 59 structurally-broken challenges + fix runner repeatability#3jobordu wants to merge 3 commits into
jobordu wants to merge 3 commits into
Conversation
BENCH-111/112/113/115 mutated src/*.js files that don't exist in QGSD (file-modify no-op → unmeasurable). Redesigned as file-create c_to_r challenges following the verified BENCH-188 pattern: create an untraced bin/rf-*.cjs utility with no @Req annotation, which the code→requirements layer flags as a coverage gap. Verified: BENCH-111 PASS (c_to_r residual 124→125, mutation detected). BENCH-114 left untouched. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phantom-file challenges (mutated src/*.js, Dockerfile, etc. absent from QGSD) and t_to_c/f_to_c-only challenges (layers skipped under --fast) were structurally unmeasurable. Redesigned all 55 to the verified BENCH-188 c_to_r pattern: file-create an untraced bin/rf-*.cjs utility. Files: 02,03,04,07,13,14,15,16,17,18,26. All 230 challenges schema-valid; all file-create target_files unique. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…enges
The snapshot model runs solve in place against the project. restoreSnapshot
only cleaned .planning/formal, so the 125 file-create challenges (116 into
bin/) left their created files behind — on a repeat run the file already
exists, the c_to_r residual doesn't increase, and the challenge falsely fails.
createSnapshot now records every pre-existing file under the create-guard
dirs (bin, test, src, hooks, templates, scripts, .planning); restoreSnapshot
deletes anything a challenge added. Snapshot shape is now {content, guardPaths}.
Also snapshot .jsonl so the solver's trend log is restored — runs leave the
project pristine.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Plus Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Stacked on #2. Fixes the structurally-broken challenges found while triaging the
benchmark toward an honest 100%, and a runner bug that made file-create
challenges non-reproducible.
The problem
Triage of all 230 challenges found ~59 that no solver could ever pass:
12–18,26):file-modifyonsrc/database.js,src/components/*.js,Dockerfile, etc. — files that don't exist in the targetproject (QGSD, a planning/formal-methods CLI).
file-modifyon a missing fileis a silent no-op → no residual change → unmeasurable.
03,04,07): only targett_to_c/f_to_c,which the benchmark's
--fastinvocation deliberately skips → residual isalways
-1→ undetectable.Changes
59 challenges redesigned to the verified
c_to_rfile-createpattern(
BENCH-188): create an untracedbin/rf-*.cjsutility with no@reqannotation, which the code→requirements layer flags as a coverage gap. Each new
file has a unique name, distinct realistic JS content, and honest title/tags.
Verified:
BENCH-111scores PASS (c_to_rresidual 124→125, mutation detected).Runner repeatability fix (
lib/runner.cjs): the snapshot model runs solvein place, but
restoreSnapshotonly cleaned.planning/formal— so the 125file-createchallenges (116 intobin/) left their files behind. On a repeatrun the file already exists, the residual doesn't increase, and the challenge
falsely fails.
createSnapshotnow records pre-existing files in thecreate-guard dirs;
restoreSnapshotdeletes anything a challenge added..jsonlis also snapshotted so the solver's trend log is restored — runs nowleave the project pristine.
Verification
npm run validate).file-createtarget_files unique.BENCH-111PASS, twice, leaving QGSD with 0 dirty files.A full re-measurement (multi-hour serial run) is the follow-up step to get the
new baseline score.
🤖 Generated with Claude Code