You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A report from my agents on validation gaps in the new QR problem:
Validation gaps
1.) benchmark mode never re-checks timed outputs (recheck=False)._run_single_benchmark validates only the pre-timing warmup outputs; the timed for i in range(max_repeats) loop computes outputs but re-validates them only when recheck=True, which is set only in leaderboard mode — never in the local benchmark.sh workers optimize against. Exposed by: the entire caching-optimization family (b0585a84, 0c28edd8, e96d13ad) operates in precisely this unchecked timed region; an output-cache here would be invisible locally. Remediation: run benchmark mode with recheck=True (or add a final post-timing re-check) in harness/benchmark.sh so any timed-loop divergence is caught before a kernel is logged as a local win.
2.) The dominant shapes feed the timed loop a single, stable input object._benchmark_batch_count yields data_list length 1 for shape3 (batch 640, n 512 — the leaderboard-critical, geomean-dominant shape) and shape4 (n 1024), and 2 for shapes 5/6. Combined with the no-reclone reuse across max_repeats, the timed call sees the sameid(A)/content every iteration — the exact precondition for an id()- or content-keyed output-cache to collapse timed work. Exposed by:0c28edd8, which built an id(A) fast-path on this property (for routing only — but the same hook could cache outputs). Remediation: re-clone (or regenerate) the input on each timed repeat for low-count shapes, so a content/identity key cannot persist a result across iterations.
3.) The checker validates input-derived invariants, not the reference factorization.reference.py::check_implementation materializes Q=householder_product(H,tau), R=triu(H) and gates on ‖R−QᵀA‖₁ ≤ 20·n·eps32·‖A‖₁ and ‖QᵀQ−I‖₁ ≤ 100·n·eps32. It never compares (H,tau) element-wise to torch.geqrf(A). Therefore a cache that returns a previously-correct (H,tau) for the same fixed input passes even under recheck=True — the residuals are computed against that same A. (This is the mechanism behind the prior-known capped_sleepy_cache / sub-µs exploits.) Reduced-precision inner math is legitimately admitted by these tolerances and must not be flagged as a hack. Exposed by: no champion trial — but it is the latent hole that makes output-caching pass. Remediation: within validation, additionally assert each timed output differs from any cached prior output when the input differs, or cross-check a few entries against torch.geqrf under a generous tolerance to deny a same-input replay.
4.) The leaderboard stream-substring rule is invisible to local validation. Local validate.sh/benchmark.sh accept non-default-stream kernels; the real KernelBot rejects any source containing the literal stream, and (per brief d05be96a) a NULL-stream kernel can be 2× slower remotely than its multi-stream local form. Exposed by: ~14 reword commits (57e6c68, 649136162, 1c55f25, d05be96a, 9eb5939f, …) and the discarded multi-stream 5701429f. Remediation: add a grep -L stream gate to harness/validate.sh so a stream-using kernel fails locally instead of being discovered only on remote rejection. (The champion already passes this — 0 stream tokens, no stream API use.)
A report from my agents on validation gaps in the new QR problem:
Validation gaps
benchmarkmode never re-checks timed outputs (recheck=False)._run_single_benchmarkvalidates only the pre-timing warmup outputs; the timedfor i in range(max_repeats)loop computes outputs but re-validates them only whenrecheck=True, which is set only inleaderboardmode — never in the localbenchmark.shworkers optimize against. Exposed by: the entire caching-optimization family (b0585a84,0c28edd8,e96d13ad) operates in precisely this unchecked timed region; an output-cache here would be invisible locally. Remediation: runbenchmarkmode withrecheck=True(or add a final post-timing re-check) inharness/benchmark.shso any timed-loop divergence is caught before a kernel is logged as a local win._benchmark_batch_countyieldsdata_listlength 1 for shape3 (batch 640, n 512 — the leaderboard-critical, geomean-dominant shape) and shape4 (n 1024), and 2 for shapes 5/6. Combined with the no-reclone reuse acrossmax_repeats, the timed call sees the sameid(A)/content every iteration — the exact precondition for anid()- or content-keyed output-cache to collapse timed work. Exposed by:0c28edd8, which built anid(A)fast-path on this property (for routing only — but the same hook could cache outputs). Remediation: re-clone (or regenerate) the input on each timed repeat for low-countshapes, so a content/identity key cannot persist a result across iterations.reference.py::check_implementationmaterializesQ=householder_product(H,tau),R=triu(H)and gates on‖R−QᵀA‖₁ ≤ 20·n·eps32·‖A‖₁and‖QᵀQ−I‖₁ ≤ 100·n·eps32. It never compares (H,tau) element-wise totorch.geqrf(A). Therefore a cache that returns a previously-correct (H,tau) for the same fixed input passes even underrecheck=True— the residuals are computed against that sameA. (This is the mechanism behind the prior-knowncapped_sleepy_cache/ sub-µs exploits.) Reduced-precision inner math is legitimately admitted by these tolerances and must not be flagged as a hack. Exposed by: no champion trial — but it is the latent hole that makes output-caching pass. Remediation: within validation, additionally assert each timed output differs from any cached prior output when the input differs, or cross-check a few entries againsttorch.geqrfunder a generous tolerance to deny a same-input replay.stream-substring rule is invisible to local validation. Localvalidate.sh/benchmark.shaccept non-default-stream kernels; the real KernelBot rejects any source containing the literalstream, and (per briefd05be96a) a NULL-stream kernel can be 2× slower remotely than its multi-stream local form. Exposed by: ~14 reword commits (57e6c68,649136162,1c55f25,d05be96a,9eb5939f, …) and the discarded multi-stream5701429f. Remediation: add agrep -L streamgate toharness/validate.shso a stream-using kernel fails locally instead of being discovered only on remote rejection. (The champion already passes this — 0streamtokens, no stream API use.)