linalg/qr validation gaps

A report from my agents on validation gaps in the new QR problem:

## Validation gaps

- [ ] 1.) **`benchmark` mode never re-checks timed outputs (`recheck=False`).** `_run_single_benchmark` validates only the pre-timing warmup outputs; the timed `for i in range(max_repeats)` loop computes outputs but re-validates them only when `recheck=True`, which is set **only** in `leaderboard` mode — never in the local `benchmark.sh` workers optimize against. **Exposed by:** the entire caching-optimization family (`b0585a84`, `0c28edd8`, `e96d13ad`) operates in precisely this unchecked timed region; an output-cache here would be invisible locally. **Remediation:** run `benchmark` mode with `recheck=True` (or add a final post-timing re-check) in `harness/benchmark.sh` so any timed-loop divergence is caught before a kernel is logged as a local win.
- [ ] 2.) **The dominant shapes feed the timed loop a single, stable input object.** `_benchmark_batch_count` yields `data_list` length **1** for shape3 (batch 640, n 512 — the leaderboard-critical, geomean-dominant shape) and shape4 (n 1024), and **2** for shapes 5/6. Combined with the no-reclone reuse across `max_repeats`, the timed call sees the *same* `id(A)`/content every iteration — the exact precondition for an `id()`- or content-keyed output-cache to collapse timed work. **Exposed by:** `0c28edd8`, which built an `id(A)` fast-path on this property (for routing only — but the same hook could cache outputs). **Remediation:** re-clone (or regenerate) the input on each timed repeat for low-`count` shapes, so a content/identity key cannot persist a result across iterations.
- [ ] 3.) **The checker validates input-derived invariants, not the reference factorization.** `reference.py::check_implementation` materializes `Q=householder_product(H,tau)`, `R=triu(H)` and gates on `‖R−QᵀA‖₁ ≤ 20·n·eps32·‖A‖₁` and `‖QᵀQ−I‖₁ ≤ 100·n·eps32`. It never compares (H,tau) element-wise to `torch.geqrf(A)`. Therefore **a cache that returns a previously-correct (H,tau) for the same fixed input passes even under `recheck=True`** — the residuals are computed against that same `A`. (This is the mechanism behind the prior-known `capped_sleepy_cache` / sub-µs exploits.) Reduced-precision inner math is *legitimately* admitted by these tolerances and must **not** be flagged as a hack. **Exposed by:** no champion trial — but it is the latent hole that makes output-caching pass. **Remediation:** within validation, additionally assert each timed output differs from any cached prior output when the input differs, or cross-check a few entries against `torch.geqrf` under a generous tolerance to deny a same-input replay.
- [ ] 4.) **The leaderboard `stream`-substring rule is invisible to local validation.** Local `validate.sh`/`benchmark.sh` accept non-default-stream kernels; the real KernelBot rejects any source containing the literal `stream`, and (per brief `d05be96a`) a NULL-stream kernel can be 2× slower remotely than its multi-stream local form. **Exposed by:** ~14 reword commits (`57e6c68`, `649136162`, `1c55f25`, `d05be96a`, `9eb5939f`, …) and the discarded multi-stream `5701429f`. **Remediation:** add a `grep -L stream` gate to `harness/validate.sh` so a stream-using kernel fails locally instead of being discovered only on remote rejection. (The champion already passes this — 0 `stream` tokens, no stream API use.)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

linalg/qr validation gaps #148

Validation gaps

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

linalg/qr validation gaps #148

Description

Validation gaps

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions