Skip to content

linalg/qr validation gaps #148

@brycelelbach

Description

@brycelelbach

A report from my agents on validation gaps in the new QR problem:

Validation gaps

  • 1.) benchmark mode never re-checks timed outputs (recheck=False). _run_single_benchmark validates only the pre-timing warmup outputs; the timed for i in range(max_repeats) loop computes outputs but re-validates them only when recheck=True, which is set only in leaderboard mode — never in the local benchmark.sh workers optimize against. Exposed by: the entire caching-optimization family (b0585a84, 0c28edd8, e96d13ad) operates in precisely this unchecked timed region; an output-cache here would be invisible locally. Remediation: run benchmark mode with recheck=True (or add a final post-timing re-check) in harness/benchmark.sh so any timed-loop divergence is caught before a kernel is logged as a local win.
  • 2.) The dominant shapes feed the timed loop a single, stable input object. _benchmark_batch_count yields data_list length 1 for shape3 (batch 640, n 512 — the leaderboard-critical, geomean-dominant shape) and shape4 (n 1024), and 2 for shapes 5/6. Combined with the no-reclone reuse across max_repeats, the timed call sees the same id(A)/content every iteration — the exact precondition for an id()- or content-keyed output-cache to collapse timed work. Exposed by: 0c28edd8, which built an id(A) fast-path on this property (for routing only — but the same hook could cache outputs). Remediation: re-clone (or regenerate) the input on each timed repeat for low-count shapes, so a content/identity key cannot persist a result across iterations.
  • 3.) The checker validates input-derived invariants, not the reference factorization. reference.py::check_implementation materializes Q=householder_product(H,tau), R=triu(H) and gates on ‖R−QᵀA‖₁ ≤ 20·n·eps32·‖A‖₁ and ‖QᵀQ−I‖₁ ≤ 100·n·eps32. It never compares (H,tau) element-wise to torch.geqrf(A). Therefore a cache that returns a previously-correct (H,tau) for the same fixed input passes even under recheck=True — the residuals are computed against that same A. (This is the mechanism behind the prior-known capped_sleepy_cache / sub-µs exploits.) Reduced-precision inner math is legitimately admitted by these tolerances and must not be flagged as a hack. Exposed by: no champion trial — but it is the latent hole that makes output-caching pass. Remediation: within validation, additionally assert each timed output differs from any cached prior output when the input differs, or cross-check a few entries against torch.geqrf under a generous tolerance to deny a same-input replay.
  • 4.) The leaderboard stream-substring rule is invisible to local validation. Local validate.sh/benchmark.sh accept non-default-stream kernels; the real KernelBot rejects any source containing the literal stream, and (per brief d05be96a) a NULL-stream kernel can be 2× slower remotely than its multi-stream local form. Exposed by: ~14 reword commits (57e6c68, 649136162, 1c55f25, d05be96a, 9eb5939f, …) and the discarded multi-stream 5701429f. Remediation: add a grep -L stream gate to harness/validate.sh so a stream-using kernel fails locally instead of being discovered only on remote rejection. (The champion already passes this — 0 stream tokens, no stream API use.)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions