benchmark: replace simulation with walltime macro benches + e2e correctness#91
benchmark: replace simulation with walltime macro benches + e2e correctness#91natemoo-re wants to merge 6 commits into
Conversation
Drives createInput as a consumer would (read chunk, scan, flush a pending ESC, dispatch) and asserts the same event stream whether input is fed whole, byte-by-byte, or split mid-sequence. Covers the lone-ESC flush path.
In-process bench over a large mixed corpus, measured in WallTime where real work dominates placement/JIT/alloc noise (the Simulation micro-benches sit on codegen cliffs). Feeds the corpus in small reads: a single scan() drains at most 256 events (the wasm event-buffer cap), so small reads keep every call under the cap and process the whole corpus.
The Simulation input micro-benches (long input burst, printable ASCII single char) move by 50-90% on unrelated changes, even a test-only rename, because their simulated cost snaps to a different value when the combined wasm shifts. Input perf is now gated by the throughput WallTime bench; correctness by the event-loop integration test.
Pin ubuntu-24.04 so the wasm toolchain is stable and drop the wasm cache so main and PRs always rebuild identically. The cache froze main's baseline on a stale build, so every PR compared a fresh build against a stale baseline and produced phantom regressions.
commit: |
Merging this PR will not alter performance
Performance Changes
Comparing Footnotes
|
CodSpeed's walltime tinybench plugin only populates result.latency on its async path; a sync task fn leaves it undefined and crashes (Cannot destructure 'min' of result.latency). startup.bench works because its tasks return a promise (spawnFixture). Return Promise.resolve() from the throughput task so the plugin takes the async path — a bare async fn with no await would trip deno's require-await lint. The walltime job runs startup and throughput as separate node processes.
… job CodSpeed Simulation (Valgrind) is unviable for CI here: flaky measurements (dashboard layout swung 20x, diff render 17% on changes that touch no render code) and unpredictable runtime — the same commit's simulation job finished in ~2 min one run and hung past 30 min the next. Convert render/ops to ms-scale WallTime macro benches (looped, promise-returning, ~7-11ms at <1% variance) run as separate node processes in the walltime job, and drop the simulation job entirely. mod.ts is now a local aggregator for deno task bench.
jbolda
left a comment
There was a problem hiding this comment.
The change to walltime makes sense. We were just discussing this aspect of codspeed on Thursday for some effection code. Might be worth just generally writing down the whys and whats for some of the decisions we made though!
| pack(complexOps, buf, 0); | ||
| .add("pack complex layout", () => { | ||
| for (let i = 0; i < 1500; i++) pack(complexOps, buf, 0); | ||
| return Promise.resolve(); |
There was a problem hiding this comment.
Could we add a comment or perhaps a README.md in this folder explaining why this is required? I am assuming it is to force a last microtask or something? I don't see anything within the tinybench API that would point to the value of doing this.
There was a problem hiding this comment.
Yes, I'll add that! It's a limitation of the @codspeed/plugin-tinybench package, which only expects the task to be async. With synchronous tasks, it throws an error like "cannot destructure 'latency' from undefined"
| new Uint8Array([0x1b, 0x5b, 0x41]), // ArrowUp | ||
| str("\x1b[<0;35;12M"), // SGR mouse press | ||
| new Uint8Array([0xe4, 0xb8, 0xad]), // 中 | ||
| str("\x1b[97;3u"), // Kitty a+Alt | ||
| new Uint8Array([0xf0, 0x9f, 0x8e, 0x89]), // 🎉 |
There was a problem hiding this comment.
rather than using comments, might as well have functions:
arrowUp(),
mousePress(),
morre re-usable, descriptive and repeatable
Our CodSpeed benchmarks were frequently inaccurate: they flagged big "regressions"/"improvements" on PRs that don't touch the measured code (a test-file rename alone showed −54% on input; "dashboard layout" swung 20×). And the simulation job's runtime was wildly unpredictable.