Skip to content

benchmark: replace simulation with walltime macro benches + e2e correctness#91

Open
natemoo-re wants to merge 6 commits into
mainfrom
ci-bench
Open

benchmark: replace simulation with walltime macro benches + e2e correctness#91
natemoo-re wants to merge 6 commits into
mainfrom
ci-bench

Conversation

@natemoo-re
Copy link
Copy Markdown
Member

@natemoo-re natemoo-re commented Jun 7, 2026

Our CodSpeed benchmarks were frequently inaccurate: they flagged big "regressions"/"improvements" on PRs that don't touch the measured code (a test-file rename alone showed −54% on input; "dashboard layout" swung 20×). And the simulation job's runtime was wildly unpredictable.

  • Both problems are baked into CodSpeed's Valgrind simulation of tiny (µs) benchmarks, confirmed by their own docs, and chasing them one bench at a time is whack-a-mole.
  • This PR drops the simulation job entirely and moves all perf gating to WallTime macro benchmarks: bigger, realistic workloads where the real work dominates and noise averages out (stable to <1%, predictable runtime).
  • Replacements:
    • input parsing → an event-loop correctness test + a throughput benchmark;
    • render/ops → WallTime macro benches (render dashboard frames, pack complex layouts).
  • Also pinned the CI build (fixed compiler, no stale wasm cache) so the baseline can't silently drift.
  • Net: green CI actually means green — no phantom red/green on unrelated PRs.

Drives createInput as a consumer would (read chunk, scan, flush a pending ESC, dispatch) and asserts the same event stream whether input is fed whole, byte-by-byte, or split mid-sequence. Covers the lone-ESC flush path.
In-process bench over a large mixed corpus, measured in WallTime where real work dominates placement/JIT/alloc noise (the Simulation micro-benches sit on codegen cliffs). Feeds the corpus in small reads: a single scan() drains at most 256 events (the wasm event-buffer cap), so small reads keep every call under the cap and process the whole corpus.
The Simulation input micro-benches (long input burst, printable ASCII single char) move by 50-90% on unrelated changes, even a test-only rename, because their simulated cost snaps to a different value when the combined wasm shifts. Input perf is now gated by the throughput WallTime bench; correctness by the event-loop integration test.
Pin ubuntu-24.04 so the wasm toolchain is stable and drop the wasm cache so main and PRs always rebuild identically. The cache froze main's baseline on a stale build, so every PR compared a fresh build against a stale baseline and produced phantom regressions.
@pkg-pr-new
Copy link
Copy Markdown

pkg-pr-new Bot commented Jun 7, 2026

Open in StackBlitz

npm i https://pkg.pr.new/clayterm@91

commit: 154a293

@natemoo-re natemoo-re marked this pull request as draft June 7, 2026 02:35
@codspeed-hq
Copy link
Copy Markdown
Contributor

codspeed-hq Bot commented Jun 7, 2026

Merging this PR will not alter performance

✅ 5 untouched benchmarks
🆕 5 new benchmarks
⏩ 18 skipped benchmarks1

Performance Changes

Mode Benchmark BASE HEAD Efficiency
🆕 WallTime pack complex layout N/A 45 ms N/A
🆕 WallTime pack large list N/A 59.1 ms N/A
🆕 WallTime input throughput (mixed corpus, chunked read loop) N/A 9.6 ms N/A
🆕 WallTime render mixed frames N/A 36.5 ms N/A
🆕 WallTime render steady diff N/A 35.7 ms N/A

Comparing ci-bench (154a293) with main (48ff4bb)

Open in CodSpeed

Footnotes

  1. 18 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

CodSpeed's walltime tinybench plugin only populates result.latency on its async path; a sync task fn leaves it undefined and crashes (Cannot destructure 'min' of result.latency). startup.bench works because its tasks return a promise (spawnFixture). Return Promise.resolve() from the throughput task so the plugin takes the async path — a bare async fn with no await would trip deno's require-await lint. The walltime job runs startup and throughput as separate node processes.
@natemoo-re natemoo-re changed the title Replace cliff-prone input micro-benches with an E2E test + macro throughput bench Replace flaky input benches with an E2E test + macro throughput bench Jun 7, 2026
… job

CodSpeed Simulation (Valgrind) is unviable for CI here: flaky measurements (dashboard layout swung 20x, diff render 17% on changes that touch no render code) and unpredictable runtime — the same commit's simulation job finished in ~2 min one run and hung past 30 min the next. Convert render/ops to ms-scale WallTime macro benches (looped, promise-returning, ~7-11ms at <1% variance) run as separate node processes in the walltime job, and drop the simulation job entirely. mod.ts is now a local aggregator for deno task bench.
@natemoo-re natemoo-re changed the title Replace flaky input benches with an E2E test + macro throughput bench Make benchmarks reliable: WallTime macro benches + E2E correctness (retire Simulation) Jun 7, 2026
@natemoo-re natemoo-re marked this pull request as ready for review June 7, 2026 05:05
@natemoo-re natemoo-re changed the title Make benchmarks reliable: WallTime macro benches + E2E correctness (retire Simulation) benchmark: replace simulation with walltime macro benches + e2e correctness Jun 7, 2026
@natemoo-re natemoo-re requested review from cowboyd and rauhryan June 7, 2026 22:37
Copy link
Copy Markdown
Collaborator

@jbolda jbolda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change to walltime makes sense. We were just discussing this aspect of codspeed on Thursday for some effection code. Might be worth just generally writing down the whys and whats for some of the decisions we made though!

Comment thread bench/ops.bench.ts
pack(complexOps, buf, 0);
.add("pack complex layout", () => {
for (let i = 0; i < 1500; i++) pack(complexOps, buf, 0);
return Promise.resolve();
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add a comment or perhaps a README.md in this folder explaining why this is required? I am assuming it is to force a last microtask or something? I don't see anything within the tinybench API that would point to the value of doing this.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'll add that! It's a limitation of the @codspeed/plugin-tinybench package, which only expects the task to be async. With synchronous tasks, it throws an error like "cannot destructure 'latency' from undefined"

Comment thread test/input-loop.test.ts
Comment on lines +82 to +86
new Uint8Array([0x1b, 0x5b, 0x41]), // ArrowUp
str("\x1b[<0;35;12M"), // SGR mouse press
new Uint8Array([0xe4, 0xb8, 0xad]), // 中
str("\x1b[97;3u"), // Kitty a+Alt
new Uint8Array([0xf0, 0x9f, 0x8e, 0x89]), // 🎉
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather than using comments, might as well have functions:

arrowUp(),
mousePress(),

morre re-usable, descriptive and repeatable

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants