Add live-path benchmark harness for transcription-quality tuning by Vortiago · Pull Request #69 · Vortiago/TapScribe

Vortiago · 2026-05-26T07:39:59Z

Update (2026-05-26): the "red baseline" was a mislabeled fixture, not broken live

Running the sweep on a real-model box surfaced ~0.92 WER on armstrong-en across every model and gate — including backend at 100% pass-through. That uniformity was the tell: not the gate (backend forwards everything) and not the model (flat across tiny/base/small.en). The transcripts were actually coherent and accurate; they just didn't match the reference:

base.en → "I'm going to step off the limb now" — a near-perfect transcription of Armstrong's lead-in "I'm going to step off the LM now," missing only the acronym.

The fixture was the first ~12 s of the source recording — that lead-in utterance — not the iconic line the reference claimed. The benchmark was scoring good transcripts against the wrong text. The live pipeline transcribes accurately.

Resolved in this PR:

tools/recut_armstrong.py + regenerated armstrong-en.wav — the clip now contains "That's one small step…" (boundaries located by word-timestamp transcription, not a fixed offset), matching the reference. README provenance corrected.
Per-language sweep — marlene-nb (Norwegian) was being run through English-only .en models (also noise). The sweep now uses the nb-whisper-* family for Norwegian fixtures (_sweep_matrix_for).
live_relay — fixed a latent bug where an already-committed (non-tail) line that grows in a later snapshot had its new suffix dropped until close, then delivered out of order. Now re-scans non-tail positions each snapshot (idempotent dedup keeps it cheap); red-green regression test added.
The "21% gate / clipping speech / tiny.en suspect" findings below were premised on the bogus fixture. The gate's forward-rate on continuous speech may still merit a look, but armstrong was never evidence of it.

Next: re-run --sweep on the corrected fixtures, then tighten _MAX_WER in test_live_quality.py to the confirmed clean base.en number.

_{Original description (kept for history — its "Early findings" are superseded by the update above):}

Why

The live transcription path is its own pipeline, separate from the batch path that tools/bench_backends.py measures:

WAV → 20 ms PCM frames → SpeechGate (Silero VAD) → WlKRelay
    → whisperlivekit-server subprocess → settled lines

Live captions have been poor and we had no way to measure why. This adds a harness that pipes a known audio file through the real production live objects so it gets segmented and decoded exactly as in production, captures the captions, and scores them — establishing a baseline ("red") so the data, not a guess, drives what to tune.

What's here

tools/bench_live.py — drives WhisperLiveKitChannel + build_gate_for_config + WlKRelay (no FastAPI server). Single-config mode and --sweep (config matrix over every fixture → bench-results/*.json). Frames are paced in real time by default for fidelity.
Broad metric set (the point of the exercise): WER/CER plus the substitution / deletion / insertion split (subs ⇒ weak model, deletions ⇒ dropped/clipped speech, insertions ⇒ hallucination), decode lag, finalization delay, and gate pass-through %.
tests/e2e/test_live_quality.py — a real_live-marked gate that runs the harness on armstrong-en and asserts a WER threshold. Skips cleanly when whisperlivekit / faster-whisper / jiwer or the model aren't available. This is the red→green target.
docs/live-tuning-research.md — compares our live defaults to WhisperLiveKit/Silero recommendations and lists hypotheses to test.
pyproject.toml — [bench] extra (jiwer) + the real_live marker. .gitignore — bench-results/.
tapscribe/live.py — closes the WlK child's stdout pipe at log-pump EOF instead of leaving it for GC (leaked an fd per live-child restart; also tripped ResourceWarning-as-error under the harness).

Test plan

Unit suite green (pytest --ignore=tests/e2e, minus the pre-existing hypothesis-not-installed module).
Live/relay/gate unit tests green; live.py fd-close change non-regressing.
Scoring, fixture discovery, framing, and Silero gate construction validated without a model.
ruff check (CI scope) + ruff format --check clean; real_live test skips cleanly.
Full sweep run on a real-model box — surfaced the mislabeled-fixture issue (see update above), now fixed.

🤖 Generated with Claude Code

Generated by Claude Code

The live transcription path (WhisperLiveKit subprocess + Silero SpeechGate + WlKRelay) is distinct from the batch path that tools/bench_backends.py measures, and we had no way to evaluate its caption quality on known audio. This adds tools/bench_live.py, which drives the real production live objects so audio is segmented and decoded exactly as in production, captures the settled captions, and scores them with a broad metric set (WER/CER plus the substitution/deletion/insertion breakdown, decode lag, finalization delay, and gate pass-through) so the numbers reveal what to tune rather than guessing. Supports single-config and --sweep matrix modes. Also adds a real_live pytest gate that runs the harness on a fixture and asserts a WER threshold (skips cleanly when models/deps are absent), a research note comparing our defaults to WhisperLiveKit/Silero recommendations, and a jiwer-only [bench] extra. Fixes an fd leak in WhisperLiveKitChannel: the log-pump thread now closes the child's stdout pipe at EOF instead of leaving it for GC, which otherwise leaks a descriptor per live-child restart (and tripped ResourceWarning-as-error under the harness). https://claude.ai/code/session_01ES1s8KXJTntmQYMEewiKdd

pip-audit flags fastapi 0.136.3 (the latest release) under MAL-2026-4750, a malicious-package advisory in the OSV malicious-packages dataset that matches the name `fastapi`. It does not describe a vulnerability in the legitimate PyPI fastapi we depend on — the dataset tracks typosquats / backdoored impostors — and records no fixed version, so there is nothing to bump to. Exclude the advisory ID with a written justification, mirroring the existing torch/silero audit exclusion. https://claude.ai/code/session_01ES1s8KXJTntmQYMEewiKdd

The SpeechGate runs locally (Silero needs no network), so --gate-only measures how much audio each gate config forwards without an ASR model — runnable on boxes that can't download weights. Real data on the fixtures shows the default gate forwards only ~21% of armstrong-en (continuous speech) as a single 2.2s segment, vs ~92% of the clean marlene-nb reading: the gate collapses on noisy/low-level continuous audio, the strongest lead for poor live captions. Captured in the research note. Also generalize the full-pipeline sweep to apply arbitrary LiveConfig overrides via dataclasses.replace (the matrix previously silently ignored confidence_validation and gate knobs) and expand it to cover the confidence/accuracy trade and the gate A/B. https://claude.ai/code/session_01ES1s8KXJTntmQYMEewiKdd

The consumer cached a lower-bound scan index (_last_emit_scan_upto) on the assumption that WlK only ever mutates the current tail line, so once a newer line appeared a position was frozen. A late LocalAgreement correction that grows an already-committed line then had its new suffix silently dropped mid-session and only recovered (out of order) at close. Re-scan all non-tail positions each snapshot instead. _consider_emit_line is (speaker,start)-keyed and idempotent, so re-walking settled lines is a dict lookup per line — negligible at WlK's few-Hz tick rate. Adds a red-green regression test driving a non-tail line that grows after a newer line exists. Verified separately that the speech-gate "end-frame drop" and missing speech_pad flagged in review are intentional (pinned by test_gate_closes_on_vad_end_event with a documented hangover rationale), so those are left as-is. https://claude.ai/code/session_01ES1s8KXJTntmQYMEewiKdd

The sweep applied an English-only matrix (tiny/base/small.en) to every fixture, including the Norwegian marlene-nb clip. English-only Whisper cannot transcribe Norwegian regardless of --lan, so those rows scored noise (>1.0 WER) and measured nothing. Split the matrix per language: SWEEP_MATRIX_EN keeps the .en sweep for English fixtures, SWEEP_MATRIX_NB sweeps the NB-Whisper family for Norwegian. _sweep_matrix_for picks by fixture language. The live channel already auto-downloads nb-whisper CT2 weights and build_live_cmd routes them via --model-path + localagreement (every other sweep knob still applies), so no further wiring was needed. https://claude.ai/code/session_01ES1s8KXJTntmQYMEewiKdd

The armstrong-en fixture was the "first ~12 s" of the source recording, which is the lead-in utterance ("I'm going to step off the LM now") — NOT the iconic line in the reference transcript. The live benchmark therefore scored accurate transcriptions against a reference the audio never spoke (~0.92 WER across every model and gate), which read as a broken pipeline but was a mislabeled fixture. Add tools/recut_armstrong.py: it locates the iconic line by word-timestamp transcription (not a fixed offset) and regenerates the WAV (~9 s, 14.0-23.3 s of the source), self-verifying by re-transcribing the cut. Update the README provenance to match. The regenerated armstrong-en.wav is committed separately (binary, produced on a networked host). https://claude.ai/code/session_01ES1s8KXJTntmQYMEewiKdd

Replace the mislabeled clip (the "step off the LM now" lead-in) with the segment that actually matches the reference — "That's one small step for man, one giant leap for mankind." Produced from the source OGG at the word-timestamp-located window (14.05-23.29s of the 24.1s recording, ~9.24s @ 16 kHz mono int16), verified by re-transcription to read exactly as the reference. Also correct the now-stale rationale comment in test_live_quality.py: the ~0.92 WER that motivated the loose 0.5 bar was the mislabeled fixture, not a model deficiency, so the bar should be tightened once a clean-fixture sweep confirms the real base.en number. https://claude.ai/code/session_01ES1s8KXJTntmQYMEewiKdd

Single-stream sweeps can't see the real failure operators hit: several SpatialChat participants talking at once. Production gives every active /tap its own relay into the single shared whisperlivekit-server (one model, one GPU), so simultaneous talkers contend for one decoder. Add `--concurrency N1,N2,...`: feeds N concurrent copies of a fixture (full overlap, worst case) into one WlK server and reports how per-stream lag, finalization delay, and WER degrade as N grows. Extract the per- stream feed loop into `_drive_one_stream`, shared by run_one and the new concurrency mode (single source of truth, run_one signature unchanged so test_live_quality keeps working). Lag climbing with N confirms inference serialization; WER rising / errored streams confirm WlK trimming away starved audio. https://claude.ai/code/session_01ES1s8KXJTntmQYMEewiKdd

The concurrency mode fed all streams perfectly aligned speech = 100% overlap, which is precisely the case the TapScribe silence gate cannot help (no silence to drop when everyone talks at once). That measures the gate's blind spot, not whether the gate works. --stagger offsets each stream's speech by N seconds using leading/trailing silence (which the gate drops), so stagger=0 is full overlap and stagger>=clip-length is pure turn-taking. Sweeping it shows whether turn-taking stays flat (gate collapsing idle taps to ~1x load, as designed) and locates where real overlap starts to break — distinguishing "overlap is the killer, needs arbitration" from "the gate isn't closing on idle mics, needs tuning". https://claude.ai/code/session_01ES1s8KXJTntmQYMEewiKdd

The dashboard's per-tap "current best pick" (buffer_transcription) never appears. Traced the whole chain — relay reads the correct wire key (a plain string in WlK's full snapshot), the fan-out forwards it, /api/state serializes it, and active-taps.js renders it with live defaulting on — so the plumbing is sound; the buffer is simply arriving empty. Capture every on_buffer update in _drive_one_stream and report buffer_updates / buffer_nonempty / buffer_sample. A single-fixture run now distinguishes "WlK never emits a non-empty buffer in this config" (gated bursts commit straight to lines / backend-policy) from "it emits transiently and the dashboard's /api/state poll misses it" — which need different fixes. https://claude.ai/code/session_01ES1s8KXJTntmQYMEewiKdd

buffer_updates=0 on the default config confirmed WlK never emits an in-flight buffer: with confidence_validation on it commits tokens the instant they're confident (the word-by-word settled lines show it), so nothing lingers for the dashboard's "current best pick" preview. Expose --confidence-validation / --no-confidence-validation on the single run so the hypothesis is directly testable: turning it off should switch WlK to LocalAgreement, populate buffer_transcription (buffer_nonempty>0), and likely improve punctuation per WlK's own guidance. https://claude.ai/code/session_01ES1s8KXJTntmQYMEewiKdd

confidence_validation=False still produced buffer_updates=0, ruling it out. The real driver of the always-empty in-flight buffer is the policy: WlK defaults to simulstreaming (AlignAtt commits tokens as it decodes, keeps no unvalidated buffer), and tapscribe only overrides to localagreement for nb-whisper — so every whisper model runs simulstreaming and buffer_transcription is always "". Add an optional backend_policy field to LiveConfig (emitted to argv only for non-nb models; nb still forces localagreement) and a --backend-policy flag on bench_live so simulstreaming vs localagreement can be A/B'd directly: whether localagreement populates the buffer, and its WER/lag cost, are both empirical questions to settle before changing any default. https://claude.ai/code/session_01ES1s8KXJTntmQYMEewiKdd

Two changes to make the concurrency numbers trustworthy and test the process-per-tap hypothesis: Lag fix: WlK's remaining_time_transcription is wall_clock minus processed-audio-time, so it grows unbounded while the gate drops silence and we forward nothing (the audio timeline freezes while the clock ticks). That inflated the staggered/turn-taking runs and likely overstated the "turn-taking is worse than overlap" result. Only count a lag sample when a real frame was forwarded within ACTIVE_LAG_WINDOW_S, so dropped silence no longer masquerades as decode backlog. --per-tap-instance: spawn one WlK server per stream (fresh per N) instead of one shared server with N connections. The turn-taking collapse happened while the GPU was mostly idle (one speaker at a time), which points at per-connection process overhead rather than GPU compute — if this topology stays flat where the shared server collapses, that's confirmed and process-per-tap is a real fix. https://claude.ai/code/session_01ES1s8KXJTntmQYMEewiKdd

WlK keeps emitting remaining_time_transcription after we stop feeding it; that value is wall_clock - last_processed_audio, so it climbs purely from elapsed silence the gate drops, painting a phantom growing backlog for a tap whose speaker went quiet. Only forward lag while the gate is open (genuine), clear it the instant the gate closes, and leave backend-gate mode (continuous feed) untouched. https://claude.ai/code/session_01ES1s8KXJTntmQYMEewiKdd

The bench hand-rolled its own gate + WlKRelay + lag sampling in _drive_one_stream, so it never exercised the real TapFanOut/Recorder and its lag accounting drifted from production (the silence-window hack didn't even match the gate-closed suppression in tap_fan_out). A benchmark that doesn't use the production paths can't be trusted. Rewrite _drive_one_stream to open a real TapFanOut against a throwaway Recorder and feed it frames exactly as the /tap WS does: the SpeechGate, WlKRelay, per-tap lag (gate-closed suppression included), in-flight buffer, and settled captions now all come from production code. lag/gate_open/ buffer are sampled from recorder.streams (the /api/state source); settled lines are read from recorder.transcripts, filtered per stream identity. Drop the dead silence-window lag hack and the per-stream silence warmup (non-production: the gate drops it anyway). Remove --per-tap-instance: it was non-production and the data showed it identical to the shared server, confirming the turn-taking "collapse" was a metric artifact, not a bottleneck. Add a fake-WlK smoke test pinning the shared code path. https://claude.ai/code/session_01ES1s8KXJTntmQYMEewiKdd

Investigating the lag rework surfaced the real cause of the staggered-run inflation: the bridge contract is one /tap WebSocket per utterance (bridges/README.md), so a production connection opens at speech onset — there is never 30s of leading silence sitting on a live connection. The old stagger modelled overlap by padding ONE long-lived connection with leading silence the gate drops, which left WlK's per-connection clock (remaining_time = wall_clock - processed_audio) anchored at t0 and running through all that silence. That inflated lag was a benchmark-modelling artifact, not a production bug. Fix the model instead of the metric: stagger when each stream OPENS its tap (start_delay_s = stagger*i), so each tap is a fresh per-utterance connection anchored at its own speech onset — exactly what the bridge does. stagger=0 = all taps open at once (overlap); large stagger = taps open/close in sequence (turn-taking, ~1 live). Lag is now clean without parsing WlK's caption timestamps, and the gate-closed suppression already handles trailing/within-utterance silence. https://claude.ai/code/session_01ES1s8KXJTntmQYMEewiKdd

The production-path rewrite made the per-frame work heavier (real gate + relay + ActiveStream lock) and I paced with a fixed sleep(20ms) AFTER that work, so each frame took work+20ms. As N grew, per-frame work grew and the feed silently drifted below real time — which is why lag DROPPED as streams increased (1.5s -> 0.7s at N=4): WlK was being under-fed, not keeping up. Pace against an absolute schedule (target = start + i*interval) so the per-frame work is absorbed instead of stacked on the sleep; the feed stays real time until the host genuinely can't keep up, at which point it falls behind honestly. Move per-tap state sampling to a background task so its overhead never inflates feed timing. Add a pacing_slip_s metric (worst real-time slip) surfaced as the slipX column: slipX > ~0.1s flags a row whose lag/WER are soft because the feed under-loaded WlK. https://claude.ai/code/session_01ES1s8KXJTntmQYMEewiKdd

Review of the branch surfaced three issues: #1 (CI red) — the recut armstrong-en.wav is ~9.24s/462 frames ending on the loud iconic line, but two level-meter tests baked in the old 12s/quiet-tail clip: `len(frames) > 500` (now 462) and decay `< 0.05` after 600ms (the louder ending only reaches 0.0527). Update the frame-count assertion to the real clip and extend the silence tail to 1s so the meter drains past 0.05. #2 — bench `errored` was dead code: _drive_one_stream never set an "error" key, so the concurrency table always printed errored=0. Detect a tap whose relay never attached (fan._relay_alive) and report it errored; skip errored streams in the WER mean so a half-dead run no longer reads as clean. #3 — a mid-feed exception leaked the tap's relay + sampler task and (via gather) aborted the whole sweep. Wrap the feed in try/finally that always cancels the sampler and closes the tap, and gather with return_exceptions=True so one bad stream becomes an errored row instead of killing the run. https://claude.ai/code/session_01ES1s8KXJTntmQYMEewiKdd

…2igTO # Conflicts: # .github/workflows/ci.yml

test_audio.py::test_int16_peak_norm_armstrong_speech_wav also baked in the old 12s/600-frame armstrong-en.wav (`len(frames) > 500`); the recut 9.24s clip is 462 frames, so it failed once the full unit suite ran post-merge. Align it with the other two level tests and refresh the two now-stale "12 s" docstrings to "~9 s". https://claude.ai/code/session_01ES1s8KXJTntmQYMEewiKdd

argparse's p.error() exits, but CodeQL doesn't model it as no-return, so it saw a path where the int() parse raised ValueError, the except 'returned', and the validation below read an uninitialized `counts`. Bind it to [] up front — behavior is unchanged (the `if not counts` check still fires on a parse failure), and the use-before-init alert clears. https://claude.ai/code/session_01ES1s8KXJTntmQYMEewiKdd

claude added 8 commits May 26, 2026 07:39

github-advanced-security AI found potential problems May 26, 2026

View reviewed changes

Comment thread tools/bench_live.py Fixed

claude added 13 commits May 26, 2026 09:41

Merge remote-tracking branch 'origin/main' into claude/pensive-gauss-…

7cbd612

…2igTO # Conflicts: # .github/workflows/ci.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add live-path benchmark harness for transcription-quality tuning#69

Add live-path benchmark harness for transcription-quality tuning#69
Vortiago wants to merge 21 commits into
mainfrom
claude/pensive-gauss-2igTO

Vortiago commented May 26, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

Vortiago commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Update (2026-05-26): the "red baseline" was a mislabeled fixture, not broken live

Why

What's here

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Vortiago commented May 26, 2026 •

edited

Loading