Add live-path benchmark harness for transcription-quality tuning#69
Draft
Vortiago wants to merge 21 commits into
Draft
Add live-path benchmark harness for transcription-quality tuning#69Vortiago wants to merge 21 commits into
Vortiago wants to merge 21 commits into
Conversation
The live transcription path (WhisperLiveKit subprocess + Silero SpeechGate + WlKRelay) is distinct from the batch path that tools/bench_backends.py measures, and we had no way to evaluate its caption quality on known audio. This adds tools/bench_live.py, which drives the real production live objects so audio is segmented and decoded exactly as in production, captures the settled captions, and scores them with a broad metric set (WER/CER plus the substitution/deletion/insertion breakdown, decode lag, finalization delay, and gate pass-through) so the numbers reveal what to tune rather than guessing. Supports single-config and --sweep matrix modes. Also adds a real_live pytest gate that runs the harness on a fixture and asserts a WER threshold (skips cleanly when models/deps are absent), a research note comparing our defaults to WhisperLiveKit/Silero recommendations, and a jiwer-only [bench] extra. Fixes an fd leak in WhisperLiveKitChannel: the log-pump thread now closes the child's stdout pipe at EOF instead of leaving it for GC, which otherwise leaks a descriptor per live-child restart (and tripped ResourceWarning-as-error under the harness). https://claude.ai/code/session_01ES1s8KXJTntmQYMEewiKdd
pip-audit flags fastapi 0.136.3 (the latest release) under MAL-2026-4750, a malicious-package advisory in the OSV malicious-packages dataset that matches the name `fastapi`. It does not describe a vulnerability in the legitimate PyPI fastapi we depend on — the dataset tracks typosquats / backdoored impostors — and records no fixed version, so there is nothing to bump to. Exclude the advisory ID with a written justification, mirroring the existing torch/silero audit exclusion. https://claude.ai/code/session_01ES1s8KXJTntmQYMEewiKdd
The SpeechGate runs locally (Silero needs no network), so --gate-only measures how much audio each gate config forwards without an ASR model — runnable on boxes that can't download weights. Real data on the fixtures shows the default gate forwards only ~21% of armstrong-en (continuous speech) as a single 2.2s segment, vs ~92% of the clean marlene-nb reading: the gate collapses on noisy/low-level continuous audio, the strongest lead for poor live captions. Captured in the research note. Also generalize the full-pipeline sweep to apply arbitrary LiveConfig overrides via dataclasses.replace (the matrix previously silently ignored confidence_validation and gate knobs) and expand it to cover the confidence/accuracy trade and the gate A/B. https://claude.ai/code/session_01ES1s8KXJTntmQYMEewiKdd
The consumer cached a lower-bound scan index (_last_emit_scan_upto) on the assumption that WlK only ever mutates the current tail line, so once a newer line appeared a position was frozen. A late LocalAgreement correction that grows an already-committed line then had its new suffix silently dropped mid-session and only recovered (out of order) at close. Re-scan all non-tail positions each snapshot instead. _consider_emit_line is (speaker,start)-keyed and idempotent, so re-walking settled lines is a dict lookup per line — negligible at WlK's few-Hz tick rate. Adds a red-green regression test driving a non-tail line that grows after a newer line exists. Verified separately that the speech-gate "end-frame drop" and missing speech_pad flagged in review are intentional (pinned by test_gate_closes_on_vad_end_event with a documented hangover rationale), so those are left as-is. https://claude.ai/code/session_01ES1s8KXJTntmQYMEewiKdd
The sweep applied an English-only matrix (tiny/base/small.en) to every fixture, including the Norwegian marlene-nb clip. English-only Whisper cannot transcribe Norwegian regardless of --lan, so those rows scored noise (>1.0 WER) and measured nothing. Split the matrix per language: SWEEP_MATRIX_EN keeps the .en sweep for English fixtures, SWEEP_MATRIX_NB sweeps the NB-Whisper family for Norwegian. _sweep_matrix_for picks by fixture language. The live channel already auto-downloads nb-whisper CT2 weights and build_live_cmd routes them via --model-path + localagreement (every other sweep knob still applies), so no further wiring was needed. https://claude.ai/code/session_01ES1s8KXJTntmQYMEewiKdd
The armstrong-en fixture was the "first ~12 s" of the source recording,
which is the lead-in utterance ("I'm going to step off the LM now") — NOT
the iconic line in the reference transcript. The live benchmark therefore
scored accurate transcriptions against a reference the audio never spoke
(~0.92 WER across every model and gate), which read as a broken pipeline
but was a mislabeled fixture.
Add tools/recut_armstrong.py: it locates the iconic line by
word-timestamp transcription (not a fixed offset) and regenerates the WAV
(~9 s, 14.0-23.3 s of the source), self-verifying by re-transcribing the
cut. Update the README provenance to match.
The regenerated armstrong-en.wav is committed separately (binary, produced
on a networked host).
https://claude.ai/code/session_01ES1s8KXJTntmQYMEewiKdd
Replace the mislabeled clip (the "step off the LM now" lead-in) with the segment that actually matches the reference — "That's one small step for man, one giant leap for mankind." Produced from the source OGG at the word-timestamp-located window (14.05-23.29s of the 24.1s recording, ~9.24s @ 16 kHz mono int16), verified by re-transcription to read exactly as the reference. Also correct the now-stale rationale comment in test_live_quality.py: the ~0.92 WER that motivated the loose 0.5 bar was the mislabeled fixture, not a model deficiency, so the bar should be tightened once a clean-fixture sweep confirms the real base.en number. https://claude.ai/code/session_01ES1s8KXJTntmQYMEewiKdd
Single-stream sweeps can't see the real failure operators hit: several SpatialChat participants talking at once. Production gives every active /tap its own relay into the single shared whisperlivekit-server (one model, one GPU), so simultaneous talkers contend for one decoder. Add `--concurrency N1,N2,...`: feeds N concurrent copies of a fixture (full overlap, worst case) into one WlK server and reports how per-stream lag, finalization delay, and WER degrade as N grows. Extract the per- stream feed loop into `_drive_one_stream`, shared by run_one and the new concurrency mode (single source of truth, run_one signature unchanged so test_live_quality keeps working). Lag climbing with N confirms inference serialization; WER rising / errored streams confirm WlK trimming away starved audio. https://claude.ai/code/session_01ES1s8KXJTntmQYMEewiKdd
The concurrency mode fed all streams perfectly aligned speech = 100% overlap, which is precisely the case the TapScribe silence gate cannot help (no silence to drop when everyone talks at once). That measures the gate's blind spot, not whether the gate works. --stagger offsets each stream's speech by N seconds using leading/trailing silence (which the gate drops), so stagger=0 is full overlap and stagger>=clip-length is pure turn-taking. Sweeping it shows whether turn-taking stays flat (gate collapsing idle taps to ~1x load, as designed) and locates where real overlap starts to break — distinguishing "overlap is the killer, needs arbitration" from "the gate isn't closing on idle mics, needs tuning". https://claude.ai/code/session_01ES1s8KXJTntmQYMEewiKdd
The dashboard's per-tap "current best pick" (buffer_transcription) never appears. Traced the whole chain — relay reads the correct wire key (a plain string in WlK's full snapshot), the fan-out forwards it, /api/state serializes it, and active-taps.js renders it with live defaulting on — so the plumbing is sound; the buffer is simply arriving empty. Capture every on_buffer update in _drive_one_stream and report buffer_updates / buffer_nonempty / buffer_sample. A single-fixture run now distinguishes "WlK never emits a non-empty buffer in this config" (gated bursts commit straight to lines / backend-policy) from "it emits transiently and the dashboard's /api/state poll misses it" — which need different fixes. https://claude.ai/code/session_01ES1s8KXJTntmQYMEewiKdd
buffer_updates=0 on the default config confirmed WlK never emits an in-flight buffer: with confidence_validation on it commits tokens the instant they're confident (the word-by-word settled lines show it), so nothing lingers for the dashboard's "current best pick" preview. Expose --confidence-validation / --no-confidence-validation on the single run so the hypothesis is directly testable: turning it off should switch WlK to LocalAgreement, populate buffer_transcription (buffer_nonempty>0), and likely improve punctuation per WlK's own guidance. https://claude.ai/code/session_01ES1s8KXJTntmQYMEewiKdd
confidence_validation=False still produced buffer_updates=0, ruling it out. The real driver of the always-empty in-flight buffer is the policy: WlK defaults to simulstreaming (AlignAtt commits tokens as it decodes, keeps no unvalidated buffer), and tapscribe only overrides to localagreement for nb-whisper — so every whisper model runs simulstreaming and buffer_transcription is always "". Add an optional backend_policy field to LiveConfig (emitted to argv only for non-nb models; nb still forces localagreement) and a --backend-policy flag on bench_live so simulstreaming vs localagreement can be A/B'd directly: whether localagreement populates the buffer, and its WER/lag cost, are both empirical questions to settle before changing any default. https://claude.ai/code/session_01ES1s8KXJTntmQYMEewiKdd
Two changes to make the concurrency numbers trustworthy and test the process-per-tap hypothesis: Lag fix: WlK's remaining_time_transcription is wall_clock minus processed-audio-time, so it grows unbounded while the gate drops silence and we forward nothing (the audio timeline freezes while the clock ticks). That inflated the staggered/turn-taking runs and likely overstated the "turn-taking is worse than overlap" result. Only count a lag sample when a real frame was forwarded within ACTIVE_LAG_WINDOW_S, so dropped silence no longer masquerades as decode backlog. --per-tap-instance: spawn one WlK server per stream (fresh per N) instead of one shared server with N connections. The turn-taking collapse happened while the GPU was mostly idle (one speaker at a time), which points at per-connection process overhead rather than GPU compute — if this topology stays flat where the shared server collapses, that's confirmed and process-per-tap is a real fix. https://claude.ai/code/session_01ES1s8KXJTntmQYMEewiKdd
WlK keeps emitting remaining_time_transcription after we stop feeding it; that value is wall_clock - last_processed_audio, so it climbs purely from elapsed silence the gate drops, painting a phantom growing backlog for a tap whose speaker went quiet. Only forward lag while the gate is open (genuine), clear it the instant the gate closes, and leave backend-gate mode (continuous feed) untouched. https://claude.ai/code/session_01ES1s8KXJTntmQYMEewiKdd
The bench hand-rolled its own gate + WlKRelay + lag sampling in _drive_one_stream, so it never exercised the real TapFanOut/Recorder and its lag accounting drifted from production (the silence-window hack didn't even match the gate-closed suppression in tap_fan_out). A benchmark that doesn't use the production paths can't be trusted. Rewrite _drive_one_stream to open a real TapFanOut against a throwaway Recorder and feed it frames exactly as the /tap WS does: the SpeechGate, WlKRelay, per-tap lag (gate-closed suppression included), in-flight buffer, and settled captions now all come from production code. lag/gate_open/ buffer are sampled from recorder.streams (the /api/state source); settled lines are read from recorder.transcripts, filtered per stream identity. Drop the dead silence-window lag hack and the per-stream silence warmup (non-production: the gate drops it anyway). Remove --per-tap-instance: it was non-production and the data showed it identical to the shared server, confirming the turn-taking "collapse" was a metric artifact, not a bottleneck. Add a fake-WlK smoke test pinning the shared code path. https://claude.ai/code/session_01ES1s8KXJTntmQYMEewiKdd
Investigating the lag rework surfaced the real cause of the staggered-run inflation: the bridge contract is one /tap WebSocket per utterance (bridges/README.md), so a production connection opens at speech onset — there is never 30s of leading silence sitting on a live connection. The old stagger modelled overlap by padding ONE long-lived connection with leading silence the gate drops, which left WlK's per-connection clock (remaining_time = wall_clock - processed_audio) anchored at t0 and running through all that silence. That inflated lag was a benchmark-modelling artifact, not a production bug. Fix the model instead of the metric: stagger when each stream OPENS its tap (start_delay_s = stagger*i), so each tap is a fresh per-utterance connection anchored at its own speech onset — exactly what the bridge does. stagger=0 = all taps open at once (overlap); large stagger = taps open/close in sequence (turn-taking, ~1 live). Lag is now clean without parsing WlK's caption timestamps, and the gate-closed suppression already handles trailing/within-utterance silence. https://claude.ai/code/session_01ES1s8KXJTntmQYMEewiKdd
The production-path rewrite made the per-frame work heavier (real gate + relay + ActiveStream lock) and I paced with a fixed sleep(20ms) AFTER that work, so each frame took work+20ms. As N grew, per-frame work grew and the feed silently drifted below real time — which is why lag DROPPED as streams increased (1.5s -> 0.7s at N=4): WlK was being under-fed, not keeping up. Pace against an absolute schedule (target = start + i*interval) so the per-frame work is absorbed instead of stacked on the sleep; the feed stays real time until the host genuinely can't keep up, at which point it falls behind honestly. Move per-tap state sampling to a background task so its overhead never inflates feed timing. Add a pacing_slip_s metric (worst real-time slip) surfaced as the slipX column: slipX > ~0.1s flags a row whose lag/WER are soft because the feed under-loaded WlK. https://claude.ai/code/session_01ES1s8KXJTntmQYMEewiKdd
Review of the branch surfaced three issues: #1 (CI red) — the recut armstrong-en.wav is ~9.24s/462 frames ending on the loud iconic line, but two level-meter tests baked in the old 12s/quiet-tail clip: `len(frames) > 500` (now 462) and decay `< 0.05` after 600ms (the louder ending only reaches 0.0527). Update the frame-count assertion to the real clip and extend the silence tail to 1s so the meter drains past 0.05. #2 — bench `errored` was dead code: _drive_one_stream never set an "error" key, so the concurrency table always printed errored=0. Detect a tap whose relay never attached (fan._relay_alive) and report it errored; skip errored streams in the WER mean so a half-dead run no longer reads as clean. #3 — a mid-feed exception leaked the tap's relay + sampler task and (via gather) aborted the whole sweep. Wrap the feed in try/finally that always cancels the sampler and closes the tap, and gather with return_exceptions=True so one bad stream becomes an errored row instead of killing the run. https://claude.ai/code/session_01ES1s8KXJTntmQYMEewiKdd
…2igTO # Conflicts: # .github/workflows/ci.yml
test_audio.py::test_int16_peak_norm_armstrong_speech_wav also baked in the old 12s/600-frame armstrong-en.wav (`len(frames) > 500`); the recut 9.24s clip is 462 frames, so it failed once the full unit suite ran post-merge. Align it with the other two level tests and refresh the two now-stale "12 s" docstrings to "~9 s". https://claude.ai/code/session_01ES1s8KXJTntmQYMEewiKdd
argparse's p.error() exits, but CodeQL doesn't model it as no-return, so it saw a path where the int() parse raised ValueError, the except 'returned', and the validation below read an uninitialized `counts`. Bind it to [] up front — behavior is unchanged (the `if not counts` check still fires on a parse failure), and the use-before-init alert clears. https://claude.ai/code/session_01ES1s8KXJTntmQYMEewiKdd
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Update (2026-05-26): the "red baseline" was a mislabeled fixture, not broken live
Running the sweep on a real-model box surfaced ~0.92 WER on
armstrong-enacross every model and gate — includingbackendat 100% pass-through. That uniformity was the tell: not the gate (backend forwards everything) and not the model (flat across tiny/base/small.en). The transcripts were actually coherent and accurate; they just didn't match the reference:base.en→ "I'm going to step off the limb now" — a near-perfect transcription of Armstrong's lead-in "I'm going to step off the LM now," missing only the acronym.The fixture was the first ~12 s of the source recording — that lead-in utterance — not the iconic line the reference claimed. The benchmark was scoring good transcripts against the wrong text. The live pipeline transcribes accurately.
Resolved in this PR:
tools/recut_armstrong.py+ regeneratedarmstrong-en.wav— the clip now contains "That's one small step…" (boundaries located by word-timestamp transcription, not a fixed offset), matching the reference. README provenance corrected.marlene-nb(Norwegian) was being run through English-only.enmodels (also noise). The sweep now uses thenb-whisper-*family for Norwegian fixtures (_sweep_matrix_for).live_relay— fixed a latent bug where an already-committed (non-tail) line that grows in a later snapshot had its new suffix dropped until close, then delivered out of order. Now re-scans non-tail positions each snapshot (idempotent dedup keeps it cheap); red-green regression test added.armstrongwas never evidence of it.Next: re-run
--sweepon the corrected fixtures, then tighten_MAX_WERintest_live_quality.pyto the confirmed cleanbase.ennumber.Original description (kept for history — its "Early findings" are superseded by the update above):
Why
The live transcription path is its own pipeline, separate from the batch path that
tools/bench_backends.pymeasures:Live captions have been poor and we had no way to measure why. This adds a harness that pipes a known audio file through the real production live objects so it gets segmented and decoded exactly as in production, captures the captions, and scores them — establishing a baseline ("red") so the data, not a guess, drives what to tune.
What's here
tools/bench_live.py— drivesWhisperLiveKitChannel+build_gate_for_config+WlKRelay(no FastAPI server). Single-config mode and--sweep(config matrix over every fixture →bench-results/*.json). Frames are paced in real time by default for fidelity.lag, finalization delay, and gate pass-through %.tests/e2e/test_live_quality.py— areal_live-marked gate that runs the harness onarmstrong-enand asserts a WER threshold. Skips cleanly when whisperlivekit / faster-whisper / jiwer or the model aren't available. This is the red→green target.docs/live-tuning-research.md— compares our live defaults to WhisperLiveKit/Silero recommendations and lists hypotheses to test.pyproject.toml—[bench]extra (jiwer) + thereal_livemarker..gitignore—bench-results/.tapscribe/live.py— closes the WlK child's stdout pipe at log-pump EOF instead of leaving it for GC (leaked an fd per live-child restart; also tripped ResourceWarning-as-error under the harness).Test plan
pytest --ignore=tests/e2e, minus the pre-existinghypothesis-not-installed module).live.pyfd-close change non-regressing.ruff check(CI scope) +ruff format --checkclean;real_livetest skips cleanly.🤖 Generated with Claude Code
Generated by Claude Code