Skip to content

Add live-path benchmark harness for transcription-quality tuning#69

Draft
Vortiago wants to merge 21 commits into
mainfrom
claude/pensive-gauss-2igTO
Draft

Add live-path benchmark harness for transcription-quality tuning#69
Vortiago wants to merge 21 commits into
mainfrom
claude/pensive-gauss-2igTO

Conversation

@Vortiago
Copy link
Copy Markdown
Owner

@Vortiago Vortiago commented May 26, 2026

Update (2026-05-26): the "red baseline" was a mislabeled fixture, not broken live

Running the sweep on a real-model box surfaced ~0.92 WER on armstrong-en across every model and gate — including backend at 100% pass-through. That uniformity was the tell: not the gate (backend forwards everything) and not the model (flat across tiny/base/small.en). The transcripts were actually coherent and accurate; they just didn't match the reference:

  • base.en"I'm going to step off the limb now" — a near-perfect transcription of Armstrong's lead-in "I'm going to step off the LM now," missing only the acronym.

The fixture was the first ~12 s of the source recording — that lead-in utterance — not the iconic line the reference claimed. The benchmark was scoring good transcripts against the wrong text. The live pipeline transcribes accurately.

Resolved in this PR:

  • tools/recut_armstrong.py + regenerated armstrong-en.wav — the clip now contains "That's one small step…" (boundaries located by word-timestamp transcription, not a fixed offset), matching the reference. README provenance corrected.
  • Per-language sweepmarlene-nb (Norwegian) was being run through English-only .en models (also noise). The sweep now uses the nb-whisper-* family for Norwegian fixtures (_sweep_matrix_for).
  • live_relay — fixed a latent bug where an already-committed (non-tail) line that grows in a later snapshot had its new suffix dropped until close, then delivered out of order. Now re-scans non-tail positions each snapshot (idempotent dedup keeps it cheap); red-green regression test added.
  • The "21% gate / clipping speech / tiny.en suspect" findings below were premised on the bogus fixture. The gate's forward-rate on continuous speech may still merit a look, but armstrong was never evidence of it.

Next: re-run --sweep on the corrected fixtures, then tighten _MAX_WER in test_live_quality.py to the confirmed clean base.en number.


Original description (kept for history — its "Early findings" are superseded by the update above):

Why

The live transcription path is its own pipeline, separate from the batch path that tools/bench_backends.py measures:

WAV → 20 ms PCM frames → SpeechGate (Silero VAD) → WlKRelay
    → whisperlivekit-server subprocess → settled lines

Live captions have been poor and we had no way to measure why. This adds a harness that pipes a known audio file through the real production live objects so it gets segmented and decoded exactly as in production, captures the captions, and scores them — establishing a baseline ("red") so the data, not a guess, drives what to tune.

What's here

  • tools/bench_live.py — drives WhisperLiveKitChannel + build_gate_for_config + WlKRelay (no FastAPI server). Single-config mode and --sweep (config matrix over every fixture → bench-results/*.json). Frames are paced in real time by default for fidelity.
  • Broad metric set (the point of the exercise): WER/CER plus the substitution / deletion / insertion split (subs ⇒ weak model, deletions ⇒ dropped/clipped speech, insertions ⇒ hallucination), decode lag, finalization delay, and gate pass-through %.
  • tests/e2e/test_live_quality.py — a real_live-marked gate that runs the harness on armstrong-en and asserts a WER threshold. Skips cleanly when whisperlivekit / faster-whisper / jiwer or the model aren't available. This is the red→green target.
  • docs/live-tuning-research.md — compares our live defaults to WhisperLiveKit/Silero recommendations and lists hypotheses to test.
  • pyproject.toml[bench] extra (jiwer) + the real_live marker. .gitignorebench-results/.
  • tapscribe/live.py — closes the WlK child's stdout pipe at log-pump EOF instead of leaving it for GC (leaked an fd per live-child restart; also tripped ResourceWarning-as-error under the harness).

Test plan

  • Unit suite green (pytest --ignore=tests/e2e, minus the pre-existing hypothesis-not-installed module).
  • Live/relay/gate unit tests green; live.py fd-close change non-regressing.
  • Scoring, fixture discovery, framing, and Silero gate construction validated without a model.
  • ruff check (CI scope) + ruff format --check clean; real_live test skips cleanly.
  • Full sweep run on a real-model box — surfaced the mislabeled-fixture issue (see update above), now fixed.

🤖 Generated with Claude Code


Generated by Claude Code

claude added 8 commits May 26, 2026 07:39
The live transcription path (WhisperLiveKit subprocess + Silero
SpeechGate + WlKRelay) is distinct from the batch path that
tools/bench_backends.py measures, and we had no way to evaluate its
caption quality on known audio. This adds tools/bench_live.py, which
drives the real production live objects so audio is segmented and
decoded exactly as in production, captures the settled captions, and
scores them with a broad metric set (WER/CER plus the
substitution/deletion/insertion breakdown, decode lag, finalization
delay, and gate pass-through) so the numbers reveal what to tune rather
than guessing. Supports single-config and --sweep matrix modes.

Also adds a real_live pytest gate that runs the harness on a fixture and
asserts a WER threshold (skips cleanly when models/deps are absent), a
research note comparing our defaults to WhisperLiveKit/Silero
recommendations, and a jiwer-only [bench] extra.

Fixes an fd leak in WhisperLiveKitChannel: the log-pump thread now closes
the child's stdout pipe at EOF instead of leaving it for GC, which
otherwise leaks a descriptor per live-child restart (and tripped
ResourceWarning-as-error under the harness).

https://claude.ai/code/session_01ES1s8KXJTntmQYMEewiKdd
pip-audit flags fastapi 0.136.3 (the latest release) under
MAL-2026-4750, a malicious-package advisory in the OSV
malicious-packages dataset that matches the name `fastapi`. It does not
describe a vulnerability in the legitimate PyPI fastapi we depend on —
the dataset tracks typosquats / backdoored impostors — and records no
fixed version, so there is nothing to bump to. Exclude the advisory ID
with a written justification, mirroring the existing torch/silero audit
exclusion.

https://claude.ai/code/session_01ES1s8KXJTntmQYMEewiKdd
The SpeechGate runs locally (Silero needs no network), so --gate-only
measures how much audio each gate config forwards without an ASR model —
runnable on boxes that can't download weights. Real data on the fixtures
shows the default gate forwards only ~21% of armstrong-en (continuous
speech) as a single 2.2s segment, vs ~92% of the clean marlene-nb
reading: the gate collapses on noisy/low-level continuous audio, the
strongest lead for poor live captions. Captured in the research note.

Also generalize the full-pipeline sweep to apply arbitrary LiveConfig
overrides via dataclasses.replace (the matrix previously silently ignored
confidence_validation and gate knobs) and expand it to cover the
confidence/accuracy trade and the gate A/B.

https://claude.ai/code/session_01ES1s8KXJTntmQYMEewiKdd
The consumer cached a lower-bound scan index (_last_emit_scan_upto) on
the assumption that WlK only ever mutates the current tail line, so once
a newer line appeared a position was frozen. A late LocalAgreement
correction that grows an already-committed line then had its new suffix
silently dropped mid-session and only recovered (out of order) at close.

Re-scan all non-tail positions each snapshot instead. _consider_emit_line
is (speaker,start)-keyed and idempotent, so re-walking settled lines is a
dict lookup per line — negligible at WlK's few-Hz tick rate. Adds a
red-green regression test driving a non-tail line that grows after a
newer line exists.

Verified separately that the speech-gate "end-frame drop" and missing
speech_pad flagged in review are intentional (pinned by
test_gate_closes_on_vad_end_event with a documented hangover rationale),
so those are left as-is.

https://claude.ai/code/session_01ES1s8KXJTntmQYMEewiKdd
The sweep applied an English-only matrix (tiny/base/small.en) to every
fixture, including the Norwegian marlene-nb clip. English-only Whisper
cannot transcribe Norwegian regardless of --lan, so those rows scored
noise (>1.0 WER) and measured nothing.

Split the matrix per language: SWEEP_MATRIX_EN keeps the .en sweep for
English fixtures, SWEEP_MATRIX_NB sweeps the NB-Whisper family for
Norwegian. _sweep_matrix_for picks by fixture language. The live channel
already auto-downloads nb-whisper CT2 weights and build_live_cmd routes
them via --model-path + localagreement (every other sweep knob still
applies), so no further wiring was needed.

https://claude.ai/code/session_01ES1s8KXJTntmQYMEewiKdd
The armstrong-en fixture was the "first ~12 s" of the source recording,
which is the lead-in utterance ("I'm going to step off the LM now") — NOT
the iconic line in the reference transcript. The live benchmark therefore
scored accurate transcriptions against a reference the audio never spoke
(~0.92 WER across every model and gate), which read as a broken pipeline
but was a mislabeled fixture.

Add tools/recut_armstrong.py: it locates the iconic line by
word-timestamp transcription (not a fixed offset) and regenerates the WAV
(~9 s, 14.0-23.3 s of the source), self-verifying by re-transcribing the
cut. Update the README provenance to match.

The regenerated armstrong-en.wav is committed separately (binary, produced
on a networked host).

https://claude.ai/code/session_01ES1s8KXJTntmQYMEewiKdd
Replace the mislabeled clip (the "step off the LM now" lead-in) with the
segment that actually matches the reference — "That's one small step for
man, one giant leap for mankind." Produced from the source OGG at the
word-timestamp-located window (14.05-23.29s of the 24.1s recording, ~9.24s
@ 16 kHz mono int16), verified by re-transcription to read exactly as the
reference.

Also correct the now-stale rationale comment in test_live_quality.py: the
~0.92 WER that motivated the loose 0.5 bar was the mislabeled fixture, not
a model deficiency, so the bar should be tightened once a clean-fixture
sweep confirms the real base.en number.

https://claude.ai/code/session_01ES1s8KXJTntmQYMEewiKdd
Single-stream sweeps can't see the real failure operators hit: several
SpatialChat participants talking at once. Production gives every active
/tap its own relay into the single shared whisperlivekit-server (one
model, one GPU), so simultaneous talkers contend for one decoder.

Add `--concurrency N1,N2,...`: feeds N concurrent copies of a fixture
(full overlap, worst case) into one WlK server and reports how per-stream
lag, finalization delay, and WER degrade as N grows. Extract the per-
stream feed loop into `_drive_one_stream`, shared by run_one and the new
concurrency mode (single source of truth, run_one signature unchanged so
test_live_quality keeps working).

Lag climbing with N confirms inference serialization; WER rising / errored
streams confirm WlK trimming away starved audio.

https://claude.ai/code/session_01ES1s8KXJTntmQYMEewiKdd
Comment thread tools/bench_live.py Fixed
claude added 13 commits May 26, 2026 09:41
The concurrency mode fed all streams perfectly aligned speech = 100%
overlap, which is precisely the case the TapScribe silence gate cannot
help (no silence to drop when everyone talks at once). That measures the
gate's blind spot, not whether the gate works.

--stagger offsets each stream's speech by N seconds using leading/trailing
silence (which the gate drops), so stagger=0 is full overlap and
stagger>=clip-length is pure turn-taking. Sweeping it shows whether
turn-taking stays flat (gate collapsing idle taps to ~1x load, as
designed) and locates where real overlap starts to break — distinguishing
"overlap is the killer, needs arbitration" from "the gate isn't closing on
idle mics, needs tuning".

https://claude.ai/code/session_01ES1s8KXJTntmQYMEewiKdd
The dashboard's per-tap "current best pick" (buffer_transcription) never
appears. Traced the whole chain — relay reads the correct wire key (a
plain string in WlK's full snapshot), the fan-out forwards it, /api/state
serializes it, and active-taps.js renders it with live defaulting on — so
the plumbing is sound; the buffer is simply arriving empty.

Capture every on_buffer update in _drive_one_stream and report
buffer_updates / buffer_nonempty / buffer_sample. A single-fixture run now
distinguishes "WlK never emits a non-empty buffer in this config" (gated
bursts commit straight to lines / backend-policy) from "it emits
transiently and the dashboard's /api/state poll misses it" — which need
different fixes.

https://claude.ai/code/session_01ES1s8KXJTntmQYMEewiKdd
buffer_updates=0 on the default config confirmed WlK never emits an
in-flight buffer: with confidence_validation on it commits tokens the
instant they're confident (the word-by-word settled lines show it), so
nothing lingers for the dashboard's "current best pick" preview.

Expose --confidence-validation / --no-confidence-validation on the single
run so the hypothesis is directly testable: turning it off should switch
WlK to LocalAgreement, populate buffer_transcription (buffer_nonempty>0),
and likely improve punctuation per WlK's own guidance.

https://claude.ai/code/session_01ES1s8KXJTntmQYMEewiKdd
confidence_validation=False still produced buffer_updates=0, ruling it
out. The real driver of the always-empty in-flight buffer is the policy:
WlK defaults to simulstreaming (AlignAtt commits tokens as it decodes,
keeps no unvalidated buffer), and tapscribe only overrides to
localagreement for nb-whisper — so every whisper model runs simulstreaming
and buffer_transcription is always "".

Add an optional backend_policy field to LiveConfig (emitted to argv only
for non-nb models; nb still forces localagreement) and a --backend-policy
flag on bench_live so simulstreaming vs localagreement can be A/B'd
directly: whether localagreement populates the buffer, and its WER/lag
cost, are both empirical questions to settle before changing any default.

https://claude.ai/code/session_01ES1s8KXJTntmQYMEewiKdd
Two changes to make the concurrency numbers trustworthy and test the
process-per-tap hypothesis:

Lag fix: WlK's remaining_time_transcription is wall_clock minus
processed-audio-time, so it grows unbounded while the gate drops silence
and we forward nothing (the audio timeline freezes while the clock ticks).
That inflated the staggered/turn-taking runs and likely overstated the
"turn-taking is worse than overlap" result. Only count a lag sample when a
real frame was forwarded within ACTIVE_LAG_WINDOW_S, so dropped silence no
longer masquerades as decode backlog.

--per-tap-instance: spawn one WlK server per stream (fresh per N) instead
of one shared server with N connections. The turn-taking collapse happened
while the GPU was mostly idle (one speaker at a time), which points at
per-connection process overhead rather than GPU compute — if this topology
stays flat where the shared server collapses, that's confirmed and
process-per-tap is a real fix.

https://claude.ai/code/session_01ES1s8KXJTntmQYMEewiKdd
WlK keeps emitting remaining_time_transcription after we stop feeding it;
that value is wall_clock - last_processed_audio, so it climbs purely from
elapsed silence the gate drops, painting a phantom growing backlog for a
tap whose speaker went quiet. Only forward lag while the gate is open
(genuine), clear it the instant the gate closes, and leave backend-gate
mode (continuous feed) untouched.

https://claude.ai/code/session_01ES1s8KXJTntmQYMEewiKdd
The bench hand-rolled its own gate + WlKRelay + lag sampling in
_drive_one_stream, so it never exercised the real TapFanOut/Recorder and
its lag accounting drifted from production (the silence-window hack didn't
even match the gate-closed suppression in tap_fan_out). A benchmark that
doesn't use the production paths can't be trusted.

Rewrite _drive_one_stream to open a real TapFanOut against a throwaway
Recorder and feed it frames exactly as the /tap WS does: the SpeechGate,
WlKRelay, per-tap lag (gate-closed suppression included), in-flight buffer,
and settled captions now all come from production code. lag/gate_open/
buffer are sampled from recorder.streams (the /api/state source); settled
lines are read from recorder.transcripts, filtered per stream identity.

Drop the dead silence-window lag hack and the per-stream silence warmup
(non-production: the gate drops it anyway). Remove --per-tap-instance: it
was non-production and the data showed it identical to the shared server,
confirming the turn-taking "collapse" was a metric artifact, not a
bottleneck. Add a fake-WlK smoke test pinning the shared code path.

https://claude.ai/code/session_01ES1s8KXJTntmQYMEewiKdd
Investigating the lag rework surfaced the real cause of the staggered-run
inflation: the bridge contract is one /tap WebSocket per utterance
(bridges/README.md), so a production connection opens at speech onset —
there is never 30s of leading silence sitting on a live connection. The
old stagger modelled overlap by padding ONE long-lived connection with
leading silence the gate drops, which left WlK's per-connection clock
(remaining_time = wall_clock - processed_audio) anchored at t0 and running
through all that silence. That inflated lag was a benchmark-modelling
artifact, not a production bug.

Fix the model instead of the metric: stagger when each stream OPENS its
tap (start_delay_s = stagger*i), so each tap is a fresh per-utterance
connection anchored at its own speech onset — exactly what the bridge
does. stagger=0 = all taps open at once (overlap); large stagger =
taps open/close in sequence (turn-taking, ~1 live). Lag is now clean
without parsing WlK's caption timestamps, and the gate-closed suppression
already handles trailing/within-utterance silence.

https://claude.ai/code/session_01ES1s8KXJTntmQYMEewiKdd
The production-path rewrite made the per-frame work heavier (real gate +
relay + ActiveStream lock) and I paced with a fixed sleep(20ms) AFTER that
work, so each frame took work+20ms. As N grew, per-frame work grew and the
feed silently drifted below real time — which is why lag DROPPED as streams
increased (1.5s -> 0.7s at N=4): WlK was being under-fed, not keeping up.

Pace against an absolute schedule (target = start + i*interval) so the
per-frame work is absorbed instead of stacked on the sleep; the feed stays
real time until the host genuinely can't keep up, at which point it falls
behind honestly. Move per-tap state sampling to a background task so its
overhead never inflates feed timing. Add a pacing_slip_s metric (worst
real-time slip) surfaced as the slipX column: slipX > ~0.1s flags a row
whose lag/WER are soft because the feed under-loaded WlK.

https://claude.ai/code/session_01ES1s8KXJTntmQYMEewiKdd
Review of the branch surfaced three issues:

#1 (CI red) — the recut armstrong-en.wav is ~9.24s/462 frames ending on the
loud iconic line, but two level-meter tests baked in the old 12s/quiet-tail
clip: `len(frames) > 500` (now 462) and decay `< 0.05` after 600ms (the
louder ending only reaches 0.0527). Update the frame-count assertion to the
real clip and extend the silence tail to 1s so the meter drains past 0.05.

#2 — bench `errored` was dead code: _drive_one_stream never set an "error"
key, so the concurrency table always printed errored=0. Detect a tap whose
relay never attached (fan._relay_alive) and report it errored; skip errored
streams in the WER mean so a half-dead run no longer reads as clean.

#3 — a mid-feed exception leaked the tap's relay + sampler task and (via
gather) aborted the whole sweep. Wrap the feed in try/finally that always
cancels the sampler and closes the tap, and gather with
return_exceptions=True so one bad stream becomes an errored row instead of
killing the run.

https://claude.ai/code/session_01ES1s8KXJTntmQYMEewiKdd
…2igTO

# Conflicts:
#	.github/workflows/ci.yml
test_audio.py::test_int16_peak_norm_armstrong_speech_wav also baked in the
old 12s/600-frame armstrong-en.wav (`len(frames) > 500`); the recut 9.24s
clip is 462 frames, so it failed once the full unit suite ran post-merge.
Align it with the other two level tests and refresh the two now-stale "12 s"
docstrings to "~9 s".

https://claude.ai/code/session_01ES1s8KXJTntmQYMEewiKdd
argparse's p.error() exits, but CodeQL doesn't model it as no-return, so it
saw a path where the int() parse raised ValueError, the except 'returned',
and the validation below read an uninitialized `counts`. Bind it to [] up
front — behavior is unchanged (the `if not counts` check still fires on a
parse failure), and the use-before-init alert clears.

https://claude.ai/code/session_01ES1s8KXJTntmQYMEewiKdd
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants