feat(voice): add console audio IO and SessionHost audio routing#1694
feat(voice): add console audio IO and SessionHost audio routing#1694toubatbrian wants to merge 1 commit into
Conversation
Port the python tcp_console audio IO to agents-js. TcpAudioInput resamples inbound audio_input frames from the 48 kHz wire rate to the 24 kHz agent rate and feeds them to the STT pipeline; TcpAudioOutput resamples the agent's TTS frames back up, streams them as audio_output messages, and drives the flush/clear playout handshake (blocking the agent turn until the broker reports audio_playback_finished, or reporting an interruption when the buffer is cleared). SessionHost now accepts optional audio IO and routes inbound audio_input/audio_playback_finished messages to it. Co-authored-by: Cursor <cursoragent@cursor.com>
🦋 Changeset detectedLatest commit: 126992f The changes in this PR will be included in the next version bump. This PR includes changesets to release 34 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 126992faef
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| void this.transport.sendMessage( | ||
| new pb.AgentSessionMessage({ | ||
| message: { | ||
| case: 'audioPlaybackFlush', | ||
| value: new pb.AgentSessionMessage_ConsoleIO_AudioPlaybackFlush(), | ||
| }, | ||
| }), | ||
| ); |
There was a problem hiding this comment.
Drain the resampler before flushing playback
When a TTS segment ends, this sends audioPlaybackFlush without first emitting any frames returned by this.resampler.flush(). AudioResampler can hold tail samples until it is flushed (the existing resampling paths in utils.ts/generation.ts explicitly drain it at stream boundaries), so console playback can clip the end of each response or carry buffered samples into the next segment before the broker is told playback is complete. Please write the flushed resampled frames before sending the playback flush.
Useful? React with 👍 / 👎.
| override flush(): void { | ||
| super.flush(); | ||
| void this.transport.sendMessage( | ||
| new pb.AgentSessionMessage({ | ||
| message: { | ||
| case: 'audioPlaybackFlush', | ||
| value: new pb.AgentSessionMessage_ConsoleIO_AudioPlaybackFlush(), | ||
| }, | ||
| }), | ||
| ); |
There was a problem hiding this comment.
🟡 Internal resampler not flushed in TcpAudioOutput.flush(), losing tail audio and leaking samples across segments
When TcpAudioOutput.flush() is called, the internal AudioResampler (24 kHz → 48 kHz) is not flushed. The resampler buffers a small number of trailing samples internally during the push() call in captureFrame() (console_io.ts:111). On flush, those buffered samples are never sent to the broker. Instead, they silently leak into the first frame of the next audio segment (when the next captureFrame() pushes data through the same resampler). This is inconsistent with how every other resampler in the codebase is used—see generation.ts:901-904, utils.ts:768-795, fallback_adapter.ts:366,529, recorder_io.ts:284—where resampler.flush() is always called at segment boundaries. The remaining frames should be sent as audioOutput messages before the audioPlaybackFlush marker.
Prompt for agents
In TcpAudioOutput.flush() (console_io.ts:120-129), the internal AudioResampler (this.resampler, 24kHz -> 48kHz) is never flushed, losing its internal buffer of trailing samples. The fix needs to call this.resampler.flush() and send the remaining resampled frames via the transport BEFORE sending the audioPlaybackFlush message. However, since flush() is a synchronous void method (matching the base class signature), the transport.sendMessage calls for flushed frames must also be fire-and-forget (using void). The key concern is ordering: the flushed audio frames must be sent before the flush marker so the broker receives them as part of the current segment. Since both are dispatched via void (fire-and-forget) in the same synchronous call, the microtask ordering should be preserved. Add a loop like: for (const resampled of this.resampler.flush()) { void this.transport.sendMessage(...audioOutput message with rtcFrameToConsole(resampled)...); } before the audioPlaybackFlush sendMessage call.
Was this helpful? React with 👍 or 👎 to provide feedback.
| private interrupted = new Future(); | ||
|
|
||
| constructor(transport: SessionTransport) { | ||
| super(AGENT_SAMPLE_RATE, undefined, { pause: true }); |
There was a problem hiding this comment.
🟡 TcpAudioOutput declares pause capability but never implements pause()/resume()
The constructor passes { pause: true } to the base class (console_io.ts:92), causing canPause to return true. However, TcpAudioOutput does not override pause() or resume(), and with no nextInChain, the inherited base methods are no-ops. When the pipeline's resumeFalseInterruption feature is enabled, agent_activity.ts:3842-3849 checks canPause and uses pause()/resume() to handle suspected false interruptions. Because these calls do nothing on TcpAudioOutput, audio continues playing during the "paused" state, defeating the false-interruption detection. Compare with RoomAudioOutput (room_io/_output.ts:389,400-415) which also declares { pause: true } but actually implements both methods.
| super(AGENT_SAMPLE_RATE, undefined, { pause: true }); | |
| super(AGENT_SAMPLE_RATE, undefined, { pause: false }); |
Was this helpful? React with 👍 or 👎 to provide feedback.
Summary
Second PR in the series porting Python's TCP console session support to
agents-js(follows #1693, the transport +updateIoplumbing). This adds the audio IO that lets a console-mode session exchange audio with a local broker (the LiveKit CLIlk sessiondaemon).TcpAudioInput(agents/src/voice/console_io.ts) — resamples inboundaudio_inputframes from the 48 kHz wire rate to the 24 kHz agent rate and feeds them into the baseAudioInputstream the STT pipeline reads from.TcpAudioOutput— resamples the agent's TTS frames back up to the wire rate, streams them asaudio_outputmessages, and drives the flush/clear playout handshake: a flush blocks the agent turn until the broker reportsaudio_playback_finished, or reports an interruption (with a clamped playback position) when the buffer is cleared.SessionHostnow accepts optionalaudioInput/audioOutputand routes inboundaudio_input/audio_playback_finishedmessages to them inrecvLoop.Notes / divergences from the Python port
TcpAudioInputuses a stdlib queue +run_in_executorto bridge the producer and consumer event loops underJobExecutorType.THREAD. The JS console job runs in-process on a single event loop, so aStreamChannelis sufficient — no cross-thread queue.PlaybackFinishedEvent.playbackPositionis reported in seconds to match the baseAudioOutputcontract.SessionHostaudio fields are typed viaimport typefromconsole_io.ts(the TS equivalent of Python'sTYPE_CHECKINGimport) so there's no runtime import cycle.Reference
Ports from Python
livekit-agentscli/tcp_console.py(TcpAudioInput/TcpAudioOutput) andvoice/remote_session.py(SessionHost._dispatch_transport_message).Test plan
New
agents/src/voice/console_io.test.ts(5 cases, all green):TcpAudioInputresamples 48 kHz wire frames to 24 kHz and exposes them on the streamTcpAudioInputdrops frames pushed after closeTcpAudioOutputstreams resampledaudio_output+audio_playback_flush, and the flush handshake completes (uninterrupted) onnotifyPlayoutFinishedTcpAudioOutputreports interruption (clamped position) when the buffer is cleared mid-playoutSessionHostroutesaudio_input->pushFrameandaudio_playback_finished->notifyPlayoutFinishedAlso verified:
pnpm build:agents, ESLint, and Prettier clean on the changed files. The existing room-based path is untouched.Follow-up (next stacked PR)
JobContextfake-job support + CLIconsolesubcommand wiring the transport + audio IO +SessionHosttogether.Made with Cursor