Conversation
There was a problem hiding this comment.
Pull request overview
This PR implements streaming ASR (Automatic Speech Recognition) functionality, enabling real-time speech-to-text conversion with Voice Activity Detection (VAD). The changes introduce a new stream_asr mode that allows incremental ASR results to be sent to clients as speech is detected, rather than waiting for complete audio submission.
Changes:
- Added streaming ASR support for both Whisper and Paraformer ASR backends with real-time VAD integration
- Integrated Silero VAD model for server-side speech detection with configurable parameters
- Refactored session handling to support new WebSocket commands for streaming audio and VAD events
Reviewed changes
Copilot reviewed 12 out of 13 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| src/util.rs | Added bidirectional audio format conversion utilities and RIFF tag validation |
| src/services/ws/stable/mod.rs | Added stream_asr flag and helper methods for sending ASR results and control messages |
| src/services/ws/stable/asr.rs | Implemented streaming ASR methods for Whisper and Paraformer with VAD integration |
| src/services/ws.rs | Added EndVad command and stream_asr parameter support |
| src/services/mod.rs | Added stream_asr parameter to connection query params |
| src/protocol.rs | Added EndVad server event |
| src/main.rs | Added /version endpoint |
| src/config.rs | Added SileroVadconfig for VAD parameters and updated WhisperASRConfig |
| src/ai/vad.rs | Implemented VadSession and VadFactory for Silero VAD integration |
| src/ai/mod.rs | Changed logging from messages array to last_message only |
| src/ai/bailian/realtime_asr.rs | Added semantic_punctuation_enabled parameter and streaming test |
| Cargo.toml | Added silero_vad_burn and burn dependencies |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
src/services/ws/stable/asr.rs
Outdated
| recv_audio_bytes += data.len(); | ||
| if !recv_any_asr_result && recv_audio_bytes >= 16000 * 10 { | ||
| log::warn!( | ||
| "`{}` paraformer asr received more than 30s audio without StartChat, starting automatically", |
There was a problem hiding this comment.
The condition checks for 10 seconds of audio (16000 * 10 = 160,000 bytes) but the warning message states '30s'. Either the calculation should be 16000 * 30 or the message should say '10s' to match the actual logic.
| "`{}` paraformer asr received more than 30s audio without StartChat, starting automatically", | |
| "`{}` paraformer asr received more than 10s audio without StartChat, starting automatically", |
src/config.rs
Outdated
| pub struct SileroVadconfig { | ||
| #[serde(default = "SileroVadconfig::default_threshold")] | ||
| pub threshold: f32, | ||
| #[serde(default = "SileroVadconfig::default_neg_threshold")] | ||
| pub neg_threshold: Option<f32>, | ||
| #[serde(default = "SileroVadconfig::default_min_speech_duration_ms")] | ||
| pub min_speech_duration_ms: usize, | ||
| #[serde(default = "SileroVadconfig::default_max_silence_duration_ms")] | ||
| pub max_silence_duration_ms: usize, | ||
| #[serde(default = "SileroVadconfig::hangover_ms")] | ||
| pub hangover_ms: usize, | ||
| } | ||
|
|
||
| impl SileroVadconfig { |
There was a problem hiding this comment.
Type name has inconsistent casing. Should be SileroVadConfig (capital 'C') to follow Rust naming conventions for type names.
| pub struct SileroVadconfig { | |
| #[serde(default = "SileroVadconfig::default_threshold")] | |
| pub threshold: f32, | |
| #[serde(default = "SileroVadconfig::default_neg_threshold")] | |
| pub neg_threshold: Option<f32>, | |
| #[serde(default = "SileroVadconfig::default_min_speech_duration_ms")] | |
| pub min_speech_duration_ms: usize, | |
| #[serde(default = "SileroVadconfig::default_max_silence_duration_ms")] | |
| pub max_silence_duration_ms: usize, | |
| #[serde(default = "SileroVadconfig::hangover_ms")] | |
| pub hangover_ms: usize, | |
| } | |
| impl SileroVadconfig { | |
| pub struct SileroVadConfig { | |
| #[serde(default = "SileroVadConfig::default_threshold")] | |
| pub threshold: f32, | |
| #[serde(default = "SileroVadConfig::default_neg_threshold")] | |
| pub neg_threshold: Option<f32>, | |
| #[serde(default = "SileroVadConfig::default_min_speech_duration_ms")] | |
| pub min_speech_duration_ms: usize, | |
| #[serde(default = "SileroVadConfig::default_max_silence_duration_ms")] | |
| pub max_silence_duration_ms: usize, | |
| #[serde(default = "SileroVadConfig::hangover_ms")] | |
| pub hangover_ms: usize, | |
| } | |
| impl SileroVadConfig { |
| "streaming": "duplex" | ||
| }, | ||
| "payload": { | ||
| "task_group": "audio", |
There was a problem hiding this comment.
The task_group field is added without explanation or documentation. Consider adding a comment explaining why this field is necessary and what impact it has on the ASR behavior.
src/services/ws.rs
Outdated
| .map_err(|_| anyhow::anyhow!("audio_tx closed"))?; | ||
|
|
||
| if DEBUG_WAV { | ||
| if debug_wav_data.len() > 0 { |
There was a problem hiding this comment.
Use !debug_wav_data.is_empty() instead of debug_wav_data.len() > 0 for better idiomatic Rust code.
| if debug_wav_data.len() > 0 { | |
| if !debug_wav_data.is_empty() { |
No description provided.