Feat/stream asr by L-jasmine · Pull Request #37 · second-state/echokit_server

L-jasmine · 2026-01-22T19:16:40Z

No description provided.

Copilot

Pull request overview

This PR implements streaming ASR (Automatic Speech Recognition) functionality, enabling real-time speech-to-text conversion with Voice Activity Detection (VAD). The changes introduce a new stream_asr mode that allows incremental ASR results to be sent to clients as speech is detected, rather than waiting for complete audio submission.

Changes:

Added streaming ASR support for both Whisper and Paraformer ASR backends with real-time VAD integration
Integrated Silero VAD model for server-side speech detection with configurable parameters
Refactored session handling to support new WebSocket commands for streaming audio and VAD events

Reviewed changes

Copilot reviewed 12 out of 13 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
src/util.rs	Added bidirectional audio format conversion utilities and RIFF tag validation
src/services/ws/stable/mod.rs	Added stream_asr flag and helper methods for sending ASR results and control messages
src/services/ws/stable/asr.rs	Implemented streaming ASR methods for Whisper and Paraformer with VAD integration
src/services/ws.rs	Added EndVad command and stream_asr parameter support
src/services/mod.rs	Added stream_asr parameter to connection query params
src/protocol.rs	Added EndVad server event
src/main.rs	Added /version endpoint
src/config.rs	Added SileroVadconfig for VAD parameters and updated WhisperASRConfig
src/ai/vad.rs	Implemented VadSession and VadFactory for Silero VAD integration
src/ai/mod.rs	Changed logging from messages array to last_message only
src/ai/bailian/realtime_asr.rs	Added semantic_punctuation_enabled parameter and streaming test
Cargo.toml	Added silero_vad_burn and burn dependencies

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-22T19:17:44Z

src/services/ws/stable/asr.rs

+                                recv_audio_bytes += data.len();
+                                if !recv_any_asr_result && recv_audio_bytes >= 16000 * 10 {
+                                    log::warn!(
+                                        "`{}` paraformer asr received more than 30s audio without StartChat, starting automatically",


The condition checks for 10 seconds of audio (16000 * 10 = 160,000 bytes) but the warning message states '30s'. Either the calculation should be 16000 * 30 or the message should say '10s' to match the actual logic.

Suggested change

"`{}` paraformer asr received more than 30s audio without StartChat, starting automatically",

"`{}` paraformer asr received more than 10s audio without StartChat, starting automatically",

Copilot · 2026-01-22T19:17:45Z

src/config.rs

+pub struct SileroVadconfig {
+    #[serde(default = "SileroVadconfig::default_threshold")]
+    pub threshold: f32,
+    #[serde(default = "SileroVadconfig::default_neg_threshold")]
+    pub neg_threshold: Option<f32>,
+    #[serde(default = "SileroVadconfig::default_min_speech_duration_ms")]
+    pub min_speech_duration_ms: usize,
+    #[serde(default = "SileroVadconfig::default_max_silence_duration_ms")]
+    pub max_silence_duration_ms: usize,
+    #[serde(default = "SileroVadconfig::hangover_ms")]
+    pub hangover_ms: usize,
+}
+
+impl SileroVadconfig {


Type name has inconsistent casing. Should be SileroVadConfig (capital 'C') to follow Rust naming conventions for type names.

Suggested change

pub struct SileroVadconfig {

#[serde(default = "SileroVadconfig::default_threshold")]

pub threshold: f32,

#[serde(default = "SileroVadconfig::default_neg_threshold")]

pub neg_threshold: Option<f32>,

#[serde(default = "SileroVadconfig::default_min_speech_duration_ms")]

pub min_speech_duration_ms: usize,

#[serde(default = "SileroVadconfig::default_max_silence_duration_ms")]

pub max_silence_duration_ms: usize,

#[serde(default = "SileroVadconfig::hangover_ms")]

pub hangover_ms: usize,

}

impl SileroVadconfig {

pub struct SileroVadConfig {

#[serde(default = "SileroVadConfig::default_threshold")]

pub threshold: f32,

#[serde(default = "SileroVadConfig::default_neg_threshold")]

pub neg_threshold: Option<f32>,

#[serde(default = "SileroVadConfig::default_min_speech_duration_ms")]

pub min_speech_duration_ms: usize,

#[serde(default = "SileroVadConfig::default_max_silence_duration_ms")]

pub max_silence_duration_ms: usize,

#[serde(default = "SileroVadConfig::hangover_ms")]

pub hangover_ms: usize,

}

impl SileroVadConfig {

Copilot · 2026-01-22T19:17:45Z

src/ai/bailian/realtime_asr.rs

                "streaming": "duplex"
            },
            "payload": {
+                "task_group": "audio",


The task_group field is added without explanation or documentation. Consider adding a comment explaining why this field is necessary and what impact it has on the ASR behavior.

Copilot · 2026-01-22T19:17:46Z

src/services/ws.rs

+                        .map_err(|_| anyhow::anyhow!("audio_tx closed"))?;
+
+                    if DEBUG_WAV {
+                        if debug_wav_data.len() > 0 {


Use !debug_wav_data.is_empty() instead of debug_wav_data.len() > 0 for better idiomatic Rust code.

Suggested change

if debug_wav_data.len() > 0 {

if !debug_wav_data.is_empty() {

L-jasmine added 5 commits January 21, 2026 19:02

feat: add version endpoint

041ce61

feat: embeding silero_vad for whisper

d6a9608

feat: stream whisper-asr

faea950

feat: impl stream vad & asr when use ParaformerRealtimeV2Asr

0d1ab01

fix: don't recv asr event before start_pcm_recognition

af4ef7f

L-jasmine requested a review from Copilot January 22, 2026 19:16

Copilot AI reviewed Jan 22, 2026

View reviewed changes

fix typo

c738ca8

L-jasmine merged commit 89b1913 into main Jan 22, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/stream asr#37

Feat/stream asr#37
L-jasmine merged 6 commits intomainfrom
feat/stream_asr

L-jasmine commented Jan 22, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jan 22, 2026

Uh oh!

Copilot AI Jan 22, 2026

Uh oh!

Copilot AI Jan 22, 2026

Uh oh!

Copilot AI Jan 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

	"`{}` paraformer asr received more than 30s audio without StartChat, starting automatically",
	"`{}` paraformer asr received more than 10s audio without StartChat, starting automatically",

	if debug_wav_data.len() > 0 {
	if !debug_wav_data.is_empty() {

Conversation

L-jasmine commented Jan 22, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments