Skip to content

feat: voice-activity streaming mode & inner-vad for speech-to-text module#1160

Open
IgorSwat wants to merge 12 commits into
mainfrom
@is/vad-streaming
Open

feat: voice-activity streaming mode & inner-vad for speech-to-text module#1160
IgorSwat wants to merge 12 commits into
mainfrom
@is/vad-streaming

Conversation

@IgorSwat
Copy link
Copy Markdown
Contributor

@IgorSwat IgorSwat commented May 20, 2026

Description

This PR introduces changes focused on voice-activity-detection module and it's utilization within the library:

  • Native side VAD streaming - introduces a continuous voice-activity-detection mechanism with user-friendly callback system. Example usage from demo app:
  await model.stream({
    onSpeechBegin: () => {...},
    onSpeechEnd: () => {...},
    options: {...},
  });
  • VAD x STT integration - adds an option to utilize voice-activity-detection within the speech-to-text module, significantly improving the effective performance of the STT.
  • Demo apps: introduces new screen in the speech demo app: VoiceActivityDetectionScreen and changes the behavior of SpeechToTextScreen, adding a toggle to switch the VAD submodule for STT on/off.

Introduces a breaking change?

  • Yes
  • No

Type of change

  • Bug fix (change which fixes an issue)
  • New feature (change which adds functionality)
  • Documentation update (improves or adds clarity to existing documentation)
  • Other (chores, tests, code style improvements etc.)

Tested on

  • iOS
  • Android

Testing instructions

  • To test the VAD streaming: run the VoiceActivityDetectionScreen within the Speech demo app.
  • To test the VAD & STT integration: run the SpeechToTextScreen within the Speech demo app, with VAD toggle on.

Screenshots

Related issues

#1118

Checklist

  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have updated the documentation accordingly
  • My changes generate no new warnings

Additional notes

@IgorSwat IgorSwat requested review from chmjkb and msluszniak May 20, 2026 13:09
@IgorSwat IgorSwat force-pushed the @is/vad-streaming branch from 694fe4f to 1c2411e Compare May 20, 2026 13:15
@IgorSwat IgorSwat changed the base branch from main to @is/speech-to-text-ultimate May 20, 2026 13:26
@IgorSwat IgorSwat force-pushed the @is/speech-to-text-ultimate branch from 02113ff to 6bba141 Compare May 20, 2026 15:46
Comment thread apps/speech/screens/SpeechToTextScreen.tsx
Comment thread apps/speech/screens/VoiceActivityDetectionScreen.tsx
Base automatically changed from @is/speech-to-text-ultimate to main May 21, 2026 08:20
@IgorSwat IgorSwat force-pushed the @is/vad-streaming branch from 1c2411e to 0ea858d Compare May 21, 2026 08:55
@msluszniak msluszniak added the feature PRs that implement a new feature label May 21, 2026
@IgorSwat IgorSwat requested a review from benITo47 May 21, 2026 12:49
@msluszniak
Copy link
Copy Markdown
Member

Please also fix these warnings:
image

Comment thread docs/docs/03-hooks/01-natural-language-processing/useSpeechToText.md Outdated
Comment thread docs/docs/04-typescript-api/01-natural-language-processing/VADModule.md Outdated
}
})();

while (this.isStreaming && !finished) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stream() resolves as soon as this.isStreaming flips, but the native loop only re-checks the flag at the top of the next iteration — so for up to timeout + one inference after await streamStop() returns, the native streamer is still alive, can still queue callInvoker_->invokeAsync callbacks, and still touches audioBuffer_. If the caller then runs unload() (or the host object is destroyed) we're in UAF / use-after-unload territory.

Two options: (a) actually join — stream() doesn't resolve until the native stream() call returns, and streamStop() awaits that; or (b) document explicitly that unload() is not safe immediately after streamStop() and that callbacks may fire after the promise resolves. (a) is the safer contract.

runForward((inst) => inst.stream(input));

const streamInsert = (waveform: Float32Array) =>
runForward((inst) => {
Copy link
Copy Markdown
Member

@msluszniak msluszniak May 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both streamInsert and streamStop go through runForward, which gates on isGenerating. While stream() is running, isGenerating is true, so every streamInsert(buffer) call rejects with "model is currently generating" — which is what Jakub hit on the rapid-tap repro. That thread is marked resolved, but I don't see the fix in either useVAD.ts or VADModule.ts. At minimum streamInsert (a buffer push) must bypass runForward for the streaming API to function; arguably streamStop should bypass too — you may want to stop precisely because inference is stuck.

Copy link
Copy Markdown
Contributor Author

@IgorSwat IgorSwat May 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's true, then that's a Module Factory issue - and I don't think this PR is a good place to fix it.

@msluszniak msluszniak linked an issue May 21, 2026 that may be closed by this pull request
Copy link
Copy Markdown
Collaborator

@chmjkb chmjkb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

besides my previous comment, I think it looks good, great work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature PRs that implement a new feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement continuous voice activity detection

4 participants