-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Description
Feature Type
Nice to have
Feature Description
Voice users often produce filler or backchannel speech (e.g. “um”, “uh”, “hmm”, “yeah”) while listening.
Today, such utterances can unintentionally trigger interruption logic, causing the agent to stop speaking even when the user does not intend to interrupt.
This feature allows agents to ignore configurable filler words during interruption detection, preventing unnecessary interruptions when users speak only filler content.
Solution Overview
Introduce an optional configuration:
interruption_ignore_words=["um", "uh", "like", "hmm"]If a transcript contains only ignored words, the interruption is suppressed.
If at least one non-ignored word is present, interruption proceeds normally.
This acts as a content-based filter layered on top of existing interruption logic.
Behavior Summary
- Filler-only speech does not interrupt
- Meaningful speech interrupts as usual
- Existing timing and word-count thresholds remain unchanged
- Feature is fully opt-in
Integration Across Turn Detection Modes
The ignore-word filter applies consistently across:
- VAD / STT-based interruption detection
- Realtime LLM turn detection
- Preemptive generation
- False interruption resume logic
No changes are made to interruption timing or scheduling semantics.
Backward Compatibility
- Default value is
None - Existing behavior is preserved
- No API-breaking changes
- Ignore words act only as an additional filter layer
Performance Impact
- Latency: <1ms per interruption decision
- CPU: Constant-time set lookups
- Memory: Minimal (small word lists)
Notes
- Requires STT to be enabled for transcript-based filtering
- Case-insensitive matching with punctuation stripping
- Intended to improve conversational naturalness, not replace timing-based controls
Workarounds / Alternatives
Currently, users can partially mitigate this by increasing min_interruption_words, but this also delays legitimate interruptions and does not distinguish filler speech from meaningful input.
Additional Context
