Utterance detection

Utterance detection identifies when a speaker has finished a turn by monitoring for a period of silence. When the silence period exceeds a threshold you specify, the server sends an UtteranceEnd message.

Why do you need utterance detection?

An utterance is a continuous segment of speech, that contains everything a person says before they stop talking. It might be a single sentence, several sentences, or even a few words. The key distinction is the silence that follows. For instance, when a speaker pauses long enough, that silence marks the boundary between one utterance and the next.

In a conversation, utterances roughly correspond to "turns." For example, in a voice assistant interaction, the user's question is one utterance and the assistant's response is another. You can use utterance detection to segment conversations, trigger processing between turns, or build turn-based interaction flows.

Word timestamps must be enabled for UtteranceEnd events to fire. Word timestamps are included by default in Results messages.

Configure the silence threshold

To enable utterance detection, add the utterance_end_ms parameter in the WebSocket query string. You also specify the silence duration that should trigger an utterance end in milliseconds:

wss://stt-api.subq.ai/v1/listen?utterance_end_ms=1000&encoding=mp3

This example fires an UtteranceEnd event after 1 second of silence. You choose a value based on your use case. A shorter threshold such as 500 ms responds quickly but might trigger during natural mid-thought pauses. On the other hand, a longer threshold such as 1500 ms is more reliable for detecting true turn endings but adds delay.

UtteranceEnd message

When the silence threshold is reached, the server sends a message like this:

{
  "type": "UtteranceEnd",
  "channel": [0],
  "last_word_end": 2.5
}

Field	Description
`type`	Always `"UtteranceEnd"`
`channel`	Array indicating which channel detected the utterance end
`last_word_end`	Timestamp (in seconds) of the last detected word