In an industry obsessed with making AI not just sound human, but feel human, I’ve found myself fixated on a less glamorous but crucial part of the puzzle: Turn Detection.
Turn Detection is about teaching Voice AI when to speak and when to hang back, so it doesn’t feel like it’s constantly interrupting you.
It's the invisible process that, when executed poorly, can transform promising technology into an awkward digital parody of human-agent interaction.
Most systems today rely on Voice Activity Detection (VAD) to determine when someone is talking. VAD is great for picking up the raw sound of speech, but real human conversations are about more than just sound waves (or the lack of).
We rely on a complex mix of semantics, context, tone, and those subtle, unspoken cues that tell us when someone is pausing to think, or catching their breath. See below for an example...
Effective turn detection in conversational AI can have real financial consequences. For enterprises racing to deploy voice AI agents, the stakes of poor turn-taking are high.
You face an annoying dilemma: make your system wait too long before responding (feeling sluggish), or risk interrupting customers mid-thought (feeling rude).
I've seen this play out across industries.
In finance, customers spelling out names or reading complex account numbers get cut off mid-sequence when they pause to check information.
In healthcare, patients spelling medical terms or recalling patient IDs are interrupted mid-sentence, hindering accurate data capture and eroding trust.
In retail, shoppers providing delivery addresses experience interruptions during natural pauses as they recall information.
This may sound all too familiar if you've ever experienced a call like this:
—-----------------------
Agent: "Can I have your 8-digit account ID?"
Customer: "Yeah sure, the ID is um, Two-One-Four... (pauses to check document)**
Agent (interrupting): "I'm sorry, I didn't catch that. Can you please provide me with your full 8-digit account ID?"
Customer (frustrated): "Ugh, okay. It’s Two-One-Four-Three-Seven-Nine..."
—-----------------------
In this case, the voice agent misinterpreted a natural pause as the end of the turn and mistakenly assumed the ID was incomplete, prompting an unnecessary re-prompt. A human agent would likely have the patience to wait for the full response, understanding that pauses are part of natural speech.
The financial impact extends beyond customer frustration. If your voice AI's "brain" is powered by a large language model (LLM), especially one hosted by an external provider, prematurely sending incomplete user responses leads directly to unnecessary expenses.
In an age where every LLM API call incurs a cost, you're essentially paying to process and respond to incomplete thoughts, then paying again to fix the misunderstandings they create.
For me, the moment that crystallized this problem came during our daily testing sessions with our Voice AI Agent product, Flow. While interacting with our voice agent, I found myself unnaturally rushing my speech and avoiding normal pauses.
Why? Because I knew from experience that any natural hesitation would trigger an interruption.
In human conversations, we expect those brief moments when someone needs to think or recall information. But our AI lacked this fundamental understanding, creating interactions that felt rushed and mechanical.
For businesses, the consequences extend far beyond awkward conversations:
You're watching customers flee: When people find interactions with AI voice agents frustrating, they abandon automated systems for human agents, negating the very efficiency gains you implemented the system to achieve.
You're hemorrhaging compute costs: Every premature interruption drives up LLM API expenses through unnecessary reprocessing, misinterpretations, and more frequent retries. This dramatically increases compute usage and cloud spend.
Your competitive advantage is slipping: As consumers grow more sophisticated in their AI interactions, their tolerance for clumsy conversational mechanics is plummeting.
The businesses poised to win the conversational AI race aren't just those with the most powerful models, but those who master the subtle art of conversational rhythm, knowing not just what to say, but when to say it.
But to achieve this mastery, we first need to understand why current systems fail so consistently at what seems like a simple human task.
The root cause of these costly interruptions lies in the fundamental limitations of using only VAD for Turn Detection.
VAD is a system that listens to audio and identifies when someone is speaking. It typically uses a neural network to check if the incoming audio is human speech. If VAD detects speech, we can assume the person is still talking. But if it detects silence for a certain amount of time (usually about 1 second), we can treat that as the end of the person's turn.
The challenge with VAD-based turn detection systems is that they only consider acoustic signals from the audio, making them prone to misinterpreting natural pauses. Users often pause mid-sentence to think, recall information, or formulate a response, and these pauses can sound identical to an end-of-turn.
Unlike humans, who rely on a symphony of non-verbal cues like facial expressions, eye contact, or body language, voice AI must make the same judgments using only audio. I call this the "uncanny valley of conversation," where interactions feel just human enough to set expectations, but not sophisticated enough to meet them.
Consider this example: if someone says, "I understand your point, but…" and pauses for a second (or more) before continuing, VAD would likely call this an end of turn, but a human listener would intuitively keep listening.
This understanding of the technical challenge reveals another counterintuitive insight: State-of-the-art voice agents can now respond in as little as 200 milliseconds after a user finishes speaking.
Technically, that's impressive, but it's not natural at all.
In human conversations, a typical pause between turns is around 600 milliseconds. That slight delay feels thoughtful and respectful, giving the impression the listener is processing what was said, not robotically waiting to respond.
The solution requires a complete rethinking of how AI listens.
That's why, at Speechmatics, we're building a turn detection system that incorporates genuine semantic understanding.
Large language models (LLMs) are ideally suited for this task, as they are trained to deeply understand natural language and conversational patterns. However, real-time voice AI systems are highly latency-sensitive. We can't rely on off-the-shelf LLM APIs that introduce significant delays.
Our work centers on exploring smaller, optimized language models that can deliver the necessary level of semantic comprehension for accurate end-of-turn detection without compromising on latency. This balance between accuracy and performance is critical for delivering smooth user experiences in voice interactions.
Our approach goes beyond analyzing just the current utterance. It incorporates meaning and context from the broader conversation.
Referring back to the earlier example, when your AI voice agent asks for a six-digit customer ID and hears "My ID is 123..." followed by a pause, it needs contextual awareness to recognize that this response is likely incomplete.
Without this understanding, you're stuck in an endless cycle of interruptions and frustrations.
The journey to solve this technical challenge has revealed something profoundly human. Conversation isn't merely an exchange of information, it's a combination of attention, respect, and understanding.
The most sophisticated thing your AI assistant can learn isn't generating sub-200 milliseconds responses, it's knowing when to stay silent.
Working on this problem daily has reinforced my belief that the future belongs to companies who recognize that true intelligence isn't just about processing speed or vocabulary size.
It's about rhythmic intelligence and the ability to participate in the subtle, unspoken processes that make conversation feel natural.
As we enter this new era of voice-first interfaces, perhaps the most valuable thing your artificial assistants can learn isn't what to say, but when to say nothing at all.