Nov 21, 2025 | Read time 5 min

Voice AI doesn't need to be faster. It needs to read the room.

Is it time to rethink responsiveness as a conversational problem, not just a technical one?
Sam SykesSenior Director of Innovation

Think about the last time you spoke with someone who replied before you'd finished your sentence. Someone so eager to respond that they missed your actual point, derailed your train of thought, and left you undeniably frustrated.

That's what we've built into voice AI.

For years, the industry has optimized for a single metric: speed.

We've built ultra-low latency systems that can turn speech into text faster than most humans can finish a thought - like our own Real-Time Speech-to-Text API. We've created infrastructure that scales to millions of concurrent conversations. Models that achieve accuracy rates that would have seemed impossible five years ago.

Extraordinary technology.

But while we're shaving milliseconds off latency every other week, we're still building systems that interrupt users mid-sentence. We're solving some of the hardest technical problems, but conversations still feel jarringly unnatural. 

So what's clear to me is we’re not doing a good enough job when it comes to the conversational problem. And as a result, we're measuring the wrong things.

We've passed the point of diminishing returns

We've convinced ourselves there's a simple equation: faster = better.

But there's a threshold below which speed improvements become invisible, maybe even counterproductive. And I don't think that threshold is where we imagine it to be.

To be clear: there are absolutely contexts where milliseconds matter:

  • Emergency services

  • Navigation while driving

  • Customer-service queues where efficiency is the point

  • Any task where delay creates friction

Every millisecond genuinely matters in those scenarios.

But we've equated responsiveness with quality across the board. We assume that faster means better, that lower latency equals superior user experience.

Human conversation doesn't work that way.

Real dialogue has rhythm, breath, and space.

It requires the ability to recognize when someone has actually finished speaking versus when they're simply pausing to think.

If you and I were having this conversation right now and I hadn't replied, it's not because I'm slow, it's because I'm still forming my thought.

There's no reason that shouldn't be the same when you're talking to a voice assistant.

But our current systems can't make that distinction. Silence = end of turn. Always.

How humans actually experience waiting

Recently, I built a voice assistant for ordering Chinese takeaway. I added restaurant background noise during the silences. Ambient sound, dishes clattering, the hum of conversation. It transformed the experience. 

The pause while the AI thinks doesn't feel awkward anymore. It feels natural, like you're really on the phone with a busy restaurant.

The background noise reveals a deeper truth: We’re so obsessed with eliminating latency that we’ve forgotten how humans actually experience time in conversation.

Three variables we're ignoring

The problem isn't latency itself, it's that we're measuring it wrong.

If you're building a voice agent today, you need to rethink what "responsive" actually means.

I call this conversational latency: not how fast the system responds, but whether that response rhythm matches the conversation's actual needs.

Working exactly how this redefined form of latency works keeps me up most nights. But I’ve made a start with three initial variables (that I intend to expand on): 

1. Task type

When someone asks for train times or needs a quick yes or no answer, speed is everything.

That's a transactional exchange. But most conversations aren't like that. Most are exploratory, iterative, messy. They need room to breathe. The system needs to recognize the difference and adjust accordingly; task type should dictate how aggressively you optimize for speed versus space.

2. Speaker profile

Some people talk fast and expect instant replies.

Others pause, ramble, think aloud.

I've been experimenting with categorizing speakers: really slow, slow, normal, fast.

If the system knows Person A speaks quickly and gets to the point, it can respond instantly. If it recognizes that Person B needs time to articulate, it can wait. 

Breaking this down further, if someone consistently pauses mid-sentence and then continues, the system should learn to treat those pauses as thinking gaps, not end-of-turn signals.

When we built our voice agent, we programmed it to wait 1.5 seconds after someone says "um" or "ahh" before responding. Those verbal disfluencies are tells that someone's still thinking, still forming their point.

During that pause, the agent keeps processing in the background—but it stays ready to adapt if the person starts talking again. If they do continue, it doesn't bulldoze ahead with whatever it was about to say. It listens, recalibrates, adjusts. That sort of approach pushes us beyond technical tuning and into real conversational awareness, something our Speaker Diarization already does in differentiating speakers with high precision.

3. Perceived waiting time

Silence feels different depending on context.

The restaurant test I mentioned proves this: add ambient noise during pauses, and suddenly the wait for your agent to complete its task doesn't feel awkward anymore. It’s a distraction so we don’t focus on the silence.

Small design choices matter: - Subtle backchannels (“I’m checking that now…”) - Ambient sound - Soft tones - Clear signals the system is still listening When you and I talk, we're not just listening to words. We're processing visual cues, audio cues, breathing patterns. We're constantly assessing whether the other person has truly finished or whether they're still forming their thoughts. Voice AI strips away most of those signals, yet we've done almost nothing to compensate for that loss.

From speed metrics to conversation metrics

Currently, we ask: How fast was it? How accurate was the transcription? How clean was the punctuation?

The better question: did it do what it needed to do?

Accuracy matters - especially in domains like medical transcription.

But in many instances, grammatical accuracy doesn't.

Users and the agent don't care if a comma is missing.

They care if the intent is understood.

We've spent years optimizing for how transcripts look on a screen, perfecting capitalization and punctuation that humans never consciously process in actual speech. Meanwhile, we've largely ignored the conversational dynamics that make or break real interactions.

Developers have given us the infrastructure. They solved/are solving many of the hard problems of scale, accuracy, reliability, multilingual support. They created the foundation that makes any of this possible.

The next layer won’t come from shaving another 50ms off response time.

It will come from understanding conversational latency; from measuring whether the rhythm of interaction actually serves the conversation.