Nov 21, 2025 | Read time 5 min

Voice AI doesn't need to be faster. It needs to read the room.

Is it time to rethink responsiveness as a conversational problem, not just a technical one?
VoiceAI-Needs-to-read-the-room-1200x900 1 5x
Sam Sykes
Sam SykesSenior Director of Innovation

Think about the last time you spoke with someone who replied before you'd finished your sentence. Someone so eager to respond that they missed your actual point, derailed your train of thought, and left you undeniably frustrated.

That's what we've built into voice AI.

For years, the industry has optimized for a single metric: speed.

We've built ultra-low latency systems that can turn speech into text faster than most humans can finish a thought - like our own Real-Time Speech-to-Text API. We've created infrastructure that scales to millions of concurrent conversations. Models that achieve accuracy rates that would have seemed impossible five years ago.

Extraordinary technology.

But while we're shaving milliseconds off latency every other week, we're still building systems that interrupt users mid-sentence. We're solving some of the hardest technical problems, but conversations still feel jarringly unnatural. 

So what's clear to me is we’re not doing a good enough job when it comes to the conversational problem. And as a result, we're measuring the wrong things.

We've passed the point of diminishing returns

We've convinced ourselves there's a simple equation: faster = better.

But there's a threshold below which speed improvements become invisible, maybe even counterproductive. And I don't think that threshold is where we imagine it to be.

To be clear: there are absolutely contexts where milliseconds matter:

  • Emergency services

  • Navigation while driving

  • Customer-service queues where efficiency is the point

  • Any task where delay creates friction

Every millisecond genuinely matters in those scenarios.

But we've equated responsiveness with quality across the board. We assume that faster means better, that lower latency equals superior user experience.

Human conversation doesn't work that way.

Real dialogue has rhythm, breath, and space.

It requires the ability to recognize when someone has actually finished speaking versus when they're simply pausing to think.

If you and I were having this conversation right now and I hadn't replied, it's not because I'm slow, it's because I'm still forming my thought.

There's no reason that shouldn't be the same when you're talking to a voice assistant.

But our current systems can't make that distinction. Silence = end of turn. Always.

How humans actually experience waiting

Recently, I built a voice assistant for ordering Chinese takeaway. I added restaurant background noise during the silences. Ambient sound, dishes clattering, the hum of conversation. It transformed the experience. 

The pause while the AI thinks doesn't feel awkward anymore. It feels natural, like you're really on the phone with a busy restaurant.

The background noise reveals a deeper truth: We’re so obsessed with eliminating latency that we’ve forgotten how humans actually experience time in conversation.

Three variables we're ignoring

The problem isn't latency itself, it's that we're measuring it wrong.

If you're building a voice agent today, you need to rethink what "responsive" actually means.

I call this conversational latency: not how fast the system responds, but whether that response rhythm matches the conversation's actual needs.

Working exactly how this redefined form of latency works keeps me up most nights. But I’ve made a start with three initial variables (that I intend to expand on): 

1. Task type

When someone asks for train times or needs a quick yes or no answer, speed is everything.

That's a transactional exchange. But most conversations aren't like that. Most are exploratory, iterative, messy. They need room to breathe. The system needs to recognize the difference and adjust accordingly; task type should dictate how aggressively you optimize for speed versus space.

2. Speaker profile

Some people talk fast and expect instant replies.

Others pause, ramble, think aloud.

I've been experimenting with categorizing speakers: really slow, slow, normal, fast.

If the system knows Person A speaks quickly and gets to the point, it can respond instantly. If it recognizes that Person B needs time to articulate, it can wait. 

Breaking this down further, if someone consistently pauses mid-sentence and then continues, the system should learn to treat those pauses as thinking gaps, not end-of-turn signals.

When we built our voice agent, we programmed it to wait 1.5 seconds after someone says "um" or "ahh" before responding. Those verbal disfluencies are tells that someone's still thinking, still forming their point.

During that pause, the agent keeps processing in the background—but it stays ready to adapt if the person starts talking again. If they do continue, it doesn't bulldoze ahead with whatever it was about to say. It listens, recalibrates, adjusts. That sort of approach pushes us beyond technical tuning and into real conversational awareness, something our Speaker Diarization already does in differentiating speakers with high precision.

3. Perceived waiting time

Silence feels different depending on context.

The restaurant test I mentioned proves this: add ambient noise during pauses, and suddenly the wait for your agent to complete its task doesn't feel awkward anymore. It’s a distraction so we don’t focus on the silence.

Small design choices matter: - Subtle backchannels (“I’m checking that now…”) - Ambient sound - Soft tones - Clear signals the system is still listening When you and I talk, we're not just listening to words. We're processing visual cues, audio cues, breathing patterns. We're constantly assessing whether the other person has truly finished or whether they're still forming their thoughts. Voice AI strips away most of those signals, yet we've done almost nothing to compensate for that loss.

From speed metrics to conversation metrics

Currently, we ask: How fast was it? How accurate was the transcription? How clean was the punctuation?

The better question: did it do what it needed to do?

Accuracy matters - especially in domains like medical transcription.

But in many instances, grammatical accuracy doesn't.

Users and the agent don't care if a comma is missing.

They care if the intent is understood.

We've spent years optimizing for how transcripts look on a screen, perfecting capitalization and punctuation that humans never consciously process in actual speech. Meanwhile, we've largely ignored the conversational dynamics that make or break real interactions.

Developers have given us the infrastructure. They solved/are solving many of the hard problems of scale, accuracy, reliability, multilingual support. They created the foundation that makes any of this possible.

The next layer won’t come from shaving another 50ms off response time.

It will come from understanding conversational latency; from measuring whether the rhythm of interaction actually serves the conversation.

Latest Articles

Carousel slide image
Product

Alphanumeric speech recognition: why voice assistants mangle SKUs (and how to fix it)

A guide for voice AI engineers, ecommerce platforms and warehouse teams on SKU recognition accuracy voice assistant deployments depend on: why speech recognition systems produce transcription errors on product codes, what to measure when error rates matter, and the fixes that move the needle on order picking, voice ordering and customer-facing voice AI.

Speechmatics
SpeechmaticsEditorial Team
Carousel slide image
Technical

The Adobe story: How we made cloud-grade AI work on your laptop

Behind the build: what it takes to make cloud-grade speech recognition work inside Adobe Premiere, and why Whisper raised the stakes.

Andrew Innes
Andrew InnesChief Architect
Carousel slide image
Company

Adobe and Speechmatics deliver cloud-grade speech recognition on-device for Premiere

Adobe Premiere users can run the most accurate on-device transcription locally; efficient enough for a laptop, powerful enough for professional work.

Speechmatics
SpeechmaticsEditorial Team
Carousel slide image
Use Cases

Best speech-to-text AI guide: APIs, platforms and services compared

Speech-to-text has moved from novelty to enterprise infrastructure. Here's how the leading platforms stack up in 2026 — and how to pick the right one.

Tom Young
Tom YoungDigital Specialist
Speechmatics x Thymia combine medical-grade speech-to-text with clinical-grade voice biomarker intelligence to identify health signals.
News

AI can now understand health signals from 15 seconds of your voice, including fatigue, stress and type 2 diabetes

The joint platform returns transcription and health signals in real time, with no additional hardware required.

Speechmatics
SpeechmaticsEditorial Team
[alt: Concentric circles radiate outward from a central orange icon with a white Speechmatics logo. The background is dark blue, enhancing the orange glow. A thin green line runs horizontally across the lower part of the image.]
Technical

Speed you can trust: The STT metrics that matter for voice agents

What “fast” actually means for voice agents — and why Pipecat’s TTFS + semantic accuracy is the clearest benchmark we’ve seen.

Archie McMullan
Archie McMullanSpeechmatics Graduate