What does Speechmatics do?

Speechmatics provides speech technology and Voice AI for enterprises, offering accurate Speech-to-Text, Text-to-Speech, and Voice Agent solutions. Our models understand every voice and accent across 55+ languages, helping businesses unlock the full potential of voice data.

How accurate is Speechmatics Speech-to-Text?

Speechmatics delivers best-in-market accuracy, achieving up to 99% word accuracy and 96% medical keyword recall in industry benchmarks. Our models handle multiple accents, noisy environments, and multi speakers with ease.

What makes Speechmatics Text-to-Speech different?

Our low-latency Text-to-Speech (TTS) delivers lifelike, human-sounding voices with sub-150ms latency that is ideal for real-time conversations. Developers can stream natural speech in multiple voices and deploy it in the cloud, hybrid, or on-prem for privacy and control.

Can I build real-time voice agents with Speechmatics?

Our voice AI enables developers to build real-time voice agents that listen, understand, and respond naturally. Plug in fast with a flexible API and native integrations to power your AI voice agents.

Which industries use Speechmatics?

Speechmatics is trusted by organizations in media, healthcare, contact center, finance, education, and accessibility. Our technology powers transcription, translation, call analytics, and voice AI applications worldwide.

You can’t hurry love, but you can hurry final transcripts

Introduction

In the world of voice agents, the "awkward silence" is the enemy.

We all know the feeling. You finish asking a voice assistant a question.

And then you wait.

…

That gap…

Between you finishing your sentence…

And the AI responding…

Is where the magic of conversational AI lives or dies.

For us at Speechmatics, solving this isn't just about raw processing speed (though we have that). It’s having the confidence in when to send you the finals.

We are introducing Forced End of Utterance (FEOU) to let you tell us when you want them.

The philosophy is simple: If you are happy that speech is over, then so are we.

The Problem: The Transcription Waiting Game

To understand why this feature matters, we have to look at how transcription works under the hood. Real-time transcription engines typically output two things:

Partials: Low-latency, evolving best guesses of what is being said. (e.g., "I want two...")
Finals: Highest-confidence, punctuated, stable text. (e.g., "I want to go to the park.")

Most systems (voice agents, scribes, etc.) wait for the accurate "Finals" before spending LLM tokens with text that could continue to update. The problem? To generate "Finals", standard engines hold back for a fixed period of silence to ensure the user has truly finished speaking.

Traditionally, the engine plays it safe. It waits for a preset amount of silence or a "Max Delay" timer to expire before locking in the text. It needs to ensure that the silence after "Two..." isn't just a pause before "...hundred."

For captioning, this ensures the highest accuracy. For a conversational voice agent, that buffer feels like an eternity.

The Solution: Force End of Utterance (FEOU)

With the new ForceEndOfUtterance message, we have decoupled "Turn Detection" from "Transcription”. Your client can send a ForceEndOfUtterance message at the right moment forcing the current segment to close and immediately emitting a final transcript - ready to be passed straight to the LLM.

If you have a Voice Activity Detector (VAD), a "Push-to-Talk" button, or a sophisticated multimodal model that predicts when a user has finished speaking, you can now immediately signal our engine.

When we receive an FEOU message, we stop waiting. We immediately process all the audio in our buffer, apply punctuation, finalize the transcript, and send it back to you.

The triple lock of latency

With this release, you now have three distinct levers to pull to optimize your agent's responsiveness:

1. The Manual Override: ForceEndOfUtterance

This is the new power feature. You send a JSON message to the websocket, and we finalize immediately.

Best for: Sophisticated voice agents using client-side VAD, push-to-talk interfaces whatever you can dream up.
Benefit: Lowest possible latency. You define the rules.~

2. The Safety Net: end_of_utterance_silence_trigger

This is a server-side setting. You tell us: "If you hear X milliseconds without speech, send an EndOfUtterance (EOU) message."

Best for: A backup to ensure turns are eventually closed if the client-side signal fails.
Recommendation: Keep this lower than max_delay (e.g., 0.5s - 0.8s).

3. The Backstop: max_delay

The classic setting. A consistent delay before finalizing transcription; the longer it is the more audio context the engine is given before having to finalize.

Best for: Ensuring text eventually appears on screen during long monologues or dictation.
Recommendation: 1s - 2s for conversational agents.

Using ForceEndOfUtterance

The easiest way to start bringing your end-of-turn latency down is to use our new Python Voice SDK. It brings together loads of helpful features for voice AI. It can handle the entire turn detection pipeline for you. It includes the open source Silero voice activity detector and the Pipecat turn detection model (Smart Turn V3) to detect an end of turn and send the FEOU message.

Alternatively, in both our real-time (RT) and Voice SDKs, you can use your own logic to finalize using:

This will send back an EndOfUtterance message. For a complete, runnable example of how to implement this logic, check out our Voice Agent Turn Detection tutorial on GitHub.

Our new Pipecat integration and upcoming LiveKit update out of the box will use their internal turn detection to automatically trigger the finalization for you.

Start building now with our integration section in the Speechmatics academy.

Let's talk numbers

250ms: The New Standard for Finals

From the moment your client sends the ForceEndOfUtterance signal, you can expect to receive a final transcript in approximately 250ms (network dependent).

To understand why this is significant, we need to look at the total latency equation.

The Old Equation (Standard Engines) In a traditional cloud STT setup, you are at the mercy of the server’s safety settings.

Latency = Server Silencer Buffer + Processing Time

Most engines enforce a silence buffer of 700ms–1000ms. You pay this "waiting tax" on every single turn to ensure accuracy, regardless of the context.

The New Equation (FEOU) With FEOU, we remove the server-side wait entirely.

Latency = Your Turn Detection + Speechmatics Processing (250ms)

This puts the latency budget entirely in your hands.

Your Turn Detection: You decide the logic. VAD or Smart Turn: You can even send us the signal pre-emptively, so that once the turn is complete the transcript is ready.

Push-to-Talk: If you use a hardware button, this is effectively 0ms. The moment the user releases the button, we begin finalization.

Our Processing: We handle the heavy lifting of finalizing text and formatting punctuation in that ~250ms window.

Conclusion

Don’t wait to reduce your latency.

By using ForceEndOfUtterance, you get the best of both worlds: the unparalleled accuracy of Speechmatics' transcription, with the snappy, real-time responsiveness required for modern voice agents. Try it out for yourself in the Speechmatics Academy.

If your system is confident the user is done, don't wait for us.

Force the finals.

Start building with Voice AI

Get started in minutes

Jan 21, 2026 | Read time 4 min

You can’t hurry love, but you can hurry final transcripts

TL;DR

Introduction

The Problem: The Transcription Waiting Game

The Solution: Force End of Utterance (FEOU)

The triple lock of latency

1. The Manual Override: ForceEndOfUtterance

2. The Safety Net: end_of_utterance_silence_trigger

3. The Backstop: max_delay

Using ForceEndOfUtterance

Let's talk numbers

Latency = Server Silencer Buffer + Processing Time

Latency = Your Turn Detection + Speechmatics Processing (250ms)

Conclusion

Start building with Voice AI

Related Articles

Best TTS APIs in 2026: ElevenLabs, Google, AWS & 9 More Compared

Speechmatics in 2025: The numbers that shaped Voice AI's breakthrough year (+ what’s to come in 2026)

AI can speak. But has it forgotten to listen?

Latest Articles

Speechmatics achieves a world first in bilingual Voice AI with new Arabic–English model

Your voice agent speaks perfect Arabic. That's the problem.

How Nvidia Dominates the HuggingFace Leaderboards in This Key Metric

Why AI-native EHR platforms will treat speech as core infrastructure in 2026

One word changes everything: Speechmatics and Edvak EHR partner to make voice AI safe for clinical automation at scale

Speed you can trust: The STT metrics that matter for voice agents