Oct 2, 2025 | Read time 5 min

Why we built our low-latency Text-to-Speech

Most TTS sounds great in demos but breaks in real conversations. We built ours for sub-150ms latency, natural voices, and global scale.
TTS preview blog header
Stuart Wood
Stuart WoodProduct Manager

TL;DR:

  • Natural voices heard in offline demos often degrade below 300ms latency in voice agents, especially for languages outside English.

  • Our streaming TTS delivers <150ms response times without sacrificing quality.

  • Built for real-time applications and offering SaaS or on-prem deployment.

  • Free to try until Nov 1 in the Portal and API.

  • We built our new streaming TTS API for developers who need consistent, humanlike speech at low latency.

High latency kills conversation

Every 100ms matters in live conversation. Traditional TTS systems often introduce a 1 to 3 second delay, which is fine for static content but terrible for voice agents.

In demos, or batch, you might hear ultra-realistic voices, but in low-latency environments for the same voices, where responses need to be sub-300ms, vendors often switch to smaller or faster models and quality drops.

We have seen this first-hand with several partners building voice agents: the voice that initially wowed stakeholders can disappoint when integrated into the agent.

Our approach to text-to-speech

We built our TTS from day one with real-time streaming in mind.

The preview delivers speech with latency under 150 milliseconds without sacrificing naturalness, so you get fluid, human-like voices and the snappy response time needed for interactive applications.

Beyond English: Tackling robotic voices in other languages

Another big reason we built our own TTS is the language gap in voice synthesis.

English TTS voices have become impressively natural, but many non-English voices still sound unnatural or "robotic".

We kept hearing this across languages like Dutch, French, and Thai from our clients with existing vendors.

One-size-fits-all systems, older techniques, or limited data for smaller languages lead to flat intonation and choppy cadence.

Making multilingual TTS more human

Speechmatics has a strong legacy in multilingual speech (our speech-to-text supports 55+ languages with high accuracy), and we’re bringing that expertise to TTS.

Our preview delivers a highly natural English voice now, and we’re actively working on more languages.

Our goal is to provide authentic voices in each language, capturing the nuances of local accents and speech patterns.

Building truly natural voices beyond English is a challenge, but we have a decade of experience building speech tech from the ground up.

More languages are coming, and we will not be satisfied until “robotic” is a thing of the past for all of them.

The hidden cost of talking at scale

TTS APIs charge per character or per minute of audio, and those costs add up quickly.

We’ve seen organizations hesitate to add voice everywhere or limit spoken content due to cost.

At millions of sentences or hours of speech, bills get hefty, which is not great for scaling or experimentation.

Keeping TTS costs grounded

We want TTS to be cost-effective, so you do not have to think twice about volume. By leveraging our own models and infrastructure, and being thoughtful about model size and optimization, we aim to keep costs reasonable and predictable.

We also simplify pricing and packaging alongside our speech-to-text: one contract, one platform. This simplifies integration, budgeting, and support.

Our TTS is currently free to try in the Portal and API until November 1st, so you can gauge fit.

By making TTS more affordable, we hope more applications move from proof-of-concept to real-world use.

One platform, zero compromises

Consolidating to one trusted vendor for both directions of speech simplifies procurement and compliance.

Many industries have strict data privacy rules. Just as our STT can be deployed on-premises or in your private cloud, our TTS is built with the same privacy-first ethos.

Many customers need on-prem or edge TTS for sensitive use cases (think healthcare, finance) or connectivity reasons.

Our TTS engine can be used in our cloud, your cloud, or on your own servers next to our STT engine, so you can keep voice data in-house and meet latency requirements without sacrificing quality.

It slots into existing workflows, from cloud API prototypes to containerized models on thousands of devices worldwide.

Built on a decade of speech experience (and why that matters)

We are not starting from scratch. Speechmatics has spent over 10 years at the bleeding edge of speech technology, primarily in speech-to-text. TTS and STT are two sides of the same coin.

Our expertise in acoustic modelling, pronunciation, prosody, and handling different accents and noise conditions feeds directly into generating realistic speech. We are pouring that knowledge into our TTS models.

As one of the leading speech recognition companies, we have solved hard problems in multilingual audio and will now leverage that foundation to make synthesized speech inclusive and accurate.

Our mission has always been to understand every voice. Adding TTS was a natural step toward that mission. We built this preview to address real-world problems we saw again and again, expanding on our past R&D in speech. Imagine voice agents that truly sound human and respond in real time, in any language.

That is the future we are working toward.

Try it out and tell us what you think

The Text-to-Speech Preview is live today. Log in to the Speechmatics Portal and enter some text to generate speech.

For developers, our API documentation shows how to integrate streaming TTS with just a few lines of code.

Spin up a voice agent demo, plug it into your call center software, or build that talking IoT device you have been imagining.

Where to use it today:

  • Real-time support agents that offer natural, clear speech

  • Healthcare where private, offline speech is important

  • Live translation of events, media or conversations where speed is essential

These are just a few ideas.

We would love to see what you build.

Because this is a preview, we are actively seeking your feedback.

Does the voice quality meet your expectations? How is the latency in your setup? Are there languages or voice styles you want? Let me know.

This is our chance to iterate with you and ensure the full product checks the right boxes. Voice interfaces are entering a new era of natural, real-time interaction and we want to be the foundation you build on.

Give it a try, and drop me a message on LinkedIn with how we can make it even better.

Happy coding, and happy listening!

Try Speechmatics TTS

Experience how natural text-to-speech can sound across languages and test our new voices today.

Latest Articles

[alt: Bilingual medical model featuring terms related to various health conditions and medications in Arabic and English. Key terms include "Chronic kidney disease," "Heart attack," "Diabetes," and "Insulin," among others, displayed in an organized layout.]
Product

Speechmatics achieves a world first in bilingual Voice AI with new Arabic–English model

Sets a new accuracy bar for real-world code-switching: 35% fewer errors than the closest competitor.

Speechmatics
SpeechmaticsEditorial Team
[alt: Illuminated ancient mud-brick structures stand against a dusk sky, showcasing architectural details and textures. Palm trees are in the foreground, adding to the setting's ambiance. Visually captures a historic site in twilight.]
Product

Your voice agent speaks perfect Arabic. That's the problem.

Most voice AI models are trained on formal Arabic, but real conversations across the Middle East mix dialects and English in ways those systems aren’t built to handle.

Yahia Abaza
Yahia AbazaSenior Product Manger
new blog image header
Technical

How Nvidia Dominates the HuggingFace Leaderboards in This Key Metric

A technical deep-dive into Token Duration Transducers (TDT) — the frame-skipping architecture behind Nvidia's Parakeet models. Covers inference mechanics, training with forward-backward algorithm, and how TDT achieves up to 2.82x faster decoding than standard RNN-T.

Oliver Parish
Oliver Parish Machine Learning Engineer
[alt: Healthcare professionals in scrubs and lab coats walk briskly down a hospital corridor. A nurse uses a tablet while others carry patient charts and attend to a gurney. The setting conveys a busy, clinical environment focused on patient care.]
Use Cases

Why AI-native EHR platforms will treat speech as core infrastructure in 2026

As clinical workflows become automated and AI-driven, real-time speech is shifting from a transcription feature to the foundational intelligence layer inside modern EHR systems.

Vamsi Edara
Vamsi EdaraFounder and CEO, Edvak EHR
[alt: Logos of Speechmatics and Edvak are displayed side by side, interconnected by a stylized x symbol. The background features soft, wavy lines in light blue, creating a modern and tech-focused aesthetic.]
Company

One word changes everything: Speechmatics and Edvak EHR partner to make voice AI safe for clinical automation at scale

Turning real-time clinical speech into trusted, EHR-native automation.

Speechmatics
SpeechmaticsEditorial Team
[alt: Concentric circles radiate outward from a central orange icon with a white Speechmatics logo. The background is dark blue, enhancing the orange glow. A thin green line runs horizontally across the lower part of the image.]
Technical

Speed you can trust: The STT metrics that matter for voice agents

What “fast” actually means for voice agents — and why Pipecat’s TTFS + semantic accuracy is the clearest benchmark we’ve seen.

Archie McMullan
Archie McMullanSpeechmatics Graduate