Oct 2, 2025 | Read time 5 min

Why we built our low-latency Text-to-Speech

Most TTS sounds great in demos but breaks in real conversations. We built ours for sub-150ms latency, natural voices, and global scale.
Stuart WoodProduct Manager

TL;DR:

  • Natural voices heard in offline demos often degrade below 300ms latency in voice agents, especially for languages outside English.

  • Our streaming TTS delivers <150ms response times without sacrificing quality.

  • Built for real-time applications and offering SaaS or on-prem deployment.

  • Free to try until Nov 1 in the Portal and API.

  • We built our new streaming TTS API for developers who need consistent, humanlike speech at low latency.

High latency kills conversation

Every 100ms matters in live conversation. Traditional TTS systems often introduce a 1 to 3 second delay, which is fine for static content but terrible for voice agents.

In demos, or batch, you might hear ultra-realistic voices, but in low-latency environments for the same voices, where responses need to be sub-300ms, vendors often switch to smaller or faster models and quality drops.

We have seen this first-hand with several partners building voice agents: the voice that initially wowed stakeholders can disappoint when integrated into the agent.

Our approach to text-to-speech

We built our TTS from day one with real-time streaming in mind.

The preview delivers speech with latency under 150 milliseconds without sacrificing naturalness, so you get fluid, human-like voices and the snappy response time needed for interactive applications.

Beyond English: Tackling robotic voices in other languages

Another big reason we built our own TTS is the language gap in voice synthesis.

English TTS voices have become impressively natural, but many non-English voices still sound unnatural or "robotic".

We kept hearing this across languages like Dutch, French, and Thai from our clients with existing vendors.

One-size-fits-all systems, older techniques, or limited data for smaller languages lead to flat intonation and choppy cadence.

Making multilingual TTS more human

Speechmatics has a strong legacy in multilingual speech (our speech-to-text supports 55+ languages with high accuracy), and we’re bringing that expertise to TTS.

Our preview delivers a highly natural English voice now, and we’re actively working on more languages.

Our goal is to provide authentic voices in each language, capturing the nuances of local accents and speech patterns.

Building truly natural voices beyond English is a challenge, but we have a decade of experience building speech tech from the ground up.

More languages are coming, and we will not be satisfied until “robotic” is a thing of the past for all of them.

The hidden cost of talking at scale

TTS APIs charge per character or per minute of audio, and those costs add up quickly.

We’ve seen organizations hesitate to add voice everywhere or limit spoken content due to cost.

At millions of sentences or hours of speech, bills get hefty, which is not great for scaling or experimentation.

Keeping TTS costs grounded

We want TTS to be cost-effective, so you do not have to think twice about volume. By leveraging our own models and infrastructure, and being thoughtful about model size and optimization, we aim to keep costs reasonable and predictable.

We also simplify pricing and packaging alongside our speech-to-text: one contract, one platform. This simplifies integration, budgeting, and support.

Our TTS is currently free to try in the Portal and API until November 1st, so you can gauge fit.

By making TTS more affordable, we hope more applications move from proof-of-concept to real-world use.

One platform, zero compromises

Consolidating to one trusted vendor for both directions of speech simplifies procurement and compliance.

Many industries have strict data privacy rules. Just as our STT can be deployed on-premises or in your private cloud, our TTS is built with the same privacy-first ethos.

Many customers need on-prem or edge TTS for sensitive use cases (think healthcare, finance) or connectivity reasons.

Our TTS engine can be used in our cloud, your cloud, or on your own servers next to our STT engine, so you can keep voice data in-house and meet latency requirements without sacrificing quality.

It slots into existing workflows, from cloud API prototypes to containerized models on thousands of devices worldwide.

Built on a decade of speech experience (and why that matters)

We are not starting from scratch. Speechmatics has spent over 10 years at the bleeding edge of speech technology, primarily in speech-to-text. TTS and STT are two sides of the same coin.

Our expertise in acoustic modelling, pronunciation, prosody, and handling different accents and noise conditions feeds directly into generating realistic speech. We are pouring that knowledge into our TTS models.

As one of the leading speech recognition companies, we have solved hard problems in multilingual audio and will now leverage that foundation to make synthesized speech inclusive and accurate.

Our mission has always been to understand every voice. Adding TTS was a natural step toward that mission. We built this preview to address real-world problems we saw again and again, expanding on our past R&D in speech. Imagine voice agents that truly sound human and respond in real time, in any language.

That is the future we are working toward.

Try it out and tell us what you think

The Text-to-Speech Preview is live today. Log in to the Speechmatics Portal and enter some text to generate speech.

For developers, our API documentation shows how to integrate streaming TTS with just a few lines of code.

Spin up a voice agent demo, plug it into your call center software, or build that talking IoT device you have been imagining.

Where to use it today:

  • Real-time support agents that offer natural, clear speech

  • Healthcare where private, offline speech is important

  • Live translation of events, media or conversations where speed is essential

These are just a few ideas.

We would love to see what you build.

Because this is a preview, we are actively seeking your feedback.

Does the voice quality meet your expectations? How is the latency in your setup? Are there languages or voice styles you want? Let me know.

This is our chance to iterate with you and ensure the full product checks the right boxes. Voice interfaces are entering a new era of natural, real-time interaction and we want to be the foundation you build on.

Give it a try, and drop me a message on LinkedIn with how we can make it even better.

Happy coding, and happy listening!

Try Speechmatics TTS in Preview

Experience how natural text-to-speech can sound across languages and test our new voices today.