Most TTS sounds great in demos but falls apart in real-time conversations.
Our preview TTS streams with sub-150ms latency and still sounds natural.
Designed for multilingual voices that avoid the “robotic” effect outside English.
Cost-effective pricing so you can scale without worrying about usage caps.
Deploy in the cloud, on-prem, or at the edge — one platform for STT + TTS.
Today, we’re introducing our low-latency, natural-sounding TTS preview to tackle these pain points head-on.
It’s built for streaming (think sub-150ms latency), can be deployed in the cloud or on-prem for privacy, and pairs seamlessly with our speech-to-text (STT).
All this enables you to build lifelike voice agents and real-time voice experiences without the usual headaches.
Traditional TTS systems often introduce a 1–3 second delay before you hear the full audio, which is fine for static content but terrible for live conversations.
In demos or batch processing, you might get ultra-realistic speech, but when you try to use those same voices in a live, low-latency (<200ms) environment, things change.
To achieve real-time streaming, vendors often switch to smaller or faster models and suddenly the once-natural voice can sound a bit off, with noticeable artifacts or drop in quality. It’s a classic trade-off: speed vs. quality.
We see this first-hand with clients.
The voice that wowed stakeholders from the voice catalogue would sometimes disappoint when integrated into a voice agent.
We built our TTS from day one with real-time streaming in mind. The preview version delivers speech with latency under 150 milliseconds without sacrificing naturalness.
In other words, you get the fluid, human-like voice and the snappy response time needed for interactive applications.
No more awkward pauses or robotic quirks just because you need speed. Our streaming TTS ensures your voice app feels responsive and lifelike, keeping users engaged.
Another big reason we built our own TTS is the language gap in voice synthesis.
English TTS voices have become impressively natural in recent years, thanks to massive datasets and vendor focus.
But move beyond English, and the story changes. We kept hearing feedback that many non-English TTS voices sound unnatural, and the word “robotic” comes up a lot.
For example, one user pointed out that the Thai voice on a language platform was “quite robotic sounding,” highlighting the lack of good options.
This is a common theme: whether it’s Bulgarian, French, or Thai, users notice that TTS beyond English still often sounds like a Dalek (to quote a frustrated user).
Why does this happen? In part, because many TTS systems are one-size-fits-all, sometimes using older synthesis techniques or limited data for smaller languages, resulting in flat intonation and choppy cadence.
Speechmatics has a strong legacy in multilingual speech (our speech-to-text supports 55+ languages with high accuracy), so we’re bringing that expertise to TTS.
Our preview focuses on delivering a highly natural English voice now, and we’re actively working on more languages.
Our goal is authentic voices in each language, capturing the nuances of local accents and speech patterns.
So, if that’s a Spanish or Hindi TTS voice from us, we know it will sound as warm and human as the English one.
Building truly natural-sounding voices beyond English is harder than it looks (there’s a reason so many offerings all sound the same), but we’ve learned a lot from our decade in speech tech.
As we roll out new languages, we’ll be leaning on advanced neural synthesis techniques and our rich multilingual data to raise the bar for non-English TTS.
Stay tuned - more languages are coming, and we won’t be satisfied until “robotic” is a thing of the past for all of them.
Let’s talk pricing, another practical pain point for developers.
TTS APIs charge per character or per minute of audio, and those costs add up quickly when you have a large-scale voice application.
We’ve seen organizations hesitate to add voice responses everywhere they’d like, or limit the length of spoken content, purely due to cost concerns.
If you’re generating millions of sentences or hours of speech, you can end up with a hefty bill. Not great for scaling your app or for experimenting during development.
We want our TTS to be cost-effective so you don’t have to think twice about using it in volume.
Part of that comes from efficient tech – by leveraging our own models and infrastructure, we aim to keep the costs reasonable and predictable. Being thoughtful about model size and optimization means lower compute costs, which we can pass on to users.)
Another part is simplified pricing and packaging alongside our Speech-to-Text.
With Speechmatics, you’re not juggling separate vendors and pricing schemes for STT and TTS, it’s one contract, one platform. This not only simplifies integration but often makes budgeting and support easier. Our TTS is currently free to try in the Portal, so you can gauge how it fits your use case.
By making TTS more affordable, we hope to see more innovative voice applications make it out of the proof-of-concept stage and into real-world use.
If you’re an enterprise developer or IT manager, consolidating to one trusted vendor for both directions of speech simplifies procurement and compliance. And speaking of compliance, we know many industries have strict data privacy rules.
Just as our STT can be deployed on-premises or in your private cloud for full data control, our TTS is built with the same privacy-first ethos. In fact, a big motivator for us was seeing how many customers need an on-prem or edge TTS solution, whether for sensitive use cases (think healthcare, finance) or connectivity reasons.
So our TTS engine is designed to run anywhere, our cloud, your cloud, or on your servers next to our STT engine. This means you can keep voice data in-house and meet latency requirements without sacrificing quality.
In short, we built this TTS to slot right into your existing workflows, whether you’re calling our cloud API for a quick prototype or deploying a containerized model to thousands of kiosks worldwide.
Why do we believe we can deliver all of the above: low latency, natural quality, multilingual support, cost efficiency, when so many others have struggled with pieces of it?
The key is that we’re not starting from scratch.
Speechmatics has spent over 10 years at the bleeding edge of speech technology, primarily in speech-to-text.
This matters because TTS and STT are two sides of the same coin in many ways.
Our expertise in understanding the nuances of human speech (acoustic modelling, pronunciation, prosody, handling different accents and noise conditions) directly feeds into building a system that can generate realistic speech.
We’ve poured all that knowledge into our TTS models. As one of the leading speech recognition companies, we’ve already solved hard problems in multilingual audio, transcribing dozens of languages, capturing different dialects and now we’re leveraging that foundation to make synthesized speech just as inclusive and accurate.
At the end of the day, our mission has always been to understand every voice. Adding TTS to our portfolio was a natural step (no pun intended) toward that mission.
We built this preview to address the real-world problems we saw our clients encounter again and again and we’re doing it by standing on the shoulders of our past R&D in speech.
We’re excited about where this is headed: imagine voice agents that truly sound human and respond in real-time, in any language, and you don’t have to assemble five different services to make it happen. That’s the future we’re working toward.
The Text-to-Speech Preview is live today for all Speechmatics users – and we invite you to take it for a spin.
Getting started is as easy as logging into the Speechmatics Portal and entering some text to generate voice.
For developers, our API documentation will show you how to integrate the streaming TTS into your applications with just a few lines of code.
Spin up a voice agent demo, plug it into your call center software, or build that talking IoT device you’ve been prototyping. We’d love to see it all.
Most importantly, because this is a preview, we’re actively seeking your feedback.
Does the voice quality meet your expectations? How’s the latency feeling in your particular network setup? Are there languages or voice styles you’re itching for? Let us know.
This preview period is our chance to iterate with you, the developer community, and ensure that when we launch the full product, it checks all the right boxes.
Voice interfaces are entering a new era of natural, real-time interaction. We built our low-latency TTS to remove the traditional barriers and help you build the voice experiences of the future.
Give it a try, and tell us how we can make it even better.
Happy coding, and happy listening!