What does Speechmatics do?

Speechmatics provides speech technology and Voice AI for enterprises, offering accurate Speech-to-Text, Text-to-Speech, and Voice Agent solutions. Our models understand every voice and accent across 55+ languages, helping businesses unlock the full potential of voice data.

How accurate is Speechmatics Speech-to-Text?

Speechmatics delivers best-in-market accuracy, achieving up to 99% word accuracy and 96% medical keyword recall in industry benchmarks. Our models handle multiple accents, noisy environments, and multi speakers with ease.

What makes Speechmatics Text-to-Speech different?

Our low-latency Text-to-Speech (TTS) delivers lifelike, human-sounding voices with sub-150ms latency that is ideal for real-time conversations. Developers can stream natural speech in multiple voices and deploy it in the cloud, hybrid, or on-prem for privacy and control.

Can I build real-time voice agents with Speechmatics?

Our voice AI enables developers to build real-time voice agents that listen, understand, and respond naturally. Plug in fast with a flexible API and native integrations to power your AI voice agents.

Which industries use Speechmatics?

Speechmatics is trusted by organizations in media, healthcare, contact center, finance, education, and accessibility. Our technology powers transcription, translation, call analytics, and voice AI applications worldwide.

Custom Voice AI in 2025: The Open Source Boom

There’s a new shift underway in Voice AI — open-sourcing of ultra-realistic text-to-speech (TTS) models that can be shaped for your own use case, your own domain, and your own users.

As product teams expect voice assistants and agents to be everywhere, the demand for high quality on-device voice AI is spiking, and the open source community is racing to meet it.

We are seeing quite a few interesting examples of this shift, all of which have popped up in the last few months: Sesame's Conversational Speech Model (CSM), Nari Labs' Dia, and Canopy's Orpheus-TTS. This article is our take on what it means for the future of speech tech, and why specialized great models are more accessible than ever before.

(If you’re looking to get under the hood, we’ve also published a detailed technical guide to fine-tuning CSM on your own data.)

Democratization of Voice

For most of the past decade, voice was a service, piped in from a cloud API. What you got was what everyone else got.

That’s changing.

We’re seeing open-source models that are actually competitive with closed-source industry leaders like ElevenLabs, plug-and-play fine-tuning tools like Unsloth that slash VRAM-requirements and training time, and latency optimizations that make on-device inference realistic for large 1-3B models.

This represents a democratization of high quality TTS models, leading to a growing share of production TTS made up of fine-tuned models shaped around specific accents, domains, and contexts.

Why fine-tuning is becoming popular

Most models are trained for the general case. Fine-tuning is how you adapt them for the specific one.

It’s what lets a model generate medical speech without stumbling over acronyms and jargon. Or replicate the tone of your customer service team. Or produce speech across three regional dialects in one product. Below are examples from fine-tuning Sesame CSM into French and German.

Speechmatics · Hear it for yourself: Fine-tuned transcription in French and German

We see three main technical factors tipping the balance in 2025.

Firstly, the brand-new ultra-realistic open models like Dia, Orpheus, and Sesame’s CSM allow you to start from near-human quality rather than from scratch.

Secondly, community contributions and tooling have made fine-tuning easier than ever before. Cheap adapters like LoRA and QLoRA let you freeze 99% of a model and learn just a few million parameters, so now a large 3B parameter TTS model can be fine-tuned on a single 16 GB GPU. Or, for those who want to undergo bigger domain shifts like new languages and have the compute budget to back it up, full fine-tuning is also readily accessible for most models via community contributions.

Finally, latency optimizations like (fairly aggressive) quantization now cost very little perceptual quality loss for speech synthesis, making edge deployment practical and opening up a new realm of privacy-first applications.

The industry signal

Fine-tuning-ready open models are more than a technical curiosity; it’s also a shift in power. From platform dependency to open experimentation.

Developers are building multilingual voice assistants from the ground up—not with off-the-shelf tools, but with customized pipelines. Researchers are training models on underrepresented dialects and niche audio data to test the boundaries of what’s possible outside dominant languages.

And regional products can weave in dialects or code-switching, reflecting how real customers actually speak.

These new open-source models aren't ready-made, but they offer something more valuable: a solid, open foundation. One that invites experimentation, iteration, and ownership.

Discover more

Want to dive deeper into the mechanics of fine-tuning CSM? We’ve got you covered.

Read the technical tutorial >

Curious about how we’re applying this internally—or how it might apply to your team? We’re open to conversations.

Industry-leading models to unlock breakthrough impact

Power stand-out experiences with the leader in Voice AI to elevate business insights, streamline tasks, and generate real revenue.

Jun 5, 2025 | Read time 4 min

Custom Voice AI in 2025: The Open Source Boom

Democratization of Voice

Why fine-tuning is becoming popular

The industry signal

Discover more

Industry-leading models to unlock breakthrough impact

Related Articles

The return of on-premise: Why enterprise AI's head is no longer in the cloud

Speaker lock: Fixing Voice AI for the real world

The future of voice AI: 3 experts weigh in on what's next