Jun 5, 2025 | Read time 4 min

Custom Voice AI in 2025: The Open Source Boom

Open-source TTS is reshaping the landscape — unlocking custom voices, on-device deployment, and fine-tuned speech that fits your domain and your users.
Customization and fine tuning
Will Knottenbelt
Will KnottenbeltMachine Learning Engineer

There’s a new shift underway in Voice AI — open-sourcing of ultra-realistic text-to-speech (TTS) models that can be shaped for your own use case, your own domain, and your own users. 

As product teams expect voice assistants and agents to be everywhere, the demand for high quality on-device voice AI is spiking, and the open source community is racing to meet it.

We are seeing quite a few interesting examples of this shift, all of which have popped up in the last few months: Sesame's Conversational Speech Model (CSM), Nari Labs' Dia, and Canopy's Orpheus-TTS. This article is our take on what it means for the future of speech tech, and why specialized great models are more accessible than ever before.

(If you’re looking to get under the hood, we’ve also published a detailed technical guide to fine-tuning CSM on your own data.)

Democratization of Voice

For most of the past decade, voice was a service, piped in from a cloud API. What you got was what everyone else got.

That’s changing. 

We’re seeing open-source models that are actually competitive with closed-source industry leaders like ElevenLabs, plug-and-play fine-tuning tools like Unsloth that slash VRAM-requirements and training time, and latency optimizations that make on-device inference realistic for large 1-3B models.

This represents a democratization of high quality TTS models, leading to a growing share of production TTS made up of fine-tuned models shaped around specific accents, domains, and contexts.

Most models are trained for the general case. Fine-tuning is how you adapt them for the specific one.

It’s what lets a model generate medical speech without stumbling over acronyms and jargon. Or replicate the tone of your customer service team. Or produce speech across three regional dialects in one product. Below are examples from fine-tuning Sesame CSM into French and German.

We see three main technical factors tipping the balance in 2025.

Firstly, the brand-new ultra-realistic open models like Dia, Orpheus, and Sesame’s CSM allow you to start from near-human quality rather than from scratch.

Secondly, community contributions and tooling have made fine-tuning easier than ever before. Cheap adapters like LoRA and QLoRA let you freeze 99% of a model and learn just a few million parameters, so now a large 3B parameter TTS model can be fine-tuned on a single 16 GB GPU. Or, for those who want to undergo bigger domain shifts like new languages and have the compute budget to back it up, full fine-tuning is also readily accessible for most models via community contributions.

Finally, latency optimizations like (fairly aggressive) quantization now cost very little perceptual quality loss for speech synthesis, making edge deployment practical and opening up a new realm of privacy-first applications.

The industry signal

Fine-tuning-ready open models are more than a technical curiosity; it’s also a shift in power. From platform dependency to open experimentation.

Developers are building multilingual voice assistants from the ground up—not with off-the-shelf tools, but with customized pipelines. Researchers are training models on underrepresented dialects and niche audio data to test the boundaries of what’s possible outside dominant languages.

And regional products can weave in dialects or code-switching, reflecting how real customers actually speak.

These new open-source models aren't ready-made, but they offer something more valuable: a solid, open foundation. One that invites experimentation, iteration, and ownership.

Discover more

Want to dive deeper into the mechanics of fine-tuning CSM? We’ve got you covered.

Read the technical tutorial >

Curious about how we’re applying this internally—or how it might apply to your team? We’re open to conversations.

Industry-leading models to unlock breakthrough impact

Power stand-out experiences with the leader in Voice AI to elevate business insights, streamline tasks, and generate real revenue.

Latest Articles

Carousel slide image
Technical

How to build a microbatching workflow with the Speechmatics API

Build a cleaner path between batch and real time. Learn when micro-batching makes sense, how to chunk audio, submit jobs, stitch JSON, and scale safely with the Speechmatics API.

Speechmatics
SpeechmaticsEditorial Team
Carousel slide image
Product

Alphanumeric speech recognition: why voice assistants mangle SKUs (and how to fix it)

A guide for voice AI engineers, ecommerce platforms and warehouse teams on SKU recognition accuracy voice assistant deployments depend on: why speech recognition systems produce transcription errors on product codes, what to measure when error rates matter, and the fixes that move the needle on order picking, voice ordering and customer-facing voice AI.

Speechmatics
SpeechmaticsEditorial Team
Carousel slide image
Technical

The Adobe story: How we made cloud-grade AI work on your laptop

Behind the build: what it takes to make cloud-grade speech recognition work inside Adobe Premiere, and why Whisper raised the stakes.

Andrew Innes
Andrew InnesChief Architect
Carousel slide image
Company

Adobe and Speechmatics deliver cloud-grade speech recognition on-device for Premiere

Adobe Premiere users can run the most accurate on-device transcription locally; efficient enough for a laptop, powerful enough for professional work.

Speechmatics
SpeechmaticsEditorial Team
Carousel slide image
Use Cases

Best speech-to-text AI guide: APIs, platforms and services compared

Speech-to-text has moved from novelty to enterprise infrastructure. Here's how the leading platforms stack up in 2026 — and how to pick the right one.

Tom Young
Tom YoungDigital Specialist
Speechmatics x Thymia combine medical-grade speech-to-text with clinical-grade voice biomarker intelligence to identify health signals.
News

AI can now understand health signals from 15 seconds of your voice, including fatigue, stress and type 2 diabetes

The joint platform returns transcription and health signals in real time, with no additional hardware required.

Speechmatics
SpeechmaticsEditorial Team