Blog - Technical
Jun 5, 2025 | Read time 4 min

Custom Voice AI in 2025: The Open Source Boom

Open-source TTS is reshaping the landscape — unlocking custom voices, on-device deployment, and fine-tuned speech that fits your domain and your users.
Will KnottenbeltMachine Learning Engineer

There’s a new shift underway in Voice AI — open-sourcing of ultra-realistic text-to-speech (TTS) models that can be shaped for your own use case, your own domain, and your own users. 

As product teams expect voice assistants and agents to be everywhere, the demand for high quality on-device voice AI is spiking, and the open source community is racing to meet it.

We are seeing quite a few interesting examples of this shift, all of which have popped up in the last few months: Sesame's Conversational Speech Model (CSM), Nari Labs' Dia, and Canopy's Orpheus-TTS. This article is our take on what it means for the future of speech tech, and why specialized great models are more accessible than ever before.

(If you’re looking to get under the hood, we’ve also published a detailed technical guide to fine-tuning CSM on your own data.)

Democratization of Voice

For most of the past decade, voice was a service, piped in from a cloud API. What you got was what everyone else got.

That’s changing. 

We’re seeing open-source models that are actually competitive with closed-source industry leaders like ElevenLabs, plug-and-play fine-tuning tools like Unsloth that slash VRAM-requirements and training time, and latency optimizations that make on-device inference realistic for large 1-3B models.

This represents a democratization of high quality TTS models, leading to a growing share of production TTS made up of fine-tuned models shaped around specific accents, domains, and contexts.

Why fine-tuning is becoming popular

Most models are trained for the general case. Fine-tuning is how you adapt them for the specific one.

It’s what lets a model generate medical speech without stumbling over acronyms and jargon. Or replicate the tone of your customer service team. Or produce speech across three regional dialects in one product. Below are examples from fine-tuning Sesame CSM into French and German.

We see three main technical factors tipping the balance in 2025.

Firstly, the brand-new ultra-realistic open models like Dia, Orpheus, and Sesame’s CSM allow you to start from near-human quality rather than from scratch.

Secondly, community contributions and tooling have made fine-tuning easier than ever before. Cheap adapters like LoRA and QLoRA let you freeze 99% of a model and learn just a few million parameters, so now a large 3B parameter TTS model can be fine-tuned on a single 16 GB GPU. Or, for those who want to undergo bigger domain shifts like new languages and have the compute budget to back it up, full fine-tuning is also readily accessible for most models via community contributions.

Finally, latency optimizations like (fairly aggressive) quantization now cost very little perceptual quality loss for speech synthesis, making edge deployment practical and opening up a new realm of privacy-first applications.

The industry signal

Fine-tuning-ready open models are more than a technical curiosity; it’s also a shift in power. From platform dependency to open experimentation.

Developers are building multilingual voice assistants from the ground up—not with off-the-shelf tools, but with customized pipelines. Researchers are training models on underrepresented dialects and niche audio data to test the boundaries of what’s possible outside dominant languages.

And regional products can weave in dialects or code-switching, reflecting how real customers actually speak.

These new open-source models aren't ready-made, but they offer something more valuable: a solid, open foundation. One that invites experimentation, iteration, and ownership.

Discover more

Want to dive deeper into the mechanics of fine-tuning CSM? We’ve got you covered.

Read the technical tutorial >

Curious about how we’re applying this internally—or how it might apply to your team? We’re open to conversations.

Industry-leading models to unlock breakthrough impact

Power stand-out experiences with the leader in Voice AI to elevate business insights, streamline tasks, and generate real revenue.