There’s a new shift underway in Voice AI — open-sourcing of ultra-realistic text-to-speech (TTS) models that can be shaped for your own use case, your own domain, and your own users.
We are seeing quite a few interesting examples of this shift, all of which have popped up in the last few months: Sesame's Conversational Speech Model (CSM), Nari Labs' Dia-1.6B, and Canopy's Orpheus-TTS. This article is our take on what it means for the future of speech tech, and why fine-tuning great models is more accessible than ever before.
(If you’re looking to get under the hood, we’ve also published a detailed technical guide to fine-tuning CSM on your own data.)
For most of the past decade, voice was a service, piped in from a cloud API. What you got was what everyone else got.
That’s changing.
We’re seeing open-source models that are actually competitive with closed-source industry leaders like ElevenLabs, plug-and-play fine-tuning tools like Unsloth that slash VRAM-requirements and training time, and latency optimizations that make on-device inference realistic for large 1-3B models.
This represents a democratization of high quality TTS models, leading to a shift from one-size-fits-all to models shaped around specific accents, domains, or contexts. The world is gearing up for a new wave of personal voice assistants, and specialization is key.
We see three main factors tipping the balance in 2025.
Firstly, there are brand-new ultra-realistic open models like Dia, Orpheus, and Sesame’s CSM popping up every couple of months, so you can start from near-human quality rather than from scratch.
Secondly, community contributions and tooling have made fine-tuning easier than ever before. Cheap adapters like LoRA and QLoRA let you freeze 99% of a model and learn just a few million parameters, so now a large 3B parameter TTS model can be fine-tuned on a single 16 GB GPU. Or, for those who want to undergo bigger domain shifts like new languages and have the compute budget to back it up, full fine-tuning is also readily accessible for most models via community contributions.
Finally, latency optimizations like (fairly aggressive) quantization now cost very little perceptual quality loss for speech synthesis, making edge deployment practical and opening up a new realm of privacy-first applications.
Most models are trained for the general case. Fine-tuning is how you adapt them for the specific one.
It’s what lets a model generate medical speech without stumbling over acronyms and jargon. Or replicate the tone of your customer service team. Or produce speech across three regional dialects in one product.
There is greater interest in models that can be hosted, audited, and adapted, and a growing need for voice systems that reflect real users—not just a training corpus.
Fine-tuning-ready open models are more than a technical curiosity; it’s also a shift in power. From platform dependency to open experimentation.
Developers are building multilingual voice assistants from the ground up—not with off-the-shelf tools, but with customised pipelines. Researchers are training models on underrepresented dialects and niche audio data to test the boundaries of what’s possible outside dominant languages.
And regional products can weave in dialects or code-switching, reflecting how real customers actually speak.
These new open-source models aren't ready-made, but they offer something more valuable: a solid, open foundation. One that invites experimentation, iteration, and ownership.
Start by assembling at least three hours of clean, domain-specific audio and transcripts (note that the bigger the domain shift you want, the more data you need). Remember that the model will try to sound like the data it trains on so if you don't have good data, don't expect a good model.
Next step is to pick your method of fine-tuning. If you are limited by compute then I recommend parameter-efficient fine-tuning with Unsloth TTS. Use a 16-bit LoRA if you have access to 16 GB of VRAM or better, provided that it fits your desired model and batch size (it should easily accommodate up to 3B models with a low batch size). If memory is tighter, fall back to a 4-bit QLoRA, which runs comfortably on a single RTX 4090 or even smaller cards.
Unfortunately, the QLoRA and LoRA method can be a little unreliable for TTS, particularly when introducing big distribution shifts like fine-tuning into new languages or wildly different accents. If you have the compute budget, you will want to go for full fine-tuning. Luckily, most models now have an open contribution for full fine-tuning.
Lastly, embed a watermark now, before the compliance teams start knocking.
Want to dive deeper into the mechanics of fine-tuning CSM? We’ve got you covered.
Curious about how we’re applying this internally—or how it might apply to your team? We’re open to conversations.