Jun 5, 2025 | Read time 4 min

Custom Voice AI in 2025: The Open Source Boom

Open-source TTS is reshaping the landscape — unlocking custom voices, on-device deployment, and fine-tuned speech that fits your domain and your users.
Customization and fine tuning
Will Knottenbelt
Will KnottenbeltMachine Learning Engineer

There’s a new shift underway in Voice AI — open-sourcing of ultra-realistic text-to-speech (TTS) models that can be shaped for your own use case, your own domain, and your own users. 

As product teams expect voice assistants and agents to be everywhere, the demand for high quality on-device voice AI is spiking, and the open source community is racing to meet it.

We are seeing quite a few interesting examples of this shift, all of which have popped up in the last few months: Sesame's Conversational Speech Model (CSM), Nari Labs' Dia, and Canopy's Orpheus-TTS. This article is our take on what it means for the future of speech tech, and why specialized great models are more accessible than ever before.

(If you’re looking to get under the hood, we’ve also published a detailed technical guide to fine-tuning CSM on your own data.)

Democratization of Voice

For most of the past decade, voice was a service, piped in from a cloud API. What you got was what everyone else got.

That’s changing. 

We’re seeing open-source models that are actually competitive with closed-source industry leaders like ElevenLabs, plug-and-play fine-tuning tools like Unsloth that slash VRAM-requirements and training time, and latency optimizations that make on-device inference realistic for large 1-3B models.

This represents a democratization of high quality TTS models, leading to a growing share of production TTS made up of fine-tuned models shaped around specific accents, domains, and contexts.

Most models are trained for the general case. Fine-tuning is how you adapt them for the specific one.

It’s what lets a model generate medical speech without stumbling over acronyms and jargon. Or replicate the tone of your customer service team. Or produce speech across three regional dialects in one product. Below are examples from fine-tuning Sesame CSM into French and German.

We see three main technical factors tipping the balance in 2025.

Firstly, the brand-new ultra-realistic open models like Dia, Orpheus, and Sesame’s CSM allow you to start from near-human quality rather than from scratch.

Secondly, community contributions and tooling have made fine-tuning easier than ever before. Cheap adapters like LoRA and QLoRA let you freeze 99% of a model and learn just a few million parameters, so now a large 3B parameter TTS model can be fine-tuned on a single 16 GB GPU. Or, for those who want to undergo bigger domain shifts like new languages and have the compute budget to back it up, full fine-tuning is also readily accessible for most models via community contributions.

Finally, latency optimizations like (fairly aggressive) quantization now cost very little perceptual quality loss for speech synthesis, making edge deployment practical and opening up a new realm of privacy-first applications.

The industry signal

Fine-tuning-ready open models are more than a technical curiosity; it’s also a shift in power. From platform dependency to open experimentation.

Developers are building multilingual voice assistants from the ground up—not with off-the-shelf tools, but with customized pipelines. Researchers are training models on underrepresented dialects and niche audio data to test the boundaries of what’s possible outside dominant languages.

And regional products can weave in dialects or code-switching, reflecting how real customers actually speak.

These new open-source models aren't ready-made, but they offer something more valuable: a solid, open foundation. One that invites experimentation, iteration, and ownership.

Discover more

Want to dive deeper into the mechanics of fine-tuning CSM? We’ve got you covered.

Read the technical tutorial >

Curious about how we’re applying this internally—or how it might apply to your team? We’re open to conversations.

Industry-leading models to unlock breakthrough impact

Power stand-out experiences with the leader in Voice AI to elevate business insights, streamline tasks, and generate real revenue.

Latest Articles

[alt: Bilingual medical model featuring terms related to various health conditions and medications in Arabic and English. Key terms include "Chronic kidney disease," "Heart attack," "Diabetes," and "Insulin," among others, displayed in an organized layout.]
Product

Speechmatics achieves a world first in bilingual Voice AI with new Arabic–English model

Sets a new accuracy bar for real-world code-switching: 35% fewer errors than the closest competitor.

Speechmatics
SpeechmaticsEditorial Team
[alt: Illuminated ancient mud-brick structures stand against a dusk sky, showcasing architectural details and textures. Palm trees are in the foreground, adding to the setting's ambiance. Visually captures a historic site in twilight.]
Product

Your voice agent speaks perfect Arabic. That's the problem.

Most voice AI models are trained on formal Arabic, but real conversations across the Middle East mix dialects and English in ways those systems aren’t built to handle.

Yahia Abaza
Yahia AbazaSenior Product Manger
new blog image header
Technical

How Nvidia Dominates the HuggingFace Leaderboards in This Key Metric

A technical deep-dive into Token Duration Transducers (TDT) — the frame-skipping architecture behind Nvidia's Parakeet models. Covers inference mechanics, training with forward-backward algorithm, and how TDT achieves up to 2.82x faster decoding than standard RNN-T.

Oliver Parish
Oliver Parish Machine Learning Engineer
[alt: Healthcare professionals in scrubs and lab coats walk briskly down a hospital corridor. A nurse uses a tablet while others carry patient charts and attend to a gurney. The setting conveys a busy, clinical environment focused on patient care.]
Use Cases

Why AI-native EHR platforms will treat speech as core infrastructure in 2026

As clinical workflows become automated and AI-driven, real-time speech is shifting from a transcription feature to the foundational intelligence layer inside modern EHR systems.

Vamsi Edara
Vamsi EdaraFounder and CEO, Edvak EHR
[alt: Logos of Speechmatics and Edvak are displayed side by side, interconnected by a stylized x symbol. The background features soft, wavy lines in light blue, creating a modern and tech-focused aesthetic.]
Company

One word changes everything: Speechmatics and Edvak EHR partner to make voice AI safe for clinical automation at scale

Turning real-time clinical speech into trusted, EHR-native automation.

Speechmatics
SpeechmaticsEditorial Team
[alt: Concentric circles radiate outward from a central orange icon with a white Speechmatics logo. The background is dark blue, enhancing the orange glow. A thin green line runs horizontally across the lower part of the image.]
Technical

Speed you can trust: The STT metrics that matter for voice agents

What “fast” actually means for voice agents — and why Pipecat’s TTFS + semantic accuracy is the clearest benchmark we’ve seen.

Archie McMullan
Archie McMullanSpeechmatics Graduate