Jan 17, 2023 | Read time 4 min

How to Rapidly Train New Languages Using Common Voice and OSCAR

Speechmatics' Data Engineer, Steve Kingsley, discusses Common Voice and OSCAR, and how their data can help to rapidly train languages.
How to Rapidly Train New Languages Using Common Voice and OSCAR
Steve Kingsley
Steve KingsleyData Engineer

Data is at the heart of machine learning and the field of speech-to-text. There’s a huge abundance of it available online that can be used to train and refine our models, especially in the most common languages spoken. However, for “under-resourced languages” there’s less data available, and this makes it hard for speech-to-text systems to train, resulting in the end-product being less effective for everyone.

When it comes to rapidly deploying a new language which has previously not been supported, speech recognition providers need to turn to existing datasets to fill the gaps. When we launched Ukranian last year, we were helped massively by two organizations; Common Voice, from the Mozilla foundation, and a possibly lesser-known project, OSCAR created by a research team from Inria, a French research institute.

Building Ukrainian

Since February 2022, the people of Ukraine have had an urgent need to be heard around the world. Ukrainian speakers reporting on their experiences need to have their stories shared. To enable rapid communication, tools such as speech-to-text can be incredibly useful. Transcribing spoken content into text for things such as captioning and translation can make a huge difference.

At Speechmatics, we tasked our team with getting Ukranian into our offering as quickly as they could, while maintaining our expected high levels of accuracy. We turned to Common Voice and OSCAR to help.

The Common Voice Project

Common Voice is a project that allows everyone to contribute to a massive corpus of recorded and transcribed audio and text. This is known as “labelled data” and is vital to the accuracy of machine learning results. By crowdsourcing data and validation this project aims to bring all voices to the table, and by providing the resulting data under an open-source license it enables companies like Speechmatics to embrace this inclusivity and bring it to our platform.

As users contribute to the project by submitting recordings of their speech and the corresponding transcripts, this leads to a huge influx of data. To ensure the data is correct, users also help validate the submissions. Without this extra step, the content submitted wouldn’t be of a high enough standard for us to train on.

The OSCAR Dataset

This project name comes from the phrase “Open Super-large Crawled Aggregated Corpus”. This is a huge multilingual corpus created by distilling and classifying content by language from another amazing open-source project, Common Crawl. The OSCAR dataset allowed us to quickly access the language content that we need to train a new language model. We then used this data in the pipeline to build full support for the new language.

By providing a large, diverse, and well-annotated dataset, OSCAR helps us build more accurate and robust speech-to-text systems. When it came to Ukranian, having this data was hugely helpful.

Helping Under-Resourced Languages

When we talk about ‘under-resourced’ languages we refer to languages and dialects that have very little data easily available. In speech-to-text, this quickly becomes a self-perpetuating cycle. The lack of support for your preferred native language forces you, to use a second language from the commonly supported options. This creates yet more content, further biasing the availability of content out there.

Both Common Voice and OSCAR help to redress this imbalance by providing the data needed to support the ongoing development of systems like ours and others in the field. They help bring the same advantages and access to all.

We used these two datasets to enable us to not only rapidly train and deploy Ukrainian support, but to then add support for 15 further languages. Without these two open-source projects and their contributors, we wouldn’t have been able to take such a large leap forward in our supported languages and improve the accuracy of our existing supported languages.

Open access to data is a vital part of bringing equality to speech-to-text. Inclusivity and equity are core values we hold here at Speechmatics. Whoever you are, whatever your native language, accent, ethnicity, or demographic, contribute your voice to the Common Voice project and help them, to help us, to help you, and help everyone… equally.

Steve Kingsley, Data Engineer, Speechmatics

Latest Articles

[alt: Bilingual medical model featuring terms related to various health conditions and medications in Arabic and English. Key terms include "Chronic kidney disease," "Heart attack," "Diabetes," and "Insulin," among others, displayed in an organized layout.]
Product

Speechmatics achieves a world first in bilingual Voice AI with new Arabic–English model

Sets a new accuracy bar for real-world code-switching: 35% fewer errors than the closest competitor.

Speechmatics
SpeechmaticsEditorial Team
[alt: Illuminated ancient mud-brick structures stand against a dusk sky, showcasing architectural details and textures. Palm trees are in the foreground, adding to the setting's ambiance. Visually captures a historic site in twilight.]
Product

Your voice agent speaks perfect Arabic. That's the problem.

Most voice AI models are trained on formal Arabic, but real conversations across the Middle East mix dialects and English in ways those systems aren’t built to handle.

Yahia Abaza
Yahia AbazaSenior Product Manger
new blog image header
Technical

How Nvidia Dominates the HuggingFace Leaderboards in This Key Metric

A technical deep-dive into Token Duration Transducers (TDT) — the frame-skipping architecture behind Nvidia's Parakeet models. Covers inference mechanics, training with forward-backward algorithm, and how TDT achieves up to 2.82x faster decoding than standard RNN-T.

Oliver Parish
Oliver Parish Machine Learning Engineer
[alt: Healthcare professionals in scrubs and lab coats walk briskly down a hospital corridor. A nurse uses a tablet while others carry patient charts and attend to a gurney. The setting conveys a busy, clinical environment focused on patient care.]
Use Cases

Why AI-native EHR platforms will treat speech as core infrastructure in 2026

As clinical workflows become automated and AI-driven, real-time speech is shifting from a transcription feature to the foundational intelligence layer inside modern EHR systems.

Vamsi Edara
Vamsi EdaraFounder and CEO, Edvak EHR
[alt: Logos of Speechmatics and Edvak are displayed side by side, interconnected by a stylized x symbol. The background features soft, wavy lines in light blue, creating a modern and tech-focused aesthetic.]
Company

One word changes everything: Speechmatics and Edvak EHR partner to make voice AI safe for clinical automation at scale

Turning real-time clinical speech into trusted, EHR-native automation.

Speechmatics
SpeechmaticsEditorial Team
[alt: Concentric circles radiate outward from a central orange icon with a white Speechmatics logo. The background is dark blue, enhancing the orange glow. A thin green line runs horizontally across the lower part of the image.]
Technical

Speed you can trust: The STT metrics that matter for voice agents

What “fast” actually means for voice agents — and why Pipecat’s TTFS + semantic accuracy is the clearest benchmark we’ve seen.

Archie McMullan
Archie McMullanSpeechmatics Graduate