Jan 17, 2023 | Read time 4 min

How to Rapidly Train New Languages Using Common Voice and OSCAR

Speechmatics' Data Engineer, Steve Kingsley, discusses Common Voice and OSCAR, and how their data can help to rapidly train languages.
How to Rapidly Train New Languages Using Common Voice and OSCAR
Steve Kingsley
Steve KingsleyData Engineer

Data is at the heart of machine learning and the field of speech-to-text. There’s a huge abundance of it available online that can be used to train and refine our models, especially in the most common languages spoken. However, for “under-resourced languages” there’s less data available, and this makes it hard for speech-to-text systems to train, resulting in the end-product being less effective for everyone.

When it comes to rapidly deploying a new language which has previously not been supported, speech recognition providers need to turn to existing datasets to fill the gaps. When we launched Ukranian last year, we were helped massively by two organizations; Common Voice, from the Mozilla foundation, and a possibly lesser-known project, OSCAR created by a research team from Inria, a French research institute.

Building Ukrainian

Since February 2022, the people of Ukraine have had an urgent need to be heard around the world. Ukrainian speakers reporting on their experiences need to have their stories shared. To enable rapid communication, tools such as speech-to-text can be incredibly useful. Transcribing spoken content into text for things such as captioning and translation can make a huge difference.

At Speechmatics, we tasked our team with getting Ukranian into our offering as quickly as they could, while maintaining our expected high levels of accuracy. We turned to Common Voice and OSCAR to help.

The Common Voice Project

Common Voice is a project that allows everyone to contribute to a massive corpus of recorded and transcribed audio and text. This is known as “labelled data” and is vital to the accuracy of machine learning results. By crowdsourcing data and validation this project aims to bring all voices to the table, and by providing the resulting data under an open-source license it enables companies like Speechmatics to embrace this inclusivity and bring it to our platform.

As users contribute to the project by submitting recordings of their speech and the corresponding transcripts, this leads to a huge influx of data. To ensure the data is correct, users also help validate the submissions. Without this extra step, the content submitted wouldn’t be of a high enough standard for us to train on.

The OSCAR Dataset

This project name comes from the phrase “Open Super-large Crawled Aggregated Corpus”. This is a huge multilingual corpus created by distilling and classifying content by language from another amazing open-source project, Common Crawl. The OSCAR dataset allowed us to quickly access the language content that we need to train a new language model. We then used this data in the pipeline to build full support for the new language.

By providing a large, diverse, and well-annotated dataset, OSCAR helps us build more accurate and robust speech-to-text systems. When it came to Ukranian, having this data was hugely helpful.

Helping Under-Resourced Languages

When we talk about ‘under-resourced’ languages we refer to languages and dialects that have very little data easily available. In speech-to-text, this quickly becomes a self-perpetuating cycle. The lack of support for your preferred native language forces you, to use a second language from the commonly supported options. This creates yet more content, further biasing the availability of content out there.

Both Common Voice and OSCAR help to redress this imbalance by providing the data needed to support the ongoing development of systems like ours and others in the field. They help bring the same advantages and access to all.

We used these two datasets to enable us to not only rapidly train and deploy Ukrainian support, but to then add support for 15 further languages. Without these two open-source projects and their contributors, we wouldn’t have been able to take such a large leap forward in our supported languages and improve the accuracy of our existing supported languages.

Open access to data is a vital part of bringing equality to speech-to-text. Inclusivity and equity are core values we hold here at Speechmatics. Whoever you are, whatever your native language, accent, ethnicity, or demographic, contribute your voice to the Common Voice project and help them, to help us, to help you, and help everyone… equally.

Steve Kingsley, Data Engineer, Speechmatics

Latest Articles

Carousel slide image
Technical

How to build a microbatching workflow with the Speechmatics API

Build a cleaner path between batch and real time. Learn when micro-batching makes sense, how to chunk audio, submit jobs, stitch JSON, and scale safely with the Speechmatics API.

Speechmatics
SpeechmaticsEditorial Team
Carousel slide image
Product

Alphanumeric speech recognition: why voice assistants mangle SKUs (and how to fix it)

A guide for voice AI engineers, ecommerce platforms and warehouse teams on SKU recognition accuracy voice assistant deployments depend on: why speech recognition systems produce transcription errors on product codes, what to measure when error rates matter, and the fixes that move the needle on order picking, voice ordering and customer-facing voice AI.

Speechmatics
SpeechmaticsEditorial Team
Carousel slide image
Technical

The Adobe story: How we made cloud-grade AI work on your laptop

Behind the build: what it takes to make cloud-grade speech recognition work inside Adobe Premiere, and why Whisper raised the stakes.

Andrew Innes
Andrew InnesChief Architect
Carousel slide image
Company

Adobe and Speechmatics deliver cloud-grade speech recognition on-device for Premiere

Adobe Premiere users can run the most accurate on-device transcription locally; efficient enough for a laptop, powerful enough for professional work.

Speechmatics
SpeechmaticsEditorial Team
Carousel slide image
Use Cases

Best speech-to-text AI guide: APIs, platforms and services compared

Speech-to-text has moved from novelty to enterprise infrastructure. Here's how the leading platforms stack up in 2026 — and how to pick the right one.

Tom Young
Tom YoungDigital Specialist
Speechmatics x Thymia combine medical-grade speech-to-text with clinical-grade voice biomarker intelligence to identify health signals.
News

AI can now understand health signals from 15 seconds of your voice, including fatigue, stress and type 2 diabetes

The joint platform returns transcription and health signals in real time, with no additional hardware required.

Speechmatics
SpeechmaticsEditorial Team