Blog - Technical
Jan 17, 2023 | Read time 4 min

How to Rapidly Train New Languages Using Common Voice and OSCAR

Speechmatics' Data Engineer, Steve Kingsley, discusses Common Voice and OSCAR, and how their data can help to rapidly train languages.
Steve KingsleyData Engineer

Data is at the heart of machine learning and the field of speech-to-text. There’s a huge abundance of it available online that can be used to train and refine our models, especially in the most common languages spoken. However, for “under-resourced languages” there’s less data available, and this makes it hard for speech-to-text systems to train, resulting in the end-product being less effective for everyone.

When it comes to rapidly deploying a new language which has previously not been supported, speech recognition providers need to turn to existing datasets to fill the gaps. When we launched Ukranian last year, we were helped massively by two organizations; Common Voice, from the Mozilla foundation, and a possibly lesser-known project, OSCAR created by a research team from Inria, a French research institute.

Building Ukrainian

Since February 2022, the people of Ukraine have had an urgent need to be heard around the world. Ukrainian speakers reporting on their experiences need to have their stories shared. To enable rapid communication, tools such as speech-to-text can be incredibly useful. Transcribing spoken content into text for things such as captioning and translation can make a huge difference.

At Speechmatics, we tasked our team with getting Ukranian into our offering as quickly as they could, while maintaining our expected high levels of accuracy. We turned to Common Voice and OSCAR to help.

The Common Voice Project

Common Voice is a project that allows everyone to contribute to a massive corpus of recorded and transcribed audio and text. This is known as “labelled data” and is vital to the accuracy of machine learning results. By crowdsourcing data and validation this project aims to bring all voices to the table, and by providing the resulting data under an open-source license it enables companies like Speechmatics to embrace this inclusivity and bring it to our platform.

As users contribute to the project by submitting recordings of their speech and the corresponding transcripts, this leads to a huge influx of data. To ensure the data is correct, users also help validate the submissions. Without this extra step, the content submitted wouldn’t be of a high enough standard for us to train on.

The OSCAR Dataset

This project name comes from the phrase “Open Super-large Crawled Aggregated Corpus”. This is a huge multilingual corpus created by distilling and classifying content by language from another amazing open-source project, Common Crawl. The OSCAR dataset allowed us to quickly access the language content that we need to train a new language model. We then used this data in the pipeline to build full support for the new language.

By providing a large, diverse, and well-annotated dataset, OSCAR helps us build more accurate and robust speech-to-text systems. When it came to Ukranian, having this data was hugely helpful.

Helping Under-Resourced Languages

When we talk about ‘under-resourced’ languages we refer to languages and dialects that have very little data easily available. In speech-to-text, this quickly becomes a self-perpetuating cycle. The lack of support for your preferred native language forces you, to use a second language from the commonly supported options. This creates yet more content, further biasing the availability of content out there.

Both Common Voice and OSCAR help to redress this imbalance by providing the data needed to support the ongoing development of systems like ours and others in the field. They help bring the same advantages and access to all.

We used these two datasets to enable us to not only rapidly train and deploy Ukrainian support, but to then add support for 15 further languages. Without these two open-source projects and their contributors, we wouldn’t have been able to take such a large leap forward in our supported languages and improve the accuracy of our existing supported languages.

Open access to data is a vital part of bringing equality to speech-to-text. Inclusivity and equity are core values we hold here at Speechmatics. Whoever you are, whatever your native language, accent, ethnicity, or demographic, contribute your voice to the Common Voice project and help them, to help us, to help you, and help everyone… equally.

Steve Kingsley, Data Engineer, Speechmatics