110 million more voices are now understood
Understanding every voice is crucial for bringing Speech Intelligence to the world.
To be truly valuable, AI can’t just understand every American English voice (which represents less than 20% of speakers and only one of over 7,000 languages spoken) – it must be accurate for a broad range of languages around the globe.
Continuing our mission to understand every voice, we recently added our 49th language to the roster of those we support for transcription: Persian.
With it, 110 million more voices are now understood. Persian is a widely spoken language in the Middle East and Central Asia, ranking among the world’s 20 most widely spoken first languages. Persian is known as Farsi, Dari and Tajik in Iran, Afghanistan and Tajikistan respectively, and has 62 million native speakers and at least 50 million second-language speakers.
So, how does Speechmatics’ speech recognition learn a new language? Here, we lift the lid on how this process works, as we think it’s rather interesting.
A Running Start
Many polyglots say that after the first few languages, learning additional languages becomes easier. The shared rules, grammar, and words within language families help us transfer learnings from one to another. The universals of language, such as nouns, verbs, consonants, and vowels, mean the structure and syntax of languages can also be shared.
This helps Speechmatics a lot. Every time we want to learn a new language, we don't have to start from scratch. We've already 'learnt' 48 languages, and using self-supervised learning, having immersed our models in a huge mix of speech data from a broad range of languages, dialects, and accents.
Recognizing 48 languages gives us a head start on learning our 49th: Persian...
Getting Started With Persian
Turning the Persian spoken word into an accurate transcript invokes two questions:
How does Persian sound?
How is Persian written?
Let’s consider each in turn.
The key to understanding spoken Persian is variety. We want a mix of clean audio, such as audiobooks, and messier audio, like someone shouting next to a loud washing machine. The speech needs to include a range of vocabulary too, including technical language, informal vernacular, and regional-specific words. We try to create a bank of diverse voices that reflect how Persian is heard in the real world – different contexts, quality of recordings, and accents.
Why is this important?
Our end goal is to provide an accurate transcript of Persian in the real world. Sometimes speech audio is clean, but sometimes it is a messy, low-quality phone call with multiple speakers and plenty of background noise. Both scenarios are important for us. Utility in the real world is a priority, so having a dataset that reflects this is vital.
Our training data also needs to cover as diverse a set of voices as possible so the models can generalize well to many users. Since our voices are shaped by where we're from, our gender, our background, and many other variables, we try to make sure the accents and dialects in the data are as diverse as possible. We also want to ensure the audio comes from a variety of sources. Newsreaders sound very different to people phoning into a contact center helpline.
Capturing such a wide range of voices is significantly helped by our self-supervised approach. When building our bank of speech, we're not only looking for labeled data (i.e. audio recordings that come accompanied by a human-written transcript) but also unlabeled audio, of which there is much more. This opens the pool of audio to be learned from since we're not restricted to perfect datasets of recorded and transcribed Persian – we can potentially use any spoken Persian.
To cover as many voices as possible, we typically train on thousands of hours of speech per language. For Persian, we started with about 300 hours of relatively 'high-quality' speech, which means generally people speaking into a laptop-quality microphone.
We then augmented this with about 1,000 hours more of 'mixed quality' speech, which included audio from bad microphones, background noise/and music, slang, etc. This improved the coverage of diverse speakers in different scenarios and had other languages in the region, e.g. a few dialects of Arabic, which helps us recognize words that are borrowed from other languages.
Here are two examples of Persian audio from the Mozilla Common Voice project, one with clean (perfect) audio, and the other with messy, low-quality audio, and both are equally important for Speechmatics to learn and understand.
The second part of the early puzzle is to learn the nuances of how the language is written. If the aim is to provide an easy-to-understand transcript, a big question is how a reader of that language expects the written text to look and whether that changes for non-native speakers or more formal settings.
For example, in all languages, colloquialisms may be written a specific way by native speakers, but we ideally want the formal form ('want to' instead of 'wanna'). This helps non-native speakers and people learning the language – non-native English speakers are less likely to recognize 'wanna' when written down.
Another reason is to help 'normalize' the language before training our models on it. We build a normalizer to transform the text into a common standard before training our models. For example, when training our models, we lowercase all of the text and decide which diacritics should be output in various scenarios. Diacritics are the symbols added above or below certain letters that modify them. While French and German use diacritics to indicate changes in pronunciation, Persian also uses diacritics to indicate the positions of vowels.
For Persian, we asked ourselves important questions:
When are diacritics used, and what for?
Does the written form differ across regions?
How are vowels represented?
How are numbers and punctuation written?
Are there different digit systems for numbers?
When, or how, are they used?
Once we start understanding the nuances of the language, we bake this knowledge into our code. Before training our models, the normalizer removes any punctuation to remove inconsistencies in the data – the serial comma is removed, for example. We then train our models to re-apply punctuation and formatting consistently based on a smaller set of high-quality written data.
Combining a wide variety of spoken and written Persian data, and understanding the nuances of how Persian is written, we start with a strong foundation for our models to build upon.
Next: Training 🏋️
Once we have all this audio data, we split it into three parts for training, validating, and evaluating our models.
In many ways, training our models follows a similar path to how a human might learn a language at school.
Let's use that analogy to explain the process.
Step 1: The Lessons
We start by training our models in a similar fashion to taking a language class or course.
We immerse the models in the language and watch as the model picks up the language over time. Like a good linguist, large models are able to soak up the data and learn from the data very quickly. Others may need to see a lot more data or see the same data a few times to meet our required levels of accuracy.
We train the models on the majority of our data and refer to this as the training dataset. This could be in the tens of hours, but we generally start with a little more.
Step 2: The Mock Exams
The second dataset is used for testing the models as they're being trained to make sure they're getting better, and this set looks typically a lot like the training set. The process is known as validating the models, and the dataset we use is called a validation dataset.
It's usually a portion of data we get from the same sources we use for training that we carve off for validation – we call it 'in-domain' because it comes from the same source as the training data. In our analogy, this might be the teacher setting you tests based on what they taught you the previous week. Humans might also (wisely) want to take a couple of mock or practice exams. You wouldn't want to dive straight into the finals, so you take some practice tests throughout your language course to see how you're doing and learn from any mistakes you made. This is the same for us. We test the models using the validation dataset a few times, looking for common mistakes and trying to iron them out – is there something we missed in our code for handling numbers? Are we using the wrong gender for gendered languages (though in this instance, Persian isn't one)? Are we including/excluding diacritics or punctuation that the validation data says should/shouldn't be there? Like learning a language, a couple of practice tests are a great way to solidify your learning. And just like in real life, if you have a friend who speaks the language fluently, you might be brave enough to engage them in conversation, with them providing pointers and feedback on where you are making mistakes. For us, we had a customer who was one such friend whom we could send some of our models for feedback.
Step 3: The Final Exam
In real life, the final exam cannot be something you've seen before, or be too closely tied to your teacher's approach. It is an exam, after all. Our evaluation data is our 'final’ exam. This evaluation dataset comes from different sources that we trust highly, and we use this third set for testing the models once they're built to see how well we've done and benchmark ourselves in the market. To ensure that we can understand every voice and be helpful in the real world, we check against a diverse range of speakers and vocabulary from different sources than our training data ('out-of-domain' rather than 'in-domain'). We put that audio through the models and compare the output against what the evaluation transcript says it should be. We also run our evaluation by native speakers and customers to make sure they're happy with the quality of the transcription, and that we're not making any obvious mistakes. This looks at the formality of the written form, the punctuation, the formatting of numbers/dates/addresses, and how we do on borrowed words in the audio (e.g. non-Persian names, brands, etc).
Unlike an exam at school or on a language course, we won't stop if our results aren't good enough. We'll get back to work and keep improving to ensure we hit the consistently high accuracy that Speechmatics prides itself on.
For Persian, all the above took no more than a couple of weeks, which by any standards is incredibly fast, let alone by human standards. Much of this time was spent learning about Persian and adapting our code to handle some of its nuances. The rest of the time was spent on model training.
How can we create highly accurate transcription models in such relatively short times? The secret lies in our use of self-supervised learning and the small amount of data we need. This technology gives us a tremendous head start, so we're not learning from scratch every time.
High Accuracy, Low Turnaround Time
Despite this super quick addition of a new language, our customers are delighted with our accuracy figures. On the Common Voice dataset, we evaluated Speechmatics as having 82.8% accuracy. In contrast, OpenAI's Whisper Large (v2) model only achieved 60.6% accuracy.
Next, we'll be building more languages in the same way as Persian. The quality of our transcription remains our top priority, and we'll always work with customers and those interested to ensure they get what they need.
The process described above can be used for any language, so if you're reading this and Speechmatics doesn't currently support a language you need, let us know. Head here and pop us a message; we'd love to talk.
Transcription coverage for over half the world's population
With Speechmatics, the only real question you need to ask is... where next?