
With speech-to-text, our competitors are names everyone knows. Companies like Google and Microsoft have vast resources at their disposal, allowing them to focus on other projects outside of accuracy. For example, Amazon uses its Alexa technology in vehicles to drive customers to the rest of Amazon's product range.
At Speechmatics, our direction is much more focused. We believe in our mission to understand every voice, so accuracy is of paramount importance. In pursuit of that, we recently published a whitepaper looking at continuous content and comparing the accuracy of our Autonomous Speech Recognition (ASR) engine with our competitors.
Here's some of what we learned.
We understand that it's challenging to be first in the AI industry. Competitors with more resources are more likely to be first on the market. We care about being the best. So, while our innovation rapidly grows year on year, we prioritize accuracy.
However, when you combine accuracy with speed, you produce a powerful, inclusive ASR. An engine that, when compared to our competitors using a wide-ranging 24 YouTube videos, has accuracy levels of 90%. We managed to prioritize accuracy while also improving the speed with the introduction of self-supervised learning.
Essentially, this means that we can use labeled data to build better models without the need for human supervision – saving time without compromising on accuracy. Before self-supervised learning, we trained our ASR on approximately 30,000 hours of labeled audio data. Now that number is closer to 1.1 million hours. That's a lot more data for a lot more accuracy.
Let's say you're watching a YouTube video. You press the 'CC' button, and words start appearing at the bottom of your screen. Unfortunately, they don't all seem to match the audio.
In most cases, speech-to-text engines aren't as accurate as they should be. As a result, an editor must manually fix a transcript. When you have an inaccurate engine, the process taken to reach your end product slows drastically.
However, the accuracy-focused AI rooted within our ASR minimizes the need for manual improvements. Instead, our ASR fixes initial errors. Again, this is down to self-supervised learning – it masks a word from the content file, trains a model to predict the missing word, and then learns which words match.
At Speechmatics, we have roughly 160 employees. This is in stark contrast to Microsoft and Amazon, some of our primary competitors. Of course, with the higher employee numbers, these industry giants have the resources needed to enter new markets and keep making new products, as seen with Amazon's Alexa.
Amazon has integrated its voice recognition software into BMW cars. As a result, drivers are encouraged to use the Amazon store. A win-win for Amazon. At Speechmatics, however, we're not looking to send customers to other products or stores. Our sole focus is on the innovation and accuracy of our award-winning ASR.
Thanks to this laser focus, our speech-to-text engine produced an overall accuracy of 82.8% for African American voices compared to 68.6% for Amazon, as seen in Stanford's 'Racial Disparities in Speech Recognition' study.
Of course, a speech-to-text engine is more likely to have higher accuracy levels in widely spoken languages such as English, Hindi, and Spanish. And while that makes media consumption incredibly accessible for millions of people, it still leaves gaping holes. Nuanced, slightly different versions of all languages exist – there are over 7500 worldwide.
Therefore, an inaccurate ASR would leave many people out in the cold. It's why we're always working on adding to the 34 languages our ASR currently caters for, with further improvements on specific dialects and accents within existing language packs such as French-Canadian and Brazilian-Portuguese.
This leads us to our fifth and final point.
If you've heard of us before, you likely know our primary mission: to understand every voice. This drives everything we do, but how do we achieve that? By continually improving the accuracy of our engine.
It's a simple message but one worth reminding. An accurate speech-to-text engine builds bridges between people, making communication more accessible for all. People will feel more confident stepping into new cultures. As we wrote on our website, an accurate ASR has far-reaching benefits in healthcare, finance, advertising, home living, driving, and productivity to name a few.
An accurate speech-to-text engine helps remove the often-daunting communication barrier and foster a sense of understanding.
So, there you have it, you can see why we prioritize accuracy. Here's a brief summary of the five advantages of having an accuracy-focused speech-to-text engine:
Self-supervised learning makes combining speed and accuracy possible.
Less time spent manually fixing errors in transcription.
Singular focus helps use resources to their maximum potential.
Prioritizing accuracy is to prioritize accessibility.
The barrier of communication makes for a more harmonious existence.
To ensure our ASR stays ahead of the pack, we'll continue to explore ways we can innovate. It's what we do.