99% accurate!? A sceptics guide to assessing speech-to-text accuracy
Looking beyond the numbers...
Stuart WoodProduct Manager
Benedetta CevoliSenior Data Scientist
The quest for real world accuracy
We recently added a new open test set to our evaluation suite which covers many languages and ran the numbers to compare ourselves against some others. For this, we used open, publicly available data (called the FLEURS dataset).
We were very happy with the results 😊:
We performed really well in languages that are typically underrepresented, for example, we were on average 8.45% more accurate in Mandarin, and 4.77% better than our nearest competitor.
More accurate than Amazon, AssemblyAI and Deepgram in every language we offer.
When comparing every major ASR vendor in every language, Speechmatics is most accurate 93.73% of the time.
What followed was a discussion about what these results meant, and whether we should plaster the full test results on our website. We haven't, and for good reason. Whilst the FLEURS test data is interesting (more on that later), we think there's an important nuance to discuss how you should judge any speech-to-text technology, and how important it is to get as close as possible to real-world, messy data.
We felt this discussion was worth sharing because, for us, it's even more important to be transparent and helpful than to massage figures to show that we come out on top. We know it can be difficult to make sense of the various claims and figures that companies in our space throw out and make the right decision for your business.
It also provides insight into how hard it is to come up with a single figure that shows in objective terms how 'good' or 'bad' a transcription provider is at actually transcribing media you give them.
So here we'll talk you through how these figures are calculated, and how sometimes keeping it simple can lead to an incomplete picture (and potentially some sub-par buying decisions).
The trouble with testing
Companies that provide speech-to-text (otherwise known as ASR, or STT) services, will often test their transcription using a dataset. These datasets are labeled, which means that a human has painstakingly listened to them (often multiple times) and written a transcript of the audio in question (this is considered the gold standard). The test then compares this manually transcribed audio with the output of their own product. You then can measure the success or failure of this in two distinct but closely related ways:
Word Error Rate (WER) – what percentage of words did the service get wrong? So, a WER of 33% would mean that one in every three words was transcribed incorrectly.
Accuracy – what percentage of words does the service get right? So, an accuracy figure of 66% would mean two out of three words were transcribed correctly (and the eagle-eyed will notice that Accuracy is simply 100% - WER).
For our tests, we used the FLEURS dataset. You can read more about it here, but in short, it's a good dataset for testing your accuracy across a wide range of languages. For any company offering a broad range of languages and translation services, this dataset is a great initial way to see how they are doing.
So far, so sensible. It's good to use attributed, publicly available data as this means that you're not hiding your evaluation data and in theory, anyone could replicate the results (yay for science).
As a side note – look out for any company that doesn't show (or 'attribute') their data, since you have no idea what's in that dataset. Since you don't know this, you (or anyone else) cannot replicate the results and therefore should be very dubious of them.
But there's a big drawback. These datasets are simply too artificial, too clean, too good. The FLEURS dataset is great in that it contains a lot of audio from different speakers across a range of languages. But they are all speakers reading from Wikipedia into a microphone. If you listen to the samples, you can hear people reading at a relatively slow and steady pace, reading well-written and formatted informational text. Why is this a problem? Well, if you're trying to make products that use speech data in the real world, chances are you won't have the luxury of every piece of speech being this, well, clean. For example:
People talking over each other as in regular conversations, phone calls and podcasts.
People speaking 'as they are thinking' – um, ers, ooh, incomplete words, half-finished thoughts
Noisy background, for example in a contact center obscuring an agents voice
Noisy background from Sporting events (thinking of trackside F1 presenters here)
Low-quality audio
Low volume audio
Any machine learning product or model designed only to excel at some of these publicly available datasets will fall down when you actually want to use them to be valuable in the real world, which is messy, and loud. Of course, it's great to show high accuracy for these artificially clean scenarios and datasets, but this isn't our North Star.
At Speechmatics, our North Star is usefulness. We want to create something that works for real product companies looking to provide value in the real world. If we create a product that looks great on paper but fails for our customers, we have failed. If we create a product that only excels in one language or fails to understand a particular demographic and not others, we have failed. Our mission is to Understand Every Voice, and we take that mission seriously.
How then, can you judge the accuracy of a provider? How do you know if it will work with your real-world media?
To do that, you need to get messy.
Making things more realistic
If the aim is to provide a better indication of how a product will perform in the real world, you intuitively need to make the test data more realistic (we call this representative).
You can do this by throwing in some more varied audio data into the mix.
For example:
Data from speakers with more varied social and economic backgrounds.
Data from speakers with different accents.
Data from speakers having a spontaneous ‘natural’ conversation rather than reading from a text.
Data from multiple speakers talking to each other to simulate a phone conversation.
Data with speakers in noisy environments.
Now what happens when we do this?
Here we look at tests comparing OpenAI Whisper and Speechmatics with different datasets, each designed to address one of the above. Here we're looking at Word Error Rate, to give you a sense of the percentage of mistakes made. In this case, lower is better.
Conversations across broad demographics
Casual Conversions is Meta's dataset, composed of over 45,000 videos (3,011 participants). The videos feature paid individuals who agreed to participate in the project and explicitly provided age and gender labels themselves. The videos were recorded in the US, with a diverse set of adults in various age, gender and apparent skin tone groups.
(Don't forget, lower is better when it comes to WER.)
This shows that with this data, Speechmatics is making 18.95% fewer errors for those with light skin tones and 23.99% fewer errors for those with dark skin tones.
Telephone conversations
Switchboard is a collection of about 2,400 two-sided telephone conversations among 543 speakers (302 male, 241 female) from all areas across the US. It was created to give a close approximation of a telephone call and is therefore close to the kind of audio that Call Centers and CCaaS providers would look to transcribe.
In this scenario, Speechmatics leads again, recording 47.25% fewer errors.
In a noisy moving car(!)
Now in what definitely would have been the most fun to record, the AVICAR dataset includes very short utterance recordings in a moving vehicle setting across conditions (moving at different speeds with the windows both up and down). This obviously adds varying degrees of background noise to the recording and would be a great dataset to use to reflect the ability of a model to capture audio with background noise.
Here, Speechmatics makes 44.48% fewer errors.
In fact, if you look across 8 datasets that are closer to the messy real world (Common Voice, CORAAL, AVICAR, Switchboard, Rev16, Casual Conversations and two internal noisy datasets), Speechmatics makes over 32% fewer errors than OpenAI Whisper.
Now for us, this is the true test of a good product. We want our customers to know and understand that we don't only train things in a lab, or a vacuum. We train our models on data that contains challenging scenarios to ensure the models are robust in the real world. We also test our product with realistic data, to ensure that when they use Speechmatics, they can trust that we're not going to let them down. They know that if they use us on noisy, low-quality phone conversations recorded in a call center, we're not going to get one in every two words wrong.
And, when we fall short, we listen. When they tell us that they are not seeing the accuracy they need, we get to work.
How then should you judge these figures?
The dangers of making things too simple
The old marketing adage of KISS – keep it simple, stupid, is often useful, but it can lead to some over-simplification. We like to say, 'make it as simple as possible, but no simpler'.
Sometimes, a single WER or accuracy figure is too simple. It loses the nuance needed to show how something will perform in the real world. But nuance is hard to convey quickly, and we understand this too. People want to slap an impressive figure on a webpage and hope that their customers and people evaluating their product won't ask too many questions. It's a tough balancing act.
What this means is that for every provider in this space, you'll be able to find accuracy or WER figures, comparing them to others and doubtless showing that they come out on top. We do too!
Our commitment has always been not to make the most accurate transcription if all that it means is that we can transcribe clean, un-messy audio, as well as possible. Our commitment is instead to achieve high accuracy regardless of the input – to provide transcripts that hit a high bar even when that bar is very high and the audio quality low.
A great story about this is that we had a strange scenario with some test data a few years ago. Our accuracy scores came back incredibly low, and we found this confusing because it didn't match the scores we were getting with other datasets. Upon investigation, we found that if you cranked the volume WAY up, you could actually hear a very faint second speaker being picked up by the microphone. Speechmatics was transcribing this (accurately), but the human transcriber had not been able to hear this, so had omitted it from their transcript. In other words, Speechmatics was too good. We were hearing audio that the person in charge of providing the clean transcription could not. In this instance, we were too good and actually not useful!
Our priority is to create something valuable, something useful. In order to do that it has to work in the real world. And for that to happen, you need to get your hands dirty (when it comes to audio). The only true test is to try out your audio and see the results for yourself.
Luckily, we have a portal - which is free - where you can do just that. We think you'll be impressed by the results.
If you have any feedback about this blog, or have questions for us here at Speechmatics then please let us know.
To learn more about Speechmatics, please visit our Docs, and you can immediately test out our WER for yourselves on our portal.
Don't just take our word for it.
SourceForge is the world's largest software and services comparison website - see how Speechmatics compares to others in the market.