How to accurately benchmark speech technology providers

Speech-to-text accuracy (also discussed as word error rate or WER) and the number of languages supported are often quoted benchmarks when comparing speech-to-text products.

But when looking at benchmarking from an operational perspective, there are many other factors that come into play that can make an integration exercise either very simple or much harder than it needs to be.

Senior Product Manager for a speech technology company.

I regularly benchmark our products against our competitors so that we understand how we compare in the market. In order to do this, I read the documentation, integrate with the products' APIs and consume the output just like end-users would do.

I’ve learnt a lot along the way, so I thought I’d share my experiences on some of the hurdles we've had to overcome. I’ve pulled out four key factors I feel are the most important to consider – and watch out for – when benchmarking speech technology providers from an operational perspective. In my testing I looked at IBM Watson, Microsoft Video Indexer and Google Cloud Speech-to-Text.

1. Speed

The speed it takes to receive a transcription is measured by real-time factor (RTF). This is the time taken to transcribe the audio divided by the duration of the audio. Users expect a fast turn-around time and a lot of providers indeed do this, however, there are some impediments that can slow this down.

Here’s what to look out for when it comes to benchmarking operational speed.

Understanding how long an auth token lasts is important as you may need to refresh it after a certain number of requests. One of the providers I tested has tokens that need to be refreshed after 1 hour, so we had to account for this when running our 4-hour test set.
It is very unlikely (but does sometimes happen) that access to the SaaS service may be lost for any number of reasons and the jobs fail. Having to resubmit your jobs means time is added to the RTF value that you measure for an individual transcription, as well as additional transcription costs. So, to ensure that a test set is run successfully, we've had to add logic which retries submitting jobs if any of them fail. At the end of the test, we can then compare SaaS stability between the providers for a given test run.

2. Error handling

You need to throttle your own requests so that you do not hit limits imposed by the SaaS service. When this happens, some providers are helpful enough to present errors such as, HTTP 429 error "The project exceeded the rate limit for creating and deleting buckets". Others just fail with an unhelpful HTTP 503 "Service Unavailable" error or HTTP 500 "Internal Server Error". To work around this when testing, we had to pace the requests to reduce the impact of hitting these limits.
This is really a user error but occasionally you can ask for a language that the provider does not support and it's interesting to see the error returned. One provider returns the error "You need to specify MediaSampleRate"!

3. Content preparation

In some cases, you need to prepare the files according to the providers’ requirements. That means having to convert an mp3 file to a different file type yourself. This contributes to slowing down the time to get a transcription. This is another example of the RTF number not telling the full story.
We have seen APIs return an HTTP 500 error and when you dig a bit deeper (via support tickets across several days!) it turns out that the provider was really grumbling about the encoding of the audio file due to a protocol violation with a reason of "content length mismatch". This is despite the fact that the files play perfectly well on any well-known media player.
Some providers just don't like large files. With video transcription being a common use case, you would think that uploading a video file would be handled quite happily. We have seen some providers complain if the files are > 100MB in size, which is an easy limit to reach if you are in the business of video.
One interesting edge case we found was audio files with > 30s of silence. To work around this, you need to set an option to tell the service not to timeout.
Some cloud providers also provide cloud storage and require the audio file to be uploaded and the URL is passed in the API. In one case, a provider also had a file upload function but generated an HTTP 500 error because they prefer that the content is first uploaded to their cloud storage. From a cost perspective, this means that a transcription request also incurs a storage cost, plus there's also an overhead as you'll need to manage the content once it's there.

4. Language models

Sometimes it's not obvious which language model you need to use. It's common for providers to have multiple English models. So, what should you do if there are different accents? Some customers will put the audio through multiple models and take the best, so in the case where there are 3 accents (US, GB and AU), that will cost three times as much plus the time to review the results and pick the one you want to use.
There can also be different models for 16Khz and 8Khz audio. But what happens if you have an audio file with a sample rate of 11Khz, which one should you use? In order to use any, you need to up or downsample the audio file to fit with one of the supported models.

So how do we at Speechmatics compare when benchmarking speech technology providers?

Through our continuous testing and benchmarking of our own solution, we have used the experiences and discoveries that I have discussed in this blog to build, develop and progress our own solution.

Speed

For the Speechmatics SaaS platform, we are able to scale up to handle the number of requests by spinning up more resources to adhere to our target SLA of 0.5 RTF or better i.e. a 10-minute file takes 4 minutes to return a transcription.

Error handling

Speechmatics will provide you with helpful error messages, for example, HTTP 400 "requested product not available" if you ask for a language that we don't support. If you notice any unhelpful errors, then please let us know and we'll look at fixing them.

Content preparation

Speechmatics will look after any file conversion for you so you don’t need to worry about using the documentation to find the supported file types and then converting to that type. Nor would you need to spend many days waiting for support tickets to be answered for an explanation on what an HTTP error really means! We are also able to deal with periods of silence. In terms of storage and hidden costs, we won't make you use our cloud storage and charge you for the privilege.

A simplified workflow is essential. You might also have to upload the file to a cloud storage before it can be processed and then you have to remember to remove it after, that’s all part of the Speechmatics service, give us the file, we process it and remove it.

Language models

Speechmatics only supports a single Global English language which can deal with multiple accents. You don’t need to worry about sample rates either, we'll seamlessly handle that too. Both of these attributes can lead to transcriptions that require less compute resources and therefore an overall lower cost. This is yet another example of how important it is to be careful when measuring RTF only.

So, as you can see, there are many factors you should consider when benchmarking speech technology providers – not just accuracy or WER and the number of supported languages. Client applications should not need to work around idiosyncrasies of the software or a bizarre edge case.

Jun 5, 2019 | Read time 2 min