Nov 19, 2020 | Read time 5 min

How big is the gap between academic and commercial speech recognition systems?

Header image
Remi Francis
Remi FrancisPrincipal Speech Architect

Speech recognition has been making a lot of noise in the last few years, both academically and commercially. Let’s begin with commercial. Many big companies have begun launching their own voice assistants, such as Apple Siri, Microsoft Cortana, Amazon Alexa and Google Assistant.

Speech is the biggest form of communication used by humans but due to its complexities.

It is one of the hardest challenges to overcome. Speech as an interface is made possible by ever-increasing accuracy and speed of speech recognition systems. Speech interfaces are especially important in places with low literacy rates like third world countries where speech is the only form of communication. Or for example, in China where typing in pinyin is not convenient or accessible for many people, especially the older generation.

There are now speech recognition systems available as cloud-based APIs which are making speech as an interface more accessible from the likes of IBM, Microsoft, Google and us.

From an academic perspective, there have been recent claims about achieving human parity from Microsoft and even better systems from IBM and text to speech systems are also making a lot of progress recently.

But what are the differences?

Datasets

In academia, to allow for fair comparison, the datasets are fixed both for training and testing. However, collecting data is quite expensive, so often the datasets that are used are quite old thus there is a selection bias for models that end up performing on these specific datasets.

The switchboard datasets used in Microsoft's and IBM's paper use about 300 hours of training data, whereas other academic datasets go to up to 2000 hours.

For a commercial use, the only limitations are the cost of collecting the data and training the algorithm. For example, Baidu uses 40,000 hours. Major vendors typically use thousands of hours of data for their commercial systems, but the data is not disclosed due to its importance. The test sets they use internally are also kept confidential as they are tailored to the use of their customers.

Companies do not publish the accuracy of their commercial system on an open test set for these reasons, so the only way to discover which system is the best on a particular application is to test them all.

Vocabulary size

In the switchboard testing task, Microsoft used a vocabulary of 30k, and IBM used 85k. Today, large test sets (LVCSR) often have a few hundred thousand words in the vocabulary as standard.

However large vocabularies are often restricted for specific use cases. For example, for an embedded speech recognition system, Google used 64k. And on a popular natural language processing dataset, Google used a vocabulary size of about 800k.

New words appear every year and commercial systems need to take them into account when drawing new results.

Types of errors

In papers, word error rates (WERs) are reported, however we can see in Microsoft's and IBM's papers that most of the mistakes are made on short functional words such as “and”, “is”, “the”, and these errors are given the same weighting as errors on keywords.

In a commercial setting, verbatim transcripts are not always necessary. For example, voice search and voice assistants, content extraction and indexing applications often ignore the short functional words when dealing with a transcript.

Languages

Although there are open datasets in many languages, academics tend to focus on English. Open datasets also don't have as much data as the English language. There are common issues found in other languages that are not found in English such as, tone, agglutination and different script (i.e. better wording or a non-Latin script).

A provider of speech recognition systems need to offer languages that cover the most common languages in the world. The top 10 languages in the world only cover about 46% of the world population with the top 100 languages still only covering about 85% of the world population.

Speed

Academics rarely report the real-time factor (RTF) ­– the time taken to transcribe the audio divided by duration of the audio – of their models, the best systems proposed in academic papers are a combination of multiple large models. This makes them unusable commercially because the compute cost is too high.

Users often expect a fast turnaround time which has made real time systems increasingly important and desirable especially when embedded on a device.

Robustness

Popular academic datasets tend to be too clean whereas audio in real life applications are often very noisy. On a noisier dataset, the WER is about 40%, it would be good if more real-world data was available to academics to make the results more commercially viable and applicable.

Additional functionalities

Real world systems need more than good ASR accuracy, they also need diarisation (who speaks when), punctuation and capitalisation, whereas it is something expected from transcriptionist, and makes transcripts much easier and nicer for commercial use.

Audio segmentation – knowing what kind of noise or music happened when.

So, what does this mean for the future of speech recognition systems?

There is a gap between academic and commercial systems, however academic research is important to show that improvement can be made in speech recognition applications. So, then the problem becomes how can it be more efficient?

The academic method relies on dividing the problem into parts that can be improved in isolation. This has resulted in good progress being made on the somewhat artificial tasks we have set ourselves. In contrast, commercial speech recognition needs to take a more holistic approach, it’s the performance of the overall system that matters. Companies are continuing to build both types of ASR technology helping to close the gap between academic and commercial systems.

Rémi Francis, Speechmatics

Latest Articles

[alt: Bilingual medical model featuring terms related to various health conditions and medications in Arabic and English. Key terms include "Chronic kidney disease," "Heart attack," "Diabetes," and "Insulin," among others, displayed in an organized layout.]
Product

Speechmatics achieves a world first in bilingual Voice AI with new Arabic–English model

Sets a new accuracy bar for real-world code-switching: 35% fewer errors than the closest competitor.

Speechmatics
SpeechmaticsEditorial Team
[alt: Illuminated ancient mud-brick structures stand against a dusk sky, showcasing architectural details and textures. Palm trees are in the foreground, adding to the setting's ambiance. Visually captures a historic site in twilight.]
Product

Your voice agent speaks perfect Arabic. That's the problem.

Most voice AI models are trained on formal Arabic, but real conversations across the Middle East mix dialects and English in ways those systems aren’t built to handle.

Yahia Abaza
Yahia AbazaSenior Product Manger
new blog image header
Technical

How Nvidia Dominates the HuggingFace Leaderboards in This Key Metric

A technical deep-dive into Token Duration Transducers (TDT) — the frame-skipping architecture behind Nvidia's Parakeet models. Covers inference mechanics, training with forward-backward algorithm, and how TDT achieves up to 2.82x faster decoding than standard RNN-T.

Oliver Parish
Oliver Parish Machine Learning Engineer
[alt: Healthcare professionals in scrubs and lab coats walk briskly down a hospital corridor. A nurse uses a tablet while others carry patient charts and attend to a gurney. The setting conveys a busy, clinical environment focused on patient care.]
Use Cases

Why AI-native EHR platforms will treat speech as core infrastructure in 2026

As clinical workflows become automated and AI-driven, real-time speech is shifting from a transcription feature to the foundational intelligence layer inside modern EHR systems.

Vamsi Edara
Vamsi EdaraFounder and CEO, Edvak EHR
[alt: Logos of Speechmatics and Edvak are displayed side by side, interconnected by a stylized x symbol. The background features soft, wavy lines in light blue, creating a modern and tech-focused aesthetic.]
Company

One word changes everything: Speechmatics and Edvak EHR partner to make voice AI safe for clinical automation at scale

Turning real-time clinical speech into trusted, EHR-native automation.

Speechmatics
SpeechmaticsEditorial Team
[alt: Concentric circles radiate outward from a central orange icon with a white Speechmatics logo. The background is dark blue, enhancing the orange glow. A thin green line runs horizontally across the lower part of the image.]
Technical

Speed you can trust: The STT metrics that matter for voice agents

What “fast” actually means for voice agents — and why Pipecat’s TTFS + semantic accuracy is the clearest benchmark we’ve seen.

Archie McMullan
Archie McMullanSpeechmatics Graduate