Oct 30, 2020 | Read time 4 min

The Speechmatics approach to accent-independent speech recognition

Read how Speechmatics has created an accent-independent speech recognition language pack enabling the accurate transcription of all major English accents.
Header image

Speechmatics has created an accent-independent speech recognition language pack enabling the accurate transcription of all major English accents.

A global approach to transcription

The days of having to adjust your accent so that speech recognition systems can understand you are well and truly over. You don't even have to choose between American English or British English in a drop-down menu of speech recognition languages. Our Global English language pack for speech-to-text transcription encompasses all major English accents and dialects.

Trained on thousands of hours of spoken data from more than 40 countries – and tens of billions of words drawn from global sources – Global English is the result of our unique accent-independent approach to creating world-class speech-to-text technology. We use The Automatic Linguist – our unique machine learning framework which is capable of learning new languages quickly. The technology is so world leading that it was a winner in the Innovation category of the 2019 Queen's Awards for Enterprise.

How we harnessed machine learning to create the Global English language pack

As an industry pioneer, Speechmatics has taken advantage of recent advances in machine learning and applied proprietary language training techniques – allowing a more universal, accent-independent and any-context approach to speech recognition languages than has been possible until now.

Speech recognition has advanced hugely in recent years, giving step-change improvements in a field used to marginal gains. In particular, modern neural network architectures are capable of generalizing across variations in speech by using representation learning. Deep neural networks feature multiple layers between input and output, allowing Speechmatics to filter everything but the phonetics. This effectively gives us the performance of a variety of specialized models, all in one comprehensive language pack.

In addition, single modern servers are more powerful than old room-filling supercomputers. This astonishing rise in compute power, coupled with repurposing of GPUs – from playthings of gamers into serious computing machines – gives masses of computing power. This allows us to train models, based on more data, capable of supporting more variations.

Investing in new ways of solving problems with high levels of speech-to-text accuracy

By investing more time in gathering data from a wide range of sources, we have created a huge and diverse training corpus – allowing us to train models with a much wider range of applications than ever before.

Traditionally, speech recognition – like other machine learning and big data organizations and products – relies on huge amounts of labeled data to achieve real-world results. While this approach is good in the short to medium term, as we continue towards the future where results are getting better, it will become harder to continue to drive improvements from a mechanism of training that relies on big data alone. Small data becomes increasingly important to support precise and focused use cases – as data also needs to be more representative to deliver exceptional levels of accuracy.

To reach the next stage of improvements, there will be a requirement to invest significant amounts of time in collecting and labeling data for machine learning. While Speechmatics is already delivering leading levels of accuracy, we are also investing in new ways of solving problems to enable high levels of accuracy without a growing commitment to labeling data that is not sustainable.

One accent-independent speech recognition model to rule them all

We have compared our Global English model with those of other providers of speech-to-text technology for the most common English accents. Test sets comprised diverse audio and transcribed text, with accented test files including variations in gender, age and region. In every case, our Global English language pack produced a more accurate transcription than our competitors' variant-specific language packs.

Speechmatics is committed to undertaking regular comparisons against other providers, with frequent testing and benchmarking to ensure we provide the best automatic speech recognition on the market. By moving from multiple specialist speech recognition language packs to a more comprehensive, single language pack, we have streamlined our portfolio and maximized the resources available for Global English.

Fast, accurate, reliable and now more flexible, convenient and inclusive, Global English offers users speech recognition for the future.

Latest Articles

Carousel slide image
Use Cases

What Word Error Rate Is Acceptable for Legal Transcription?

Word error rate for legal transcription has no single acceptable threshold. But knowing how accuracy, audio quality, and review obligations connect to real legal risk is what separates a reliable transcript from a costly one.

Mieke Smith
Mieke SmithSenior Writer
Carousel slide image
Use Cases

The court reporter shortage crisis: data, causes, and what legal teams are doing about it

The court reporter shortage is reshaping litigation. Explore data, causes, and how legal teams are using digital reporting and AI transcription to adapt.

Tom Young
Tom YoungDigital Specialist
[alt: Bilingual medical model featuring terms related to various health conditions and medications in Arabic and English. Key terms include "Chronic kidney disease," "Heart attack," "Diabetes," and "Insulin," among others, displayed in an organized layout.]
Product

Speechmatics achieves a world first in bilingual Voice AI with new Arabic–English model

Sets a new accuracy bar for real-world code-switching: 35% fewer errors than the closest competitor.

Speechmatics
SpeechmaticsEditorial Team
[alt: Illuminated ancient mud-brick structures stand against a dusk sky, showcasing architectural details and textures. Palm trees are in the foreground, adding to the setting's ambiance. Visually captures a historic site in twilight.]
Product

Your voice agent speaks perfect Arabic. That's the problem.

Most voice AI models are trained on formal Arabic, but real conversations across the Middle East mix dialects and English in ways those systems aren’t built to handle.

Yahia Abaza
Yahia AbazaSenior Product Manger
new blog image header
Technical

How Nvidia Dominates the HuggingFace Leaderboards in This Key Metric

A technical deep-dive into Token Duration Transducers (TDT) — the frame-skipping architecture behind Nvidia's Parakeet models. Covers inference mechanics, training with forward-backward algorithm, and how TDT achieves up to 2.82x faster decoding than standard RNN-T.

Oliver Parish
Oliver Parish Machine Learning Engineer
[alt: Healthcare professionals in scrubs and lab coats walk briskly down a hospital corridor. A nurse uses a tablet while others carry patient charts and attend to a gurney. The setting conveys a busy, clinical environment focused on patient care.]
Use Cases

Why AI-native EHR platforms will treat speech as core infrastructure in 2026

As clinical workflows become automated and AI-driven, real-time speech is shifting from a transcription feature to the foundational intelligence layer inside modern EHR systems.

Vamsi Edara
Vamsi EdaraFounder and CEO, Edvak EHR