Jan 12, 2023 | Read time 4 min

Speech Technology in a Global World: Reducing Inequalities Across Languages

Speechmatics’ Data Science Engineer, Benedetta Cevoli looks at why speech technology still has some distance to travel in reducing inequality across different languages.
Speech Technology in a Global World: Reducing Inequalities Across Languages
Benedetta Cevoli
Benedetta CevoliSenior Machine Learning Engineer

Speech technology already plays a huge part in our everyday lives, from common applications on our phones and computers, to unseen uses in customer services and advertising. As artificial intelligence continues to make huge leaps in everything from structuring materials to content creation, it won’t be long before voice technology plays an even bigger role in our lives, including in areas like legal and healthcare. And while it might not be a matter of life and death if Alexa plays you the wrong song, in a health setting, it very well could be.

Speech Technology Today

Recent advancements in speech technology have been hugely impressive. AI-led speech-to-text is now the only real choice for transcription at scale, with human-led transcription being prohibitively expensive and time-consuming. But are we really at the stage where we can put our hands up and say, “The problem is solved”?

Speech recognition is an incredible technology that we’ll rely on more and more in the future. Yet, it can also be a barrier. As a non-native speaker, it’s a barrier I’m all too aware of. I’m originally from Italy but have been living in the UK for several years. A few years ago, my partner and I bought our first smart speaker. Excitedly, we started to interact with it in Italian, our native language. We usually speak Italian at home. We quickly switched to English. We didn’t switch because we were more comfortable with it, we switched because it didn’t work for us in Italian. It didn’t work great for us in English, either, with our accents. But it was the better of the two options.

For the Few, Not the Many?

In the past few years, research has shown that language, accent, race, gender and age are the main factors that influence the accuracy of speech recognition. Researchers at Stanford have found that speech-to-text systematically misunderstands Black speakers twice as often as White speakers. Another study reported robust differences in accuracy across both gender and dialect, with lower accuracy for women and speakers from Scotland.

It’s worth noting at this stage, that results for accuracy are complicated. After all, our voices are extremely rich and unique, no one is like any other. But any sort of barrier, any digital divide with unequal access to digital technologies, deserves dissection. As Halcyon Lawrence, an assistant professor of technical communication and information design at Towson University told Claudia Lopez Lloreda in a piece for Scientific America: “I don’t get to negotiate with these devices unless I adapt my identity”.

This is simply not inclusive. Why should some people have to adapt their own voices and others not? Why should some get inferior results and others not?

A Deprivation of Data

It’s an issue that reaches beyond the speech recognition world. English (and a handful of other languages) are generally the focus of today’s language technologies. Despite there being over 6,500 languages in the world today, only a handful are systematically represented in academia and industry. The issue is that the near-human results on language translation and understanding usually only apply to a few languages. The vast majority of languages fall far below such standards.

Modern deep learning systems are data-hungry, they rely on enormous amounts of data for accuracy. This is problematic for languages for which a limited amount of data is available. Without the data to drive efficiency, some languages will continue to improve while others won’t. The bridge to inequality will grow.

Exclusive vs Inclusive

There’s a vast difference between speech-to-text working for some people, most of the time, and for all people, all of the time. At Speechmatics, we’re battling hard to make the latter a reality. We strongly believe speech technology must help us interact with the digital world fairly. Until it works for everyone all the time, true fairness is a target not an accomplishment.

We currently support 50 languages, covering over half of the world’s population with leading, consistent accuracy, that’s not dependent on language. But we’re not stopping here. As we continue to move forward, expand our coverage, and improve our technology, we’ll keep pushing the limits of what inclusivity means for commercially-ready speech recognition.

Benedetta Cevoli, Data Science Engineer, Speechmatics

Latest Articles

[alt: Bilingual medical model featuring terms related to various health conditions and medications in Arabic and English. Key terms include "Chronic kidney disease," "Heart attack," "Diabetes," and "Insulin," among others, displayed in an organized layout.]
Product

Speechmatics achieves a world first in bilingual Voice AI with new Arabic–English model

Sets a new accuracy bar for real-world code-switching: 35% fewer errors than the closest competitor.

Speechmatics
SpeechmaticsEditorial Team
[alt: Illuminated ancient mud-brick structures stand against a dusk sky, showcasing architectural details and textures. Palm trees are in the foreground, adding to the setting's ambiance. Visually captures a historic site in twilight.]
Product

Your voice agent speaks perfect Arabic. That's the problem.

Most voice AI models are trained on formal Arabic, but real conversations across the Middle East mix dialects and English in ways those systems aren’t built to handle.

Yahia Abaza
Yahia AbazaSenior Product Manger
new blog image header
Technical

How Nvidia Dominates the HuggingFace Leaderboards in This Key Metric

A technical deep-dive into Token Duration Transducers (TDT) — the frame-skipping architecture behind Nvidia's Parakeet models. Covers inference mechanics, training with forward-backward algorithm, and how TDT achieves up to 2.82x faster decoding than standard RNN-T.

Oliver Parish
Oliver Parish Machine Learning Engineer
[alt: Healthcare professionals in scrubs and lab coats walk briskly down a hospital corridor. A nurse uses a tablet while others carry patient charts and attend to a gurney. The setting conveys a busy, clinical environment focused on patient care.]
Use Cases

Why AI-native EHR platforms will treat speech as core infrastructure in 2026

As clinical workflows become automated and AI-driven, real-time speech is shifting from a transcription feature to the foundational intelligence layer inside modern EHR systems.

Vamsi Edara
Vamsi EdaraFounder and CEO, Edvak EHR
[alt: Logos of Speechmatics and Edvak are displayed side by side, interconnected by a stylized x symbol. The background features soft, wavy lines in light blue, creating a modern and tech-focused aesthetic.]
Company

One word changes everything: Speechmatics and Edvak EHR partner to make voice AI safe for clinical automation at scale

Turning real-time clinical speech into trusted, EHR-native automation.

Speechmatics
SpeechmaticsEditorial Team
[alt: Concentric circles radiate outward from a central orange icon with a white Speechmatics logo. The background is dark blue, enhancing the orange glow. A thin green line runs horizontally across the lower part of the image.]
Technical

Speed you can trust: The STT metrics that matter for voice agents

What “fast” actually means for voice agents — and why Pipecat’s TTFS + semantic accuracy is the clearest benchmark we’ve seen.

Archie McMullan
Archie McMullanSpeechmatics Graduate