Jan 12, 2022 | Read time 4 min

What Makes Up Your Voice: Understanding the Best Speech-to-Text

Read about how important it is that speech recognition understands every voice in every situation and how Speechmatics' is ensuring more voices are better represented.
What Makes Up your Voice
Benedetta Cevoli
Benedetta CevoliSenior Machine Learning Engineer

What Makes Up Your Voice: Understanding the Best Speech-to-Text

What gives your voice its individuality is as complex and varied as everything that makes you the person you are. The best transcription tools understand that the make-up of your voice will be influenced by everything from your gender at birth, to your state of emotion, to the levels of pollution in your area. It may be guided by your parents, your siblings, your friends, and your education. But there is no single contributing factor that makes your voice yours.

Opinions are formed about us when we speak. We can be judged on how we sound and there are times when how we speak can either be misconstrued or completely ignored. These opinions often drive behavior and decision-making. For example, in a recent article from American Scientist, researchers discovered that a political candidate’s pitch – a simple element that makes up our voice – can have a major influence on how voters perceive them.

The different factors that make up a voice can also have exclusionary consequences. When it comes to being able to transcribe speech to text, we believe this shouldn’t be the case. When the technology is at its optimum, every voice should be treated the same. Every voice should be transcribed as equally and as accurately as possible.

Inaccuracy Means Exclusion

At Speechmatics, we specialize in automatic voice-to-text transcription. We turn what someone has said into the written word for assistance, reference, and analysis. If our results are inaccurate, it means someone isn’t being heard and we’re no closer in our mission to understand every voice. We constantly have to consider all the factors that make a voice a voice and make sure these don’t negatively influence our technology and lead to inaccurate results.

The primary differential for what makes up your voice is the size of your vocal cords. The majority of males have larger vocal cords than females leading to the majority of men having deeper voices. The same is, of course, true with adults and children. When it comes to voice recognition the latter is still not represented as well as the former, mostly because voice recognition models have been trained primarily on adult voices.

Emotion and Voice Recognition

Our emotions play a huge role in how we’re heard too. The way we speak is greatly influenced by what we feel in each moment. Our voices can be quite different when sad, happy, worried, or excited for example. The best speech-to-text technology must accurately transcribe every voice no matter its emotional charge. There’s also an undiscovered world to the crossover of emotion and speech recognition, with use cases from Health to Finance and Customer Service ready to benefit from future technologies that recognize if a caller is feeling nervous, excited, or anxious.

Then there are those in society with completely unique voice patterns, such as people with Down syndrome. When they use voice recognition designed primarily for able-bodied speakers, they’re often let down by technology which should be beneficial for everyone. Ventures such as Project Understood completely understand that without enough data to train on, these voices will be given poor results from voice recognition.

The same obstacles to receiving accurate results can be found in those who have suffered strokes, have received injuries to their vocal cords and for those who suffer from dementia. The more speech recognition engines which use self-supervised learning and unlabeled data – as Speechmatics does – the quicker we can get to systems that work for everyone. Before our machine learning experts unlocked self-supervised learning, we were training on around 30,000 hours of audio, now it’s over 1,000,000 hours.

Better Representation in Voice

In a world where digital assistants are everywhere, it’s particularly important that speech recognition understands every voice in every situation. The recent work we’ve done at Speechmatics has been exceptional at making sure a variety of voices are better represented. One example of this is the incredible performance of our recently launched Autonomous Speech Recognition on children voices. Speechmatics shows the best speech-to-text accuracy in adults as well as children. And we’ve seen the smallest gap in accuracy between younger and older voices when compared to competitors.

But, for us, this is just the beginning. Every day we’re setting our sights on understanding every voice, in every situation.

Benedetta Cevoli, Data Science Engineer, Speechmatics 

Latest Articles

Carousel slide image
Use Cases

What Word Error Rate Is Acceptable for Legal Transcription?

Word error rate for legal transcription has no single acceptable threshold. But knowing how accuracy, audio quality, and review obligations connect to real legal risk is what separates a reliable transcript from a costly one.

Mieke Smith
Mieke SmithSenior Writer
Carousel slide image
Use Cases

The court reporter shortage crisis: data, causes, and what legal teams are doing about it

The court reporter shortage is reshaping litigation. Explore data, causes, and how legal teams are using digital reporting and AI transcription to adapt.

Tom Young
Tom YoungDigital Specialist
[alt: Bilingual medical model featuring terms related to various health conditions and medications in Arabic and English. Key terms include "Chronic kidney disease," "Heart attack," "Diabetes," and "Insulin," among others, displayed in an organized layout.]
Product

Speechmatics achieves a world first in bilingual Voice AI with new Arabic–English model

Sets a new accuracy bar for real-world code-switching: 35% fewer errors than the closest competitor.

Speechmatics
SpeechmaticsEditorial Team
[alt: Illuminated ancient mud-brick structures stand against a dusk sky, showcasing architectural details and textures. Palm trees are in the foreground, adding to the setting's ambiance. Visually captures a historic site in twilight.]
Product

Your voice agent speaks perfect Arabic. That's the problem.

Most voice AI models are trained on formal Arabic, but real conversations across the Middle East mix dialects and English in ways those systems aren’t built to handle.

Yahia Abaza
Yahia AbazaSenior Product Manger
new blog image header
Technical

How Nvidia Dominates the HuggingFace Leaderboards in This Key Metric

A technical deep-dive into Token Duration Transducers (TDT) — the frame-skipping architecture behind Nvidia's Parakeet models. Covers inference mechanics, training with forward-backward algorithm, and how TDT achieves up to 2.82x faster decoding than standard RNN-T.

Oliver Parish
Oliver Parish Machine Learning Engineer
[alt: Healthcare professionals in scrubs and lab coats walk briskly down a hospital corridor. A nurse uses a tablet while others carry patient charts and attend to a gurney. The setting conveys a busy, clinical environment focused on patient care.]
Use Cases

Why AI-native EHR platforms will treat speech as core infrastructure in 2026

As clinical workflows become automated and AI-driven, real-time speech is shifting from a transcription feature to the foundational intelligence layer inside modern EHR systems.

Vamsi Edara
Vamsi EdaraFounder and CEO, Edvak EHR