Jan 12, 2022 | Read time 4 min

What Makes Up Your Voice: Understanding the Best Speech-to-Text

Read about how important it is that speech recognition understands every voice in every situation and how Speechmatics' is ensuring more voices are better represented.
What Makes Up your Voice
Benedetta Cevoli
Benedetta CevoliSenior Machine Learning Engineer

What Makes Up Your Voice: Understanding the Best Speech-to-Text

What gives your voice its individuality is as complex and varied as everything that makes you the person you are. The best transcription tools understand that the make-up of your voice will be influenced by everything from your gender at birth, to your state of emotion, to the levels of pollution in your area. It may be guided by your parents, your siblings, your friends, and your education. But there is no single contributing factor that makes your voice yours.

Opinions are formed about us when we speak. We can be judged on how we sound and there are times when how we speak can either be misconstrued or completely ignored. These opinions often drive behavior and decision-making. For example, in a recent article from American Scientist, researchers discovered that a political candidate’s pitch – a simple element that makes up our voice – can have a major influence on how voters perceive them.

The different factors that make up a voice can also have exclusionary consequences. When it comes to being able to transcribe speech to text, we believe this shouldn’t be the case. When the technology is at its optimum, every voice should be treated the same. Every voice should be transcribed as equally and as accurately as possible.

Inaccuracy Means Exclusion

At Speechmatics, we specialize in automatic voice-to-text transcription. We turn what someone has said into the written word for assistance, reference, and analysis. If our results are inaccurate, it means someone isn’t being heard and we’re no closer in our mission to understand every voice. We constantly have to consider all the factors that make a voice a voice and make sure these don’t negatively influence our technology and lead to inaccurate results.

The primary differential for what makes up your voice is the size of your vocal cords. The majority of males have larger vocal cords than females leading to the majority of men having deeper voices. The same is, of course, true with adults and children. When it comes to voice recognition the latter is still not represented as well as the former, mostly because voice recognition models have been trained primarily on adult voices.

Emotion and Voice Recognition

Our emotions play a huge role in how we’re heard too. The way we speak is greatly influenced by what we feel in each moment. Our voices can be quite different when sad, happy, worried, or excited for example. The best speech-to-text technology must accurately transcribe every voice no matter its emotional charge. There’s also an undiscovered world to the crossover of emotion and speech recognition, with use cases from Health to Finance and Customer Service ready to benefit from future technologies that recognize if a caller is feeling nervous, excited, or anxious.

Then there are those in society with completely unique voice patterns, such as people with Down syndrome. When they use voice recognition designed primarily for able-bodied speakers, they’re often let down by technology which should be beneficial for everyone. Ventures such as Project Understood completely understand that without enough data to train on, these voices will be given poor results from voice recognition.

The same obstacles to receiving accurate results can be found in those who have suffered strokes, have received injuries to their vocal cords and for those who suffer from dementia. The more speech recognition engines which use self-supervised learning and unlabeled data – as Speechmatics does – the quicker we can get to systems that work for everyone. Before our machine learning experts unlocked self-supervised learning, we were training on around 30,000 hours of audio, now it’s over 1,000,000 hours.

Better Representation in Voice

In a world where digital assistants are everywhere, it’s particularly important that speech recognition understands every voice in every situation. The recent work we’ve done at Speechmatics has been exceptional at making sure a variety of voices are better represented. One example of this is the incredible performance of our recently launched Autonomous Speech Recognition on children voices. Speechmatics shows the best speech-to-text accuracy in adults as well as children. And we’ve seen the smallest gap in accuracy between younger and older voices when compared to competitors.

But, for us, this is just the beginning. Every day we’re setting our sights on understanding every voice, in every situation.

Benedetta Cevoli, Data Science Engineer, Speechmatics 

Latest Articles

Carousel slide image
Technical

How to build a microbatching workflow with the Speechmatics API

Build a cleaner path between batch and real time. Learn when micro-batching makes sense, how to chunk audio, submit jobs, stitch JSON, and scale safely with the Speechmatics API.

Speechmatics
SpeechmaticsEditorial Team
Carousel slide image
Product

Alphanumeric speech recognition: why voice assistants mangle SKUs (and how to fix it)

A guide for voice AI engineers, ecommerce platforms and warehouse teams on SKU recognition accuracy voice assistant deployments depend on: why speech recognition systems produce transcription errors on product codes, what to measure when error rates matter, and the fixes that move the needle on order picking, voice ordering and customer-facing voice AI.

Speechmatics
SpeechmaticsEditorial Team
Carousel slide image
Technical

The Adobe story: How we made cloud-grade AI work on your laptop

Behind the build: what it takes to make cloud-grade speech recognition work inside Adobe Premiere, and why Whisper raised the stakes.

Andrew Innes
Andrew InnesChief Architect
Carousel slide image
Company

Adobe and Speechmatics deliver cloud-grade speech recognition on-device for Premiere

Adobe Premiere users can run the most accurate on-device transcription locally; efficient enough for a laptop, powerful enough for professional work.

Speechmatics
SpeechmaticsEditorial Team
Carousel slide image
Use Cases

Best speech-to-text AI guide: APIs, platforms and services compared

Speech-to-text has moved from novelty to enterprise infrastructure. Here's how the leading platforms stack up in 2026 — and how to pick the right one.

Tom Young
Tom YoungDigital Specialist
Speechmatics x Thymia combine medical-grade speech-to-text with clinical-grade voice biomarker intelligence to identify health signals.
News

AI can now understand health signals from 15 seconds of your voice, including fatigue, stress and type 2 diabetes

The joint platform returns transcription and health signals in real time, with no additional hardware required.

Speechmatics
SpeechmaticsEditorial Team