Feb 23, 2022 | Read time 4 min

Emotion and Voice: The Next Goal for Speech-to-Text

Learn more about the importance of emotion in speech-to-text communication and the advancements Speechmatics are making to help understand every voice.
Emotion and Voice: The Next Goal for Speech-to-Text
Benedetta Cevoli
Benedetta CevoliSenior Machine Learning Engineer

Emotion and Voice: The Next Goal for Speech-to-Text

Your voice is an incredible indicator of how you're feeling. How often have we “sensed” that someone is not doing well, even when they tell us they are? In many ways, your voice is equal to body language or facial expression. It communicates more than we know we’re letting on and, critically, emotion in our voice is challenging to mask.

As we interact more with machines, especially voice assistants, how important is knowing how someone is feeling – as opposed to what someone is saying – when it comes to speech-to-text? While we can mask emotion or misread cues when it comes to visual communication if we’re able to unlock how someone is feeling through their voice, is there more scope for voice technology to help us understand each other?

Emotional Intelligence

Speaking to the Inside Voice podcast, Rana Gujral, the CEO of Behavioural Signals (specialists in emotional conversations with AI), discussed where we are now, with voice and emotion. "We're talking to machines, but it's a very one-sided interaction where we're giving commands," Rana explained. "We're not really having a dialogue; we're not really having a conversation. And that was the promise of these virtual assistants.”

If a machine cannot relate to our emotions, the two-way street of dialogue breaks down. Without empathy or sympathy, a considerable barrier can be created. So how can we make machines emotionally intelligent? What will the benefits be once we can?

One clear use case for improved emotion in voice technology is in the contact center industry. From a server point of view, machines have guided workers for years by informing them they are talking too slowly, or their client sounds tired, for example. But to repeat Rana’s point, if the other side of the conversation doesn’t understand emotion, is it even a conversation? Emotion recognition is essential for an empathetic and affective dialogue between humans and machines.

Accuracy and Emotion

When it comes to understanding every voice, emotion recognition is essential for an empathetic and affective dialogue between humans and machines and is too crucial a factor not to consider. Speech recognition accuracy varies significantly according to someone's emotional status. If automatic speech recognition (ASR) works ‘only’ with neutral speech, any real-life application becomes problematic. Emotions are part of every natural human interaction and have a considerable role in speech production and comprehension.

At Speechmatics, research is at the heart of everything we do, and we plan on delving into emotion more this year. In 2020, we conducted initial research into using self-supervised learning for emotion recognition. With the introduction of self-supervised models into our latest engine, we can now look to leverage this research and explore the “rich underlying structure of audio” otherwise missed by human-led data. Moreover, we can effectively explore different domains such as TV broadcasts and phone calls with less human-labeled data.

To begin to understand the external impacts of human emotion, we wanted to see how our technology would handle poor quality – or noisy – data. We tested 6 hours of audio taken from meetings, earning calls, online videos, and a host of other real-world examples that included ambient sounds (phones, machine noise, different conversations, etc.).

To make it even more of a challenge, we randomly changed the pitch, reverb, and volume levels, too – the sort of aspects varied emotions would also affect.

These real-world factors are an accessible, relatable start to uncovering how emotion impacts speech-to-text technology, as these different scenarios bring diverse levels of emotion in a person’s voice. As we look to increase the understanding of every voice in our technology, analyzing situations like these are vital in understanding the way we talk, and our emotions are not only impacted by our feelings but also our surroundings.

Emotion recognition is a key aspect of Speechmatics' aim of understanding every voice.

Benedetta Cevoli - Data Science Engineer, Speechmatics

Latest Articles

Carousel slide image
Technical

How to build a microbatching workflow with the Speechmatics API

Build a cleaner path between batch and real time. Learn when micro-batching makes sense, how to chunk audio, submit jobs, stitch JSON, and scale safely with the Speechmatics API.

Speechmatics
SpeechmaticsEditorial Team
Carousel slide image
Product

Alphanumeric speech recognition: why voice assistants mangle SKUs (and how to fix it)

A guide for voice AI engineers, ecommerce platforms and warehouse teams on SKU recognition accuracy voice assistant deployments depend on: why speech recognition systems produce transcription errors on product codes, what to measure when error rates matter, and the fixes that move the needle on order picking, voice ordering and customer-facing voice AI.

Speechmatics
SpeechmaticsEditorial Team
Carousel slide image
Technical

The Adobe story: How we made cloud-grade AI work on your laptop

Behind the build: what it takes to make cloud-grade speech recognition work inside Adobe Premiere, and why Whisper raised the stakes.

Andrew Innes
Andrew InnesChief Architect
Carousel slide image
Company

Adobe and Speechmatics deliver cloud-grade speech recognition on-device for Premiere

Adobe Premiere users can run the most accurate on-device transcription locally; efficient enough for a laptop, powerful enough for professional work.

Speechmatics
SpeechmaticsEditorial Team
Carousel slide image
Use Cases

Best speech-to-text AI guide: APIs, platforms and services compared

Speech-to-text has moved from novelty to enterprise infrastructure. Here's how the leading platforms stack up in 2026 — and how to pick the right one.

Tom Young
Tom YoungDigital Specialist
Speechmatics x Thymia combine medical-grade speech-to-text with clinical-grade voice biomarker intelligence to identify health signals.
News

AI can now understand health signals from 15 seconds of your voice, including fatigue, stress and type 2 diabetes

The joint platform returns transcription and health signals in real time, with no additional hardware required.

Speechmatics
SpeechmaticsEditorial Team