What does Speechmatics do?

Speechmatics provides speech technology and Voice AI for enterprises, offering accurate Speech-to-Text, Text-to-Speech, and Voice Agent solutions. Our models understand every voice and accent across 55+ languages, helping businesses unlock the full potential of voice data.

How accurate is Speechmatics Speech-to-Text?

Speechmatics delivers best-in-market accuracy, achieving up to 99% word accuracy and 96% medical keyword recall in industry benchmarks. Our models handle multiple accents, noisy environments, and multi speakers with ease.

What makes Speechmatics Text-to-Speech different?

Our low-latency Text-to-Speech (TTS) delivers lifelike, human-sounding voices with sub-150ms latency that is ideal for real-time conversations. Developers can stream natural speech in multiple voices and deploy it in the cloud, hybrid, or on-prem for privacy and control.

Can I build real-time voice agents with Speechmatics?

Our voice AI enables developers to build real-time voice agents that listen, understand, and respond naturally. Plug in fast with a flexible API and native integrations to power your AI voice agents.

Which industries use Speechmatics?

Speechmatics is trusted by organizations in media, healthcare, contact center, finance, education, and accessibility. Our technology powers transcription, translation, call analytics, and voice AI applications worldwide.

Emotion and Voice: The Next Goal for Speech-to-Text

Your voice is an incredible indicator of how you're feeling. How often have we “sensed” that someone is not doing well, even when they tell us they are? In many ways, your voice is equal to body language or facial expression. It communicates more than we know we’re letting on and, critically, emotion in our voice is challenging to mask.

As we interact more with machines, especially voice assistants, how important is knowing how someone is feeling – as opposed to what someone is saying – when it comes to speech-to-text? While we can mask emotion or misread cues when it comes to visual communication if we’re able to unlock how someone is feeling through their voice, is there more scope for voice technology to help us understand each other?

Emotional Intelligence

Speaking to the Inside Voice podcast, Rana Gujral, the CEO of Behavioural Signals (specialists in emotional conversations with AI), discussed where we are now, with voice and emotion. "We're talking to machines, but it's a very one-sided interaction where we're giving commands," Rana explained. "We're not really having a dialogue; we're not really having a conversation. And that was the promise of these virtual assistants.”

If a machine cannot relate to our emotions, the two-way street of dialogue breaks down. Without empathy or sympathy, a considerable barrier can be created. So how can we make machines emotionally intelligent? What will the benefits be once we can?

One clear use case for improved emotion in voice technology is in the contact center industry. From a server point of view, machines have guided workers for years by informing them they are talking too slowly, or their client sounds tired, for example. But to repeat Rana’s point, if the other side of the conversation doesn’t understand emotion, is it even a conversation? Emotion recognition is essential for an empathetic and affective dialogue between humans and machines.

Accuracy and Emotion

When it comes to understanding every voice, emotion recognition is essential for an empathetic and affective dialogue between humans and machines and is too crucial a factor not to consider. Speech recognition accuracy varies significantly according to someone's emotional status. If automatic speech recognition (ASR) works ‘only’ with neutral speech, any real-life application becomes problematic. Emotions are part of every natural human interaction and have a considerable role in speech production and comprehension.

At Speechmatics, research is at the heart of everything we do, and we plan on delving into emotion more this year. In 2020, we conducted initial research into using self-supervised learning for emotion recognition. With the introduction of self-supervised models into our latest engine, we can now look to leverage this research and explore the “rich underlying structure of audio” otherwise missed by human-led data. Moreover, we can effectively explore different domains such as TV broadcasts and phone calls with less human-labeled data.

To begin to understand the external impacts of human emotion, we wanted to see how our technology would handle poor quality – or noisy – data. We tested 6 hours of audio taken from meetings, earning calls, online videos, and a host of other real-world examples that included ambient sounds (phones, machine noise, different conversations, etc.).

To make it even more of a challenge, we randomly changed the pitch, reverb, and volume levels, too – the sort of aspects varied emotions would also affect.

These real-world factors are an accessible, relatable start to uncovering how emotion impacts speech-to-text technology, as these different scenarios bring diverse levels of emotion in a person’s voice. As we look to increase the understanding of every voice in our technology, analyzing situations like these are vital in understanding the way we talk, and our emotions are not only impacted by our feelings but also our surroundings.

Emotion recognition is a key aspect of Speechmatics' aim of understanding every voice.

Benedetta Cevoli - Data Science Engineer, Speechmatics