Blog - Product

Feb 12, 2024 | Read time 9 min

AI assistants don't do much assisting - here's why...

Discover why AI assistants often fail to deliver real assistance. Unveiling the gap between promise and performance.

Will Williams

Chief Technology Officer

Trevor Back

Chief Product Officer

Voice - The missing ingredient for true AI assistance

True AI assistants lie in our future.

ChatGPT, Bard and Pi from Inflection are moving fast to build upon their brittle, and often frustrating, predecessors Siri, Cortana and Google Assistant.

The idea is simple. Why not have AI-powered technology that takes on much of the administrative burden of simply going about daily life? Fact-finding, planning meals, booking transport, sifting through emails.

We share in this optimistic vision for the power of AI. But we believe the key missing ingredient is that this technology must be accessible through our most natural communication method:  

Our voices.

Our North Star at Speechmatics is driven by technology depicted in science fiction, like that provided by Samantha from the film “Her”.

We asked ourselves what it means to have a truly seamless assistant. Samantha is not only useful for Theodore, providing him with information when he asks for it, but she understands his tone of voice, his mood, all of the previous interactions, context and more.

Why is this worth pursuing?

We asked ourselves why these current assistants don’t yet pass the “Toothbrush Test” - why don’t billions of people use voice-powered AI assistants every single day?

Towards effortless, real-time interaction with speech technology

Our voices are natural and engrained in our daily lives, and no training should be required to utilize the incredible capabilities available through AI today. Billions more people should be able to access these benefits, including people from every age group, gender, socio-economic class, race and country.

This should be truly inclusive, game-changing technology. It should be effortless, seamless, and available to all.

Compare this to where we’re at today with speech technology.

In the best case scenario, the results can be useful, but the interaction is stilted. We can ask for ideas for recipes to cook for dinner, or ask for inspiration for a new blog post, but the answers we received are delayed, transactional, and far from being intuitive.

We are still far from passing a Turing Test for speech-based AI systems. Even the latest ChatGPT App feels stifled by the multiple second gap between responses.

For technology to be as valuable as possible, it must mirror how we as people interact by default. This is of course, in real-time.

Not just real-time in the sense of being able to have a relatively laborious back and forth with some pauses, but actual, instantaneous, intuitive, seamless communication. 

That is our goal, and Speechmatics’ current engine is a strong place from which to build. 

Where Speechmatics are today with live transcription 

Our wider vision relies on being able to Understand Every Voice. There are three foundations to this vision:

1) Rock-solid, highly accurate speech recognition 

2) Consistent and broad language, accent and dialect coverage 

3) Equivalent accuracy in both file and real-time modes 

So, where are we today with the real-time aspect of this?  

  • Our real-time speech-to-text engine offers anyone industry-leading accuracy at 400-800ms latencies, and the flexibility to reduce or increase this latency for an accuracy trade-off.

  • This real-time service is available in every language we support in file transcription. 

  • Any translation service available in batch transcription is also available in real-time. 

  • We can distinguish between different speakers, even when in a live environment. 

  • Our real-time transcription is available in the cloud, on-prem or on-device. 

So far, so good.

How have we done this?

We have built our whole AI pipeline to be ‘real-time first’. Our batch system also uses all our real-time models. Moreover, our core self-supervised models are “fully causal” meaning they are trained to expect the real-time use case.

We have built our acoustic models to balance both accuracy and total cost of ownership, which allows on-premise deployments to also be cost effective.

All this is underpinned by some breakthroughs in Self-Supervised Learning which give us an edge, even when context windows are small (as they are in live transcription).

This approach has also allowed us to train our model with messy, real-world audio data, which in turn leads to reliable accuracy even in noisy environments.

This makes Speechmatics uniquely positioned to offer the best real-time service in the industry today.

Does that mean that the challenges of instant speech recognition are solved? The short answer is... not quite.

The many challenges of true real-time speech interaction 

We take our brand promise to “Understand Every Voice” seriously.

“Understanding” has enormous breadth and depth as a concept, and that helps us frame the challenges that remain before achieving truly intuitive interaction with technology: 

Universal language coverage 

We support 50 languages including all dialects, with more on the way. There are over 7,000 languages and dialects in the world, many of which have little or no ‘training data’ for ASR systems.

This number does not include accents and different environments within the same language.

Tone and cadence 

Words often have intrinsic meaning, but not always. Tone and cadence can shift the meaning of a word completely – sarcasm is the archetypal example. The meaning of utterance is the opposite to what is said.  “This meal was disgusting” can mean exactly that or be a classic dad joke uttered when handing back an emptied plate. 

Context 

The meaning of words can be shifted by the context in which they are said. This can be a matter of the time window that we’re looking in, but also be shifted by environmental factors.

As comedian Demetri Martin said:  Saying, 'I'm sorry' is the same as saying, 'I apologize.' Except at a funeral. 

One can think of another example. An AI assistant knows from your calendar that you had an important job interview. However when it asks you how your day was, you do not mention the interview. That probably implies it did not go well, or at least you are not ready to talk about it yet.

Intuitive conversation 

Should an AI assistant ever interrupt you? If your intuition is to say ‘no’ then think again. We jump in when speaking with each other all the time. A great, seamless speech-powered piece of technology must be able to understand pauses, ums, errs, and potentially interrupt us to seek clarification or clear up potential misunderstandings. The “understanding gap” here is clear.

In science fiction depictions, all of the above are solved challenges. This is what leads to such a powerful depiction of the future of this technology. And when we say ‘Understand Every Voice’ we do take a deep interpretation of the meaning of ‘understanding’. It’s more than the words they say – deeper intentions, preferences, meaning being conveyed.

There’s a long way to go. 

How can we achieve truly helpful AI assistants? 

The pace of progress in AI research is phenomenal and is only likely to increase. We believe there are two interesting paths of research that Speechmatics is uniquely positioned to drive forward in the audio domain.

1. Multi-scale Representation Learning 

This involves changing the context window of models so that we can learn the different information depending on the horizon or context-length in the audio. Current models often look frame by frame, syllable by syllable or word by word, to try to predict what should come next. There’s an opportunity to shift this so that much longer stretches of time can be looked at. For example, after an hour of spirited discussion and disagreement about what color the front door of an office should be, someone might say ‘well, that meeting was great’. When only looking at the context of those words, it would likely be interpreted literally. Taken in context of the entire meeting, sarcasm is far more likely. What’s key here is we’re using context to mean both the words being said, but also how they are being said. So a larger context window can help to accurately work out that when someone says ‘green park’ they actually mean ‘Green Park’ the place, given the context of the words said around that. But it can also be used to understand the tone of voice used to say those words. Saying ‘I’m fine’ might mean just that. But if you know how someone says that when they in fact are not fine, you can infer deeper meaning.

This represents a far more ground up, organic building of understanding, and is vital for AI assistants to also become AI agents. In ‘Her’, Samantha is especially useful because her ‘scale’ or context covers every single interaction with Theodore since the moment she was activated. She can appropriately draw on all of that to better understand Theodore's emotions, wants and needs.

2. Multi-modal Representation Learning 

The world is inherently multi-modal. Text. Images. Audio. Video. These are modalities. Large Language Models (LLMs) and generative AI have to-date primarily existed within one of these, and are only just beginning to expand into LMMs, such as Google’s Gemini. This frontier of research therefore asks how we make our AI not just output across audio, text, video and imagery, but benefit deeply from the learning that transfers across these modalities. This approach is also vital for gaining a deeper understanding of speech and making it more useful. Both of these avenues represent hugely exciting opportunities for discovery for us. We see both as necessary to achieve a truly seamless future AI assistants.

Striving for meaningful understanding in AI assistants

For the future of AI assistants and agents (and beyond), technology needs to not only hear words, but understand the meaning conveyed by us. 

We’re fortunate that we’re in a strong position to push the envelope in this area of speech technology. Transcription is just the starting point - our end goal is a much deeper understanding of how people convey meaning in what they say, and to build machines that can reply in turn with the same depth.   

If AI assistants lie in our future, they will have to do these things to be useful. We need them to be able to understand people’s goals, preferences and intent. 

We don’t just want to transcribe every word.   

We’re firmly committed to, and want to Understand Every Voice

Astonishing accurate ASR is here, in real-time.

What are you waiting for?