Building Speech Intelligence on Solid Foundations

Advancing Speech Technology

Over the last decade, we've seen many foundational machine learning technologies come of age. Computer Vision, Speech, and Natural Language Processing (NLP) have all been revolutionized thanks to large training regimes paired with large and willing neural networks. At Speechmatics, we've seen error rates drop steadily over the last decade as we've pushed hard on the frontier of speech recognition accuracy.

The above represents a 50% reduction in WER for our best performing model in just 2 years (using our internal testing data).

But where does this leave us and the wider tech community? What’s next?

As an AI company, we have a clear answer to that question. We propose to build a seamless AI stack which we are calling Speech Intelligence that connects the latest AI technologies to the spoken world. As the most natural and fluid form of communication, speech technology promises a seamless entry and exit point; however, our interactions today with machines are stilted, frustrating and underwhelming.

We want to fix that.

Three pieces come together to form our initial approach:

1) Core ASR with global language coverage - this executes our vision to 'Understand Every Voice' and provides a compounding advantage throughout the stack.

2) Capabilities over one or more transcripts - exploit LLMs to build compelling APIs.

3) Verticalized solutions - take a highly customer-centric view on packaging and exposing this stack.

Having been a part of the team at Speechmatics since the very early days, I've been primarily building technology related to part 1 of the picture.

As we move into this next phase, the importance of the ASR has not diminished, if anything it has become more important than ever, and here’s why.

Garbage in, garbage out

Speechmatics has always prioritized the accuracy of our transcriptions. This commitment to accuracy has not merely been a competitive advantage but a fundamental principle rooted in our belief that for speech to be truly valuable, it must mirror reality. We believe the highest impact applications will be those where the ASR layer is rock solid and errors don't diffuse throughout the wider system. Imagine a world where voice assistants "just get it", and can participate in seamless conversations, and where AI-generated summaries never hallucinate in either the ASR or the LLM. If the speech foundations are brittle, the whole AI stack becomes brittle very quickly. Garbage in, garbage out.

Unlike some machine learning challenges, global transcription faces a unique obstacle - the scarcity of labeled data, especially for diverse languages and speakers. To make ASR rock-solid we must have an approach that has a fighting chance of dealing with the long tail of accents, words and acoustic conditions. To tackle this challenge, we've set up a long-running research program on self-supervised learning which has increasingly given us the ability to achieve wider coverage and higher accuracy with less and less labeled data, consistently across all the languages we offer.

A recent update to our Enhanced model lead to some significant reductions in Word Error Rates, including a 40% improvement for Norwegian.

A key aspect of Intelligence is the rate of skill acquisition – the reason humans don't have this same long-tail problem is that we can generalize from a small number of hours of real-life interactions. Any worthwhile ASR (or indeed AI) system of the future must evolve in the same direction.

In today's landscape of Language Model Models (LLMs), automation, and generative AI, the significance of this accuracy has only grown. Any downstream application or value extracted from speech data is directly proportional to the precision of the original transcript. It's a classic case of "garbage in, garbage out".

Given the above, superb accuracy across any language means that anyone who can achieve this will always be the preferred choice of any 'conversational AI stack'. If you plan to use speech, you need the best at capturing speech. We are, and we will continue to pursue this.

Our place in generative AI

Many might argue that the advancements in speech-to-text technology only impact those already embedded in the voice-driven ecosystem — CCaaS providers, captioning companies, podcasters and the like. Yet, the scope is broader. Most AI interactions today revolve around rigid scenarios such as asking Alexa to set a timer or getting an automated set of meeting notes. We must think bigger.

AI assistants today are just the beginning. The goal is for people to talk to most tech tools, just like they have been communicating for thousands of years – with their voices. Speaking is often both more natural and impactful than typing. The tech of the future should not just hear but truly understand our spoken words. Instead of fumbling with clunky LLM prompts and handoffs between multiple systems, we should be able to simply talk, and the AI should respond in kind. No latency and no misheard words. The vision is a seamless voice interface powered by a fully speech-to-speech neural network.

In the future, people in both their work environments and in their personal lives will be able to interact with technology in this way. Come and join us to help build that future.

Sure, Speech Technology isn't a panacea here. It won't be that we only use our voices to interact with tech - but it's still a string that we simply don't have in our collective bow right now. Current generation systems are good, but we need to get much closer to 'perfect' before this kind of future will come into view.

What's next?

Short term, we're looking to build our next-generation self-supervised models to strengthen our ASR foundations. As always, we will be adding key languages to our offering. On the capabilities front we have some new APIs which we are excited will pair with our high-accuracy ASR really well - more on that soon. We will also be expanding our Speech Intelligence stack and increasing its impact by providing solutions tailored to specific customer needs.

Long term, as an AI company we remain committed to investing in paradigm changes that make these seamless voice interfaces of the future a reality. We're excited about our direction and the future development of Speech Intelligence – our success in powering speech technology will always be built on the inclusive and consistently accurate foundations of automatic speech recognition.

Oct 17, 2023 | Read time 6 min

Building Speech Intelligence on Solid Foundations

Advancing Speech Technology

But where does this leave us and the wider tech community? What’s next?

Garbage in, garbage out

Our place in generative AI

What's next?

Related Articles

Best-in-class real-time ASR system

Boosting sample efficiency through Self-Supervised Learning

Ursa: Scaling up as a Solution to Domain Generalization