May 22, 2024 | Read time 5 min

Why Google and Open AI’s latest announcements don’t solve all the challenges of AI Assistants

Understanding every voice is necessary to creating great AI assistants, but also a huge challenge.
AI assistant - OpenAI and Google
Trevor Back
Trevor BackChief Product Officer
Will Williams
Will WilliamsChief Technology Officer

A renewed focus on Conversational AI

Back in February, we shared our belief that “Her”-like AI assistants lie in our future. The movie is a great North Star for our vision due to its focus on a natural, truly seamless, interaction.

We also claimed that speech will be a core part of any future AGI stack

Last week’s demos from OpenAI and Google have shown that Big Tech also recognize this vision to be true.

In the space of just a week, both companies held events to showcase their latest updates, and both showed a renewed focus on audio-centric interaction layers with their LLM stacks, and how this enables a more conversational AI device. These demos reignited media enthusiasm for assistants after some less well received product reviews of Humane AI and Rabbit.

Some of the most impressive demos focused on a ‘multimodal’ application, able to use text, images and audio inputs together. A particularly exciting moment for many observers was the ability to use this multimodal input to provide a personalized tutor to help a student learn calculus and trigonometry.

Our main takeaway, however, was that the demos all used audio input, our voices, as the method for interaction. No keyboards, no touch screens, no mice, no eye tracking (not even a brain interface). But our most natural, most seamless, most human way of interacting with our world: our voice.

The shift to audio-driven interaction

A simple way to think about these new assistant models is as three core components:

  1. Multimodal input, including Audio-in for user interaction (using speech-to-text)

  2. Intelligence stack (often using LLMs)

  3. Audio-out (using speech synthesis)

Google I/O showcased several demos which utilize Gemini’s impressive multimodal capabilities.

OpenAI’s demo showcased their impressive progress in speech synthesis – how to get an application to talk in an emotive, human-like way. Which, over the last year, has seen incredible leaps forward from what we’re use to with Siri, Alexa or Google Assistant.

A few commentators (The Guardian and Bloomberg to name two) were critical of just how close OpenAI’s assistant sounded like the AI from “Her” (played by Scarlett Johansson), but nonetheless the range of emotion was impressive to experience.

One of the major benefits of this progress in speech synthesis, is that it can give the impression of a deeper understanding in the other components of the tech stack, even when that may not be present. 

For example, in the OpenAI demo, we heard the following interaction:

“I want you to tell my friend a bedtime story about Robots and Love” 

“Ooooooo, a bedtime story about Robots and Love? I got you covered…”

By responding quickly to utterances with human-like (but ultimately filler) phrases, the perception of responsiveness undoubtedly increases, but this does not demonstrate that the actual speech understanding has also improved. Lots of these filler responses were demonstrated – often repeating the question back in the same order, with some personality thrown in for good measure.

Speechmatics' commitment to shaping the future of AI assistants  

At Speechmatics, we believe that deeply understanding the audio-in is required to build the seamless AI assistant of the future. This is why we focus so much of our time and energy on the first part of the component stack: the audio-in

We’ve been focused on this for over a decade, and here are some reasons we think it truly can make a difference for the AI assistants of the (near) future:

Feature

OpenAI

Speechmatics

Comments

Real-time

Likely using "mini-batch" methodology. Blocks, of question and response.

Real-time first.

GPT-4o likely uses a “mini-batch” methodology where an end of user-speech is detected, and a full response immediately generated. This can also lead to moments where the system responds unnaturally fast, or interrupted, or failed to recognize it was interrupted.

We use “streaming” on ASR to recognize each word as it is spoken. Our future systems will be able to reformulate responses mid-response, interrupt gracefully, handle crosstalk and incorporate background events without triggering awkward transitions.

Diarization

Unknown

Best-in-class

For OpenAI, there were moments the assistant failed to recognize it was speaking to multiple people. Understanding multi-speakers is critical to many use cases.

Speechmatics offers best-in-class speaker diarization. This includes speaker (rather than channel-wise) diarization, where we are able to recognize multiple different speakers through the same audio channel. 

Background noise handling

Unreliable

Best-in-class

On a few occasions, GPT-4o reacted to audience audio rather than the presenter. For an assistant to be useful in the real-world, it needs to be robust to background noise (like traffic noise, a crowd, or an airport), and to background speakers (such as if you’re sat on a train).

Our ASR is specifically trained to be solid and more robust to background noise (just take this example from a referee in a basketball game with the audience).

Accents, dialects, and non-English languages

American-English and simple Spanish showcased. Full evaluation pending.

50+ languages, including strong accents and dialect coverage.

OpenAI employee’s showcasing the demo had understandable American-English voices. ASR is known to work extremely well for these voices. But how does it work for understanding every voice? Every accent? Every language?

Speechmatics is independently verified as the most accurate ASR for a wide range of languages, accents and dialects. We want our technology to enable the understanding of every voice, not just the most prominently represented. The devil is in the detail when it comes to accents.

Latency

Variable

Consistently low

While OpenAI’s assistant responded with very low latency most of the time, there were still occasions when waiting for the response had an unknown duration. This variability in response time can be frustrating for a user. We provide premium products for our premium enterprise customers. This means robust, consistent, and reliable products which ‘just work’.

At Speechmatics we work on ensuring you always get a response at the same latency, no matter what the question.

Overcoming challenges to enhance AI accessibility for all

Understanding every voice is both necessary to creating great AI assistants, and also a huge challenge.

Whilst the announcements were truly impressive, and represent a further evolution of AI technology, we’re 100% focused on developing the speech understanding required to enable accessibility for these AI assistants for everyone.

What is clear is others share our vision for the critical role of speech in how we interact with technology. Speech will be a core component of any future AGI stack.

We’re in an enviable position to capitalize on our decades of expertise in audio. Of course, we’ll be making a few of our own announcements over the next few months… stayed tuned 👀.

Latest Articles

[alt: Bilingual medical model featuring terms related to various health conditions and medications in Arabic and English. Key terms include "Chronic kidney disease," "Heart attack," "Diabetes," and "Insulin," among others, displayed in an organized layout.]
Product

Speechmatics achieves a world first in bilingual Voice AI with new Arabic–English model

Sets a new accuracy bar for real-world code-switching: 35% fewer errors than the closest competitor.

Speechmatics
SpeechmaticsEditorial Team
[alt: Illuminated ancient mud-brick structures stand against a dusk sky, showcasing architectural details and textures. Palm trees are in the foreground, adding to the setting's ambiance. Visually captures a historic site in twilight.]
Product

Your voice agent speaks perfect Arabic. That's the problem.

Most voice AI models are trained on formal Arabic, but real conversations across the Middle East mix dialects and English in ways those systems aren’t built to handle.

Yahia Abaza
Yahia AbazaSenior Product Manger
new blog image header
Technical

How Nvidia Dominates the HuggingFace Leaderboards in This Key Metric

A technical deep-dive into Token Duration Transducers (TDT) — the frame-skipping architecture behind Nvidia's Parakeet models. Covers inference mechanics, training with forward-backward algorithm, and how TDT achieves up to 2.82x faster decoding than standard RNN-T.

Oliver Parish
Oliver Parish Machine Learning Engineer
[alt: Healthcare professionals in scrubs and lab coats walk briskly down a hospital corridor. A nurse uses a tablet while others carry patient charts and attend to a gurney. The setting conveys a busy, clinical environment focused on patient care.]
Use Cases

Why AI-native EHR platforms will treat speech as core infrastructure in 2026

As clinical workflows become automated and AI-driven, real-time speech is shifting from a transcription feature to the foundational intelligence layer inside modern EHR systems.

Vamsi Edara
Vamsi EdaraFounder and CEO, Edvak EHR
[alt: Logos of Speechmatics and Edvak are displayed side by side, interconnected by a stylized x symbol. The background features soft, wavy lines in light blue, creating a modern and tech-focused aesthetic.]
Company

One word changes everything: Speechmatics and Edvak EHR partner to make voice AI safe for clinical automation at scale

Turning real-time clinical speech into trusted, EHR-native automation.

Speechmatics
SpeechmaticsEditorial Team
[alt: Concentric circles radiate outward from a central orange icon with a white Speechmatics logo. The background is dark blue, enhancing the orange glow. A thin green line runs horizontally across the lower part of the image.]
Technical

Speed you can trust: The STT metrics that matter for voice agents

What “fast” actually means for voice agents — and why Pipecat’s TTFS + semantic accuracy is the clearest benchmark we’ve seen.

Archie McMullan
Archie McMullanSpeechmatics Graduate