
Dr Tony Robinson began his PhD research into speech recognition at Cambridge in 1985 and built one of the first neural network-based systems four years later.
He founded Speechmatics, shaped a generation of Cambridge researchers who went on to build the Voice AI infrastructure at Microsoft, Amazon and Apple, and has spent four decades on one problem: Voice AI works in demos and breaks in production.
Last week he answered 24 hours-worth of questions from builders on Reddit.
Here are some of his top insights.

Automatic speech recognition has crossed a threshold. In Tony's framing, it is now more like software than machine learning. You can plug it in and it works for a wide range of use cases. The accuracy is there. The barrier to entry is low.
The problem is everything else. In a real-time STT to LLM to TTS pipeline, the speech-to-text component is rarely where things go wrong. The LLM is. In his own testing, the fun conversations last until the LLM hits a loop and just becomes wrong. Getting reliable, fact-checkable language model output is the unsolved layer of Voice AI in 2026.
What the industry actually needs, Tony argues, is not better LLMs but provably correct ones: systems where the model is doing nothing more than enabling search of a verified facts database, with fact-checking built into the output.
A wrong answer delivered in a natural, human-sounding voice can be more dangerous because people may trust it too much. The accountability problem has got worse as the voice quality has got better.
The one-second rule has held across decades of voice interface design. If the user does not get a reply within one second of finishing their turn, they lose faith in the system. Go over it and the interaction degrades, regardless of how accurate the transcription is.
The implication for pipeline design is stark. A cascaded system is only as fast as its slowest component. One delay anywhere in the chain and the answer arrives late, which in Voice AI is effectively a wrong answer.
Network delays have to be minimized. Whatever can be run locally should be.
Any latency introduced at the ASR stage is unrecoverable. But Tony also flags an opposite failure mode: a response that arrives unnaturally fast puts people off too. That one, at least, is easy to fix.
One of the most substantive debates in Voice AI is whether modular, cascaded pipelines will eventually be superseded by end-to-end speech-to-speech models that skip text as an intermediate representation entirely. Tony has held a clear position on this for decades.
He pushes back on the word "cascaded" as a misnomer first. A real production pipeline is not a simple sequential hand-off. You can interrupt the TTS. There is metadata beyond text. Components can run in parallel. The architecture is more fluid than the name implies.
His case for text in the middle comes down to factoring. If you can break a problem into separable components, you can debug each one, optimize each section independently, and train different parts on different datasets. You also get accountability: text is something you can inspect, verify, and parse into SQL queries or structured workflows in a way raw audio simply cannot be.
Speech-to-speech will win eventually, he concedes. But the first organization to get there will need to train on the full internet, have built a credible world model, and have a data center or two spare. The modular approach is not a compromise in the meantime. For many use cases, it is the better architecture on its own terms.
Ask any developer shipping a real-time Voice AI product what keeps them up at night, and turn detection appears quickly. When has a speaker finished their turn? The naive answer: when they stop talking. The real answer is considerably harder.
Tony has been working on this since 1989, and is candid that it still has a long way to go. A slow speaker might pause mid-thought. A filler word like "umm" signals continuation, not completion. A second voice in the room introduces ambiguity. The ASR pipeline has to have completed, and the LLM pipeline may already be in motion, by the time a turn-continuation signal arrives. If "umm" comes in at the half-second mark, the system has to invalidate partial results, discard intermediate state, and restore state. In real time.
For many production use cases, the current state of the art is good enough. But you can always break it if you try hard enough, and the question for builders is whether it holds in the specific context they are deploying into.
On whether Voice AI should be able to interrupt the user: yes, Tony says, and frames it as a world model problem. If the system has enough context to know the user is heading somewhere incorrect, it should be able to redirect. Most current LLMs are too nice. Tony suggests that this can help cover up hallucinations. A system with a genuinely good model of the user could interrupt appropriately. Not there yet.
Tony's answer is immediate: speaker diarization. The ability to distinguish between different speakers in an audio stream is treated as a secondary concern in many speech-to-text implementations. It should not be.
In a meeting or multi-party conversation, feeding speaker-tagged transcripts into an LLM changes what the model can do. Who said what, and when, is often as important as what was said. One wrong speaker assignment can change the meaning of the conversation entirely. Diarization is an ASR problem, not something a better LLM will fix, and passing clean speaker labels downstream unlocks substantially richer analysis.
Barge-in detection, the ability to handle a user interrupting the system mid-response, is a different kind of problem: a pipeline problem, and not a particularly hard one. If the ASR detects new audio, the TTS should shut down. With the right signalling architecture, it can shut down faster still. Solvable.
Speech recognition still has real challenges in low-resource languages, but Tony's position is that the barrier is economic rather than architectural.
The Speechmatics model shares parameters across all languages. When the deep learning architecture improves, every language improves. The problem is that the money to build and deploy these systems has historically concentrated where English is spoken, which has priced parts of the world out.
That is shifting. Speech-to-text costs are falling and volumes are growing. The languages seeing fastest adoption are moving beyond the English-speaking world already.
For communities with low literacy rates, voice interfaces carry an access advantage that text-based systems cannot offer. When the tipping point comes for any given language, Tony says, it tends to come fast. Predicting exactly when is the hard part.
The honest answer is not about technology. It is about timing.
He was working with neural networks when he met Geoff Hinton in late 1985. His connectionist speech recognition work in the 1990s was technically ahead of its time. He started a company to distribute music over dial-up internet.
Speechmatics has been running for 20 years, and for the first several of those years ASR was not good enough to sell as a startup product. It took Siri before the market understood that speech recognition actually worked.
"Don't do things before the market is ready for them," Tony says, "but have them done when the market is ready."
The distinction matters. Being early is often indistinguishable from being wrong. The skill is building during the quiet period so you can move when the window opens.
Late in the AMA, Tony was asked whether Voice AI would become the dominant human-machine interface, or whether brain-machine interfaces would eventually take over. He said he hoped neither would dominate.
Speech is faster than typing. More natural. Better suited to devices too small for a keyboard, situations where hands are occupied, and any context where frictionless input matters. But it is not universal. In a room full of people, you might reasonably prefer the soft keyboard. Neuralink has its first happy customer, Tony notes, but he is personally never going to want a neural implant.
Dr Tony Robinson is founder of Speechmatics, which provides speech-to-text, speaker diarization and real-time transcription across 55+ languages. Get building with Speechmatics today via our Portal.

Word error rate for legal transcription has no single acceptable threshold. But knowing how accuracy, audio quality, and review obligations connect to real legal risk is what separates a reliable transcript from a costly one.

The court reporter shortage is reshaping litigation. Explore data, causes, and how legal teams are using digital reporting and AI transcription to adapt.
![[alt: Bilingual medical model featuring terms related to various health conditions and medications in Arabic and English. Key terms include "Chronic kidney disease," "Heart attack," "Diabetes," and "Insulin," among others, displayed in an organized layout.]](/_next/image?url=https%3A%2F%2Fimages.ctfassets.net%2Fyze1aysi0225%2F3I31FQHBheddd0CibURFBv%2F4355036ed3d14b4e1accb3fe39ecd886%2FArabic-English-blog-Jade-wide-carousel.webp&w=3840&q=75)
Sets a new accuracy bar for real-world code-switching: 35% fewer errors than the closest competitor.
![[alt: Illuminated ancient mud-brick structures stand against a dusk sky, showcasing architectural details and textures. Palm trees are in the foreground, adding to the setting's ambiance. Visually captures a historic site in twilight.]](/_next/image?url=https%3A%2F%2Fimages.ctfassets.net%2Fyze1aysi0225%2F2qdoWdIOsIygVY0cwl8UD4%2Fe7725d963a96f84c87d614ccc6cce3c6%2FAdobeStock_669627191-wide-carousel.webp&w=3840&q=75)
Most voice AI models are trained on formal Arabic, but real conversations across the Middle East mix dialects and English in ways those systems aren’t built to handle.

A technical deep-dive into Token Duration Transducers (TDT) — the frame-skipping architecture behind Nvidia's Parakeet models. Covers inference mechanics, training with forward-backward algorithm, and how TDT achieves up to 2.82x faster decoding than standard RNN-T.


