Blog - Technical
Jul 2, 2025 | Read time 5 min

Inside the future of Voice AI: Speechmatics at the LiveKit developer showcase

From stage to strategy: core takeaways from a lead engineer on the future of live, usable Voice AI.
Eli LandauSolutions Engineer

Last week I had the honor of speaking on behalf of Speechmatics at an event hosted by the fantastic team at LiveKit.  

The event was focused on Voice AI and building/integrating STT and TTS into real-time workflows and applications. Naturally, as a Speechmatics Solutions Engineer I had quite a bit to share on the subject 😉.

Setting the stage at Fort Mason 

The event was located at Fort Mason in San Francisco in a wonderfully suitable space by Founders, Inc.  

To credit the LiveKit team even further, attendance was excellent with a packed amphitheater of more than 80 developers (from both enterprises and as individuals) eager to learn more about the secret sauce that goes into building these impressive machine learning models.  

I shared the stage with Jean Wang, a product manager at InWorld who focused on TTS while I covered the STT side of things. 

Making voice as intuitive as human conversation 

After some initial developer demos (including one from PrepPal that incorporated Speechmatics into their presentation) we kicked off the panel. We dove deep into the core problems that Speechmatics and real-time STT aim to solve, and how we do it through machine learning.  

Voice is an intuitive mode of communication amongst human beings, and our aim is to bring that natural ease of communication to STT. The result of these aims is that Speechmatics now transcribes millions of hours of audio each month with extremely high accuracy and real-time, low latency and we can do this across 56 languages (and counting) including different dialects/accents.  

We support a wide range of use cases from conversational AI, healthcare, broadcasting, call centers, edtech and more.  

Beyond simply transcribing what was said, we can also determine who said what with our state-of-the-art diarization. 

Training for the real world, not the lab 

But how do we do all of this? And how do we do it well? Our approach to pre-training our machine learning model with millions of hours of unlabeled data (audio only) allows us to expose the model to a wide variety of phonemes across different dialects, accents, and languages in the training audio itself. This enables us to learn new languages with just a small amount of labeled data   

On top of the inclusive and diverse training data, basing our training around the sound of words as opposed to pure text allows our models to perform well not only with newscasters, but also regional dialects or non-native speakers. In the same vein, using unlabeled data (which is much easier to find and cheaper than labelled data which requires audio and transcription) has allowed us to create more language models for languages that are lesser known and simply have less labelled data.  

The end result? More inclusivity and higher accuracy. 

Who said what: The power of understanding who spoke 

Another ancillary benefit to this approach is that by focusing on acoustics while training the models, we are also more adept at distinguishing the differences between speaker voices/cadences and labelling them in real-time as such – also known in the industry as diarization. 

Our training data extends to audio that is low quality, has background noise, and even pure silence so that we can train our engine to focus on the speakers and speech that actually matter.  

This is critical for real world use cases – audio in the field is sometimes cacophonous, rarely perfect, and requires STT models that can sift through the noise.  

Our training through augmentation allows us to perform well in these scenarios and helps users distinguish between speakers through diarization which is often essential in call centers, drive-thrus, meeting rooms, and conferences.    Our training through augmentation allows us to perform well in these scenarios and helps users distinguish between speakers through diarization which is often essential in call centers, drive-thrus, meeting rooms, and conferences. This capability unlocks powerful use cases: in multi-person meetings, voice agents can understand and respond to multiple participants seamlessly, while in drive-thrus, the system can focus on the primary speaker (the customer) while filtering out background conversations.

What's next for speech technology? 

We rounded out the panel by discussing the future of STT, which looks very exciting. Paralinguistics or the ability to detect tone and sentiment from speech will unlock a lot of new use cases, especially with call centers and conversational AI.  

And real-time, on-device STT will allow smaller but effective models to exist on hardware, opening the door for transcription and conversational AI that is private and can operate with low-to-no internet connectivity.

The chatbot that knows who's talking 

After the panel the only thing left to do was showcase the great tech we talked about!  

Our Senior Director of Innovation Sam Sykes built a fantastic real-time, conversational AI bot using the Livekit Framework, Speechmatics for STT, Azure OpenAI for LLM and ElevenLabs for TTS.  

The result? A chatbot that can distinguish the different speakers in the conversation, all thanks to using speaker labels provided by our diarization.  

My favorite way to showcase this is by having our chatbot respond positively to suggestions for what to do from an audience member and conversely dislike everything that I suggest – differentiating our suggestions purely on voice.  

But don’t take my word for it – some of our audience members were equally impressed, and you can get a summation of the demo on TikTok.  

 

Building the developer community 

Overall, the event was a great success! A key indicator of this was that after the demos concluded, people hung around for a few hours talking about tech and inquiring more about Speechmatics STT. 

We’re eager to continue building out our developer community, and this ultimately means more developer showcases in the near future.  

Whether you’re new to machine learning or an expert we’d love to see you at the next event. But, if you can’t wait and want to get hands on our STT – sign up for free in the portal here and start building 😊!