Pipecat and Speechmatics: Building Voice Agents th...
Jul 31, 2025 | Read time 2 min
Pipecat and Speechmatics: Building Voice Agents that know exactly ‘Who’ said ‘What’
Build smarter voice agents on Pipecat with Speechmatics speech-to-text, now with powerful speaker diarization for real-world, multi-speaker conversations.
SpeechmaticsEditorial Team
Speechmatics is thrilled to announce our partnership with Pipecat, the open source framework for voice (and multimodal) conversational AI. For the first time, Speechmatics brings speaker diarization to Pipecat voice agents - enabling them to understand ‘who’ said ‘what’. This unlocks valuable new use cases and helps tackle some key challenges in voice agent design.
This collaboration sets a new benchmark in conversational AI - bringing an unprecedented level of clarity and speaker awareness to voice interfaces across the Pipecat community and beyond.
The team at @Speechmatics just shipped a really clean integration of realtime speaker diarization for voice agents. I've tinkered quite a bit with multi-speaker voice agent pipelines, and this is the best implementation I've seen so far.
Today’s voice AI agents are built for one-on-one interactions. But the real world isn’t that simple.
Speaker diarization - the ability to identify ‘who’ said ‘what’ - unlocks a new level of context-awareness, enabling voice agents to thrive in complex, multi-speaker environments like households, meetings, or shared workspaces.
Here’s some examples on how diarization solves some of the most common (and stubborn) problems in Voice AI:
Interruptions from background speakers - letting the agent focus on the right voice, ignoring unintended inputs
Lack of multi-speaker understanding - giving agents the ability to track and respond to individuals, not just voices - essential for context-rich conversations
Self-interruption in multi-device setups – When microphones and speakers are on separate systems, agents can mistakenly interpret their own output as user input. This often happens when echo cancellation is weak or fails entirely too
As Voice AI evolves beyond one-on-one interactions, speaker-aware intelligence becomes non-negotiable. Diarization is the key to unlocking this future.
Deploy anywhere!
Even better, this technology isn't limited to the cloud. It can run on your own computer, on-premise, on-device, and even on small embedded hardware.
The demo below showcases a Pipecat-powered voice agent using Speechmatics’ speech-to-text, with audio input and output routed through an ESP32 microcontroller. It plays a real-time Guess Who-style game, highlighting speaker diarization in action.
Beyond diarization, Speechmatics continues to redefine expectations with consistently high accuracy across more than 55 languages. While many speech-to-text (STT) providers struggle with accuracy outside of English, our models set the benchmark for languages regardless of dialects and accents.
Empowering Bilingual Conversations
Our pioneering bilingual models, including Spanish-English, Mandarin-English, Malay-English, and Tamil-English, are uniquely designed to capture and transcribe fluid multilingual conversations without sacrificing accuracy at the expense of code-switching. This empowers global businesses to interact effortlessly in diverse linguistic environments, removing language barriers and promoting effective communication.
The world’s most accurate, inclusive, and low-latency STT
Speechmatics remains committed to diversity and inclusion, offering the highest accuracy and lowest latency STT engine, built to understand all dialects and accents. In voice experiences, even a single misunderstanding can break the flow and disrupt the entire user journey. Our approach ensures every voice is heard and accurately represented, regardless of linguistic background or diversity.
Together, Speechmatics and Daily’s Pipecat are unlocking transformative possibilities for voice-driven interactions, demonstrating how advanced technology can deliver unmatched conversational clarity and inclusive communication on a global scale.