Jul 31, 2025 | Read time 2 min

Pipecat and Speechmatics: Building Voice Agents that know exactly ‘Who’ said ‘What’

Build smarter voice agents on Pipecat with Speechmatics speech-to-text, now with powerful speaker diarization for real-world, multi-speaker conversations.
SpeechmaticsEditorial Team

Speechmatics is thrilled to announce our partnership with Pipecat, the open source framework for voice (and multimodal) conversational AI. For the first time, Speechmatics brings speaker diarization to Pipecat voice agents - enabling them to understand ‘who’ said ‘what’. This unlocks valuable new use cases and helps tackle some key challenges in voice agent design.

Start building with:

This collaboration sets a new benchmark in conversational AI - bringing an unprecedented level of clarity and speaker awareness to voice interfaces across the Pipecat community and beyond.

Why Diarization matters in Voice AI

Today’s voice AI agents are built for one-on-one interactions. But the real world isn’t that simple.

Speaker diarization - the ability to identify ‘who’ said ‘what’ - unlocks a new level of context-awareness, enabling voice agents to thrive in complex, multi-speaker environments like households, meetings, or shared workspaces.

Here’s some examples on how diarization solves some of the most common (and stubborn) problems in Voice AI:

  • Interruptions from background speakers - letting the agent focus on the right voice, ignoring unintended inputs

  • Lack of multi-speaker understanding - giving agents the ability to track and respond to individuals, not just voices - essential for context-rich conversations

  • Self-interruption in multi-device setups – When microphones and speakers are on separate systems, agents can mistakenly interpret their own output as user input. This often happens when echo cancellation is weak or fails entirely too

As Voice AI evolves beyond one-on-one interactions, speaker-aware intelligence becomes non-negotiable. Diarization is the key to unlocking this future.

Deploy anywhere!

Even better, this technology isn't limited to the cloud. It can run on your own computer, on-premise, on-device, and even on small embedded hardware. 

The demo below showcases a Pipecat-powered voice agent using Speechmatics’ speech-to-text, with audio input and output routed through an ESP32 microcontroller. It plays a real-time Guess Who-style game, highlighting speaker diarization in action.

Github: Code for Guess Who? demo

Unmatched accuracy across all languages

Beyond diarization, Speechmatics continues to redefine expectations with consistently high accuracy across more than 55 languages. While many speech-to-text (STT) providers struggle with accuracy outside of English, our models set the benchmark for languages regardless of dialects and accents.

Empowering Bilingual Conversations

Our pioneering bilingual models, including Spanish-English, Mandarin-English, Malay-English, and Tamil-English, are uniquely designed to capture and transcribe fluid multilingual conversations without sacrificing accuracy at the expense of code-switching. This empowers global businesses to interact effortlessly in diverse linguistic environments, removing language barriers and promoting effective communication.

The world’s most accurate, inclusive, and low-latency STT

Speechmatics remains committed to diversity and inclusion, offering the highest accuracy and lowest latency STT engine, built to understand all dialects and accents. In voice experiences, even a single misunderstanding can break the flow and disrupt the entire user journey. Our approach ensures every voice is heard and accurately represented, regardless of linguistic background or diversity.

Together, Speechmatics and Daily’s Pipecat are unlocking transformative possibilities for voice-driven interactions, demonstrating how advanced technology can deliver unmatched conversational clarity and inclusive communication on a global scale.

Start building today with Speechmatics + Pipecat!

Links and resources: