Jul 31, 2025 | Read time 2 min

Pipecat and Speechmatics: Building Voice Agents that know exactly ‘Who’ said ‘What’

Build smarter voice agents on Pipecat with Speechmatics speech-to-text, now with powerful speaker diarization for real-world, multi-speaker conversations.
Pipecat and Speechmatics Voice AI integration is now live
Speechmatics
SpeechmaticsEditorial Team

Speechmatics is thrilled to announce our partnership with Pipecat, the open source framework for voice (and multimodal) conversational AI. For the first time, Speechmatics brings speaker diarization to Pipecat voice agents - enabling them to understand ‘who’ said ‘what’. This unlocks valuable new use cases and helps tackle some key challenges in voice agent design. This collaboration sets a new benchmark in conversational AI - bringing an unprecedented level of clarity and speaker awareness to voice interfaces across the Pipecat community and beyond.

Why Diarization matters in Voice AI

Today’s voice AI agents are built for one-on-one interactions. But the real world isn’t that simple.

Speaker diarization - the ability to identify ‘who’ said ‘what’ - unlocks a new level of context-awareness, enabling voice agents to thrive in complex, multi-speaker environments like households, meetings, or shared workspaces.

Here’s some examples on how diarization solves some of the most common (and stubborn) problems in Voice AI:

  • Interruptions from background speakers - letting the agent focus on the right voice, ignoring unintended inputs

  • Lack of multi-speaker understanding - giving agents the ability to track and respond to individuals, not just voices - essential for context-rich conversations

  • Self-interruption in multi-device setups – When microphones and speakers are on separate systems, agents can mistakenly interpret their own output as user input. This often happens when echo cancellation is weak or fails entirely too

As Voice AI evolves beyond one-on-one interactions, speaker-aware intelligence becomes non-negotiable. Diarization is the key to unlocking this future.

Deploy anywhere!

Even better, this technology isn't limited to the cloud. It can run on your own computer, on-premise, on-device, and even on small embedded hardware. 

The demo below showcases a Pipecat-powered voice agent using Speechmatics’ speech-to-text, with audio input and output routed through an ESP32 microcontroller. It plays a real-time Guess Who-style game, highlighting speaker diarization in action.

Github: Code for Guess Who? demo

Unmatched accuracy across all languages

Beyond diarization, Speechmatics continues to redefine expectations with consistently high accuracy across more than 55 languages. While many speech-to-text (STT) providers struggle with accuracy outside of English, our models set the benchmark for languages regardless of dialects and accents.

Empowering Bilingual Conversations

Our pioneering bilingual models, including Spanish-English, Mandarin-English, Malay-English, and Tamil-English, are uniquely designed to capture and transcribe fluid multilingual conversations without sacrificing accuracy at the expense of code-switching. This empowers global businesses to interact effortlessly in diverse linguistic environments, removing language barriers and promoting effective communication.

The world’s most accurate, inclusive, and low-latency STT

Speechmatics remains committed to diversity and inclusion, offering the highest accuracy and lowest latency STT engine, built to understand all dialects and accents. In voice experiences, even a single misunderstanding can break the flow and disrupt the entire user journey. Our approach ensures every voice is heard and accurately represented, regardless of linguistic background or diversity.

Together, Speechmatics and Daily’s Pipecat are unlocking transformative possibilities for voice-driven interactions, demonstrating how advanced technology can deliver unmatched conversational clarity and inclusive communication on a global scale.

Start building today with Speechmatics + Pipecat!

Start building with:

Latest Articles

Carousel slide image
Technical

How to build a microbatching workflow with the Speechmatics API

Build a cleaner path between batch and real time. Learn when micro-batching makes sense, how to chunk audio, submit jobs, stitch JSON, and scale safely with the Speechmatics API.

Speechmatics
SpeechmaticsEditorial Team
Carousel slide image
Product

Alphanumeric speech recognition: why voice assistants mangle SKUs (and how to fix it)

A guide for voice AI engineers, ecommerce platforms and warehouse teams on SKU recognition accuracy voice assistant deployments depend on: why speech recognition systems produce transcription errors on product codes, what to measure when error rates matter, and the fixes that move the needle on order picking, voice ordering and customer-facing voice AI.

Speechmatics
SpeechmaticsEditorial Team
Carousel slide image
Technical

The Adobe story: How we made cloud-grade AI work on your laptop

Behind the build: what it takes to make cloud-grade speech recognition work inside Adobe Premiere, and why Whisper raised the stakes.

Andrew Innes
Andrew InnesChief Architect
Carousel slide image
Company

Adobe and Speechmatics deliver cloud-grade speech recognition on-device for Premiere

Adobe Premiere users can run the most accurate on-device transcription locally; efficient enough for a laptop, powerful enough for professional work.

Speechmatics
SpeechmaticsEditorial Team
Carousel slide image
Use Cases

Best speech-to-text AI guide: APIs, platforms and services compared

Speech-to-text has moved from novelty to enterprise infrastructure. Here's how the leading platforms stack up in 2026 — and how to pick the right one.

Tom Young
Tom YoungDigital Specialist
Speechmatics x Thymia combine medical-grade speech-to-text with clinical-grade voice biomarker intelligence to identify health signals.
News

AI can now understand health signals from 15 seconds of your voice, including fatigue, stress and type 2 diabetes

The joint platform returns transcription and health signals in real time, with no additional hardware required.

Speechmatics
SpeechmaticsEditorial Team