Jul 31, 2025 | Read time 2 min

Pipecat and Speechmatics: Building Voice Agents that know exactly ‘Who’ said ‘What’

Build smarter voice agents on Pipecat with Speechmatics speech-to-text, now with powerful speaker diarization for real-world, multi-speaker conversations.
Pipecat and Speechmatics Voice AI integration is now live
Speechmatics
SpeechmaticsEditorial Team

Speechmatics is thrilled to announce our partnership with Pipecat, the open source framework for voice (and multimodal) conversational AI. For the first time, Speechmatics brings speaker diarization to Pipecat voice agents - enabling them to understand ‘who’ said ‘what’. This unlocks valuable new use cases and helps tackle some key challenges in voice agent design. This collaboration sets a new benchmark in conversational AI - bringing an unprecedented level of clarity and speaker awareness to voice interfaces across the Pipecat community and beyond.

Why Diarization matters in Voice AI

Today’s voice AI agents are built for one-on-one interactions. But the real world isn’t that simple.

Speaker diarization - the ability to identify ‘who’ said ‘what’ - unlocks a new level of context-awareness, enabling voice agents to thrive in complex, multi-speaker environments like households, meetings, or shared workspaces.

Here’s some examples on how diarization solves some of the most common (and stubborn) problems in Voice AI:

  • Interruptions from background speakers - letting the agent focus on the right voice, ignoring unintended inputs

  • Lack of multi-speaker understanding - giving agents the ability to track and respond to individuals, not just voices - essential for context-rich conversations

  • Self-interruption in multi-device setups – When microphones and speakers are on separate systems, agents can mistakenly interpret their own output as user input. This often happens when echo cancellation is weak or fails entirely too

As Voice AI evolves beyond one-on-one interactions, speaker-aware intelligence becomes non-negotiable. Diarization is the key to unlocking this future.

Deploy anywhere!

Even better, this technology isn't limited to the cloud. It can run on your own computer, on-premise, on-device, and even on small embedded hardware. 

The demo below showcases a Pipecat-powered voice agent using Speechmatics’ speech-to-text, with audio input and output routed through an ESP32 microcontroller. It plays a real-time Guess Who-style game, highlighting speaker diarization in action.

Github: Code for Guess Who? demo

Unmatched accuracy across all languages

Beyond diarization, Speechmatics continues to redefine expectations with consistently high accuracy across more than 55 languages. While many speech-to-text (STT) providers struggle with accuracy outside of English, our models set the benchmark for languages regardless of dialects and accents.

Empowering Bilingual Conversations

Our pioneering bilingual models, including Spanish-English, Mandarin-English, Malay-English, and Tamil-English, are uniquely designed to capture and transcribe fluid multilingual conversations without sacrificing accuracy at the expense of code-switching. This empowers global businesses to interact effortlessly in diverse linguistic environments, removing language barriers and promoting effective communication.

The world’s most accurate, inclusive, and low-latency STT

Speechmatics remains committed to diversity and inclusion, offering the highest accuracy and lowest latency STT engine, built to understand all dialects and accents. In voice experiences, even a single misunderstanding can break the flow and disrupt the entire user journey. Our approach ensures every voice is heard and accurately represented, regardless of linguistic background or diversity.

Together, Speechmatics and Daily’s Pipecat are unlocking transformative possibilities for voice-driven interactions, demonstrating how advanced technology can deliver unmatched conversational clarity and inclusive communication on a global scale.

Start building today with Speechmatics + Pipecat!

Start building with:

Latest Articles

Carousel slide image
Use Cases

What Word Error Rate Is Acceptable for Legal Transcription?

Word error rate for legal transcription has no single acceptable threshold. But knowing how accuracy, audio quality, and review obligations connect to real legal risk is what separates a reliable transcript from a costly one.

Mieke Smith
Mieke SmithSenior Writer
Carousel slide image
Use Cases

The court reporter shortage crisis: data, causes, and what legal teams are doing about it

The court reporter shortage is reshaping litigation. Explore data, causes, and how legal teams are using digital reporting and AI transcription to adapt.

Tom Young
Tom YoungDigital Specialist
[alt: Bilingual medical model featuring terms related to various health conditions and medications in Arabic and English. Key terms include "Chronic kidney disease," "Heart attack," "Diabetes," and "Insulin," among others, displayed in an organized layout.]
Product

Speechmatics achieves a world first in bilingual Voice AI with new Arabic–English model

Sets a new accuracy bar for real-world code-switching: 35% fewer errors than the closest competitor.

Speechmatics
SpeechmaticsEditorial Team
[alt: Illuminated ancient mud-brick structures stand against a dusk sky, showcasing architectural details and textures. Palm trees are in the foreground, adding to the setting's ambiance. Visually captures a historic site in twilight.]
Product

Your voice agent speaks perfect Arabic. That's the problem.

Most voice AI models are trained on formal Arabic, but real conversations across the Middle East mix dialects and English in ways those systems aren’t built to handle.

Yahia Abaza
Yahia AbazaSenior Product Manger
new blog image header
Technical

How Nvidia Dominates the HuggingFace Leaderboards in This Key Metric

A technical deep-dive into Token Duration Transducers (TDT) — the frame-skipping architecture behind Nvidia's Parakeet models. Covers inference mechanics, training with forward-backward algorithm, and how TDT achieves up to 2.82x faster decoding than standard RNN-T.

Oliver Parish
Oliver Parish Machine Learning Engineer
[alt: Healthcare professionals in scrubs and lab coats walk briskly down a hospital corridor. A nurse uses a tablet while others carry patient charts and attend to a gurney. The setting conveys a busy, clinical environment focused on patient care.]
Use Cases

Why AI-native EHR platforms will treat speech as core infrastructure in 2026

As clinical workflows become automated and AI-driven, real-time speech is shifting from a transcription feature to the foundational intelligence layer inside modern EHR systems.

Vamsi Edara
Vamsi EdaraFounder and CEO, Edvak EHR