What does Speechmatics do?

Speechmatics provides speech technology and Voice AI for enterprises, offering accurate Speech-to-Text, Text-to-Speech, and Voice Agent solutions. Our models understand every voice and accent across 55+ languages, helping businesses unlock the full potential of voice data.

How accurate is Speechmatics Speech-to-Text?

Speechmatics delivers best-in-market accuracy, achieving up to 99% word accuracy and 96% medical keyword recall in industry benchmarks. Our models handle multiple accents, noisy environments, and multi speakers with ease.

What makes Speechmatics Text-to-Speech different?

Our low-latency Text-to-Speech (TTS) delivers lifelike, human-sounding voices with sub-150ms latency that is ideal for real-time conversations. Developers can stream natural speech in multiple voices and deploy it in the cloud, hybrid, or on-prem for privacy and control.

Can I build real-time voice agents with Speechmatics?

Our voice AI enables developers to build real-time voice agents that listen, understand, and respond naturally. Plug in fast with a flexible API and native integrations to power your AI voice agents.

Which industries use Speechmatics?

Speechmatics is trusted by organizations in media, healthcare, contact center, finance, education, and accessibility. Our technology powers transcription, translation, call analytics, and voice AI applications worldwide.

Pipecat and Speechmatics: Building Voice Agents that know exactly ‘Who’ said ‘What’

Speechmatics is thrilled to announce our partnership with Pipecat, the open source framework for voice (and multimodal) conversational AI. For the first time, Speechmatics brings speaker diarization to Pipecat voice agents - enabling them to understand ‘who’ said ‘what’. This unlocks valuable new use cases and helps tackle some key challenges in voice agent design. This collaboration sets a new benchmark in conversational AI - bringing an unprecedented level of clarity and speaker awareness to voice interfaces across the Pipecat community and beyond.

The team at @Speechmatics just shipped a really clean integration of realtime speaker diarization for voice agents. I've tinkered quite a bit with multi-speaker voice agent pipelines, and this is the best implementation I've seen so far.

Voice AI in 2025 is at a really… https://t.co/YeGY7vfcK5
— kwindla (@kwindla) July 23, 2025

Why Diarization matters in Voice AI

Today’s voice AI agents are built for one-on-one interactions. But the real world isn’t that simple.

Speaker diarization - the ability to identify ‘who’ said ‘what’ - unlocks a new level of context-awareness, enabling voice agents to thrive in complex, multi-speaker environments like households, meetings, or shared workspaces.

Here’s some examples on how diarization solves some of the most common (and stubborn) problems in Voice AI:

Interruptions from background speakers - letting the agent focus on the right voice, ignoring unintended inputs
Lack of multi-speaker understanding - giving agents the ability to track and respond to individuals, not just voices - essential for context-rich conversations
Self-interruption in multi-device setups – When microphones and speakers are on separate systems, agents can mistakenly interpret their own output as user input. This often happens when echo cancellation is weak or fails entirely too

As Voice AI evolves beyond one-on-one interactions, speaker-aware intelligence becomes non-negotiable. Diarization is the key to unlocking this future.

Deploy anywhere!

Even better, this technology isn't limited to the cloud. It can run on your own computer, on-premise, on-device, and even on small embedded hardware.

The demo below showcases a Pipecat-powered voice agent using Speechmatics’ speech-to-text, with audio input and output routed through an ESP32 microcontroller. It plays a real-time Guess Who-style game, highlighting speaker diarization in action.

Github: Code for Guess Who? demo

Unmatched accuracy across all languages

Beyond diarization, Speechmatics continues to redefine expectations with consistently high accuracy across more than 55 languages. While many speech-to-text (STT) providers struggle with accuracy outside of English, our models set the benchmark for languages regardless of dialects and accents.

Empowering Bilingual Conversations

Our pioneering bilingual models, including Spanish-English, Mandarin-English, Malay-English, and Tamil-English, are uniquely designed to capture and transcribe fluid multilingual conversations without sacrificing accuracy at the expense of code-switching. This empowers global businesses to interact effortlessly in diverse linguistic environments, removing language barriers and promoting effective communication.

The world’s most accurate, inclusive, and low-latency STT

Speechmatics remains committed to diversity and inclusion, offering the highest accuracy and lowest latency STT engine, built to understand all dialects and accents. In voice experiences, even a single misunderstanding can break the flow and disrupt the entire user journey. Our approach ensures every voice is heard and accurately represented, regardless of linguistic background or diversity.

Together, Speechmatics and Daily’s Pipecat are unlocking transformative possibilities for voice-driven interactions, demonstrating how advanced technology can deliver unmatched conversational clarity and inclusive communication on a global scale.

Start building today with Speechmatics + Pipecat!

Jul 31, 2025 | Read time 2 min

Pipecat and Speechmatics: Building Voice Agents that know exactly ‘Who’ said ‘What’

Why Diarization matters in Voice AI

Deploy anywhere!

Unmatched accuracy across all languages

Empowering Bilingual Conversations

The world’s most accurate, inclusive, and low-latency STT

Start building with:

Related Articles

AI personal assistant guide: How voice is powering enterprise success beyond Siri and Alexa

Inside the future of Voice AI: Speechmatics at the LiveKit developer showcase

The rise of assistive voice agents — and what they're really doing