What does Speechmatics do?

Speechmatics provides speech technology and Voice AI for enterprises, offering accurate Speech-to-Text, Text-to-Speech, and Voice Agent solutions. Our models understand every voice and accent across 55+ languages, helping businesses unlock the full potential of voice data.

How accurate is Speechmatics Speech-to-Text?

Speechmatics delivers best-in-market accuracy, achieving up to 99% word accuracy and 96% medical keyword recall in industry benchmarks. Our models handle multiple accents, noisy environments, and multi speakers with ease.

What makes Speechmatics Text-to-Speech different?

Our low-latency Text-to-Speech (TTS) delivers lifelike, human-sounding voices with sub-150ms latency that is ideal for real-time conversations. Developers can stream natural speech in multiple voices and deploy it in the cloud, hybrid, or on-prem for privacy and control.

Can I build real-time voice agents with Speechmatics?

Our voice AI enables developers to build real-time voice agents that listen, understand, and respond naturally. Plug in fast with a flexible API and native integrations to power your AI voice agents.

Which industries use Speechmatics?

Speechmatics is trusted by organizations in media, healthcare, contact center, finance, education, and accessibility. Our technology powers transcription, translation, call analytics, and voice AI applications worldwide.

The 7 best AI agent orchestration platforms for building Voice AI agents in 2025

An AI agent orchestration platform is the backbone of modern voice and conversational systems, connecting speech, reasoning, and action so multiple AI agents can work together seamlessly.

In 2025, the term “voice agent” or “voice AI agent” no longer refers to a narrow category of bots with scripted responses.

It now spans a wide range of real-time AI-powered systems that handle spoken input: interpreting, responding and routing conversations across industries.

Whether embedded in customer service flows, voice-driven apps, or interactive voice response systems, voice agents are increasingly designed to mirror natural human dialogue while handling complex back-end tasks.

But as this new generation of voice agents becomes more sophisticated, so too does the decision around how to build them.

While everyone’s talking about the latest LLM or smart prompting strategy, the most critical choice often goes overlooked: the platform you use to orchestrate the entire experience.

Whether you’re a solo developer experimenting on weekends or you’re deploying across an enterprise-grade stack, the orchestration layer is what defines the voice experience.

It governs latency, interruptions, session control, speaker switching, and ultimately, how ‘human’ your voice agent feels.

This article breaks down the top platforms for voice agent development in 2025, explains what each is best suited for, and helps you choose the right foundation.

Because building great agents starts with choosing great infrastructure.

1. Pipecat

Best for: Developers who want full control and stable enterprise-grade builds

Pipecat is gaining traction as one of the most technically robust options for voice agent builders.

Created by the team behind Daily, it offers deep control and enterprise-grade reliability for teams deploying real-time agents.

Key features:

Cascaded architecture with built-in orchestration
WebRTC-ready for low-latency streaming
Partial support, function calling, and LLM triggers
Enterprise partnership momentum (e.g. NVIDIA)

Developer community: Active and enterprise-focused, with 100M+ hours of annual voice traffic. Discord support, GitHub activity, and solid documentation make this a confident build environment.

STT integration: Speechmatics integration with end-of-turn, diarization, and latency optimization. Highly compatible for developers who want to own the stack.

2. LiveKit

Best for: Scalable B2C workflows and community-backed reliability

LiveKit is a full-stack, open-source voice/video infrastructure with strong voice agent capabilities.

Its agent framework is ideal for products needing scale, stability and a smooth WebRTC-based experience.

Key features:

WebRTC-first design, ideal for global reach
Turn-taking and state management
Agent SDK with plugin support

Developer community: 13,000+ developers in Slack. A major player for audio/video infra in the open-source ecosystem.

STT integration: Speechmatics plugin available with diarization and real-time improvements. Ideal for polished, multi-device voice agents.

3. Vapi

Best for: Fast prototyping and solo builders

Vapi offers a fast-moving platform with a lightweight plugin model, making it a strong option for individual developers and small teams looking to prototype quickly.

Key features:

Simple API for real-time audio to LLM workflows
Built-in support for ElevenLabs, OpenAI and more
Modular and extensible with low switching costs

Developer community: Vibrant and fast-growing, with 17,000+ users on Discord. Especially popular with indie hackers and early-stage startups.

STT integration: Speechmatics enhanced model is now available - ideal platform that makes it easy to get started with voice agents that can then be pushed into production.

4. Retell.ai

Best for: Contact center simulation and sales workflows

Retell.ai provides vertical integration across call simulation, routing, and memory logic—making it well-suited for voice agents in customer service and sales.

Key features:

Built-in testing and demo tools
Agent logic and memory management
Suitable for recorded or real-time calls

Developer community: Moderate but growing. Used in sales automation, contact centers and regulated industries.

5. Synthflow

Best for: Low-code/no-code teams entering voice UX

Synthflow is a low-code environment designed to make voice agent building more accessible. Its drag-and-drop interface allows quick assembly of logic for non-technical teams.

Key features:

No-code interface
Simple integration with major voice providers
Pre-built voice UX templates

Developer community: Quiet but growing. Strong appeal for design-led and CX teams.

6. ElevenLabs

Best for: Natural, expressive voice output and seamless agent responses

ElevenLabs has evolved beyond text-to-speech into a full multimodal voice AI platform. A solid choice for teams prioritizing lifelike audio and fluid conversational flow between human and machine.

Key features:

Ultra-realistic voices and streaming TTS
Low-latency responses for real-time voice agents
Voice cloning and emotional range support
API-level control for integration into any orchestration framework

Developer community: Massive and creator-led, with an expanding developer SDK and plugin ecosystem. Perfect for builders designing polished conversational experiences.

7. Layercode

Best for: Integrated voice routing and modern contact center automation

Layercode provides a new generation of voice infrastructure—bridging the gap between agentic AI and telephony. Built for developers modernizing contact center and support flows, it combines low-latency orchestration with flexible API design.

Key features:

Realtime call routing and voice API integration
CRM-ready with modern, AI-first architecture
Secure, reliable infrastructure for enterprise workloads
Excellent WebRTC and SIP support

Developer community: Emerging but highly technical. Backed by open SDKs and strong documentation for integration into complex workflows.

How to choose the right voice agent platform

There’s no one-size-fits-all answer. Your ideal platform depends on what you’re building, how much control you want, and the type of team you’re working with. Here’s a breakdown.

You are...	You want...	Best platform to explore
A solo builder or side-project hacker	To move fast, test voice logic and build demos	Vapi
A product engineer at a growing startup	To build robust real-time agents with deep control	Pipecat
Building for consumer-scale or B2C UX	Stable infrastructure and flexibility across devices	LiveKit
Running a contact centre or sales use case	Full-stack demo tooling and agent management	Retell.ai
A non-technical team building internal tools	No-code setup and simple voice automation	Synthflow
Creating expressive, human-like voice agents	Real-time speech synthesis and TTS	ElevenLabs
Working inside an existing call stack	Integration into telephony and CRM infrastructure	Layercode

Key areas to consider

Before locking into a framework, assess what your agent actually needs to succeed. These six areas can guide your thinking.

1. Latency requirements

Will your agent need fast turn-taking and responsiveness?
Are partials important for mid-sentence actions?

2. Customization needs

Can you bring your own STT, LLM or memory system?
Does it allow session-specific parameters or plug-ins?

3. Audio handling

Does it use WebRTC or web sockets for stability?
How does it perform under noisy or mobile environments?

4. Multilingual support

Will your users switch languages in one conversation?
Are dialects or accents accurately recognized?

5. Speaker diarization

Do you need to track who is speaking at any moment?
Does the platform or plugin handle speaker handoff?

6. Developer experience

Is documentation up to date?
Are GitHub issues resolved quickly?

Why Speech Recognition still defines the voice agent experience

Speech recognition is the foundation of every voice agent.

To understand the architecture behind a modern voice agent, it's useful to visualize how spoken input travels through the system. The following diagram shows the key stages—from initial voice capture to response output—and highlights the crucial role of the speech-to-text (STT) layer in determining overall performance, responsiveness, and accuracy.

The STT layer handles everything before the agent logic kicks in. When it’s accurate and responsive, it powers smooth, human-like interactions. When it fails, the agent fumbles.

Common Pain Points

Issue	What Breaks	What to Prioritize
Slow final transcripts	Delayed replies	Use low-latency STT with partials
Missed interruptions	Awkward turn-taking	Enable end-of-turn detection
Wrong speaker attribution	Broken session logic	Diarization trained for real-time use
Accent or dialect confusion	Misinterpreted intent	Diverse language models with accent handling
Disfluency misreads	Early cutoffs or skipped meaning	STT tuned for conversational context

From timing to tone, language switching to background noise, it determines whether your agent can respond naturally, or fails before it starts.

How Speechmatics supports voice agent builders

Speechmatics provides the speech recognition infrastructure behind some of the world’s most advanced voice applications. Our real-time API is designed to perform under pressure, whether that’s mid-sentence language switches, speaker interruptions, or live customer interactions.

Features built for Voice Agents

Real-time transcription with <300ms latency
Bilingual and multilingual support across 55+ languages
Advanced speaker diarization
End-of-turn detection built on context, timing and disfluency
Accurate partials for live responsiveness
Custom dictionaries and session-specific formatting
Deployment flexibility: cloud or on-premises

Designed for developer integration

We integrate into the most popular frameworks and continue to expand support:

LiveKit: plugin supports diarization, partials and end-of-turn
Pipecat: integration in development, with enterprise-first design
Vapi: updated configuration and improved documentation underway
Flow: internal demo platform showcasing best-practice voice agents

Trusted by Leaders

Global brands rely on Speechmatics to deliver accurate, inclusive and responsive speech recognition at scale.

Want to try it out? Head to our portal to see how your agents could perform with the right STT foundation. Try for free.

FAQs: Voice agents in 2025

What is a voice agent?

A voice agent is an AI system designed to interact with people through spoken language. It listens, processes and responds in real time-acting as an interface between human intent and machine capability. Think of it as a digital assistant, concierge, or customer service rep that communicates purely through voice.

What do voice agents do?

Voice agents carry out tasks based on spoken input. That could mean answering questions, booking appointments, taking orders, resolving support issues or guiding users through a process. Increasingly, they’re embedded across industries-from retail and travel to banking and healthcare.

How do voice agents work?

They rely on a chain of technologies: speech-to-text (STT) to transcribe what’s said, an orchestration layer to determine what to do with it, a large language model (LLM) to generate a response, and text-to-speech (TTS) to speak back. The smoother this flow, the more natural the interaction feels.

What are the types of voice agents?

Voice agents can be real-time (think live phone calls or drive-thru bots), asynchronous (e.g. voice note-based replies), embedded in physical hardware (like kiosks or smart devices), or purely software-based (like in-app support). Some are transactional, others conversational. Some are powered by scripts, others by LLMs.

What are the key challenges?

Maintaining low latency in noisy environments
Handling multiple speakers or overlapping speech
Accurately detecting intent and context
Managing diverse accents, dialects and languages
Designing fallback logic when things go wrong

👉 Build your voice agent now on the Speechmatics Portal

Oct 9, 2025 | Read time 5 min

The 7 best AI agent orchestration platforms for building Voice AI agents in 2025

1. Pipecat

2. LiveKit

3. Vapi

4. Retell.ai

5. Synthflow

6. ElevenLabs

7. Layercode

How to choose the right voice agent platform

Key areas to consider

Why Speech Recognition still defines the voice agent experience

Common Pain Points

How Speechmatics supports voice agent builders

Features built for Voice Agents

Designed for developer integration

Trusted by Leaders

FAQs: Voice agents in 2025

What is a voice agent?

What do voice agents do?

How do voice agents work?

What are the types of voice agents?

What are the key challenges?

Related Articles

Why we built our low-latency Text-to-Speech

Why “fastest” voice tech is a trap

How we built real-time concurrency for Voice AI at scale