Sep 9, 2025 | Read time 4 min

How we built real-time concurrency for Voice AI at scale

Why supporting thousands of parallel sessions is more than a backend problem.
Voice AI at scale
Owen O'Loan
Owen O'LoanDirector of Engineering Operations

TL;DR

  • Session-based concurrency ≠ request throughput. It’s about supporting thousands of persistent audio streams in parallel, often for hours at a time.

  • It’s a product-critical experience factor, not just an engineering concern. One missed session is one failed product moment.

  • At Speechmatics, we’ve built real-time concurrency into our architecture from the start. From 100 millisecond start times to multi-day persistent sessions, we’ve learned how to scale for bursty, real-world usage.

Each week, my team at Speechmatics processes millions of hours of conversation. That includes meetings, customer service calls, medical consultations and voice assistant interactions. Much of that happens in real-time, and the share of that is growing. Whether live or post-recorded, all of it depends on systems that can handle pressure without breaking.

To make this work, speed and accuracy of our world-leading speech recognition models is fundamental. It's also critical to ensure that our models are available instantly when our customers need them: hence concurrency.

Concurrency means handling many live speech sessions at the same time, with each one starting immediately and continuing smoothly, often for hours. 

It’s the difference between a demo that runs smoothly and a product that delivers at scale. Without it, even high-performance models fall short when demand spikes.

In this post, we’ll unpack how we’ve built session-based concurrency into Speechmatics' architecture—what it is, why it matters, and what we’ve learned from supporting thousands of live voice sessions in parallel, every single day.

Why concurrency in Voice AI is a different beast

In traditional web services, concurrency usually means handling more requests per second. But in real-time voice AI, concurrency means something else entirely: supporting thousands of long-running, persistent audio streams--and doing it reliably.

When a live meeting starts, or a customer calls a helpline, the transcription needs to start instantly. There’s no loading screen, no buffer. And it needs to keep running for hours, sometimes even days, without breaking.

Request-Based Concurrency

Session-Based Concurrency

Short bursts (e.g. API calls)

Long, persistent streams

Easy to load-balance

Requires session state

Retry on failure is feasible

Must remain uninterrupted

Scaling by request volume

Scaling by number of active sessions

Stateless

Stateful

Low individual request duration

Sessions can last minutes to days

Traditional backend logic applies

Needs concurrency-aware orchestration

This is what we mean when we say concurrency is non-negotiable in voice AI.

Concurrency is a product experience, not just infrastructure

At Speechmatics, many of the conversations we process are real-time sessions that span anywhere from 30 seconds to 48 hours. In fact, one of our longest sessions ran unbroken for over 100 days.

That kind of scale demands an entirely different engineering mindset. Where traditional applications expect short bursts of user activity, our workloads are long-lived, persistent, and unpredictable. We designed our infrastructure to match.

Real-time session length vs total hours processed

💡Session-based concurrency is fundamentally different from request-based concurrency.

💡Scaling for live speech isn’t just about handling volume but about handling time.

Scaling pains and lessons learned

The biggest lessons came from our customers.

In healthcare, session demand is tied to real-world rhythms: clinics open at 9am, emergency care spikes after hours. In media and meetings, concurrency can double in seconds, based on breaking news or a global webinar launch.

That meant we had to move beyond conservative limits. We evolved our quotas, tuned our autoscaling, and adjusted orchestration logic so customers never feel like they’re outgrowing us. Our system is designed to flex before the customer even notices a surge. 💡If you wait for demand to hit you, you’re already too late.

How we built it

To support persistent sessions at scale, our architecture is built around a 'real-time first' principle.

From day one, we optimized our system for:

  • Ultra-low latency models and infrastructure

  • Fast, flexible session starts (often under 100 milliseconds)

  • High-availability orchestration that handles bursty demand

  • Session persistence across hours, days, or however long it takes

How we built it

Here’s how it works:

  1. A user requests a new session

  2. Our load balancer routes it based on current demand

  3. A concurrency-aware orchestration layer allocates the right resources

  4. A persistent STT engine instance takes over, managing the stream from start to finish

💡Real-world concurrency means treating every live session like a critical connection, not just a spike in throughput.

What this means for builders

If you're building a voice-powered product (anything from transcription tooling to live subtitling or agent support) concurrency should be a day-one conversation.

Here's why:

  • You can’t fake real-time.

  • Your users won’t wait.

  • Your architecture needs to scale before you scale.

Pick the wrong Speech-to-Text provider, and you’ll hit the ceiling before you hit product-market fit.

Start building with Voice AI

Get started in minutes

Latest Articles

[alt: Bilingual medical model featuring terms related to various health conditions and medications in Arabic and English. Key terms include "Chronic kidney disease," "Heart attack," "Diabetes," and "Insulin," among others, displayed in an organized layout.]
Product

Speechmatics achieves a world first in bilingual Voice AI with new Arabic–English model

Sets a new accuracy bar for real-world code-switching: 35% fewer errors than the closest competitor.

Speechmatics
SpeechmaticsEditorial Team
[alt: Illuminated ancient mud-brick structures stand against a dusk sky, showcasing architectural details and textures. Palm trees are in the foreground, adding to the setting's ambiance. Visually captures a historic site in twilight.]
Product

Your voice agent speaks perfect Arabic. That's the problem.

Most voice AI models are trained on formal Arabic, but real conversations across the Middle East mix dialects and English in ways those systems aren’t built to handle.

Yahia Abaza
Yahia AbazaSenior Product Manger
new blog image header
Technical

How Nvidia Dominates the HuggingFace Leaderboards in This Key Metric

A technical deep-dive into Token Duration Transducers (TDT) — the frame-skipping architecture behind Nvidia's Parakeet models. Covers inference mechanics, training with forward-backward algorithm, and how TDT achieves up to 2.82x faster decoding than standard RNN-T.

Oliver Parish
Oliver Parish Machine Learning Engineer
[alt: Healthcare professionals in scrubs and lab coats walk briskly down a hospital corridor. A nurse uses a tablet while others carry patient charts and attend to a gurney. The setting conveys a busy, clinical environment focused on patient care.]
Use Cases

Why AI-native EHR platforms will treat speech as core infrastructure in 2026

As clinical workflows become automated and AI-driven, real-time speech is shifting from a transcription feature to the foundational intelligence layer inside modern EHR systems.

Vamsi Edara
Vamsi EdaraFounder and CEO, Edvak EHR
[alt: Logos of Speechmatics and Edvak are displayed side by side, interconnected by a stylized x symbol. The background features soft, wavy lines in light blue, creating a modern and tech-focused aesthetic.]
Company

One word changes everything: Speechmatics and Edvak EHR partner to make voice AI safe for clinical automation at scale

Turning real-time clinical speech into trusted, EHR-native automation.

Speechmatics
SpeechmaticsEditorial Team
[alt: Concentric circles radiate outward from a central orange icon with a white Speechmatics logo. The background is dark blue, enhancing the orange glow. A thin green line runs horizontally across the lower part of the image.]
Technical

Speed you can trust: The STT metrics that matter for voice agents

What “fast” actually means for voice agents — and why Pipecat’s TTFS + semantic accuracy is the clearest benchmark we’ve seen.

Archie McMullan
Archie McMullanSpeechmatics Graduate