Sep 9, 2025 | Read time 4 min

How we built real-time concurrency for Voice AI at scale

Why supporting thousands of parallel sessions is more than a backend problem.
Owen O'LoanDirector of Engineering Operations

TL;DR

  • Session-based concurrency ≠ request throughput. It’s about supporting thousands of persistent audio streams in parallel, often for hours at a time.

  • It’s a product-critical experience factor, not just an engineering concern. One missed session is one failed product moment.

  • At Speechmatics, we’ve built real-time concurrency into our architecture from the start. From 100 millisecond start times to multi-day persistent sessions, we’ve learned how to scale for bursty, real-world usage.

Each week, my team at Speechmatics processes millions of hours of conversation. That includes meetings, customer service calls, medical consultations and voice assistant interactions. Much of that happens in real-time, and the share of that is growing. Whether live or post-recorded, all of it depends on systems that can handle pressure without breaking.

To make this work, speed and accuracy of our world-leading speech recognition models is fundamental. It's also critical to ensure that our models are available instantly when our customers need them: hence concurrency.

Concurrency means handling many live speech sessions at the same time, with each one starting immediately and continuing smoothly, often for hours. 

It’s the difference between a demo that runs smoothly and a product that delivers at scale. Without it, even high-performance models fall short when demand spikes.

In this post, we’ll unpack how we’ve built session-based concurrency into Speechmatics' architecture—what it is, why it matters, and what we’ve learned from supporting thousands of live voice sessions in parallel, every single day.

Why concurrency in Voice AI is a different beast

In traditional web services, concurrency usually means handling more requests per second. But in real-time voice AI, concurrency means something else entirely: supporting thousands of long-running, persistent audio streams--and doing it reliably.

When a live meeting starts, or a customer calls a helpline, the transcription needs to start instantly. There’s no loading screen, no buffer. And it needs to keep running for hours, sometimes even days, without breaking.

Request-Based Concurrency

Session-Based Concurrency

Short bursts (e.g. API calls)

Long, persistent streams

Easy to load-balance

Requires session state

Retry on failure is feasible

Must remain uninterrupted

Scaling by request volume

Scaling by number of active sessions

Stateless

Stateful

Low individual request duration

Sessions can last minutes to days

Traditional backend logic applies

Needs concurrency-aware orchestration

This is what we mean when we say concurrency is non-negotiable in voice AI.

Concurrency is a product experience, not just infrastructure

At Speechmatics, many of the conversations we process are real-time sessions that span anywhere from 30 seconds to 48 hours. In fact, one of our longest sessions ran unbroken for over 100 days.

That kind of scale demands an entirely different engineering mindset. Where traditional applications expect short bursts of user activity, our workloads are long-lived, persistent, and unpredictable. We designed our infrastructure to match.

Real-time session length vs total hours processed

💡Session-based concurrency is fundamentally different from request-based concurrency.

💡Scaling for live speech isn’t just about handling volume but about handling time.

Scaling pains and lessons learned

The biggest lessons came from our customers.

In healthcare, session demand is tied to real-world rhythms: clinics open at 9am, emergency care spikes after hours. In media and meetings, concurrency can double in seconds, based on breaking news or a global webinar launch.

That meant we had to move beyond conservative limits. We evolved our quotas, tuned our autoscaling, and adjusted orchestration logic so customers never feel like they’re outgrowing us. Our system is designed to flex before the customer even notices a surge. 💡If you wait for demand to hit you, you’re already too late.

How we built it

To support persistent sessions at scale, our architecture is built around a 'real-time first' principle.

From day one, we optimized our system for:

  • Ultra-low latency models and infrastructure

  • Fast, flexible session starts (often under 100 milliseconds)

  • High-availability orchestration that handles bursty demand

  • Session persistence across hours, days, or however long it takes

How we built it

Here’s how it works:

  1. A user requests a new session

  2. Our load balancer routes it based on current demand

  3. A concurrency-aware orchestration layer allocates the right resources

  4. A persistent STT engine instance takes over, managing the stream from start to finish

💡Real-world concurrency means treating every live session like a critical connection, not just a spike in throughput.

What this means for builders

If you're building a voice-powered product (anything from transcription tooling to live subtitling or agent support) concurrency should be a day-one conversation.

Here's why:

  • You can’t fake real-time.

  • Your users won’t wait.

  • Your architecture needs to scale before you scale.

Pick the wrong Speech-to-Text provider, and you’ll hit the ceiling before you hit product-market fit.

Start building with Voice AI

Get started in minutes