Session-based concurrency ≠ request throughput. It’s about supporting thousands of persistent audio streams in parallel, often for hours at a time.
It’s a product-critical experience factor, not just an engineering concern. One missed session is one failed product moment.
At Speechmatics, we’ve built real-time concurrency into our architecture from the start. From 100 millisecond start times to multi-day persistent sessions, we’ve learned how to scale for bursty, real-world usage.
Each week, my team at Speechmatics processes millions of hours of conversation. That includes meetings, customer service calls, medical consultations and voice assistant interactions. Much of that happens in real-time, and the share of that is growing. Whether live or post-recorded, all of it depends on systems that can handle pressure without breaking.
To make this work, speed and accuracy of our world-leading speech recognition models is fundamental. It's also critical to ensure that our models are available instantly when our customers need them: hence concurrency.
Concurrency means handling many live speech sessions at the same time, with each one starting immediately and continuing smoothly, often for hours.
It’s the difference between a demo that runs smoothly and a product that delivers at scale. Without it, even high-performance models fall short when demand spikes.
In this post, we’ll unpack how we’ve built session-based concurrency into Speechmatics' architecture—what it is, why it matters, and what we’ve learned from supporting thousands of live voice sessions in parallel, every single day.
In traditional web services, concurrency usually means handling more requests per second. But in real-time voice AI, concurrency means something else entirely: supporting thousands of long-running, persistent audio streams--and doing it reliably.
When a live meeting starts, or a customer calls a helpline, the transcription needs to start instantly. There’s no loading screen, no buffer. And it needs to keep running for hours, sometimes even days, without breaking.
Request-Based Concurrency | Session-Based Concurrency |
---|---|
Short bursts (e.g. API calls) | Long, persistent streams |
Easy to load-balance | Requires session state |
Retry on failure is feasible | Must remain uninterrupted |
Scaling by request volume | Scaling by number of active sessions |
Stateless | Stateful |
Low individual request duration | Sessions can last minutes to days |
Traditional backend logic applies | Needs concurrency-aware orchestration |
This is what we mean when we say concurrency is non-negotiable in voice AI.
At Speechmatics, many of the conversations we process are real-time sessions that span anywhere from 30 seconds to 48 hours. In fact, one of our longest sessions ran unbroken for over 100 days.
That kind of scale demands an entirely different engineering mindset. Where traditional applications expect short bursts of user activity, our workloads are long-lived, persistent, and unpredictable. We designed our infrastructure to match.
💡Session-based concurrency is fundamentally different from request-based concurrency.
💡Scaling for live speech isn’t just about handling volume but about handling time.
The biggest lessons came from our customers.
In healthcare, session demand is tied to real-world rhythms: clinics open at 9am, emergency care spikes after hours. In media and meetings, concurrency can double in seconds, based on breaking news or a global webinar launch.
That meant we had to move beyond conservative limits. We evolved our quotas, tuned our autoscaling, and adjusted orchestration logic so customers never feel like they’re outgrowing us. Our system is designed to flex before the customer even notices a surge. 💡If you wait for demand to hit you, you’re already too late.
To support persistent sessions at scale, our architecture is built around a 'real-time first' principle.
From day one, we optimized our system for:
Ultra-low latency models and infrastructure
Fast, flexible session starts (often under 100 milliseconds)
High-availability orchestration that handles bursty demand
Session persistence across hours, days, or however long it takes
Here’s how it works:
A user requests a new session
Our load balancer routes it based on current demand
A concurrency-aware orchestration layer allocates the right resources
A persistent STT engine instance takes over, managing the stream from start to finish
💡Real-world concurrency means treating every live session like a critical connection, not just a spike in throughput.
If you're building a voice-powered product (anything from transcription tooling to live subtitling or agent support) concurrency should be a day-one conversation.
Here's why:
You can’t fake real-time.
Your users won’t wait.
Your architecture needs to scale before you scale.
Pick the wrong Speech-to-Text provider, and you’ll hit the ceiling before you hit product-market fit.