May 9, 2023 | Read time 6 min

Best-in-class real-time ASR system

Speechmatics’ ASR system (Ursa) surpasses competitors such as Amazon, Microsoft, and Google in real-time accuracy even when operating at a latency setting as low as 2s, while competitors are set to their highest latency, most accurate configurations. Furthermore, Ursa maintains near batch level performance across different latency requirements.
Best in class ASR system
Steve Kingsley
Steve KingsleyData Engineer

Real-Time ASR Systems

Automatic Speech Recognition (ASR) systems have two common modes of operation used to solve different use cases, batch and real-time.

In batch ASR, audio content is provided in complete files which are transcribed in their entirety with a single transcript provided as an output. This allows the speech recognition system to fully understand the context of speech and allows for higher levels of accuracy. In contrast, real-time audio is provided as a stream of data, typically at the time of creation, and sent to the ASR system which then returns short segments of the transcription back at regular intervals. The time it takes for these segments to be returned after the words have been spoken is important to minimize and is known as the latency of the system. There is a continuous trade-off between latency and accuracy since sending the transcripts back quicker reduces the amount of context the model can use which is important for accuracy.

To illustrate latency in different use-cases, consider generating captions for a live spoken event. In this scenario, the user requires the text to be available with minimal delay after the speaker has uttered the words. However, for a live news broadcast, there is often a longer delay, providing more time for the captioning process. The balance between timeliness and accuracy is determined by the user's specific requirements.

Evaluating Batch versus Real-Time ASR

To show the effectiveness of our real-time engine compared to batch transcription, we transcribed six internal test sets that cover a wide range of use cases applicable to real-time, such as news reports for captions or meetings for accessibility needs. We then compared the results of transcribing in real-time with the output of batch transcription, and the output of other real-time ASR systems. The files were streamed to each service simulating the use of a microphone streaming input. Each transcript was then normalised using the open-source OpenAI Whisper normaliser to give a comparable output for word error rate (WER) scoring. WER is the standard metric used to track the mistakes made in transcription (learn more about it here).

We previously reported the substantial accuracy improvements we obtained with our latest release, Ursa. Whereas the results we presented before were more focused on batch mode, in this blog we show that, increased scale of neural networks, Ursa has also made significant improvements in real-time transcription across a range of latencies, offering near batch levels of accuracy performance as shown in Table 1. The table shows that at the lowest latency setting, Ursa achieves a WER of 11.2%, which is only an 8.5% relative degradation compared to the batch accuracy of 10.25%. Linking to the results in the Ursa release blog, our low latency real-time ASR system is significantly ahead of OpenAI Whisper in accuracy, which only supports batch processing.

At Speechmatics, we prioritize real-time performance and base all our modelling decisions around it. This means that we use the same models for both batch and real-time allowing us to achieve parity in accuracy as the latency is increased to 10s. By taking this approach, we deliver the most accurate results for all ranges of latencies that might be required for your application.

Batch Real-time max_delay in seconds
10s 5s 4s 3s 2s
10.24 10.25 -0.06% 10.35 -0.97% 10.41 -1.56% 10.55 -2.9% 11.20 -8.5%

Table 1: Word Error Rate (WER) across multiple max_delay settings measured in seconds. It also includes the relative degradation of WER as you decrease the latency compared to batch.

Ursa Leads the Competition

As part of our focus on continuous improvement, we regularly compare our releases to other competitors. It can be difficult to compare settings between vendors as each uses different methods for prioritizing between fast transcription and accurate results. Therefore, we compared our fastest real-time transcription (2s max delay) with each provider’s most accurate settings, regardless of the speed of transcription, to offer them the most favorable environment when measuring WER on our test sets.

As shown in Table 2, Ursa achieves higher accuracy than the competition in settings that favour competitor products. For example, Ursa removes an extra 2 out of 5 errors compared to Amazon.

Speechmatics
(2s max delay)
Amazon Microsoft Google
WER 11.20 20.19 13.06 15.48
Relative Difference -44.55% -14.28 -27.68

Table 2: Word Error Rate (WER) averaged across six test-sets when running real-time ASR across different vendors. Speechmatics is run at the lowest latency, while other competitors are run at their most accurate and high latency configurations.

Approaches to Latency

Latency is an important factor within real-time ASR technology. To provide transcripts faster, real-time ASR systems offer provisional transcripts of segments very quickly (typically within 1 second), known as partials. This is acceptable when the display of partials can be updated as soon as a better result is available. Finals, the revised segment transcripts, then provide the best accuracy possible given the time constraints and will no longer be altered as further context is provided.

For Ursa, latency is controlled with two settings, `max_delay` and `max_delay_mode`. Max delay is set in seconds and tells the server to return the transcript finals segment to you within the specified number of seconds. This allows you to control the ASR’s latency and accuracy for your specific use case. The other setting, max delay mode, optionally allows the system to use its understanding to achieve a slightly higher accuracy by waiting a little longer than the `max_delay` setting when it detects it is partway through an entity.

Each of the other vendors offers distinct features for managing transcription partials. Amazon Transcribe employs a stabilization approach, marking partials as stable or unstable*. For our evaluation, we only utilized results that were deemed stable. Microsoft’s Azure, on the other hand, allows users to set a numeric threshold for partial confirmation before returning results. In our comparison, we set the threshold to 3 and used only the results from the 'recognized' event for our calculations. Meanwhile, Google’s Cloud Speech Service provides the option for interim results, which deliver quicker results that may be subject to change. During our test, we opted not to use this setting, ensuring that only finalized results for each utterance were considered.

Final Thoughts

Speechmatics’ latest release, Ursa, demonstrates outstanding performance with only a 2-second latency and only an 8.5% relative difference in WER when compared to batch processing, reducing to zero as the latency is increased to 10s. Ursa outperforms major speech-to-text vendors such as Amazon, Microsoft, and Google in accuracy, even when prioritizing speed over accuracy. This demonstrates our commitment to delivering state-of-the-art, low-latency ASR systems without sacrificing accuracy, which makes us the top performer for demanding real-time applications.

Footnotes * Amazon Transcribe stabilization returns results of partials flagged as stable or not. This feature allows you to show transcription parts that are subject to change and requires you to update the transcription at the point it’s flagged as stable.

† Azure offers a threshold for the partials before they are returned. You set a numeric value to determine the number of times the result has been confirmed before it’s sent back.

‡ For the cloud speech service, you have the option to set interim_results to true to allow you to get results back faster that again may be subject to change.
AuthorSteve Kingsley
AcknowledgementsBenedetta Cevoli, John Hughes, Liam Steadman and Stuart Wood

Latest Articles

[alt: Bilingual medical model featuring terms related to various health conditions and medications in Arabic and English. Key terms include "Chronic kidney disease," "Heart attack," "Diabetes," and "Insulin," among others, displayed in an organized layout.]
Product

Speechmatics achieves a world first in bilingual Voice AI with new Arabic–English model

Sets a new accuracy bar for real-world code-switching: 35% fewer errors than the closest competitor.

Speechmatics
SpeechmaticsEditorial Team
[alt: Illuminated ancient mud-brick structures stand against a dusk sky, showcasing architectural details and textures. Palm trees are in the foreground, adding to the setting's ambiance. Visually captures a historic site in twilight.]
Product

Your voice agent speaks perfect Arabic. That's the problem.

Most voice AI models are trained on formal Arabic, but real conversations across the Middle East mix dialects and English in ways those systems aren’t built to handle.

Yahia Abaza
Yahia AbazaSenior Product Manger
new blog image header
Technical

How Nvidia Dominates the HuggingFace Leaderboards in This Key Metric

A technical deep-dive into Token Duration Transducers (TDT) — the frame-skipping architecture behind Nvidia's Parakeet models. Covers inference mechanics, training with forward-backward algorithm, and how TDT achieves up to 2.82x faster decoding than standard RNN-T.

Oliver Parish
Oliver Parish Machine Learning Engineer
[alt: Healthcare professionals in scrubs and lab coats walk briskly down a hospital corridor. A nurse uses a tablet while others carry patient charts and attend to a gurney. The setting conveys a busy, clinical environment focused on patient care.]
Use Cases

Why AI-native EHR platforms will treat speech as core infrastructure in 2026

As clinical workflows become automated and AI-driven, real-time speech is shifting from a transcription feature to the foundational intelligence layer inside modern EHR systems.

Vamsi Edara
Vamsi EdaraFounder and CEO, Edvak EHR
[alt: Logos of Speechmatics and Edvak are displayed side by side, interconnected by a stylized x symbol. The background features soft, wavy lines in light blue, creating a modern and tech-focused aesthetic.]
Company

One word changes everything: Speechmatics and Edvak EHR partner to make voice AI safe for clinical automation at scale

Turning real-time clinical speech into trusted, EHR-native automation.

Speechmatics
SpeechmaticsEditorial Team
[alt: Concentric circles radiate outward from a central orange icon with a white Speechmatics logo. The background is dark blue, enhancing the orange glow. A thin green line runs horizontally across the lower part of the image.]
Technical

Speed you can trust: The STT metrics that matter for voice agents

What “fast” actually means for voice agents — and why Pipecat’s TTFS + semantic accuracy is the clearest benchmark we’ve seen.

Archie McMullan
Archie McMullanSpeechmatics Graduate