Real-Time ASR Systems
Automatic Speech Recognition (ASR) systems have two common modes of operation used to solve different use cases, batch and real-time.
In batch ASR, audio content is provided in complete files which are transcribed in their entirety with a single transcript provided as an output. This allows the speech recognition system to fully understand the context of speech and allows for higher levels of accuracy. In contrast, real-time audio is provided as a stream of data, typically at the time of creation, and sent to the ASR system which then returns short segments of the transcription back at regular intervals. The time it takes for these segments to be returned after the words have been spoken is important to minimize and is known as the latency of the system. There is a continuous trade-off between latency and accuracy since sending the transcripts back quicker reduces the amount of context the model can use which is important for accuracy.
To illustrate latency in different use-cases, consider generating captions for a live spoken event. In this scenario, the user requires the text to be available with minimal delay after the speaker has uttered the words. However, for a live news broadcast, there is often a longer delay, providing more time for the captioning process. The balance between timeliness and accuracy is determined by the user's specific requirements.
Evaluating Batch versus Real-Time ASR
To show the effectiveness of our real-time engine compared to batch transcription, we transcribed six internal test sets that cover a wide range of use cases applicable to real-time, such as news reports for captions or meetings for accessibility needs. We then compared the results of transcribing in real-time with the output of batch transcription, and the output of other real-time ASR systems. The files were streamed to each service simulating the use of a microphone streaming input. Each transcript was then normalised using the open-source OpenAI Whisper normaliser to give a comparable output for word error rate (WER) scoring. WER is the standard metric used to track the mistakes made in transcription (learn more about it here).
We previously reported the substantial accuracy improvements we obtained with our latest release, Ursa. Whereas the results we presented before were more focused on batch mode, in this blog we show that, increased scale of neural networks, Ursa has also made significant improvements in real-time transcription across a range of latencies, offering near batch levels of accuracy performance as shown in Table 1. The table shows that at the lowest latency setting, Ursa achieves a WER of 11.2%, which is only an 8.5% relative degradation compared to the batch accuracy of 10.25%. Linking to the results in the Ursa release blog, our low latency real-time ASR system is significantly ahead of OpenAI Whisper in accuracy, which only supports batch processing.
At Speechmatics, we prioritize real-time performance and base all our modelling decisions around it. This means that we use the same models for both batch and real-time allowing us to achieve parity in accuracy as the latency is increased to 10s. By taking this approach, we deliver the most accurate results for all ranges of latencies that might be required for your application.
|Batch||Real-time max_delay in seconds|
Table 1: Word Error Rate (WER) across multiple max_delay settings measured in seconds. It also includes the relative degradation of WER as you decrease the latency compared to batch.
Ursa Leads the Competition
As part of our focus on continuous improvement, we regularly compare our releases to other competitors. It can be difficult to compare settings between vendors as each uses different methods for prioritizing between fast transcription and accurate results. Therefore, we compared our fastest real-time transcription (2s max delay) with each provider’s most accurate settings, regardless of the speed of transcription, to offer them the most favorable environment when measuring WER on our test sets.
As shown in Table 2, Ursa achieves higher accuracy than the competition in settings that favour competitor products. For example, Ursa removes an extra 2 out of 5 errors compared to Amazon.
(2s max delay)
Table 2: Word Error Rate (WER) averaged across six test-sets when running real-time ASR across different vendors. Speechmatics is run at the lowest latency, while other competitors are run at their most accurate and high latency configurations.
Approaches to Latency
Latency is an important factor within real-time ASR technology. To provide transcripts faster, real-time ASR systems offer provisional transcripts of segments very quickly (typically within 1 second), known as partials. This is acceptable when the display of partials can be updated as soon as a better result is available. Finals, the revised segment transcripts, then provide the best accuracy possible given the time constraints and will no longer be altered as further context is provided.
For Ursa, latency is controlled with two settings, `max_delay` and `max_delay_mode`. Max delay is set in seconds and tells the server to return the transcript finals segment to you within the specified number of seconds. This allows you to control the ASR’s latency and accuracy for your specific use case. The other setting, max delay mode, optionally allows the system to use its understanding to achieve a slightly higher accuracy by waiting a little longer than the `max_delay` setting when it detects it is partway through an entity.
Each of the other vendors offers distinct features for managing transcription partials. Amazon Transcribe employs a stabilization approach, marking partials as stable or unstable*. For our evaluation, we only utilized results that were deemed stable. Microsoft’s Azure, on the other hand, allows users to set a numeric threshold for partial confirmation† before returning results. In our comparison, we set the threshold to 3 and used only the results from the 'recognized' event for our calculations. Meanwhile, Google’s Cloud Speech Service provides the option for interim results‡, which deliver quicker results that may be subject to change. During our test, we opted not to use this setting, ensuring that only finalized results for each utterance were considered.
Speechmatics’ latest release, Ursa, demonstrates outstanding performance with only a 2-second latency and only an 8.5% relative difference in WER when compared to batch processing, reducing to zero as the latency is increased to 10s. Ursa outperforms major speech-to-text vendors such as Amazon, Microsoft, and Google in accuracy, even when prioritizing speed over accuracy. This demonstrates our commitment to delivering state-of-the-art, low-latency ASR systems without sacrificing accuracy, which makes us the top performer for demanding real-time applications.