Blog - Technical

Apr 4, 2023 | Read time 4 min

Ursa: Scaling up as a Solution to Domain Generalization

Speechmatics' newest model, Ursa, performs better than ever on specialized topics like medicine and finance.

Ana Olssen

Machine Learning Engineer

Speechmatics recently introduced Ursa, new generation models for our Automatic Speech Recognition (ASR) system, which achieves market-leading accuracy in speech transcription. We’ve previously discussed the overall performance and new features of Ursa and demonstrated our ongoing commitment to understanding every voice, with outstanding results across a range of demographics. In addition to these, Ursa shows impressive performance in specialized domains.

Here, when we talk about domains, we mean specific contexts in which language is used. Each domain has its own style and vocabulary. For example, the style and vocabulary of language in a legal domain, such as a court hearing[1], will differ significantly from that in a medical domain, such as a physician-patient conversation[2]. Ursa can accurately transcribe speech across these different domains in a process known as domain generalization.

Ursa’s impressive performance is driven by significant scaling up of both our self-supervised learning model and our neural language model. We increased our self-supervised learning model to 2 billion parameters, enabling us to better understand every voice. We also increased our language model to 30 times its previous size, greatly expanding our coverage of domain-specific vocabulary. The ability to improve domain generalization by boosting the language model in particular is one of the great benefits of maintaining our modular approach to ASR.

To test Ursa’s domain generalization performance, we identified utterances relating to five specific domains within one of our internal datasets: Medical, Financial, Technical, Political, and Construction, and measured Ursa against Speechmatics' previous Enhanced model using the word error rate (WER) metric. While our previous model already boasted market-leading accuracy, Figure 1 shows that Ursa achieves relative WER improvements of up to 18.2%.

Figure 1: WER improvement by domain, showing Ursa's relative improvements of up to 18.2%

This improvement on domain-specific data means that Ursa can be used by companies and individuals with a range of different needs. We accurately transcribe challenging vocabulary, including people and product names, and specialist terms. Listen to some examples below and see the difference in transcription between Ursa and our previous enhanced model:


Choose a clip

Play audio

In summary, we’ve already shown that Ursa has the lowest average WER compared to major ASR competitors, but if you need to transcribe a builder discussing clerestory windows, a radiologist considering the Greulich and Pyle method of bone age assessment, or a mathematician explaining Poisson distributions, Ursa also has you covered.

References [1] Saadany, Hadeel, et al. "Better Transcription of UK Supreme Court Hearings" arXiv preprint arXiv: 2211.17094

[2] Soltau, Hagen, et al. "Understanding Medical Conversations: Rich Transcription, Confidence Scores & Information Extraction" Interspeech, 2021

AuthorsAna Olssen
AcknowledgementsBenedetta Cevoli & John Hughes