Apr 4, 2023 | Read time 4 min

Ursa: Scaling up as a Solution to Domain Generalization

Speechmatics' newest model, Ursa, performs better than ever on specialized topics like medicine and finance.
Domain Generalization image
Ana Olssen
Ana OlssenMachine Learning Engineer

Speechmatics recently introduced Ursa, new generation models for our Automatic Speech Recognition (ASR) system, which achieves market-leading accuracy in speech transcription. We’ve previously discussed the overall performance and new features of Ursa and demonstrated our ongoing commitment to understanding every voice, with outstanding results across a range of demographics. In addition to these, Ursa shows impressive performance in specialized domains.

Here, when we talk about domains, we mean specific contexts in which language is used. Each domain has its own style and vocabulary. For example, the style and vocabulary of language in a legal domain, such as a court hearing[1], will differ significantly from that in a medical domain, such as a physician-patient conversation[2]. Ursa can accurately transcribe speech across these different domains in a process known as domain generalization.

Ursa’s impressive performance is driven by significant scaling up of both our self-supervised learning model and our neural language model. We increased our self-supervised learning model to 2 billion parameters, enabling us to better understand every voice. We also increased our language model to 30 times its previous size, greatly expanding our coverage of domain-specific vocabulary. The ability to improve domain generalization by boosting the language model in particular is one of the great benefits of maintaining our modular approach to ASR.

To test Ursa’s domain generalization performance, we identified utterances relating to five specific domains within one of our internal datasets: Medical, Financial, Technical, Political, and Construction, and measured Ursa against Speechmatics' previous Enhanced model using the word error rate (WER) metric. While our previous model already boasted market-leading accuracy, Figure 1 shows that Ursa achieves relative WER improvements of up to 18.2%.

Figure 1: WER improvement by domain, showing Ursa's relative improvements of up to 18.2%

This improvement on domain-specific data means that Ursa can be used by companies and individuals with a range of different needs. We accurately transcribe challenging vocabulary, including people and product names, and specialist terms. Listen to some examples below and see the difference in transcription between Ursa and our previous enhanced model:

Choose a clip
Play audio
And so global workspace theory. So that's, that's Bernie Baars (Baars's) was originated by Bernie Baars and has been developed by Stanislas Dehaene and colleagues.
Help
The comparison text for ASR providers shows how the recognized output compares to the reference. Words in red indicate the errors with substitutions being in italic (e.g. substitution), deletions (e.g. deletion) being crossed out, and insertions (e.g. insertion) being underlined. Hovering over the substitution error will show the ground truth.

In summary, we’ve already shown that Ursa has the lowest average WER compared to major ASR competitors, but if you need to transcribe a builder discussing clerestory windows, a radiologist considering the Greulich and Pyle method of bone age assessment, or a mathematician explaining Poisson distributions, Ursa also has you covered.

References [1] Saadany, Hadeel, et al. "Better Transcription of UK Supreme Court Hearings" arXiv preprint arXiv: 2211.17094

[2] Soltau, Hagen, et al. "Understanding Medical Conversations: Rich Transcription, Confidence Scores & Information Extraction" Interspeech, 2021

AuthorsAna Olssen
AcknowledgementsBenedetta Cevoli & John Hughes

Latest Articles

Carousel slide image
Technical

How to build a microbatching workflow with the Speechmatics API

Build a cleaner path between batch and real time. Learn when micro-batching makes sense, how to chunk audio, submit jobs, stitch JSON, and scale safely with the Speechmatics API.

Speechmatics
SpeechmaticsEditorial Team
Carousel slide image
Product

Alphanumeric speech recognition: why voice assistants mangle SKUs (and how to fix it)

A guide for voice AI engineers, ecommerce platforms and warehouse teams on SKU recognition accuracy voice assistant deployments depend on: why speech recognition systems produce transcription errors on product codes, what to measure when error rates matter, and the fixes that move the needle on order picking, voice ordering and customer-facing voice AI.

Speechmatics
SpeechmaticsEditorial Team
Carousel slide image
Technical

The Adobe story: How we made cloud-grade AI work on your laptop

Behind the build: what it takes to make cloud-grade speech recognition work inside Adobe Premiere, and why Whisper raised the stakes.

Andrew Innes
Andrew InnesChief Architect
Carousel slide image
Company

Adobe and Speechmatics deliver cloud-grade speech recognition on-device for Premiere

Adobe Premiere users can run the most accurate on-device transcription locally; efficient enough for a laptop, powerful enough for professional work.

Speechmatics
SpeechmaticsEditorial Team
Carousel slide image
Use Cases

Best speech-to-text AI guide: APIs, platforms and services compared

Speech-to-text has moved from novelty to enterprise infrastructure. Here's how the leading platforms stack up in 2026 — and how to pick the right one.

Tom Young
Tom YoungDigital Specialist
Speechmatics x Thymia combine medical-grade speech-to-text with clinical-grade voice biomarker intelligence to identify health signals.
News

AI can now understand health signals from 15 seconds of your voice, including fatigue, stress and type 2 diabetes

The joint platform returns transcription and health signals in real time, with no additional hardware required.

Speechmatics
SpeechmaticsEditorial Team