Mar 9, 2023 | Read time 10 min

Achieving Accessibility Through Incredible Accuracy with Ursa

Our latest release, Ursa, breaks the accessibility barriers in speech technologies by offering ground-breaking accuracy for every voice. Irrespective of accent, dialect, and various demographic factors, Ursa is consistently the most accurate speech-to-text engine on the market with a relative lead of up to 30% compared to Amazon, Google, Microsoft, and OpenAI’s Whisper across the board.
Accessibility blog header image
Benedetta Cevoli
Benedetta CevoliSenior Machine Learning Engineer
Choose a clip
Play audio
They were known as seers and they were held in fear by women and the elderly.
People (They) have (were) noticed (known) seals (as) seers and they were held in fear by women and the elderly.
Help
The comparison text for ASR providers shows how the recognized output compares to the reference. Words in red indicate the errors with substitutions being in italic (e.g. substitution), deletions (e.g. deletion) being crossed out, and insertions (e.g. insertion) being underlined. Hovering over the substitution error will show the ground truth.

Figure 1. Ursa’s enhanced model is 24% more accurate than the best competitor when transcribing English speakers from all across the globe with a wide variety of accents across Amazon, Google, Microsoft, and OpenAI’s Whisper. Word error rate (WER) calculated on the Common Voice dataset[1], 26 hours of speech from speakers across the globe with varied accents (lower is better, error bars show standard error).

Figure 2. Ursa provides a 22% lead over the next best competitor on specific dialects that have been historically underrepresented in the data. Word error rate (WER) calculated on the CORAAL dataset[3], more than 100 hours of African American Vernacular English (error bars represent standard errors).

Figure 3. Ursa is consistently the most accurate speech-to-text system across age and gender, with a 30% and 25% relative lead on male and senior voices compared to Google, respectively.

Figure 4. Ursa is consistently 32% more accurate than Google across skin tones (results based on Casual Conversation[2] dataset; based on the Fitzpatrick scale spanning from 1 to 6, higher the number darker the skin tone[2]).

Figure 5. Ursa is consistently the most accurate speech-to-text engine across socio-economic status and levels of education, with an approximate lead of 30% over Google for people from a lower socio-economic background and less formal education (results based on CORAAL[3]).

References [1] Ardila, Rosana, et al. "Common voice: A massively-multilingual speech corpus." arXiv preprint arXiv:1912.06670 (2019).

[2] Liu, Chunxi, et al. "Towards measuring fairness in speech recognition: casual conversations dataset transcriptions." ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022.

[3] Kendall, Tyler and Charlie Farrington. 2021. The Corpus of Regional African American Language. Version 2021.07. Eugene, OR: The Online Resources for African American Language Project.

[4] Martin, Joshua L., and Kelly Elizabeth Wright. "Bias in Automatic Speech Recognition: The Case of African American Language." Applied Linguistics (2022).

[5] Tatman, Rachael. "Gender and dialect bias in YouTube’s automatic captions." Proceedings of the first ACL workshop on ethics in natural language processing. 2017.

[6] Koenecke, Allison, et al. "Racial disparities in automated speech recognition." Proceedings of the National Academy of Sciences 117.14 (2020): 7684-7689.

[7] Garnerin, Mahault, Solange Rossato, and Laurent Besacier. "Investigating the impact of gender representation in speech-to-text training data: A case study on librispeech." 3rd Workshop on Gender Bias in Natural Language Processing. Association for Computational Linguistics, 2021.

[8] Feng, Siyuan, et al. "Quantifying bias in automatic speech recognition." arXiv preprint arXiv:2103.15122 (2021).

[9] Buolamwini, Joy, and Timnit Gebru. "Gender shades: Intersectional accuracy disparities in commercial gender classification." Conference on fairness, accountability and transparency. PMLR, 2018.

[10] Cho, Won Ik, et al. "Towards cross-lingual generalization of translation gender bias." Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. 2021.

[11] Radford, Alec, et al. "Robust speech recognition via large-scale weak supervision." arXiv preprint arXiv:2212.04356 (2022).

[12] Ware, Olivia R., et al. "Racial limitations of Fitzpatrick skin type." Cutis 105.2 (2020): 77-80.

[13] Juhn, Young J., et al. "Assessing socioeconomic bias in machine learning algorithms in health care: a case study of the HOUSES index." Journal of the American Medical Informatics Association 29.7 (2022): 1142-1151.
Author Benedetta Cevoli
Acknowledgements Ana Olssen, Ben Leaman, Emma Davidson, Georgina Robertson, Harish Kumar, John Hughes, Liam Steadman, Markus Hennerbichler, Tom Young
Carousel slide image
Company

Better than Whisper: how Adobe Premiere's on-device speech engine got rebuilt

Andrew Innes
Andrew InnesChief Architect
Carousel slide image
Technical

The Adobe story: How we made cloud-grade AI work on your laptop

Andrew Innes
Andrew InnesChief Architect
Carousel slide image
Technical

De-risk your voice agent: The 11 best voice agent testing platforms in 2026

Speechmatics
SpeechmaticsEditorial Team
Carousel slide image
Technical

How to build a microbatching workflow with the Speechmatics API

Speechmatics
SpeechmaticsEditorial Team
Carousel slide image
Product

Alphanumeric speech recognition: why voice assistants mangle SKUs (and how to fix it)

Speechmatics
SpeechmaticsEditorial Team
Carousel slide image
Company

Adobe and Speechmatics deliver cloud-grade speech recognition on-device for Premiere

Speechmatics
SpeechmaticsEditorial Team