Blog - Technical

Mar 9, 2023 | Read time 10 min

Achieving Accessibility Through Incredible Accuracy with Ursa

Our latest release, Ursa, breaks the accessibility barriers in speech technologies by offering ground-breaking accuracy for every voice. Irrespective of accent, dialect, and various demographic factors, Ursa is consistently the most accurate speech-to-text engine on the market with a relative lead of up to 30% compared to Amazon, Google, Microsoft, and OpenAI’s Whisper across the board.
Benedetta CevoliSenior Data Scientist

When it comes to speech-to-text, accuracy goes hand-in-hand with accessibility. Understanding every voice means accurately transcribing speakers regardless of where they come from, their accent, and their background. Our latest release, Ursa delivers unprecedented accuracy with a 22% lead over the other nearest vendor, breaking the accessibility barriers of speech technologies while taking a step closer towards understanding every voice. Have a listen to the samples below to check out for yourself the astonishing difference in accuracy between Ursa and other vendors.

Choose a clip
Play audio
They were known as seers and they were held in fear by women and the elderly.
People (They) have (were) noticed (known) seals (as) seers and they were held in fear by women and the elderly.
Help
The comparison text for ASR providers shows how the recognized output compares to the reference. Words in red indicate the errors with substitutions being in italic (e.g. substitution), deletions (e.g. deletion) being crossed out, and insertions (e.g. insertion) being underlined. Hovering over the substitution error will show the ground truth.

To understand how Ursa compares to other speech-to-text systems across a variety of voice cohorts, we ran a series of evaluation sets combining Common Voice[1], Casual Conversations[2] and the Corpus of Regional African American Language (CORAAL)[3] datasets. We then evaluated speech recognition performance by calculating the weighted average word error rate (WER) for different speaker groups. WER is a common metric for assessing the accuracy of speech-to-text models, defined as the number of errors in a transcript divided by the total number of words in the reference transcript. On the Common Voice[1] dataset, which contains a variety of English accents from across the globe, Ursa is 24% more accurate compared to the best competitor (see Figure 1).

Figure 1. Ursa’s enhanced model is 24% more accurate than the best competitor when transcribing English speakers from all across the globe with a wide variety of accents across Amazon, Google, Microsoft, and OpenAI’s Whisper. Word error rate (WER) calculated on the Common Voice dataset[1], 26 hours of speech from speakers across the globe with varied accents (lower is better, error bars show standard error).

Ursa maintains a 22% lead over the next best competitor on specific dialects that have been historically underrepresented in speech data, such as African American Vernacular English[4] (see Figure 2). We also see similar results across various demographic factors such as age, gender, skin tone, and socio-economic status.

Figure 2. Ursa provides a 22% lead over the next best competitor on specific dialects that have been historically underrepresented in the data. Word error rate (WER) calculated on the CORAAL dataset[3], more than 100 hours of African American Vernacular English (error bars represent standard errors).

Bias in Speech Recognition Performance

Accent and dialect variation is only one among the many factors that have been shown to influence speech recognition performance[4]. Speech-to-text systems have also been shown to exhibit systematic inaccuracies or biases towards groups of speakers with varying age, gender, and other demographic factors[5-8]. While some of these variables affect our voices more than others, algorithmic biases observed in speech-to-text are thought to reflect broader historical, societal biases and prejudices.   

Artificial intelligence (AI) bias in speech-to-text not only affects the reliability of speech technologies in real-world applications but it can perpetuate discrimination at a large scale. At Speechmatics, we strive to reduce bias as much as possible by utilizing rich representations of speech learnt from millions of hours of unlabeled audio with self-supervised learning. With the release of Ursa, we’ve scaled our machine learning models to create additional capacity to learn from our multilingual data and further reduce bias on diverse voice cohorts. This is how Ursa sets new standards for speech-to-text fairness with the best accuracy across the spectrum. 

Accuracy Across Demographics

Based on a combination of the Casual Conversation[2] and CORAAL[3] datasets, we evaluated Ursa’s transcription performance against other speech-to-text providers across several demographic factors. We found that independent of age, gender identity, skin tone, socio-economic status, and level of education, Ursa offers the best transcription accuracy across all demographics with a 10% lead over the nearest vendor (see Figure 3 - 6). 

It’s no news that a wide variety of different AI applications have been reported to be biased against women. Numerous studies have shown consistent algorithmic differences between women and men, with better performance for men in tasks from face recognition[9] to language translation[10]. These issues have been extensively covered by the media[11]. In the context of speech-to-text, Ursa stands out from the competition as the most accurate speech-to-text system across both female and male speakers (see Figure 3). Specifically, Ursa is almost 30% more accurate than Google on male speech.

Figure 3. Ursa is consistently the most accurate speech-to-text system across age and gender, with a 30% and 25% relative lead on male and senior voices compared to Google, respectively.

A recent study has found that teenagers’ speech is better recognized than children’s speech and those of people over 65 years old[8]. Similarly, we found that speech-to-text typically struggles with the speech of people over 60 years old compared to other age groups. Ursa is consistently the most accurate transcription engine across age groups and is 25% more accurate than Google in the age group 60 to 81.

Figure 4. Ursa is consistently 32% more accurate than Google across skin tones (results based on Casual Conversation[2] dataset; based on the Fitzpatrick scale spanning from 1 to 6, higher the number darker the skin tone[2]).

Amongst age and gender bias, several studies have also reported that people of color are generally misunderstood twice as much as white people[6]. We found that Ursa performance is the most consistent across varying skin tones when compared to competitors (see Figure 4). Specifically, Ursa is consistently 32% more accurate than Google across skin tones. It is important to stress that generally the more similar a given voice is compared to the ones included in the training data, the better a speech-to-text system is at transcribing it[11]. As skin tones are purely visual factors, we suspect that these differences are caused by people with darker skin tones being underrepresented in training datasets compared to people with lighter skin tones[9]. Similar discussions can be found in literature[2].  

Socio-economic bias in AI applications has also been observed in a wide variety of healthcare use cases[13], with people coming from a lower socio-economic background captured less accurately than those coming from a more affluent background. We also found both socio-economic status and level of education are related to varying levels of speech-to-text performance (see Figure 5). Ursa is consistently the most accurate compared to the competition across socio-economic status and levels of education, with a lead of 30% over Google for people of lower socio-economic status and 28% for people with less formal education.

Figure 5. Ursa is consistently the most accurate speech-to-text engine across socio-economic status and levels of education, with an approximate lead of 30% over Google for people from a lower socio-economic background and less formal education (results based on CORAAL[3]).

A Step Change in Reducing Inequalities in Speech Technologies

In summary, Speechmatics’ Ursa achieves ground-breaking accuracy across a wide range of voice cohorts from age to socio-economic status. By exploiting self-supervised learning at scale, Ursa brings speech technologies closer to everyone. We’re incredibly proud to offer a product that increases accessibility to everyone irrespective of who they are, delivering a major step closer to understanding every voice.

Ready to test Ursa for yourself? Head over to the Ursa blog to try out our Real-Time demo or sign up to our Portal for free and get immediate access to the best speech-to-text system ever created.

References [1] Ardila, Rosana, et al. "Common voice: A massively-multilingual speech corpus." arXiv preprint arXiv:1912.06670 (2019).

[2] Liu, Chunxi, et al. "Towards measuring fairness in speech recognition: casual conversations dataset transcriptions." ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022.

[3] Kendall, Tyler and Charlie Farrington. 2021. The Corpus of Regional African American Language. Version 2021.07. Eugene, OR: The Online Resources for African American Language Project.

[4] Martin, Joshua L., and Kelly Elizabeth Wright. "Bias in Automatic Speech Recognition: The Case of African American Language." Applied Linguistics (2022).

[5] Tatman, Rachael. "Gender and dialect bias in YouTube’s automatic captions." Proceedings of the first ACL workshop on ethics in natural language processing. 2017.

[6] Koenecke, Allison, et al. "Racial disparities in automated speech recognition." Proceedings of the National Academy of Sciences 117.14 (2020): 7684-7689.

[7] Garnerin, Mahault, Solange Rossato, and Laurent Besacier. "Investigating the impact of gender representation in speech-to-text training data: A case study on librispeech." 3rd Workshop on Gender Bias in Natural Language Processing. Association for Computational Linguistics, 2021.

[8] Feng, Siyuan, et al. "Quantifying bias in automatic speech recognition." arXiv preprint arXiv:2103.15122 (2021).

[9] Buolamwini, Joy, and Timnit Gebru. "Gender shades: Intersectional accuracy disparities in commercial gender classification." Conference on fairness, accountability and transparency. PMLR, 2018.

[10] Cho, Won Ik, et al. "Towards cross-lingual generalization of translation gender bias." Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. 2021.

[11] Radford, Alec, et al. "Robust speech recognition via large-scale weak supervision." arXiv preprint arXiv:2212.04356 (2022).

[12] Ware, Olivia R., et al. "Racial limitations of Fitzpatrick skin type." Cutis 105.2 (2020): 77-80.

[13] Juhn, Young J., et al. "Assessing socioeconomic bias in machine learning algorithms in health care: a case study of the HOUSES index." Journal of the American Medical Informatics Association 29.7 (2022): 1142-1151.
Author Benedetta Cevoli
Acknowledgements Ana Olssen, Ben Leaman, Emma Davidson, Georgina Robertson, Harish Kumar, John Hughes, Liam Steadman, Markus Hennerbichler, Tom Young

Related Articles