When it comes to speech-to-text, accuracy goes hand-in-hand with accessibility. Understanding every voice means accurately transcribing speakers regardless of where they come from, their accent, and their background. Our latest release, Ursa delivers unprecedented accuracy with a 22% lead over the other nearest vendor, breaking the accessibility barriers of speech technologies while taking a step closer towards understanding every voice. Have a listen to the samples below to check out for yourself the astonishing difference in accuracy between Ursa and other vendors.
Choose a clip
To understand how Ursa compares to other speech-to-text systems across a variety of voice cohorts, we ran a series of evaluation sets combining Common Voice, Casual Conversations and the Corpus of Regional African American Language (CORAAL) datasets. We then evaluated speech recognition performance by calculating the weighted average word error rate (WER) for different speaker groups. WER is a common metric for assessing the accuracy of speech-to-text models, defined as the number of errors in a transcript divided by the total number of words in the reference transcript. On the Common Voice dataset, which contains a variety of English accents from across the globe, Ursa is 24% more accurate compared to the best competitor (see Figure 1).
Ursa maintains a 22% lead over the next best competitor on specific dialects that have been historically underrepresented in speech data, such as African American Vernacular English (see Figure 2). We also see similar results across various demographic factors such as age, gender, skin tone, and socio-economic status.
Bias in Speech Recognition Performance
Accent and dialect variation is only one among the many factors that have been shown to influence speech recognition performance. Speech-to-text systems have also been shown to exhibit systematic inaccuracies or biases towards groups of speakers with varying age, gender, and other demographic factors[5-8]. While some of these variables affect our voices more than others, algorithmic biases observed in speech-to-text are thought to reflect broader historical, societal biases and prejudices.
Artificial intelligence (AI) bias in speech-to-text not only affects the reliability of speech technologies in real-world applications but it can perpetuate discrimination at a large scale. At Speechmatics, we strive to reduce bias as much as possible by utilizing rich representations of speech learnt from millions of hours of unlabeled audio with self-supervised learning. With the release of Ursa, we’ve scaled our machine learning models to create additional capacity to learn from our multilingual data and further reduce bias on diverse voice cohorts. This is how Ursa sets new standards for speech-to-text fairness with the best accuracy across the spectrum.
Accuracy Across Demographics
Based on a combination of the Casual Conversation and CORAAL datasets, we evaluated Ursa’s transcription performance against other speech-to-text providers across several demographic factors. We found that independent of age, gender identity, skin tone, socio-economic status, and level of education, Ursa offers the best transcription accuracy across all demographics with a 10% lead over the nearest vendor (see Figure 3 - 6).
It’s no news that a wide variety of different AI applications have been reported to be biased against women. Numerous studies have shown consistent algorithmic differences between women and men, with better performance for men in tasks from face recognition to language translation. These issues have been extensively covered by the media. In the context of speech-to-text, Ursa stands out from the competition as the most accurate speech-to-text system across both female and male speakers (see Figure 3). Specifically, Ursa is almost 30% more accurate than Google on male speech.
A recent study has found that teenagers’ speech is better recognized than children’s speech and those of people over 65 years old. Similarly, we found that speech-to-text typically struggles with the speech of people over 60 years old compared to other age groups. Ursa is consistently the most accurate transcription engine across age groups and is 25% more accurate than Google in the age group 60 to 81.
Amongst age and gender bias, several studies have also reported that people of color are generally misunderstood twice as much as white people. We found that Ursa performance is the most consistent across varying skin tones when compared to competitors (see Figure 4). Specifically, Ursa is consistently 32% more accurate than Google across skin tones. It is important to stress that generally the more similar a given voice is compared to the ones included in the training data, the better a speech-to-text system is at transcribing it. As skin tones are purely visual factors, we suspect that these differences are caused by people with darker skin tones being underrepresented in training datasets compared to people with lighter skin tones. Similar discussions can be found in literature.
Socio-economic bias in AI applications has also been observed in a wide variety of healthcare use cases, with people coming from a lower socio-economic background captured less accurately than those coming from a more affluent background. We also found both socio-economic status and level of education are related to varying levels of speech-to-text performance (see Figure 5). Ursa is consistently the most accurate compared to the competition across socio-economic status and levels of education, with a lead of 30% over Google for people of lower socio-economic status and 28% for people with less formal education.
A Step Change in Reducing Inequalities in Speech Technologies
In summary, Speechmatics’ Ursa achieves ground-breaking accuracy across a wide range of voice cohorts from age to socio-economic status. By exploiting self-supervised learning at scale, Ursa brings speech technologies closer to everyone. We’re incredibly proud to offer a product that increases accessibility to everyone irrespective of who they are, delivering a major step closer to understanding every voice.