
| Footnotes | * Our quoted percentages are a relative reduction in word error rate (WER) across 21 open-source test sets when comparing two systems. To illustrate, a gain of 10% would mean that on average 1 in 10 errors are removed. WER is calculated by dividing the number of errors by the number of words in the reference, so a lower number indicates a better system. Ursa’s enhanced model is used for the comparisons unless stated otherwise. † Replicate our experiment that shows Speechmatics surpasses human level transcription on the Kincaid46 dataset using this python notebook. It uses our latest API so requires an API key which can be generated on our portal. ‡ We would like to extend a special thanks to FluidStack who provided the infrastructure and a month of GPU training time to make this possible. ** We are aware of the limitations of WER, one major issue being that errors involving misinformation are given the same weight as simple spelling mistakes. To address this, we normalize our transcriptions to reduce penalties for differences in contractions or spelling between British and American English that humans would still consider correct. Going forward, we intend to adopt a metric based on NER. †† Tests conducted in January 2023 against Amazon Transcribe, Microsoft Azure Video Indexer, Google Cloud Speech-to-Text (latest_long model), and OpenAI’s Whisper (large-v2 model) compared to Ursa's enhanced model in the Speechmatics Batch SaaS. ‡‡ Our quoted numbers for Whisper large-v2 differ from the paper[7] for a few reasons. Firstly, we found that the Whisper models tend to hallucinate, causing increases in WER due to many insertion errors as well as having non-deterministic behavior. Secondly, though we endeavored to minimize this, our preparation of some of these test sets may differ, but the numbers in the tables will show consistent comparisons. |
| References | [1] Kincaid, Jason. "Which Automatic Transcription Service Is the Most Accurate? - 2018." Medium, 5 Sept. 2018. Accessed 24 Feb. 2023. [2] Hoffmann, Jordan, et al. "Training compute-optimal large language models." arXiv preprint arXiv:2203.15556 (2022). [3] Panayotov, Vassil, et al. "Librispeech: an asr corpus based on public domain audio books." 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2015. [4] Del Rio, Miguel, et al. "Earnings-22: A Practical Benchmark for Accents in the Wild." arXiv preprint arXiv:2203.15591 (2022). [5] Kendall, Tyler, and Charlie Farrington. "The corpus of regional african american language." Version 6 (2018): 1. [6] Ardila, Rosana, et al. "Common voice: A massively-multilingual speech corpus." arXiv preprint arXiv:1912.06670 (2019). [7] Radford, Alec, et al. "Robust speech recognition via large-scale weak supervision." arXiv preprint arXiv:2212.04356 (2022). [8] Papineni, Kishore, et al. "Bleu: a method for automatic evaluation of machine translation." Proceedings of the 40th annual meeting of the Association for Computational Linguistics. 2002. [9] Wang, Changhan, Anne Wu, and Juan Pino. "Covost 2 and massively multilingual speech-to-text translation." arXiv preprint arXiv:2007.10310 (2020). |
| Author | John Hughes |
| Acknowledgements | Aaron Ng, Adam Walford, Ajith Selvan, Alex Raymond, Alex Wicks, Ana Olssen, Anand Mishra, Anartz Nuin, André Mansikkaniemi, Andrew Innes, Baskaran Mani, Ben Gorman, Ben Walker, Benedetta Cevoli, Bethan Thomas, Brad Phipps, Callum Hackett, Caroline Dockes, Chris Waple, Claire Schaefer, Daniel Nurkowski, David Agmen-Smith, David Gray, David Howlett, David MacLeod, David Mrva, Dominik Jochec, Dumitru Gutu, Ed Speyer, Edward Rees, Edward Weston, Ellena Reid, Gareth Rickards, George Lodge, Georgios Hadjiharalambous, Greg Richards, Hannes Unterholzner, Harish Kumar, James Gilmore, James Olinya, Jamie Dougherty, Jan Pesan, Janani T E, Jindrich Dolezal, John Hughes, Kin Hin Wong, Lawrence Atkins, Lenard Szolnoki, Liam Steadman, Manideep Karimireddy, Markus Hennerbichler, Matt Nemitz, Mayank Kalbande, Michal Polkowski, Neil Stratford, Nelson Kondia, Owais Aamir Thungalwadi, Owen O'Loan, Parthiban Selvaraj, Peter Uhrin, Philip Brown, Pracheta Phadnis, Pradeep Kumar, Rajasekaran Radhakrishnan, Rakesh Venkataraman, Remi Francis, Ross Thompson, Sakthy Vengatesh, Sathishkumar Durai, Seth Asare, Shuojie Fu, Simon Lawrence, Sreeram P, Stefan Fisher, Steve Kingsley, Stuart Wood, Tej Birring, Theo Clark, Tom Young, Tomasz Swider, Tudor Evans, Venkatesh Chandran, Vignesh Umapathy, Vyanktesh Tadkod, Waldemar Maleska, Will Williams, Wojciech Kruzel, Yahia Abaza. Special thanks to Will Williams, Harish Kumar, Georgina Robertson, Liam Steadman, Benedetta Cevoli, Emma Davidson, Edward Rees and Lawrence Atkins for reviewing drafts. |
| Citation | For attribution in academic contexts, please cite this work as
BibTeX citation
|
![[alt: Bilingual medical model featuring terms related to various health conditions and medications in Arabic and English. Key terms include "Chronic kidney disease," "Heart attack," "Diabetes," and "Insulin," among others, displayed in an organized layout.]](/_next/image?url=https%3A%2F%2Fimages.ctfassets.net%2Fyze1aysi0225%2F3I31FQHBheddd0CibURFBv%2F4355036ed3d14b4e1accb3fe39ecd886%2FArabic-English-blog-Jade-wide-carousel.webp&w=3840&q=75)
![[alt: Illuminated ancient mud-brick structures stand against a dusk sky, showcasing architectural details and textures. Palm trees are in the foreground, adding to the setting's ambiance. Visually captures a historic site in twilight.]](/_next/image?url=https%3A%2F%2Fimages.ctfassets.net%2Fyze1aysi0225%2F2qdoWdIOsIygVY0cwl8UD4%2Fe7725d963a96f84c87d614ccc6cce3c6%2FAdobeStock_669627191-wide-carousel.webp&w=3840&q=75)

![[alt: Healthcare professionals in scrubs and lab coats walk briskly down a hospital corridor. A nurse uses a tablet while others carry patient charts and attend to a gurney. The setting conveys a busy, clinical environment focused on patient care.]](/_next/image?url=https%3A%2F%2Fimages.ctfassets.net%2Fyze1aysi0225%2F3TUGqo1FcOmT91WhT3fgbo%2F9a07c229c11f8cbe62e6e40a1f8682c7%2FImage_fx__8__1-wide-carousel.webp&w=3840&q=75)
![[alt: Logos of Speechmatics and Edvak are displayed side by side, interconnected by a stylized x symbol. The background features soft, wavy lines in light blue, creating a modern and tech-focused aesthetic.]](/_next/image?url=https%3A%2F%2Fimages.ctfassets.net%2Fyze1aysi0225%2F7LI5VH9yspI5pKWFeiZBXC%2F92f6a47a06ab6a97fb7f5a953b998737%2FCyan-wide-carousel.webp&w=3840&q=75)
![[alt: Concentric circles radiate outward from a central orange icon with a white Speechmatics logo. The background is dark blue, enhancing the orange glow. A thin green line runs horizontally across the lower part of the image.]](/_next/image?url=https%3A%2F%2Fimages.ctfassets.net%2Fyze1aysi0225%2F4jGjYveRLo3sKjzBzMIXXa%2F11e90a40df418658e9c15cb1ecff4e4b%2FBlog_image-wide-carousel.webp&w=3840&q=75)