Description Number recognition is a notoriously difficult problem in automatic speech recognition (ASR). Unlike words where there is only a single way to express them within a transcript, numbers provide a challenge for transcription as they can be expressed as digits or words. This presents inconsistencies when transcribing numbers that can impact both the readability for human consumers and for machine tools that might expect a certain output format. When the numbers are a crucial part of an interaction, for example, credit card and phone number use cases, unpredictable outputs present a challenge in any instance where numbers need to be transcribed. Speechmatics’ ASR delivers a standardized and consistent format of transcribing numbers (less than 10) as words. Number recognition ASR has evolved significantly in recent years. So too have the expectations of users. Battles are no longer fought over word error rates. Top providers are consistently delivering accuracy results in the mid to high 90s, especially for English. The battlegrounds have shifted with providers considering elements other than word error rate in the pursuit of capturing more of the intricacies of voice and speech. Last year, for example, Speechmatics rolled out the most advanced punctuation in the market to its top languages. Work is in progress to add Advanced Punctuation to even more languages. ASR has many applications and capabilities to add value to businesses. It enables businesses to innovate with the voice data in their organization. From the voice of their employees to the voice of the customers they serve. Organizations are looking to integrate ASR solutions in addition to other 3rd party solutions to build out workflows using voice. These use cases range from straight-up transcription, captioning, media monitoring, call interaction capture, call routing, call center agent assist solutions, compliance monitoring and analytics. In these situations, a consistent and accurate representation of common entities such as numbers is not only necessary but expected. ASR solutions are highly effective and accurate at transcribing speech. However, when it comes to numbers the format of how these are transcribed can be mixed. In some cases, transcribed numbers are unpredictable due to how models are trained. For example, there might be a mix of words and digits with the transcription product unable to differentiate that the entity it has recognized is a number and not a word.
Speechmatics’ enhanced number recognition and consistent formatting Accurate number recognition enhances the quality of the Speechmatics Global English language pack. It delivers accurate recognition of numbers within speech and provides a consistent output format of words for numbers less than 10. Previous output “Yes, please call me back. The best number to get me on is 0 7 seven 2 3 four 5 six 7 eight nine” New output “Yes, please call me back. The best number to get me on is zero seven seven two three four five six seven eight nine” Numbers less than 10 are now always outputted as words. This standardized transcription output delivers predictability. In the case that words represent a different format than the one required by the customer, this standard approach enables a simplified mapping so that numbers can be normalized (or converted) based on the customer’s specific needs. The benefits The focus on number recognition and delivering a consistent format uplifts the quality of the Speechmatics output. The demand on customers to review and edit transcripts can be significantly reduced. This accelerates the time to market of perfect transcripts for applications like closed captioning especially in real-time. The predictable output of numbers less than ten means that transcripts require less triage from human editors, optimizing the workforce and their efforts. Another example of the benefits of this feature is within the contact center. Speechmatics can significantly optimize agent tasks like interaction and call note capture. This can be done automatically and accurately through Speechmatics’ ASR. Number recognition can also uplift the capabilities of automated customer-facing tools such as interactive voice response (IVR) and for privacy and compliance scenarios. These use cases rely on the recognition of numbers in the voice of the speaker and also require a specific format from the ASR solution to work seamlessly with additional products within the workflow. Where and how can you use Speechmatics’ enhanced number recognition? Great news, to use this you don’t have to do anything! It comes as standard in Speechmatics’ Global English language pack in all deployment options (SaaS, Batch and Real-Time Virtual Appliances, and Batch and Real-Time Containers). Other numbers (including 10 and larger) will be addressed in subsequent releases.
![[alt: Smiling man with gray hair sits against a teal background, holding a blank clipboard. He wears a blue sweater and appears relaxed and approachable, suggesting a friendly environment.]](/_next/image?url=https%3A%2F%2Fimages.ctfassets.net%2Fyze1aysi0225%2F2B2UcXrPGOWkeyLII5FGUA%2Ff263f595ae176937bdc93a08b55febcd%2FBlog-header__1_-wide-carousel.webp&w=3840&q=75)
The founder who built speech recognition in 1989 on latency, turn detection and faulty pipelines

Word error rate for legal transcription has no single acceptable threshold. But knowing how accuracy, audio quality, and review obligations connect to real legal risk is what separates a reliable transcript from a costly one.

The court reporter shortage is reshaping litigation. Explore data, causes, and how legal teams are using digital reporting and AI transcription to adapt.
![[alt: Bilingual medical model featuring terms related to various health conditions and medications in Arabic and English. Key terms include "Chronic kidney disease," "Heart attack," "Diabetes," and "Insulin," among others, displayed in an organized layout.]](/_next/image?url=https%3A%2F%2Fimages.ctfassets.net%2Fyze1aysi0225%2F3I31FQHBheddd0CibURFBv%2F4355036ed3d14b4e1accb3fe39ecd886%2FArabic-English-blog-Jade-wide-carousel.webp&w=3840&q=75)
Sets a new accuracy bar for real-world code-switching: 35% fewer errors than the closest competitor.
![[alt: Illuminated ancient mud-brick structures stand against a dusk sky, showcasing architectural details and textures. Palm trees are in the foreground, adding to the setting's ambiance. Visually captures a historic site in twilight.]](/_next/image?url=https%3A%2F%2Fimages.ctfassets.net%2Fyze1aysi0225%2F2qdoWdIOsIygVY0cwl8UD4%2Fe7725d963a96f84c87d614ccc6cce3c6%2FAdobeStock_669627191-wide-carousel.webp&w=3840&q=75)
Most voice AI models are trained on formal Arabic, but real conversations across the Middle East mix dialects and English in ways those systems aren’t built to handle.

A technical deep-dive into Token Duration Transducers (TDT) — the frame-skipping architecture behind Nvidia's Parakeet models. Covers inference mechanics, training with forward-backward algorithm, and how TDT achieves up to 2.82x faster decoding than standard RNN-T.