Engaging with multilingual audiences, particularly in diverse linguistic regions is a unique challenge. American businesses, for instance, often navigate environments where English and Spanish interchange fluidly.
The ability to accurately identify and transcribe varying accents or dialects within audio content is not just a challenge - it's a necessity.
This is especially crucial in real-time applications like live news captioning. Traditional approaches requiring manual model switches for accent-specific transcription are impractical and inefficient.
Enter Speechmatics' innovative solution. A global approach to language transcription that simplifies this complexity.
Our latest bilingual release, supporting both Spanish and English, exemplifies our commitment to understanding every voice. This development is particularly significant given these languages rank among the top five globally in terms of the number of speakers.
Check out this great example from Ricardo Herreros-Symons, our VP of Corporate Development testing our bilingual model:
The blending of cultures and languages in media content has become increasingly common, necessitating sophisticated solutions for accurate captioning. Captions have become a staple for broadcasters to improve the accessibility of their content, but the stakes are high; mismatches between expected language and actual spoken content can result in confusing or even embarrassing subtitles. Ensuring accurate captions in these situations saves broadcasters time and avoids embarrassment.
This challenge extends beyond broadcasting to businesses that rely on transcribing calls or meetings. In scenarios where a conversation might switch from Spanish to English fluidly, traditional Automatic Speech Recognition (ASR) systems falter. Words spoken may be missed or transcribed as nonsense. This compromises analytics and compliance efforts with incomplete or inaccurate transcriptions.
We’re excited to introduce our new bilingual Spanish and English transcription to address these challenges. It allows you to understand and transcribe over 2 billion people, a quarter of the world's population, in a single API call.
The transcription can organically switch between the two languages, whether in a real-time audio stream or a batch file. We also provide the ability to translate it into a single language. So, for example, in the same API call, we can return the complete transcription in English, so that users who do not speak Spanish can fully comprehend the entire conversation without missing any details.
This advancement not only enhances accessibility and inclusivity but also represents a significant leap forward in the global reach and impact of products and services in the digital age.
Modern ASR systems do an amazing job learning the particular nuances of a given language, and the rules that govern how sounds and words can be combined. Some examples of these language specifics are:
Phonetic inventory - the phones that can be produced. Phones are the smallest sound units of a language which contribute to meaning. For instance, Portuguese has nasalized vowels, Xhosa uses clicks.
Phonotactic rules – how different phones can be combined. “Spr” (as in “spring”) is a valid combination in English, but not a possible sound combination in many other languages.
Syntactic rules – grammatical rules governing the order of words in a sentence. E.g. in Māori and Welsh, the verb comes before the noun, whereas in Korean and Turkish it comes last.
Alphabet - the script the language is written in - Italian, Arabic, and Thai all use different alphabetic scripts.
An ASR model that covers more than one language has to learn all of the linguistic rules for each language. Not only can thirs require changes to the models themselves, but there is a huge requirement for high-quality training and evaluation data in each language and, ideally, data from combined, i.e. code-switching, scenarios.
Code-switching is when a person switches between languages as they speak. This can happen within or across sentences. Without high-quality code-switching data, this type of speech is particularly challenging for ASR models. Because ASR models predict sequences of words based on their training input, they tend to learn that English is most commonly followed by English and likewise for any other language input.
Our first bilingual model is focused on the language-pair of Spanish and English.
Their linguistic similarities make some tasks easier. They have similar phonetic inventories and belong to similar language families – both descend from the same branch. This means that their phonotactic and syntactic rules are more alike than languages from completely disparate language families, e.g. English and Japanese. Both use Latin alphabets which also simplifies the problem. But the languages are still different enough that this problem is non-trivial.
Their roles as global languages introduce specific challenges. Spanish is the official language of 21 countries and therefore, like English, has a wide range of accents and varieties. An ASR model that can truly handle more than one language not only has to deal with the variation across languages, but also with the variation within languages.
No two speakers are the same, and here at Speechmatics, we make it our mission to truly understand every voice.
Our bilingual model is powered by Ursa and performs well across a diverse range of voices and scenarios. Ursa’s powerful self-supervised learning (SSL) makes it particularly suited to multilingual scenarios.
During pre-training, Ursa is exposed to examples of audio from many different languages. Sample efficiency means that it can learn to map acoustics to phones with ease, no matter what language those phones belong to. On top of Ursa, we build a pipeline of models which can utilize this SSL. The main components of the pipeline are an acoustic model, which maps the SSL features to phones, and then a language model which maps into words.
Due to the similarity between the sound inventory of Spanish and English, we can reuse the acoustic model from our existing monolingual pipeline. Theoretically, this shouldn’t work, as some sounds exist in one language that don't appear within the other. However, we find our pipeline approach is powerful enough to overcome this potential issue.
In practice, only the language model component which deals with the vocabulary needs to be retrained with examples from both Spanish and English. By combining the vocabularies of the two languages, we can seamlessly output words in both languages.
Crucially, we do not rely on extra language identification modules, both Spanish and English are handled by a single model. This reduces latency and increases efficiency when running our bilingual model.
Here at Speechmatics, we pride ourselves on constant development and innovation. Our bilingual Spanish-English model is just another step in our mission to Understand Every Voice, and an exciting step ahead for our multilingual capabilities.
So, no matter if you’re in a call center with multilingual customers, sitting in a Microsoft Teams call with colleagues from around the world, or watching your favorite Youtuber, Speechmatics has got you covered.
![[alt: Bilingual medical model featuring terms related to various health conditions and medications in Arabic and English. Key terms include "Chronic kidney disease," "Heart attack," "Diabetes," and "Insulin," among others, displayed in an organized layout.]](/_next/image?url=https%3A%2F%2Fimages.ctfassets.net%2Fyze1aysi0225%2F3I31FQHBheddd0CibURFBv%2F4355036ed3d14b4e1accb3fe39ecd886%2FArabic-English-blog-Jade-wide-carousel.webp&w=3840&q=75)
Sets a new accuracy bar for real-world code-switching: 35% fewer errors than the closest competitor.
![[alt: Illuminated ancient mud-brick structures stand against a dusk sky, showcasing architectural details and textures. Palm trees are in the foreground, adding to the setting's ambiance. Visually captures a historic site in twilight.]](/_next/image?url=https%3A%2F%2Fimages.ctfassets.net%2Fyze1aysi0225%2F2qdoWdIOsIygVY0cwl8UD4%2Fe7725d963a96f84c87d614ccc6cce3c6%2FAdobeStock_669627191-wide-carousel.webp&w=3840&q=75)
Most voice AI models are trained on formal Arabic, but real conversations across the Middle East mix dialects and English in ways those systems aren’t built to handle.

Why predicting durations as well as tokens allows transducer models to skip frames and achieve up to 2.82X faster inference.
![[alt: Healthcare professionals in scrubs and lab coats walk briskly down a hospital corridor. A nurse uses a tablet while others carry patient charts and attend to a gurney. The setting conveys a busy, clinical environment focused on patient care.]](/_next/image?url=https%3A%2F%2Fimages.ctfassets.net%2Fyze1aysi0225%2F3TUGqo1FcOmT91WhT3fgbo%2F9a07c229c11f8cbe62e6e40a1f8682c7%2FImage_fx__8__1-wide-carousel.webp&w=3840&q=75)
As clinical workflows become automated and AI-driven, real-time speech is shifting from a transcription feature to the foundational intelligence layer inside modern EHR systems.
![[alt: Logos of Speechmatics and Edvak are displayed side by side, interconnected by a stylized x symbol. The background features soft, wavy lines in light blue, creating a modern and tech-focused aesthetic.]](/_next/image?url=https%3A%2F%2Fimages.ctfassets.net%2Fyze1aysi0225%2F7LI5VH9yspI5pKWFeiZBXC%2F92f6a47a06ab6a97fb7f5a953b998737%2FCyan-wide-carousel.webp&w=3840&q=75)
Turning real-time clinical speech into trusted, EHR-native automation.
![[alt: Concentric circles radiate outward from a central orange icon with a white Speechmatics logo. The background is dark blue, enhancing the orange glow. A thin green line runs horizontally across the lower part of the image.]](/_next/image?url=https%3A%2F%2Fimages.ctfassets.net%2Fyze1aysi0225%2F4jGjYveRLo3sKjzBzMIXXa%2F11e90a40df418658e9c15cb1ecff4e4b%2FBlog_image-wide-carousel.webp&w=3840&q=75)
What “fast” actually means for voice agents — and why Pipecat’s TTFS + semantic accuracy is the clearest benchmark we’ve seen.
