Blog - Technical
Feb 29, 2024 | Read time 8 min

Mastering Bilingual Communication: Transcribe English And Spanish Simultaneously With Advanced ASR

The key to unlocking bilingual transcription.
Ana OlssenMachine Learning Engineer
Bethan ThomasSenior Machine Learning Engineer

Engaging with multilingual audiences, particularly in diverse linguistic regions is a unique challenge. American businesses, for instance, often navigate environments where English and Spanish interchange fluidly.

The ability to accurately identify and transcribe varying accents or dialects within audio content is not just a challenge - it's a necessity.

This is especially crucial in real-time applications like live news captioning. Traditional approaches requiring manual model switches for accent-specific transcription are impractical and inefficient. 

Enter Speechmatics' innovative solution. A global approach to language transcription that simplifies this complexity.

Empowering Communication Across Borders With ASR

Our latest bilingual release, supporting both Spanish and English, exemplifies our commitment to understanding every voice. This development is particularly significant given these languages rank among the top five globally in terms of the number of speakers.

Check out this great example from Ricardo Herreros-Symons, our VP of Corporate Development testing our bilingual model:

The blending of cultures and languages in media content has become increasingly common, necessitating sophisticated solutions for accurate captioning. Captions have become a staple for broadcasters to improve the accessibility of their content, but the stakes are high; mismatches between expected language and actual spoken content can result in confusing or even embarrassing subtitles. Ensuring accurate captions in these situations saves broadcasters time and avoids embarrassment. 

This challenge extends beyond broadcasting to businesses that rely on transcribing calls or meetings. In scenarios where a conversation might switch from Spanish to English fluidly, traditional Automatic Speech Recognition (ASR) systems falter. Words spoken may be missed or transcribed as nonsense. This compromises analytics and compliance efforts with incomplete or inaccurate transcriptions.

Introducing Seamless Bilingual Transcription For Enhanced Accessibility And Inclusivity 

We’re excited to introduce our new bilingual Spanish and English transcription to address these challenges. It allows you to understand and transcribe over 2 billion people, a quarter of the world's population, in a single API call.  

The transcription can organically switch between the two languages, whether in a real-time audio stream or a batch file. We also provide the ability to translate it into a single language. So, for example, in the same API call, we can return the complete transcription in English, so that users who do not speak Spanish can fully comprehend the entire conversation without missing any details.

This advancement not only enhances accessibility and inclusivity but also represents a significant leap forward in the global reach and impact of products and services in the digital age.

The Challenges Of Bilingual ASR

Modern ASR systems do an amazing job learning the particular nuances of a given language, and the rules that govern how sounds and words can be combined. Some examples of these language specifics are: 

  • Phonetic inventory - the phones that can be produced. Phones are the smallest sound units of a language which contribute to meaning. For instance, Portuguese has nasalized vowels, Xhosa uses clicks.

  • Phonotactic rules – how different phones can be combined.  “Spr” (as in “spring”) is a valid combination in English, but not a possible sound combination in many other languages. 

  • Syntactic rules – grammatical rules governing the order of words in a sentence. E.g. in Māori and Welsh, the verb comes before the noun, whereas in Korean and Turkish it comes last.

  • Alphabet - the script the language is written in - Italian, Arabic, and Thai all use different alphabetic scripts. 

An ASR model that covers more than one language has to learn all of the linguistic rules for each language. Not only can thirs require changes to the models themselves, but there is a huge requirement for high-quality training and evaluation data in each language and, ideally, data from combined, i.e. code-switching, scenarios.  

Code-switching is when a person switches between languages as they speak. This can happen within or across sentences. Without high-quality code-switching data, this type of speech is particularly challenging for ASR models. Because ASR models predict sequences of words based on their training input, they tend to learn that English is most commonly followed by English and likewise for any other language input.

Our first bilingual model is focused on the language-pair of Spanish and English.

Their linguistic similarities make some tasks easier. They have similar phonetic inventories and belong to similar language families – both descend from the same branch. This means that their phonotactic and syntactic rules are more alike than languages from completely disparate language families, e.g. English and Japanese. Both use Latin alphabets which also simplifies the problem. But the languages are still different enough that this problem is non-trivial.

Their roles as global languages introduce specific challenges. Spanish is the official language of 21 countries and therefore, like English, has a wide range of accents and varieties. An ASR model that can truly handle more than one language not only has to deal with the variation across languages, but also with the variation within languages.

No two speakers are the same, and here at Speechmatics, we make it our mission to truly understand every voice.

Real-Time And Batch Transcription For 2 Billion People

Our bilingual model is powered by Ursa and performs well across a diverse range of voices and scenarios. Ursa’s powerful self-supervised learning (SSL) makes it particularly suited to multilingual scenarios.  

During pre-training, Ursa is exposed to examples of audio from many different languages. Sample efficiency means that it can learn to map acoustics to phones with ease, no matter what language those phones belong to. On top of Ursa, we build a pipeline of models which can utilize this SSL. The main components of the pipeline are an acoustic model, which maps the SSL features to phones, and then a language model which maps into words.  

Due to the similarity between the sound inventory of Spanish and English, we can reuse the acoustic model from our existing monolingual pipeline. Theoretically, this shouldn’t work, as some sounds exist in one language that don't appear within the other. However, we find our pipeline approach is powerful enough to overcome this potential issue.  

In practice, only the language model component which deals with the vocabulary needs to be retrained with examples from both Spanish and English. By combining the vocabularies of the two languages, we can seamlessly output words in both languages.  

Crucially, we do not rely on extra language identification modules, both Spanish and English are handled by a single model. This reduces latency and increases efficiency when running our bilingual model.

Our First Multilingual Model - But Not The Last

Here at Speechmatics, we pride ourselves on constant development and innovation. Our bilingual Spanish-English model is just another step in our mission to Understand Every Voice, and an exciting step ahead for our multilingual capabilities. 

So, no matter if you’re in a call center with multilingual customers, sitting in a Microsoft Teams call with colleagues from around the world, or watching your favorite Youtuber, Speechmatics has got you covered.

Transcription coverage for over half the world's population

With Speechmatics, the only real question you need to ask is... Where next?