Blog - Technical
Apr 20, 2023 | Read time 5 min

Introducing Real-Time Translation: Breaking Down Language Barriers

Speechmatics are proud to announce our real-time voice translation service. Combining our existing best-in-class speech-to-text allows us to offer highly accurate real-time translation, in our single speech API. Try it out today!
Caroline DockesMachine Learning Engineer

Following the release of batch translation in February, real-time translation is now available in our SaaS offering. We provide translation of speech to and from English for 34 languages, tightly integrated with our high-accuracy transcription through a single real-time or batch API. Customers can start using this through our API, further details on how to use it can be found in our docs.

You can see a live demo with a select few languages below:

Our translation builds on top of our state-of-the-art speech-to-text system, and benefits from the substantial improvement in transcription accuracy offered by the Ursa generation models. We previously showed how quality of ASR impacts various downstream tasks. Here we discuss this in the context of translation.

Translation cannot recover from breakdowns in transcription

Unsurprisingly, when transcription breaks down, it is impossible for translation to recover the meaning of the original sentence. Here are some examples from the CoVoST2[1] test set:

Transcription Translation
Google Gets truly how Mr. Your creators. Ruft sind wirklich wie Mr. Your Creators.
Speechmatics Cats truly are mysterious creatures. Katzen sind wirklich geheimnisvolle Kreaturen.
Reference Cats truly are mysterious creatures. Katzen sind wirklich geheimnisvolle Geschöpfe.

Help: The comparison text for ASR providers shows how the recognized or translated output compares to the reference. Words in red indicate the errors with substitutions being in italic, deletions being crossed out, and insertions being underlined.

Transcription Translation
Google the Sheep at 13 him that das Schaf um 13 das beigebracht
Speechmatics The sheep had taught him that. Die Schafe hatten ihm das beigebracht.
Reference The sheep had taught him that. Die Schafe hatten ihm das beigebracht.

Transcription Translation
Google No activo la frika tropical. I do not activate the tropical freak.
Speechmatics Es Nativo de África tropical. Native of tropical Africa
Reference Es nativo del África tropical. Native of tropical Africa.

Of course, the examples above are rather extreme, but we find that even small mistakes from transcription can have a large impact on the resulting translation. Here is an example:

Transcription Translation
Google Elle croit en Tanzanie. She believes in Tanzania.
Speechmatics Elle croît en Tanzanie. It is growing in Tanzania.
Reference Elle croît en Tanzanie. It grows in Tanzania

In this context, the French word "croit" means "believe", and the word "croît" means "grow". However, the two are pronounced exactly the same! From the perspective of transcription, substituting one for the other is a minor mistake. Still, as you can see from the Google translation, the mistake causes the English translation to entirely lose the meaning of the original sentence.

Word Error Rates and BLEU Scores

Evaluating the two systems more systematically, we observe that Speechmatics’ lower WERs are associated with higher average BLEU on the CoVoST2 test set. BLEU[2] scores are a very commonly used automatic metric for translation quality. They measure the overlap (in terms of words) between the machine generated translation and one or more human generated references.

Figure 1: Transcription Word Error Rate (WER) from Google and Speechmatics on the CoVoST2 speech translation test set. Lower scores are better.

Figure 2: BiLingual Evaluation Understudy (BLEU) scores from Google and Speechmatics on the CoVoST2 speech translation test set. Higher scores are better.

Beyond BLEU Scores

BLEU scores are a convenient way to measure translation quality because they can be computed easily and in a standardized way. However, they are also limited in some ways. They penalize any deviation from the reference translation, even ones that preserve meaning and have the same level of fluency. They put the same weight on every word, even though sometimes a single word can flip the meaning of the entire sentence (e.g. “not”).

Here is an example that illustrates the limitations of BLEU:

Transcription Translation BLEU Score*
Google Comme partons-nous pour faire. How are we going to do. 41.11
Speechmatics Quand partons nous pour Ferrare? When do we leave for Ferrara? 8.64
Reference Quand partons-nous pour Ferrare? When are we going to Ferrare? 100.00

*BLEU is a corpus-based metric and isn’t generally used to evaluate individual sentences. We only include sentence-level BLEU scores here for illustration.

The Speechmatics hypothesis substitutes words 2,4,5 and 6. The Google hypothesis substitutes only words 1 and 6. From the point of view of BLEU scores, the latter is strictly better, despite the fact that the Speechmatics hypothesis matches the meaning of the reference translation much more closely.

In response to BLEU scores’ limitations, people have tried to find better metrics of translation quality, ones that align more closely with human judgement. One such metric is the COMET score[3] submitted to the WMT20 Metrics Shared Task by Unbabel. This relies on a pretrained multilingual encoder, XLM-RoBERTa[4] to create a representation of the source text, the reference text, and the translation hypothesis into a shared feature space. The representations are then fed to a feed forward network which is trained to predict human generated quality assessments. While the absolute values of the scores are hard to interpret, [3] show that they correlate better with human judgements than BLEU scores, indicating that they are a more meaningful way to rank different systems.

Looking at COMET scores on the CoVoST2 test set, we again find that Speechmatics outperforms Google.

Figure 3: COMET scores from Google and Speechmatics on the CoVoST2 speech translation test set. Higher scores are better.

Looking beyond WER and BLEU scores to COMET scores also highlights the importance of capitalization and punctuation. In the following example, Speechmatics and Google transcription both each have one substitution and one insertion. Neither gets the tricky proper noun “Makololos” correctly, but the capitalization in the Speechmatics hypothesis helps preserve the original sentence’s meaning in the translation.

Transcription Translation
Google Pas une trace de ma coloros Not a trace of my color
Speechmatics Pas une trace de Mako Lolo. Not a trace of Mako Lolo.
Reference Pas une trace de Makololos. No sign of Makololos.

Challenges of Real-Time Translation

Delivering a high quality real-time translation system poses several challenges beyond translation quality. For one, we would like to minimize the delay between when a word is spoken and when the corresponding translation is returned. However, different languages have very different rules about word orderings, which can make this tricky. One example is that German sentences often have the verb at the end. In order to translate such a sentence into English, we have to wait until the end of the sentence, we cannot do it incrementally. Additionally, waiting for the end of the sentence also implies that we must have a high quality punctuation model to signal the end of sentence. Striking the right balance between gathering enough context for high quality translation and minimizing delay is something we are still actively working on.


Real-Time translation is a new area for us, but we are excited that our strong foundation in ASR enables us to offer a competitive system, which we expect will keep improving in line with our transcription accuracy. In the coming months, we plan to roll out more APIs based on our ASR system, and we hope that these will also benefit from our state-of-the-art word error rates.

References [1] Wang, C et al. "CoVoST 2: A Massively Multilingual Speech-to-Text Translation Corpus." arXiv:2007.10310 (2020).

[2] Papineni, et al. "Bleu: a method for automatic evaluation of machine translation." Proceedings of the 40th annual meeting of the Association for Computational Linguistics (2002).

[3] Rei, R., et al. "Unbabel’s Participation in the WMT20 Metrics Shared Task." In Proceedings of the Fifth Conference on Machine Translation, pages 911–920, Online. Association for Computational Linguistics (2020).

[4] Conneau, A., et al. Unsupervised Cross-lingual Representation Learning at Scale. n Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451, Online. Association for Computational Linguistics (2019).

AuthorCaroline Dockes
AcknowledgementsAna Olssen, Andrew Innes, Benedetta Cevoli, Chris Waple, Dominik Jochec, Dumitru Gutu, Georgina Robertson, James Gilmore, John Hughes, Markus Hennerbichler, Nelson Kondia, Nick Gerig, Owais Aamir Thungalwadi, Owen O'Loan, Stuart Wood, Tom Young, Tomasz Swider, Tudor Evans, Venkatesh Chandran, Vignesh Umapathy and Yahia Abaza.