Apr 20, 2023 | Read time 5 min

Introducing Real-Time Translation: Breaking Down Language Barriers

Speechmatics are proud to announce our real-time voice translation service. Combining our existing best-in-class speech-to-text allows us to offer highly accurate real-time translation, in our single speech API. Try it out today!
Real-Time Translation
Caroline Dockes
Caroline DockesMachine Learning Engineer

Following the release of batch translation in February, real-time translation is now available in our SaaS offering. We provide translation of speech to and from English for 34 languages, tightly integrated with our high-accuracy transcription through a single real-time or batch API. Customers can start using this through our API, further details on how to use it can be found in our docs.

You can see a live demo with a select few languages below:

Our translation builds on top of our state-of-the-art speech-to-text system, and benefits from the substantial improvement in transcription accuracy offered by the Ursa generation models. We previously showed how quality of ASR impacts various downstream tasks. Here we discuss this in the context of translation.

Translation cannot recover from breakdowns in transcription

Unsurprisingly, when transcription breaks down, it is impossible for translation to recover the meaning of the original sentence. Here are some examples from the CoVoST2[1] test set:

Transcription Translation
Google Gets truly how Mr. Your creators. Ruft sind wirklich wie Mr. Your Creators.
Speechmatics Cats truly are mysterious creatures. Katzen sind wirklich geheimnisvolle Kreaturen.
Reference Cats truly are mysterious creatures. Katzen sind wirklich geheimnisvolle Geschöpfe.

Help: The comparison text for ASR providers shows how the recognized or translated output compares to the reference. Words in red indicate the errors with substitutions being in italic, deletions being crossed out, and insertions being underlined.

Transcription Translation
Google the Sheep at 13 him that das Schaf um 13 das beigebracht
Speechmatics The sheep had taught him that. Die Schafe hatten ihm das beigebracht.
Reference The sheep had taught him that. Die Schafe hatten ihm das beigebracht.

Transcription Translation
Google No activo la frika tropical. I do not activate the tropical freak.
Speechmatics Es Nativo de África tropical. Native of tropical Africa
Reference Es nativo del África tropical. Native of tropical Africa.

Of course, the examples above are rather extreme, but we find that even small mistakes from transcription can have a large impact on the resulting translation. Here is an example:

Transcription Translation
Google Elle croit en Tanzanie. She believes in Tanzania.
Speechmatics Elle croît en Tanzanie. It is growing in Tanzania.
Reference Elle croît en Tanzanie. It grows in Tanzania

In this context, the French word "croit" means "believe", and the word "croît" means "grow". However, the two are pronounced exactly the same! From the perspective of transcription, substituting one for the other is a minor mistake. Still, as you can see from the Google translation, the mistake causes the English translation to entirely lose the meaning of the original sentence.

Word Error Rates and BLEU Scores

Evaluating the two systems more systematically, we observe that Speechmatics’ lower WERs are associated with higher average BLEU on the CoVoST2 test set. BLEU[2] scores are a very commonly used automatic metric for translation quality. They measure the overlap (in terms of words) between the machine generated translation and one or more human generated references.

Figure 1: Transcription Word Error Rate (WER) from Google and Speechmatics on the CoVoST2 speech translation test set. Lower scores are better.

Figure 2: BiLingual Evaluation Understudy (BLEU) scores from Google and Speechmatics on the CoVoST2 speech translation test set. Higher scores are better.

Beyond BLEU Scores

BLEU scores are a convenient way to measure translation quality because they can be computed easily and in a standardized way. However, they are also limited in some ways. They penalize any deviation from the reference translation, even ones that preserve meaning and have the same level of fluency. They put the same weight on every word, even though sometimes a single word can flip the meaning of the entire sentence (e.g. “not”).

Here is an example that illustrates the limitations of BLEU:

Transcription Translation BLEU Score*
Google Comme partons-nous pour faire. How are we going to do. 41.11
Speechmatics Quand partons nous pour Ferrare? When do we leave for Ferrara? 8.64
Reference Quand partons-nous pour Ferrare? When are we going to Ferrare? 100.00

*BLEU is a corpus-based metric and isn’t generally used to evaluate individual sentences. We only include sentence-level BLEU scores here for illustration.

The Speechmatics hypothesis substitutes words 2,4,5 and 6. The Google hypothesis substitutes only words 1 and 6. From the point of view of BLEU scores, the latter is strictly better, despite the fact that the Speechmatics hypothesis matches the meaning of the reference translation much more closely.

In response to BLEU scores’ limitations, people have tried to find better metrics of translation quality, ones that align more closely with human judgement. One such metric is the COMET score[3] submitted to the WMT20 Metrics Shared Task by Unbabel. This relies on a pretrained multilingual encoder, XLM-RoBERTa[4] to create a representation of the source text, the reference text, and the translation hypothesis into a shared feature space. The representations are then fed to a feed forward network which is trained to predict human generated quality assessments. While the absolute values of the scores are hard to interpret, [3] show that they correlate better with human judgements than BLEU scores, indicating that they are a more meaningful way to rank different systems.

Looking at COMET scores on the CoVoST2 test set, we again find that Speechmatics outperforms Google.

Figure 3: COMET scores from Google and Speechmatics on the CoVoST2 speech translation test set. Higher scores are better.

Looking beyond WER and BLEU scores to COMET scores also highlights the importance of capitalization and punctuation. In the following example, Speechmatics and Google transcription both each have one substitution and one insertion. Neither gets the tricky proper noun “Makololos” correctly, but the capitalization in the Speechmatics hypothesis helps preserve the original sentence’s meaning in the translation.

Transcription Translation
Google Pas une trace de ma coloros Not a trace of my color
Speechmatics Pas une trace de Mako Lolo. Not a trace of Mako Lolo.
Reference Pas une trace de Makololos. No sign of Makololos.

Challenges of Real-Time Translation

Delivering a high quality real-time translation system poses several challenges beyond translation quality. For one, we would like to minimize the delay between when a word is spoken and when the corresponding translation is returned. However, different languages have very different rules about word orderings, which can make this tricky. One example is that German sentences often have the verb at the end. In order to translate such a sentence into English, we have to wait until the end of the sentence, we cannot do it incrementally. Additionally, waiting for the end of the sentence also implies that we must have a high quality punctuation model to signal the end of sentence. Striking the right balance between gathering enough context for high quality translation and minimizing delay is something we are still actively working on.

Conclusion

Real-Time translation is a new area for us, but we are excited that our strong foundation in ASR enables us to offer a competitive system, which we expect will keep improving in line with our transcription accuracy. In the coming months, we plan to roll out more APIs based on our ASR system, and we hope that these will also benefit from our state-of-the-art word error rates.

References [1] Wang, C et al. "CoVoST 2: A Massively Multilingual Speech-to-Text Translation Corpus." arXiv:2007.10310 (2020).

[2] Papineni, et al. "Bleu: a method for automatic evaluation of machine translation." Proceedings of the 40th annual meeting of the Association for Computational Linguistics (2002).

[3] Rei, R., et al. "Unbabel’s Participation in the WMT20 Metrics Shared Task." In Proceedings of the Fifth Conference on Machine Translation, pages 911–920, Online. Association for Computational Linguistics (2020).

[4] Conneau, A., et al. Unsupervised Cross-lingual Representation Learning at Scale. n Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451, Online. Association for Computational Linguistics (2019).

AuthorCaroline Dockes
AcknowledgementsAna Olssen, Andrew Innes, Benedetta Cevoli, Chris Waple, Dominik Jochec, Dumitru Gutu, Georgina Robertson, James Gilmore, John Hughes, Markus Hennerbichler, Nelson Kondia, Nick Gerig, Owais Aamir Thungalwadi, Owen O'Loan, Stuart Wood, Tom Young, Tomasz Swider, Tudor Evans, Venkatesh Chandran, Vignesh Umapathy and Yahia Abaza.

Latest Articles

Carousel slide image
Use Cases

What Word Error Rate Is Acceptable for Legal Transcription?

Word error rate for legal transcription has no single acceptable threshold. But knowing how accuracy, audio quality, and review obligations connect to real legal risk is what separates a reliable transcript from a costly one.

Mieke Smith
Mieke SmithSenior Writer
Carousel slide image
Use Cases

The court reporter shortage crisis: data, causes, and what legal teams are doing about it

The court reporter shortage is reshaping litigation. Explore data, causes, and how legal teams are using digital reporting and AI transcription to adapt.

Tom Young
Tom YoungDigital Specialist
[alt: Bilingual medical model featuring terms related to various health conditions and medications in Arabic and English. Key terms include "Chronic kidney disease," "Heart attack," "Diabetes," and "Insulin," among others, displayed in an organized layout.]
Product

Speechmatics achieves a world first in bilingual Voice AI with new Arabic–English model

Sets a new accuracy bar for real-world code-switching: 35% fewer errors than the closest competitor.

Speechmatics
SpeechmaticsEditorial Team
[alt: Illuminated ancient mud-brick structures stand against a dusk sky, showcasing architectural details and textures. Palm trees are in the foreground, adding to the setting's ambiance. Visually captures a historic site in twilight.]
Product

Your voice agent speaks perfect Arabic. That's the problem.

Most voice AI models are trained on formal Arabic, but real conversations across the Middle East mix dialects and English in ways those systems aren’t built to handle.

Yahia Abaza
Yahia AbazaSenior Product Manger
new blog image header
Technical

How Nvidia Dominates the HuggingFace Leaderboards in This Key Metric

A technical deep-dive into Token Duration Transducers (TDT) — the frame-skipping architecture behind Nvidia's Parakeet models. Covers inference mechanics, training with forward-backward algorithm, and how TDT achieves up to 2.82x faster decoding than standard RNN-T.

Oliver Parish
Oliver Parish Machine Learning Engineer
[alt: Healthcare professionals in scrubs and lab coats walk briskly down a hospital corridor. A nurse uses a tablet while others carry patient charts and attend to a gurney. The setting conveys a busy, clinical environment focused on patient care.]
Use Cases

Why AI-native EHR platforms will treat speech as core infrastructure in 2026

As clinical workflows become automated and AI-driven, real-time speech is shifting from a transcription feature to the foundational intelligence layer inside modern EHR systems.

Vamsi Edara
Vamsi EdaraFounder and CEO, Edvak EHR