Apr 20, 2023 | Read time 5 min

Introducing Real-Time Translation: Breaking Down Language Barriers

Speechmatics are proud to announce our real-time voice translation service. Combining our existing best-in-class speech-to-text allows us to offer highly accurate real-time translation, in our single speech API. Try it out today!
Real-Time Translation
Caroline Dockes
Caroline DockesMachine Learning Engineer

Following the release of batch translation in February, real-time translation is now available in our SaaS offering. We provide translation of speech to and from English for 34 languages, tightly integrated with our high-accuracy transcription through a single real-time or batch API. Customers can start using this through our API, further details on how to use it can be found in our docs.

You can see a live demo with a select few languages below:

Our translation builds on top of our state-of-the-art speech-to-text system, and benefits from the substantial improvement in transcription accuracy offered by the Ursa generation models. We previously showed how quality of ASR impacts various downstream tasks. Here we discuss this in the context of translation.

Translation cannot recover from breakdowns in transcription

Unsurprisingly, when transcription breaks down, it is impossible for translation to recover the meaning of the original sentence. Here are some examples from the CoVoST2[1] test set:

Transcription Translation
Google Gets truly how Mr. Your creators. Ruft sind wirklich wie Mr. Your Creators.
Speechmatics Cats truly are mysterious creatures. Katzen sind wirklich geheimnisvolle Kreaturen.
Reference Cats truly are mysterious creatures. Katzen sind wirklich geheimnisvolle Geschöpfe.

Help: The comparison text for ASR providers shows how the recognized or translated output compares to the reference. Words in red indicate the errors with substitutions being in italic, deletions being crossed out, and insertions being underlined.

Transcription Translation
Google the Sheep at 13 him that das Schaf um 13 das beigebracht
Speechmatics The sheep had taught him that. Die Schafe hatten ihm das beigebracht.
Reference The sheep had taught him that. Die Schafe hatten ihm das beigebracht.

Transcription Translation
Google No activo la frika tropical. I do not activate the tropical freak.
Speechmatics Es Nativo de África tropical. Native of tropical Africa
Reference Es nativo del África tropical. Native of tropical Africa.

Of course, the examples above are rather extreme, but we find that even small mistakes from transcription can have a large impact on the resulting translation. Here is an example:

Transcription Translation
Google Elle croit en Tanzanie. She believes in Tanzania.
Speechmatics Elle croît en Tanzanie. It is growing in Tanzania.
Reference Elle croît en Tanzanie. It grows in Tanzania

In this context, the French word "croit" means "believe", and the word "croît" means "grow". However, the two are pronounced exactly the same! From the perspective of transcription, substituting one for the other is a minor mistake. Still, as you can see from the Google translation, the mistake causes the English translation to entirely lose the meaning of the original sentence.

Word Error Rates and BLEU Scores

Evaluating the two systems more systematically, we observe that Speechmatics’ lower WERs are associated with higher average BLEU on the CoVoST2 test set. BLEU[2] scores are a very commonly used automatic metric for translation quality. They measure the overlap (in terms of words) between the machine generated translation and one or more human generated references.

Figure 1: Transcription Word Error Rate (WER) from Google and Speechmatics on the CoVoST2 speech translation test set. Lower scores are better.

Figure 2: BiLingual Evaluation Understudy (BLEU) scores from Google and Speechmatics on the CoVoST2 speech translation test set. Higher scores are better.

Beyond BLEU Scores

BLEU scores are a convenient way to measure translation quality because they can be computed easily and in a standardized way. However, they are also limited in some ways. They penalize any deviation from the reference translation, even ones that preserve meaning and have the same level of fluency. They put the same weight on every word, even though sometimes a single word can flip the meaning of the entire sentence (e.g. “not”).

Here is an example that illustrates the limitations of BLEU:

Transcription Translation BLEU Score*
Google Comme partons-nous pour faire. How are we going to do. 41.11
Speechmatics Quand partons nous pour Ferrare? When do we leave for Ferrara? 8.64
Reference Quand partons-nous pour Ferrare? When are we going to Ferrare? 100.00

*BLEU is a corpus-based metric and isn’t generally used to evaluate individual sentences. We only include sentence-level BLEU scores here for illustration.

The Speechmatics hypothesis substitutes words 2,4,5 and 6. The Google hypothesis substitutes only words 1 and 6. From the point of view of BLEU scores, the latter is strictly better, despite the fact that the Speechmatics hypothesis matches the meaning of the reference translation much more closely.

In response to BLEU scores’ limitations, people have tried to find better metrics of translation quality, ones that align more closely with human judgement. One such metric is the COMET score[3] submitted to the WMT20 Metrics Shared Task by Unbabel. This relies on a pretrained multilingual encoder, XLM-RoBERTa[4] to create a representation of the source text, the reference text, and the translation hypothesis into a shared feature space. The representations are then fed to a feed forward network which is trained to predict human generated quality assessments. While the absolute values of the scores are hard to interpret, [3] show that they correlate better with human judgements than BLEU scores, indicating that they are a more meaningful way to rank different systems.

Looking at COMET scores on the CoVoST2 test set, we again find that Speechmatics outperforms Google.

Figure 3: COMET scores from Google and Speechmatics on the CoVoST2 speech translation test set. Higher scores are better.

Looking beyond WER and BLEU scores to COMET scores also highlights the importance of capitalization and punctuation. In the following example, Speechmatics and Google transcription both each have one substitution and one insertion. Neither gets the tricky proper noun “Makololos” correctly, but the capitalization in the Speechmatics hypothesis helps preserve the original sentence’s meaning in the translation.

Transcription Translation
Google Pas une trace de ma coloros Not a trace of my color
Speechmatics Pas une trace de Mako Lolo. Not a trace of Mako Lolo.
Reference Pas une trace de Makololos. No sign of Makololos.

Challenges of Real-Time Translation

Delivering a high quality real-time translation system poses several challenges beyond translation quality. For one, we would like to minimize the delay between when a word is spoken and when the corresponding translation is returned. However, different languages have very different rules about word orderings, which can make this tricky. One example is that German sentences often have the verb at the end. In order to translate such a sentence into English, we have to wait until the end of the sentence, we cannot do it incrementally. Additionally, waiting for the end of the sentence also implies that we must have a high quality punctuation model to signal the end of sentence. Striking the right balance between gathering enough context for high quality translation and minimizing delay is something we are still actively working on.

Conclusion

Real-Time translation is a new area for us, but we are excited that our strong foundation in ASR enables us to offer a competitive system, which we expect will keep improving in line with our transcription accuracy. In the coming months, we plan to roll out more APIs based on our ASR system, and we hope that these will also benefit from our state-of-the-art word error rates.

References [1] Wang, C et al. "CoVoST 2: A Massively Multilingual Speech-to-Text Translation Corpus." arXiv:2007.10310 (2020).

[2] Papineni, et al. "Bleu: a method for automatic evaluation of machine translation." Proceedings of the 40th annual meeting of the Association for Computational Linguistics (2002).

[3] Rei, R., et al. "Unbabel’s Participation in the WMT20 Metrics Shared Task." In Proceedings of the Fifth Conference on Machine Translation, pages 911–920, Online. Association for Computational Linguistics (2020).

[4] Conneau, A., et al. Unsupervised Cross-lingual Representation Learning at Scale. n Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451, Online. Association for Computational Linguistics (2019).

AuthorCaroline Dockes
AcknowledgementsAna Olssen, Andrew Innes, Benedetta Cevoli, Chris Waple, Dominik Jochec, Dumitru Gutu, Georgina Robertson, James Gilmore, John Hughes, Markus Hennerbichler, Nelson Kondia, Nick Gerig, Owais Aamir Thungalwadi, Owen O'Loan, Stuart Wood, Tom Young, Tomasz Swider, Tudor Evans, Venkatesh Chandran, Vignesh Umapathy and Yahia Abaza.

Latest Articles

Carousel slide image
Product

Alphanumeric speech recognition: why voice assistants mangle SKUs (and how to fix it)

A guide for voice AI engineers, ecommerce platforms and warehouse teams on SKU recognition accuracy voice assistant deployments depend on: why speech recognition systems produce transcription errors on product codes, what to measure when error rates matter, and the fixes that move the needle on order picking, voice ordering and customer-facing voice AI.

Speechmatics
SpeechmaticsEditorial Team
Carousel slide image
Technical

The Adobe story: How we made cloud-grade AI work on your laptop

Behind the build: what it takes to make cloud-grade speech recognition work inside Adobe Premiere, and why Whisper raised the stakes.

Andrew Innes
Andrew InnesChief Architect
Carousel slide image
Company

Adobe and Speechmatics deliver cloud-grade speech recognition on-device for Premiere

Adobe Premiere users can run the most accurate on-device transcription locally; efficient enough for a laptop, powerful enough for professional work.

Speechmatics
SpeechmaticsEditorial Team
Carousel slide image
Use Cases

Best speech-to-text AI guide: APIs, platforms and services compared

Speech-to-text has moved from novelty to enterprise infrastructure. Here's how the leading platforms stack up in 2026 — and how to pick the right one.

Tom Young
Tom YoungDigital Specialist
Speechmatics x Thymia combine medical-grade speech-to-text with clinical-grade voice biomarker intelligence to identify health signals.
News

AI can now understand health signals from 15 seconds of your voice, including fatigue, stress and type 2 diabetes

The joint platform returns transcription and health signals in real time, with no additional hardware required.

Speechmatics
SpeechmaticsEditorial Team
[alt: Concentric circles radiate outward from a central orange icon with a white Speechmatics logo. The background is dark blue, enhancing the orange glow. A thin green line runs horizontally across the lower part of the image.]
Technical

Speed you can trust: The STT metrics that matter for voice agents

What “fast” actually means for voice agents — and why Pipecat’s TTFS + semantic accuracy is the clearest benchmark we’ve seen.

Archie McMullan
Archie McMullanSpeechmatics Graduate