Following the release of batch translation in February, real-time translation is now available in our SaaS offering. We provide translation of speech to and from English for 34 languages, tightly integrated with our high-accuracy transcription through a single real-time or batch API. Customers can start using this through our API, further details on how to use it can be found in our docs.
You can see a live demo with a select few languages below:
Our translation builds on top of our state-of-the-art speech-to-text system, and benefits from the substantial improvement in transcription accuracy offered by the Ursa generation models. We previously showed how quality of ASR impacts various downstream tasks. Here we discuss this in the context of translation.
Translation cannot recover from breakdowns in transcription
Unsurprisingly, when transcription breaks down, it is impossible for translation to recover the meaning of the original sentence. Here are some examples from the CoVoST2 test set:
|Gets truly how Mr. Your creators.||Ruft sind wirklich wie Mr. Your Creators.|
|Speechmatics||Cats truly are mysterious creatures.||Katzen sind wirklich geheimnisvolle Kreaturen.|
|Reference||Cats truly are mysterious creatures.||Katzen sind wirklich geheimnisvolle Geschöpfe.|
Help: The comparison text for ASR providers shows how the recognized or translated output compares to the reference. Words in red indicate the errors with substitutions being in italic, deletions being crossed out, and insertions being underlined.
|the Sheep at 13 him that||das Schaf um 13 das beigebracht|
|Speechmatics||The sheep had taught him that.||Die Schafe hatten ihm das beigebracht.|
|Reference||The sheep had taught him that.||Die Schafe hatten ihm das beigebracht.|
|No activo la frika tropical.||I do not activate the tropical freak.|
|Speechmatics||Es Nativo de África tropical.||Native of tropical Africa|
|Reference||Es nativo del África tropical.||Native of tropical Africa.|
Of course, the examples above are rather extreme, but we find that even small mistakes from transcription can have a large impact on the resulting translation. Here is an example:
|Elle croit en Tanzanie.||She believes in Tanzania.|
|Speechmatics||Elle croît en Tanzanie.||It is growing in Tanzania.|
|Reference||Elle croît en Tanzanie.||It grows in Tanzania|
In this context, the French word "croit" means "believe", and the word "croît" means "grow". However, the two are pronounced exactly the same! From the perspective of transcription, substituting one for the other is a minor mistake. Still, as you can see from the Google translation, the mistake causes the English translation to entirely lose the meaning of the original sentence.
Word Error Rates and BLEU Scores
Evaluating the two systems more systematically, we observe that Speechmatics’ lower WERs are associated with higher average BLEU on the CoVoST2 test set. BLEU scores are a very commonly used automatic metric for translation quality. They measure the overlap (in terms of words) between the machine generated translation and one or more human generated references.
Beyond BLEU Scores
BLEU scores are a convenient way to measure translation quality because they can be computed easily and in a standardized way. However, they are also limited in some ways. They penalize any deviation from the reference translation, even ones that preserve meaning and have the same level of fluency. They put the same weight on every word, even though sometimes a single word can flip the meaning of the entire sentence (e.g. “not”).
Here is an example that illustrates the limitations of BLEU:
|Comme partons-nous pour faire.||How are we going to do.||41.11|
|Speechmatics||Quand partons nous pour Ferrare?||When do we leave for Ferrara?||8.64|
|Reference||Quand partons-nous pour Ferrare?||When are we going to Ferrare?||100.00|
*BLEU is a corpus-based metric and isn’t generally used to evaluate individual sentences. We only include sentence-level BLEU scores here for illustration.
The Speechmatics hypothesis substitutes words 2,4,5 and 6. The Google hypothesis substitutes only words 1 and 6. From the point of view of BLEU scores, the latter is strictly better, despite the fact that the Speechmatics hypothesis matches the meaning of the reference translation much more closely.
In response to BLEU scores’ limitations, people have tried to find better metrics of translation quality, ones that align more closely with human judgement. One such metric is the COMET score submitted to the WMT20 Metrics Shared Task by Unbabel. This relies on a pretrained multilingual encoder, XLM-RoBERTa to create a representation of the source text, the reference text, and the translation hypothesis into a shared feature space. The representations are then fed to a feed forward network which is trained to predict human generated quality assessments. While the absolute values of the scores are hard to interpret,  show that they correlate better with human judgements than BLEU scores, indicating that they are a more meaningful way to rank different systems.
Looking at COMET scores on the CoVoST2 test set, we again find that Speechmatics outperforms Google.
Looking beyond WER and BLEU scores to COMET scores also highlights the importance of capitalization and punctuation. In the following example, Speechmatics and Google transcription both each have one substitution and one insertion. Neither gets the tricky proper noun “Makololos” correctly, but the capitalization in the Speechmatics hypothesis helps preserve the original sentence’s meaning in the translation.
|Pas une trace de ma coloros||Not a trace of my color|
|Speechmatics||Pas une trace de Mako Lolo.||Not a trace of Mako Lolo.|
|Reference||Pas une trace de Makololos.||No sign of Makololos.|
Challenges of Real-Time Translation
Delivering a high quality real-time translation system poses several challenges beyond translation quality. For one, we would like to minimize the delay between when a word is spoken and when the corresponding translation is returned. However, different languages have very different rules about word orderings, which can make this tricky. One example is that German sentences often have the verb at the end. In order to translate such a sentence into English, we have to wait until the end of the sentence, we cannot do it incrementally. Additionally, waiting for the end of the sentence also implies that we must have a high quality punctuation model to signal the end of sentence. Striking the right balance between gathering enough context for high quality translation and minimizing delay is something we are still actively working on.
Real-Time translation is a new area for us, but we are excited that our strong foundation in ASR enables us to offer a competitive system, which we expect will keep improving in line with our transcription accuracy. In the coming months, we plan to roll out more APIs based on our ASR system, and we hope that these will also benefit from our state-of-the-art word error rates.