Blog - Product
May 26, 2023 | Read time 8 min

Speechmatics Unified Speech Translation API is here

Speechmatics breaks down language barriers of spoken communication in real-time with our brand new unified speech translation API for files and live audio. Find out more!
Stuart WoodProduct Manager

In an increasingly connected world, the ability to communicate with people from all corners of the globe is more important than ever. For media organizations, this means being able to provide content to an international audience. For enterprises, it’s about enabling international teams to function effectively while preserving the richness of their individual cultures. 

Building on the world-leading accuracy from our Ursa models, Speechmatics now enables our customers to easily exploit the benefits of speech translation in both batch and real-time scenarios. You can read the documentation here, and in this blog we wanted to lift the lid on some of our motivations for building this into our product and discuss some of the main challenges we had to overcome in its development.

A Personal Experience of the Need for Translation

Last year, during a visit to my partner’s family in Taiwan, I became acutely aware of the limitations of my Mandarin skills. Keeping up with the conversation around me was a significant challenge. Had I had access to our newly launched real-time speech translation technology, I would have been able to understand a significant part of the dialogue unfolding around me. Unlike Google Translate, which is built for short audio snippets, Speechmatics’ real-time translation software allows for transcribing continuous, lengthy discussions, such as a 30-minute conversation.  

Will it be perfect? Of course not, no transcription is without a small number of errors. But can it facilitate my overall understanding of the dialogue? Absolutely. 

Given this need to understand has been around as long as different languages have existed, it’s no surprise that there is a rich history of trying to use technology to help with translation.

Developments in Machine Translation

In the last decade, machine translation has taken a significant leap forward with the advent of neural machine translation (NMT). NMT utilizes artificial neural networks, modelled loosely on the human brain, to process and translate language. These systems consider a larger context rather than translating word by word, leading to a far more accurate and natural-sounding translation. 

With the advances in speech recognition that Speechmatics have fuelled, we can now take advantage of the advances in NMT to transcribe and translate spoken language in real-time, opening up this technology to help our customers understand every voice. 

It hasn’t been straightforward though, and we’ve had to overcome some tricky challenges in being to offer real-time translation that we’re proud of.

Challenges in Real-Time Translation

Real-time translation presents unique challenges, particularly when latency is a critical factor. Despite overcoming several of these issues, some linguistic challenges still exist which you may have come across if you’ve learnt a foreign language.  

A nice illustration of this is the word ordering in German, which can complicate the translation of sentence fragments. Take, for example, the sentence “Ich wollte meine Oma anrufen,” which translates to “I wanted to call my grandma.” If we split this sentence into two translated segments to reduce the time needed for translation, we would have “Ich wollte meine” and “Oma anrufen.” However, combining these segments would result in the nonsensical phrase “I wanted mine call grandma.” To provide optimal speed and accuracy Speechmatics has implemented the concept of partial translations which can be continuously updated as we get more context, and final translations which are usually at sentence boundaries and give the highest accuracy. 

The demand for real-time, or near real-time, translation introduces the challenge of processing speed. The system must be able to transcribe speech and translate it, all in a matter of seconds, while maintaining high accuracy. Achieving this requires a delicate balance between speed and precision. So, we’ve optimized our models and infrastructure to ensure quick, real-time translations without compromising quality. 

Language variation is another hurdle to overcome, accents, pronunciation and speed vary considerably among speakers. This presents an additional challenge for speech recognition, the first step in speech translation. However, this is an area Speechmatics excels in, with Ursa delivering exceptional accuracy across a broad spectrum of speakers.


For quality translation, the machine learning models’ accuracy is important. In this case, we must consider speech-to-text accuracy for transcribing what a person said and the accuracy of the machine translation. 

To find out more about the accuracy of our combined offering and how we measure it, you can check out our previous blog. Caroline provides excellent examples of why accurate transcription is essential. Meaning and context can easily be lost in either process, but translation is an even bigger challenge since there are multiple ways to translate a sentence. 

The journey towards perfect real-time speech translation is ongoing, but we’re excited about our progress so far and are committed to continually improving and refining our technology.

Ease of Implementation

A key focus for Speechmatics is on the ease of implementation for our customers. We understand that integrating multiple APIs for translation services can be a daunting task, necessitating custom development and resulting in increased latency, which is critical to minimize in real-time operations.  

To address this, Speechmatics has created a single, unified API which enables sending audio once and receiving both transcription and translation. This streamlined approach significantly simplifies the setup and consumption of translation services. This simplicity helps with challenging scenarios such as live captioning of events or broadcasts. Here, multiple factors need to be simultaneously considered, including partial sentences and updates in the context as more words are spoken. Our unified API ensures these variables are expertly managed while keeping latency to a bare minimum, making real-time captioning more efficient and effective. 

Using Speechmatics speech-to-text, and translation in one API enables businesses to reach a wider geographical audience without additional effort. We believe these developments will help businesses greatly reduce costs, speed up processes, and help them enter new markets and reach wider audiences. Media, EdTech, and Contact Centers are three industries in particular we believe can capitalize on these features, so explore how they can do so below.


Enable Localization of Media Content with the most powerful Real-time Speech Translation API

As a content creator, you always want to reach the broadest possible audience with your work. And in today’s globalized world, that means expanding beyond the English-speaking market. However, accurate multilingual captions can be daunting, especially if you don’t have the budget to hire professional translators. In a live stream, it’s even more challenging due to the short turn-around time. 

This is where our machine translation, combined with speech recognition, comes in. Using our translation API we hope to see platforms that can offer automatic captioning in many languages for their creators and audiences. Either for videos or live streams, which human translators can then edit to ensure accuracy if required.

This approach has several benefits. Firstly, it enables organizations to reach a much wider audience, as they are no longer limited to those who speak the same language. Secondly, it saves time and resources that would otherwise be spent manually translating content. 

Localization has been a key strategy for media companies, TV production houses, and platforms that power video to millions of audiences. There is an increasing demand for automatic captioning of videos in multiple languages by viewers in different regions for TV shows and broadcast, live events, game streaming, podcast, YouTube, and more.

Content production is seeing a seismic shift with the rise and democratization of Generative AI, Voice cloning, and AI Dubbing. A good example is this viral tweet of popular machine learning expert and youtuber Lex Fridman’s podcast with Jordan Peterson that was translated and then voice cloned in Spanish. The reactions from the wider community have been positive who are looking to adopt these developments into their content and businesses. At the forefront of the Voice cloning revolution has been Eleven Labs, the platform used to clone the translated audio for Lex Fridman’s podcast.

Currently content creators, OTT, and other video monetization platforms like YouTube, Vimeo, and Patreon offer automatic captioning for their users. These platforms can automatically detect and generate captioning in the language the video was uploaded, however creators have been stuck with finding their own methods to manually translate for their global audiences and is often not accurate.


Inclusive Classrooms: Speech Translation Encourages Diversity and Inclusion in Education

The internet has revolutionized education, making lectures and videos globally accessible and facilitating learning on a global scale. Creating content that is available with captions in multiple languages can help share knowledge globally and include students who are non-native speakers. 

With accurate real time translation at your disposal, you can enable hyper localization and enter new markets to increase revenue from customers and accessibility and engagement. Customer centric companies like Coursera and Skillshare are already utilizing tools like Papercup for dubbing their courses in multiple languages to reach students globally. There is a clear need to cater to a global audience in education by utilizing voice as the foundation to deliver more accessible experiences to learners.  

We believe lectures and course content with multilingual captions can help facilitate collaboration and exchange between students, teachers, and researchers from different language backgrounds.

Contact Center as a Service (CCaaS)

Revolutionizing Customer Service and Contact Centers with Speech Translation.

The increasing scale of contact centers is leading to high demand and need for translation that is accurate, fast, and easy to integrate. By integrating Speechmatics API’s agents can be equipped to understand and provide instant & fast support to users in their native language, in real time.  

With integrated translation, it’s now possible for supervisors in multilingual customer service roles to review calls in languages they don’t understand, leading to improved training and betterment of agents.

It’s also possible to use translation to enable call analytics workflows in more languages without developing bespoke machine learning models for each, helping expand your CCaaS platform to international markets.

A Future Where Language is no Longer a Barrier

As machine translation continues to improve, I anticipate we are moving rapidly towards a future where language is no longer a barrier, but a bridge. It is in this context that we introduce our new Speech Translation API that carries the potential to redefine communication in our increasingly interconnected world.  

We’re excited for you to try it out and want to hear feedback about what you build with it. I will certainly be using the real-time translation on my next trip to Taiwan!

Speechmatics Unified Translation and Transcription API now offers: 

  • Translation of speech to and from English for 34 languages, through a single real-time or batch API. 

  • Our market leading high-accuracy transcription, which in turn gives the best translation possible.

  • Fast, low-latency translation for real-time use cases such as media broadcasting.

  • A single unified API, which means we now can provide both transcription and translation from a single send of audio giving you both speed and simplicity in set up. 

  • You transcription will also include metadata such as 

    • Timing of sentences for captions or user interfaces. 

    • Attribution of translations to who spoke to them.

See our new features in Real-time translation, and Transcription in action today. View our tutorial video below on how you can get started quickly, without code & for free.

Start Translating Today!

Sign up today and test translation without code, and once you’re ready read the documentation on how to integrate with your application and get setup quickly.