Developer Resources

Discover the answers to your questions and learn more about Speechmatics.

Find out how and why our cutting edge speech-to-text technology works for you.Log In
filing cabinet image
We've made it easy to learn how to deploy, operate and manage our technology.

Our documentation gives you interactive code examples for interfacing with the engine and use of all available features. It includes a full API reference to help you integrate our ASR engine into your own applications.

With it, you can:

  • Search keywords across the entire documentation site.

  • Download PDFs to use offline.

  • Access the historic and latest versions of the documentation.

  • Get notified of any documentation changes or new versions.

  • Contact our support team directly.

Every question you want to ask, answered

Frequently Asked Questions


What audio files do you support?

Speechmatics support a number of audio and video file formats including:

aac, amr, flac, m4a, mp3, mp4, mpeg, ogg, wav.

Other formats may work but should be validated under user acceptance testing. For the best possible results, try to avoid using file formats that use compression technologies. See our product sheets for a full list of features, including file input formats.


What languages do you support?

As of September 2022, we support; Arabic, Bashkir, Basque, Belarusian, Bulgarian, Catalan, Cantonese, Croatian, Czech, Danish, Dutch, English, Esperanto, Estonian, Finnish, French, Galician, German, Greek, Hindi, Hungarian, Indonesian, Interlingua, Italian, Japanese, Korean, Latvian, Lithuanian, Malay, Mandarin (Traditional and Simplified), Marathi, Mongolian, Norwegian, Polish, Portuguese, Romanian, Russian, Slovakian, Slovenian, Spanish, Swedish, Tamil, Thai, Turkish, Ukrainian, Uyghur, Vietnamese and Welsh.

As well as operating in these languages, we also cater to additional accents and dialects.


Do I have to submit a job with a supported language specified?

Yes. All audio/video files submitted to the Speechmatics ASR require a language code to be supplied. A list of all the language codes can be found in related customer documentation.


What sampling rate do I need on my audio files?

All Speechmatics language packs are optimized for sample rates of 8kHz and 16kHz. If you’re unsure, or your audio files are varied and have mixed sample rates, please submit them as is. Our Speechmatics engine will automatically optimize them for the best possible results.


Do you support multi-channel transcription?

Yes. Our Batch ASR solutions support audio/video files with up to 6 channels. When enabled the audio/video file submitted will have each channel processed separately but will be combined in a single output.

With channel diarization we can label custom channels, for example, Agents and Callers. We can also eliminate cross-talk and deliver a single output for timing information. With 1 user per channel, we can deliver perfect speaker identification.


What impacts the accuracy of the transcription?

Our engine is capable of understanding all manner of audio. But like all ASR, it’s at its best when the speech is clear and free from background noise. Cross talk will also have an impact on accuracy, as will the speaker being some distance from the microphone. Multiple languages being spoken in the same audio film will also have an impact.


How can I improve transcription accuracy?

We’ve worked hard to make our latest engine as accurate as it can be, regardless of the audio quality. But speech recognition is always at its best when the quality of the audio presented to the system is high. To get the best out of our system, try to follow these simple steps.

First, speak clearly. You don’t need to slow down your voice, but ideally, it should flow over the microphone. If possible, think about or even rehearse what you want to say. If you have time, record a test and play it back to check everything sounds okay.

When it comes to the environment you’re speaking in, try to record in as quiet a location as you can. Try to avoid multiple people speaking at the same time. If possible, minimize the reverberation of sound bouncing off the walls.

Use a good quality microphone, such as a USB noise cancelling (or directional) microphone. A mounted microphone is best. Recording audio at 16kHz or greater will give the best results. You do not need to compress the audio, but if you do, avoid over-compressing. Use above 96 kbps AAC or 128 kbps MP3. If available, use two channels and do not apply any transcoding before you submit your file to the Speechmatics ASR.


Do you provide an on-premises solution?

Our flexible approach to ASR allows you to deploy our on-premises solutions for batch and real-time use cases. This will provide peace of mind regarding data security and allows you to meet any data compliance requirements.

Check out Our Technology page to see the various deployment options available.

Our Technology


What forms of diarization do you support?

We support both speaker diarization and channel diarization to suit different use cases. The former detects and labels different speakers within the same channel, the latter detects and labels different speakers on up to six streams or channels.


How can I improve the accuracy of the alignment output?

Any words in the text that are not in the audio provided need to be removed or surrounded by < & >, otherwise, accuracy will be reduced and the alignment job may take significantly longer. Alignment is only available on our SaaS.


How accurate are you?

Actions speak louder than words. We like our customers to try audio files representative of their use-case. This provides a true measurement of accuracy and what it means for you. Why not try transcribing a media file or live speech using our free demo and see for yourself? 

Log In


What data do you collect and store when your service is used?

Our on-premises solutions won’t collect or store any video/audio files or transcribed output, so you can have full control over your data. Users of our public cloud ASR can refer to our Privacy Policy.

Privacy Policy


What’s the difference between Batch and Real-time ASR?

Our Batch ASR allows you to transcribe pre-recorded media files whenever you want. Real-time ASR lets you transcribe live, to gather actionable data instantly. With Batch you can schedule a transcription at a time that suits you and optimize your available resource.


How long does it take to transcribe a file?

We refer to the processing time taken to transcribe a media file compared to the length of the file as the real time factor (RTF). Our batch ASR will transcribe a file with an RTF of 0.5. So, for example, a 10-minute file will be transcribed within 5 minutes. Files less than 5 minutes long may vary in RTF.


Are there any limitations to the number of jobs I can submit on your Cloud?

We ask that customers limit the rate of files they submit to a maximum of 2 jobs per second, with a maximum of 100 jobs in progress at any one time. We also ask that customers limit the rate of polling for the status of submitted jobs to a maximum of 20 queries per second (across all jobs).

Speechmatics reserve the right to change the rate limits at any time in order to ensure continuity of service for all customers of the Cloud.


What operating points do Speechmatics offer?

We offer two operating points. The Standard Model is comparable to our previous models for transcription accuracy and from the speed the media file is transcribed. By default, the Speechmatics engine will choose the "standard" operating point.

Our Enhanced Model is larger in size and can produce more accurate transcriptions. To optimize the performance for transcription processing, we recommend running transcription workloads on AVX512_VNNI (Advanced Vector Extensions 512 Vector Neural Network Instructions) compatible CPUs. If you'd like to use the Enhanced Model, please contact your Account Manager and we will enable it on your account.


What is Autonomous Speech Recognition?

Speechmatics’ innovation is leading the industry from Automatic Speech Recognition towards Autonomous Speech Recognition (ASR).

Fuelled by the introduction of self-supervised learning for our training, Autonomous Speech Recognition delivers a step-change in accuracy and inclusion by leveraging a wide range of voices using the scale and diversity of the internet.


What is self-supervised learning?

In the past, supervised learning from labeled data was the only way to ensure the levels of accuracy needed to feel confident text was truly representative of the words spoken. Now, using self-supervised learning, our models can autonomously learn to spot salient patterns in unlabeled data.

This means we can learn by using much wider datasets from a huge variety of sources that include both labeled and unlabeled data.

Before we moved to self-supervised learning we were training on around 30,000 hours of audio. Now, that number is closer to 1,100,000 hours.

Interested in working at Speechmatics? Find our more on our Careers page.