Discover the answers to your questions and learn more about Speechmatics.
Our documentation gives you interactive code examples for interfacing with the engine and use of all available features. It includes a full API reference to help you integrate our ASR engine into your own applications.
With it, you can:
Search keywords across the entire documentation site.
Download PDFs to use offline.
Access the historic and latest versions of the documentation.
Get notified of any documentation changes or new versions.
Contact our support team directly.
Every question you want to ask, answered
Speechmatics support a number of audio and video file formats including:
aac, amr, flac, m4a, mp3, mp4, mpeg, ogg, wav.
Other formats may work but should be validated under user acceptance testing. For the best possible results, try to avoid using file formats that use compression technologies. See our product sheets for a full list of features, including file input formats.
As of December 2021, we support; Arabic, Bulgarian, Catalan, Croatian, Czech, Danish, Dutch, English, Finnish, French, German, Greek, Hindi, Hungarian, Italian, Japanese, Korean, Latvian, Lithuanian, Malay, Mandarin (Traditional and Simplified), Norwegian, Polish, Portuguese, Romanian, Russian, Slovak, Slovenian, Spanish, Swedish and Turkish.
As well as operating in these languages, we also cater to additional accents and dialects.
Yes. All audio/video files submitted to the Speechmatics ASR require a language code to be supplied. A list of all the language codes can be found in related customer documentation.
All Speechmatics language packs are optimized for sample rates of 8kHz and 16kHz. If you’re unsure, or your audio files are varied and have mixed sample rates, please submit them as is. Our Speechmatics engine will automatically optimize them for the best possible results.
Yes. Our Batch ASR solutions support audio/video files with up to 6 channels. When enabled the audio/video file submitted will have each channel processed separately but will be combined in a single output.
With channel diarization we can label custom channels, for example, Agents and Callers. We can also eliminate cross-talk and deliver a single output for timing information. With 1 user per channel, we can deliver perfect speaker identification.
Our engine is capable of understanding all manner of audio. But like all ASR, it’s at its best when the speech is clear and free from background noise. Cross talk will also have an impact on accuracy, as will the speaker being some distance from the microphone. Multiple languages being spoken in the same audio film will also have an impact.
We’ve worked hard to make our latest engine as accurate as it can be, regardless of the audio quality. But speech recognition is always at its best when the quality of the audio presented to the system is high. To get the best out of our system, try to follow these simple steps.
First, speak clearly. You don’t need to slow down your voice, but ideally, it should flow over the microphone. If possible, think about or even rehearse what you want to say. If you have time, record a test and play it back to check everything sounds okay.
When it comes to the environment you’re speaking in, try to record in as quiet a location as you can. Try to avoid multiple people speaking at the same time. If possible, minimize the reverberation of sound bouncing off the walls.
Use a good quality microphone, such as a USB noise cancelling (or directional) microphone. A mounted microphone is best. Recording audio at 16kHz or greater will give the best results. You do not need to compress the audio, but if you do, avoid over-compressing. Use above 96 kbps AAC or 128 kbps MP3. If available, use two channels and do not apply any transcoding before you submit your file to the Speechmatics ASR.
Our flexible approach to ASR allows you to deploy our on-premises solutions for batch and real-time use cases. This will provide peace of mind regarding data security and allows you to meet any data compliance requirements.
Check out Our Technology page to see the various deployment options available.Our Technology
We support both speaker diarization and channel diarization to suit different use cases. The former detects and labels different speakers within the same channel, the latter detects and labels different speakers on up to six streams or channels.
Any words in the text that are not in the audio provided need to be removed or surrounded by < & >, otherwise, accuracy will be reduced and the alignment job may take significantly longer. Alignment is only available on our SaaS.
Actions speak louder than words. We like our customers to try audio files representative of their use-case. This provides a true measurement of accuracy and what it means for you. Why not try transcribing a media file or live speech using our free demo and see for yourself?Try for Free
Our Batch ASR allows you to transcribe pre-recorded media files whenever you want. Real-time ASR lets you transcribe live, to gather actionable data instantly. With Batch you can schedule a transcription at a time that suits you and optimize your available resource.
We refer to the processing time taken to transcribe a media file compared to the length of the file as the real time factor (RTF). Our batch ASR will transcribe a file with an RTF of 0.5. So, for example, a 10-minute file will be transcribed within 5 minutes. Files less than 5 minutes long may vary in RTF.
We ask that customers limit the rate of files they submit to a maximum of 2 jobs per second, with a maximum of 100 jobs in progress at any one time. We also ask that customers limit the rate of polling for the status of submitted jobs to a maximum of 20 queries per second (across all jobs).
Speechmatics reserve the right to change the rate limits at any time in order to ensure continuity of service for all customers of the Cloud.
We offer two operating points. The Standard Model is comparable to our previous models for transcription accuracy and from the speed the media file is transcribed. By default, the Speechmatics engine will choose the "standard" operating point.
Our Enhanced Model is larger in size and can produce more accurate transcriptions. To optimize the performance for transcription processing, we recommend running transcription workloads on AVX512_VNNI (Advanced Vector Extensions 512 Vector Neural Network Instructions) compatible CPUs. If you'd like to use the Enhanced Model, please contact your Account Manager and we will enable it on your account.
Speechmatics’ innovation is leading the industry from Automatic Speech Recognition towards Autonomous Speech Recognition (ASR).
Fuelled by the introduction of self-supervised learning for our training, Autonomous Speech Recognition delivers a step-change in accuracy and inclusion by leveraging a wide range of voices using the scale and diversity of the internet.
In the past, supervised learning from labeled data was the only way to ensure the levels of accuracy needed to feel confident text was truly representative of the words spoken. Now, using self-supervised learning, our models can autonomously learn to spot salient patterns in unlabeled data.
This means we can learn by using much wider datasets from a huge variety of sources that include both labeled and unlabeled data.
Before we moved to self-supervised learning we were training on around 30,000 hours of audio. Now, that number is closer to 1,100,000 hours.
Interested in working at Speechmatics? Find our more on our Careers page.