
To stay at the summit of the speech-to-text mountain, we recognize that not everyone will immediately know how to use our award-winning Autonomous Speech Recognition engine (ASR).
Moreover, our mission to understand every voice heavily centers around accessibility within media – for example, open and closed captions are becoming less of an accessory and more of a necessity. So, naturally, we want to keep our highly advanced AI accessible and easy to use.
In pursuit of that, we've created a helpful guide with five easy steps for getting started with Speechmatics.
Firstly, you will need to choose your deployment options. You have four choices: virtual appliance, containers, SaaS, and hybrid. Here's a short summary of each:
A pre-configured virtual machine capable of doing Real-Time or Batch processing. It can be deployed directly in your on-premises environment.
Our Docker Containers enable you to build scalable transcription services within your infrastructure in Real-Time or Batch processing.
We can deliver all the benefits of the Speechmatics ASR without the complexities of deploying it within your team and environment. Choose public (hosted by Speechmatics) or your cloud.
Hybrid deployment helps those with a mixture of data requirements that use cloud and on-premises processing.
First choice down, four to go.
Next on the guide is to choose your offering. In this instance, we provide you with two options: Batch and Real-Time.
Firstly, there's Batch, where you transcribe speech-to-text from pre-recorded media files at your convenience using our ASR. You can schedule a transcription at a time that suits you.
With the Real-Time offering, you can get your speech-to-text in real time and get results instantly. As a result, you can gather actionable data as soon as needed. If you're worried accuracy would be compromised, fear not; our proprietary technology delivers best-in-class accuracy even at low latencies - proven in recent research against our competitors.
At this point, your options open some more. Our features range from channel diarization to flexible endpointing. Here's a complete list:
Entity Formatting
Notifications
Speaker Diarization
Partials
Channel Diarization
Transcript Finalization
Custom Dictionary and Sounds Feature
Flexible Endpointing
Speaker Change
Speech-to-text is not a solved problem yet, so we're always looking to innovate our ASR, expect to see more features in the future.
Next up, you will need to choose your format. Happily, we support all major audio and video formats, reducing users' time to prepare files. After all, we're all about speech-to-text accessibility.
For clarity, the default for our output is JSON. Users also have the option to pick an alternative in srt and txt.
Despite competition from the most prominent names in the speech-to-text and surrounding AI industry (Google, Microsoft, etc.), we are proud to boast about the 50 languages in our ASR's coverage.
This includes Arabic, Bulgarian, Catalan, Cantonese, Croatian, Czech, Danish, Dutch, English, Finnish, French, German, Greek, Hebrew, Hindi, Hungarian, Indonesian, Italian, Japanese, Korean, Latvian, Lithuanian, Malay, Mandarin (traditional and simplified), Norwegian, Persian, Polish, Portuguese, Romanian, Russian, Slovakian, Slovenian, Spanish, Swedish, Turkish, and Ukrainian.
Arabic, Bashkir, Basque, Belarusian, Bulgarian, Cantonese, Catalan, Croatian, Czech, Danish, Dutch, English, Esperanto, Estonian, Finnish, French, Galician, German, Greek, Hindi, Hungarian, Indonesian, Interlingua, Italian, Japanese, Korean, Latvian, Lithuanian, Malay, Mandarin (Traditional & Simplified), Marathi, Mongolian, Norwegian, Polish, Portuguese, Romanian, Russian, Slovak, Slovenian, Spanish, Swedish, Tamil, Thai, Turkish, Ukrainian, Uyghur, Vietnamese and Welsh.
There you have it. This guide was designed to be simple, so with that, here's a short summary:
Choose your deployment options: virtual appliance, containers, SaaS, and hybrid.
Choose your offering: batch or real-time.
Choose your features: entity formatting, notifications, speaker diarization, partials, channel diarization, transcript finalization, custom dictionary, flexible endpointing, and speaker change.
Choose your format: all major audio and video formats.
Choose your language: 50 dialects at your disposal.
Paul Gordon, Product Marketing Manager, Speechmatics
![[alt: Logo design featuring the text "SPEECHMATICS" alongside a stylized logo for "Cekura," set against a soft green background with subtle curved lines.]](/_next/image?url=https%3A%2F%2Fimages.ctfassets.net%2Fyze1aysi0225%2F39N1Yr95B2jvfd7JKGihq0%2F7b1ca5f8d5db0235b64829dcab16b96a%2FSpeechmatics_partners_with_Cekura-wide-carousel.webp&w=3840&q=75)
A new integration gives agent developers a QA layer built for the complexity of the real world.
![[alt: Smiling man with gray hair sits against a teal background, holding a blank clipboard. He wears a blue sweater and appears relaxed and approachable, suggesting a friendly environment.]](/_next/image?url=https%3A%2F%2Fimages.ctfassets.net%2Fyze1aysi0225%2F2B2UcXrPGOWkeyLII5FGUA%2Ff263f595ae176937bdc93a08b55febcd%2FBlog-header__1_-wide-carousel.webp&w=3840&q=75)
The founder who built speech recognition in 1989 on latency, turn detection and faulty pipelines

The court reporter shortage is reshaping litigation. Explore data, causes, and how legal teams are using digital reporting and AI transcription to adapt.

Word error rate for legal transcription has no single acceptable threshold. But knowing how accuracy, audio quality, and review obligations connect to real legal risk is what separates a reliable transcript from a costly one.
![[alt: Bilingual medical model featuring terms related to various health conditions and medications in Arabic and English. Key terms include "Chronic kidney disease," "Heart attack," "Diabetes," and "Insulin," among others, displayed in an organized layout.]](/_next/image?url=https%3A%2F%2Fimages.ctfassets.net%2Fyze1aysi0225%2F3I31FQHBheddd0CibURFBv%2F4355036ed3d14b4e1accb3fe39ecd886%2FArabic-English-blog-Jade-wide-carousel.webp&w=3840&q=75)
Sets a new accuracy bar for real-world code-switching: 35% fewer errors than the closest competitor.
![[alt: Illuminated ancient mud-brick structures stand against a dusk sky, showcasing architectural details and textures. Palm trees are in the foreground, adding to the setting's ambiance. Visually captures a historic site in twilight.]](/_next/image?url=https%3A%2F%2Fimages.ctfassets.net%2Fyze1aysi0225%2F2qdoWdIOsIygVY0cwl8UD4%2Fe7725d963a96f84c87d614ccc6cce3c6%2FAdobeStock_669627191-wide-carousel.webp&w=3840&q=75)
Most voice AI models are trained on formal Arabic, but real conversations across the Middle East mix dialects and English in ways those systems aren’t built to handle.