In the absence of any physical events, Speechmatics has launched a NEW! Live Sessions series to bring our product to your screens. The first episode of the series was ‘Demo with Damir’ where Senior Sales Engineer, Damir Derd gave a live Speechmatics product experience. In the session, Damir covered:
You can catch up on the content below!
We had lots of questions both before and during the session, so we thought we’d follow up with a brief Q&A. Scroll down for the answers.
Speechmatics tests on a WER level for different test sets based on use cases. The WER differs from language to language and use case to use case. We work very closely with our partners to help them validate the accuracy of Speechmatics’ any-context speech recognition engine with their own representative data so the WER is as close to their use case as possible.
If you would like to see what Speechmatics’ WER is on your use case, get in touch! We’d be happy to set up a free trial for you.
Speechmatics focuses on delivering the best speech recognition on the market. By focusing solely on speech recognition, Speechmatics delivers the best experience. We offer a large language portfolio (31+ languages) with global language packs to minimize language drift and user confusion.
Speechmatics offers a highly accurate, any-context speech recognition engine and offers customers the ability to deploy on-premises so they have full control of their data and always know where it. As well as on-premises deployment, Speechmatics offers a cloud service.
A combination of the most accurate speech recognition engine, flexible deployment options, a competitive language portfolio and a versatile, any-context offering is what makes Speechmatics unique.
This is slightly dependent upon what “significant” means. Are we talking about a revenue-generating context, a time-saving context or even a life-improving context?
One of the underlying drivers behind automatic speech recognition (ASR) is the concept of bringing structure to unstructured human spoken conversation. Text is minable, interpretable and quantifiable – speech is not.
The main goal of ASR is powering that transition to enable interactions with a machine interface. Ultimately the most significant role that ASR plays is enabling companies to make use of their voice data.
For shorter utterances as in the case of an IVR, Speechmatics’ engine would rely more on acoustic modelling. With longer utterances, language modelling should help as there would be better context around these words.
Yes, we can work with TV broadcasts. In the post-processing (Batch) ASR, we detect segments where there is speech and we do recognition on those. In real-time we do recognition on everything that is streamed.
Yes, customers can run Speechmatics without access to the internet. We have customers that have really secure and locked down environments where there is no outbound internet access. Speechmatics’ Batch or Real-time ASR can be delivered as Virtual Appliance (virtual machine) or through Docker containers that can be installed in networks without internet connectivity.
This is particularly useful for customers with sensitive or private data such as media, security or government use cases.
Real-time transcription is currently offered as an on-premises solution through Virtual Appliance (virtual machine) or Docker container delivery methods. It is not yet available in our cloud offering.
Speechmatics offers a cloud service for the Batch ASR. However, we understand data security and privacy are important to our customers and partners and so our ASR can be deployed on-premises through the Virtual Appliance (virtual machine) or Docker container delivery method. Choosing one deployment method over another depends on the integration, scale, delivery to the end users and more.
Speechmatics support 31 languages. Please take a look at our languages page on our website for the most up to date list.
Yes. Advanced Punctuation, which outputs question marks, exclamation marks, periods and commas (? ! . ,) in the transcription, is currently available in Global English, Spanish, German, French, Dutch, Malay, and Turkish. The other languages are capable of including periods (.) in the transcription.
There are three types of speaker segmentation or diarization. Speechmatics’ Speaker Diarization identifies different speakers in the audio and labels them in the output. This is effective for mono media files. Channel Diarization adds labels on the individual audio tracks in a file. This works well for stereo files. Finally, Speaker Change – which is currently part of our BETA program – has the ability to detect a change in speaker and provide a token in the transcription output.
All three are available across all of Speechmatics’ languages. In the Batch ASR, all three are available while the Real-time ASR is capable of Speaker Change only.
You would need to make separate API calls with the identified languages selected in each.
We are close to having feature parity for all languages that we offer. Advanced Punctuation is being added to more languages as they are rebuilt.
When you submit a media file (audio or video) the additional vocabulary would be passed through a configuration object that would include other features. As it is passed through for every audio/video file that is submitted, the word list can be changed. You can read more about how to use Custom Dictionary in the “API How-To Guide”.
Please note that Custom Dictionary is available in our V2 Cloud offering. If you’d like access, get in touch!
1,000 words and phrases can be passed through as part of the
You can upload multiple words at once. You can pass up to 1,000 words and phrases to the Speechmatics engine. The word list is passed through per media file you submit for transcription.
It’s accessed and used through the API. Please look at the relevant Speech API documentation for the deployment you are using (V2 Cloud Service, Virtual Appliance and Docker Containers). If you are using the V1 Cloud service (https://api.speechmatics.com or https://app.speechmatics.com) then Custom Dictionary is not available.
To get you started, you can find an example of how to construct and pass through the Custom Dictionary using the Cloud API. Look at “API How-To Guide” in the documentation.
No, Speaker Diarization is only available in the Batch ASR. The Real-time engine is capable of sending a token in the transcription output to detect a change in speaker, however, this functionality doesn’t add speaker labels like Speaker Diarization.
The Speaker Change option in the demo application could be enabled to indicate if there was a speaker change. It is used in scenarios when there is more than one person speaking. A token is sent in the transcription to identify when the change occurred. Depending on your use case you could, for example, do a line break when you see this token to make the transcript more readable instead of having a big block of text.
Yes, we support multiple speakers on a single audio track/channel. With Speaker Diarization the transcript will have a label to identify a speaker on a per word basis. We also have Channel Diarization if you have multi-channel audio files where the engine can label the individual tracks. If you have one speaker per channel, then you can customize the labels if you wish.
Yes. We provide metadata that includes the start time and end time for each word that is successfully transcribed.
No, our valued partners provide this functionality on top of Speechmatics’ transcription output.
No, we currently do not have language detection. When submitting audio to be transcribed a current pre-requisite is to define the language to be used for the recognition.
If you are running this on-premises through the Virtual Appliance or Docker container deployment method, then you will need 1vCPU for each concurrent transcription. The solution is very scalable. Please see Speechmatics’ product sheets for more information. For a more detailed conversation, please get in touch.
Damir Derd, Speechmatics
Sign up for our newsletter!