Q&A | Live Sessions: Demo with Damir

In the absence of any physical events, Speechmatics has launched a NEW! Live Sessions series to bring our product to your screens. The first episode of the series was ‘Demo with Damir’ where Senior Sales Engineer, Damir Derd gave a live Speechmatics product experience. In the session, Damir covered:

The Speechmatics real-time demo environment
A top-level dive into the Custom Dictionary feature
The Speechmatics Advanced Punctuation feature
The Speechmatics Partials feature

While word error rate (WER) doesn't fully encapsulate accuracy, what is Speechmatics’ WER?

Speechmatics tests on a WER level for different test sets based on use cases. The WER differs from language to language and use case to use case. We work very closely with our partners to help them validate the accuracy of Speechmatics’ any-context speech recognition engine with their own representative data so the WER is as close to their use case as possible.

If you would like to see what Speechmatics’ WER is on your use case, get in touch! We’d be happy to set up a free trial for you.

What is unique in your service/product?

Speechmatics focuses on delivering the best speech recognition on the market. By focusing solely on speech recognition, Speechmatics delivers the best experience. We offer a large language portfolio (31+ languages) with global language packs to minimize language drift and user confusion.

Speechmatics offers a highly accurate, any-context speech recognition engine and offers customers the ability to deploy on-premises so they have full control of their data and always know where it. As well as on-premises deployment, Speechmatics offers a cloud service.

A combination of the most accurate speech recognition engine, flexible deployment options, a competitive language portfolio and a versatile, any-context offering is what makes Speechmatics unique.

Use cases

What are the most significant applications of speech-to-text?

This is slightly dependent upon what "significant" means. Are we talking about a revenue-generating context, a time-saving context or even a life-improving context?

One of the underlying drivers behind automatic speech recognition (ASR) is the concept of bringing structure to unstructured human spoken conversation. Text is minable, interpretable and quantifiable – speech is not.

The main goal of ASR is powering that transition to enable interactions with a machine interface. Ultimately the most significant role that ASR plays is enabling companies to make use of their voice data.

The main use case is for long speech – how would Speechmatics cope with use in IVR, where the speech to be transcribed is relatively short per each interaction?

For shorter utterances as in the case of an IVR, Speechmatics’ engine would rely more on acoustic modelling. With longer utterances, language modelling should help as there would be better context around these words.

Can you work on TV broadcasts with commercials, songs?

Yes, we can work with TV broadcasts. In the post-processing (Batch) ASR, we detect segments where there is speech and we do recognition on those. In real-time we do recognition on everything that is streamed.

Deployment

Is it possible to run Speechmatics without access to the internet?

Yes, customers can run Speechmatics without access to the internet. We have customers that have really secure and locked down environments where there is no outbound internet access. Speechmatics’ Batch or Real-time ASR can be delivered as Virtual Appliance (virtual machine) or through Docker containers that can be installed in networks without internet connectivity.

This is particularly useful for customers with sensitive or private data such as media, security or government use cases.

Is the real-time transcription service offered in your cloud offering or only as an appliance?

Real-time transcription is currently offered as an on-premises solution through Virtual Appliance (virtual machine) or Docker container delivery methods. It is not yet available in our cloud offering.

How can I deploy Speechmatics?

Speechmatics offers a cloud service for the Batch ASR. However, we understand data security and privacy are important to our customers and partners and so our ASR can be deployed on-premises through the Virtual Appliance (virtual machine) or Docker container delivery method. Choosing one deployment method over another depends on the integration, scale, delivery to the end users and more.

Languages

What languages do you support?

Speechmatics support 31 languages. Please take a look at our languages page on our website for the most up to date list.

Are there differences in the punctuation feature between different languages?

Yes. Advanced Punctuation, which outputs question marks, exclamation marks, periods and commas (? ! . ,) in the transcription, is currently available in Global English, Spanish, German, French, Dutch, Malay, and Turkish. The other languages are capable of including periods (.) in the transcription.

What are the different kinds of speaker segmentation and how do they differ between languages?

There are three types of speaker segmentation or diarization. Speechmatics’ Speaker Diarization identifies different speakers in the audio and labels them in the output. This is effective for mono media files. Channel Diarization adds labels on the individual audio tracks in a file. This works well for stereo files. Finally, Speaker Change – which is currently part of our BETA program – has the ability to detect a change in speaker and provide a token in the transcription output.

All three are available across all of Speechmatics’ languages. In the Batch ASR, all three are available while the Real-time ASR is capable of Speaker Change only.

How would I handle multilingual captioning?

You would need to make separate API calls with the identified languages selected in each.

Features

When will all your amazing features be available in other languages?

We are close to having feature parity for all languages that we offer. Advanced Punctuation is being added to more languages as they are rebuilt.

Custom Dictionary

How can I use the Custom Dictionary feature for the upload of audio and video files?

When you submit a media file (audio or video) the additional vocabulary would be passed through a configuration object that would include other features. As it is passed through for every audio/video file that is submitted, the word list can be changed. You can read more about how to use Custom Dictionary in the “API How-To Guide”.

Please note that Custom Dictionary is available in our V2 Cloud offering. If you’d like access, get in touch!

How many words can be added to the Custom Dictionary?

1,000 words and phrases can be passed through as part of the

Can you load multiple words at once or is it only 1 at a time?

You can upload multiple words at once. You can pass up to 1,000 words and phrases to the Speechmatics engine. The word list is passed through per media file you submit for transcription.

How do we access the Custom Dictionary function?

It’s accessed and used through the API. Please look at the relevant Speech API documentation for the deployment you are using (V2 Cloud Service, Virtual Appliance and Docker Containers). If you are using the V1 Cloud service (https://api.speechmatics.com or https://app.speechmatics.com) then Custom Dictionary is not available.

To get you started, you can find an example of how to construct and pass through the Custom Dictionary using the Cloud API. Look at “API How-To Guide” in the documentation.

Speaker Diarization

Can you do Speaker Diarization for real-time ASR?

No, Speaker Diarization is only available in the Batch ASR. The Real-time engine is capable of sending a token in the transcription output to detect a change in speaker, however, this functionality doesn’t add speaker labels like Speaker Diarization.

What is the "Speaker Change” option used for?

The Speaker Change option in the demo application could be enabled to indicate if there was a speaker change. It is used in scenarios when there is more than one person speaking. A token is sent in the transcription to identify when the change occurred. Depending on your use case you could, for example, do a line break when you see this token to make the transcript more readable instead of having a big block of text.

Do you support multi-speaker speech recognition and diarization?

Yes, we support multiple speakers on a single audio track/channel. With Speaker Diarization the transcript will have a label to identify a speaker on a per word basis. We also have Channel Diarization if you have multi-channel audio files where the engine can label the individual tracks. If you have one speaker per channel, then you can customize the labels if you wish.

Time stamping

Can we do time stamping as per client requirement?

Yes. We provide metadata that includes the start time and end time for each word that is successfully transcribed.

Product-specific

Has your team ever built NLP models on top of transcription like sentiment analysis?

No, our valued partners provide this functionality on top of Speechmatics’ transcription output.

Does Speechmatics automatically detect the language when there is more than one language in the speech?

No, we currently do not have language detection. When submitting audio to be transcribed a current pre-requisite is to define the language to be used for the recognition.

Is there a significant CPU overhead attributed to the Speechmatics API?

If you are running this on-premises through the Virtual Appliance or Docker container deployment method, then you will need 1vCPU for each concurrent transcription. The solution is very scalable. Please see Speechmatics’ product sheets for more information. For a more detailed conversation, please get in touch.

Damir Derd, Speechmatics

Jul 7, 2020 | Read time 3 min