There seem to be new technologies and new implementations of existing technology every day. Video is also growing. 78% Of people are watching videos online every week, and it’s only going to increase. It’s estimated that 82% Of the world’s internet traffic will be video by 2022.
With these rapid advances in technology, combined with a growing trend in video, it’s no wonder that using machine learning and artificial intelligence – more specifically ASR – for video captioning is a hot topic!
Automatic speech recognition (ASR) transcribes spoken word to text. Common uses include Siri, Alexa, and YouTube auto-captions. ASR is a critical piece of our captioning process at 3Play Media.
In order to closely follow trends with captioning accuracy, and stay on top of our own process, we are constantly testing to make sure we are using the best ASR engine. Our most recent research findings are published in the 2019 State of ASR report.
Our most recent research, published in the 2019 annual report, tested the most popular ASR technologies across content from eCommerce, higher education, fitness, media and entertainment, and enterprise industries. All testing used real content, and lots of it, reflective of the most common type and volume that we receive at 3Play Media. When evaluating ASR for captioning, there are a number of different things to consider.
Word Error Rate (WER) is widely used across the speech recognition community to judge and determine quality. But looking at WER, is not enough.
WER is concerned with capturing the phonetic similarity between what was spoken and what is transcribed, since that is the primary concern of ASR researchers. While WER is certainly an important component of measuring accuracy, it is clearly limited in capturing the requirements of the captioning task. At 3Play Media, we use an additional measure called Formatted Error Rate (FER) to measure accuracy.
FER is the percentage of word errors when formatting elements such as punctuation, grammar, speaker identification, non-speech elements, capitalization, and other notations are taken into account. Formatting errors are particularly common with ASR technology.
When it comes to captioning and transcription, both WER and FER are important components for evaluating accuracy.
Deletion, insertion, etc.
We evaluated the different ASR engines based on several different factors, including the percent error, percent correct, percent substitution error, percent insertion error, and percent deletion error, the components of Word Error Rate.
These different factors are critical to evaluating ASR for captioning, where leaving out, substituting, or inserting words can really change the meaning. For those using captioning as an accommodation, this does not provide them with equal access.
Additionally, with the 3Play Media process, caption editors use an initial transcript captured with ASR technology. If words are missing or inserted, this can throw off the timecodes, and cause our editors a lot more work to make sure the captions are accurate.
Why is Siri so good, but my automatic captions so bad?
This information helps us answer the next common question we receive, “Why is Siri and Alexa so good, but my automatic captions so bad?”
It is important to draw a careful distinction between “automated assistant” applications like Siri and automatic speech recognition technology.
Some of the factors that made conquering Siri an easier task than conquering ASR include:
Automated assistants respond to a single speaker and adapt over time to that speaker’s voice and language idiosyncrasies.
The tasks that automated assistants can complete are very constrained, so the possible output is limited. (Have you ever tried holding a conversation with Siri?)
If an automated assistant doesn’t initially understand, it can ask the user to repeat what they said.
Automated assistants work well as long as the gist of the speaker’s intent is captured.
In contrast, captioning and transcription are much more challenging. This task is primarily characterized by long-form content where the speaker is completely unknown and where it is essential to transcribe almost every word that is spoken (some words like “um” and “uh” are often discarded).
Some state-of-the-art automatic speech recognition systems can achieve very high accuracy rates – even in the ’90s [%] – if the following conditions are true:
There’s only one speaker.
If the speaker is reading from a script or is equivalently concise with virtually no grammatical or speech errors.
If all of the speakers are using high-quality microphones and speaking at an appropriate distance from the microphone.
If there is little to no background noise in the audio.
If all the above conditions remain constant through the majority of the audio file.
Once the above conditions begin to waver, it immediately affects the quality of the transcript. More often than not, the majority of these conditions are not present, unless the audio was recorded in a professional studio. If even two or three of the conditions don’t exist, error rates may go as low as 50% meaning that 50% of the transcript would be inaccurate.
The following example shows what it looks like when ASR fails on complex vocabulary. The meaning is completely changed, and the passage no longer makes any sense.
So, while technology continues to advance, and ASR continues to provide some laughs, humans are still needed when it comes to true captioning accuracy. Read more findings in the 2019 State of ASR report.
Written by 3Play Media.
We provide premium closed captioning, transcription, audio description, and subtitling solutions at very competitive prices. Our goal is to simplify the process by providing a user-friendly account system, fast turnaround, flexible API’s, and integrations with a multitude of video players, platforms, and lecture capture systems. We create closed captions and subtitles in many different formats and languages. We also develop video search plugins and a range of tools that save time and cut costs. Our commitment to innovation has led to 7 patents (granted and pending)–all of which focus on making the captioning, subtitling, and transcription process more efficient and less expensive.