Accuracy is often quoted and talked about when discussing automatic speech recognition (ASR) and its application to different market areas. ASR accuracy is often quoted as a percentage, but as a percentage of what? The number of correct words? Does that include omissions, punctuation, spelling and grammar errors? A comparison of ASR systems on the basis of accuracy figures is often not easy and care needs to be taken that you are comparing apples with apples. And how does ASR accuracy impact its use in captioning?
There are many challenges involved in producing captions (or subtitles for the hard of hearing), regardless of the production method. The introduction of ASR has not significantly removed these challenges, but it has created possibilities that did not previously exist. In principle, the use of ASR is cheaper than traditional manual captioning, however, this premise would only be completely sound if an ASR system delivered a result that was identical to that produced by a human captioner. The reality is that ASR systems do not directly produce equivalent output and may require a degree of manual work to make them meet the ‘quality standards’ traditionally used in captioning. Whether these traditional standards are still relevant to all captioning scenarios is a different matter!
Given the nature of ASR, in particular, the lack of true comprehension of the source material, these systems are also prone to errors that are not commonly seen in manually created captions. ASR systems have problems with proper nouns (names), and are also very sensitive to poor audio, whereas human captioners have considerably greater tolerance and comprehension of background noises, accents and non-speech sounds. Additionally, it should be realised that ASR is not automatic captioning, most ASR systems do not currently format the output text in accordance with the subtle conventions used in captioning, e.g. the timing adjustments for readability, the removal of repetition and redundant speech, sound effect information*, or adding the indications that a speaker has changed. [* Captions should always include text that describes these non-speech events to a non-hearing audience, an aspect that ASR systems cannot currently provide].
ASR errors are often measured and quoted as accuracy figures, but given that good captioning is not just about accurate text, but also about delivering that text in a manner that can be easily comprehended, how important are these measures? Pragmatically, for some markets, the absence of captions for sound effects, a few misrecognised words and a less than fluid presentation may be totally acceptable. This is arguably the situation for high volume, low value or ephemeral content (e.g. YouTube videos). In this arena, totally automated captioning using ASR may be acceptable, and adding the ability to ‘touch up’ the captions to fix the bigger errors can resolve any shortcomings of using an ASR system. In essence, the use of ASR technology enables captioning that previously would be uneconomic, i.e. a lower cost for an adequate captioning experience.
Perhaps the most impressive difference is that output from ASR is considerably faster than manual captioning, with ASR systems typically able to return results faster than the ‘real-time’ duration of the media, i.e. an hour long movie might be automatically captioned in a matter of minutes, and this can have a definite impact in reducing the ‘turn-around’ time for producing captions. ASR systems produce text very quickly, especially compared to the typical time required for a human captioner on the same material, and of course, you can always speed up a software process by running it on more powerful computers! Naturally (and conveniently), the quality expectations for fast turnaround captions from a manual process are lower, as the captioner has less time to spend perfecting the output, so the output from an ASR system and that from a fast turnaround manual process are more directly comparable.
Clearly then, there are scenarios where ASR significantly enhances captioning, but it will only do so if the benefits outweigh any downsides. If a human captioner has to correct every caption in an ASR generated output, then in reality there is little advantage gained, as the overhead in opening, adjusting and closing each subtitle is significant. Fortunately, ASR systems tend to make predictable and repeatable errors, and ASR systems are usually able to indicate how well they have performed in recognising any given part of speech by providing ‘confidence scores’. This information can be used to focus human effort where it is most needed for correction. ASR systems can also be pre-loaded with anticipated proper names to help reduce incorrect spellings, and it is often possible to ‘train’ an ASR system to avoid similar errors in the future.
If you are using an ASR system, then it is the starting point of the process and it has a marked impact on the overall result. The difference between an ASR system with a ‘measured accuracy’ of 85% and one of 95% is very significant. This seems counter-intuitive, but at 85% accuracy, for typical captioning applications where each caption averages 10 words, then statistically every caption has an error (i.e. more than one word in ten is incorrect). However, if the accuracy is 95%, then typically only every other caption may require adjustment (i.e. only half the captions need correcting manually). A 10% change therefore achieves a 50% saving. So the higher the accuracy that can be achieved by the ASR system, the less costly is any subsequent manual correction processes and even small improvements in accuracy can have big impacts on cost.
The bottom line is that because manual quality control (QC) and correction of captioning is the most time consuming and costly part in any captioning workflow, choosing and using an ASR system with high accuracy, especially across a broad range of vocabulary, has clear value.
John Birch, Strategy & Business Development Manager at Screen Systems
About Screen Systems
Screen was founded as Screen Electronics by Laurie Atkin in 1976, and pioneered the first ever electronic subtitling system, providing the first digital character generator to the BBC. Throughout the 1970s and 80s, Screen continued to lead the market, developing a number of new subtitling technologies including fully automated transmission using timecode, the first PC based subtitle preparation system and the first multi-channel, multi-language subtitling systems.
In 2001 Screen took subtitling technologies into the 21st Century with the Polistream transmission and Poliscript preparation products. In 2011 it diversified by acquiring SysMedia Ltd, a leader in the fields of subtitle preparation and teletext content production and publishing systems. Then in 2018, the company itself was acquired by BroadStream Holdings Ltd (BHL) bringing Integrated Playout into the fold of its capability via parent company BroadStream Solutions.
Screen is now the number 1 provider of subtitling production and delivery systems in the world, and with its broader product portfolio now builds on that success with products that enhance broadcast content with value-add information services across multiple platforms and devices.