Using punctuation in speech recognition improves the readability and usability of transcripts. We’re going to dive into its importance in transcription.
The addition of punctuation narrows the gap between a mere transcript and a more intelligent understanding of language. Punctuation, used in the right place, significantly improves the readability of transcripts and reduces the time it takes to edit them. But how important is punctuation, really?
If you’ve been using a speech-to-text service, you’ll know that companies are starting to introduce more sophisticated punctuation models. This helps to improve the readability and usability of transcripts. The recent backlash from users of Google’s speech to text dictation service shows that getting punctuation wrong is as bad, if not worse, as providing no punctuation at all. Sure, having no punctuation makes transcripts very difficult to follow, but at least it doesn’t require the user to spend hours editing incorrect punctuation and re-interpreting sentences. A frustrated Google speech-to-text user said: “This adding in of punctuation started happening today for the first time and is as annoying and inaccurate as the other posts [in the forum] have already described. It has meant more time editing the mistakes in the dictation than ever before.”
Let’s take an example. What turns a sequence of words on the page (or screen) into a comprehensible phrase, sentence or paragraph? Consider the following spoken sentence that was transcribed by Speechmatics’ ASR from a radio news broadcast.
“if the weather cooperates as they hope crews might be able to start talking about letting people back into their homes today”
Now see the difference if we add in capitalization and punctuation.
“If the weather cooperates, as they hope, crews might be able to start talking about letting people back into their homes today.”
Punctuation is a huge assist with readability, and in certain cases can help to avoid ambiguity. A recent example in the news was the furore around the omission of the so-called "Oxford comma" from the inscription "Peace, prosperity and friendship with all nations" on a recently minted fifty pence coin (with author Philip Pullman calling for it to be "boycotted by all literate people").
Traditionally, in order to have punctuation marks appear in transcribed text, it was necessary to pronounce each character by name, such as “full stop”, “comma”, “question mark”, etc. However, advances in machine learning has enabled the development of automatic punctuation placement. In this section, we’ll address how Speechmatics uses punctuation for use in automatic speech recognition, according to Machine Learning Engineer, Tom Ash.
When building our language models with advanced punctuation, we don't internalize the rules of The Oxford Style Manual, The Chicago Manual of Style, Fowler's Dictionary of Modern English Usage or any other venerable guide to good language usage. Instead, Speechmatics uses machine learning techniques to filter training data so we can get a good picture of what appropriate punctuation looks like for sentences in the target language, whether that be English, French, Japanese or Turkish, for example.
The steps we took
Our first step in creating advanced punctuation was to create relevant data to learn from. This actually meant undoing a lot of our standard pipelines for ASR, where we normally try and remove it in preparation for language modeling. We spent time refactoring our code to leave the ‘good’ punctuation in, and also filter out lines that had either too many or too few punctuation marks.
At first, we wanted to cover all the punctuation marks we could think of. However, we soon realized that any punctuation marks that come in pairs (quotation marks, parentheses) were going to be almost impossible to integrate with a streaming ASR system. In a streaming system, words come out in chunks smaller than a sentence. We would, therefore, have to insert opening quotes, for example, before we even realized we were in a quotation! We then looked at colons and semi-colons and realized that because they are used so rarely it was going to be hard to get enough training data to train on. In the end, we boiled it down to full stops, commas, question marks and exclamation marks. These were also the punctuation marks that our customers were most interested in and most useful to their use cases.
Key decisions made
One of the biggest trade-offs we had to make was in latency versus accuracy. The higher the latency, the more context you will be able to take into account when deciding on a punctuation mark to output. However, latency is a crucial issue for some of our use cases. We, therefore, had to balance it with accuracy to get the most appropriate system. We worked closely with the team that works on our streaming system to get the right operating point. This required us to take into account the intricacies of how we endpoint our chunks of text when streaming transcripts on live audio.
Another key decision that we considered when designing our punctuation system was regarding how much audio versus textual context was required. The academic literature gives somewhat mixed views on this. It seems that different researchers use different approaches in their systems. Taking both into account brings extra engineering challenges into play that are not present when only using one information source. However, this does give the fullest picture to train a model on.
Research with our customers found that most people prefer transcriptions with full punctuation. It introduces pauses in the correct places and makes the transcript easier to follow. For captioning use cases, punctuation significantly improves the readability of those captions and enables the audience to better understand the context of the audio.
Hewson Maxwell, Head of Technology Development, Access Services, Red Bee Media said:
“In captioning, accurate punctuation is extremely important. If the punctuation is absent or misleading it is significantly more difficult for the audience to understand the meaning of the dialogue as transcribed, especially where the speech is dense and quickly delivered. Accurate punctuation both improves fully automated captioning, but also can speed productivity in workflows based on partial automation with human review.”
For customers that prefer minimal (or no) punctuation, Speechmatics provides the ability to dial back the punctuation and restrict the set of punctuation marks that will appear. For editing use cases, poor punctuation can impact the time it takes to generate a perfect transcription. Good punctuation is important for editing use cases because it significantly reduces the editing time and improves transcript readability.
Ian Firth, VP Products at Speechmatics said:
“We recognize that there’s much more to making a useful transcription than the words on the page. Our advanced punctuation model is a significant step in improving the readability of transcripts. Our customers derive even greater meaning and value from their transcriptions using our advanced punctuation feature.”
To make speech recognition systems a really useful tool for human beings, the transcription output should be natural looking. It should look as if someone had recorded the spoken utterances themselves. Punctuation can help to distinguish the grammatical construction of a sentence, as well as guiding the reader to pause for breath in the right places and to modify their intonation.
To that end, Speechmatics has always provided full stops (reading a transcript that did not have full stops would be akin to reading Finnegans Wake). Question marks are relatively straightforward as they are the stop used for interrogatives. Commas help with pacing the sentence, as well as making lists much easier to follow. Exclamation marks are also supported, although they are seen very rarely in transcripts. These four common sets of punctuation characters help to make transcripts more usable.
Organizations are looking for more ways to innovate with voice and utilize the rich sources of information locked within. Unlocking voice data helps companies to deliver better experiences, understand more about their customers, their staff and their business processes. Through the transformation of voice data to text and with the ability to analyze this information, businesses can make significant improvements to all these areas.
Speech recognition technology transforms voice into text, helping to make audio content more widely accessible. This, coupled with advanced punctuation characters, further improves the usability of voice data by adding readability and a deeper level of understanding into the mix as well. It transforms voice data into something consumable and enables better analysis.
Interested to learn how Speechmatics’ any-context speech recognition engine can help your business innovate with voice? Download our report to learn how voice technology can unlock new business opportunities for you.