6 voice-to-text features for the future

Voice technology and its use cases are evolving and growing every day. These new applications demand even more voice-to-text features to align the technology with consumer expectations.

What voice-to-text features will be developed in the next three years?

A survey for the Speechmatics report on Trends and Predictions for Voice Technology in 2021 asked respondents which voice technology features they thought would be crucial over the next three years:

Improved word error rate accuracy

English is reaching a point of accuracy which is hard to surpass – with an accuracy of around 95% (or a word error rate (WER) of 5%). So, speech recognition technology providers need to look at the accuracy of other languages and ensure they are fit for purpose for global businesses. Providers can also look to shift focus to delivering supporting voice-to-text features that enhance the quality of output provided to users.

Features like entity tagging within the audio, identification of the languages spoken and better diarization will all count towards the delivery of a more accurate representation of the audio as part of the files transcribed. Providers also need to ensure that the levels of accuracy that they preach are applicable in real-world use cases. For example, the ability to deliver quality transcription output in noisy environments or with audio recorded on low-quality devices or speakers with different accents and dialects.

Speaker diarization

Speaker diarization is used to understand which speaker is speaking in single-channel media files. It does this by detecting unique speakers and assigning speaker labels to the corresponding portions of text within the transcript.

Speaker diarization is one of the most challenging elements of speech recognition technology. It poses a challenge for automated systems due to the fluctuations in a single speaker’s voice depending on their mood, hesitations, word emphasis, noise, etc. Improved speaker diarization will uplift use cases that benefit from being able to match a speaker with the words spoken.

Language identification

Detecting the language of speakers within a video or audio file automates the manual task of selecting the correct language pack to use to transcribe it.

By automating the language identification element of the transcription process, businesses can save time and human resource cost as well as unlocking new information that would previously have been lost. For example, in stock trading – where compliance and monitoring are vital – the means to understand that a call might contain multiple languages might flag it for additional investigation.

Customer-specific language models trained on customer text data (language model adaptation)

Customization of language models has existed for many years, with users importing their own custom dictionary lists of brand names, acronyms, etc. But this approach lacks the finesse of context.

The ability to tune models using the user’s own data has the potential to deliver the extra and elusive 5% in accuracy that standard packs might not. Users and providers will need to work more closely together, sharing data and improving incrementally until joint goals are achieved.

Short utterance accuracy

The desire for improvements in short utterance accuracy is not surprising, given the increased focus on virtual assistants and their use in more edge-based platforms like phones, cars, and other devices.

The global adoption of virtual assistants has encouraged providers to continue to add high-quality speech recognition technology in even more languages, for more accents and dialects than ever before. Consumers expect their virtual assistants to understand them, irrespective of their accent, dialect or language. In fact, 19% of respondents highlighted language coverage as something they expect providers to improve on in the next three years.

Spoken language translation

Organizations with aspirations to deliver a global service must unify communications and messaging across all their work environments and employees, irrespective of location. Translation might provide an answer to this. However, it presents challenges that still require work to solve.

Audio can be transcribed in one language, translated word for word, and then fed into a text-to-speech engine. The output, however, will never reflect a natural language. To achieve results in this application, additional understanding and experimentation will be required, with specialist providers dedicating effort to enable the delivery of a transcribed, translated and machine-spoken output that is almost indistinguishable from a natural speaker.

Want to know more about the future of voice-to-text features?

For more information – and the full survey results – download Trends and Predictions for Voice Technology in 2021.

Apr 21, 2021 | Read time 4 min