Speech technology has progressed rapidly over the past 5-10 years due to the rise of graphics processing and cloud computing, but how do we keep this momentum up and continue to progress Automatic Speech Recognition (ASR) further in the future?
I was recently in Hyderabad, India at Interspeech, the largest dedicated speech conference in the world, learning about new developments of speech science, technology and the enabling effects AI and Machine Learning has on them. Whilst listening to talks and speaking to other speech experts it was evident that we’ve come a long way, with speech technology now surpassing the accuracy of humans in some tasks, who’d ever have thought?
What was more interesting, was seeing the steady progression of speech technology since the 90s, where improving accuracy by reducing word error rate has been the highest agenda. It became clear that whilst the quality of current speech systems has enabled them to be applied in our everyday lives, such as through Siri, Alexa and Google Home, real applications of speech come with new problems and challenges that we must overcome to see any further progress.
The challenges that we are now faced with include the problems with far-field speech and noisy environments. These two problems arise when more ambitious applications of ASR are created, on-air interviews, indoor speech transcription where the voice is coming from a distance (imagine a family dinner) and meeting transcriptions, to mention some examples.
Another challenge worth noting is speech recognition for under-resourced languages and domains. Current technologies rely on enormous quantities of data to produce high-quality systems. However, this kind of data does not exist for most languages in the world, which limits the applications we can use.
Despite these ASR challenges, I expect speech recognition will be adopted more and more into our everyday lives. As the world becomes increasingly connected, we’ll be required to improve communication systems, with speech being its rawest form. My future predictions for speech recognition include the development of noise robust ASR systems, coupled with new forms of model adaptation which would enable us to deal with noisy environments. Moreover, model adaptation is crucial for dealing with under-resourced languages and domains, so I expect a huge activity in this sub-area of ASR.
It really is an exciting time to be part of this speech revolution and I’m looking forward to working with the Machine Learning experts at Speechmatics to really take this industry to the next level.
Interested in speech recognition? Try our real-time demo here.
Nicolás Serrano, Speechmatics