Myself and a couple of the team at Speechmatics attended the BBC News Labs and BBC Connected Studio’s TextAV event in London from 18-19 September. TextAV is a two-day working group attracting leading technologists, application developers, and practitioners working in the area of online audio and video, with a particular focus on the use of captions and transcripts to facilitate and speed up the production process.
The goal of the hack event was to provide an overview of current software projects and practices, to explore challenges in the ecosystem, from transcription to URL addressability, GUIs for collaboration, automation, muxing etc., and to catalyse new open source collaborations to further build out the ecosystem.
The event consisted of knowledge sharing and presentations as well as showcasing key projects and libraries. We were split into sub-groups for a mini-hackathon on our domains and topics and were then required to follow up by documenting and presenting key learnings and findings.
Jumping into our first day, we engaged in a discussion about the difficulties of benchmarking ASR providers. In particular, we talked at length on the issue of collating the correct data set, and how this could be made easier. Our stance on this, however, is always that the test set should represent the data to be used in production, so needs to be bespoke – you wouldn’t compare defenders in football by how well they take penalties.
Interestingly, a lot of the things that people told us they care about in an ASR provider (languages, accents, etc.) are a data set problem!
With this assumption aside, we moved on to discussing the metrics people use to differentiate between providers, and how we could produce a project that is capable of fairly assessing these. The idea being that once presented with the pure facts, the user should be able to easily choose the provider that suits them, and the data they care about, best.
From these conversations, we learned that while word error rate is used almost exclusively as the standard of ASR performance, users are starting to care about much more than just the accuracy of the words. The overall readability of the transcript, the quality of punctuation and capitalisation, and the speaker diarisation were just some of the issues that were raised. These factors cannot be fairly assessed by counting insertions, deletions and substitutions and clearly a tool that could present all of these data points is much needed in the community.
Our motivation behind working on this particular problem stemmed from discussions between myself, James Dobson and Will Williams around the frustration that multiple groups have to essentially replicate the same work to compare multiple providers. Datasets aside, the evaluation code should be more or less exactly the same and it’s a complete waste of resource for companies to reproduce the same code internally.
Creating a tool that evaluates in an agreed upon and standardised way saves time and facilitates reproducible and fair comparisons.
We intend to continue to contribute to this project, and to the discussion in general off the back of this event and may even have some exciting news to come in the future, so stay tuned! In the meantime, you can try our real-time demo here!
Tom Wordsworth, Speechmatics