Whisper Speech to Text Deep-Dive

Last month, OpenAI launched its Whisper Speech to text software. An open-source model that can perform multiple tasks including ASR, language identification, voice-activity detection, and translation.

In typical OpenAI fashion, Whisper is surprisingly different from current state-of-the-art speech-to-text systems. There are three immediate key takeaways from Whisper’s approach:

Increasing the quantity of labeled data can have an instant impact: Whisper's training data contains 680,000 hours of audio. This is at least an order of magnitude larger than typical previous training regimes.
Neither model architecture nor loss needs to be complex. What matters is choosing an approach that scales well. Whisper uses a simple autoregressive encoder-decoder structure (see Figure 1), much like the original transformer paper [Vaswani et al. (2017)] [1]. The loss is simply the cross-entropy of the next word but conditioned on the audio.
If 1 & 2 are done well, then supervised learning can produce impressive systems.

This is super exciting! As is always the case when new innovations are released, there are now multiple questions that need answering. This blog aims to answer some of these and think through some of the deeper implications. We’ll develop some more takeaways from Whisper speech-to-text and point to what it means for future AI research.

Digging Deeper

OpenAI is famous for its use of scale: they were the first to formulate the neural scaling laws [2], and its GPT-series models made previous language models look very small indeed. Open AI Whisper is no different. It takes the scaling hypothesis and applies it rigorously to the domain of speech. Whisper’s authors push their setup to the limit with internet-scale labeled datasets and a simple Transformer model. Extensive internal testing at Speechmatics has shown this approach is by no means perfect, however, there is much to consider from the approach.

Importance of testing out-of-distribution and danger of overfitting in-distribution

A perennial question in ASR is whether “human-level” accuracy has been achieved. In 2017 Microsoft announced [3] reaching human parity on a conversational speech recognition task, citing word error rates of 5.1% which fell in line with their human-level transcription. However, the key question always remained unasked ‘how does that result generalize to producing super-human systems across all domains’. Indeed, Professor Naomi Harte brought this exact claim into question at this year’s UK Speech conference in Edinburgh, asking whether we really are anywhere closer to general superhuman performance. This same suspicion that something isn’t quite right brought out quite strikingly in the Whisper paper [4]:

“In 2015, Deep Speech 2 (Amodei et al., 2015) reported a speech recognition system matched human-level performance when transcribing the LibriSpeech test-clean split. As part of their analysis they concluded: “Given this result, we suspect that there is little room for a generic speech system to further improve on clean read speech without further domain adaptation.” Yet seven years later the SOTA WER on LibriSpeech test-clean has dropped another 73% from their 5.3% to 1.4% (Zhang et al., 2021), far below their reported human-level error rate of 5.8%. Despite this massive and unanticipated further improvement in performance on held-out but in-distribution data, speech recognition models trained on LibriSpeech remain far above human error rates when used in other settings. What explains this gap between reportedly superhuman performance in-distribution and subhuman performance out-of-distribution?”

We suspect there are several important factors at play here. First, the authors quite rightly point the finger at a systemic failure in common evaluation protocols: the key to demonstrating generalized superhuman accuracy is to evaluate ‘out-of-distribution’. In other words, we must always evaluate data distributions from which we don’t use any training data. This is almost a given when building large-scale robust ASR systems but an important departure from common academic evaluation methodologies and clearly demonstrated to be an ongoing issue, evidenced by the gap between ‘Ideal Robustness’ and current systems in Figure 2.

Second, we must ask why word error rates (WER) can go so low in-distribution but fail to generalize. Driving LibriSpeech’s testset so low has clearly been useful to propel the research community forward but why is the level of transfer poor to other out-of-distribution testsets?

We speculate it’s a similar phenomenon to the vision community’s experience with Imagenet: it’s eminently possible to overfit to spurious correlations in a given dataset, quirks which are correlated with correct classification, but which don’t hold when the data distribution is shifted. For speech, this could be the neural network building a deep reliance on certain frequencies appearing in the input when a certain phone is spoken, but those frequencies having no representation under distributional shift.

For a single percentage point decrease in WER, perfectly robust systems will see a corresponding single percentage point decrease across all out-of-distribution testsets. Figure 2 shows that the Whisper system makes great strides in this direction by leveraging large quantities of labeled data, which is broadly distributed, but a gap still remains.

Key takeaways:

We must train and evaluate broadly and out-of-distribution if we truly want a handle on the human-level accuracy question
It’s far too easy to overfit in-distribution; robust systems are non-trivial to create, and we need a mechanism to enable out-of-distribution generalisation

Supervised learning is amazing at scale

By keeping the architectural setup very simple Whisper can tell us lots about the impact of dataset scaling. The architecture is a vanilla encoder-decoder transformer setup to perform a simple token prediction task. Although this presents challenges in forming a real-time system, adding custom vocabulary and performing reliable long-form recognition it’s a perfect sandbox for analysing the impact of training on internet-scale datasets.

Their results show that models can be scaled to around the billion-parameter mark and continue to exploit weakly labeled datasets. This is perhaps the key idea in the whole paper: it’s possible to both obtain and exploit 680,000 hours of mostly correct English transcriptions. At these data and model sizes, architectural and loss innovations become secondary and wash out.

By following this simple recipe, the Whisper model approaches human-level performance on the Kincaid46 dataset which is a great achievement.

Key takeaways:

It pays to collect the largest weakly labeled dataset possible, with the bar for inclusion set high
Heuristics to auto-clean or reject data are as important as architectural or training-loss innovations, particularly if your largest models are data-bottlenecked

Internet-scale supervised learning is plateauing for English ASR

At Speechmatics, we believe that scale is a key ingredient for almost all the AI systems of the future. For both supervised and unsupervised systems, scaling labeled data along with parameter count in a transformer model is now a proven recipe. However, what happens when the labeled data runs out? What’s the play when we need to collect or pay for exponentially more to realise the next step-change in accuracy? This is the challenge in ASR today.

The Whisper setup tells us what happens when model and data scaling saturate. If the authors are correct, the Whisper model has converged to the implicit error level in the training data. To hit the next step change we'd then need an order of magnitude more of perfect-level (not human-level) transcribed audio. That would be ~10M hrs of perfectly labeled data which is out of reach even for companies with the deepest pockets.

As the authors highlight: “Performance improves rapidly on English speech recognition from 3,000 to 13,000 hours and then slows down noticeably between 13,000 and 54,000 hours. Using the full dataset, which corresponds to another 12.5× increase in size results in only a further 1 point drop in WER. This mirrors the diminishing returns observed with model size scaling for English speech recognition and could similarly be explained by saturation effects when approaching human-level performance.”

It's therefore clear that supervised learning is powerful, scales well and we need it, but it can plateau even on internet-scale datasets when the number of edge cases is large and out-of-distribution generalisation is tough. And as is the case here, you can be left with a far-from-perfect model.

Here at Speechmatics we are investing in a longer game which we believe is both the faster and economically feasible route to perfect-level ASR. In particular, self-supervised learning allows us to reach a similar word error rate with 100x less labeled data. This kind of data-efficient learning catalyses ASR training but also gives us a route to combat the saturation and exponential data requirements in the classic supervised regime. On this view, labeled data is necessary but not sufficient for training the next-generation ASR systems.

Most strikingly, young children learn to understand speech after only several years of play and with minimal supervision, whereas the supervised ASR systems of today require a lifetime of constant second-by-second supervision. As such we believe a key component for both AGI and the perfect-level ASR systems of the future will involve data-efficient representation learning. Moreover, we’ll know we are doing well when we find we need less and less labeled data because we’ll have learnt all the important things about speech from pretraining. Crucially, this approach aligns with the core of what we believe it means for a system to be increasingly intelligent: we observe larger amounts of generalization from smaller amounts of supervision.

Key takeaways:

Supervised learning scales but is prone to brittleness in out-of-distribution scenarios and can plateau before your problem is solved
Data-efficient pre-training such as self-supervised learning offers an exciting alternative trajectory to combat saturation effects seen in models trained on internet-scale ASR datasets
The Whisper results indicate that internet-scale supervised learning has plateaued for English ASR

Summary

This blog was written to provide some insight into OpenAI Whisper’s approach to speech-to-text, and its implications for speech research.

However, deploying ASR systems in production is hard. Good WERs are not enough – for a system to be useful, there is an array of extra factors to be considered. In our next post, we'll look at some of these factors and dig into how our latest systems are scaling and performing.

Will Williams, VP Machine Learning and Lawrence Atkins, Machine Learning Engineer

[1]: Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems 30 (2017).

[2]: Kaplan, Jared, et al. "Scaling laws for neural language models." arXiv preprint arXiv:2001.08361 (2020).

[3]: Xiong, Wayne, et al. "Achieving human parity in conversational speech recognition." arXiv preprint arXiv:1610.05256 (2016).

[4]: Radford, Alec, et al. Robust Speech Recognition via Large-Scale Weak Supervision. Technical report, OpenAI, 2022. URL https://cdn. openai. com/papers/whisper. pdf, 2022.

Oct 12, 2022 | Read time 9 min

Whisper Speech to Text Deep-Dive

Digging Deeper

Importance of testing out-of-distribution and danger of overfitting in-distribution

Supervised learning is amazing at scale

Internet-scale supervised learning is plateauing for English ASR

Summary

Related Articles

The Future of Word Error Rate (WER)

Recognizing Rare Words: Experiments with Subword Units

How to Successfully Achieve Multinode Training in PyTorch