With our latest Ursa release we talked about the exciting accuracy gains we obtained by scaling our self-supervised learning models, which ultimately boosted our sample efficiency. This blog aims to explain what we mean by this and demonstrate the ability of scaled self-supervised learning to perform in low-resource settings where training with high sample efficiency is key.
Most deep learning algorithms rely on labeled data; for the case of automatic speech recognition (ASR), this is pairs of audio and text. The model learns to map input feature representations to output labels. Self-supervised learning (SSL) is instead the task of learning patterns from unlabeled data. It is able to take input speech and map to rich speech representations.
In the case of SSL, the output is not so important, instead it is the internal outputs of final layers of the model that we utilize. These models are generally trained via some kind of proxy task, for example predicting masked portions of the input as in BERT. This task allows the model to learn rich representations of the input features at intermediate layers within the model. These representations can be used as input to a supervised deep learning task, instead of more traditional speech features such as MFCCs or Fbanks.
Through the proxy task, the model learns to encode the input signal in a meaningful way, with information that is useful to a downstream task such as ASR. Deep learning is very good at spotting patterns. The SSL task finds the patterns in the input and representations of similar portions of input, will be encoded in a similar way. For example, two segments of speech with the same speaker may appear close to each other in the embedding space. At a more granular level, two segments where the same sound is spoken will also share similar representations. This is the property we rely on for good sample efficiency in ASR.
Figure 1: Left: a simplified diagram of a traditional ASR system mapping input speech features directly to output labels. Right: Intermediate layer representations from the SSL model are fed into the acoustic model as input. The final projection layer of the SSL model is ignored.
Good sample efficiency is the ability of a model to reach a certain level of performance with fewer samples of labeled data; it is more efficient at using the labels it is given. This also means that with more labeled data, the model should achieve even better performance. Traditional speech features have shown to reach great performance in ASR; however, it usually takes very large samples of labeled data to reach this performance because the input feature representations are low dimensional and limited in terms of the information they contain.
Labeled data is often scarce for ASR; it is expensive and difficult to label speech, and the labeled data that exists generally comes from a few domains and doesn’t capture the real variety of spoken language. This further limits the ability of supervised ASR methods to learn efficiently as they are constrained in terms of the variety of speech they are exposed to.
On the other hand, a model trained with SSL features is much more efficient at learning the task of ASR. Take this example; the SSL model has seen many examples of the sound /b/, these share certain similar acoustic properties. It learns to represent these segments of audio in a similar way, despite having no understanding that these portions correspond to the sound /b/. Then some labeled data comes along. The supervised model sees that one of the representations in that cluster, maps to a /b/ in the output space. Therefore, it is likely that the other features in the cluster are also a /b/. Thus, the model starts to learn a mapping, not just of one feature to /b/, but a whole cluster of varied features.
By training on unlabeled data, we are also able to capture many more domains in the training data meaning that the model will learn features of more diverse speech. Through pre-training, we prime the model with a huge amount of speech information so that the labeled data has a much greater impact. All of this contributes to an ASR system that is more data efficient and better at understanding every voice.
With Ursa we scaled our self-supervised learning both in terms of data and model size. This has led to richer representations and greater sample efficiency, as the representations reflect more diversity within the input data.
Generally, deep learning benefits from more data and more training. In production, you will normally train on as much data as is available for as long as possible. However, not all languages have multiple thousands of hours of training data, and the point of sample efficiency is that we shouldn’t need tens of thousands of hours to get excellent performance. Scaling SSL should lead to better sample efficiency; therefore, we decided to test how accuracy changes as we reduce the amount of English ASR training data.
In these experiments, we used two different SSL models to generate the input features for our downstream supervised ASR training. One model uses 2B parameters, and the second model has 500M parameters and was pre-trained on less data. We hypothesized that we should see greater sample efficiency from the larger model. We used no data selection methods for ASR training, so the ordering of the labeled data was completely random. The models were tested on a variety of publicly available test sets (explained in the Ursa blog) and we report weighted word error rates in Figure 2.
We have 3 main findings:
Progress saturates quickly above 10,000 hours with our more powerful SSL model. We can see from Figure 2 that for the larger SSL model, there is a limited improvement as we progress from 10,000 hours to 100,000+ hours of labeled data.
For the low-resource regime, performance is still very strong. With 5000 hours of labeled data, we see just a 2.5% relative degradation. With 500 hours of labeled data, we see around 10% relative degradation. However, given that we are 22% ahead of the nearest large cloud provider with our fully trained system, this means that we are still 12% ahead with just 500 hours of labeled training data.
Scaling SSL leads to greater sample efficiency and generally better performance. At a basic level, the absolute word error rates for the smaller SSL model are well above those of the scaled SSL model, even when we have reduced the labeled data by 300x. We also see that performance drops off for the smaller model as we reduce the labeled data to 10,000 hours, showing that our larger SSL model is more robust to data reduction.
We have demonstrated that scaling SSL is effective at increasing sample efficiency and boosting ASR performance. Our scaled SSL model is robust to a 10x reduction in labeled training data, with little discernable degradation in performance. In a lower resource setting, with a few thousand, or even hundred, hours of ASR training data, we still see excellent performance.
We have shown that the key to excellence is scaling self-supervised learning rather than scaling our labeled training data. With our powerful 2B parameter SSL model, we are able to beat every other speech-to-text vendor with a fraction of the hours of labeled data generally used for ASR. This is a very promising result, both for scaling SSL further and for training low-resource languages to continue our mission to understand every voice.