Apr 12, 2023 | Read time 7 min

Boosting sample efficiency through Self-Supervised Learning

Our latest Ursa release was able to achieve incredible accuracy partly through scaling self-supervised learning. In this blog we demonstrate the power of self-supervised learning and challenge the assumption that scaling labeled data is the key to greater accuracy. We show that with 300x less the amount of labeled data we still beat the nearest vendor by 12% relative.
Self-Supervised Learning
Bethan Thomas
Bethan ThomasSenior Machine Learning Engineer
Sample Efficiency

Figure 1: Left: a simplified diagram of a traditional ASR system mapping input speech features directly to output labels. Right: Intermediate layer representations from the SSL model are fed into the acoustic model as input. The final projection layer of the SSL model is ignored.

Figure 2: The word error rate (WER) of two SSL models of different parameter size. The plot demonstrates how WER varies with the amount of labeled training data. For the larger model, the rate of improvement is slower, showing the diminishing returns of training on more data with a more powerful SSL model. The absolute difference shows that the larger model is generally better performing, even as labeled data drastically decreases.

References [1] Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. "Bert: Pre-training of deep bidirectional transformers for language understanding." arXiv preprint arXiv:1810.04805 (2018).

[2] Chan, W., Jaitly, N., Le, Q. V., & Vinyals, O. "Listen, attend and spell." arXiv preprint arXiv:1508.01211 (2015).

[3] Graves, A., Fernández, S., Gomez, F., & Schmidhuber, J. (2006, June). Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning (pp. 369-376).

AuthorsBethan Thomas
AcknowledgementsBenedetta Cevoli, John Hughes, Will Williams
Carousel slide image
Company

Better than Whisper: how Adobe Premiere's on-device speech engine got rebuilt

Andrew Innes
Andrew InnesChief Architect
Carousel slide image
Technical

The Adobe story: How we made cloud-grade AI work on your laptop

Andrew Innes
Andrew InnesChief Architect
Carousel slide image
Technical

De-risk your voice agent: The 11 best voice agent testing platforms in 2026

Speechmatics
SpeechmaticsEditorial Team
Carousel slide image
Technical

How to build a microbatching workflow with the Speechmatics API

Speechmatics
SpeechmaticsEditorial Team
Carousel slide image
Product

Alphanumeric speech recognition: why voice assistants mangle SKUs (and how to fix it)

Speechmatics
SpeechmaticsEditorial Team
Carousel slide image
Company

Adobe and Speechmatics deliver cloud-grade speech recognition on-device for Premiere

Speechmatics
SpeechmaticsEditorial Team