Figure 1: Left: a simplified diagram of a traditional ASR system mapping input speech features directly to output labels. Right: Intermediate layer representations from the SSL model are fed into the acoustic model as input. The final projection layer of the SSL model is ignored.
Figure 2: The word error rate (WER) of two SSL models of different parameter size. The plot demonstrates how WER varies with the amount of labeled training data. For the larger model, the rate of improvement is slower, showing the diminishing returns of training on more data with a more powerful SSL model. The absolute difference shows that the larger model is generally better performing, even as labeled data drastically decreases.
References | [1] Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. "Bert: Pre-training of deep bidirectional transformers for language understanding." arXiv preprint arXiv:1810.04805 (2018). [2] Chan, W., Jaitly, N., Le, Q. V., & Vinyals, O. "Listen, attend and spell." arXiv preprint arXiv:1508.01211 (2015). [3] Graves, A., Fernández, S., Gomez, F., & Schmidhuber, J. (2006, June). Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd international conference on Machine learning (pp. 369-376). |
Authors | Bethan Thomas |
Acknowledgements | Benedetta Cevoli, John Hughes, Will Williams |