This article is a continuation of ‘Machine learning is getting BIG (Part I)‘ previously published on this blog.
As well as cost in terms of dollars, people have started calculating the cost of training these huge models in terms of greenhouse gas emissions. When neural architecture search was included, training a single ‘transformer’ model was found to generate more carbon dioxide than 4 lifetimes of car use. It is true that modern hardware is much more efficient than older hardware. Although the improvement in efficiency has not tracked the increase in use of energy on these projects.
Those are the run time costs. There are also the costs of manufacturing. A GPU or TPU is a very precise and complicated piece of technology, which has great energy and material cost in itself. The array of materials required to make one is vast and contains elements, such as tantalum, that have complicated supply chains with ethical concerns ringed around them. Disposal of used devices also contributes to our growing problems with electronic waste.
Machine learning gets bigger as these costs are only going to rise. It is by no means a problem limited to ML – any technological field has to consider its environmental footprint. But this recent obsession with scale means that ML is suddenly starting to be large enough to matter, and not in a good way. Of course, we may deem this as acceptable – the advances being made in ML may improve our lives in ways that outweigh the environmental damage, but we need to start having that discussion more openly.
Ultimately, an end-user doesn’t care much about what is going on under the hood of their ML-powered products. As long as it does the right thing, most consumers will be happy.
However, the size of these large ML models prohibits them from being deployed in ways that will delight their customers. A huge model pretty much prohibits anything but a server-based deployment – so any users of these models will have to have an internet connection and rely on the provider to have up-time in sync with their required usage patterns.
And that also leaves users beholden to the continued existence of those servers. As owners of Revolv’s smart home hub will tell you, that is not guaranteed (after acquisition by Nest, the Revolv servers were shut down, leaving the physical hub as nought but an expensive reminder of early-adoption hubris).
Ultimately, if a provider could host their ML solution directly on a device (mobile phones, cars, internet of things devices etc) they would gain a huge competitive advantage. Consumers are becoming increasingly wary of allowing their devices to share their information, as well as always expecting faster and more convenient interfaces with their technology. And, those are both best achieved with local ‘on-device’ processing. Additionally, that is made much more straightforward if your models are not monsters that will not fit on, for example, a mobile phone with a maximum of 12GB of RAM right now.
This may be a problem that solves itself as companies realize that they need smaller models for commercial success. However, we also need to change how we measure success even in academic discussions of ML: rather than raw quality, we need to be looking at metrics that take efficiency into account.
So, we’ve covered that ML is getting large in its model size, data usage and hardware demands and that this is concerning because of how it is concentrating power into the hands of a few large industry players, polluting the planet and preventing ML from making its ways into the places it is needed most.
The sky is not completely falling in however. There are ways that the field is moving to adapt and correct some of these excesses. And if you’re working in ML but not for one of the very large industry players, looking in some of these directions might be fruitful for you too.
Dealing with the last problem, of usability, there are various ways you can shrink your models after training and increasing interest is being shown in these techniques. A common approach gaining traction at present is known as ‘distillation’. This approach involves taking your large model (or even ensemble of multiple models) and using them as a ‘teacher’ to ‘teach’ a much smaller ‘student’ model. The resultant model can end up with accuracies near the original larger models, but with a much smaller footprint and overhead.
Some approaches are far more aggressive. Researchers have shown that complete removal of 38/48 ‘heads’ of a ‘Transformer’ trained for machine translation barely affected performance, despite representing a large proportion of the model’s parameters.
The quickest and easiest way to make a model smaller is almost certainly quantization. This is now a standardized operation in the major ML toolkits. This involves taking your model and storing its weights in a more compact form.
Typically, this will mean mapping 32 bit floats to 16 bit floats, or even all the way to integers. It can take some tuning to reach the right trade-off of quantization between size and accuracy, and you need to be aware that not all platforms will support quantized calculations by default (so that your model may be smaller on disk, but then inflate back up again at runtime), but it offers by far the simplest way to shrink your models.
These approaches still require you to train a very large model in the first place. Other researchers are attempting to go in the opposite direction, and starting with a very small model and letting it grow during training so that it determines its own architecture according to restrained resources to some extent.
The problem of data availability can be tackled by looking at different data sources. The large industry players are doing well using either very large and expensive proprietary labeled data sources, or very hardware intensive reinforcement learning setups that take more compute than most can afford.
However, there is another way – and that is to look at the whole cake, and not just the icing, or the cherry on top. I’m referring to the famous ‘cake’ analogy of Yann LeCun (one of the ‘Godfathers’ of ML), that says that self-supervised learning is where we should be looking to get the bulk of our data and training signal from.
The techniques are different from reinforcement or supervised learning, but if you can get it right you should find data acquisition a much simpler task. With the modern internet, there is no shortage of raw data in almost any domain – whether it be audio, image, video or text, there is plenty of data there for you. The difficult part is in labeling it, so develop techniques that only need a small amount of labeling, and that can cope with large swathes of unlabeled data.
One of the key methods of using this data is in utilizing prediction – mask out part of your data and try to predict it from the rest. This is done, for example, in parts of training systems for speech recognition (predict the next word in a stream of words) and image segmentation (hide part of the image and train it to predict what is in the gap). Prediction provides an endless supply of labels to train your system on, without expensive manual annotation.
This approach still uses a lot of compute, however – self-supervised learning is typically much less ‘sample efficient’ than supervised learning, often requiring an order of magnitude more data to make the same impact on model accuracy. So, although it might help solve the data problem, it will exacerbate the compute problem.
An emerging way of doing things at the moment leans into the inequality between the large industry players and the rest (like Google, Amazon, Facebook, etc). In this way of doing things, the large industry players make their big models, get their deserved plaudits for them, then publish them for the rest of the world to use.
Typically, this will mean fine-tuning the large model for a specific use case, which requires far less hardware and data than would be required to create an equivalent model from scratch. In the field of NLP this has become so prevalent there are easy to follow tutorials on how to fine-tune ‘BERT’ – one of the more famous large models released by Google. Others have followed suit – recently OpenAI released their full GPT-2 language models to great acclaim for example.
This way of doing things seems to play into the strengths of the world of Big ML – by allowing the large industry players to do their big stuff, but then for the rest of the field to benefit by adapting the output to their specific niches, which the big industry players wouldn’t want to get involved in. However, it is reliant upon the continued seeming altruism of those big players.
That is not guaranteed. With the release of GPT-2 by OpenAI only came after they initially warned they would not release it due to ‘concerns about malicious applications of the technology’. There is every chance that as ML becomes more developed, ethical concerns about the wider release of technology may dominate more, and combined with business reasons not to want to release models, may limit the scope of this approach.
If you’re not working for a large industry player, and you want to be involved in ML, your final option is to be smart. Don’t try to compete with the big players in the big model game. Look at other areas, where there is opportunity. With large models comes an even larger problem of the lack of explainability of deep learning – if you can crack that (why a model makes a specific decision) you will help the field immeasurably. If you can design new ways of doing things that are less data, hardware or parameter hungry you will have no shortage of places to apply them.
Therefore, if you focus on efficiency and can improve it by even a small percentage, you can make a large difference to the scale of current ML. If you can shrink a stunning new huge model to a size more suited to a smaller research lab you will have the gratitude of the field.
We have talked today about how BIG machine learning is getting, and there is no doubt that this emphasis on size is helping to push the boundaries of what is possible in the field.
However, the field of machine learning has a duty to itself to think about the implications of this rise in size, and what can be done to make sure machine learning helps keep pushing humanity in the right directions, and not accelerating its plunge in the wrong ones.
Tom Ash, Machine Learning Engineer, Speechmatics