Machine learning is getting BIG (Part I)

Machine learning (ML) is everywhere right now. A cursory look at Google Trends shows how interest has spiked in the past 5 years as it has been applied more widely, and more successfully, to a range of applications.

The field has been through a few lean times (sometimes referred to as the AI Winters) where a promise has been unfulfilled, but in the past few years, it has grown really, really BIG.

And machine learning certainly has got big.

Billions of pounds/dollars a year are spent on developing ML, and notably all of the big household-name technology giants of the age – Amazon, Apple, Facebook, Google for example (top-level ‘Diamond sponsors’ of 2019’s most prestigious ML conference - NeurIPS conference) – are going deep into ML to help develop their current products and innovate towards new ones. These big companies are leading the way in their own very big style – which will become important later.

But the type of big I’m going to talk about here isn’t about how far ML’s influence is spreading and how much money it’s bringing in. I’m going to be talking about just how big its core internals are – the size of modern ML models, the datasets that train them and the hardware required to support all that.

Just how big are we talking?

Let’s start with a brief illustration of how size has changed.

In 1991, Speechmatics' founder, Dr Tony Robinson, published one of the first attempts to successfully use recurrent neural networks in the context of speech recognition. In this paper, he used a SparcStation 2, which as far as records show, had 4.2MFLOPS of processing power (one FLOP basically means one computation per second, 1 MFLOPS means 1 million calculations per second). A quick back of the envelope calculation shows that his RNN had around 38 thousand parameters. The only dataset explicitly mentioned is TIMIT – around 6 hours of audio data with transcripts totalling around 66 thousand words.

In 2015, Speechmatics published a new paper on using recurrent neural networks, for a different aspect of speech recognition. At this stage we used an Nvidia GeForce GTX Titan, capable of 4.7 TFLOPs (a TFLOP being a trillion calculations per second). The largest recurrent models we quote using had 541 million parameters. And we use data of up to 8 billion words.

That’s about a million times as much compute power, 14 thousand times larger a model and 120 thousand times as many words of data. These numbers aren’t entirely comparable as the papers are on different aspects of a speech recognition system (the first paper was on acoustic modelling, the second on language modelling), but it should at least give you some idea as to how the ML world has changed in the intervening 24 years.

The importance of ML model size

The size of some modern machine learning models is staggering. At the recent NeurIPS conference, GPipe (a Google project) presented their platform where they had been training a model of 90 BILLION parameters, to make a total model size of 0.8TB. That’s a crazy big number. If you put all of that data on old fashioned 3.5” floppy disks and laid them side by side, they would more than cover a marathon course.

The surprising thing is that this is actually a very good approach if you want success. Time and again machine learning has shown that if you want to improve your model, the easiest way is to make it bigger (providing you have the hardware and data to support it, we’ll come onto that later). These gains come from the additional ‘capacity’ of the model to learn and make more complex connections. There is really no need to spend time and risk on novel approaches if you can instead just scale up what’s already out there and do just as well, if not better.

The GPipe example above is clearly extreme. But it points to a common trend – we are moving away from small models with handcrafted features, and moving towards huge systems with very little pre-supplied knowledge and expecting them to learn it all from scratch instead. The ImageNet challenge illustrates this. Setup in 2010, the challenge is to train a system on a set of the data to be able to classify an image correctly. For the first couple of years, around 25% errors were enough to win the competition, typically using a Support Vector Machine approach, which uses about the same number of parameters as training samples – just over a million in this case. Then in 2012, AlexNet won (and is often credited with launching the modern ML craze by outperforming the competition so admirably) with a neural network model featuring 62 million parameters and an error rate of just 16%. By 2014, the error rates were around 7% using architectures like VGG with 140 million or so parameters. More recently people have been using AmoebaNets with hundreds of billions of parameters, to get error rates under 5%.

The benefits are clear – you need to get as big a model as possible.

Bring in the data

Throughout the history of machine learning, there has been plenty of evidence that using more training data will lead to higher quality models. Famously, Peter Norvig (Google’s chief scientist of the time) once said: “We don’t have better algorithms than anyone else; we just have more data”.

This relationship between the amount of training data available for a machine learning model and its eventual quality is amplified in the modern world of deep neural networks, that contain billions of parameters to be trained. Smaller models don’t need as much training data before their performance reaches a plateau: they have learnt as much as they can, given their limited capacity. However, the super large models we’re starting to see these days require lots of data before they even exhibit good performance – because their large number of parameters means they are very good at ‘overfitting’ to small amounts of data if you are not careful. As a rule of thumb, you might expect your accuracy to increase with the logarithm of the amount of data you have, providing you can scale your model size to learn it all. So this often means you need an extra order of magnitude more data to get each similarly-sized increment of performance improvement.

Interestingly, the size of open source / publicly available datasets that are used as standardized testbeds for machine learning problems hasn’t actually increased all that fast compared to how fast models and their demand for data has grown. There are some large datasets that are being publicly released and developed, though their size is still dwarfed by the datasets used by industry giants. These big players have whole teams and jobs dedicated to creating and curating their datasets and hard information on how much actual data they have access to is difficult to come by. Given Peter Norvig’s quote above, however, we can assume it is at least an order of magnitude more than is publicly available.

Let’s compute for ML

The availability of compute has been pointed to as the big driver of our current machine learning renaissance. The humble GPU – once the darling of gamers and bitcoin miners – has now become an emblem of our current ML age. It’s easy to see why this has made such a difference. A single Titan X (a graphics card launched in 2015) is capable of several times more operations per second than the largest supercomputers of the twentieth century. That in itself is cause for pause – a relatively small and inexpensive consumer device outperforming specialist equipment from 15 years before.

But of course, the world has not stopped there. Google, in particular, have developed ‘TPUs’ that are purpose-built for ML. These are offered in their cloud service, with pods of them for hire at over 100 petaflops. Other smaller companies, like Graphcore with their IPUs or Cerebras with their WSE are also entering the field with specialised offerings.

The biggest models that have been making the biggest splashes have been using this hardware extensively. BERT, the model that has led a Sesame-Street-based revolution of NLP, was trained using 16 TPUs. The GPT-2 language model that was released in late 2019 to much publicity was, we are told, trained on 256 TPU3 cores. The XLNet-large model that beat BERT was trained on 512 TPUv3 chips. These numbers are many orders of magnitude higher than the largest compute available to any previous generation of ML researcher, which is why we are able to train such large models on so much data.

Why does this matter?

There is no inherent problem with pursuing size as the solution to making ML models perform better. It works, and if you have the resources required, go for it. Scaling up your setup is a very legitimate way to get better at your task. But there are larger concerns.

All-access and gatekeeping

You may have noticed that the ‘big’ models I’ve been talking about have been coming from a handful of research laboratories, primarily the likes of Google, OpenAI, Facebook and similar ‘tech giants’. Of the 1428 papers accepted to NeurIPS (probably the most prestigious ML conference) in 2019, 170 were from Google labs, 76 from Microsoft, alongside large numbers from Facebook, IBM and Amazon. This trend for going big in ML has meant that it is hard for anyone but the biggest industry players to get the top, state-of-the-art results.

Although the hardware to do the necessary computation is theoretically available to everyone through platforms like, for example, Google Cloud Compute, the costs effectively bar most researchers from entry to the ‘big club’. Various attempts have been made to estimate the cost of training these ultra-large models, and although the exact figures are disputed, it is clear that the biggest models cost at least thousands of dollars, if not orders of magnitude more than that. A typical PhD student (traditionally a demographic that would contribute heavily to a conference such as NeurIPS) simply does not have access to those sorts of funds. And in reality, the costs will be much higher than that when failed ideas, parameter sweeps and other model builds are taken into account. You rarely build your best model first time in ML.

There is a similar story with regards datasets. The large industry players have the largest datasets available because they have spent millions of dollars creating those datasets. And they are not particularly willing to share them, as they provide such a strong competitive advantage. So the PhD students and non-industry players are left with the open-source datasets that are smaller and less powerful.

Aside from the reduced number of clever minds that are therefore able to produce meaningful innovation at the sharp end of ML, this also leaves us beholden to these big tech giants to lead the way in our research. Perhaps that’s okay? I will leave that as an ethical question to the reader. Suffice to say, there are plenty of people who do not trust these companies.

Tom Ash, Machine Learning Engineer, Speechmatics

Jan 23, 2020 | Read time 8 min