LLMs really are large

TL;DR; the memory impacts really are huge, the tech of GPUs is amazing

Memory is important

Mostly I had thought that the training of neural nets was expensive, running them was cheap.

After all, you are only talking about multiplying a few numbers together, even a few thousand numbers.. but the training over millions of pixels of millions of images, i can see how that was expensive.

But Large Language models are .. Large. Look at the models that are out there, a small model is 5 billion nodes, a large one is maybe 50 billion nodes. 

That means for each of the nodes you need a weight that you multiply, so 50 billion nodes => 50 billion floating point numbers; a double64 is 64 bits, which means a lot. Just to hold the model in memory it’s hundreds of gig.

And GPT4 is probably even bigger, the details don’t seem to be public, idle speculation has put the number at a trillion.

GPUs are amazing for LLMs

It’s a common thread in the past few years that Moore’s law on the increase in CPU speed is not working as well as it was, and yet things are getting faster.

So GPUs to the rescue.

As this talk discusses, there have been many many innovations in GPUs over the last 10 years, which have contributed to a 1000x speed up

One of the interesting parts is the change to optimise lower precision calculations, using 16 or 8 bit numbers, without a loss in accuracy. I’ve also seen this in other talks about LLMs, that there are converters to move models from one number system to another.

 I enjoyed this time showing the cost of running a GPU cluster to train an LLM: $10M for training -> per word of output: $3×10-4