Stanford CS224N NLP with Deep Learning ｜ Spring 2022 ｜ Guest Lecture： Scaling Language Models

0:00:00 - 0:00:09 Text: In theoretical physics, to get this kind of audience, you have to win the Nobel Prize

0:00:09 - 0:00:10 Text: or something.

0:00:10 - 0:00:15 Text: But, of course, I've been working on ML recently, and it's been much more exciting.

0:00:15 - 0:00:17 Text: There's a huge amount of interest.

0:00:17 - 0:00:22 Text: So to some extent, part of what I'll be talking about may be an implicit theme will be sort

0:00:22 - 0:00:29 Text: of why there's so much excitement and why you might expect that excitement to continue.

0:00:29 - 0:00:37 Text: So an outline of my talk is that I'll first start by discussing motivations for language

0:00:37 - 0:00:38 Text: modeling.

0:00:38 - 0:00:42 Text: I'm sure you're all very well motivated because this is an NLP class.

0:00:42 - 0:00:47 Text: And I'll also talk about sort of orders of magnitude of data and compute that go into

0:00:47 - 0:00:49 Text: contemporary language modeling.

0:00:49 - 0:00:57 Text: And that will kind of set the stage for talking about scaling laws for neural language modeling.

0:00:57 - 0:01:03 Text: And further realization that these scaling laws seem to be quite universal for generative

0:01:03 - 0:01:07 Text: models and maybe for sort of machine learning more generally.

0:01:07 - 0:01:12 Text: And then, finally, after discussing that, I'll talk about what happens when we actually

0:01:12 - 0:01:15 Text: do scale up language models.

0:01:15 - 0:01:17 Text: I'll talk about the GBD3 model.

0:01:17 - 0:01:22 Text: And if there's time, I'll talk about some lessons from all of these ideas for research,

0:01:22 - 0:01:27 Text: which I imagine many of you are excited to be involved with soon.

0:01:27 - 0:01:32 Text: So I'll start by talking about why we do language modeling and Fermi estimates for language

0:01:32 - 0:01:33 Text: modeling.

0:01:33 - 0:01:41 Text: By Fermi estimates, I mean questions like estimating, is it a million or a hundred thousand or

0:01:41 - 0:01:43 Text: ten thousand PNO tuners in Chicago.

0:01:43 - 0:01:45 Text: Fermi famously asked this kind of question.

0:01:45 - 0:01:48 Text: And there are a lot of estimates like this that we can kind of do in a back of the envelope

0:01:48 - 0:01:54 Text: way to really get a sense for what's going on.

0:01:54 - 0:01:57 Text: But before going into that, why should you study?

0:01:57 - 0:01:58 Text: Study language.

0:01:58 - 0:01:59 Text: This is sort of my motivation.

0:01:59 - 0:02:00 Text: You might have all sorts of other motivations.

0:02:00 - 0:02:04 Text: Language is obviously very fascinating.

0:02:04 - 0:02:08 Text: Intellectual creation by our species.

0:02:08 - 0:02:12 Text: But I think another reason why it's particularly exciting for AI is that language is, in some

0:02:12 - 0:02:18 Text: sense, our species best attempt to encode everything about the world in as efficient and

0:02:18 - 0:02:20 Text: compressed way as possible.

0:02:20 - 0:02:27 Text: And that means that it's very yielding to an AI.

0:02:27 - 0:02:29 Text: There's a lot of noise.

0:02:29 - 0:02:34 Text: There's a huge quantity of writing freely available on the internet.

0:02:34 - 0:02:39 Text: And there are also a huge number of books, for example, I think very roughly speaking

0:02:39 - 0:02:41 Text: in this sort of Fermi estimate level.

0:02:41 - 0:02:46 Text: There's something like ten million books in the Library of Congress.

0:02:46 - 0:02:50 Text: And very, very roughly that might mean there's something like a trillion words.

0:02:50 - 0:02:51 Text: There are books.

0:02:51 - 0:02:56 Text: And then there's actually much more language information out on the internet.

0:02:56 - 0:03:01 Text: And so there's therefore a lot of data for AI models to learn from.

0:03:01 - 0:03:05 Text: And then a third reason, at least for, just some extent for me and maybe for many of you,

0:03:05 - 0:03:12 Text: is that if you're actually able to get an AI that kind of knows, quote, unquote, understands

0:03:12 - 0:03:16 Text: language, then you can communicate with it in a kind of natural way.

0:03:16 - 0:03:21 Text: You can ask it about anything, and you can get a lot of intuition from the responses

0:03:21 - 0:03:26 Text: and behaviors and all sorts of different kinds of evaluations you can perform on such

0:03:26 - 0:03:28 Text: a model.

0:03:28 - 0:03:34 Text: If you compare it to sort of ancient history of AI like excitement about classifying images

0:03:34 - 0:03:40 Text: from, I don't know, Alex Net, ten years ago in sort of the distant past.

0:03:40 - 0:03:44 Text: And from, say, AlphaGo, again in the distant past five years ago, there's a lot more

0:03:44 - 0:03:46 Text: intuition you can get.

0:03:46 - 0:03:51 Text: And then you can use that to sort of understand what these models know and don't know and

0:03:51 - 0:03:52 Text: can do.

0:03:52 - 0:03:59 Text: And you can also think about this in terms of how to make these models aligned with what

0:03:59 - 0:04:00 Text: humans prefer.

0:04:00 - 0:04:08 Text: There's a lot of work on trying to understand language model bias, racism, other such issues,

0:04:08 - 0:04:12 Text: and there's really a lot that you can kind of explore and dig into.

0:04:12 - 0:04:18 Text: So I imagine this is all very basic for everyone here, but just so we're on the same page.

0:04:18 - 0:04:22 Text: If you're doing kind of contemporary neural network based machine learning, the ingredients

0:04:22 - 0:04:26 Text: that you need to get started are really surprisingly simple.

0:04:26 - 0:04:30 Text: You need some kind of model to parameterize a function.

0:04:30 - 0:04:32 Text: You need a data set.

0:04:32 - 0:04:36 Text: You need some computers with plenty of computation.

0:04:36 - 0:04:41 Text: You need a lost function and you need some choice of optimizer.

0:04:41 - 0:04:47 Text: And basically for pretty much everything in this talk, I'll be thinking about language

0:04:47 - 0:04:54 Text: modeling as a task where the lost function is simply to predict the next word in some

0:04:54 - 0:04:56 Text: sentence or paragraph or book.

0:04:56 - 0:05:00 Text: And so that's how basically all of the models that I'll be talking about are trained.

0:05:00 - 0:05:05 Text: They have a lost function which incentivizes them to predict the correct probability

0:05:05 - 0:05:08 Text: distribution for the next word.

0:05:08 - 0:05:13 Text: So what about these other ingredients like the models that we use, the data sets that

0:05:13 - 0:05:17 Text: we use, and how much computation do we use?

0:05:17 - 0:05:22 Text: What are those sort of order of magnitude figures?

0:05:22 - 0:05:28 Text: So one way to think about this is sort of how much language do we consume as a person

0:05:28 - 0:05:29 Text: for comparison.

0:05:29 - 0:05:35 Text: So you can imagine that if you were a very voracious reader, maybe you'd read a long

0:05:35 - 0:05:41 Text: book every day and you'd spend your life doing that, maybe you'd live for 70 years, if

0:05:41 - 0:05:48 Text: you did that, you'd end up reading something like two billion words over your lifetime.

0:05:48 - 0:05:56 Text: For comparison, a canonical large language model, GPT-3, was trained for on the order of

0:05:56 - 0:05:58 Text: 200 billion words.

0:05:58 - 0:06:04 Text: So that's about 100 times more language data than maybe you'd see in your lifetime if

0:06:04 - 0:06:09 Text: you kind of tried really hard to attend to written text.

0:06:09 - 0:06:15 Text: There are other data sets, of course, that are much, much bigger than GPT-3's trading

0:06:15 - 0:06:17 Text: set.

0:06:17 - 0:06:22 Text: The year's common crawl, which is a sort of snapshot of the internet that anyone can

0:06:22 - 0:06:28 Text: go out and download if you like, this has very roughly on the order of 10 to the 15

0:06:28 - 0:06:31 Text: words.

0:06:31 - 0:06:37 Text: I said earlier that the Library of Congress has something like maybe 10 million books,

0:06:37 - 0:06:39 Text: each book is maybe 100,000 words.

0:06:39 - 0:06:45 Text: So the Library of Congress in total maybe has something like a trillion words.

0:06:45 - 0:06:52 Text: And as another sort of smaller data set example, English Wikipedia is very roughly of

0:06:52 - 0:06:54 Text: order three billion words.

0:06:54 - 0:06:59 Text: So maybe if you spent your whole life reading Wikipedia, you could just barely do it if

0:06:59 - 0:07:03 Text: that was your mission.

0:07:03 - 0:07:09 Text: So what about the actual neural networks that I'll be talking about that we currently

0:07:09 - 0:07:13 Text: seem to be using fairly effectively to model language?

0:07:13 - 0:07:19 Text: So I'll be talking about transformer language models, so-called decoder-only transformer

0:07:19 - 0:07:23 Text: language models of which GPT-3 is an example.

0:07:23 - 0:07:29 Text: And just to sort of pout things, these models have, with kind of the standard way that they're

0:07:29 - 0:07:36 Text: set up, a number of parameters, which is something like 12 times the number of layers in the

0:07:36 - 0:07:37 Text: network.

0:07:37 - 0:07:38 Text: So GPT-3 has 96 layers.

0:07:38 - 0:07:46 Text: You can make deeper or shallower such networks times the sort of activation dimension squared.

0:07:46 - 0:07:55 Text: So D model, this D model parameter is just the dimension of the vector space that each

0:07:55 - 0:08:03 Text: token occupies or word, if you were to use words as tokens, when you run this model on

0:08:03 - 0:08:05 Text: language data.

0:08:05 - 0:08:09 Text: And so this gives you some sense for where parameter comes from.

0:08:09 - 0:08:16 Text: I think D model for GPT-3 is of order 10,000 and layer is 96 and that's how you get roughly

0:08:16 - 0:08:22 Text: 200 billion parameters in that model and other models scale similarly.

0:08:22 - 0:08:29 Text: Now, how much computation do you actually do when you train this kind of model?

0:08:29 - 0:08:34 Text: Well it turns out that different neural network architectures have different properties with

0:08:34 - 0:08:40 Text: effect this question, but transformers are actually quite simple in that in a forward

0:08:40 - 0:08:48 Text: pass of a transformer, every parameter on every token performs roughly one add and one

0:08:48 - 0:08:53 Text: multiply and then about twice this in the backward pass.

0:08:53 - 0:08:57 Text: And so that gives us a very simple formula that the number of floating point operations

0:08:57 - 0:09:06 Text: that a model like this performs during training is 6, which is 2 times 1 plus 2, times the

0:09:06 - 0:09:11 Text: number of parameters in the model times the number of tokens, that's what D is sort of the

0:09:11 - 0:09:15 Text: size of the data set in tokens that you process.

0:09:15 - 0:09:22 Text: And one other point that sort of I'll make while kind of going over these estimates is

0:09:22 - 0:09:27 Text: that you might wonder whether or not there's a lot of computation involved in processing

0:09:27 - 0:09:29 Text: long sequences.

0:09:29 - 0:09:38 Text: There's sort of a famous point that dense attention in transformer models is n squared

0:09:38 - 0:09:41 Text: with respect to context length and that's absolutely true.

0:09:41 - 0:09:47 Text: However, if you actually work out the sort of coefficients, the ratio of the amount of

0:09:47 - 0:09:54 Text: computation you do in a forward pass or during training in the context direction versus

0:09:54 - 0:10:00 Text: in the direction of sort of moving up the layers of the model is roughly n context over

0:10:00 - 0:10:02 Text: 12 times D model.

0:10:02 - 0:10:10 Text: So I note this just because if you think which I'll kind of suggest that this is a likely

0:10:10 - 0:10:16 Text: direction for the world to be heading, that models might continue to get bigger, then

0:10:16 - 0:10:18 Text: D model for GPT-3 is already 10,000.

0:10:18 - 0:10:21 Text: So the denominator here is order 100,000.

0:10:21 - 0:10:25 Text: And so actually even if you have quite long contexts with the sort of dumbest possible

0:10:25 - 0:10:30 Text: dense attention, the amount of compute you actually do in the context direction is not

0:10:30 - 0:10:35 Text: always so much.

0:10:35 - 0:10:40 Text: What about actually numerical values for this compute?

0:10:40 - 0:10:44 Text: So the largest models that we have so far, if we're in kind of Fermi estimate mode, we

0:10:44 - 0:10:48 Text: can round up and say they have say order a trillion parameters.

0:10:48 - 0:10:55 Text: If you have a model with a trillion parameters, then what kind of hardware are you going

0:10:55 - 0:10:56 Text: to run it on?

0:10:56 - 0:11:00 Text: Well, you might run it on in a 100 GPU at least this year.

0:11:00 - 0:11:07 Text: And a 100 GPU is performed about 3 times 10 to the 14, floating point operations per second,

0:11:07 - 0:11:12 Text: or 2 times 10 to the 19, floating point operations per day.

0:11:12 - 0:11:17 Text: This means that it's sort of convenient to sometimes use units of pedoflap days, which

0:11:17 - 0:11:22 Text: is 10 to the 15, floating point operations per second times a day.

0:11:22 - 0:11:26 Text: And that means that's about 3 a 100 days.

0:11:26 - 0:11:32 Text: And that's about 8.6 times 10 to the 19 or order 10 to the 20 floating point operations

0:11:32 - 0:11:34 Text: in a day.

0:11:34 - 0:11:40 Text: So how does sort of the compute available on hardware compare to the compute that we

0:11:40 - 0:11:42 Text: do when we train these gigantic models?

0:11:42 - 0:11:50 Text: Well, if we have a model with a trillion parameters and we train it for 300 billion tokens,

0:11:50 - 0:11:55 Text: then we get 6 times 10 to the 12 times 3 times 10 to the 11.

0:11:55 - 0:12:01 Text: And so we get on the order of 10 to the 24 floating point operations to train a trillion

0:12:01 - 0:12:06 Text: parameter model for on one of these large data sets.

0:12:06 - 0:12:09 Text: So these numbers involved, I mean, I think the thing that I find most amazing about this

0:12:09 - 0:12:14 Text: is that I still remember taking chemistry in high school.

0:12:14 - 0:12:19 Text: And in chemistry, you learn that sort of a macroscopic amount of stuff is sort of

0:12:19 - 0:12:23 Text: an avogadros number of atoms, which is like 6 times 10 to the 23.

0:12:23 - 0:12:29 Text: So somehow we're actually able to build computers that do, that working together, do more than

0:12:29 - 0:12:33 Text: an avogadros number of computations to train these neural models.

0:12:33 - 0:12:37 Text: So anyway, I find these numbers kind of mind-boggling and also useful to sort of have in the

0:12:37 - 0:12:41 Text: back of your head to understand what's going on.

0:12:41 - 0:12:47 Text: So with that, pretty good, unless there are any questions, I'll start talking about

0:12:47 - 0:12:54 Text: scaling laws for these kinds of language models.

0:12:54 - 0:13:03 Text: So what I'll basically be arguing is that there are very surprisingly precise empirical

0:13:03 - 0:13:10 Text: scaling laws for the performance of machine learning systems, machine learning models,

0:13:10 - 0:13:16 Text: as a function of kind of gross macroscopic inputs like how many parameters does the model

0:13:16 - 0:13:23 Text: have, how big is the data set, and how much compute is used for training.

0:13:23 - 0:13:29 Text: And I'll also make the point that if you're sort of in an airplane at 30,000 feet looking

0:13:29 - 0:13:33 Text: down on what's going on in the field, a lot of the other details in these systems don't

0:13:33 - 0:13:37 Text: matter all that much, or at least they don't matter as much as you might have expected

0:13:37 - 0:13:39 Text: that they would.

0:13:39 - 0:13:45 Text: Very often they just change some kind of constant pre-factor in these kinds of scaling

0:13:45 - 0:13:50 Text: laws, which give you kind of a big picture of what's changing as you really increase

0:13:50 - 0:13:52 Text: these inputs.

0:13:52 - 0:13:58 Text: And one way of sort of turning this into sort of a theme, what do you learn from it, how

0:13:58 - 0:14:04 Text: do you summarize it, is that getting these models to perform better is to a large extent

0:14:04 - 0:14:07 Text: about kind of avoiding bottlenecks.

0:14:07 - 0:14:09 Text: It's avoiding being blocked by something.

0:14:09 - 0:14:15 Text: And there are a lot of things that can block improvements in performance.

0:14:15 - 0:14:19 Text: The most obvious one, which is what scaling laws are studying, is you could not have enough

0:14:19 - 0:14:25 Text: data, you could not have a large enough model, you could not have enough computation to train

0:14:25 - 0:14:26 Text: that model.

0:14:26 - 0:14:31 Text: And then there are also a lot of other literal bottlenecks that you can think about, many

0:14:31 - 0:14:35 Text: of which involve sort of that information propagation through the network.

0:14:35 - 0:14:39 Text: So I guess like one way that I would summarize a lot of the most highly cited papers in machine

0:14:39 - 0:14:46 Text: learning in the last 10 years, papers like Resnets and LayerNorm, BatchNorm, things like

0:14:46 - 0:14:52 Text: that, is that there's sort of alleviating bottlenecks where information wasn't propagating

0:14:52 - 0:14:54 Text: nicely through your network.

0:14:54 - 0:15:00 Text: And the sort of simplest possible picture to sort of illustrate this, which perhaps is

0:15:00 - 0:15:04 Text: a cartoon of what's going on, something that I'll talk about later on with LSTMs, is

0:15:04 - 0:15:10 Text: that if you take a matrix, I mean neural networks are really just fancy systems that do a lot

0:15:10 - 0:15:11 Text: of matrix multiplication.

0:15:11 - 0:15:17 Text: If you take a matrix and you multiply it a large number of times, then very roughly speaking

0:15:17 - 0:15:24 Text: what you end up with is a projection onto its largest eigenspace.

0:15:24 - 0:15:29 Text: And so very roughly speaking, even with a deep network and you sort of don't set it up

0:15:29 - 0:15:35 Text: correctly, it's very easy to be in a situation where you lose signal or lose information

0:15:35 - 0:15:37 Text: and you get like a literal, literal model.

0:15:37 - 0:15:43 Text: But anyway, that's sort of the philosophy that at least at zero-thorder might, you might

0:15:43 - 0:15:48 Text: sort of reach from thinking about some of these results.

0:15:48 - 0:15:55 Text: So this slide is really about the kind of core results for scaling laws for language

0:15:55 - 0:15:56 Text: models.

0:15:56 - 0:15:59 Text: I'll explain it in some detail.

0:15:59 - 0:16:05 Text: So I'm actually going to start with the plot on the far right, which is about scaling

0:16:05 - 0:16:11 Text: laws with respect to the number of parameters in a neural network.

0:16:11 - 0:16:19 Text: And so what we did to generate this plot was get a very large data set such that we weren't

0:16:19 - 0:16:23 Text: worried about models overfitting it all.

0:16:23 - 0:16:30 Text: And train all of our models for a very long time so that they were essentially at convergence.

0:16:30 - 0:16:34 Text: So in other words, training time or compute was not constrained on performance.

0:16:34 - 0:16:41 Text: And then plot the resulting test loss of language models, trained to predict the next word

0:16:41 - 0:16:45 Text: as a function of parameter count on a nice log scale.

0:16:45 - 0:16:49 Text: And so what you see is that there's this power law, which is a straight line on a log

0:16:49 - 0:16:56 Text: log plot of the loss as a function of the parameter count of these models.

0:16:56 - 0:17:01 Text: In the middle plot, we do the same thing, but switch the role of the amount of data that

0:17:01 - 0:17:03 Text: we have with parameter count.

0:17:03 - 0:17:08 Text: So we train a model that's very large, maybe one of the largest models on the plot on

0:17:08 - 0:17:15 Text: the right, so that model size is not a constraint on performance on data sets of various sizes.

0:17:15 - 0:17:17 Text: And we apply early stopping.

0:17:17 - 0:17:21 Text: So we measure the test loss at the point where the test loss is at its minimum during

0:17:21 - 0:17:24 Text: otherwise pretty naive straightforward training.

0:17:24 - 0:17:31 Text: And we find again a very clear power law for loss as a function of data set size.

0:17:31 - 0:17:35 Text: And then the most complicated plot is the one on the left.

0:17:35 - 0:17:45 Text: So on the left, we plot all of the learning curves for many different models.

0:17:45 - 0:17:48 Text: We provide these models with plenty of data so they're not overfitting.

0:17:48 - 0:17:52 Text: They're in the under parameterized regime.

0:17:52 - 0:17:57 Text: But we train all of these different model sizes for a very, very long time.

0:17:57 - 0:18:03 Text: And we measure on the x-axis not the number of training steps or training tokens, but

0:18:03 - 0:18:08 Text: the amount of compute that has been used so far during training.

0:18:08 - 0:18:12 Text: And as a consequence of one of the formulas that I wrote on a couple of slides ago, that

0:18:12 - 0:18:18 Text: compute is six times parameter count times the amount of training data.

0:18:18 - 0:18:24 Text: If you take the logarithm of both sides, the log of parameters times data is log of parameters

0:18:24 - 0:18:25 Text: plus log of data.

0:18:25 - 0:18:29 Text: So what that means is that learning curves for models at different sizes are just shifted

0:18:29 - 0:18:34 Text: over left and right by constant amounts with the largest models on the sort of the far

0:18:34 - 0:18:38 Text: right of this curve and the smallest models on the left.

0:18:38 - 0:18:42 Text: So we have the learning curves for all of these models all put together.

0:18:42 - 0:18:47 Text: And so a question you can ask is sort of what is the best loss you can get for any given

0:18:47 - 0:18:52 Text: amount of training compute where you're allowing yourself to choose the model that does best

0:18:52 - 0:18:54 Text: for that amount of training compute?

0:18:54 - 0:18:59 Text: And that's what sort of the heavy black line and the orange fit are picking out.

0:18:59 - 0:19:05 Text: I mean formally you could call this the convex hull of all of these curves.

0:19:05 - 0:19:12 Text: And that again somewhat surprisingly seems to obey a very nice power law fit over many,

0:19:12 - 0:19:17 Text: many orders of magnitude in computation.

0:19:17 - 0:19:21 Text: And it's crucial for all of these experiments that you're only limiting performance with

0:19:21 - 0:19:23 Text: one thing at a time.

0:19:23 - 0:19:27 Text: On the far right you have plenty of data in compute, but you're limiting the number of

0:19:27 - 0:19:30 Text: parameters in the middle you're limiting the amount of data but you have a big model.

0:19:30 - 0:19:36 Text: On the left you're looking at training compute but you have all sorts of different model sizes

0:19:36 - 0:19:39 Text: and again plenty of data.

0:19:39 - 0:19:43 Text: So in other words in each of these cases there's sort of one of these parameters that's

0:19:43 - 0:19:48 Text: bottlenecking performance and otherwise you have plenty of resources.

0:19:48 - 0:19:50 Text: There's a question?

0:19:50 - 0:20:00 Text: I hope that's not true.

0:20:00 - 0:20:03 Text: So there's a minus sign in the exponent.

0:20:03 - 0:20:07 Text: I'm not sure if you're looking at the lines or the function.

0:20:07 - 0:20:09 Text: On the bottom, I see.

0:20:09 - 0:20:10 Text: I'll be right.

0:20:10 - 0:20:11 Text: Oh, I see.

0:20:11 - 0:20:12 Text: Okay.

0:20:12 - 0:20:17 Text: Yeah, they're just a log scope plot.

0:20:17 - 0:20:20 Text: Yeah, please ask any questions.

0:20:20 - 0:20:22 Text: Great.

0:20:22 - 0:20:27 Text: And then the x-axis on this compute plot is this pediflop a day unit.

0:20:27 - 0:20:31 Text: That's why it's actually a small number.

0:20:31 - 0:20:36 Text: Any other questions about anything about this plot?

0:20:36 - 0:20:39 Text: Cool.

0:20:39 - 0:20:44 Text: So there's another thing that you can do that's kind of interesting with the plot on the

0:20:44 - 0:20:53 Text: left, which is you can ask for any given quantity of compute that you have available.

0:20:53 - 0:20:58 Text: Someone kindly donates to you some number of a-100s to use for a few weeks and you want

0:20:58 - 0:21:01 Text: to use it to train the best possible language model you can.

0:21:01 - 0:21:07 Text: And so you can ask based on this plot on the left, how should I allocate the computation

0:21:07 - 0:21:14 Text: that was given to me in terms of making a bigger model or training longer?

0:21:14 - 0:21:20 Text: And it turns out there's sort of a simplified cartoon for the answer that we found with

0:21:20 - 0:21:26 Text: our language data, which was that you mostly want to allocate most of your compute, basically

0:21:26 - 0:21:32 Text: two-thirds on a geometric scale to making models bigger.

0:21:32 - 0:21:38 Text: And you can allocate about a third to training for longer on more data.

0:21:38 - 0:21:43 Text: And so this at least for us wasn't an obvious conclusion.

0:21:43 - 0:21:49 Text: It suggests that a lot of the gains that you're going to get if you want to get better

0:21:49 - 0:21:53 Text: performance with a fixed amount of compute, a fixed budget, is going to come from making

0:21:53 - 0:21:55 Text: your models bigger.

0:21:55 - 0:21:58 Text: And it turns out that in practice, I won't go into it in detail.

0:21:58 - 0:22:02 Text: You can, to some extent, just make your batch size bigger during training.

0:22:02 - 0:22:06 Text: And that means that the total number of serial steps that you train for doesn't have to

0:22:06 - 0:22:07 Text: increase all that much.

0:22:07 - 0:22:11 Text: You don't necessarily like to train for vastly longer.

0:22:11 - 0:22:16 Text: You seemingly just need a largely a bigger model.

0:22:16 - 0:22:20 Text: And that's something that you read off from this compute plot that I showed.

0:22:20 - 0:22:28 Text: That's a general question.

0:22:28 - 0:22:37 Text: The way that you get this graph is you basically do an analysis where you look at any given

0:22:37 - 0:22:44 Text: point for compute and you look up and you pick out the blue curve that's closest to the

0:22:44 - 0:22:45 Text: black line.

0:22:45 - 0:22:49 Text: And that gives you a model size and an amount of training.

0:22:49 - 0:22:56 Text: And so you can do that for all of these different points on the x-axis.

0:22:56 - 0:22:59 Text: And then for any given point on this x-axis that tells you a model size.

0:22:59 - 0:23:04 Text: You learn model size as a function of your compute budget.

0:23:04 - 0:23:09 Text: And then, conversely, you also learn an amount of training, which is sort of a data set size.

0:23:09 - 0:23:16 Text: And so that's the explanation for the sort of million x model size versus a thousand

0:23:16 - 0:23:18 Text: x in data.

0:23:18 - 0:23:23 Text: I probably won't try to explain the batch size question.

0:23:23 - 0:23:27 Text: But it's basically based on some empirical analysis where you ask, how big can you make

0:23:27 - 0:23:28 Text: your batch size?

0:23:28 - 0:23:33 Text: How far can you push data parallelism without seeing diminishing returns?

0:23:33 - 0:23:37 Text: And that's sort of the rough answer from that question.

0:23:37 - 0:23:41 Text: They're all trained from scratch.

0:23:41 - 0:23:46 Text: So this is always almost everything that I'll talk about in this talk is training from

0:23:46 - 0:23:47 Text: scratch.

0:23:47 - 0:23:51 Text: Any other questions?

0:23:51 - 0:23:54 Text: Okay.

0:23:54 - 0:23:58 Text: And then another point that I mean, I don't want to overemphasize.

0:23:58 - 0:24:04 Text: But like I said, from a sort of very zero-thorder naive perspective is that for some of these

0:24:04 - 0:24:07 Text: results, architecture isn't the most crucial thing.

0:24:07 - 0:24:13 Text: So I think one of the biggest advances in machine learning in the last five or ten years

0:24:13 - 0:24:17 Text: has been the development of the transformer models that I'm talking about.

0:24:17 - 0:24:22 Text: But of course, you can do language modeling with a recurrent model that reads words in

0:24:22 - 0:24:24 Text: order.

0:24:24 - 0:24:30 Text: And of course, LSTMs or stacked LSTMs are sort of the standard way to do that.

0:24:30 - 0:24:36 Text: And so you can compare what you actually get if you study LSTMs versus transformers.

0:24:36 - 0:24:40 Text: And at zero-thorder, it doesn't seem like LSTMs are so bad.

0:24:40 - 0:24:44 Text: It looks like as you make them bigger, they are scaling up quite nicely.

0:24:44 - 0:24:48 Text: But there's basically a constant offset where transformers are something like five or ten

0:24:48 - 0:24:53 Text: times more efficient for a given model size than LSTMs.

0:24:53 - 0:24:56 Text: And so I think this is a very, very convincing plot that tells you the transformers are in

0:24:56 - 0:24:58 Text: fact better.

0:24:58 - 0:25:03 Text: But you don't necessarily need a transformer to see that making models bigger is giving

0:25:03 - 0:25:05 Text: you in.

0:25:05 - 0:25:10 Text: And really the sort of more interesting limitation of LSTMs that I'll also talk about a little

0:25:10 - 0:25:13 Text: more later is if we plot something else.

0:25:13 - 0:25:21 Text: So if we look at a thousand tokens, which is something like 600 words of context, we

0:25:21 - 0:25:25 Text: can look at what the loss is as a function of the position in the context.

0:25:25 - 0:25:30 Text: Because if you've read more of a document already, you're going to be better at predicting

0:25:30 - 0:25:33 Text: what the next word is because you have more context available.

0:25:33 - 0:25:34 Text: And they're very smooth.

0:25:34 - 0:25:41 Text: It turns out also power law curves for the loss as a function of context position.

0:25:41 - 0:25:47 Text: But the thing that you notice is that the red lines are LSTMs and the blue lines are

0:25:47 - 0:25:48 Text: transformers.

0:25:48 - 0:25:54 Text: And LSTMs tend to sort of plateau in performance after on the order of a hundred tokens.

0:25:54 - 0:26:00 Text: And this is sort of another bottleneck in a different direction.

0:26:00 - 0:26:05 Text: This is the famous fact that transformers are much better at learning long context information.

0:26:05 - 0:26:08 Text: And this is obviously a limitation of LSTMs.

0:26:08 - 0:26:14 Text: But sort of the basic parameter scaling law seems like it holds for many architectures.

0:26:14 - 0:26:15 Text: And then there are much more refined questions.

0:26:15 - 0:26:18 Text: You can ask, I won't go into too much detail on this.

0:26:18 - 0:26:21 Text: But there are all sorts of hyper parameters in transformer models.

0:26:21 - 0:26:25 Text: And you might ask how much does it matter if I really optimize those?

0:26:25 - 0:26:28 Text: Do I get qualitatively different behavior if I optimize those better?

0:26:28 - 0:26:33 Text: And what all of these plots show is that for various different kinds of hyper parameters

0:26:33 - 0:26:38 Text: and transformer models, there's some broad basin where you get quite good performance.

0:26:38 - 0:26:42 Text: I mean, maybe a factor of three in either direction where performance doesn't change

0:26:42 - 0:26:43 Text: all that much.

0:26:43 - 0:26:48 Text: Of course, you might want to optimize that I'm not saying you shouldn't, but kind of qualitatively

0:26:48 - 0:26:53 Text: it's not an enormous difference.

0:26:53 - 0:26:59 Text: So I think this is also a place where it's, so I'm going to tell you in a few slides

0:26:59 - 0:27:04 Text: that a lot of these features are true more generally beyond language.

0:27:04 - 0:27:11 Text: And they really sort of say that much of what's going on when machines learn is quite universal.

0:27:11 - 0:27:14 Text: But there are features that are not universal.

0:27:14 - 0:27:22 Text: So this is kind of a nicer plot of loss versus token index.

0:27:22 - 0:27:29 Text: And I've included some power law fits, which are dotted lines, which show that this is

0:27:29 - 0:27:34 Text: actually, this performance is also highly predictable.

0:27:34 - 0:27:38 Text: That just says the obvious that when you read more, you understand it's easier for you

0:27:38 - 0:27:40 Text: to predict what's coming next.

0:27:40 - 0:27:45 Text: But you can train models on images, I'll briefly talk about that later, you can train models

0:27:45 - 0:27:48 Text: identically on images.

0:27:48 - 0:27:51 Text: And there you see a performance as a function of context position.

0:27:51 - 0:27:52 Text: It's very different.

0:27:52 - 0:27:58 Text: So here you have a model that reads pixels row by row.

0:27:58 - 0:28:02 Text: And as you might expect, there's usually much more non-tribule stuff going on in the

0:28:02 - 0:28:05 Text: middle of an image rather than in the background.

0:28:05 - 0:28:08 Text: And that's represented by the fact that models do much worse.

0:28:08 - 0:28:12 Text: Their loss is higher in the center of images as compared to junior-of-the-edges.

0:28:12 - 0:28:17 Text: So while some properties of transformers and language models are universal, and I'll

0:28:17 - 0:28:22 Text: talk about those later on, there are features of language data that are totally different

0:28:22 - 0:28:24 Text: from other data distributions.

0:28:24 - 0:28:30 Text: And this is a very stark example of that.

0:28:30 - 0:28:35 Text: But generally, the fact that there are these kinds of nice patterns lurking whenever you

0:28:35 - 0:28:37 Text: optimize a model.

0:28:37 - 0:28:39 Text: I think that is very common.

0:28:39 - 0:28:42 Text: So any questions about this?

0:28:42 - 0:28:43 Text: Yeah.

0:28:43 - 0:28:45 Text: Do you mind going back to slide?

0:28:45 - 0:28:52 Text: Do you mind explaining what it means to have a loss on the first template versus the

0:28:52 - 0:28:54 Text: 1000 template?

0:28:54 - 0:29:01 Text: Yeah, so if you imagine you have a thousand words extracted randomly from a book, then

0:29:01 - 0:29:05 Text: the very first thing you can ask the model to do is try to predict the very first word.

0:29:05 - 0:29:09 Text: Then you ask it to predict the second word, the third word, et cetera.

0:29:09 - 0:29:15 Text: The very first word basically all the model can possibly do is predict the unigram distribution

0:29:15 - 0:29:16 Text: for its training set.

0:29:16 - 0:29:20 Text: It just doesn't have any information to go on, otherwise to predict what's happening.

0:29:20 - 0:29:23 Text: And so that's why it's lost is very high.

0:29:23 - 0:29:28 Text: But by the time you get to the end of the passage, you've read a lot of some little short story,

0:29:28 - 0:29:30 Text: and you know a lot about what's going to happen.

0:29:30 - 0:29:32 Text: You know what kinds of words are likely to come next.

0:29:32 - 0:29:35 Text: You know about the author's style and vocabulary.

0:29:35 - 0:29:38 Text: You know about what characters exist, et cetera.

0:29:38 - 0:29:43 Text: And so your model has gotten much, much better at prediction by the end of the context.

0:29:43 - 0:29:48 Text: And so literally to make this plot, you take maybe a thousand, ten thousand different

0:29:48 - 0:29:51 Text: passages with a thousand words in them.

0:29:51 - 0:29:55 Text: You compute the model's loss on all of the words in the passage, and then you take the

0:29:55 - 0:29:57 Text: mean, and you get some nice plot like this.

0:29:57 - 0:29:58 Text: Yeah.

0:29:58 - 0:30:08 Text: But because the computation complexity is quite tragic with respect to token index, without

0:30:08 - 0:30:12 Text: being that essentially, for talking about it, if you think about it, if you were looking

0:30:12 - 0:30:18 Text: at compute, then it would go from the zero to like ten to the six.

0:30:18 - 0:30:27 Text: So you know, you know, significantly greater compute, for a given test gloss as you increase

0:30:27 - 0:30:28 Text: token index.

0:30:28 - 0:30:33 Text: So it's true that if you make the context length longer, longer, you will spend somewhat

0:30:33 - 0:30:35 Text: more compute.

0:30:35 - 0:30:41 Text: But the fraction of the amount of compute you spend near the last token isn't nearly

0:30:41 - 0:30:42 Text: so stark.

0:30:42 - 0:30:50 Text: Most of the compute happens in the matrix multiplies for the MLP feed forward part of

0:30:50 - 0:30:57 Text: the transformer, and also the matrix multiplies to make the keys and queries and values, et cetera.

0:30:57 - 0:30:59 Text: That's actually in most well.

0:30:59 - 0:31:04 Text: It depends on the model hyper parameters, but in many models, especially models that are

0:31:04 - 0:31:07 Text: large, that's actually the predominant compute.

0:31:07 - 0:31:10 Text: And so actually, the amount of compute you do for the last token in the first token might

0:31:10 - 0:31:12 Text: only differ by a few percent.

0:31:12 - 0:31:16 Text: So for GP3, I think it's literally like one or two percent difference.

0:31:16 - 0:31:19 Text: So the sum matrix multiplies the tension.

0:31:19 - 0:31:20 Text: Yeah, yeah, yeah.

0:31:20 - 0:31:24 Text: So I mean, the formula for that was this one that I briefly mentioned here.

0:31:24 - 0:31:29 Text: So basically, how much compute you do in the context direction divided by the amount of

0:31:29 - 0:31:32 Text: compute you do in the matrix multiply direction is this.

0:31:32 - 0:31:40 Text: So if your model is, if D model is very small, if D model is 128, and context is 1,000, then

0:31:40 - 0:31:42 Text: it's basically 50-50.

0:31:42 - 0:31:48 Text: But if D model is 10,000, and context is 2,000, then it's like 2%.

0:31:48 - 0:31:52 Text: So if the model is keep getting bigger, then that means that if you're willing to pay

0:31:52 - 0:31:57 Text: a fractional cost, then you can keep making context length longer and pay a fixed fractional

0:31:57 - 0:31:58 Text: cost.

0:31:58 - 0:32:01 Text: And of course, if you use something fancy with intense attention, you also get extra

0:32:01 - 0:32:04 Text: winds on top of that.

0:32:04 - 0:32:08 Text: Any other questions?

0:32:08 - 0:32:11 Text: Cool.

0:32:11 - 0:32:21 Text: So this is sort of both of these, the left and the right, show you samples from a transformer

0:32:21 - 0:32:24 Text: model.

0:32:24 - 0:32:28 Text: Very roughly speaking, they're identical kinds of transformer models, which is with some

0:32:28 - 0:32:30 Text: slightly different hyper parameters.

0:32:30 - 0:32:32 Text: But they're trained on very different data distributions.

0:32:32 - 0:32:35 Text: The one on the left is obviously, this is GPT-3.

0:32:35 - 0:32:38 Text: The one on the right is IGPT.

0:32:38 - 0:32:42 Text: It's a model that's trained to predict pixels, row by row.

0:32:42 - 0:32:47 Text: And so what happened here was that we took the top half of an image and then generated

0:32:47 - 0:32:50 Text: all the rows beneath.

0:32:50 - 0:32:55 Text: And so the same kind of model architecture, but just trained on different data distributions

0:32:55 - 0:33:04 Text: is able to effectively learn very impressive generative capabilities in both cases.

0:33:04 - 0:33:09 Text: And so this is sort of a qualitative hint at the possibility that what's going on here

0:33:09 - 0:33:13 Text: is quite universal.

0:33:13 - 0:33:18 Text: And so another way of introducing it is say, you might have some questions after the

0:33:18 - 0:33:19 Text: last few slides.

0:33:19 - 0:33:22 Text: How are the scaling laws I'm talking about really specific to language, are they a feature

0:33:22 - 0:33:26 Text: of the kinds of data that language is?

0:33:26 - 0:33:29 Text: You might ask, do these scaling laws really continue?

0:33:29 - 0:33:33 Text: You showed that they're true over many orders of magnitude, but did they break down eventually

0:33:33 - 0:33:34 Text: in a what way?

0:33:34 - 0:33:40 Text: And then another question you might ask is, what do they imply for other kinds of evaluations?

0:33:40 - 0:33:45 Text: You probably don't just want to generate raw samples from either of these kinds of models.

0:33:45 - 0:33:48 Text: You might want to use them for some other more specific task.

0:33:48 - 0:33:55 Text: And so the question of whether or not the test loss, the training loss that you've optimized

0:33:55 - 0:34:00 Text: as that goes down in a predictable way, does that also imply that other things, other

0:34:00 - 0:34:03 Text: capabilities of the model are improving?

0:34:03 - 0:34:07 Text: So I'll be talking about these questions.

0:34:07 - 0:34:16 Text: So this plot contains kind of a lot of compressed information all at once, or the set of plots.

0:34:16 - 0:34:25 Text: So this is the result of what happens if you train the same kind of transfer models on

0:34:25 - 0:34:27 Text: sort of five different data distributions.

0:34:27 - 0:34:33 Text: So text language we already saw, but you can try video where you predict every pixel

0:34:33 - 0:34:39 Text: in a video in this sort of rectangular prism of video pixels.

0:34:39 - 0:34:45 Text: Images, this sort of synthetically generated deep-mind math data set where you're trying

0:34:45 - 0:34:49 Text: to predict the answer to math problems.

0:34:49 - 0:34:55 Text: There's a multimodal data set where you have image text pairs in either direction.

0:34:55 - 0:35:03 Text: And in all cases, the x-axis is compute, and the y-axis is the appropriate test loss

0:35:03 - 0:35:08 Text: for that class of models minus a constant.

0:35:08 - 0:35:15 Text: So that's the one complication that I've added here.

0:35:15 - 0:35:30 Text: So the claim is that these dashed lines in terms of the original loss are a power law,

0:35:30 - 0:35:38 Text: like the power laws that we saw on a much earlier slide, plus one constant term.

0:35:38 - 0:35:42 Text: And if you subtract off that constant term, then you make a log-log plot once again,

0:35:42 - 0:35:46 Text: then you once again get these very, very nice straight lines.

0:35:46 - 0:35:50 Text: And so this compute scaling law generalizes to all these other data distributions.

0:35:50 - 0:35:56 Text: And the other scaling laws also generalize, I just haven't plotted them.

0:35:56 - 0:36:01 Text: So the claim of this slide is that scaling laws do generalize to all of these other data

0:36:01 - 0:36:07 Text: distributions, and you train the same basic kind of model on them.

0:36:07 - 0:36:13 Text: And furthermore, there's sort of an intellectually slightly interesting point, which is that

0:36:13 - 0:36:21 Text: if you really believe that these dashed lines are true, if you think that they're a real

0:36:21 - 0:36:28 Text: feature of what's going on, and they continue out very, very, very far, then if you think

0:36:28 - 0:36:34 Text: that the loss is a constant plus a power law, then you can interpret the constant term

0:36:34 - 0:36:39 Text: as the entropy of the underlying data distribution.

0:36:39 - 0:36:44 Text: And you can interpret the power law as something like the KL divergence between the true data

0:36:44 - 0:36:48 Text: distribution and the model that you have.

0:36:48 - 0:36:50 Text: So that's a lot.

0:36:50 - 0:36:57 Text: The important summary at zero-thorder to remember is that I'm telling you that the kinds of

0:36:57 - 0:37:02 Text: scaling laws I presented for language generalize to all of these other domains.

0:37:02 - 0:37:06 Text: There's also some other interesting features here.

0:37:06 - 0:37:13 Text: The reason why I used compute to illustrate that the scaling laws generalize is because

0:37:13 - 0:37:18 Text: you can ask another question now that puts all of the different data distributions on

0:37:18 - 0:37:19 Text: one plot.

0:37:19 - 0:37:25 Text: It wouldn't have made any sense to combine the five plots on the last slide into one plot,

0:37:25 - 0:37:30 Text: because the test loss, when you're predicting a word, is not in any way comparable to the

0:37:30 - 0:37:32 Text: test loss when you're predicting a pixel.

0:37:32 - 0:37:33 Text: It doesn't really make sense.

0:37:33 - 0:37:34 Text: They don't have the same units.

0:37:34 - 0:37:36 Text: It doesn't make sense to put them together.

0:37:36 - 0:37:42 Text: But something that does make sense to put together is what the optimal model size is as

0:37:42 - 0:37:45 Text: a function of your computational budget.

0:37:45 - 0:37:50 Text: And so in the same way that we did for language, you can go here and you can ask for any

0:37:50 - 0:37:55 Text: given amount of compute, like 10 to the minus 2, petaflap days, what is the best model size?

0:37:55 - 0:37:57 Text: You can do that for all of these plots.

0:37:57 - 0:38:01 Text: You combine that information together and you find something kind of surprising, which

0:38:01 - 0:38:07 Text: is that, again, roughly speaking, if you're sort of willing to allow a little bit of wiggle

0:38:07 - 0:38:13 Text: room, all of these different kinds of models seem to be on the same trajectory for optimal

0:38:13 - 0:38:15 Text: model size versus compute.

0:38:15 - 0:38:20 Text: There's some kind of universal fit of how much bigger you should make your model if you're

0:38:20 - 0:38:28 Text: going to model any of these data distributions with some given amount of compute.

0:38:28 - 0:38:32 Text: So what about other kinds of tasks?

0:38:32 - 0:38:38 Text: Well, one of the most classic tasks that you can ask about in ML is image classification.

0:38:38 - 0:38:44 Text: And so the models that we were training on images, and that I've shown you plots, they're

0:38:44 - 0:38:50 Text: training loss, these models are sort of on tiny little images predicted.

0:38:50 - 0:38:54 Text: Pixel by pixel, in particular, they're 32 by 32 images, so we can look at sort of the

0:38:54 - 0:39:00 Text: 32 by 32 pixel version of image net classification.

0:39:00 - 0:39:05 Text: And the models that I was discussing are generative models that predict pixels, but you can shop

0:39:05 - 0:39:13 Text: off their heads, add a classification head in its place, and try to predict image net

0:39:13 - 0:39:15 Text: and train on image net.

0:39:15 - 0:39:19 Text: And the orange curve that I've shown you here is what happens if you just take a randomly

0:39:19 - 0:39:23 Text: initialized model with that architecture and train it.

0:39:23 - 0:39:28 Text: You get very good performance up to a point and then performance plateaus because you're

0:39:28 - 0:39:33 Text: being limited by the fact that image net is from this point of you a small data set.

0:39:33 - 0:39:38 Text: However, if you take these pre-trained models that have been trained generatively to

0:39:38 - 0:39:43 Text: draw pixels, they sort of use the features, presumably they're using the features they

0:39:43 - 0:39:52 Text: learned from image generation for classification, and you get some nice trend for the error rate

0:39:52 - 0:39:56 Text: in classification as a function of model size.

0:39:56 - 0:40:02 Text: So this is saying that in this particular case, we actually do fine-tuning the pre-training

0:40:02 - 0:40:08 Text: you did and the sort of trends you saw really kind of transfer into trends in something else

0:40:08 - 0:40:12 Text: you might care about like image classification.

0:40:12 - 0:40:17 Text: We can ask the same kinds of questions about language models.

0:40:17 - 0:40:23 Text: In particular, does this steady improvement in language modeling as a function of scale,

0:40:23 - 0:40:28 Text: does that translate into better performance?

0:40:28 - 0:40:30 Text: And this is sort of an interesting subject by itself.

0:40:30 - 0:40:34 Text: And so you can ask what happens if we scale language models.

0:40:34 - 0:40:38 Text: And so this is sort of this exact same plot that you've seen a couple of times now for

0:40:38 - 0:40:44 Text: language models, but it just increased from sort of original work that we did out to

0:40:44 - 0:40:47 Text: this yellow line, which is GPT-3.

0:40:47 - 0:40:50 Text: And you see that basically this sort of trends continue.

0:40:50 - 0:40:54 Text: Possibly GPT-3 is sort of missing the trend a little bit.

0:40:54 - 0:40:59 Text: I can't really honestly tell you whether that's because GPT-3 wasn't well optimized, or

0:40:59 - 0:41:05 Text: if it's because there's some bending in this curve where we're hitting some irreducible

0:41:05 - 0:41:06 Text: loss.

0:41:06 - 0:41:12 Text: That irreducible loss would be something like the entropy of this sort of language data

0:41:12 - 0:41:15 Text: set itself.

0:41:15 - 0:41:18 Text: But it's just in order the trends continue.

0:41:18 - 0:41:24 Text: And what's now pretty well known is that if you train fairly large language models,

0:41:24 - 0:41:27 Text: then they can exhibit in context learning.

0:41:27 - 0:41:34 Text: So the kind of learning that I'm talking about is that you give these models an example

0:41:34 - 0:41:42 Text: of many arithmetic problems or many anagrams or whatnot or translation tasks for individual

0:41:42 - 0:41:43 Text: words.

0:41:43 - 0:41:49 Text: Then early on in the sequence of the top, they might not be very good at doing the task,

0:41:49 - 0:41:54 Text: but they figure out what the pattern is in the task and they learn to do it.

0:41:54 - 0:41:58 Text: And in particular, you can plot that so you can ask for, say, like one of these anagram

0:41:58 - 0:42:00 Text: tasks.

0:42:00 - 0:42:06 Text: What is the performance of the model as a function of how many examples of the task get seen

0:42:06 - 0:42:07 Text: in the context?

0:42:07 - 0:42:12 Text: So this is kind of similar to the loss as a function of context position, but it's now

0:42:12 - 0:42:16 Text: an accuracy at doing an actual task, like unscramble the letters in a word.

0:42:16 - 0:42:23 Text: And you see probably most importantly that if you give more examples, you get significantly

0:42:23 - 0:42:28 Text: better performance starting from very, very poor performance to pretty good.

0:42:28 - 0:42:31 Text: And also you see that larger models do this better.

0:42:31 - 0:42:35 Text: You also finally see that giving a natural language prompt with some instructions helps

0:42:35 - 0:42:40 Text: significantly in the regime where you have very few examples.

0:42:40 - 0:42:42 Text: This is in context learning.

0:42:42 - 0:42:46 Text: You can call this a kind of meta learning.

0:42:46 - 0:42:52 Text: And it just emerges automatically from training large language models without any particular

0:42:52 - 0:42:57 Text: attempt to get this kind of behavior.

0:42:57 - 0:43:03 Text: And you could also ask sort of about downstream tasks that you actually you care about.

0:43:03 - 0:43:08 Text: So there is accuracy at doing arithmetic as a function of model size, a bunch of different

0:43:08 - 0:43:11 Text: kinds of arithmetic problems.

0:43:11 - 0:43:19 Text: There is just some data set of analogies from a test that American college students take

0:43:19 - 0:43:23 Text: to go to college, the SATs.

0:43:23 - 0:43:31 Text: And if you care the sort of average score of that year's test was I think 58% or so

0:43:31 - 0:43:32 Text: percent.

0:43:32 - 0:43:35 Text: So the largest model is sort of doing a little bit better than the average American high

0:43:35 - 0:43:37 Text: school student.

0:43:37 - 0:43:42 Text: The trivia QA, which is sort of just knowing trivia.

0:43:42 - 0:43:50 Text: And Wina grad schemas are problems like if a tree falls on your roof and you got it fixed,

0:43:50 - 0:43:53 Text: what did you get fixed, did you get the tree fixed or your roof.

0:43:53 - 0:43:57 Text: It's a measure of common sense reasoning and models are also getting better at this.

0:43:57 - 0:44:03 Text: And I think the other interesting thing that's very often emphasized is that clearly trivia

0:44:03 - 0:44:05 Text: performance is improving very smoothly as you make models bigger.

0:44:05 - 0:44:09 Text: The models are just remembering more and more trivia.

0:44:09 - 0:44:14 Text: Wina grad schemas are also improving fairly smoothly.

0:44:14 - 0:44:17 Text: But then there are examples like arithmetic where models are very poor and then they sort

0:44:17 - 0:44:19 Text: of suddenly get pretty good.

0:44:19 - 0:44:25 Text: And so these kind of sudden rocks sort of the model sort of suddenly kind of like gets

0:44:25 - 0:44:28 Text: what it's supposed to do for arithmetic are pretty interesting.

0:44:28 - 0:44:32 Text: And there are all sorts of other kind of interesting things if you kind of dig into these

0:44:32 - 0:44:33 Text: specific abilities.

0:44:33 - 0:44:34 Text: Yeah.

0:44:34 - 0:44:41 Text: Why do bigger models do better in the context of student?

0:44:41 - 0:44:48 Text: I mean, I guess the sort of dumb zero-thorder point is that larger models are just getting

0:44:48 - 0:44:52 Text: much better and better at predicting the next word given more and more context.

0:44:52 - 0:44:58 Text: So I think it like, I think there's a very tight connection between a plot like this and

0:44:58 - 0:45:02 Text: these sort of in context loading plots.

0:45:02 - 0:45:06 Text: Basically the more information you're getting, I mean all of these models probably know the

0:45:06 - 0:45:11 Text: unigram distribution of words and tokens pretty well.

0:45:11 - 0:45:15 Text: But the bigger model is getting much, much, much more information from its context than

0:45:15 - 0:45:17 Text: the smaller models.

0:45:17 - 0:45:21 Text: And at a certain point, I mean, it depends on your training distribution and all sorts

0:45:21 - 0:45:22 Text: of other things.

0:45:22 - 0:45:27 Text: But like, one of the things that we do is when we see several examples of something happening

0:45:27 - 0:45:32 Text: in a text, we guess that that's what we're going to see next.

0:45:32 - 0:45:37 Text: And that's really probably embedded in a ton of text that's out there on the internet

0:45:37 - 0:45:38 Text: and in books.

0:45:38 - 0:45:41 Text: And models have to decrease their loss somehow.

0:45:41 - 0:45:43 Text: That's a pattern in the text.

0:45:43 - 0:45:48 Text: It's a pattern that models eventually learn and they seemingly apply this knowledge.

0:45:48 - 0:45:51 Text: I think there are other people, of course, who've kind of worked on this question more

0:45:51 - 0:45:53 Text: specifically, and I have more specific theories.

0:45:53 - 0:46:02 Text: But I think it like kind of an intuitive sense, that's how I would think about it.

0:46:02 - 0:46:08 Text: I guess one final evaluation you can ask, can people tell that text written by a language

0:46:08 - 0:46:12 Text: model, it was written by a language model or that it's a human?

0:46:12 - 0:46:16 Text: This is an evaluation where we looked at short news articles.

0:46:16 - 0:46:22 Text: There's two or three paragraphs and generated equivalent news articles from GPT-3.

0:46:22 - 0:46:27 Text: And by the time you get to sort of the largest models, people are approaching chance accuracy

0:46:27 - 0:46:28 Text: at being able to tell the difference.

0:46:28 - 0:46:33 Text: This sort of has a lot of implications, both, I mean, it's interesting and surprising

0:46:33 - 0:46:35 Text: as a state and it's about language modeling.

0:46:35 - 0:46:38 Text: But it's also somewhat scary.

0:46:38 - 0:46:43 Text: That means these language models are very difficult to tell that you're talking to a language

0:46:43 - 0:46:47 Text: model if you don't have a very long conversation.

0:46:47 - 0:46:48 Text: Yeah.

0:46:48 - 0:46:49 Text: Hi.

0:46:49 - 0:46:58 Text: So I am wondering, so for this specific statement, because with modern current models, they

0:46:58 - 0:47:00 Text: are what projects are going to use you.

0:47:00 - 0:47:05 Text: So have you attended, like, the general experience article, and what you don't have, and for

0:47:05 - 0:47:08 Text: a document, or do you have anything?

0:47:08 - 0:47:12 Text: I actually don't know the answer to that question for this particular analysis off the top of

0:47:12 - 0:47:14 Text: my head.

0:47:14 - 0:47:19 Text: I believe that these are not memorized.

0:47:19 - 0:47:23 Text: One simple thing you can do, at least, for some things that occur frequently is like you

0:47:23 - 0:47:29 Text: can look at the distribution of the loss for a model on its own samples.

0:47:29 - 0:47:34 Text: And at least for things that are memorized, that are very clearly memorized.

0:47:34 - 0:47:38 Text: Obviously they, of course, probably they first frequently in the training set, but also

0:47:38 - 0:47:41 Text: the loss tends to be much, much lower on memorized samples.

0:47:41 - 0:47:46 Text: Because you can intuitively understand this because if there's 100 words that are exactly

0:47:46 - 0:47:52 Text: verbatim sampled out, and you're sampling at temperature equals one, then all of the

0:47:52 - 0:47:55 Text: next word predictions have to be extremely, extremely confident.

0:47:55 - 0:47:57 Text: And that means the loss has to be super low.

0:47:57 - 0:48:02 Text: So, I mean, just informally, something that I've done to just get rid of memorized samples

0:48:02 - 0:48:05 Text: is compute the loss, and usually you'll just see a pretty clear by-modal where there'll

0:48:05 - 0:48:09 Text: be a few memorized examples and then things that aren't.

0:48:09 - 0:48:10 Text: That's a simple thing you can do to check.

0:48:10 - 0:48:13 Text: You can also, of course, do de-de-de-de-de-de-de-de-de-plocation.

0:48:13 - 0:48:18 Text: I don't remember off the top of my head what de-de-de-de-de-de-plocation has done here, though.

0:48:18 - 0:48:24 Text: On the Downscape task section, if I want to say about how scale loss is, you can look

0:48:24 - 0:48:29 Text: at the transferable objects and adversarial objects.

0:48:29 - 0:48:35 Text: I don't think I have anything particularly clear to say about that.

0:48:35 - 0:48:41 Text: I mean, these evils, I think, are not adversarial in the sense that they're just few shot evaluations

0:48:41 - 0:48:44 Text: with some fixed data set.

0:48:44 - 0:48:51 Text: There are a large number of different kinds of adversarial data sets out there for reasoning,

0:48:51 - 0:48:54 Text: for common sense knowledge, for truthfulness.

0:48:54 - 0:48:57 Text: So, I mean, there's, like, for example, truthful QA.

0:48:57 - 0:49:01 Text: This is an example where there aren't any trends like this and arguably the trends go

0:49:01 - 0:49:06 Text: downward, though it depends on your training distribution and some models actually do improve.

0:49:06 - 0:49:08 Text: So I think that's a complicated question.

0:49:08 - 0:49:11 Text: I think it's hard to find examples where the trends go down.

0:49:11 - 0:49:16 Text: I don't think it's easy, but these do exist.

0:49:16 - 0:49:21 Text: Any other questions?

0:49:21 - 0:49:25 Text: Great.

0:49:25 - 0:49:35 Text: So, I guess I'll sort of end by summarizing some lessons that you might draw pretty practically

0:49:35 - 0:49:38 Text: for research from this.

0:49:38 - 0:49:43 Text: And then I can either open it up for questions, or I can also, I can always talk infinitely

0:49:43 - 0:49:44 Text: long.

0:49:44 - 0:49:48 Text: I've been a professor for like 10 years of my life, so I can just talk forever.

0:49:48 - 0:49:52 Text: But I'll sort of end after talking about some lessons.

0:49:52 - 0:49:58 Text: So I think one lesson that kind of I draw from this is that kind of scanning over some

0:49:58 - 0:50:03 Text: of the important inputs to your training process is just a pretty useful thing to do when

0:50:03 - 0:50:05 Text: you're doing ML research.

0:50:05 - 0:50:08 Text: And it's sort of typically very cheap.

0:50:08 - 0:50:13 Text: It's cheap because generally most things vary in an important way on a log scale, or

0:50:13 - 0:50:17 Text: sort of on a geometric scale, however you want to say it.

0:50:17 - 0:50:21 Text: And that means that like if you're training with the data set of size D, maybe you should

0:50:21 - 0:50:25 Text: also train with D over 2 and D over 4 and D over 8 or something like that.

0:50:25 - 0:50:28 Text: And if you sum that geometric series, you get 2D.

0:50:28 - 0:50:34 Text: So you sort of, I mean, you made your training process twice as expensive in some sense, but

0:50:34 - 0:50:37 Text: it's not really a big change in what you have to do.

0:50:37 - 0:50:42 Text: But you can often learn a lot about what's going on by doing these kinds of scans.

0:50:42 - 0:50:47 Text: And so, I mean, this is an example of some data that I didn't show earlier.

0:50:47 - 0:50:51 Text: So, I think you might wonder about is what happens if you scan over data set size and model

0:50:51 - 0:50:53 Text: size at the same time.

0:50:53 - 0:50:57 Text: And it turns out there's some very simple trends that you can model in that case too that

0:50:57 - 0:51:00 Text: tell you about things like overfitting.

0:51:00 - 0:51:03 Text: And I mean, if you care about overfitting, then this tells you about something like how

0:51:03 - 0:51:06 Text: big do you have to make your data set for a given model size to avoid overfitting being

0:51:06 - 0:51:11 Text: a significant problem so that you can answer all kinds of questions like that.

0:51:11 - 0:51:18 Text: And I at least find that this is kind of useful and it's nice for learning things about

0:51:18 - 0:51:19 Text: behavior.

0:51:19 - 0:51:24 Text: And I think alongside that, I think like this is sort of a joke.

0:51:24 - 0:51:25 Text: This isn't real.

0:51:25 - 0:51:29 Text: This is sort of making fun of a large number of machine learning papers that you might

0:51:29 - 0:51:30 Text: see.

0:51:30 - 0:51:34 Text: I think a lot of machine learning papers have tables like this.

0:51:34 - 0:51:37 Text: And it's sort of hard to tell from like this kind of table obviously I'm making fun, but

0:51:37 - 0:51:42 Text: I think it's not so unrealistic like did the technique that went into our model really

0:51:42 - 0:51:45 Text: improve on other things that happened.

0:51:45 - 0:51:49 Text: And I think that this kind of plot at least for me is a much more convincing statement

0:51:49 - 0:51:53 Text: that like will clearly transformers are just better than LSTMs.

0:51:53 - 0:51:59 Text: So the slogan here is sort of success for new techniques if your goal is to sort of

0:51:59 - 0:52:04 Text: improve a model if that is your goal.

0:52:04 - 0:52:08 Text: And I think it's at least to me much more convincing and kind of clear what's going on

0:52:08 - 0:52:10 Text: if you see these trends.

0:52:10 - 0:52:13 Text: Maybe I have another slide making fun of the CS.

0:52:13 - 0:52:18 Text: So I mean, I think this is a thing that I actually see very often in research is that

0:52:18 - 0:52:26 Text: you come up with some new idea and you see like you first do the cheapest easiest experiment

0:52:26 - 0:52:29 Text: and you see, well my new idea improved performance.

0:52:29 - 0:52:31 Text: I'm really excited.

0:52:31 - 0:52:33 Text: Everyone should it should adopt this.

0:52:33 - 0:52:38 Text: But then you make some plot like this and you sort of say, oh, okay, I guess it doesn't

0:52:38 - 0:52:40 Text: really matter that much at all.

0:52:40 - 0:52:42 Text: And I think this is actually a comment.

0:52:42 - 0:52:44 Text: I mean, I think we all have all sorts of ideas.

0:52:44 - 0:52:50 Text: I mean, people fall asleep at night and they can't sleep and then they wake up and they

0:52:50 - 0:52:52 Text: have ideas and like, oh, I'm going to go try this.

0:52:52 - 0:52:53 Text: We all do it.

0:52:53 - 0:52:57 Text: But oftentimes they don't work and I think this is sort of useful for understanding whether

0:52:57 - 0:52:59 Text: your idea really, really works.

0:52:59 - 0:53:05 Text: And I mean, if all you're ever going to do is train this model, then your idea did work.

0:53:05 - 0:53:13 Text: But I think that like there's sort of an expectation that probably people will be using bigger

0:53:13 - 0:53:15 Text: computers to train larger models in the future.

0:53:15 - 0:53:18 Text: And so the ideas that are really going to have a huge impact are ones that sort of point

0:53:18 - 0:53:20 Text: in the opposite direction.

0:53:20 - 0:53:24 Text: I've even seen ideas where on small models they make no difference at all, but on larger

0:53:24 - 0:53:28 Text: models they do better.

0:53:28 - 0:53:31 Text: And so these kinds of trends I think are useful.

0:53:31 - 0:53:34 Text: And they're certainly useful to think about.

0:53:34 - 0:53:42 Text: Another point that I find useful, I think it's not sort of obvious and maybe you shouldn't

0:53:42 - 0:53:46 Text: trust it completely, is that I tend to think, I mean, because I've sort of swallowed

0:53:46 - 0:53:53 Text: my own coolade, that if something works, then it should scale fairly predictably.

0:53:53 - 0:53:58 Text: It's not always true, but for things that you can measure that are very close to your

0:53:58 - 0:54:01 Text: optimization target.

0:54:01 - 0:54:08 Text: If sort of your training process, your hyper parameters, etc., are all kind of set up well,

0:54:08 - 0:54:12 Text: then I tend to think that you should see some kind of predictable trend.

0:54:12 - 0:54:18 Text: And if that trend goes away, then I mean, maybe that's just exactly what's true.

0:54:18 - 0:54:21 Text: But I think often it means that there's something broken about what's going on.

0:54:21 - 0:54:26 Text: Maybe your numerics are broken and you need higher precision in some part of your model,

0:54:26 - 0:54:29 Text: maybe there's some bottleneck you hadn't thought of.

0:54:29 - 0:54:34 Text: So I mean, this is also an example that kind of scaling, predictable scaling can be found

0:54:34 - 0:54:35 Text: all over the place.

0:54:35 - 0:54:39 Text: So I just think this is sort of neat.

0:54:39 - 0:54:43 Text: So if you just train these extremely naive, very stupid multimodal models, or you use

0:54:43 - 0:54:49 Text: a decoder-only transformer to either model the text based on the image or model the image

0:54:49 - 0:54:51 Text: based on the text, then you can do that.

0:54:51 - 0:54:56 Text: And measure a sort of empirical mutual information between the image and the text.

0:54:56 - 0:55:02 Text: How much information did the image give you about the words in the sense of sort of Shannon

0:55:02 - 0:55:04 Text: information?

0:55:04 - 0:55:08 Text: And or conversely, how much information did the text give you about the image?

0:55:08 - 0:55:12 Text: And this is also a place where, I mean, this is very close to the optimization target.

0:55:12 - 0:55:16 Text: The whole point of the multimodal is to get this information.

0:55:16 - 0:55:21 Text: And you see that there's some predictable scaling going on where larger models are getting

0:55:21 - 0:55:28 Text: more information about one data, just one part of the distribution for the other.

0:55:28 - 0:55:36 Text: But I think this is sort of a general thing that you should expect in model training.

0:55:36 - 0:55:42 Text: And so maybe to sort of summarize, maybe even bigger picture implications.

0:55:42 - 0:55:49 Text: I think that these kinds of results suggest that it may not be the best or the smartest

0:55:49 - 0:55:53 Text: or the most interesting way to make better ML models.

0:55:53 - 0:55:56 Text: Maybe it won't be the way that happens in the future.

0:55:56 - 0:56:00 Text: But at least I think these results suggest that there aren't any really hard conceptual

0:56:00 - 0:56:08 Text: barriers preventing people from training significantly more powerful models of all kinds, including

0:56:08 - 0:56:13 Text: of course, language models in AI research.

0:56:13 - 0:56:20 Text: I think certainly my perspective, originally as a physicist, sort of coming to machine

0:56:20 - 0:56:28 Text: learning, kind of fresh new way five years ago, is that, I mean, this is sort of one set

0:56:28 - 0:56:35 Text: of abstractions for thinking about kind of what's going on in AI research that you, if

0:56:35 - 0:56:39 Text: you're going to be training fairly large models and you want them to do well, that's

0:56:39 - 0:56:44 Text: a thing that you're going to do, then you probably want your models to sort of be scaling

0:56:44 - 0:56:46 Text: well in terms of their performance.

0:56:46 - 0:56:50 Text: And I think this framework of maybe there's a bottleneck, but if you remove the bottleneck,

0:56:50 - 0:56:53 Text: then you'll just continue to see further progress.

0:56:53 - 0:56:57 Text: I found useful.

0:56:57 - 0:57:02 Text: I think another point that, well, maybe I'll make this point at the end.

0:57:02 - 0:57:06 Text: Another point is that, yeah, scaling laws are just sort of all over the place and they

0:57:06 - 0:57:11 Text: can help you to sort of maybe organize your research a bit.

0:57:11 - 0:57:15 Text: And then, I mean, maybe the most interesting point conceptually, though, is that it seems

0:57:15 - 0:57:22 Text: like, if you believe this kind of story, that it seems like many domains of ML are kind

0:57:22 - 0:57:27 Text: of surprisingly simple and universal, things that you might not have thought are the same

0:57:27 - 0:57:31 Text: or more similar than they are different.

0:57:31 - 0:57:34 Text: And of course, this is also a fascinating thing to try to understand.

0:57:34 - 0:57:41 Text: So I mean, I was a theoretical physicist for most of my life, so I mostly tried to understand

0:57:41 - 0:57:46 Text: things that seem extremely esoteric and weird and why would anyone care about them.

0:57:46 - 0:57:51 Text: This is a thing that I think probably, probably everyone in this room kind of cares about,

0:57:51 - 0:57:55 Text: like, can AI models write, can they communicate in language?

0:57:55 - 0:58:00 Text: And these kinds of trends are really, really nice, though the kind of trends that you might

0:58:00 - 0:58:05 Text: see in a very controlled physics experiment or something, and yet they're coming out of

0:58:05 - 0:58:10 Text: something very, very noisy and random, like, predicting language data on the internet.

0:58:10 - 0:58:16 Text: So I think it's very interesting to think about, like, why are these kinds of trends true?

0:58:16 - 0:58:21 Text: What is the underlying kind of theory or science here that makes these trends true?

0:58:21 - 0:58:23 Text: Can we predict it?

0:58:23 - 0:58:24 Text: Can we refine those predictions?

0:58:24 - 0:58:28 Text: Can we understand why when this doesn't, doesn't occur?

0:58:28 - 0:58:30 Text: Another question is sort of, there are some exponents here.

0:58:30 - 0:58:34 Text: This is a straight line, but the straight line represents a power law with a particular

0:58:34 - 0:58:35 Text: exponent.

0:58:35 - 0:58:36 Text: Why that exponent?

0:58:36 - 0:58:39 Text: For language, it's like 0.0 H or so.

0:58:39 - 0:58:42 Text: Why 0.08 and not 0.2 or 0.4 or 0.001?

0:58:42 - 0:58:45 Text: I think there are all sorts of questions here.

0:58:45 - 0:58:49 Text: When you see data that has a very clear trend, it's very interesting to understand, to try

0:58:49 - 0:58:54 Text: to think about why is something so simple happening.

0:58:54 - 0:58:56 Text: And I'll sort of leave you with that.

0:58:56 - 0:58:58 Text: Yeah.

0:58:58 - 0:59:16 Text: Have you thought any about how this is scaling laws, what are providing human beings?

0:59:16 - 0:59:23 Text: So your picture is essentially making everything bigger, avoid bottlenecks, and all the thought.

0:59:23 - 0:59:30 Text: Whereas I guess human beings, so you're good on the number of parameters, because they're

0:59:30 - 0:59:33 Text: still several orders based on two in rooms there.

0:59:33 - 0:59:40 Text: But it seems like you're not very good on the bottom of the map to use.

0:59:40 - 0:59:45 Text: So human to use is very constrained, but it's only an empty map, so it's because of the

0:59:45 - 0:59:50 Text: slow processing, because it's also because of the empty demands, try to have it be using

0:59:50 - 0:59:54 Text: most of their parameters first of the time.

0:59:54 - 1:00:02 Text: And data's a little bit complex, because I guess we get a ton of data, so if you are thinking

1:00:02 - 1:00:11 Text: of that on the amount of language data we get, you know, sort of fully complex language

1:00:11 - 1:00:18 Text: uses, sort of three orders and make the shoot down, where GPT3 is now.

1:00:18 - 1:00:22 Text: And get something good seems to happen, if you look at what these are all those.

1:00:22 - 1:00:24 Text: Any thoughts on that?

1:00:24 - 1:00:27 Text: I mean, I think it's a fantastic question.

1:00:27 - 1:00:31 Text: I don't have anything to say that isn't quite speculative.

1:00:31 - 1:00:32 Text: So I mean, I don't have any good answer to the question.

1:00:32 - 1:00:34 Text: I think it's a great question.

1:00:34 - 1:00:37 Text: I guess one thing that seems like it's true is that sort of the factor of a thousand

1:00:37 - 1:00:40 Text: you mentioned seems pretty common.

1:00:40 - 1:00:44 Text: I mean, my impression is that AlphaGo probably plays like a thousand times more games when

1:00:44 - 1:00:49 Text: it trains than like a go master does.

1:00:49 - 1:00:55 Text: I think this is a pretty common factor to see in a lot of like ML contexts.

1:00:55 - 1:00:56 Text: But I have no idea why it is.

1:00:56 - 1:00:59 Text: I don't know if it's that evolution optimized us to learn fast.

1:00:59 - 1:01:03 Text: If we have some hard coded information, if this sort of multimodal inputs that we have

1:01:03 - 1:01:08 Text: help a lot, you might imagine that when you have a system that's already pretty smart,

1:01:08 - 1:01:12 Text: reinforcement learning or active learning of some form becomes more and more important,

1:01:12 - 1:01:18 Text: because like when these language models or a person, like if I read a physics textbook,

1:01:18 - 1:01:22 Text: I don't really learn a lot in a certain sense because I already learned physics.

1:01:22 - 1:01:24 Text: And I think the same is probably true for these models.

1:01:24 - 1:01:29 Text: So as the models get smarter, this sort of very dumb next word prediction task is giving

1:01:29 - 1:01:34 Text: you less and less information, but you might expect to get more and more information if

1:01:34 - 1:01:37 Text: you did something more active.

1:01:37 - 1:01:41 Text: I can continue to speculate, but I don't really know anything about it.

1:01:41 - 1:01:47 Text: I don't have anything well established to tell you.

1:01:47 - 1:01:48 Text: It's a great question.

1:01:48 - 1:01:52 Text: So if you can't have the transformers, the LFGM, the LFGM, the LFGM, the transformers

1:01:52 - 1:02:02 Text: are a bit better, but they're a smaller, similar mode that seems like human abilities

1:02:02 - 1:02:08 Text: are a little bit more developed to a more standard, a few things on a very different place

1:02:08 - 1:02:09 Text: on the graph.

1:02:09 - 1:02:12 Text: Yeah, I think that's absolutely, I think it just true.

1:02:12 - 1:02:14 Text: The simple efficiency of these models is not similar.

1:02:14 - 1:02:19 Text: Another way of saying is that if you got into AI research to understand the human brain,

1:02:19 - 1:02:23 Text: it's very unclear whether we're making any progress on that.

1:02:23 - 1:02:29 Text: But if we just want to sort of, yeah, for a lot of these tasks, we don't seem to have

1:02:29 - 1:02:32 Text: to solve the brain to solve AI surprisingly.

1:02:32 - 1:02:42 Text: I think they have a question.

1:02:42 - 1:02:43 Text: Yeah.

1:02:43 - 1:03:10 Text: See it's deixar, and now I can wonder what you're going to get in this car and push

1:03:10 - 1:03:20 Text: There's also like things that will probably come,

1:03:20 - 1:03:23 Text: remember, really, like, not just acting in the past,

1:03:23 - 1:03:26 Text: that probably aren't necessarily written for,

1:03:26 - 1:03:28 Text: and by great science.

1:03:28 - 1:03:29 Text: Okay, so, Bob and Bob.

1:03:29 - 1:03:31 Text: Sure, sure, no, these are all great questions.

1:03:31 - 1:03:35 Text: So, I mean, sort of early on,

1:03:35 - 1:03:37 Text: I commented on some sources of data,

1:03:37 - 1:03:39 Text: and I mean, you're certainly correct about quality.

1:03:39 - 1:03:42 Text: I think in terms of quantity, I mean,

1:03:42 - 1:03:46 Text: I don't think anyone has, like, a digitized library of Congress,

1:03:46 - 1:03:48 Text: but I think if you did, that would be like,

1:03:48 - 1:03:51 Text: I don't know, maybe 10x bigger than the training set for GPD3.

1:03:51 - 1:03:54 Text: So, there's a sense in which there's probably quite a lot of,

1:03:54 - 1:03:57 Text: still quite high quality data that isn't in use.

1:03:57 - 1:03:59 Text: I don't know whether it will ever be in use,

1:03:59 - 1:04:01 Text: so it's a complicated question.

1:04:01 - 1:04:04 Text: And then, if you are willing to sort of take all of this garbage

1:04:04 - 1:04:08 Text: on the internet, or try to filter that garbage down,

1:04:08 - 1:04:11 Text: I think, I don't know how accurate this estimate is,

1:04:11 - 1:04:13 Text: but in order of magnitude level,

1:04:13 - 1:04:15 Text: you can get something like 10 to the 15 words,

1:04:15 - 1:04:17 Text: which is a thousand times bigger.

1:04:17 - 1:04:20 Text: And of course, if you find any kind of intelligent way of filtering,

1:04:20 - 1:04:23 Text: then if you can filter down to 0.1% of that,

1:04:23 - 1:04:25 Text: and take the 0.1% that's best,

1:04:25 - 1:04:27 Text: then you do still have a lot of data.

1:04:27 - 1:04:28 Text: So, I think for language modeling,

1:04:28 - 1:04:30 Text: there's definitely still some headroom,

1:04:30 - 1:04:34 Text: but this is certainly a constraint,

1:04:34 - 1:04:38 Text: and there are other kinds of data distributions

1:04:38 - 1:04:40 Text: where you'll run out sooner.

1:04:40 - 1:04:44 Text: I mean, in terms of, yeah, I mean,

1:04:44 - 1:04:46 Text: of course, there are all sorts of other things you can explore,

1:04:46 - 1:04:48 Text: one you can explore, multi-modal models,

1:04:48 - 1:04:51 Text: one can switch to a different kind of loss function

1:04:51 - 1:04:55 Text: that is more interactive, or actually accomplishing a task.

1:04:55 - 1:04:58 Text: But I think, for pure language modeling,

1:04:58 - 1:05:00 Text: it seems like there's at least some room left.

1:05:00 - 1:05:03 Text: And if you think that your model size increases,

1:05:03 - 1:05:06 Text: sort of, if you think you can increase your model size by a factor of 100

1:05:06 - 1:05:09 Text: and increase your data set size by a factor of 10,

1:05:09 - 1:05:14 Text: which is sort of like roughly what this is saying.

1:05:14 - 1:05:18 Text: If you believe that, then you can still scale up your model size a lot

1:05:18 - 1:05:21 Text: and have probably plenty of data.

1:05:21 - 1:05:27 Text: But, yeah, you couldn't sort of do this stuff without the internet.

1:05:27 - 1:05:29 Text: Yeah, or, you know,

1:05:29 - 1:05:33 Text: do you want to share it?

1:05:33 - 1:05:34 Text: Sure, yeah.

1:05:34 - 1:05:39 Text: In terms of bottlenecks for improving a task of a time,

1:05:39 - 1:05:42 Text: are you more optimistic about

1:05:42 - 1:05:47 Text: how much larger models on the same level

1:05:47 - 1:05:50 Text: as a cheese water improvement,

1:05:50 - 1:05:51 Text: or petrol improvements,

1:05:51 - 1:05:53 Text: like the LSDF transform?

1:05:53 - 1:05:56 Text: I guess, I mean,

1:05:56 - 1:05:59 Text: I think I'm sort of optimistic about both.

1:05:59 - 1:06:03 Text: I think that my understanding, sort of the zero-thorder understanding

1:06:03 - 1:06:05 Text: of the hardware situation is that, like,

1:06:05 - 1:06:08 Text: connecting together GPUs and GPU,

1:06:08 - 1:06:10 Text: like objects works pretty well,

1:06:10 - 1:06:12 Text: and that, like,

1:06:12 - 1:06:16 Text: interconnection speeds are increasing and can increase pretty easily.

1:06:16 - 1:06:20 Text: So I think that you don't need one chip to run your entire model.

1:06:20 - 1:06:25 Text: You can distribute your model over many, many, many accelerators.

1:06:25 - 1:06:29 Text: And I think you can do that if you're willing to pay for those accelerators,

1:06:29 - 1:06:32 Text: et cetera, then I think you can do that.

1:06:32 - 1:06:35 Text: Architectural improvements, I think,

1:06:35 - 1:06:40 Text: I would say it sort of typically haven't been super excited about architectural improvements,

1:06:40 - 1:06:45 Text: but I think there will continue to be architectural improvements.

1:06:45 - 1:06:49 Text: I think that, sort of, whenever you do something for the first time,

1:06:49 - 1:06:52 Text: or even just, like, whenever you train a really big model for the first time,

1:06:52 - 1:06:54 Text: you sort of don't do it in the best possible way,

1:06:54 - 1:06:58 Text: and there's a lot of, like, all sorts of different kinds of improvements.

1:06:58 - 1:07:02 Text: Maybe there are, sort of, non-incremental improvements that will look like big jumps.

1:07:02 - 1:07:04 Text: So yeah, I think that'll be both.

1:07:04 - 1:07:07 Text: So yeah, I mean, there's a sense in which,

1:07:07 - 1:07:10 Text: if all you did was look at this plot and just try to continue it,

1:07:10 - 1:07:14 Text: that might be an underestimate of progress that the field is going to make,

1:07:14 - 1:07:25 Text: because there will be improvements in architecture and algorithms and things like that.

1:07:25 - 1:07:41 Text: So this related input was looking very good at the testing process,

1:07:41 - 1:07:46 Text: and it looked by some computer and new things were inBS.

1:07:46 - 1:07:49 Text: And then he can do everything he's going to do

1:07:49 - 1:07:52 Text: with doing that this scaling long,

1:07:52 - 1:07:54 Text: like, in the fall part, and he'll force back

1:07:54 - 1:07:57 Text: this type of almost the territory.

1:07:57 - 1:08:00 Text: I think that's a great question.

1:08:00 - 1:08:02 Text: I think the simplest version of this,

1:08:02 - 1:08:05 Text: well, a simple version of it that I think is probably

1:08:05 - 1:08:08 Text: important and increasingly important to sort of

1:08:08 - 1:08:10 Text: just reinforcement learning, reinforcement learning

1:08:10 - 1:08:12 Text: in a certain sense as a situation where you generate your own data,

1:08:12 - 1:08:14 Text: because if you have a language model doing RL,

1:08:14 - 1:08:17 Text: then it writes something and then you're training on that data.

1:08:17 - 1:08:21 Text: So I definitely do think that that will sort of augment data

1:08:21 - 1:08:25 Text: and mean that there'll be other avenues for improvement.

1:08:25 - 1:08:28 Text: Literal data augmentation itself seems also seem plausible to me.

1:08:28 - 1:08:31 Text: I think it's not happening a lot because there still

1:08:31 - 1:08:34 Text: is more language data out there.

1:08:37 - 1:08:38 Text: Yeah.

1:08:38 - 1:08:47 Text: I think I've got two versions of this one's more clear,

1:08:47 - 1:08:51 Text: just coming from a nine, and a few versions back on this one

1:08:51 - 1:08:55 Text: is just like how, what about associated with the language

1:08:55 - 1:08:58 Text: field and how I can't really explain about that in physics

1:08:58 - 1:09:01 Text: in the second part of this.

1:09:01 - 1:09:04 Text: In your research and for this, I'm sure you dealt a lot with

1:09:04 - 1:09:06 Text: different things going on.

1:09:06 - 1:09:09 Text: But I kind of understand to this type of stuff,

1:09:09 - 1:09:12 Text: other than finding that you found particularly

1:09:12 - 1:09:14 Text: surprising we're going to expect you,

1:09:14 - 1:09:17 Text: I'm just with your past experience,

1:09:17 - 1:09:20 Text: because I always get this a lot of different types of data.

1:09:20 - 1:09:23 Text: Also since now, two of you are going to understand,

1:09:23 - 1:09:26 Text: but for you like under this or like other stuff

1:09:26 - 1:09:28 Text: that you're doing by phenolic,

1:09:28 - 1:09:31 Text: you're using stuff right now, like if you're anything that

1:09:31 - 1:09:34 Text: is coming through like this and going to call us about that

1:09:34 - 1:09:38 Text: or just like particularly, so, supposing.

1:09:38 - 1:09:44 Text: I think to me the most surprising thing is of these sorts of

1:09:44 - 1:09:49 Text: results was probably that there is a very, very precise trend.

1:09:49 - 1:09:53 Text: It seems like, I mean, yeah, I mean, like, I think this is

1:09:53 - 1:09:56 Text: sort of an unusual thing, and I think when I saw that,

1:09:56 - 1:09:59 Text: I thought it was a really big deal.

1:09:59 - 1:10:02 Text: I think that like, usually like, I mean, it's just,

1:10:02 - 1:10:05 Text: it's not true in most many things you plot.

1:10:05 - 1:10:07 Text: I mean, obviously there are other plots that don't show

1:10:07 - 1:10:09 Text: this kind of trend, even if they're reasonable.

1:10:09 - 1:10:11 Text: I mean, like, I don't know, I mean, there's sort of a trend

1:10:11 - 1:10:14 Text: to interview a QA, but I don't really know what that means.

1:10:14 - 1:10:17 Text: But I think the fact that there's something seemingly

1:10:17 - 1:10:22 Text: very precise is, I view that as like a very intriguing

1:10:22 - 1:10:25 Text: entry point to like try to dig into something,

1:10:25 - 1:10:27 Text: because it means that there's probably some deeper reason.

1:10:27 - 1:10:30 Text: And then the fact that it seems fairly universal

1:10:30 - 1:10:33 Text: across data distributions, again, suggests something like that.

1:10:33 - 1:10:37 Text: Yeah, the main difference between data distributions is

1:10:37 - 1:10:40 Text: these exponents in a scaling browser different.

1:10:40 - 1:10:43 Text: I mean, in terms of like coming from physics,

1:10:43 - 1:10:46 Text: I mean, I think I got into like a lot of this stuff partly

1:10:46 - 1:10:50 Text: because I'm fairly mercurial, and I was interested,

1:10:50 - 1:10:52 Text: and a lot of other friends I had were interested,

1:10:52 - 1:10:55 Text: and so we sort of studied it, and went from there,

1:10:55 - 1:10:58 Text: I had friends, et cetera.

1:10:58 - 1:11:01 Text: But I mean, from another point of view, I think I got involved

1:11:01 - 1:11:04 Text: in it for really weird reasons, perhaps, in the sense that like,

1:11:04 - 1:11:08 Text: I just know a lot of people who are already, and sort of,

1:11:08 - 1:11:11 Text: I don't know, 2015, talking about things like, wow,

1:11:11 - 1:11:14 Text: is like, how much better is AI going to get?

1:11:14 - 1:11:17 Text: What are the implications going to be for the world?

1:11:17 - 1:11:20 Text: Is this going to keep improving at an addressed clip?

1:11:20 - 1:11:23 Text: What are we going to do to sort of make sure that these

1:11:23 - 1:11:26 Text: models are aligned with human values to use the kind

1:11:26 - 1:11:29 Text: of usual sort of phrase that's now used?

1:11:29 - 1:11:33 Text: And I sort of thought these people were weird and crazy,

1:11:33 - 1:11:35 Text: even though they were friends of mine, and I sort of said,

1:11:35 - 1:11:38 Text: oh, like this is really dumb, like I don't think that these

1:11:38 - 1:11:40 Text: AI models are really something to worry about.

1:11:40 - 1:11:45 Text: But like, I was still interested, and sort of was like,

1:11:45 - 1:11:47 Text: well, like smart people I know think that AI is improving

1:11:47 - 1:11:52 Text: very rapidly, and that might have a lot of impacts,

1:11:52 - 1:11:55 Text: and might require a lot of sort of caution and thought,

1:11:55 - 1:11:58 Text: and work to sort of make it safe.

1:11:58 - 1:12:01 Text: And so that was actually a significant motivation for me

1:12:01 - 1:12:02 Text: getting involved.

1:12:02 - 1:12:06 Text: It was a mixture of sort of, there being a lot of potentially

1:12:06 - 1:12:09 Text: really intellectually interesting questions,

1:12:09 - 1:12:12 Text: liking to sort of switch fields every few years,

1:12:12 - 1:12:16 Text: and friends of mine being very kind of concerned about this

1:12:16 - 1:12:21 Text: question, and yeah, that was sort of what brought me in.

1:12:21 - 1:12:25 Text: Are there everything you've seen in this picture, you know,

1:12:25 - 1:12:28 Text: how do model scale, the one operator,

1:12:28 - 1:12:32 Text: you know, one factor that you found has the most potential

1:12:32 - 1:12:38 Text: to work on, the scale we've got for this kind of question.

1:12:38 - 1:12:43 Text: I mean, if we go back to sort of very basic ML ingredients,

1:12:43 - 1:12:47 Text: of like, what are these things, like,

1:12:47 - 1:12:49 Text: so there's a sense in which this is all you're doing,

1:12:49 - 1:12:52 Text: you choose one of each of these five things.

1:12:52 - 1:12:55 Text: I would guess that what the objective is,

1:12:55 - 1:12:58 Text: is most likely to sort of change things, in the sense that

1:12:58 - 1:13:00 Text: predicting the next word is really sort of one of the

1:13:00 - 1:13:03 Text: laziest sort of dumbest things you can do.

1:13:03 - 1:13:07 Text: And, and, I mean, there are all sorts of things,

1:13:07 - 1:13:10 Text: so it's really just chosen because you want to be able to

1:13:10 - 1:13:12 Text: compute, you want to be able to do back prop,

1:13:12 - 1:13:15 Text: and so you want to be able to get some differentiable thing,

1:13:15 - 1:13:17 Text: you want to be able to get a lot of data for which you can

1:13:17 - 1:13:23 Text: compute this differentiable thing, and so that's the game that you're playing.

1:13:23 - 1:13:26 Text: But, I think that you can have other objectives,

1:13:26 - 1:13:31 Text: like through reinforcement learning, or some other kind of active learning,

1:13:31 - 1:13:34 Text: whatever, I mean, some combination of such things.

1:13:34 - 1:13:40 Text: And, I sort of would just guess that generally performance will change a lot more.

1:13:40 - 1:13:43 Text: Like, if you're expecting sort of these trends to be very different,

1:13:43 - 1:13:45 Text: I would guess they're different if you have a different objective.

1:13:45 - 1:13:50 Text: I think changing the data distribution, or the model might also change things,

1:13:50 - 1:13:54 Text: but I think that, like, the lesson that I personally draw from something like this,

1:13:54 - 1:13:58 Text: is that even if you found a, like, really revolutionary change,

1:13:58 - 1:14:01 Text: that was, like, much better than transformers,

1:14:01 - 1:14:05 Text: it might be kind of equivalent to making transformers 10 times bigger,

1:14:05 - 1:14:10 Text: but I'm not sure if that would be as big of a deal as changing the loss.

1:14:10 - 1:14:12 Text: Changing what the objective is.

1:14:12 - 1:14:15 Text: But that's just my guess, I have no idea.

1:14:15 - 1:14:19 Text: And, of course, this paradigm, I mean, I think I was trying to be polite.

1:14:19 - 1:14:23 Text: I usually have, like, a picture of a grilled cheese here to emphasize,

1:14:23 - 1:14:26 Text: like, sort of, how simple and sort of silly this is,

1:14:26 - 1:14:30 Text: rather than this sort of very sophisticated palette of spices.

1:14:30 - 1:14:35 Text: And, I mean, maybe someone will say, like, this isn't the right set of ingredients

1:14:35 - 1:14:38 Text: from which to think about things, and there's a different thing you should do,

1:14:38 - 1:14:40 Text: and maybe that will make a big difference as well.

1:14:40 - 1:14:43 Text: But I, that's sort of an unknown, unknown.