0:00:00 - 0:00:09 Text: In theoretical physics, to get this kind of audience, you have to win the Nobel Prize
0:00:09 - 0:00:10 Text: or something.
0:00:10 - 0:00:15 Text: But, of course, I've been working on ML recently, and it's been much more exciting.
0:00:15 - 0:00:17 Text: There's a huge amount of interest.
0:00:17 - 0:00:22 Text: So to some extent, part of what I'll be talking about may be an implicit theme will be sort
0:00:22 - 0:00:29 Text: of why there's so much excitement and why you might expect that excitement to continue.
0:00:29 - 0:00:37 Text: So an outline of my talk is that I'll first start by discussing motivations for language
0:00:37 - 0:00:38 Text: modeling.
0:00:38 - 0:00:42 Text: I'm sure you're all very well motivated because this is an NLP class.
0:00:42 - 0:00:47 Text: And I'll also talk about sort of orders of magnitude of data and compute that go into
0:00:47 - 0:00:49 Text: contemporary language modeling.
0:00:49 - 0:00:57 Text: And that will kind of set the stage for talking about scaling laws for neural language modeling.
0:00:57 - 0:01:03 Text: And further realization that these scaling laws seem to be quite universal for generative
0:01:03 - 0:01:07 Text: models and maybe for sort of machine learning more generally.
0:01:07 - 0:01:12 Text: And then, finally, after discussing that, I'll talk about what happens when we actually
0:01:12 - 0:01:15 Text: do scale up language models.
0:01:15 - 0:01:17 Text: I'll talk about the GBD3 model.
0:01:17 - 0:01:22 Text: And if there's time, I'll talk about some lessons from all of these ideas for research,
0:01:22 - 0:01:27 Text: which I imagine many of you are excited to be involved with soon.
0:01:27 - 0:01:32 Text: So I'll start by talking about why we do language modeling and Fermi estimates for language
0:01:32 - 0:01:33 Text: modeling.
0:01:33 - 0:01:41 Text: By Fermi estimates, I mean questions like estimating, is it a million or a hundred thousand or
0:01:41 - 0:01:43 Text: ten thousand PNO tuners in Chicago.
0:01:43 - 0:01:45 Text: Fermi famously asked this kind of question.
0:01:45 - 0:01:48 Text: And there are a lot of estimates like this that we can kind of do in a back of the envelope
0:01:48 - 0:01:54 Text: way to really get a sense for what's going on.
0:01:54 - 0:01:57 Text: But before going into that, why should you study?
0:01:57 - 0:01:58 Text: Study language.
0:01:58 - 0:01:59 Text: This is sort of my motivation.
0:01:59 - 0:02:00 Text: You might have all sorts of other motivations.
0:02:00 - 0:02:04 Text: Language is obviously very fascinating.
0:02:04 - 0:02:08 Text: Intellectual creation by our species.
0:02:08 - 0:02:12 Text: But I think another reason why it's particularly exciting for AI is that language is, in some
0:02:12 - 0:02:18 Text: sense, our species best attempt to encode everything about the world in as efficient and
0:02:18 - 0:02:20 Text: compressed way as possible.
0:02:20 - 0:02:27 Text: And that means that it's very yielding to an AI.
0:02:27 - 0:02:29 Text: There's a lot of noise.
0:02:29 - 0:02:34 Text: There's a huge quantity of writing freely available on the internet.
0:02:34 - 0:02:39 Text: And there are also a huge number of books, for example, I think very roughly speaking
0:02:39 - 0:02:41 Text: in this sort of Fermi estimate level.
0:02:41 - 0:02:46 Text: There's something like ten million books in the Library of Congress.
0:02:46 - 0:02:50 Text: And very, very roughly that might mean there's something like a trillion words.
0:02:50 - 0:02:51 Text: There are books.
0:02:51 - 0:02:56 Text: And then there's actually much more language information out on the internet.
0:02:56 - 0:03:01 Text: And so there's therefore a lot of data for AI models to learn from.
0:03:01 - 0:03:05 Text: And then a third reason, at least for, just some extent for me and maybe for many of you,
0:03:05 - 0:03:12 Text: is that if you're actually able to get an AI that kind of knows, quote, unquote, understands
0:03:12 - 0:03:16 Text: language, then you can communicate with it in a kind of natural way.
0:03:16 - 0:03:21 Text: You can ask it about anything, and you can get a lot of intuition from the responses
0:03:21 - 0:03:26 Text: and behaviors and all sorts of different kinds of evaluations you can perform on such
0:03:26 - 0:03:28 Text: a model.
0:03:28 - 0:03:34 Text: If you compare it to sort of ancient history of AI like excitement about classifying images
0:03:34 - 0:03:40 Text: from, I don't know, Alex Net, ten years ago in sort of the distant past.
0:03:40 - 0:03:44 Text: And from, say, AlphaGo, again in the distant past five years ago, there's a lot more
0:03:44 - 0:03:46 Text: intuition you can get.
0:03:46 - 0:03:51 Text: And then you can use that to sort of understand what these models know and don't know and
0:03:51 - 0:03:52 Text: can do.
0:03:52 - 0:03:59 Text: And you can also think about this in terms of how to make these models aligned with what
0:03:59 - 0:04:00 Text: humans prefer.
0:04:00 - 0:04:08 Text: There's a lot of work on trying to understand language model bias, racism, other such issues,
0:04:08 - 0:04:12 Text: and there's really a lot that you can kind of explore and dig into.
0:04:12 - 0:04:18 Text: So I imagine this is all very basic for everyone here, but just so we're on the same page.
0:04:18 - 0:04:22 Text: If you're doing kind of contemporary neural network based machine learning, the ingredients
0:04:22 - 0:04:26 Text: that you need to get started are really surprisingly simple.
0:04:26 - 0:04:30 Text: You need some kind of model to parameterize a function.
0:04:30 - 0:04:32 Text: You need a data set.
0:04:32 - 0:04:36 Text: You need some computers with plenty of computation.
0:04:36 - 0:04:41 Text: You need a lost function and you need some choice of optimizer.
0:04:41 - 0:04:47 Text: And basically for pretty much everything in this talk, I'll be thinking about language
0:04:47 - 0:04:54 Text: modeling as a task where the lost function is simply to predict the next word in some
0:04:54 - 0:04:56 Text: sentence or paragraph or book.
0:04:56 - 0:05:00 Text: And so that's how basically all of the models that I'll be talking about are trained.
0:05:00 - 0:05:05 Text: They have a lost function which incentivizes them to predict the correct probability
0:05:05 - 0:05:08 Text: distribution for the next word.
0:05:08 - 0:05:13 Text: So what about these other ingredients like the models that we use, the data sets that
0:05:13 - 0:05:17 Text: we use, and how much computation do we use?
0:05:17 - 0:05:22 Text: What are those sort of order of magnitude figures?
0:05:22 - 0:05:28 Text: So one way to think about this is sort of how much language do we consume as a person
0:05:28 - 0:05:29 Text: for comparison.
0:05:29 - 0:05:35 Text: So you can imagine that if you were a very voracious reader, maybe you'd read a long
0:05:35 - 0:05:41 Text: book every day and you'd spend your life doing that, maybe you'd live for 70 years, if
0:05:41 - 0:05:48 Text: you did that, you'd end up reading something like two billion words over your lifetime.
0:05:48 - 0:05:56 Text: For comparison, a canonical large language model, GPT-3, was trained for on the order of
0:05:56 - 0:05:58 Text: 200 billion words.
0:05:58 - 0:06:04 Text: So that's about 100 times more language data than maybe you'd see in your lifetime if
0:06:04 - 0:06:09 Text: you kind of tried really hard to attend to written text.
0:06:09 - 0:06:15 Text: There are other data sets, of course, that are much, much bigger than GPT-3's trading
0:06:15 - 0:06:17 Text: set.
0:06:17 - 0:06:22 Text: The year's common crawl, which is a sort of snapshot of the internet that anyone can
0:06:22 - 0:06:28 Text: go out and download if you like, this has very roughly on the order of 10 to the 15
0:06:28 - 0:06:31 Text: words.
0:06:31 - 0:06:37 Text: I said earlier that the Library of Congress has something like maybe 10 million books,
0:06:37 - 0:06:39 Text: each book is maybe 100,000 words.
0:06:39 - 0:06:45 Text: So the Library of Congress in total maybe has something like a trillion words.
0:06:45 - 0:06:52 Text: And as another sort of smaller data set example, English Wikipedia is very roughly of
0:06:52 - 0:06:54 Text: order three billion words.
0:06:54 - 0:06:59 Text: So maybe if you spent your whole life reading Wikipedia, you could just barely do it if
0:06:59 - 0:07:03 Text: that was your mission.
0:07:03 - 0:07:09 Text: So what about the actual neural networks that I'll be talking about that we currently
0:07:09 - 0:07:13 Text: seem to be using fairly effectively to model language?
0:07:13 - 0:07:19 Text: So I'll be talking about transformer language models, so-called decoder-only transformer
0:07:19 - 0:07:23 Text: language models of which GPT-3 is an example.
0:07:23 - 0:07:29 Text: And just to sort of pout things, these models have, with kind of the standard way that they're
0:07:29 - 0:07:36 Text: set up, a number of parameters, which is something like 12 times the number of layers in the
0:07:36 - 0:07:37 Text: network.
0:07:37 - 0:07:38 Text: So GPT-3 has 96 layers.
0:07:38 - 0:07:46 Text: You can make deeper or shallower such networks times the sort of activation dimension squared.
0:07:46 - 0:07:55 Text: So D model, this D model parameter is just the dimension of the vector space that each
0:07:55 - 0:08:03 Text: token occupies or word, if you were to use words as tokens, when you run this model on
0:08:03 - 0:08:05 Text: language data.
0:08:05 - 0:08:09 Text: And so this gives you some sense for where parameter comes from.
0:08:09 - 0:08:16 Text: I think D model for GPT-3 is of order 10,000 and layer is 96 and that's how you get roughly
0:08:16 - 0:08:22 Text: 200 billion parameters in that model and other models scale similarly.
0:08:22 - 0:08:29 Text: Now, how much computation do you actually do when you train this kind of model?
0:08:29 - 0:08:34 Text: Well it turns out that different neural network architectures have different properties with
0:08:34 - 0:08:40 Text: effect this question, but transformers are actually quite simple in that in a forward
0:08:40 - 0:08:48 Text: pass of a transformer, every parameter on every token performs roughly one add and one
0:08:48 - 0:08:53 Text: multiply and then about twice this in the backward pass.
0:08:53 - 0:08:57 Text: And so that gives us a very simple formula that the number of floating point operations
0:08:57 - 0:09:06 Text: that a model like this performs during training is 6, which is 2 times 1 plus 2, times the
0:09:06 - 0:09:11 Text: number of parameters in the model times the number of tokens, that's what D is sort of the
0:09:11 - 0:09:15 Text: size of the data set in tokens that you process.
0:09:15 - 0:09:22 Text: And one other point that sort of I'll make while kind of going over these estimates is
0:09:22 - 0:09:27 Text: that you might wonder whether or not there's a lot of computation involved in processing
0:09:27 - 0:09:29 Text: long sequences.
0:09:29 - 0:09:38 Text: There's sort of a famous point that dense attention in transformer models is n squared
0:09:38 - 0:09:41 Text: with respect to context length and that's absolutely true.
0:09:41 - 0:09:47 Text: However, if you actually work out the sort of coefficients, the ratio of the amount of
0:09:47 - 0:09:54 Text: computation you do in a forward pass or during training in the context direction versus
0:09:54 - 0:10:00 Text: in the direction of sort of moving up the layers of the model is roughly n context over
0:10:00 - 0:10:02 Text: 12 times D model.
0:10:02 - 0:10:10 Text: So I note this just because if you think which I'll kind of suggest that this is a likely
0:10:10 - 0:10:16 Text: direction for the world to be heading, that models might continue to get bigger, then
0:10:16 - 0:10:18 Text: D model for GPT-3 is already 10,000.
0:10:18 - 0:10:21 Text: So the denominator here is order 100,000.
0:10:21 - 0:10:25 Text: And so actually even if you have quite long contexts with the sort of dumbest possible
0:10:25 - 0:10:30 Text: dense attention, the amount of compute you actually do in the context direction is not
0:10:30 - 0:10:35 Text: always so much.
0:10:35 - 0:10:40 Text: What about actually numerical values for this compute?
0:10:40 - 0:10:44 Text: So the largest models that we have so far, if we're in kind of Fermi estimate mode, we
0:10:44 - 0:10:48 Text: can round up and say they have say order a trillion parameters.
0:10:48 - 0:10:55 Text: If you have a model with a trillion parameters, then what kind of hardware are you going
0:10:55 - 0:10:56 Text: to run it on?
0:10:56 - 0:11:00 Text: Well, you might run it on in a 100 GPU at least this year.
0:11:00 - 0:11:07 Text: And a 100 GPU is performed about 3 times 10 to the 14, floating point operations per second,
0:11:07 - 0:11:12 Text: or 2 times 10 to the 19, floating point operations per day.
0:11:12 - 0:11:17 Text: This means that it's sort of convenient to sometimes use units of pedoflap days, which
0:11:17 - 0:11:22 Text: is 10 to the 15, floating point operations per second times a day.
0:11:22 - 0:11:26 Text: And that means that's about 3 a 100 days.
0:11:26 - 0:11:32 Text: And that's about 8.6 times 10 to the 19 or order 10 to the 20 floating point operations
0:11:32 - 0:11:34 Text: in a day.
0:11:34 - 0:11:40 Text: So how does sort of the compute available on hardware compare to the compute that we
0:11:40 - 0:11:42 Text: do when we train these gigantic models?
0:11:42 - 0:11:50 Text: Well, if we have a model with a trillion parameters and we train it for 300 billion tokens,
0:11:50 - 0:11:55 Text: then we get 6 times 10 to the 12 times 3 times 10 to the 11.
0:11:55 - 0:12:01 Text: And so we get on the order of 10 to the 24 floating point operations to train a trillion
0:12:01 - 0:12:06 Text: parameter model for on one of these large data sets.
0:12:06 - 0:12:09 Text: So these numbers involved, I mean, I think the thing that I find most amazing about this
0:12:09 - 0:12:14 Text: is that I still remember taking chemistry in high school.
0:12:14 - 0:12:19 Text: And in chemistry, you learn that sort of a macroscopic amount of stuff is sort of
0:12:19 - 0:12:23 Text: an avogadros number of atoms, which is like 6 times 10 to the 23.
0:12:23 - 0:12:29 Text: So somehow we're actually able to build computers that do, that working together, do more than
0:12:29 - 0:12:33 Text: an avogadros number of computations to train these neural models.
0:12:33 - 0:12:37 Text: So anyway, I find these numbers kind of mind-boggling and also useful to sort of have in the
0:12:37 - 0:12:41 Text: back of your head to understand what's going on.
0:12:41 - 0:12:47 Text: So with that, pretty good, unless there are any questions, I'll start talking about
0:12:47 - 0:12:54 Text: scaling laws for these kinds of language models.
0:12:54 - 0:13:03 Text: So what I'll basically be arguing is that there are very surprisingly precise empirical
0:13:03 - 0:13:10 Text: scaling laws for the performance of machine learning systems, machine learning models,
0:13:10 - 0:13:16 Text: as a function of kind of gross macroscopic inputs like how many parameters does the model
0:13:16 - 0:13:23 Text: have, how big is the data set, and how much compute is used for training.
0:13:23 - 0:13:29 Text: And I'll also make the point that if you're sort of in an airplane at 30,000 feet looking
0:13:29 - 0:13:33 Text: down on what's going on in the field, a lot of the other details in these systems don't
0:13:33 - 0:13:37 Text: matter all that much, or at least they don't matter as much as you might have expected
0:13:37 - 0:13:39 Text: that they would.
0:13:39 - 0:13:45 Text: Very often they just change some kind of constant pre-factor in these kinds of scaling
0:13:45 - 0:13:50 Text: laws, which give you kind of a big picture of what's changing as you really increase
0:13:50 - 0:13:52 Text: these inputs.
0:13:52 - 0:13:58 Text: And one way of sort of turning this into sort of a theme, what do you learn from it, how
0:13:58 - 0:14:04 Text: do you summarize it, is that getting these models to perform better is to a large extent
0:14:04 - 0:14:07 Text: about kind of avoiding bottlenecks.
0:14:07 - 0:14:09 Text: It's avoiding being blocked by something.
0:14:09 - 0:14:15 Text: And there are a lot of things that can block improvements in performance.
0:14:15 - 0:14:19 Text: The most obvious one, which is what scaling laws are studying, is you could not have enough
0:14:19 - 0:14:25 Text: data, you could not have a large enough model, you could not have enough computation to train
0:14:25 - 0:14:26 Text: that model.
0:14:26 - 0:14:31 Text: And then there are also a lot of other literal bottlenecks that you can think about, many
0:14:31 - 0:14:35 Text: of which involve sort of that information propagation through the network.
0:14:35 - 0:14:39 Text: So I guess like one way that I would summarize a lot of the most highly cited papers in machine
0:14:39 - 0:14:46 Text: learning in the last 10 years, papers like Resnets and LayerNorm, BatchNorm, things like
0:14:46 - 0:14:52 Text: that, is that there's sort of alleviating bottlenecks where information wasn't propagating
0:14:52 - 0:14:54 Text: nicely through your network.
0:14:54 - 0:15:00 Text: And the sort of simplest possible picture to sort of illustrate this, which perhaps is
0:15:00 - 0:15:04 Text: a cartoon of what's going on, something that I'll talk about later on with LSTMs, is
0:15:04 - 0:15:10 Text: that if you take a matrix, I mean neural networks are really just fancy systems that do a lot
0:15:10 - 0:15:11 Text: of matrix multiplication.
0:15:11 - 0:15:17 Text: If you take a matrix and you multiply it a large number of times, then very roughly speaking
0:15:17 - 0:15:24 Text: what you end up with is a projection onto its largest eigenspace.
0:15:24 - 0:15:29 Text: And so very roughly speaking, even with a deep network and you sort of don't set it up
0:15:29 - 0:15:35 Text: correctly, it's very easy to be in a situation where you lose signal or lose information
0:15:35 - 0:15:37 Text: and you get like a literal, literal model.
0:15:37 - 0:15:43 Text: But anyway, that's sort of the philosophy that at least at zero-thorder might, you might
0:15:43 - 0:15:48 Text: sort of reach from thinking about some of these results.
0:15:48 - 0:15:55 Text: So this slide is really about the kind of core results for scaling laws for language
0:15:55 - 0:15:56 Text: models.
0:15:56 - 0:15:59 Text: I'll explain it in some detail.
0:15:59 - 0:16:05 Text: So I'm actually going to start with the plot on the far right, which is about scaling
0:16:05 - 0:16:11 Text: laws with respect to the number of parameters in a neural network.
0:16:11 - 0:16:19 Text: And so what we did to generate this plot was get a very large data set such that we weren't
0:16:19 - 0:16:23 Text: worried about models overfitting it all.
0:16:23 - 0:16:30 Text: And train all of our models for a very long time so that they were essentially at convergence.
0:16:30 - 0:16:34 Text: So in other words, training time or compute was not constrained on performance.
0:16:34 - 0:16:41 Text: And then plot the resulting test loss of language models, trained to predict the next word
0:16:41 - 0:16:45 Text: as a function of parameter count on a nice log scale.
0:16:45 - 0:16:49 Text: And so what you see is that there's this power law, which is a straight line on a log
0:16:49 - 0:16:56 Text: log plot of the loss as a function of the parameter count of these models.
0:16:56 - 0:17:01 Text: In the middle plot, we do the same thing, but switch the role of the amount of data that
0:17:01 - 0:17:03 Text: we have with parameter count.
0:17:03 - 0:17:08 Text: So we train a model that's very large, maybe one of the largest models on the plot on
0:17:08 - 0:17:15 Text: the right, so that model size is not a constraint on performance on data sets of various sizes.
0:17:15 - 0:17:17 Text: And we apply early stopping.
0:17:17 - 0:17:21 Text: So we measure the test loss at the point where the test loss is at its minimum during
0:17:21 - 0:17:24 Text: otherwise pretty naive straightforward training.
0:17:24 - 0:17:31 Text: And we find again a very clear power law for loss as a function of data set size.
0:17:31 - 0:17:35 Text: And then the most complicated plot is the one on the left.
0:17:35 - 0:17:45 Text: So on the left, we plot all of the learning curves for many different models.
0:17:45 - 0:17:48 Text: We provide these models with plenty of data so they're not overfitting.
0:17:48 - 0:17:52 Text: They're in the under parameterized regime.
0:17:52 - 0:17:57 Text: But we train all of these different model sizes for a very, very long time.
0:17:57 - 0:18:03 Text: And we measure on the x-axis not the number of training steps or training tokens, but
0:18:03 - 0:18:08 Text: the amount of compute that has been used so far during training.
0:18:08 - 0:18:12 Text: And as a consequence of one of the formulas that I wrote on a couple of slides ago, that
0:18:12 - 0:18:18 Text: compute is six times parameter count times the amount of training data.
0:18:18 - 0:18:24 Text: If you take the logarithm of both sides, the log of parameters times data is log of parameters
0:18:24 - 0:18:25 Text: plus log of data.
0:18:25 - 0:18:29 Text: So what that means is that learning curves for models at different sizes are just shifted
0:18:29 - 0:18:34 Text: over left and right by constant amounts with the largest models on the sort of the far
0:18:34 - 0:18:38 Text: right of this curve and the smallest models on the left.
0:18:38 - 0:18:42 Text: So we have the learning curves for all of these models all put together.
0:18:42 - 0:18:47 Text: And so a question you can ask is sort of what is the best loss you can get for any given
0:18:47 - 0:18:52 Text: amount of training compute where you're allowing yourself to choose the model that does best
0:18:52 - 0:18:54 Text: for that amount of training compute?
0:18:54 - 0:18:59 Text: And that's what sort of the heavy black line and the orange fit are picking out.
0:18:59 - 0:19:05 Text: I mean formally you could call this the convex hull of all of these curves.
0:19:05 - 0:19:12 Text: And that again somewhat surprisingly seems to obey a very nice power law fit over many,
0:19:12 - 0:19:17 Text: many orders of magnitude in computation.
0:19:17 - 0:19:21 Text: And it's crucial for all of these experiments that you're only limiting performance with
0:19:21 - 0:19:23 Text: one thing at a time.
0:19:23 - 0:19:27 Text: On the far right you have plenty of data in compute, but you're limiting the number of
0:19:27 - 0:19:30 Text: parameters in the middle you're limiting the amount of data but you have a big model.
0:19:30 - 0:19:36 Text: On the left you're looking at training compute but you have all sorts of different model sizes
0:19:36 - 0:19:39 Text: and again plenty of data.
0:19:39 - 0:19:43 Text: So in other words in each of these cases there's sort of one of these parameters that's
0:19:43 - 0:19:48 Text: bottlenecking performance and otherwise you have plenty of resources.
0:19:48 - 0:19:50 Text: There's a question?
0:19:50 - 0:20:00 Text: I hope that's not true.
0:20:00 - 0:20:03 Text: So there's a minus sign in the exponent.
0:20:03 - 0:20:07 Text: I'm not sure if you're looking at the lines or the function.
0:20:07 - 0:20:09 Text: On the bottom, I see.
0:20:09 - 0:20:10 Text: I'll be right.
0:20:10 - 0:20:11 Text: Oh, I see.
0:20:11 - 0:20:12 Text: Okay.
0:20:12 - 0:20:17 Text: Yeah, they're just a log scope plot.
0:20:17 - 0:20:20 Text: Yeah, please ask any questions.
0:20:20 - 0:20:22 Text: Great.
0:20:22 - 0:20:27 Text: And then the x-axis on this compute plot is this pediflop a day unit.
0:20:27 - 0:20:31 Text: That's why it's actually a small number.
0:20:31 - 0:20:36 Text: Any other questions about anything about this plot?
0:20:36 - 0:20:39 Text: Cool.
0:20:39 - 0:20:44 Text: So there's another thing that you can do that's kind of interesting with the plot on the
0:20:44 - 0:20:53 Text: left, which is you can ask for any given quantity of compute that you have available.
0:20:53 - 0:20:58 Text: Someone kindly donates to you some number of a-100s to use for a few weeks and you want
0:20:58 - 0:21:01 Text: to use it to train the best possible language model you can.
0:21:01 - 0:21:07 Text: And so you can ask based on this plot on the left, how should I allocate the computation
0:21:07 - 0:21:14 Text: that was given to me in terms of making a bigger model or training longer?
0:21:14 - 0:21:20 Text: And it turns out there's sort of a simplified cartoon for the answer that we found with
0:21:20 - 0:21:26 Text: our language data, which was that you mostly want to allocate most of your compute, basically
0:21:26 - 0:21:32 Text: two-thirds on a geometric scale to making models bigger.
0:21:32 - 0:21:38 Text: And you can allocate about a third to training for longer on more data.
0:21:38 - 0:21:43 Text: And so this at least for us wasn't an obvious conclusion.
0:21:43 - 0:21:49 Text: It suggests that a lot of the gains that you're going to get if you want to get better
0:21:49 - 0:21:53 Text: performance with a fixed amount of compute, a fixed budget, is going to come from making
0:21:53 - 0:21:55 Text: your models bigger.
0:21:55 - 0:21:58 Text: And it turns out that in practice, I won't go into it in detail.
0:21:58 - 0:22:02 Text: You can, to some extent, just make your batch size bigger during training.
0:22:02 - 0:22:06 Text: And that means that the total number of serial steps that you train for doesn't have to
0:22:06 - 0:22:07 Text: increase all that much.
0:22:07 - 0:22:11 Text: You don't necessarily like to train for vastly longer.
0:22:11 - 0:22:16 Text: You seemingly just need a largely a bigger model.
0:22:16 - 0:22:20 Text: And that's something that you read off from this compute plot that I showed.
0:22:20 - 0:22:28 Text: That's a general question.
0:22:28 - 0:22:37 Text: The way that you get this graph is you basically do an analysis where you look at any given
0:22:37 - 0:22:44 Text: point for compute and you look up and you pick out the blue curve that's closest to the
0:22:44 - 0:22:45 Text: black line.
0:22:45 - 0:22:49 Text: And that gives you a model size and an amount of training.
0:22:49 - 0:22:56 Text: And so you can do that for all of these different points on the x-axis.
0:22:56 - 0:22:59 Text: And then for any given point on this x-axis that tells you a model size.
0:22:59 - 0:23:04 Text: You learn model size as a function of your compute budget.
0:23:04 - 0:23:09 Text: And then, conversely, you also learn an amount of training, which is sort of a data set size.
0:23:09 - 0:23:16 Text: And so that's the explanation for the sort of million x model size versus a thousand
0:23:16 - 0:23:18 Text: x in data.
0:23:18 - 0:23:23 Text: I probably won't try to explain the batch size question.
0:23:23 - 0:23:27 Text: But it's basically based on some empirical analysis where you ask, how big can you make
0:23:27 - 0:23:28 Text: your batch size?
0:23:28 - 0:23:33 Text: How far can you push data parallelism without seeing diminishing returns?
0:23:33 - 0:23:37 Text: And that's sort of the rough answer from that question.
0:23:37 - 0:23:41 Text: They're all trained from scratch.
0:23:41 - 0:23:46 Text: So this is always almost everything that I'll talk about in this talk is training from
0:23:46 - 0:23:47 Text: scratch.
0:23:47 - 0:23:51 Text: Any other questions?
0:23:51 - 0:23:54 Text: Okay.
0:23:54 - 0:23:58 Text: And then another point that I mean, I don't want to overemphasize.
0:23:58 - 0:24:04 Text: But like I said, from a sort of very zero-thorder naive perspective is that for some of these
0:24:04 - 0:24:07 Text: results, architecture isn't the most crucial thing.
0:24:07 - 0:24:13 Text: So I think one of the biggest advances in machine learning in the last five or ten years
0:24:13 - 0:24:17 Text: has been the development of the transformer models that I'm talking about.
0:24:17 - 0:24:22 Text: But of course, you can do language modeling with a recurrent model that reads words in
0:24:22 - 0:24:24 Text: order.
0:24:24 - 0:24:30 Text: And of course, LSTMs or stacked LSTMs are sort of the standard way to do that.
0:24:30 - 0:24:36 Text: And so you can compare what you actually get if you study LSTMs versus transformers.
0:24:36 - 0:24:40 Text: And at zero-thorder, it doesn't seem like LSTMs are so bad.
0:24:40 - 0:24:44 Text: It looks like as you make them bigger, they are scaling up quite nicely.
0:24:44 - 0:24:48 Text: But there's basically a constant offset where transformers are something like five or ten
0:24:48 - 0:24:53 Text: times more efficient for a given model size than LSTMs.
0:24:53 - 0:24:56 Text: And so I think this is a very, very convincing plot that tells you the transformers are in
0:24:56 - 0:24:58 Text: fact better.
0:24:58 - 0:25:03 Text: But you don't necessarily need a transformer to see that making models bigger is giving
0:25:03 - 0:25:05 Text: you in.
0:25:05 - 0:25:10 Text: And really the sort of more interesting limitation of LSTMs that I'll also talk about a little
0:25:10 - 0:25:13 Text: more later is if we plot something else.
0:25:13 - 0:25:21 Text: So if we look at a thousand tokens, which is something like 600 words of context, we
0:25:21 - 0:25:25 Text: can look at what the loss is as a function of the position in the context.
0:25:25 - 0:25:30 Text: Because if you've read more of a document already, you're going to be better at predicting
0:25:30 - 0:25:33 Text: what the next word is because you have more context available.
0:25:33 - 0:25:34 Text: And they're very smooth.
0:25:34 - 0:25:41 Text: It turns out also power law curves for the loss as a function of context position.
0:25:41 - 0:25:47 Text: But the thing that you notice is that the red lines are LSTMs and the blue lines are
0:25:47 - 0:25:48 Text: transformers.
0:25:48 - 0:25:54 Text: And LSTMs tend to sort of plateau in performance after on the order of a hundred tokens.
0:25:54 - 0:26:00 Text: And this is sort of another bottleneck in a different direction.
0:26:00 - 0:26:05 Text: This is the famous fact that transformers are much better at learning long context information.
0:26:05 - 0:26:08 Text: And this is obviously a limitation of LSTMs.
0:26:08 - 0:26:14 Text: But sort of the basic parameter scaling law seems like it holds for many architectures.
0:26:14 - 0:26:15 Text: And then there are much more refined questions.
0:26:15 - 0:26:18 Text: You can ask, I won't go into too much detail on this.
0:26:18 - 0:26:21 Text: But there are all sorts of hyper parameters in transformer models.
0:26:21 - 0:26:25 Text: And you might ask how much does it matter if I really optimize those?
0:26:25 - 0:26:28 Text: Do I get qualitatively different behavior if I optimize those better?
0:26:28 - 0:26:33 Text: And what all of these plots show is that for various different kinds of hyper parameters
0:26:33 - 0:26:38 Text: and transformer models, there's some broad basin where you get quite good performance.
0:26:38 - 0:26:42 Text: I mean, maybe a factor of three in either direction where performance doesn't change
0:26:42 - 0:26:43 Text: all that much.
0:26:43 - 0:26:48 Text: Of course, you might want to optimize that I'm not saying you shouldn't, but kind of qualitatively
0:26:48 - 0:26:53 Text: it's not an enormous difference.
0:26:53 - 0:26:59 Text: So I think this is also a place where it's, so I'm going to tell you in a few slides
0:26:59 - 0:27:04 Text: that a lot of these features are true more generally beyond language.
0:27:04 - 0:27:11 Text: And they really sort of say that much of what's going on when machines learn is quite universal.
0:27:11 - 0:27:14 Text: But there are features that are not universal.
0:27:14 - 0:27:22 Text: So this is kind of a nicer plot of loss versus token index.
0:27:22 - 0:27:29 Text: And I've included some power law fits, which are dotted lines, which show that this is
0:27:29 - 0:27:34 Text: actually, this performance is also highly predictable.
0:27:34 - 0:27:38 Text: That just says the obvious that when you read more, you understand it's easier for you
0:27:38 - 0:27:40 Text: to predict what's coming next.
0:27:40 - 0:27:45 Text: But you can train models on images, I'll briefly talk about that later, you can train models
0:27:45 - 0:27:48 Text: identically on images.
0:27:48 - 0:27:51 Text: And there you see a performance as a function of context position.
0:27:51 - 0:27:52 Text: It's very different.
0:27:52 - 0:27:58 Text: So here you have a model that reads pixels row by row.
0:27:58 - 0:28:02 Text: And as you might expect, there's usually much more non-tribule stuff going on in the
0:28:02 - 0:28:05 Text: middle of an image rather than in the background.
0:28:05 - 0:28:08 Text: And that's represented by the fact that models do much worse.
0:28:08 - 0:28:12 Text: Their loss is higher in the center of images as compared to junior-of-the-edges.
0:28:12 - 0:28:17 Text: So while some properties of transformers and language models are universal, and I'll
0:28:17 - 0:28:22 Text: talk about those later on, there are features of language data that are totally different
0:28:22 - 0:28:24 Text: from other data distributions.
0:28:24 - 0:28:30 Text: And this is a very stark example of that.
0:28:30 - 0:28:35 Text: But generally, the fact that there are these kinds of nice patterns lurking whenever you
0:28:35 - 0:28:37 Text: optimize a model.
0:28:37 - 0:28:39 Text: I think that is very common.
0:28:39 - 0:28:42 Text: So any questions about this?
0:28:42 - 0:28:43 Text: Yeah.
0:28:43 - 0:28:45 Text: Do you mind going back to slide?
0:28:45 - 0:28:52 Text: Do you mind explaining what it means to have a loss on the first template versus the
0:28:52 - 0:28:54 Text: 1000 template?
0:28:54 - 0:29:01 Text: Yeah, so if you imagine you have a thousand words extracted randomly from a book, then
0:29:01 - 0:29:05 Text: the very first thing you can ask the model to do is try to predict the very first word.
0:29:05 - 0:29:09 Text: Then you ask it to predict the second word, the third word, et cetera.
0:29:09 - 0:29:15 Text: The very first word basically all the model can possibly do is predict the unigram distribution
0:29:15 - 0:29:16 Text: for its training set.
0:29:16 - 0:29:20 Text: It just doesn't have any information to go on, otherwise to predict what's happening.
0:29:20 - 0:29:23 Text: And so that's why it's lost is very high.
0:29:23 - 0:29:28 Text: But by the time you get to the end of the passage, you've read a lot of some little short story,
0:29:28 - 0:29:30 Text: and you know a lot about what's going to happen.
0:29:30 - 0:29:32 Text: You know what kinds of words are likely to come next.
0:29:32 - 0:29:35 Text: You know about the author's style and vocabulary.
0:29:35 - 0:29:38 Text: You know about what characters exist, et cetera.
0:29:38 - 0:29:43 Text: And so your model has gotten much, much better at prediction by the end of the context.
0:29:43 - 0:29:48 Text: And so literally to make this plot, you take maybe a thousand, ten thousand different
0:29:48 - 0:29:51 Text: passages with a thousand words in them.
0:29:51 - 0:29:55 Text: You compute the model's loss on all of the words in the passage, and then you take the
0:29:55 - 0:29:57 Text: mean, and you get some nice plot like this.
0:29:57 - 0:29:58 Text: Yeah.
0:29:58 - 0:30:08 Text: But because the computation complexity is quite tragic with respect to token index, without
0:30:08 - 0:30:12 Text: being that essentially, for talking about it, if you think about it, if you were looking
0:30:12 - 0:30:18 Text: at compute, then it would go from the zero to like ten to the six.
0:30:18 - 0:30:27 Text: So you know, you know, significantly greater compute, for a given test gloss as you increase
0:30:27 - 0:30:28 Text: token index.
0:30:28 - 0:30:33 Text: So it's true that if you make the context length longer, longer, you will spend somewhat
0:30:33 - 0:30:35 Text: more compute.
0:30:35 - 0:30:41 Text: But the fraction of the amount of compute you spend near the last token isn't nearly
0:30:41 - 0:30:42 Text: so stark.
0:30:42 - 0:30:50 Text: Most of the compute happens in the matrix multiplies for the MLP feed forward part of
0:30:50 - 0:30:57 Text: the transformer, and also the matrix multiplies to make the keys and queries and values, et cetera.
0:30:57 - 0:30:59 Text: That's actually in most well.
0:30:59 - 0:31:04 Text: It depends on the model hyper parameters, but in many models, especially models that are
0:31:04 - 0:31:07 Text: large, that's actually the predominant compute.
0:31:07 - 0:31:10 Text: And so actually, the amount of compute you do for the last token in the first token might
0:31:10 - 0:31:12 Text: only differ by a few percent.
0:31:12 - 0:31:16 Text: So for GP3, I think it's literally like one or two percent difference.
0:31:16 - 0:31:19 Text: So the sum matrix multiplies the tension.
0:31:19 - 0:31:20 Text: Yeah, yeah, yeah.
0:31:20 - 0:31:24 Text: So I mean, the formula for that was this one that I briefly mentioned here.
0:31:24 - 0:31:29 Text: So basically, how much compute you do in the context direction divided by the amount of
0:31:29 - 0:31:32 Text: compute you do in the matrix multiply direction is this.
0:31:32 - 0:31:40 Text: So if your model is, if D model is very small, if D model is 128, and context is 1,000, then
0:31:40 - 0:31:42 Text: it's basically 50-50.
0:31:42 - 0:31:48 Text: But if D model is 10,000, and context is 2,000, then it's like 2%.
0:31:48 - 0:31:52 Text: So if the model is keep getting bigger, then that means that if you're willing to pay
0:31:52 - 0:31:57 Text: a fractional cost, then you can keep making context length longer and pay a fixed fractional
0:31:57 - 0:31:58 Text: cost.
0:31:58 - 0:32:01 Text: And of course, if you use something fancy with intense attention, you also get extra
0:32:01 - 0:32:04 Text: winds on top of that.
0:32:04 - 0:32:08 Text: Any other questions?
0:32:08 - 0:32:11 Text: Cool.
0:32:11 - 0:32:21 Text: So this is sort of both of these, the left and the right, show you samples from a transformer
0:32:21 - 0:32:24 Text: model.
0:32:24 - 0:32:28 Text: Very roughly speaking, they're identical kinds of transformer models, which is with some
0:32:28 - 0:32:30 Text: slightly different hyper parameters.
0:32:30 - 0:32:32 Text: But they're trained on very different data distributions.
0:32:32 - 0:32:35 Text: The one on the left is obviously, this is GPT-3.
0:32:35 - 0:32:38 Text: The one on the right is IGPT.
0:32:38 - 0:32:42 Text: It's a model that's trained to predict pixels, row by row.
0:32:42 - 0:32:47 Text: And so what happened here was that we took the top half of an image and then generated
0:32:47 - 0:32:50 Text: all the rows beneath.
0:32:50 - 0:32:55 Text: And so the same kind of model architecture, but just trained on different data distributions
0:32:55 - 0:33:04 Text: is able to effectively learn very impressive generative capabilities in both cases.
0:33:04 - 0:33:09 Text: And so this is sort of a qualitative hint at the possibility that what's going on here
0:33:09 - 0:33:13 Text: is quite universal.
0:33:13 - 0:33:18 Text: And so another way of introducing it is say, you might have some questions after the
0:33:18 - 0:33:19 Text: last few slides.
0:33:19 - 0:33:22 Text: How are the scaling laws I'm talking about really specific to language, are they a feature
0:33:22 - 0:33:26 Text: of the kinds of data that language is?
0:33:26 - 0:33:29 Text: You might ask, do these scaling laws really continue?
0:33:29 - 0:33:33 Text: You showed that they're true over many orders of magnitude, but did they break down eventually
0:33:33 - 0:33:34 Text: in a what way?
0:33:34 - 0:33:40 Text: And then another question you might ask is, what do they imply for other kinds of evaluations?
0:33:40 - 0:33:45 Text: You probably don't just want to generate raw samples from either of these kinds of models.
0:33:45 - 0:33:48 Text: You might want to use them for some other more specific task.
0:33:48 - 0:33:55 Text: And so the question of whether or not the test loss, the training loss that you've optimized
0:33:55 - 0:34:00 Text: as that goes down in a predictable way, does that also imply that other things, other
0:34:00 - 0:34:03 Text: capabilities of the model are improving?
0:34:03 - 0:34:07 Text: So I'll be talking about these questions.
0:34:07 - 0:34:16 Text: So this plot contains kind of a lot of compressed information all at once, or the set of plots.
0:34:16 - 0:34:25 Text: So this is the result of what happens if you train the same kind of transfer models on
0:34:25 - 0:34:27 Text: sort of five different data distributions.
0:34:27 - 0:34:33 Text: So text language we already saw, but you can try video where you predict every pixel
0:34:33 - 0:34:39 Text: in a video in this sort of rectangular prism of video pixels.
0:34:39 - 0:34:45 Text: Images, this sort of synthetically generated deep-mind math data set where you're trying
0:34:45 - 0:34:49 Text: to predict the answer to math problems.
0:34:49 - 0:34:55 Text: There's a multimodal data set where you have image text pairs in either direction.
0:34:55 - 0:35:03 Text: And in all cases, the x-axis is compute, and the y-axis is the appropriate test loss
0:35:03 - 0:35:08 Text: for that class of models minus a constant.
0:35:08 - 0:35:15 Text: So that's the one complication that I've added here.
0:35:15 - 0:35:30 Text: So the claim is that these dashed lines in terms of the original loss are a power law,
0:35:30 - 0:35:38 Text: like the power laws that we saw on a much earlier slide, plus one constant term.
0:35:38 - 0:35:42 Text: And if you subtract off that constant term, then you make a log-log plot once again,
0:35:42 - 0:35:46 Text: then you once again get these very, very nice straight lines.
0:35:46 - 0:35:50 Text: And so this compute scaling law generalizes to all these other data distributions.
0:35:50 - 0:35:56 Text: And the other scaling laws also generalize, I just haven't plotted them.
0:35:56 - 0:36:01 Text: So the claim of this slide is that scaling laws do generalize to all of these other data
0:36:01 - 0:36:07 Text: distributions, and you train the same basic kind of model on them.
0:36:07 - 0:36:13 Text: And furthermore, there's sort of an intellectually slightly interesting point, which is that
0:36:13 - 0:36:21 Text: if you really believe that these dashed lines are true, if you think that they're a real
0:36:21 - 0:36:28 Text: feature of what's going on, and they continue out very, very, very far, then if you think
0:36:28 - 0:36:34 Text: that the loss is a constant plus a power law, then you can interpret the constant term
0:36:34 - 0:36:39 Text: as the entropy of the underlying data distribution.
0:36:39 - 0:36:44 Text: And you can interpret the power law as something like the KL divergence between the true data
0:36:44 - 0:36:48 Text: distribution and the model that you have.
0:36:48 - 0:36:50 Text: So that's a lot.
0:36:50 - 0:36:57 Text: The important summary at zero-thorder to remember is that I'm telling you that the kinds of
0:36:57 - 0:37:02 Text: scaling laws I presented for language generalize to all of these other domains.
0:37:02 - 0:37:06 Text: There's also some other interesting features here.
0:37:06 - 0:37:13 Text: The reason why I used compute to illustrate that the scaling laws generalize is because
0:37:13 - 0:37:18 Text: you can ask another question now that puts all of the different data distributions on
0:37:18 - 0:37:19 Text: one plot.
0:37:19 - 0:37:25 Text: It wouldn't have made any sense to combine the five plots on the last slide into one plot,
0:37:25 - 0:37:30 Text: because the test loss, when you're predicting a word, is not in any way comparable to the
0:37:30 - 0:37:32 Text: test loss when you're predicting a pixel.
0:37:32 - 0:37:33 Text: It doesn't really make sense.
0:37:33 - 0:37:34 Text: They don't have the same units.
0:37:34 - 0:37:36 Text: It doesn't make sense to put them together.
0:37:36 - 0:37:42 Text: But something that does make sense to put together is what the optimal model size is as
0:37:42 - 0:37:45 Text: a function of your computational budget.
0:37:45 - 0:37:50 Text: And so in the same way that we did for language, you can go here and you can ask for any
0:37:50 - 0:37:55 Text: given amount of compute, like 10 to the minus 2, petaflap days, what is the best model size?
0:37:55 - 0:37:57 Text: You can do that for all of these plots.
0:37:57 - 0:38:01 Text: You combine that information together and you find something kind of surprising, which
0:38:01 - 0:38:07 Text: is that, again, roughly speaking, if you're sort of willing to allow a little bit of wiggle
0:38:07 - 0:38:13 Text: room, all of these different kinds of models seem to be on the same trajectory for optimal
0:38:13 - 0:38:15 Text: model size versus compute.
0:38:15 - 0:38:20 Text: There's some kind of universal fit of how much bigger you should make your model if you're
0:38:20 - 0:38:28 Text: going to model any of these data distributions with some given amount of compute.
0:38:28 - 0:38:32 Text: So what about other kinds of tasks?
0:38:32 - 0:38:38 Text: Well, one of the most classic tasks that you can ask about in ML is image classification.
0:38:38 - 0:38:44 Text: And so the models that we were training on images, and that I've shown you plots, they're
0:38:44 - 0:38:50 Text: training loss, these models are sort of on tiny little images predicted.
0:38:50 - 0:38:54 Text: Pixel by pixel, in particular, they're 32 by 32 images, so we can look at sort of the
0:38:54 - 0:39:00 Text: 32 by 32 pixel version of image net classification.
0:39:00 - 0:39:05 Text: And the models that I was discussing are generative models that predict pixels, but you can shop
0:39:05 - 0:39:13 Text: off their heads, add a classification head in its place, and try to predict image net
0:39:13 - 0:39:15 Text: and train on image net.
0:39:15 - 0:39:19 Text: And the orange curve that I've shown you here is what happens if you just take a randomly
0:39:19 - 0:39:23 Text: initialized model with that architecture and train it.
0:39:23 - 0:39:28 Text: You get very good performance up to a point and then performance plateaus because you're
0:39:28 - 0:39:33 Text: being limited by the fact that image net is from this point of you a small data set.
0:39:33 - 0:39:38 Text: However, if you take these pre-trained models that have been trained generatively to
0:39:38 - 0:39:43 Text: draw pixels, they sort of use the features, presumably they're using the features they
0:39:43 - 0:39:52 Text: learned from image generation for classification, and you get some nice trend for the error rate
0:39:52 - 0:39:56 Text: in classification as a function of model size.
0:39:56 - 0:40:02 Text: So this is saying that in this particular case, we actually do fine-tuning the pre-training
0:40:02 - 0:40:08 Text: you did and the sort of trends you saw really kind of transfer into trends in something else
0:40:08 - 0:40:12 Text: you might care about like image classification.
0:40:12 - 0:40:17 Text: We can ask the same kinds of questions about language models.
0:40:17 - 0:40:23 Text: In particular, does this steady improvement in language modeling as a function of scale,
0:40:23 - 0:40:28 Text: does that translate into better performance?
0:40:28 - 0:40:30 Text: And this is sort of an interesting subject by itself.
0:40:30 - 0:40:34 Text: And so you can ask what happens if we scale language models.
0:40:34 - 0:40:38 Text: And so this is sort of this exact same plot that you've seen a couple of times now for
0:40:38 - 0:40:44 Text: language models, but it just increased from sort of original work that we did out to
0:40:44 - 0:40:47 Text: this yellow line, which is GPT-3.
0:40:47 - 0:40:50 Text: And you see that basically this sort of trends continue.
0:40:50 - 0:40:54 Text: Possibly GPT-3 is sort of missing the trend a little bit.
0:40:54 - 0:40:59 Text: I can't really honestly tell you whether that's because GPT-3 wasn't well optimized, or
0:40:59 - 0:41:05 Text: if it's because there's some bending in this curve where we're hitting some irreducible
0:41:05 - 0:41:06 Text: loss.
0:41:06 - 0:41:12 Text: That irreducible loss would be something like the entropy of this sort of language data
0:41:12 - 0:41:15 Text: set itself.
0:41:15 - 0:41:18 Text: But it's just in order the trends continue.
0:41:18 - 0:41:24 Text: And what's now pretty well known is that if you train fairly large language models,
0:41:24 - 0:41:27 Text: then they can exhibit in context learning.
0:41:27 - 0:41:34 Text: So the kind of learning that I'm talking about is that you give these models an example
0:41:34 - 0:41:42 Text: of many arithmetic problems or many anagrams or whatnot or translation tasks for individual
0:41:42 - 0:41:43 Text: words.
0:41:43 - 0:41:49 Text: Then early on in the sequence of the top, they might not be very good at doing the task,
0:41:49 - 0:41:54 Text: but they figure out what the pattern is in the task and they learn to do it.
0:41:54 - 0:41:58 Text: And in particular, you can plot that so you can ask for, say, like one of these anagram
0:41:58 - 0:42:00 Text: tasks.
0:42:00 - 0:42:06 Text: What is the performance of the model as a function of how many examples of the task get seen
0:42:06 - 0:42:07 Text: in the context?
0:42:07 - 0:42:12 Text: So this is kind of similar to the loss as a function of context position, but it's now
0:42:12 - 0:42:16 Text: an accuracy at doing an actual task, like unscramble the letters in a word.
0:42:16 - 0:42:23 Text: And you see probably most importantly that if you give more examples, you get significantly
0:42:23 - 0:42:28 Text: better performance starting from very, very poor performance to pretty good.
0:42:28 - 0:42:31 Text: And also you see that larger models do this better.
0:42:31 - 0:42:35 Text: You also finally see that giving a natural language prompt with some instructions helps
0:42:35 - 0:42:40 Text: significantly in the regime where you have very few examples.
0:42:40 - 0:42:42 Text: This is in context learning.
0:42:42 - 0:42:46 Text: You can call this a kind of meta learning.
0:42:46 - 0:42:52 Text: And it just emerges automatically from training large language models without any particular
0:42:52 - 0:42:57 Text: attempt to get this kind of behavior.
0:42:57 - 0:43:03 Text: And you could also ask sort of about downstream tasks that you actually you care about.
0:43:03 - 0:43:08 Text: So there is accuracy at doing arithmetic as a function of model size, a bunch of different
0:43:08 - 0:43:11 Text: kinds of arithmetic problems.
0:43:11 - 0:43:19 Text: There is just some data set of analogies from a test that American college students take
0:43:19 - 0:43:23 Text: to go to college, the SATs.
0:43:23 - 0:43:31 Text: And if you care the sort of average score of that year's test was I think 58% or so
0:43:31 - 0:43:32 Text: percent.
0:43:32 - 0:43:35 Text: So the largest model is sort of doing a little bit better than the average American high
0:43:35 - 0:43:37 Text: school student.
0:43:37 - 0:43:42 Text: The trivia QA, which is sort of just knowing trivia.
0:43:42 - 0:43:50 Text: And Wina grad schemas are problems like if a tree falls on your roof and you got it fixed,
0:43:50 - 0:43:53 Text: what did you get fixed, did you get the tree fixed or your roof.
0:43:53 - 0:43:57 Text: It's a measure of common sense reasoning and models are also getting better at this.
0:43:57 - 0:44:03 Text: And I think the other interesting thing that's very often emphasized is that clearly trivia
0:44:03 - 0:44:05 Text: performance is improving very smoothly as you make models bigger.
0:44:05 - 0:44:09 Text: The models are just remembering more and more trivia.
0:44:09 - 0:44:14 Text: Wina grad schemas are also improving fairly smoothly.
0:44:14 - 0:44:17 Text: But then there are examples like arithmetic where models are very poor and then they sort
0:44:17 - 0:44:19 Text: of suddenly get pretty good.
0:44:19 - 0:44:25 Text: And so these kind of sudden rocks sort of the model sort of suddenly kind of like gets
0:44:25 - 0:44:28 Text: what it's supposed to do for arithmetic are pretty interesting.
0:44:28 - 0:44:32 Text: And there are all sorts of other kind of interesting things if you kind of dig into these
0:44:32 - 0:44:33 Text: specific abilities.
0:44:33 - 0:44:34 Text: Yeah.
0:44:34 - 0:44:41 Text: Why do bigger models do better in the context of student?
0:44:41 - 0:44:48 Text: I mean, I guess the sort of dumb zero-thorder point is that larger models are just getting
0:44:48 - 0:44:52 Text: much better and better at predicting the next word given more and more context.
0:44:52 - 0:44:58 Text: So I think it like, I think there's a very tight connection between a plot like this and
0:44:58 - 0:45:02 Text: these sort of in context loading plots.
0:45:02 - 0:45:06 Text: Basically the more information you're getting, I mean all of these models probably know the
0:45:06 - 0:45:11 Text: unigram distribution of words and tokens pretty well.
0:45:11 - 0:45:15 Text: But the bigger model is getting much, much, much more information from its context than
0:45:15 - 0:45:17 Text: the smaller models.
0:45:17 - 0:45:21 Text: And at a certain point, I mean, it depends on your training distribution and all sorts
0:45:21 - 0:45:22 Text: of other things.
0:45:22 - 0:45:27 Text: But like, one of the things that we do is when we see several examples of something happening
0:45:27 - 0:45:32 Text: in a text, we guess that that's what we're going to see next.
0:45:32 - 0:45:37 Text: And that's really probably embedded in a ton of text that's out there on the internet
0:45:37 - 0:45:38 Text: and in books.
0:45:38 - 0:45:41 Text: And models have to decrease their loss somehow.
0:45:41 - 0:45:43 Text: That's a pattern in the text.
0:45:43 - 0:45:48 Text: It's a pattern that models eventually learn and they seemingly apply this knowledge.
0:45:48 - 0:45:51 Text: I think there are other people, of course, who've kind of worked on this question more
0:45:51 - 0:45:53 Text: specifically, and I have more specific theories.
0:45:53 - 0:46:02 Text: But I think it like kind of an intuitive sense, that's how I would think about it.
0:46:02 - 0:46:08 Text: I guess one final evaluation you can ask, can people tell that text written by a language
0:46:08 - 0:46:12 Text: model, it was written by a language model or that it's a human?
0:46:12 - 0:46:16 Text: This is an evaluation where we looked at short news articles.
0:46:16 - 0:46:22 Text: There's two or three paragraphs and generated equivalent news articles from GPT-3.
0:46:22 - 0:46:27 Text: And by the time you get to sort of the largest models, people are approaching chance accuracy
0:46:27 - 0:46:28 Text: at being able to tell the difference.
0:46:28 - 0:46:33 Text: This sort of has a lot of implications, both, I mean, it's interesting and surprising
0:46:33 - 0:46:35 Text: as a state and it's about language modeling.
0:46:35 - 0:46:38 Text: But it's also somewhat scary.
0:46:38 - 0:46:43 Text: That means these language models are very difficult to tell that you're talking to a language
0:46:43 - 0:46:47 Text: model if you don't have a very long conversation.
0:46:47 - 0:46:48 Text: Yeah.
0:46:48 - 0:46:49 Text: Hi.
0:46:49 - 0:46:58 Text: So I am wondering, so for this specific statement, because with modern current models, they
0:46:58 - 0:47:00 Text: are what projects are going to use you.
0:47:00 - 0:47:05 Text: So have you attended, like, the general experience article, and what you don't have, and for
0:47:05 - 0:47:08 Text: a document, or do you have anything?
0:47:08 - 0:47:12 Text: I actually don't know the answer to that question for this particular analysis off the top of
0:47:12 - 0:47:14 Text: my head.
0:47:14 - 0:47:19 Text: I believe that these are not memorized.
0:47:19 - 0:47:23 Text: One simple thing you can do, at least, for some things that occur frequently is like you
0:47:23 - 0:47:29 Text: can look at the distribution of the loss for a model on its own samples.
0:47:29 - 0:47:34 Text: And at least for things that are memorized, that are very clearly memorized.
0:47:34 - 0:47:38 Text: Obviously they, of course, probably they first frequently in the training set, but also
0:47:38 - 0:47:41 Text: the loss tends to be much, much lower on memorized samples.
0:47:41 - 0:47:46 Text: Because you can intuitively understand this because if there's 100 words that are exactly
0:47:46 - 0:47:52 Text: verbatim sampled out, and you're sampling at temperature equals one, then all of the
0:47:52 - 0:47:55 Text: next word predictions have to be extremely, extremely confident.
0:47:55 - 0:47:57 Text: And that means the loss has to be super low.
0:47:57 - 0:48:02 Text: So, I mean, just informally, something that I've done to just get rid of memorized samples
0:48:02 - 0:48:05 Text: is compute the loss, and usually you'll just see a pretty clear by-modal where there'll
0:48:05 - 0:48:09 Text: be a few memorized examples and then things that aren't.
0:48:09 - 0:48:10 Text: That's a simple thing you can do to check.
0:48:10 - 0:48:13 Text: You can also, of course, do de-de-de-de-de-de-de-de-de-plocation.
0:48:13 - 0:48:18 Text: I don't remember off the top of my head what de-de-de-de-de-de-plocation has done here, though.
0:48:18 - 0:48:24 Text: On the Downscape task section, if I want to say about how scale loss is, you can look
0:48:24 - 0:48:29 Text: at the transferable objects and adversarial objects.
0:48:29 - 0:48:35 Text: I don't think I have anything particularly clear to say about that.
0:48:35 - 0:48:41 Text: I mean, these evils, I think, are not adversarial in the sense that they're just few shot evaluations
0:48:41 - 0:48:44 Text: with some fixed data set.
0:48:44 - 0:48:51 Text: There are a large number of different kinds of adversarial data sets out there for reasoning,
0:48:51 - 0:48:54 Text: for common sense knowledge, for truthfulness.
0:48:54 - 0:48:57 Text: So, I mean, there's, like, for example, truthful QA.
0:48:57 - 0:49:01 Text: This is an example where there aren't any trends like this and arguably the trends go
0:49:01 - 0:49:06 Text: downward, though it depends on your training distribution and some models actually do improve.
0:49:06 - 0:49:08 Text: So I think that's a complicated question.
0:49:08 - 0:49:11 Text: I think it's hard to find examples where the trends go down.
0:49:11 - 0:49:16 Text: I don't think it's easy, but these do exist.
0:49:16 - 0:49:21 Text: Any other questions?
0:49:21 - 0:49:25 Text: Great.
0:49:25 - 0:49:35 Text: So, I guess I'll sort of end by summarizing some lessons that you might draw pretty practically
0:49:35 - 0:49:38 Text: for research from this.
0:49:38 - 0:49:43 Text: And then I can either open it up for questions, or I can also, I can always talk infinitely
0:49:43 - 0:49:44 Text: long.
0:49:44 - 0:49:48 Text: I've been a professor for like 10 years of my life, so I can just talk forever.
0:49:48 - 0:49:52 Text: But I'll sort of end after talking about some lessons.
0:49:52 - 0:49:58 Text: So I think one lesson that kind of I draw from this is that kind of scanning over some
0:49:58 - 0:50:03 Text: of the important inputs to your training process is just a pretty useful thing to do when
0:50:03 - 0:50:05 Text: you're doing ML research.
0:50:05 - 0:50:08 Text: And it's sort of typically very cheap.
0:50:08 - 0:50:13 Text: It's cheap because generally most things vary in an important way on a log scale, or
0:50:13 - 0:50:17 Text: sort of on a geometric scale, however you want to say it.
0:50:17 - 0:50:21 Text: And that means that like if you're training with the data set of size D, maybe you should
0:50:21 - 0:50:25 Text: also train with D over 2 and D over 4 and D over 8 or something like that.
0:50:25 - 0:50:28 Text: And if you sum that geometric series, you get 2D.
0:50:28 - 0:50:34 Text: So you sort of, I mean, you made your training process twice as expensive in some sense, but
0:50:34 - 0:50:37 Text: it's not really a big change in what you have to do.
0:50:37 - 0:50:42 Text: But you can often learn a lot about what's going on by doing these kinds of scans.
0:50:42 - 0:50:47 Text: And so, I mean, this is an example of some data that I didn't show earlier.
0:50:47 - 0:50:51 Text: So, I think you might wonder about is what happens if you scan over data set size and model
0:50:51 - 0:50:53 Text: size at the same time.
0:50:53 - 0:50:57 Text: And it turns out there's some very simple trends that you can model in that case too that
0:50:57 - 0:51:00 Text: tell you about things like overfitting.
0:51:00 - 0:51:03 Text: And I mean, if you care about overfitting, then this tells you about something like how
0:51:03 - 0:51:06 Text: big do you have to make your data set for a given model size to avoid overfitting being
0:51:06 - 0:51:11 Text: a significant problem so that you can answer all kinds of questions like that.
0:51:11 - 0:51:18 Text: And I at least find that this is kind of useful and it's nice for learning things about
0:51:18 - 0:51:19 Text: behavior.
0:51:19 - 0:51:24 Text: And I think alongside that, I think like this is sort of a joke.
0:51:24 - 0:51:25 Text: This isn't real.
0:51:25 - 0:51:29 Text: This is sort of making fun of a large number of machine learning papers that you might
0:51:29 - 0:51:30 Text: see.
0:51:30 - 0:51:34 Text: I think a lot of machine learning papers have tables like this.
0:51:34 - 0:51:37 Text: And it's sort of hard to tell from like this kind of table obviously I'm making fun, but
0:51:37 - 0:51:42 Text: I think it's not so unrealistic like did the technique that went into our model really
0:51:42 - 0:51:45 Text: improve on other things that happened.
0:51:45 - 0:51:49 Text: And I think that this kind of plot at least for me is a much more convincing statement
0:51:49 - 0:51:53 Text: that like will clearly transformers are just better than LSTMs.
0:51:53 - 0:51:59 Text: So the slogan here is sort of success for new techniques if your goal is to sort of
0:51:59 - 0:52:04 Text: improve a model if that is your goal.
0:52:04 - 0:52:08 Text: And I think it's at least to me much more convincing and kind of clear what's going on
0:52:08 - 0:52:10 Text: if you see these trends.
0:52:10 - 0:52:13 Text: Maybe I have another slide making fun of the CS.
0:52:13 - 0:52:18 Text: So I mean, I think this is a thing that I actually see very often in research is that
0:52:18 - 0:52:26 Text: you come up with some new idea and you see like you first do the cheapest easiest experiment
0:52:26 - 0:52:29 Text: and you see, well my new idea improved performance.
0:52:29 - 0:52:31 Text: I'm really excited.
0:52:31 - 0:52:33 Text: Everyone should it should adopt this.
0:52:33 - 0:52:38 Text: But then you make some plot like this and you sort of say, oh, okay, I guess it doesn't
0:52:38 - 0:52:40 Text: really matter that much at all.
0:52:40 - 0:52:42 Text: And I think this is actually a comment.
0:52:42 - 0:52:44 Text: I mean, I think we all have all sorts of ideas.
0:52:44 - 0:52:50 Text: I mean, people fall asleep at night and they can't sleep and then they wake up and they
0:52:50 - 0:52:52 Text: have ideas and like, oh, I'm going to go try this.
0:52:52 - 0:52:53 Text: We all do it.
0:52:53 - 0:52:57 Text: But oftentimes they don't work and I think this is sort of useful for understanding whether
0:52:57 - 0:52:59 Text: your idea really, really works.
0:52:59 - 0:53:05 Text: And I mean, if all you're ever going to do is train this model, then your idea did work.
0:53:05 - 0:53:13 Text: But I think that like there's sort of an expectation that probably people will be using bigger
0:53:13 - 0:53:15 Text: computers to train larger models in the future.
0:53:15 - 0:53:18 Text: And so the ideas that are really going to have a huge impact are ones that sort of point
0:53:18 - 0:53:20 Text: in the opposite direction.
0:53:20 - 0:53:24 Text: I've even seen ideas where on small models they make no difference at all, but on larger
0:53:24 - 0:53:28 Text: models they do better.
0:53:28 - 0:53:31 Text: And so these kinds of trends I think are useful.
0:53:31 - 0:53:34 Text: And they're certainly useful to think about.
0:53:34 - 0:53:42 Text: Another point that I find useful, I think it's not sort of obvious and maybe you shouldn't
0:53:42 - 0:53:46 Text: trust it completely, is that I tend to think, I mean, because I've sort of swallowed
0:53:46 - 0:53:53 Text: my own coolade, that if something works, then it should scale fairly predictably.
0:53:53 - 0:53:58 Text: It's not always true, but for things that you can measure that are very close to your
0:53:58 - 0:54:01 Text: optimization target.
0:54:01 - 0:54:08 Text: If sort of your training process, your hyper parameters, etc., are all kind of set up well,
0:54:08 - 0:54:12 Text: then I tend to think that you should see some kind of predictable trend.
0:54:12 - 0:54:18 Text: And if that trend goes away, then I mean, maybe that's just exactly what's true.
0:54:18 - 0:54:21 Text: But I think often it means that there's something broken about what's going on.
0:54:21 - 0:54:26 Text: Maybe your numerics are broken and you need higher precision in some part of your model,
0:54:26 - 0:54:29 Text: maybe there's some bottleneck you hadn't thought of.
0:54:29 - 0:54:34 Text: So I mean, this is also an example that kind of scaling, predictable scaling can be found
0:54:34 - 0:54:35 Text: all over the place.
0:54:35 - 0:54:39 Text: So I just think this is sort of neat.
0:54:39 - 0:54:43 Text: So if you just train these extremely naive, very stupid multimodal models, or you use
0:54:43 - 0:54:49 Text: a decoder-only transformer to either model the text based on the image or model the image
0:54:49 - 0:54:51 Text: based on the text, then you can do that.
0:54:51 - 0:54:56 Text: And measure a sort of empirical mutual information between the image and the text.
0:54:56 - 0:55:02 Text: How much information did the image give you about the words in the sense of sort of Shannon
0:55:02 - 0:55:04 Text: information?
0:55:04 - 0:55:08 Text: And or conversely, how much information did the text give you about the image?
0:55:08 - 0:55:12 Text: And this is also a place where, I mean, this is very close to the optimization target.
0:55:12 - 0:55:16 Text: The whole point of the multimodal is to get this information.
0:55:16 - 0:55:21 Text: And you see that there's some predictable scaling going on where larger models are getting
0:55:21 - 0:55:28 Text: more information about one data, just one part of the distribution for the other.
0:55:28 - 0:55:36 Text: But I think this is sort of a general thing that you should expect in model training.
0:55:36 - 0:55:42 Text: And so maybe to sort of summarize, maybe even bigger picture implications.
0:55:42 - 0:55:49 Text: I think that these kinds of results suggest that it may not be the best or the smartest
0:55:49 - 0:55:53 Text: or the most interesting way to make better ML models.
0:55:53 - 0:55:56 Text: Maybe it won't be the way that happens in the future.
0:55:56 - 0:56:00 Text: But at least I think these results suggest that there aren't any really hard conceptual
0:56:00 - 0:56:08 Text: barriers preventing people from training significantly more powerful models of all kinds, including
0:56:08 - 0:56:13 Text: of course, language models in AI research.
0:56:13 - 0:56:20 Text: I think certainly my perspective, originally as a physicist, sort of coming to machine
0:56:20 - 0:56:28 Text: learning, kind of fresh new way five years ago, is that, I mean, this is sort of one set
0:56:28 - 0:56:35 Text: of abstractions for thinking about kind of what's going on in AI research that you, if
0:56:35 - 0:56:39 Text: you're going to be training fairly large models and you want them to do well, that's
0:56:39 - 0:56:44 Text: a thing that you're going to do, then you probably want your models to sort of be scaling
0:56:44 - 0:56:46 Text: well in terms of their performance.
0:56:46 - 0:56:50 Text: And I think this framework of maybe there's a bottleneck, but if you remove the bottleneck,
0:56:50 - 0:56:53 Text: then you'll just continue to see further progress.
0:56:53 - 0:56:57 Text: I found useful.
0:56:57 - 0:57:02 Text: I think another point that, well, maybe I'll make this point at the end.
0:57:02 - 0:57:06 Text: Another point is that, yeah, scaling laws are just sort of all over the place and they
0:57:06 - 0:57:11 Text: can help you to sort of maybe organize your research a bit.
0:57:11 - 0:57:15 Text: And then, I mean, maybe the most interesting point conceptually, though, is that it seems
0:57:15 - 0:57:22 Text: like, if you believe this kind of story, that it seems like many domains of ML are kind
0:57:22 - 0:57:27 Text: of surprisingly simple and universal, things that you might not have thought are the same
0:57:27 - 0:57:31 Text: or more similar than they are different.
0:57:31 - 0:57:34 Text: And of course, this is also a fascinating thing to try to understand.
0:57:34 - 0:57:41 Text: So I mean, I was a theoretical physicist for most of my life, so I mostly tried to understand
0:57:41 - 0:57:46 Text: things that seem extremely esoteric and weird and why would anyone care about them.
0:57:46 - 0:57:51 Text: This is a thing that I think probably, probably everyone in this room kind of cares about,
0:57:51 - 0:57:55 Text: like, can AI models write, can they communicate in language?
0:57:55 - 0:58:00 Text: And these kinds of trends are really, really nice, though the kind of trends that you might
0:58:00 - 0:58:05 Text: see in a very controlled physics experiment or something, and yet they're coming out of
0:58:05 - 0:58:10 Text: something very, very noisy and random, like, predicting language data on the internet.
0:58:10 - 0:58:16 Text: So I think it's very interesting to think about, like, why are these kinds of trends true?
0:58:16 - 0:58:21 Text: What is the underlying kind of theory or science here that makes these trends true?
0:58:21 - 0:58:23 Text: Can we predict it?
0:58:23 - 0:58:24 Text: Can we refine those predictions?
0:58:24 - 0:58:28 Text: Can we understand why when this doesn't, doesn't occur?
0:58:28 - 0:58:30 Text: Another question is sort of, there are some exponents here.
0:58:30 - 0:58:34 Text: This is a straight line, but the straight line represents a power law with a particular
0:58:34 - 0:58:35 Text: exponent.
0:58:35 - 0:58:36 Text: Why that exponent?
0:58:36 - 0:58:39 Text: For language, it's like 0.0 H or so.
0:58:39 - 0:58:42 Text: Why 0.08 and not 0.2 or 0.4 or 0.001?
0:58:42 - 0:58:45 Text: I think there are all sorts of questions here.
0:58:45 - 0:58:49 Text: When you see data that has a very clear trend, it's very interesting to understand, to try
0:58:49 - 0:58:54 Text: to think about why is something so simple happening.
0:58:54 - 0:58:56 Text: And I'll sort of leave you with that.
0:58:56 - 0:58:58 Text: Yeah.
0:58:58 - 0:59:16 Text: Have you thought any about how this is scaling laws, what are providing human beings?
0:59:16 - 0:59:23 Text: So your picture is essentially making everything bigger, avoid bottlenecks, and all the thought.
0:59:23 - 0:59:30 Text: Whereas I guess human beings, so you're good on the number of parameters, because they're
0:59:30 - 0:59:33 Text: still several orders based on two in rooms there.
0:59:33 - 0:59:40 Text: But it seems like you're not very good on the bottom of the map to use.
0:59:40 - 0:59:45 Text: So human to use is very constrained, but it's only an empty map, so it's because of the
0:59:45 - 0:59:50 Text: slow processing, because it's also because of the empty demands, try to have it be using
0:59:50 - 0:59:54 Text: most of their parameters first of the time.
0:59:54 - 1:00:02 Text: And data's a little bit complex, because I guess we get a ton of data, so if you are thinking
1:00:02 - 1:00:11 Text: of that on the amount of language data we get, you know, sort of fully complex language
1:00:11 - 1:00:18 Text: uses, sort of three orders and make the shoot down, where GPT3 is now.
1:00:18 - 1:00:22 Text: And get something good seems to happen, if you look at what these are all those.
1:00:22 - 1:00:24 Text: Any thoughts on that?
1:00:24 - 1:00:27 Text: I mean, I think it's a fantastic question.
1:00:27 - 1:00:31 Text: I don't have anything to say that isn't quite speculative.
1:00:31 - 1:00:32 Text: So I mean, I don't have any good answer to the question.
1:00:32 - 1:00:34 Text: I think it's a great question.
1:00:34 - 1:00:37 Text: I guess one thing that seems like it's true is that sort of the factor of a thousand
1:00:37 - 1:00:40 Text: you mentioned seems pretty common.
1:00:40 - 1:00:44 Text: I mean, my impression is that AlphaGo probably plays like a thousand times more games when
1:00:44 - 1:00:49 Text: it trains than like a go master does.
1:00:49 - 1:00:55 Text: I think this is a pretty common factor to see in a lot of like ML contexts.
1:00:55 - 1:00:56 Text: But I have no idea why it is.
1:00:56 - 1:00:59 Text: I don't know if it's that evolution optimized us to learn fast.
1:00:59 - 1:01:03 Text: If we have some hard coded information, if this sort of multimodal inputs that we have
1:01:03 - 1:01:08 Text: help a lot, you might imagine that when you have a system that's already pretty smart,
1:01:08 - 1:01:12 Text: reinforcement learning or active learning of some form becomes more and more important,
1:01:12 - 1:01:18 Text: because like when these language models or a person, like if I read a physics textbook,
1:01:18 - 1:01:22 Text: I don't really learn a lot in a certain sense because I already learned physics.
1:01:22 - 1:01:24 Text: And I think the same is probably true for these models.
1:01:24 - 1:01:29 Text: So as the models get smarter, this sort of very dumb next word prediction task is giving
1:01:29 - 1:01:34 Text: you less and less information, but you might expect to get more and more information if
1:01:34 - 1:01:37 Text: you did something more active.
1:01:37 - 1:01:41 Text: I can continue to speculate, but I don't really know anything about it.
1:01:41 - 1:01:47 Text: I don't have anything well established to tell you.
1:01:47 - 1:01:48 Text: It's a great question.
1:01:48 - 1:01:52 Text: So if you can't have the transformers, the LFGM, the LFGM, the LFGM, the transformers
1:01:52 - 1:02:02 Text: are a bit better, but they're a smaller, similar mode that seems like human abilities
1:02:02 - 1:02:08 Text: are a little bit more developed to a more standard, a few things on a very different place
1:02:08 - 1:02:09 Text: on the graph.
1:02:09 - 1:02:12 Text: Yeah, I think that's absolutely, I think it just true.
1:02:12 - 1:02:14 Text: The simple efficiency of these models is not similar.
1:02:14 - 1:02:19 Text: Another way of saying is that if you got into AI research to understand the human brain,
1:02:19 - 1:02:23 Text: it's very unclear whether we're making any progress on that.
1:02:23 - 1:02:29 Text: But if we just want to sort of, yeah, for a lot of these tasks, we don't seem to have
1:02:29 - 1:02:32 Text: to solve the brain to solve AI surprisingly.
1:02:32 - 1:02:42 Text: I think they have a question.
1:02:42 - 1:02:43 Text: Yeah.
1:02:43 - 1:03:10 Text: See it's deixar, and now I can wonder what you're going to get in this car and push
1:03:10 - 1:03:20 Text: There's also like things that will probably come,
1:03:20 - 1:03:23 Text: remember, really, like, not just acting in the past,
1:03:23 - 1:03:26 Text: that probably aren't necessarily written for,
1:03:26 - 1:03:28 Text: and by great science.
1:03:28 - 1:03:29 Text: Okay, so, Bob and Bob.
1:03:29 - 1:03:31 Text: Sure, sure, no, these are all great questions.
1:03:31 - 1:03:35 Text: So, I mean, sort of early on,
1:03:35 - 1:03:37 Text: I commented on some sources of data,
1:03:37 - 1:03:39 Text: and I mean, you're certainly correct about quality.
1:03:39 - 1:03:42 Text: I think in terms of quantity, I mean,
1:03:42 - 1:03:46 Text: I don't think anyone has, like, a digitized library of Congress,
1:03:46 - 1:03:48 Text: but I think if you did, that would be like,
1:03:48 - 1:03:51 Text: I don't know, maybe 10x bigger than the training set for GPD3.
1:03:51 - 1:03:54 Text: So, there's a sense in which there's probably quite a lot of,
1:03:54 - 1:03:57 Text: still quite high quality data that isn't in use.
1:03:57 - 1:03:59 Text: I don't know whether it will ever be in use,
1:03:59 - 1:04:01 Text: so it's a complicated question.
1:04:01 - 1:04:04 Text: And then, if you are willing to sort of take all of this garbage
1:04:04 - 1:04:08 Text: on the internet, or try to filter that garbage down,
1:04:08 - 1:04:11 Text: I think, I don't know how accurate this estimate is,
1:04:11 - 1:04:13 Text: but in order of magnitude level,
1:04:13 - 1:04:15 Text: you can get something like 10 to the 15 words,
1:04:15 - 1:04:17 Text: which is a thousand times bigger.
1:04:17 - 1:04:20 Text: And of course, if you find any kind of intelligent way of filtering,
1:04:20 - 1:04:23 Text: then if you can filter down to 0.1% of that,
1:04:23 - 1:04:25 Text: and take the 0.1% that's best,
1:04:25 - 1:04:27 Text: then you do still have a lot of data.
1:04:27 - 1:04:28 Text: So, I think for language modeling,
1:04:28 - 1:04:30 Text: there's definitely still some headroom,
1:04:30 - 1:04:34 Text: but this is certainly a constraint,
1:04:34 - 1:04:38 Text: and there are other kinds of data distributions
1:04:38 - 1:04:40 Text: where you'll run out sooner.
1:04:40 - 1:04:44 Text: I mean, in terms of, yeah, I mean,
1:04:44 - 1:04:46 Text: of course, there are all sorts of other things you can explore,
1:04:46 - 1:04:48 Text: one you can explore, multi-modal models,
1:04:48 - 1:04:51 Text: one can switch to a different kind of loss function
1:04:51 - 1:04:55 Text: that is more interactive, or actually accomplishing a task.
1:04:55 - 1:04:58 Text: But I think, for pure language modeling,
1:04:58 - 1:05:00 Text: it seems like there's at least some room left.
1:05:00 - 1:05:03 Text: And if you think that your model size increases,
1:05:03 - 1:05:06 Text: sort of, if you think you can increase your model size by a factor of 100
1:05:06 - 1:05:09 Text: and increase your data set size by a factor of 10,
1:05:09 - 1:05:14 Text: which is sort of like roughly what this is saying.
1:05:14 - 1:05:18 Text: If you believe that, then you can still scale up your model size a lot
1:05:18 - 1:05:21 Text: and have probably plenty of data.
1:05:21 - 1:05:27 Text: But, yeah, you couldn't sort of do this stuff without the internet.
1:05:27 - 1:05:29 Text: Yeah, or, you know,
1:05:29 - 1:05:33 Text: do you want to share it?
1:05:33 - 1:05:34 Text: Sure, yeah.
1:05:34 - 1:05:39 Text: In terms of bottlenecks for improving a task of a time,
1:05:39 - 1:05:42 Text: are you more optimistic about
1:05:42 - 1:05:47 Text: how much larger models on the same level
1:05:47 - 1:05:50 Text: as a cheese water improvement,
1:05:50 - 1:05:51 Text: or petrol improvements,
1:05:51 - 1:05:53 Text: like the LSDF transform?
1:05:53 - 1:05:56 Text: I guess, I mean,
1:05:56 - 1:05:59 Text: I think I'm sort of optimistic about both.
1:05:59 - 1:06:03 Text: I think that my understanding, sort of the zero-thorder understanding
1:06:03 - 1:06:05 Text: of the hardware situation is that, like,
1:06:05 - 1:06:08 Text: connecting together GPUs and GPU,
1:06:08 - 1:06:10 Text: like objects works pretty well,
1:06:10 - 1:06:12 Text: and that, like,
1:06:12 - 1:06:16 Text: interconnection speeds are increasing and can increase pretty easily.
1:06:16 - 1:06:20 Text: So I think that you don't need one chip to run your entire model.
1:06:20 - 1:06:25 Text: You can distribute your model over many, many, many accelerators.
1:06:25 - 1:06:29 Text: And I think you can do that if you're willing to pay for those accelerators,
1:06:29 - 1:06:32 Text: et cetera, then I think you can do that.
1:06:32 - 1:06:35 Text: Architectural improvements, I think,
1:06:35 - 1:06:40 Text: I would say it sort of typically haven't been super excited about architectural improvements,
1:06:40 - 1:06:45 Text: but I think there will continue to be architectural improvements.
1:06:45 - 1:06:49 Text: I think that, sort of, whenever you do something for the first time,
1:06:49 - 1:06:52 Text: or even just, like, whenever you train a really big model for the first time,
1:06:52 - 1:06:54 Text: you sort of don't do it in the best possible way,
1:06:54 - 1:06:58 Text: and there's a lot of, like, all sorts of different kinds of improvements.
1:06:58 - 1:07:02 Text: Maybe there are, sort of, non-incremental improvements that will look like big jumps.
1:07:02 - 1:07:04 Text: So yeah, I think that'll be both.
1:07:04 - 1:07:07 Text: So yeah, I mean, there's a sense in which,
1:07:07 - 1:07:10 Text: if all you did was look at this plot and just try to continue it,
1:07:10 - 1:07:14 Text: that might be an underestimate of progress that the field is going to make,
1:07:14 - 1:07:25 Text: because there will be improvements in architecture and algorithms and things like that.
1:07:25 - 1:07:41 Text: So this related input was looking very good at the testing process,
1:07:41 - 1:07:46 Text: and it looked by some computer and new things were inBS.
1:07:46 - 1:07:49 Text: And then he can do everything he's going to do
1:07:49 - 1:07:52 Text: with doing that this scaling long,
1:07:52 - 1:07:54 Text: like, in the fall part, and he'll force back
1:07:54 - 1:07:57 Text: this type of almost the territory.
1:07:57 - 1:08:00 Text: I think that's a great question.
1:08:00 - 1:08:02 Text: I think the simplest version of this,
1:08:02 - 1:08:05 Text: well, a simple version of it that I think is probably
1:08:05 - 1:08:08 Text: important and increasingly important to sort of
1:08:08 - 1:08:10 Text: just reinforcement learning, reinforcement learning
1:08:10 - 1:08:12 Text: in a certain sense as a situation where you generate your own data,
1:08:12 - 1:08:14 Text: because if you have a language model doing RL,
1:08:14 - 1:08:17 Text: then it writes something and then you're training on that data.
1:08:17 - 1:08:21 Text: So I definitely do think that that will sort of augment data
1:08:21 - 1:08:25 Text: and mean that there'll be other avenues for improvement.
1:08:25 - 1:08:28 Text: Literal data augmentation itself seems also seem plausible to me.
1:08:28 - 1:08:31 Text: I think it's not happening a lot because there still
1:08:31 - 1:08:34 Text: is more language data out there.
1:08:37 - 1:08:38 Text: Yeah.
1:08:38 - 1:08:47 Text: I think I've got two versions of this one's more clear,
1:08:47 - 1:08:51 Text: just coming from a nine, and a few versions back on this one
1:08:51 - 1:08:55 Text: is just like how, what about associated with the language
1:08:55 - 1:08:58 Text: field and how I can't really explain about that in physics
1:08:58 - 1:09:01 Text: in the second part of this.
1:09:01 - 1:09:04 Text: In your research and for this, I'm sure you dealt a lot with
1:09:04 - 1:09:06 Text: different things going on.
1:09:06 - 1:09:09 Text: But I kind of understand to this type of stuff,
1:09:09 - 1:09:12 Text: other than finding that you found particularly
1:09:12 - 1:09:14 Text: surprising we're going to expect you,
1:09:14 - 1:09:17 Text: I'm just with your past experience,
1:09:17 - 1:09:20 Text: because I always get this a lot of different types of data.
1:09:20 - 1:09:23 Text: Also since now, two of you are going to understand,
1:09:23 - 1:09:26 Text: but for you like under this or like other stuff
1:09:26 - 1:09:28 Text: that you're doing by phenolic,
1:09:28 - 1:09:31 Text: you're using stuff right now, like if you're anything that
1:09:31 - 1:09:34 Text: is coming through like this and going to call us about that
1:09:34 - 1:09:38 Text: or just like particularly, so, supposing.
1:09:38 - 1:09:44 Text: I think to me the most surprising thing is of these sorts of
1:09:44 - 1:09:49 Text: results was probably that there is a very, very precise trend.
1:09:49 - 1:09:53 Text: It seems like, I mean, yeah, I mean, like, I think this is
1:09:53 - 1:09:56 Text: sort of an unusual thing, and I think when I saw that,
1:09:56 - 1:09:59 Text: I thought it was a really big deal.
1:09:59 - 1:10:02 Text: I think that like, usually like, I mean, it's just,
1:10:02 - 1:10:05 Text: it's not true in most many things you plot.
1:10:05 - 1:10:07 Text: I mean, obviously there are other plots that don't show
1:10:07 - 1:10:09 Text: this kind of trend, even if they're reasonable.
1:10:09 - 1:10:11 Text: I mean, like, I don't know, I mean, there's sort of a trend
1:10:11 - 1:10:14 Text: to interview a QA, but I don't really know what that means.
1:10:14 - 1:10:17 Text: But I think the fact that there's something seemingly
1:10:17 - 1:10:22 Text: very precise is, I view that as like a very intriguing
1:10:22 - 1:10:25 Text: entry point to like try to dig into something,
1:10:25 - 1:10:27 Text: because it means that there's probably some deeper reason.
1:10:27 - 1:10:30 Text: And then the fact that it seems fairly universal
1:10:30 - 1:10:33 Text: across data distributions, again, suggests something like that.
1:10:33 - 1:10:37 Text: Yeah, the main difference between data distributions is
1:10:37 - 1:10:40 Text: these exponents in a scaling browser different.
1:10:40 - 1:10:43 Text: I mean, in terms of like coming from physics,
1:10:43 - 1:10:46 Text: I mean, I think I got into like a lot of this stuff partly
1:10:46 - 1:10:50 Text: because I'm fairly mercurial, and I was interested,
1:10:50 - 1:10:52 Text: and a lot of other friends I had were interested,
1:10:52 - 1:10:55 Text: and so we sort of studied it, and went from there,
1:10:55 - 1:10:58 Text: I had friends, et cetera.
1:10:58 - 1:11:01 Text: But I mean, from another point of view, I think I got involved
1:11:01 - 1:11:04 Text: in it for really weird reasons, perhaps, in the sense that like,
1:11:04 - 1:11:08 Text: I just know a lot of people who are already, and sort of,
1:11:08 - 1:11:11 Text: I don't know, 2015, talking about things like, wow,
1:11:11 - 1:11:14 Text: is like, how much better is AI going to get?
1:11:14 - 1:11:17 Text: What are the implications going to be for the world?
1:11:17 - 1:11:20 Text: Is this going to keep improving at an addressed clip?
1:11:20 - 1:11:23 Text: What are we going to do to sort of make sure that these
1:11:23 - 1:11:26 Text: models are aligned with human values to use the kind
1:11:26 - 1:11:29 Text: of usual sort of phrase that's now used?
1:11:29 - 1:11:33 Text: And I sort of thought these people were weird and crazy,
1:11:33 - 1:11:35 Text: even though they were friends of mine, and I sort of said,
1:11:35 - 1:11:38 Text: oh, like this is really dumb, like I don't think that these
1:11:38 - 1:11:40 Text: AI models are really something to worry about.
1:11:40 - 1:11:45 Text: But like, I was still interested, and sort of was like,
1:11:45 - 1:11:47 Text: well, like smart people I know think that AI is improving
1:11:47 - 1:11:52 Text: very rapidly, and that might have a lot of impacts,
1:11:52 - 1:11:55 Text: and might require a lot of sort of caution and thought,
1:11:55 - 1:11:58 Text: and work to sort of make it safe.
1:11:58 - 1:12:01 Text: And so that was actually a significant motivation for me
1:12:01 - 1:12:02 Text: getting involved.
1:12:02 - 1:12:06 Text: It was a mixture of sort of, there being a lot of potentially
1:12:06 - 1:12:09 Text: really intellectually interesting questions,
1:12:09 - 1:12:12 Text: liking to sort of switch fields every few years,
1:12:12 - 1:12:16 Text: and friends of mine being very kind of concerned about this
1:12:16 - 1:12:21 Text: question, and yeah, that was sort of what brought me in.
1:12:21 - 1:12:25 Text: Are there everything you've seen in this picture, you know,
1:12:25 - 1:12:28 Text: how do model scale, the one operator,
1:12:28 - 1:12:32 Text: you know, one factor that you found has the most potential
1:12:32 - 1:12:38 Text: to work on, the scale we've got for this kind of question.
1:12:38 - 1:12:43 Text: I mean, if we go back to sort of very basic ML ingredients,
1:12:43 - 1:12:47 Text: of like, what are these things, like,
1:12:47 - 1:12:49 Text: so there's a sense in which this is all you're doing,
1:12:49 - 1:12:52 Text: you choose one of each of these five things.
1:12:52 - 1:12:55 Text: I would guess that what the objective is,
1:12:55 - 1:12:58 Text: is most likely to sort of change things, in the sense that
1:12:58 - 1:13:00 Text: predicting the next word is really sort of one of the
1:13:00 - 1:13:03 Text: laziest sort of dumbest things you can do.
1:13:03 - 1:13:07 Text: And, and, I mean, there are all sorts of things,
1:13:07 - 1:13:10 Text: so it's really just chosen because you want to be able to
1:13:10 - 1:13:12 Text: compute, you want to be able to do back prop,
1:13:12 - 1:13:15 Text: and so you want to be able to get some differentiable thing,
1:13:15 - 1:13:17 Text: you want to be able to get a lot of data for which you can
1:13:17 - 1:13:23 Text: compute this differentiable thing, and so that's the game that you're playing.
1:13:23 - 1:13:26 Text: But, I think that you can have other objectives,
1:13:26 - 1:13:31 Text: like through reinforcement learning, or some other kind of active learning,
1:13:31 - 1:13:34 Text: whatever, I mean, some combination of such things.
1:13:34 - 1:13:40 Text: And, I sort of would just guess that generally performance will change a lot more.
1:13:40 - 1:13:43 Text: Like, if you're expecting sort of these trends to be very different,
1:13:43 - 1:13:45 Text: I would guess they're different if you have a different objective.
1:13:45 - 1:13:50 Text: I think changing the data distribution, or the model might also change things,
1:13:50 - 1:13:54 Text: but I think that, like, the lesson that I personally draw from something like this,
1:13:54 - 1:13:58 Text: is that even if you found a, like, really revolutionary change,
1:13:58 - 1:14:01 Text: that was, like, much better than transformers,
1:14:01 - 1:14:05 Text: it might be kind of equivalent to making transformers 10 times bigger,
1:14:05 - 1:14:10 Text: but I'm not sure if that would be as big of a deal as changing the loss.
1:14:10 - 1:14:12 Text: Changing what the objective is.
1:14:12 - 1:14:15 Text: But that's just my guess, I have no idea.
1:14:15 - 1:14:19 Text: And, of course, this paradigm, I mean, I think I was trying to be polite.
1:14:19 - 1:14:23 Text: I usually have, like, a picture of a grilled cheese here to emphasize,
1:14:23 - 1:14:26 Text: like, sort of, how simple and sort of silly this is,
1:14:26 - 1:14:30 Text: rather than this sort of very sophisticated palette of spices.
1:14:30 - 1:14:35 Text: And, I mean, maybe someone will say, like, this isn't the right set of ingredients
1:14:35 - 1:14:38 Text: from which to think about things, and there's a different thing you should do,
1:14:38 - 1:14:40 Text: and maybe that will make a big difference as well.
1:14:40 - 1:14:43 Text: But I, that's sort of an unknown, unknown.