Stanford CS224N NLP with Deep Learning | Spring 2022 | Guest Lecture: Scaling Language Models

In theoretical physics, to get this kind of audience, you have to win the Nobel Prize

or something.

But, of course, I've been working on ML recently, and it's been much more exciting.

There's a huge amount of interest.

So to some extent, part of what I'll be talking about may be an implicit theme will be sort

of why there's so much excitement and why you might expect that excitement to continue.

So an outline of my talk is that I'll first start by discussing motivations for language

modeling.

I'm sure you're all very well motivated because this is an NLP class.

And I'll also talk about sort of orders of magnitude of data and compute that go into

contemporary language modeling.

And that will kind of set the stage for talking about scaling laws for neural language modeling.

And further realization that these scaling laws seem to be quite universal for generative

models and maybe for sort of machine learning more generally.

And then, finally, after discussing that, I'll talk about what happens when we actually

do scale up language models.

I'll talk about the GBD3 model.

And if there's time, I'll talk about some lessons from all of these ideas for research,

which I imagine many of you are excited to be involved with soon.

So I'll start by talking about why we do language modeling and Fermi estimates for language

modeling.

By Fermi estimates, I mean questions like estimating, is it a million or a hundred thousand or

ten thousand PNO tuners in Chicago.

Fermi famously asked this kind of question.

And there are a lot of estimates like this that we can kind of do in a back of the envelope

way to really get a sense for what's going on.

But before going into that, why should you study?

Study language.

This is sort of my motivation.

You might have all sorts of other motivations.

Language is obviously very fascinating.

Intellectual creation by our species.

But I think another reason why it's particularly exciting for AI is that language is, in some

sense, our species best attempt to encode everything about the world in as efficient and

compressed way as possible.

And that means that it's very yielding to an AI.

There's a lot of noise.

There's a huge quantity of writing freely available on the internet.

And there are also a huge number of books, for example, I think very roughly speaking

in this sort of Fermi estimate level.

There's something like ten million books in the Library of Congress.

And very, very roughly that might mean there's something like a trillion words.

There are books.

And then there's actually much more language information out on the internet.

And so there's therefore a lot of data for AI models to learn from.

And then a third reason, at least for, just some extent for me and maybe for many of you,

is that if you're actually able to get an AI that kind of knows, quote, unquote, understands

language, then you can communicate with it in a kind of natural way.

You can ask it about anything, and you can get a lot of intuition from the responses

and behaviors and all sorts of different kinds of evaluations you can perform on such

a model.

If you compare it to sort of ancient history of AI like excitement about classifying images

from, I don't know, Alex Net, ten years ago in sort of the distant past.

And from, say, AlphaGo, again in the distant past five years ago, there's a lot more

intuition you can get.

And then you can use that to sort of understand what these models know and don't know and

can do.

And you can also think about this in terms of how to make these models aligned with what

humans prefer.

There's a lot of work on trying to understand language model bias, racism, other such issues,

and there's really a lot that you can kind of explore and dig into.

So I imagine this is all very basic for everyone here, but just so we're on the same page.

If you're doing kind of contemporary neural network based machine learning, the ingredients

that you need to get started are really surprisingly simple.

You need some kind of model to parameterize a function.

You need a data set.

You need some computers with plenty of computation.

You need a lost function and you need some choice of optimizer.

And basically for pretty much everything in this talk, I'll be thinking about language

modeling as a task where the lost function is simply to predict the next word in some

sentence or paragraph or book.

And so that's how basically all of the models that I'll be talking about are trained.

They have a lost function which incentivizes them to predict the correct probability

distribution for the next word.

So what about these other ingredients like the models that we use, the data sets that

we use, and how much computation do we use?

What are those sort of order of magnitude figures?

So one way to think about this is sort of how much language do we consume as a person

for comparison.

So you can imagine that if you were a very voracious reader, maybe you'd read a long

book every day and you'd spend your life doing that, maybe you'd live for 70 years, if

you did that, you'd end up reading something like two billion words over your lifetime.

For comparison, a canonical large language model, GPT-3, was trained for on the order of

200 billion words.

So that's about 100 times more language data than maybe you'd see in your lifetime if

you kind of tried really hard to attend to written text.

There are other data sets, of course, that are much, much bigger than GPT-3's trading

set.

The year's common crawl, which is a sort of snapshot of the internet that anyone can

go out and download if you like, this has very roughly on the order of 10 to the 15

words.

I said earlier that the Library of Congress has something like maybe 10 million books,

each book is maybe 100,000 words.

So the Library of Congress in total maybe has something like a trillion words.

And as another sort of smaller data set example, English Wikipedia is very roughly of

order three billion words.

So maybe if you spent your whole life reading Wikipedia, you could just barely do it if

that was your mission.

So what about the actual neural networks that I'll be talking about that we currently

seem to be using fairly effectively to model language?

So I'll be talking about transformer language models, so-called decoder-only transformer

language models of which GPT-3 is an example.

And just to sort of pout things, these models have, with kind of the standard way that they're

set up, a number of parameters, which is something like 12 times the number of layers in the

network.

So GPT-3 has 96 layers.

You can make deeper or shallower such networks times the sort of activation dimension squared.

So D model, this D model parameter is just the dimension of the vector space that each

token occupies or word, if you were to use words as tokens, when you run this model on

language data.

And so this gives you some sense for where parameter comes from.

I think D model for GPT-3 is of order 10,000 and layer is 96 and that's how you get roughly

200 billion parameters in that model and other models scale similarly.

Now, how much computation do you actually do when you train this kind of model?

Well it turns out that different neural network architectures have different properties with

effect this question, but transformers are actually quite simple in that in a forward

pass of a transformer, every parameter on every token performs roughly one add and one

multiply and then about twice this in the backward pass.

And so that gives us a very simple formula that the number of floating point operations

that a model like this performs during training is 6, which is 2 times 1 plus 2, times the

number of parameters in the model times the number of tokens, that's what D is sort of the

size of the data set in tokens that you process.

And one other point that sort of I'll make while kind of going over these estimates is

that you might wonder whether or not there's a lot of computation involved in processing

long sequences.

There's sort of a famous point that dense attention in transformer models is n squared

with respect to context length and that's absolutely true.

However, if you actually work out the sort of coefficients, the ratio of the amount of

computation you do in a forward pass or during training in the context direction versus

in the direction of sort of moving up the layers of the model is roughly n context over

12 times D model.

So I note this just because if you think which I'll kind of suggest that this is a likely

direction for the world to be heading, that models might continue to get bigger, then

D model for GPT-3 is already 10,000.

So the denominator here is order 100,000.

And so actually even if you have quite long contexts with the sort of dumbest possible

dense attention, the amount of compute you actually do in the context direction is not

always so much.

What about actually numerical values for this compute?

So the largest models that we have so far, if we're in kind of Fermi estimate mode, we

can round up and say they have say order a trillion parameters.

If you have a model with a trillion parameters, then what kind of hardware are you going

to run it on?

Well, you might run it on in a 100 GPU at least this year.

And a 100 GPU is performed about 3 times 10 to the 14, floating point operations per second,

or 2 times 10 to the 19, floating point operations per day.

This means that it's sort of convenient to sometimes use units of pedoflap days, which

is 10 to the 15, floating point operations per second times a day.

And that means that's about 3 a 100 days.

And that's about 8.6 times 10 to the 19 or order 10 to the 20 floating point operations

in a day.

So how does sort of the compute available on hardware compare to the compute that we

do when we train these gigantic models?

Well, if we have a model with a trillion parameters and we train it for 300 billion tokens,

then we get 6 times 10 to the 12 times 3 times 10 to the 11.

And so we get on the order of 10 to the 24 floating point operations to train a trillion

parameter model for on one of these large data sets.

So these numbers involved, I mean, I think the thing that I find most amazing about this

is that I still remember taking chemistry in high school.

0:12:14 - 0:12:19     Text: And in chemistry, you learn that sort of a macroscopic amount of stuff is sort of

0:12:19 - 0:12:23     Text: an avogadros number of atoms, which is like 6 times 10 to the 23.

0:12:23 - 0:12:29     Text: So somehow we're actually able to build computers that do, that working together, do more than

0:12:29 - 0:12:33     Text: an avogadros number of computations to train these neural models.

0:12:33 - 0:12:37     Text: So anyway, I find these numbers kind of mind-boggling and also useful to sort of have in the

0:12:37 - 0:12:41     Text: back of your head to understand what's going on.

0:12:41 - 0:12:47     Text: So with that, pretty good, unless there are any questions, I'll start talking about

0:12:47 - 0:12:54     Text: scaling laws for these kinds of language models.

0:12:54 - 0:13:03     Text: So what I'll basically be arguing is that there are very surprisingly precise empirical

0:13:03 - 0:13:10     Text: scaling laws for the performance of machine learning systems, machine learning models,

0:13:10 - 0:13:16     Text: as a function of kind of gross macroscopic inputs like how many parameters does the model

0:13:16 - 0:13:23     Text: have, how big is the data set, and how much compute is used for training.

0:13:23 - 0:13:29     Text: And I'll also make the point that if you're sort of in an airplane at 30,000 feet looking

0:13:29 - 0:13:33     Text: down on what's going on in the field, a lot of the other details in these systems don't

0:13:33 - 0:13:37     Text: matter all that much, or at least they don't matter as much as you might have expected

0:13:37 - 0:13:39     Text: that they would.

0:13:39 - 0:13:45     Text: Very often they just change some kind of constant pre-factor in these kinds of scaling

0:13:45 - 0:13:50     Text: laws, which give you kind of a big picture of what's changing as you really increase

0:13:50 - 0:13:52     Text: these inputs.

0:13:52 - 0:13:58     Text: And one way of sort of turning this into sort of a theme, what do you learn from it, how

0:13:58 - 0:14:04     Text: do you summarize it, is that getting these models to perform better is to a large extent

0:14:04 - 0:14:07     Text: about kind of avoiding bottlenecks.

0:14:07 - 0:14:09     Text: It's avoiding being blocked by something.

0:14:09 - 0:14:15     Text: And there are a lot of things that can block improvements in performance.

0:14:15 - 0:14:19     Text: The most obvious one, which is what scaling laws are studying, is you could not have enough

0:14:19 - 0:14:25     Text: data, you could not have a large enough model, you could not have enough computation to train

0:14:25 - 0:14:26     Text: that model.

0:14:26 - 0:14:31     Text: And then there are also a lot of other literal bottlenecks that you can think about, many

0:14:31 - 0:14:35     Text: of which involve sort of that information propagation through the network.

0:14:35 - 0:14:39     Text: So I guess like one way that I would summarize a lot of the most highly cited papers in machine

0:14:39 - 0:14:46     Text: learning in the last 10 years, papers like Resnets and LayerNorm, BatchNorm, things like

0:14:46 - 0:14:52     Text: that, is that there's sort of alleviating bottlenecks where information wasn't propagating

0:14:52 - 0:14:54     Text: nicely through your network.

0:14:54 - 0:15:00     Text: And the sort of simplest possible picture to sort of illustrate this, which perhaps is

0:15:00 - 0:15:04     Text: a cartoon of what's going on, something that I'll talk about later on with LSTMs, is

0:15:04 - 0:15:10     Text: that if you take a matrix, I mean neural networks are really just fancy systems that do a lot

0:15:10 - 0:15:11     Text: of matrix multiplication.

0:15:11 - 0:15:17     Text: If you take a matrix and you multiply it a large number of times, then very roughly speaking

0:15:17 - 0:15:24     Text: what you end up with is a projection onto its largest eigenspace.

0:15:24 - 0:15:29     Text: And so very roughly speaking, even with a deep network and you sort of don't set it up

0:15:29 - 0:15:35     Text: correctly, it's very easy to be in a situation where you lose signal or lose information

0:15:35 - 0:15:37     Text: and you get like a literal, literal model.

0:15:37 - 0:15:43     Text: But anyway, that's sort of the philosophy that at least at zero-thorder might, you might

0:15:43 - 0:15:48     Text: sort of reach from thinking about some of these results.

0:15:48 - 0:15:55     Text: So this slide is really about the kind of core results for scaling laws for language

0:15:55 - 0:15:56     Text: models.

0:15:56 - 0:15:59     Text: I'll explain it in some detail.

0:15:59 - 0:16:05     Text: So I'm actually going to start with the plot on the far right, which is about scaling

0:16:05 - 0:16:11     Text: laws with respect to the number of parameters in a neural network.

0:16:11 - 0:16:19     Text: And so what we did to generate this plot was get a very large data set such that we weren't

0:16:19 - 0:16:23     Text: worried about models overfitting it all.

0:16:23 - 0:16:30     Text: And train all of our models for a very long time so that they were essentially at convergence.

0:16:30 - 0:16:34     Text: So in other words, training time or compute was not constrained on performance.

0:16:34 - 0:16:41     Text: And then plot the resulting test loss of language models, trained to predict the next word

0:16:41 - 0:16:45     Text: as a function of parameter count on a nice log scale.

0:16:45 - 0:16:49     Text: And so what you see is that there's this power law, which is a straight line on a log

0:16:49 - 0:16:56     Text: log plot of the loss as a function of the parameter count of these models.

0:16:56 - 0:17:01     Text: In the middle plot, we do the same thing, but switch the role of the amount of data that

0:17:01 - 0:17:03     Text: we have with parameter count.

0:17:03 - 0:17:08     Text: So we train a model that's very large, maybe one of the largest models on the plot on

0:17:08 - 0:17:15     Text: the right, so that model size is not a constraint on performance on data sets of various sizes.

0:17:15 - 0:17:17     Text: And we apply early stopping.

0:17:17 - 0:17:21     Text: So we measure the test loss at the point where the test loss is at its minimum during

0:17:21 - 0:17:24     Text: otherwise pretty naive straightforward training.

0:17:24 - 0:17:31     Text: And we find again a very clear power law for loss as a function of data set size.

0:17:31 - 0:17:35     Text: And then the most complicated plot is the one on the left.

0:17:35 - 0:17:45     Text: So on the left, we plot all of the learning curves for many different models.

0:17:45 - 0:17:48     Text: We provide these models with plenty of data so they're not overfitting.

0:17:48 - 0:17:52     Text: They're in the under parameterized regime.

0:17:52 - 0:17:57     Text: But we train all of these different model sizes for a very, very long time.

0:17:57 - 0:18:03     Text: And we measure on the x-axis not the number of training steps or training tokens, but

0:18:03 - 0:18:08     Text: the amount of compute that has been used so far during training.

0:18:08 - 0:18:12     Text: And as a consequence of one of the formulas that I wrote on a couple of slides ago, that

0:18:12 - 0:18:18     Text: compute is six times parameter count times the amount of training data.

0:18:18 - 0:18:24     Text: If you take the logarithm of both sides, the log of parameters times data is log of parameters

0:18:24 - 0:18:25     Text: plus log of data.

0:18:25 - 0:18:29     Text: So what that means is that learning curves for models at different sizes are just shifted

0:18:29 - 0:18:34     Text: over left and right by constant amounts with the largest models on the sort of the far

0:18:34 - 0:18:38     Text: right of this curve and the smallest models on the left.

0:18:38 - 0:18:42     Text: So we have the learning curves for all of these models all put together.

0:18:42 - 0:18:47     Text: And so a question you can ask is sort of what is the best loss you can get for any given

0:18:47 - 0:18:52     Text: amount of training compute where you're allowing yourself to choose the model that does best

0:18:52 - 0:18:54     Text: for that amount of training compute?

0:18:54 - 0:18:59     Text: And that's what sort of the heavy black line and the orange fit are picking out.

0:18:59 - 0:19:05     Text: I mean formally you could call this the convex hull of all of these curves.

0:19:05 - 0:19:12     Text: And that again somewhat surprisingly seems to obey a very nice power law fit over many,

0:19:12 - 0:19:17     Text: many orders of magnitude in computation.

0:19:17 - 0:19:21     Text: And it's crucial for all of these experiments that you're only limiting performance with

0:19:21 - 0:19:23     Text: one thing at a time.

0:19:23 - 0:19:27     Text: On the far right you have plenty of data in compute, but you're limiting the number of

0:19:27 - 0:19:30     Text: parameters in the middle you're limiting the amount of data but you have a big model.

0:19:30 - 0:19:36     Text: On the left you're looking at training compute but you have all sorts of different model sizes

0:19:36 - 0:19:39     Text: and again plenty of data.

0:19:39 - 0:19:43     Text: So in other words in each of these cases there's sort of one of these parameters that's

0:19:43 - 0:19:48     Text: bottlenecking performance and otherwise you have plenty of resources.

0:19:48 - 0:19:50     Text: There's a question?

0:19:50 - 0:20:00     Text: I hope that's not true.

0:20:00 - 0:20:03     Text: So there's a minus sign in the exponent.

0:20:03 - 0:20:07     Text: I'm not sure if you're looking at the lines or the function.

0:20:07 - 0:20:09     Text: On the bottom, I see.

0:20:09 - 0:20:10     Text: I'll be right.

0:20:10 - 0:20:11     Text: Oh, I see.

0:20:11 - 0:20:12     Text: Okay.

0:20:12 - 0:20:17     Text: Yeah, they're just a log scope plot.

0:20:17 - 0:20:20     Text: Yeah, please ask any questions.

0:20:20 - 0:20:22     Text: Great.

0:20:22 - 0:20:27     Text: And then the x-axis on this compute plot is this pediflop a day unit.

0:20:27 - 0:20:31     Text: That's why it's actually a small number.

0:20:31 - 0:20:36     Text: Any other questions about anything about this plot?

0:20:36 - 0:20:39     Text: Cool.

0:20:39 - 0:20:44     Text: So there's another thing that you can do that's kind of interesting with the plot on the

0:20:44 - 0:20:53     Text: left, which is you can ask for any given quantity of compute that you have available.

0:20:53 - 0:20:58     Text: Someone kindly donates to you some number of a-100s to use for a few weeks and you want

0:20:58 - 0:21:01     Text: to use it to train the best possible language model you can.

0:21:01 - 0:21:07     Text: And so you can ask based on this plot on the left, how should I allocate the computation

0:21:07 - 0:21:14     Text: that was given to me in terms of making a bigger model or training longer?

0:21:14 - 0:21:20     Text: And it turns out there's sort of a simplified cartoon for the answer that we found with

0:21:20 - 0:21:26     Text: our language data, which was that you mostly want to allocate most of your compute, basically

0:21:26 - 0:21:32     Text: two-thirds on a geometric scale to making models bigger.

0:21:32 - 0:21:38     Text: And you can allocate about a third to training for longer on more data.

0:21:38 - 0:21:43     Text: And so this at least for us wasn't an obvious conclusion.

0:21:43 - 0:21:49     Text: It suggests that a lot of the gains that you're going to get if you want to get better

0:21:49 - 0:21:53     Text: performance with a fixed amount of compute, a fixed budget, is going to come from making

0:21:53 - 0:21:55     Text: your models bigger.

0:21:55 - 0:21:58     Text: And it turns out that in practice, I won't go into it in detail.

0:21:58 - 0:22:02     Text: You can, to some extent, just make your batch size bigger during training.

0:22:02 - 0:22:06     Text: And that means that the total number of serial steps that you train for doesn't have to

0:22:06 - 0:22:07     Text: increase all that much.

0:22:07 - 0:22:11     Text: You don't necessarily like to train for vastly longer.

0:22:11 - 0:22:16     Text: You seemingly just need a largely a bigger model.

0:22:16 - 0:22:20     Text: And that's something that you read off from this compute plot that I showed.

0:22:20 - 0:22:28     Text: That's a general question.

0:22:28 - 0:22:37     Text: The way that you get this graph is you basically do an analysis where you look at any given

0:22:37 - 0:22:44     Text: point for compute and you look up and you pick out the blue curve that's closest to the

0:22:44 - 0:22:45     Text: black line.

0:22:45 - 0:22:49     Text: And that gives you a model size and an amount of training.

0:22:49 - 0:22:56     Text: And so you can do that for all of these different points on the x-axis.

0:22:56 - 0:22:59     Text: And then for any given point on this x-axis that tells you a model size.

0:22:59 - 0:23:04     Text: You learn model size as a function of your compute budget.

0:23:04 - 0:23:09     Text: And then, conversely, you also learn an amount of training, which is sort of a data set size.

0:23:09 - 0:23:16     Text: And so that's the explanation for the sort of million x model size versus a thousand

0:23:16 - 0:23:18     Text: x in data.

0:23:18 - 0:23:23     Text: I probably won't try to explain the batch size question.

0:23:23 - 0:23:27     Text: But it's basically based on some empirical analysis where you ask, how big can you make

0:23:27 - 0:23:28     Text: your batch size?

0:23:28 - 0:23:33     Text: How far can you push data parallelism without seeing diminishing returns?

0:23:33 - 0:23:37     Text: And that's sort of the rough answer from that question.

0:23:37 - 0:23:41     Text: They're all trained from scratch.

0:23:41 - 0:23:46     Text: So this is always almost everything that I'll talk about in this talk is training from

0:23:46 - 0:23:47     Text: scratch.

0:23:47 - 0:23:51     Text: Any other questions?

0:23:51 - 0:23:54     Text: Okay.

0:23:54 - 0:23:58     Text: And then another point that I mean, I don't want to overemphasize.

0:23:58 - 0:24:04     Text: But like I said, from a sort of very zero-thorder naive perspective is that for some of these

0:24:04 - 0:24:07     Text: results, architecture isn't the most crucial thing.

0:24:07 - 0:24:13     Text: So I think one of the biggest advances in machine learning in the last five or ten years

0:24:13 - 0:24:17     Text: has been the development of the transformer models that I'm talking about.

0:24:17 - 0:24:22     Text: But of course, you can do language modeling with a recurrent model that reads words in

0:24:22 - 0:24:24     Text: order.

0:24:24 - 0:24:30     Text: And of course, LSTMs or stacked LSTMs are sort of the standard way to do that.

0:24:30 - 0:24:36     Text: And so you can compare what you actually get if you study LSTMs versus transformers.

0:24:36 - 0:24:40     Text: And at zero-thorder, it doesn't seem like LSTMs are so bad.

0:24:40 - 0:24:44     Text: It looks like as you make them bigger, they are scaling up quite nicely.

0:24:44 - 0:24:48     Text: But there's basically a constant offset where transformers are something like five or ten

0:24:48 - 0:24:53     Text: times more efficient for a given model size than LSTMs.

0:24:53 - 0:24:56     Text: And so I think this is a very, very convincing plot that tells you the transformers are in

0:24:56 - 0:24:58     Text: fact better.

0:24:58 - 0:25:03     Text: But you don't necessarily need a transformer to see that making models bigger is giving

0:25:03 - 0:25:05     Text: you in.

0:25:05 - 0:25:10     Text: And really the sort of more interesting limitation of LSTMs that I'll also talk about a little

0:25:10 - 0:25:13     Text: more later is if we plot something else.

0:25:13 - 0:25:21     Text: So if we look at a thousand tokens, which is something like 600 words of context, we

0:25:21 - 0:25:25     Text: can look at what the loss is as a function of the position in the context.

0:25:25 - 0:25:30     Text: Because if you've read more of a document already, you're going to be better at predicting

0:25:30 - 0:25:33     Text: what the next word is because you have more context available.

0:25:33 - 0:25:34     Text: And they're very smooth.

0:25:34 - 0:25:41     Text: It turns out also power law curves for the loss as a function of context position.

0:25:41 - 0:25:47     Text: But the thing that you notice is that the red lines are LSTMs and the blue lines are

0:25:47 - 0:25:48     Text: transformers.

0:25:48 - 0:25:54     Text: And LSTMs tend to sort of plateau in performance after on the order of a hundred tokens.

0:25:54 - 0:26:00     Text: And this is sort of another bottleneck in a different direction.

0:26:00 - 0:26:05     Text: This is the famous fact that transformers are much better at learning long context information.

0:26:05 - 0:26:08     Text: And this is obviously a limitation of LSTMs.

0:26:08 - 0:26:14     Text: But sort of the basic parameter scaling law seems like it holds for many architectures.

0:26:14 - 0:26:15     Text: And then there are much more refined questions.

0:26:15 - 0:26:18     Text: You can ask, I won't go into too much detail on this.

0:26:18 - 0:26:21     Text: But there are all sorts of hyper parameters in transformer models.

0:26:21 - 0:26:25     Text: And you might ask how much does it matter if I really optimize those?

0:26:25 - 0:26:28     Text: Do I get qualitatively different behavior if I optimize those better?

0:26:28 - 0:26:33     Text: And what all of these plots show is that for various different kinds of hyper parameters

0:26:33 - 0:26:38     Text: and transformer models, there's some broad basin where you get quite good performance.

0:26:38 - 0:26:42     Text: I mean, maybe a factor of three in either direction where performance doesn't change

0:26:42 - 0:26:43     Text: all that much.

0:26:43 - 0:26:48     Text: Of course, you might want to optimize that I'm not saying you shouldn't, but kind of qualitatively

0:26:48 - 0:26:53     Text: it's not an enormous difference.

0:26:53 - 0:26:59     Text: So I think this is also a place where it's, so I'm going to tell you in a few slides

0:26:59 - 0:27:04     Text: that a lot of these features are true more generally beyond language.

0:27:04 - 0:27:11     Text: And they really sort of say that much of what's going on when machines learn is quite universal.

0:27:11 - 0:27:14     Text: But there are features that are not universal.

0:27:14 - 0:27:22     Text: So this is kind of a nicer plot of loss versus token index.

0:27:22 - 0:27:29     Text: And I've included some power law fits, which are dotted lines, which show that this is

0:27:29 - 0:27:34     Text: actually, this performance is also highly predictable.

0:27:34 - 0:27:38     Text: That just says the obvious that when you read more, you understand it's easier for you

0:27:38 - 0:27:40     Text: to predict what's coming next.

0:27:40 - 0:27:45     Text: But you can train models on images, I'll briefly talk about that later, you can train models

0:27:45 - 0:27:48     Text: identically on images.

0:27:48 - 0:27:51     Text: And there you see a performance as a function of context position.

0:27:51 - 0:27:52     Text: It's very different.

0:27:52 - 0:27:58     Text: So here you have a model that reads pixels row by row.

0:27:58 - 0:28:02     Text: And as you might expect, there's usually much more non-tribule stuff going on in the

0:28:02 - 0:28:05     Text: middle of an image rather than in the background.

0:28:05 - 0:28:08     Text: And that's represented by the fact that models do much worse.

0:28:08 - 0:28:12     Text: Their loss is higher in the center of images as compared to junior-of-the-edges.

0:28:12 - 0:28:17     Text: So while some properties of transformers and language models are universal, and I'll

0:28:17 - 0:28:22     Text: talk about those later on, there are features of language data that are totally different

0:28:22 - 0:28:24     Text: from other data distributions.

0:28:24 - 0:28:30     Text: And this is a very stark example of that.

0:28:30 - 0:28:35     Text: But generally, the fact that there are these kinds of nice patterns lurking whenever you

0:28:35 - 0:28:37     Text: optimize a model.

0:28:37 - 0:28:39     Text: I think that is very common.

0:28:39 - 0:28:42     Text: So any questions about this?

0:28:42 - 0:28:43     Text: Yeah.

0:28:43 - 0:28:45     Text: Do you mind going back to slide?

0:28:45 - 0:28:52     Text: Do you mind explaining what it means to have a loss on the first template versus the

0:28:52 - 0:28:54     Text: 1000 template?

0:28:54 - 0:29:01     Text: Yeah, so if you imagine you have a thousand words extracted randomly from a book, then

0:29:01 - 0:29:05     Text: the very first thing you can ask the model to do is try to predict the very first word.

0:29:05 - 0:29:09     Text: Then you ask it to predict the second word, the third word, et cetera.

0:29:09 - 0:29:15     Text: The very first word basically all the model can possibly do is predict the unigram distribution

0:29:15 - 0:29:16     Text: for its training set.

0:29:16 - 0:29:20     Text: It just doesn't have any information to go on, otherwise to predict what's happening.

0:29:20 - 0:29:23     Text: And so that's why it's lost is very high.

0:29:23 - 0:29:28     Text: But by the time you get to the end of the passage, you've read a lot of some little short story,

0:29:28 - 0:29:30     Text: and you know a lot about what's going to happen.

0:29:30 - 0:29:32     Text: You know what kinds of words are likely to come next.

0:29:32 - 0:29:35     Text: You know about the author's style and vocabulary.

0:29:35 - 0:29:38     Text: You know about what characters exist, et cetera.

0:29:38 - 0:29:43     Text: And so your model has gotten much, much better at prediction by the end of the context.

0:29:43 - 0:29:48     Text: And so literally to make this plot, you take maybe a thousand, ten thousand different

0:29:48 - 0:29:51     Text: passages with a thousand words in them.

0:29:51 - 0:29:55     Text: You compute the model's loss on all of the words in the passage, and then you take the

0:29:55 - 0:29:57     Text: mean, and you get some nice plot like this.

0:29:57 - 0:29:58     Text: Yeah.

0:29:58 - 0:30:08     Text: But because the computation complexity is quite tragic with respect to token index, without

0:30:08 - 0:30:12     Text: being that essentially, for talking about it, if you think about it, if you were looking

0:30:12 - 0:30:18     Text: at compute, then it would go from the zero to like ten to the six.

0:30:18 - 0:30:27     Text: So you know, you know, significantly greater compute, for a given test gloss as you increase

0:30:27 - 0:30:28     Text: token index.

0:30:28 - 0:30:33     Text: So it's true that if you make the context length longer, longer, you will spend somewhat

0:30:33 - 0:30:35     Text: more compute.

0:30:35 - 0:30:41     Text: But the fraction of the amount of compute you spend near the last token isn't nearly

0:30:41 - 0:30:42     Text: so stark.

0:30:42 - 0:30:50     Text: Most of the compute happens in the matrix multiplies for the MLP feed forward part of

0:30:50 - 0:30:57     Text: the transformer, and also the matrix multiplies to make the keys and queries and values, et cetera.

0:30:57 - 0:30:59     Text: That's actually in most well.

0:30:59 - 0:31:04     Text: It depends on the model hyper parameters, but in many models, especially models that are

0:31:04 - 0:31:07     Text: large, that's actually the predominant compute.

0:31:07 - 0:31:10     Text: And so actually, the amount of compute you do for the last token in the first token might

0:31:10 - 0:31:12     Text: only differ by a few percent.

0:31:12 - 0:31:16     Text: So for GP3, I think it's literally like one or two percent difference.

0:31:16 - 0:31:19     Text: So the sum matrix multiplies the tension.

0:31:19 - 0:31:20     Text: Yeah, yeah, yeah.

0:31:20 - 0:31:24     Text: So I mean, the formula for that was this one that I briefly mentioned here.

0:31:24 - 0:31:29     Text: So basically, how much compute you do in the context direction divided by the amount of

0:31:29 - 0:31:32     Text: compute you do in the matrix multiply direction is this.

0:31:32 - 0:31:40     Text: So if your model is, if D model is very small, if D model is 128, and context is 1,000, then

0:31:40 - 0:31:42     Text: it's basically 50-50.

0:31:42 - 0:31:48     Text: But if D model is 10,000, and context is 2,000, then it's like 2%.

0:31:48 - 0:31:52     Text: So if the model is keep getting bigger, then that means that if you're willing to pay

0:31:52 - 0:31:57     Text: a fractional cost, then you can keep making context length longer and pay a fixed fractional

0:31:57 - 0:31:58     Text: cost.

0:31:58 - 0:32:01     Text: And of course, if you use something fancy with intense attention, you also get extra

0:32:01 - 0:32:04     Text: winds on top of that.

0:32:04 - 0:32:08     Text: Any other questions?

0:32:08 - 0:32:11     Text: Cool.

0:32:11 - 0:32:21     Text: So this is sort of both of these, the left and the right, show you samples from a transformer

0:32:21 - 0:32:24     Text: model.

0:32:24 - 0:32:28     Text: Very roughly speaking, they're identical kinds of transformer models, which is with some

0:32:28 - 0:32:30     Text: slightly different hyper parameters.

0:32:30 - 0:32:32     Text: But they're trained on very different data distributions.

0:32:32 - 0:32:35     Text: The one on the left is obviously, this is GPT-3.

0:32:35 - 0:32:38     Text: The one on the right is IGPT.

0:32:38 - 0:32:42     Text: It's a model that's trained to predict pixels, row by row.

0:32:42 - 0:32:47     Text: And so what happened here was that we took the top half of an image and then generated

0:32:47 - 0:32:50     Text: all the rows beneath.

0:32:50 - 0:32:55     Text: And so the same kind of model architecture, but just trained on different data distributions

0:32:55 - 0:33:04     Text: is able to effectively learn very impressive generative capabilities in both cases.

0:33:04 - 0:33:09     Text: And so this is sort of a qualitative hint at the possibility that what's going on here

0:33:09 - 0:33:13     Text: is quite universal.

0:33:13 - 0:33:18     Text: And so another way of introducing it is say, you might have some questions after the

0:33:18 - 0:33:19     Text: last few slides.

0:33:19 - 0:33:22     Text: How are the scaling laws I'm talking about really specific to language, are they a feature

0:33:22 - 0:33:26     Text: of the kinds of data that language is?

0:33:26 - 0:33:29     Text: You might ask, do these scaling laws really continue?

0:33:29 - 0:33:33     Text: You showed that they're true over many orders of magnitude, but did they break down eventually

0:33:33 - 0:33:34     Text: in a what way?

0:33:34 - 0:33:40     Text: And then another question you might ask is, what do they imply for other kinds of evaluations?

0:33:40 - 0:33:45     Text: You probably don't just want to generate raw samples from either of these kinds of models.

0:33:45 - 0:33:48     Text: You might want to use them for some other more specific task.

0:33:48 - 0:33:55     Text: And so the question of whether or not the test loss, the training loss that you've optimized

0:33:55 - 0:34:00     Text: as that goes down in a predictable way, does that also imply that other things, other

0:34:00 - 0:34:03     Text: capabilities of the model are improving?

0:34:03 - 0:34:07     Text: So I'll be talking about these questions.

0:34:07 - 0:34:16     Text: So this plot contains kind of a lot of compressed information all at once, or the set of plots.

0:34:16 - 0:34:25     Text: So this is the result of what happens if you train the same kind of transfer models on

0:34:25 - 0:34:27     Text: sort of five different data distributions.

0:34:27 - 0:34:33     Text: So text language we already saw, but you can try video where you predict every pixel

0:34:33 - 0:34:39     Text: in a video in this sort of rectangular prism of video pixels.

0:34:39 - 0:34:45     Text: Images, this sort of synthetically generated deep-mind math data set where you're trying

0:34:45 - 0:34:49     Text: to predict the answer to math problems.

0:34:49 - 0:34:55     Text: There's a multimodal data set where you have image text pairs in either direction.

0:34:55 - 0:35:03     Text: And in all cases, the x-axis is compute, and the y-axis is the appropriate test loss

0:35:03 - 0:35:08     Text: for that class of models minus a constant.

0:35:08 - 0:35:15     Text: So that's the one complication that I've added here.

0:35:15 - 0:35:30     Text: So the claim is that these dashed lines in terms of the original loss are a power law,

0:35:30 - 0:35:38     Text: like the power laws that we saw on a much earlier slide, plus one constant term.

0:35:38 - 0:35:42     Text: And if you subtract off that constant term, then you make a log-log plot once again,

0:35:42 - 0:35:46     Text: then you once again get these very, very nice straight lines.

0:35:46 - 0:35:50     Text: And so this compute scaling law generalizes to all these other data distributions.

0:35:50 - 0:35:56     Text: And the other scaling laws also generalize, I just haven't plotted them.

0:35:56 - 0:36:01     Text: So the claim of this slide is that scaling laws do generalize to all of these other data

0:36:01 - 0:36:07     Text: distributions, and you train the same basic kind of model on them.

0:36:07 - 0:36:13     Text: And furthermore, there's sort of an intellectually slightly interesting point, which is that

0:36:13 - 0:36:21     Text: if you really believe that these dashed lines are true, if you think that they're a real

0:36:21 - 0:36:28     Text: feature of what's going on, and they continue out very, very, very far, then if you think

0:36:28 - 0:36:34     Text: that the loss is a constant plus a power law, then you can interpret the constant term

0:36:34 - 0:36:39     Text: as the entropy of the underlying data distribution.

0:36:39 - 0:36:44     Text: And you can interpret the power law as something like the KL divergence between the true data

0:36:44 - 0:36:48     Text: distribution and the model that you have.

0:36:48 - 0:36:50     Text: So that's a lot.

0:36:50 - 0:36:57     Text: The important summary at zero-thorder to remember is that I'm telling you that the kinds of

0:36:57 - 0:37:02     Text: scaling laws I presented for language generalize to all of these other domains.

0:37:02 - 0:37:06     Text: There's also some other interesting features here.

0:37:06 - 0:37:13     Text: The reason why I used compute to illustrate that the scaling laws generalize is because

0:37:13 - 0:37:18     Text: you can ask another question now that puts all of the different data distributions on

0:37:18 - 0:37:19     Text: one plot.

0:37:19 - 0:37:25     Text: It wouldn't have made any sense to combine the five plots on the last slide into one plot,

0:37:25 - 0:37:30     Text: because the test loss, when you're predicting a word, is not in any way comparable to the

0:37:30 - 0:37:32     Text: test loss when you're predicting a pixel.

0:37:32 - 0:37:33     Text: It doesn't really make sense.

0:37:33 - 0:37:34     Text: They don't have the same units.

0:37:34 - 0:37:36     Text: It doesn't make sense to put them together.

0:37:36 - 0:37:42     Text: But something that does make sense to put together is what the optimal model size is as

0:37:42 - 0:37:45     Text: a function of your computational budget.

0:37:45 - 0:37:50     Text: And so in the same way that we did for language, you can go here and you can ask for any

0:37:50 - 0:37:55     Text: given amount of compute, like 10 to the minus 2, petaflap days, what is the best model size?

0:37:55 - 0:37:57     Text: You can do that for all of these plots.

0:37:57 - 0:38:01     Text: You combine that information together and you find something kind of surprising, which

0:38:01 - 0:38:07     Text: is that, again, roughly speaking, if you're sort of willing to allow a little bit of wiggle

0:38:07 - 0:38:13     Text: room, all of these different kinds of models seem to be on the same trajectory for optimal

0:38:13 - 0:38:15     Text: model size versus compute.

0:38:15 - 0:38:20     Text: There's some kind of universal fit of how much bigger you should make your model if you're

0:38:20 - 0:38:28     Text: going to model any of these data distributions with some given amount of compute.

0:38:28 - 0:38:32     Text: So what about other kinds of tasks?

0:38:32 - 0:38:38     Text: Well, one of the most classic tasks that you can ask about in ML is image classification.

0:38:38 - 0:38:44     Text: And so the models that we were training on images, and that I've shown you plots, they're

0:38:44 - 0:38:50     Text: training loss, these models are sort of on tiny little images predicted.

0:38:50 - 0:38:54     Text: Pixel by pixel, in particular, they're 32 by 32 images, so we can look at sort of the

0:38:54 - 0:39:00     Text: 32 by 32 pixel version of image net classification.

0:39:00 - 0:39:05     Text: And the models that I was discussing are generative models that predict pixels, but you can shop

0:39:05 - 0:39:13     Text: off their heads, add a classification head in its place, and try to predict image net

0:39:13 - 0:39:15     Text: and train on image net.

0:39:15 - 0:39:19     Text: And the orange curve that I've shown you here is what happens if you just take a randomly

0:39:19 - 0:39:23     Text: initialized model with that architecture and train it.

0:39:23 - 0:39:28     Text: You get very good performance up to a point and then performance plateaus because you're

0:39:28 - 0:39:33     Text: being limited by the fact that image net is from this point of you a small data set.

0:39:33 - 0:39:38     Text: However, if you take these pre-trained models that have been trained generatively to

0:39:38 - 0:39:43     Text: draw pixels, they sort of use the features, presumably they're using the features they

0:39:43 - 0:39:52     Text: learned from image generation for classification, and you get some nice trend for the error rate

0:39:52 - 0:39:56     Text: in classification as a function of model size.

0:39:56 - 0:40:02     Text: So this is saying that in this particular case, we actually do fine-tuning the pre-training

0:40:02 - 0:40:08     Text: you did and the sort of trends you saw really kind of transfer into trends in something else

0:40:08 - 0:40:12     Text: you might care about like image classification.

0:40:12 - 0:40:17     Text: We can ask the same kinds of questions about language models.

0:40:17 - 0:40:23     Text: In particular, does this steady improvement in language modeling as a function of scale,

0:40:23 - 0:40:28     Text: does that translate into better performance?

0:40:28 - 0:40:30     Text: And this is sort of an interesting subject by itself.

0:40:30 - 0:40:34     Text: And so you can ask what happens if we scale language models.

0:40:34 - 0:40:38     Text: And so this is sort of this exact same plot that you've seen a couple of times now for

0:40:38 - 0:40:44     Text: language models, but it just increased from sort of original work that we did out to

0:40:44 - 0:40:47     Text: this yellow line, which is GPT-3.

0:40:47 - 0:40:50     Text: And you see that basically this sort of trends continue.

0:40:50 - 0:40:54     Text: Possibly GPT-3 is sort of missing the trend a little bit.

0:40:54 - 0:40:59     Text: I can't really honestly tell you whether that's because GPT-3 wasn't well optimized, or

0:40:59 - 0:41:05     Text: if it's because there's some bending in this curve where we're hitting some irreducible

0:41:05 - 0:41:06     Text: loss.

0:41:06 - 0:41:12     Text: That irreducible loss would be something like the entropy of this sort of language data

0:41:12 - 0:41:15     Text: set itself.

0:41:15 - 0:41:18     Text: But it's just in order the trends continue.

0:41:18 - 0:41:24     Text: And what's now pretty well known is that if you train fairly large language models,

0:41:24 - 0:41:27     Text: then they can exhibit in context learning.

0:41:27 - 0:41:34     Text: So the kind of learning that I'm talking about is that you give these models an example

0:41:34 - 0:41:42     Text: of many arithmetic problems or many anagrams or whatnot or translation tasks for individual

0:41:42 - 0:41:43     Text: words.

0:41:43 - 0:41:49     Text: Then early on in the sequence of the top, they might not be very good at doing the task,

0:41:49 - 0:41:54     Text: but they figure out what the pattern is in the task and they learn to do it.

0:41:54 - 0:41:58     Text: And in particular, you can plot that so you can ask for, say, like one of these anagram

0:41:58 - 0:42:00     Text: tasks.

0:42:00 - 0:42:06     Text: What is the performance of the model as a function of how many examples of the task get seen

0:42:06 - 0:42:07     Text: in the context?

0:42:07 - 0:42:12     Text: So this is kind of similar to the loss as a function of context position, but it's now

0:42:12 - 0:42:16     Text: an accuracy at doing an actual task, like unscramble the letters in a word.

0:42:16 - 0:42:23     Text: And you see probably most importantly that if you give more examples, you get significantly

0:42:23 - 0:42:28     Text: better performance starting from very, very poor performance to pretty good.

0:42:28 - 0:42:31     Text: And also you see that larger models do this better.

0:42:31 - 0:42:35     Text: You also finally see that giving a natural language prompt with some instructions helps

0:42:35 - 0:42:40     Text: significantly in the regime where you have very few examples.

0:42:40 - 0:42:42     Text: This is in context learning.

0:42:42 - 0:42:46     Text: You can call this a kind of meta learning.

0:42:46 - 0:42:52     Text: And it just emerges automatically from training large language models without any particular

0:42:52 - 0:42:57     Text: attempt to get this kind of behavior.

0:42:57 - 0:43:03     Text: And you could also ask sort of about downstream tasks that you actually you care about.

0:43:03 - 0:43:08     Text: So there is accuracy at doing arithmetic as a function of model size, a bunch of different

0:43:08 - 0:43:11     Text: kinds of arithmetic problems.

0:43:11 - 0:43:19     Text: There is just some data set of analogies from a test that American college students take

0:43:19 - 0:43:23     Text: to go to college, the SATs.

0:43:23 - 0:43:31     Text: And if you care the sort of average score of that year's test was I think 58% or so

0:43:31 - 0:43:32     Text: percent.

0:43:32 - 0:43:35     Text: So the largest model is sort of doing a little bit better than the average American high

0:43:35 - 0:43:37     Text: school student.

0:43:37 - 0:43:42     Text: The trivia QA, which is sort of just knowing trivia.

0:43:42 - 0:43:50     Text: And Wina grad schemas are problems like if a tree falls on your roof and you got it fixed,

0:43:50 - 0:43:53     Text: what did you get fixed, did you get the tree fixed or your roof.

0:43:53 - 0:43:57     Text: It's a measure of common sense reasoning and models are also getting better at this.

0:43:57 - 0:44:03     Text: And I think the other interesting thing that's very often emphasized is that clearly trivia

0:44:03 - 0:44:05     Text: performance is improving very smoothly as you make models bigger.

0:44:05 - 0:44:09     Text: The models are just remembering more and more trivia.

0:44:09 - 0:44:14     Text: Wina grad schemas are also improving fairly smoothly.

0:44:14 - 0:44:17     Text: But then there are examples like arithmetic where models are very poor and then they sort

0:44:17 - 0:44:19     Text: of suddenly get pretty good.

0:44:19 - 0:44:25     Text: And so these kind of sudden rocks sort of the model sort of suddenly kind of like gets

0:44:25 - 0:44:28     Text: what it's supposed to do for arithmetic are pretty interesting.

0:44:28 - 0:44:32     Text: And there are all sorts of other kind of interesting things if you kind of dig into these

0:44:32 - 0:44:33     Text: specific abilities.

0:44:33 - 0:44:34     Text: Yeah.

0:44:34 - 0:44:41     Text: Why do bigger models do better in the context of student?

0:44:41 - 0:44:48     Text: I mean, I guess the sort of dumb zero-thorder point is that larger models are just getting

0:44:48 - 0:44:52     Text: much better and better at predicting the next word given more and more context.

0:44:52 - 0:44:58     Text: So I think it like, I think there's a very tight connection between a plot like this and

0:44:58 - 0:45:02     Text: these sort of in context loading plots.

0:45:02 - 0:45:06     Text: Basically the more information you're getting, I mean all of these models probably know the

0:45:06 - 0:45:11     Text: unigram distribution of words and tokens pretty well.

0:45:11 - 0:45:15     Text: But the bigger model is getting much, much, much more information from its context than

0:45:15 - 0:45:17     Text: the smaller models.

0:45:17 - 0:45:21     Text: And at a certain point, I mean, it depends on your training distribution and all sorts

0:45:21 - 0:45:22     Text: of other things.

0:45:22 - 0:45:27     Text: But like, one of the things that we do is when we see several examples of something happening

0:45:27 - 0:45:32     Text: in a text, we guess that that's what we're going to see next.

0:45:32 - 0:45:37     Text: And that's really probably embedded in a ton of text that's out there on the internet

0:45:37 - 0:45:38     Text: and in books.

0:45:38 - 0:45:41     Text: And models have to decrease their loss somehow.

0:45:41 - 0:45:43     Text: That's a pattern in the text.

0:45:43 - 0:45:48     Text: It's a pattern that models eventually learn and they seemingly apply this knowledge.

0:45:48 - 0:45:51     Text: I think there are other people, of course, who've kind of worked on this question more

0:45:51 - 0:45:53     Text: specifically, and I have more specific theories.

0:45:53 - 0:46:02     Text: But I think it like kind of an intuitive sense, that's how I would think about it.

0:46:02 - 0:46:08     Text: I guess one final evaluation you can ask, can people tell that text written by a language

0:46:08 - 0:46:12     Text: model, it was written by a language model or that it's a human?

0:46:12 - 0:46:16     Text: This is an evaluation where we looked at short news articles.

0:46:16 - 0:46:22     Text: There's two or three paragraphs and generated equivalent news articles from GPT-3.

0:46:22 - 0:46:27     Text: And by the time you get to sort of the largest models, people are approaching chance accuracy

0:46:27 - 0:46:28     Text: at being able to tell the difference.

0:46:28 - 0:46:33     Text: This sort of has a lot of implications, both, I mean, it's interesting and surprising

0:46:33 - 0:46:35     Text: as a state and it's about language modeling.

0:46:35 - 0:46:38     Text: But it's also somewhat scary.

0:46:38 - 0:46:43     Text: That means these language models are very difficult to tell that you're talking to a language

0:46:43 - 0:46:47     Text: model if you don't have a very long conversation.

0:46:47 - 0:46:48     Text: Yeah.

0:46:48 - 0:46:49     Text: Hi.

0:46:49 - 0:46:58     Text: So I am wondering, so for this specific statement, because with modern current models, they

0:46:58 - 0:47:00     Text: are what projects are going to use you.

0:47:00 - 0:47:05     Text: So have you attended, like, the general experience article, and what you don't have, and for

0:47:05 - 0:47:08     Text: a document, or do you have anything?

0:47:08 - 0:47:12     Text: I actually don't know the answer to that question for this particular analysis off the top of

0:47:12 - 0:47:14     Text: my head.

0:47:14 - 0:47:19     Text: I believe that these are not memorized.

0:47:19 - 0:47:23     Text: One simple thing you can do, at least, for some things that occur frequently is like you

0:47:23 - 0:47:29     Text: can look at the distribution of the loss for a model on its own samples.

0:47:29 - 0:47:34     Text: And at least for things that are memorized, that are very clearly memorized.

0:47:34 - 0:47:38     Text: Obviously they, of course, probably they first frequently in the training set, but also

0:47:38 - 0:47:41     Text: the loss tends to be much, much lower on memorized samples.

0:47:41 - 0:47:46     Text: Because you can intuitively understand this because if there's 100 words that are exactly

0:47:46 - 0:47:52     Text: verbatim sampled out, and you're sampling at temperature equals one, then all of the

0:47:52 - 0:47:55     Text: next word predictions have to be extremely, extremely confident.

0:47:55 - 0:47:57     Text: And that means the loss has to be super low.

0:47:57 - 0:48:02     Text: So, I mean, just informally, something that I've done to just get rid of memorized samples

0:48:02 - 0:48:05     Text: is compute the loss, and usually you'll just see a pretty clear by-modal where there'll

0:48:05 - 0:48:09     Text: be a few memorized examples and then things that aren't.

0:48:09 - 0:48:10     Text: That's a simple thing you can do to check.

0:48:10 - 0:48:13     Text: You can also, of course, do de-de-de-de-de-de-de-de-de-plocation.

0:48:13 - 0:48:18     Text: I don't remember off the top of my head what de-de-de-de-de-de-plocation has done here, though.

0:48:18 - 0:48:24     Text: On the Downscape task section, if I want to say about how scale loss is, you can look

0:48:24 - 0:48:29     Text: at the transferable objects and adversarial objects.

0:48:29 - 0:48:35     Text: I don't think I have anything particularly clear to say about that.

0:48:35 - 0:48:41     Text: I mean, these evils, I think, are not adversarial in the sense that they're just few shot evaluations

0:48:41 - 0:48:44     Text: with some fixed data set.

0:48:44 - 0:48:51     Text: There are a large number of different kinds of adversarial data sets out there for reasoning,

0:48:51 - 0:48:54     Text: for common sense knowledge, for truthfulness.

0:48:54 - 0:48:57     Text: So, I mean, there's, like, for example, truthful QA.

0:48:57 - 0:49:01     Text: This is an example where there aren't any trends like this and arguably the trends go

0:49:01 - 0:49:06     Text: downward, though it depends on your training distribution and some models actually do improve.

0:49:06 - 0:49:08     Text: So I think that's a complicated question.

0:49:08 - 0:49:11     Text: I think it's hard to find examples where the trends go down.

0:49:11 - 0:49:16     Text: I don't think it's easy, but these do exist.

0:49:16 - 0:49:21     Text: Any other questions?

0:49:21 - 0:49:25     Text: Great.

0:49:25 - 0:49:35     Text: So, I guess I'll sort of end by summarizing some lessons that you might draw pretty practically

0:49:35 - 0:49:38     Text: for research from this.

0:49:38 - 0:49:43     Text: And then I can either open it up for questions, or I can also, I can always talk infinitely

0:49:43 - 0:49:44     Text: long.

0:49:44 - 0:49:48     Text: I've been a professor for like 10 years of my life, so I can just talk forever.

0:49:48 - 0:49:52     Text: But I'll sort of end after talking about some lessons.

0:49:52 - 0:49:58     Text: So I think one lesson that kind of I draw from this is that kind of scanning over some

0:49:58 - 0:50:03     Text: of the important inputs to your training process is just a pretty useful thing to do when

0:50:03 - 0:50:05     Text: you're doing ML research.

0:50:05 - 0:50:08     Text: And it's sort of typically very cheap.

0:50:08 - 0:50:13     Text: It's cheap because generally most things vary in an important way on a log scale, or

0:50:13 - 0:50:17     Text: sort of on a geometric scale, however you want to say it.

0:50:17 - 0:50:21     Text: And that means that like if you're training with the data set of size D, maybe you should

0:50:21 - 0:50:25     Text: also train with D over 2 and D over 4 and D over 8 or something like that.

0:50:25 - 0:50:28     Text: And if you sum that geometric series, you get 2D.

0:50:28 - 0:50:34     Text: So you sort of, I mean, you made your training process twice as expensive in some sense, but

0:50:34 - 0:50:37     Text: it's not really a big change in what you have to do.

0:50:37 - 0:50:42     Text: But you can often learn a lot about what's going on by doing these kinds of scans.

0:50:42 - 0:50:47     Text: And so, I mean, this is an example of some data that I didn't show earlier.

0:50:47 - 0:50:51     Text: So, I think you might wonder about is what happens if you scan over data set size and model

0:50:51 - 0:50:53     Text: size at the same time.

0:50:53 - 0:50:57     Text: And it turns out there's some very simple trends that you can model in that case too that

0:50:57 - 0:51:00     Text: tell you about things like overfitting.

0:51:00 - 0:51:03     Text: And I mean, if you care about overfitting, then this tells you about something like how

0:51:03 - 0:51:06     Text: big do you have to make your data set for a given model size to avoid overfitting being

0:51:06 - 0:51:11     Text: a significant problem so that you can answer all kinds of questions like that.

0:51:11 - 0:51:18     Text: And I at least find that this is kind of useful and it's nice for learning things about

0:51:18 - 0:51:19     Text: behavior.

0:51:19 - 0:51:24     Text: And I think alongside that, I think like this is sort of a joke.

0:51:24 - 0:51:25     Text: This isn't real.

0:51:25 - 0:51:29     Text: This is sort of making fun of a large number of machine learning papers that you might

0:51:29 - 0:51:30     Text: see.

0:51:30 - 0:51:34     Text: I think a lot of machine learning papers have tables like this.

0:51:34 - 0:51:37     Text: And it's sort of hard to tell from like this kind of table obviously I'm making fun, but

0:51:37 - 0:51:42     Text: I think it's not so unrealistic like did the technique that went into our model really

0:51:42 - 0:51:45     Text: improve on other things that happened.

0:51:45 - 0:51:49     Text: And I think that this kind of plot at least for me is a much more convincing statement

0:51:49 - 0:51:53     Text: that like will clearly transformers are just better than LSTMs.

0:51:53 - 0:51:59     Text: So the slogan here is sort of success for new techniques if your goal is to sort of

0:51:59 - 0:52:04     Text: improve a model if that is your goal.

0:52:04 - 0:52:08     Text: And I think it's at least to me much more convincing and kind of clear what's going on

0:52:08 - 0:52:10     Text: if you see these trends.

0:52:10 - 0:52:13     Text: Maybe I have another slide making fun of the CS.

0:52:13 - 0:52:18     Text: So I mean, I think this is a thing that I actually see very often in research is that

0:52:18 - 0:52:26     Text: you come up with some new idea and you see like you first do the cheapest easiest experiment

0:52:26 - 0:52:29     Text: and you see, well my new idea improved performance.

0:52:29 - 0:52:31     Text: I'm really excited.

0:52:31 - 0:52:33     Text: Everyone should it should adopt this.

0:52:33 - 0:52:38     Text: But then you make some plot like this and you sort of say, oh, okay, I guess it doesn't

0:52:38 - 0:52:40     Text: really matter that much at all.

0:52:40 - 0:52:42     Text: And I think this is actually a comment.

0:52:42 - 0:52:44     Text: I mean, I think we all have all sorts of ideas.

0:52:44 - 0:52:50     Text: I mean, people fall asleep at night and they can't sleep and then they wake up and they

0:52:50 - 0:52:52     Text: have ideas and like, oh, I'm going to go try this.

0:52:52 - 0:52:53     Text: We all do it.

0:52:53 - 0:52:57     Text: But oftentimes they don't work and I think this is sort of useful for understanding whether

0:52:57 - 0:52:59     Text: your idea really, really works.

0:52:59 - 0:53:05     Text: And I mean, if all you're ever going to do is train this model, then your idea did work.

0:53:05 - 0:53:13     Text: But I think that like there's sort of an expectation that probably people will be using bigger

0:53:13 - 0:53:15     Text: computers to train larger models in the future.

0:53:15 - 0:53:18     Text: And so the ideas that are really going to have a huge impact are ones that sort of point

0:53:18 - 0:53:20     Text: in the opposite direction.

0:53:20 - 0:53:24     Text: I've even seen ideas where on small models they make no difference at all, but on larger

0:53:24 - 0:53:28     Text: models they do better.

0:53:28 - 0:53:31     Text: And so these kinds of trends I think are useful.

0:53:31 - 0:53:34     Text: And they're certainly useful to think about.

0:53:34 - 0:53:42     Text: Another point that I find useful, I think it's not sort of obvious and maybe you shouldn't

0:53:42 - 0:53:46     Text: trust it completely, is that I tend to think, I mean, because I've sort of swallowed

0:53:46 - 0:53:53     Text: my own coolade, that if something works, then it should scale fairly predictably.

0:53:53 - 0:53:58     Text: It's not always true, but for things that you can measure that are very close to your

0:53:58 - 0:54:01     Text: optimization target.

0:54:01 - 0:54:08     Text: If sort of your training process, your hyper parameters, etc., are all kind of set up well,

0:54:08 - 0:54:12     Text: then I tend to think that you should see some kind of predictable trend.

0:54:12 - 0:54:18     Text: And if that trend goes away, then I mean, maybe that's just exactly what's true.

0:54:18 - 0:54:21     Text: But I think often it means that there's something broken about what's going on.

0:54:21 - 0:54:26     Text: Maybe your numerics are broken and you need higher precision in some part of your model,

0:54:26 - 0:54:29     Text: maybe there's some bottleneck you hadn't thought of.

0:54:29 - 0:54:34     Text: So I mean, this is also an example that kind of scaling, predictable scaling can be found

0:54:34 - 0:54:35     Text: all over the place.

0:54:35 - 0:54:39     Text: So I just think this is sort of neat.

0:54:39 - 0:54:43     Text: So if you just train these extremely naive, very stupid multimodal models, or you use

0:54:43 - 0:54:49     Text: a decoder-only transformer to either model the text based on the image or model the image

0:54:49 - 0:54:51     Text: based on the text, then you can do that.

0:54:51 - 0:54:56     Text: And measure a sort of empirical mutual information between the image and the text.

0:54:56 - 0:55:02     Text: How much information did the image give you about the words in the sense of sort of Shannon

0:55:02 - 0:55:04     Text: information?

0:55:04 - 0:55:08     Text: And or conversely, how much information did the text give you about the image?

0:55:08 - 0:55:12     Text: And this is also a place where, I mean, this is very close to the optimization target.

0:55:12 - 0:55:16     Text: The whole point of the multimodal is to get this information.

0:55:16 - 0:55:21     Text: And you see that there's some predictable scaling going on where larger models are getting

0:55:21 - 0:55:28     Text: more information about one data, just one part of the distribution for the other.

0:55:28 - 0:55:36     Text: But I think this is sort of a general thing that you should expect in model training.

0:55:36 - 0:55:42     Text: And so maybe to sort of summarize, maybe even bigger picture implications.

0:55:42 - 0:55:49     Text: I think that these kinds of results suggest that it may not be the best or the smartest

0:55:49 - 0:55:53     Text: or the most interesting way to make better ML models.

0:55:53 - 0:55:56     Text: Maybe it won't be the way that happens in the future.

0:55:56 - 0:56:00     Text: But at least I think these results suggest that there aren't any really hard conceptual

0:56:00 - 0:56:08     Text: barriers preventing people from training significantly more powerful models of all kinds, including

0:56:08 - 0:56:13     Text: of course, language models in AI research.

0:56:13 - 0:56:20     Text: I think certainly my perspective, originally as a physicist, sort of coming to machine

0:56:20 - 0:56:28     Text: learning, kind of fresh new way five years ago, is that, I mean, this is sort of one set

0:56:28 - 0:56:35     Text: of abstractions for thinking about kind of what's going on in AI research that you, if

0:56:35 - 0:56:39     Text: you're going to be training fairly large models and you want them to do well, that's

0:56:39 - 0:56:44     Text: a thing that you're going to do, then you probably want your models to sort of be scaling

0:56:44 - 0:56:46     Text: well in terms of their performance.

0:56:46 - 0:56:50     Text: And I think this framework of maybe there's a bottleneck, but if you remove the bottleneck,

0:56:50 - 0:56:53     Text: then you'll just continue to see further progress.

0:56:53 - 0:56:57     Text: I found useful.

0:56:57 - 0:57:02     Text: I think another point that, well, maybe I'll make this point at the end.

0:57:02 - 0:57:06     Text: Another point is that, yeah, scaling laws are just sort of all over the place and they

0:57:06 - 0:57:11     Text: can help you to sort of maybe organize your research a bit.

0:57:11 - 0:57:15     Text: And then, I mean, maybe the most interesting point conceptually, though, is that it seems

0:57:15 - 0:57:22     Text: like, if you believe this kind of story, that it seems like many domains of ML are kind

0:57:22 - 0:57:27     Text: of surprisingly simple and universal, things that you might not have thought are the same

0:57:27 - 0:57:31     Text: or more similar than they are different.

0:57:31 - 0:57:34     Text: And of course, this is also a fascinating thing to try to understand.

0:57:34 - 0:57:41     Text: So I mean, I was a theoretical physicist for most of my life, so I mostly tried to understand

0:57:41 - 0:57:46     Text: things that seem extremely esoteric and weird and why would anyone care about them.

0:57:46 - 0:57:51     Text: This is a thing that I think probably, probably everyone in this room kind of cares about,

0:57:51 - 0:57:55     Text: like, can AI models write, can they communicate in language?

0:57:55 - 0:58:00     Text: And these kinds of trends are really, really nice, though the kind of trends that you might

0:58:00 - 0:58:05     Text: see in a very controlled physics experiment or something, and yet they're coming out of

0:58:05 - 0:58:10     Text: something very, very noisy and random, like, predicting language data on the internet.

0:58:10 - 0:58:16     Text: So I think it's very interesting to think about, like, why are these kinds of trends true?

0:58:16 - 0:58:21     Text: What is the underlying kind of theory or science here that makes these trends true?

0:58:21 - 0:58:23     Text: Can we predict it?

0:58:23 - 0:58:24     Text: Can we refine those predictions?

0:58:24 - 0:58:28     Text: Can we understand why when this doesn't, doesn't occur?

0:58:28 - 0:58:30     Text: Another question is sort of, there are some exponents here.

0:58:30 - 0:58:34     Text: This is a straight line, but the straight line represents a power law with a particular

0:58:34 - 0:58:35     Text: exponent.

0:58:35 - 0:58:36     Text: Why that exponent?

0:58:36 - 0:58:39     Text: For language, it's like 0.0 H or so.

0:58:39 - 0:58:42     Text: Why 0.08 and not 0.2 or 0.4 or 0.001?

0:58:42 - 0:58:45     Text: I think there are all sorts of questions here.

0:58:45 - 0:58:49     Text: When you see data that has a very clear trend, it's very interesting to understand, to try

0:58:49 - 0:58:54     Text: to think about why is something so simple happening.

0:58:54 - 0:58:56     Text: And I'll sort of leave you with that.

0:58:56 - 0:58:58     Text: Yeah.

0:58:58 - 0:59:16     Text: Have you thought any about how this is scaling laws, what are providing human beings?

0:59:16 - 0:59:23     Text: So your picture is essentially making everything bigger, avoid bottlenecks, and all the thought.

0:59:23 - 0:59:30     Text: Whereas I guess human beings, so you're good on the number of parameters, because they're

0:59:30 - 0:59:33     Text: still several orders based on two in rooms there.

0:59:33 - 0:59:40     Text: But it seems like you're not very good on the bottom of the map to use.

0:59:40 - 0:59:45     Text: So human to use is very constrained, but it's only an empty map, so it's because of the

0:59:45 - 0:59:50     Text: slow processing, because it's also because of the empty demands, try to have it be using

0:59:50 - 0:59:54     Text: most of their parameters first of the time.

0:59:54 - 1:00:02     Text: And data's a little bit complex, because I guess we get a ton of data, so if you are thinking

1:00:02 - 1:00:11     Text: of that on the amount of language data we get, you know, sort of fully complex language

1:00:11 - 1:00:18     Text: uses, sort of three orders and make the shoot down, where GPT3 is now.

1:00:18 - 1:00:22     Text: And get something good seems to happen, if you look at what these are all those.

1:00:22 - 1:00:24     Text: Any thoughts on that?

1:00:24 - 1:00:27     Text: I mean, I think it's a fantastic question.

1:00:27 - 1:00:31     Text: I don't have anything to say that isn't quite speculative.

1:00:31 - 1:00:32     Text: So I mean, I don't have any good answer to the question.

1:00:32 - 1:00:34     Text: I think it's a great question.

1:00:34 - 1:00:37     Text: I guess one thing that seems like it's true is that sort of the factor of a thousand

1:00:37 - 1:00:40     Text: you mentioned seems pretty common.

1:00:40 - 1:00:44     Text: I mean, my impression is that AlphaGo probably plays like a thousand times more games when

1:00:44 - 1:00:49     Text: it trains than like a go master does.

1:00:49 - 1:00:55     Text: I think this is a pretty common factor to see in a lot of like ML contexts.

1:00:55 - 1:00:56     Text: But I have no idea why it is.

1:00:56 - 1:00:59     Text: I don't know if it's that evolution optimized us to learn fast.

1:00:59 - 1:01:03     Text: If we have some hard coded information, if this sort of multimodal inputs that we have

1:01:03 - 1:01:08     Text: help a lot, you might imagine that when you have a system that's already pretty smart,

1:01:08 - 1:01:12     Text: reinforcement learning or active learning of some form becomes more and more important,

1:01:12 - 1:01:18     Text: because like when these language models or a person, like if I read a physics textbook,

1:01:18 - 1:01:22     Text: I don't really learn a lot in a certain sense because I already learned physics.

1:01:22 - 1:01:24     Text: And I think the same is probably true for these models.

1:01:24 - 1:01:29     Text: So as the models get smarter, this sort of very dumb next word prediction task is giving

1:01:29 - 1:01:34     Text: you less and less information, but you might expect to get more and more information if

1:01:34 - 1:01:37     Text: you did something more active.

1:01:37 - 1:01:41     Text: I can continue to speculate, but I don't really know anything about it.

1:01:41 - 1:01:47     Text: I don't have anything well established to tell you.

1:01:47 - 1:01:48     Text: It's a great question.

1:01:48 - 1:01:52     Text: So if you can't have the transformers, the LFGM, the LFGM, the LFGM, the transformers

1:01:52 - 1:02:02     Text: are a bit better, but they're a smaller, similar mode that seems like human abilities

1:02:02 - 1:02:08     Text: are a little bit more developed to a more standard, a few things on a very different place

1:02:08 - 1:02:09     Text: on the graph.

1:02:09 - 1:02:12     Text: Yeah, I think that's absolutely, I think it just true.

1:02:12 - 1:02:14     Text: The simple efficiency of these models is not similar.

1:02:14 - 1:02:19     Text: Another way of saying is that if you got into AI research to understand the human brain,

1:02:19 - 1:02:23     Text: it's very unclear whether we're making any progress on that.

1:02:23 - 1:02:29     Text: But if we just want to sort of, yeah, for a lot of these tasks, we don't seem to have

1:02:29 - 1:02:32     Text: to solve the brain to solve AI surprisingly.

1:02:32 - 1:02:42     Text: I think they have a question.

1:02:42 - 1:02:43     Text: Yeah.

1:02:43 - 1:03:10     Text: See it's deixar, and now I can wonder what you're going to get in this car and push

1:03:10 - 1:03:20     Text: There's also like things that will probably come,

1:03:20 - 1:03:23     Text: remember, really, like, not just acting in the past,

1:03:23 - 1:03:26     Text: that probably aren't necessarily written for,

1:03:26 - 1:03:28     Text: and by great science.

1:03:28 - 1:03:29     Text: Okay, so, Bob and Bob.

1:03:29 - 1:03:31     Text: Sure, sure, no, these are all great questions.

1:03:31 - 1:03:35     Text: So, I mean, sort of early on,

1:03:35 - 1:03:37     Text: I commented on some sources of data,

1:03:37 - 1:03:39     Text: and I mean, you're certainly correct about quality.

1:03:39 - 1:03:42     Text: I think in terms of quantity, I mean,

1:03:42 - 1:03:46     Text: I don't think anyone has, like, a digitized library of Congress,

1:03:46 - 1:03:48     Text: but I think if you did, that would be like,

1:03:48 - 1:03:51     Text: I don't know, maybe 10x bigger than the training set for GPD3.

1:03:51 - 1:03:54     Text: So, there's a sense in which there's probably quite a lot of,

1:03:54 - 1:03:57     Text: still quite high quality data that isn't in use.

1:03:57 - 1:03:59     Text: I don't know whether it will ever be in use,

1:03:59 - 1:04:01     Text: so it's a complicated question.

1:04:01 - 1:04:04     Text: And then, if you are willing to sort of take all of this garbage

1:04:04 - 1:04:08     Text: on the internet, or try to filter that garbage down,

1:04:08 - 1:04:11     Text: I think, I don't know how accurate this estimate is,

1:04:11 - 1:04:13     Text: but in order of magnitude level,

1:04:13 - 1:04:15     Text: you can get something like 10 to the 15 words,

1:04:15 - 1:04:17     Text: which is a thousand times bigger.

1:04:17 - 1:04:20     Text: And of course, if you find any kind of intelligent way of filtering,

1:04:20 - 1:04:23     Text: then if you can filter down to 0.1% of that,

1:04:23 - 1:04:25     Text: and take the 0.1% that's best,

1:04:25 - 1:04:27     Text: then you do still have a lot of data.

1:04:27 - 1:04:28     Text: So, I think for language modeling,

1:04:28 - 1:04:30     Text: there's definitely still some headroom,

1:04:30 - 1:04:34     Text: but this is certainly a constraint,

1:04:34 - 1:04:38     Text: and there are other kinds of data distributions

1:04:38 - 1:04:40     Text: where you'll run out sooner.

1:04:40 - 1:04:44     Text: I mean, in terms of, yeah, I mean,

1:04:44 - 1:04:46     Text: of course, there are all sorts of other things you can explore,

1:04:46 - 1:04:48     Text: one you can explore, multi-modal models,

1:04:48 - 1:04:51     Text: one can switch to a different kind of loss function

1:04:51 - 1:04:55     Text: that is more interactive, or actually accomplishing a task.

1:04:55 - 1:04:58     Text: But I think, for pure language modeling,

1:04:58 - 1:05:00     Text: it seems like there's at least some room left.

1:05:00 - 1:05:03     Text: And if you think that your model size increases,

1:05:03 - 1:05:06     Text: sort of, if you think you can increase your model size by a factor of 100

1:05:06 - 1:05:09     Text: and increase your data set size by a factor of 10,

1:05:09 - 1:05:14     Text: which is sort of like roughly what this is saying.

1:05:14 - 1:05:18     Text: If you believe that, then you can still scale up your model size a lot

1:05:18 - 1:05:21     Text: and have probably plenty of data.

1:05:21 - 1:05:27     Text: But, yeah, you couldn't sort of do this stuff without the internet.

1:05:27 - 1:05:29     Text: Yeah, or, you know,

1:05:29 - 1:05:33     Text: do you want to share it?

1:05:33 - 1:05:34     Text: Sure, yeah.

1:05:34 - 1:05:39     Text: In terms of bottlenecks for improving a task of a time,

1:05:39 - 1:05:42     Text: are you more optimistic about

1:05:42 - 1:05:47     Text: how much larger models on the same level

1:05:47 - 1:05:50     Text: as a cheese water improvement,

1:05:50 - 1:05:51     Text: or petrol improvements,

1:05:51 - 1:05:53     Text: like the LSDF transform?

1:05:53 - 1:05:56     Text: I guess, I mean,

1:05:56 - 1:05:59     Text: I think I'm sort of optimistic about both.

1:05:59 - 1:06:03     Text: I think that my understanding, sort of the zero-thorder understanding

1:06:03 - 1:06:05     Text: of the hardware situation is that, like,

1:06:05 - 1:06:08     Text: connecting together GPUs and GPU,

1:06:08 - 1:06:10     Text: like objects works pretty well,

1:06:10 - 1:06:12     Text: and that, like,

1:06:12 - 1:06:16     Text: interconnection speeds are increasing and can increase pretty easily.

1:06:16 - 1:06:20     Text: So I think that you don't need one chip to run your entire model.

1:06:20 - 1:06:25     Text: You can distribute your model over many, many, many accelerators.

1:06:25 - 1:06:29     Text: And I think you can do that if you're willing to pay for those accelerators,

1:06:29 - 1:06:32     Text: et cetera, then I think you can do that.

1:06:32 - 1:06:35     Text: Architectural improvements, I think,

1:06:35 - 1:06:40     Text: I would say it sort of typically haven't been super excited about architectural improvements,

1:06:40 - 1:06:45     Text: but I think there will continue to be architectural improvements.

1:06:45 - 1:06:49     Text: I think that, sort of, whenever you do something for the first time,

1:06:49 - 1:06:52     Text: or even just, like, whenever you train a really big model for the first time,

1:06:52 - 1:06:54     Text: you sort of don't do it in the best possible way,

1:06:54 - 1:06:58     Text: and there's a lot of, like, all sorts of different kinds of improvements.

1:06:58 - 1:07:02     Text: Maybe there are, sort of, non-incremental improvements that will look like big jumps.

1:07:02 - 1:07:04     Text: So yeah, I think that'll be both.

1:07:04 - 1:07:07     Text: So yeah, I mean, there's a sense in which,

1:07:07 - 1:07:10     Text: if all you did was look at this plot and just try to continue it,

1:07:10 - 1:07:14     Text: that might be an underestimate of progress that the field is going to make,

1:07:14 - 1:07:25     Text: because there will be improvements in architecture and algorithms and things like that.

1:07:25 - 1:07:41     Text: So this related input was looking very good at the testing process,

1:07:41 - 1:07:46     Text: and it looked by some computer and new things were inBS.

1:07:46 - 1:07:49     Text: And then he can do everything he's going to do

1:07:49 - 1:07:52     Text: with doing that this scaling long,

1:07:52 - 1:07:54     Text: like, in the fall part, and he'll force back

1:07:54 - 1:07:57     Text: this type of almost the territory.

1:07:57 - 1:08:00     Text: I think that's a great question.

1:08:00 - 1:08:02     Text: I think the simplest version of this,

1:08:02 - 1:08:05     Text: well, a simple version of it that I think is probably

1:08:05 - 1:08:08     Text: important and increasingly important to sort of

1:08:08 - 1:08:10     Text: just reinforcement learning, reinforcement learning

1:08:10 - 1:08:12     Text: in a certain sense as a situation where you generate your own data,

1:08:12 - 1:08:14     Text: because if you have a language model doing RL,

1:08:14 - 1:08:17     Text: then it writes something and then you're training on that data.

1:08:17 - 1:08:21     Text: So I definitely do think that that will sort of augment data

1:08:21 - 1:08:25     Text: and mean that there'll be other avenues for improvement.

1:08:25 - 1:08:28     Text: Literal data augmentation itself seems also seem plausible to me.

1:08:28 - 1:08:31     Text: I think it's not happening a lot because there still

1:08:31 - 1:08:34     Text: is more language data out there.

1:08:37 - 1:08:38     Text: Yeah.

1:08:38 - 1:08:47     Text: I think I've got two versions of this one's more clear,

1:08:47 - 1:08:51     Text: just coming from a nine, and a few versions back on this one

1:08:51 - 1:08:55     Text: is just like how, what about associated with the language

1:08:55 - 1:08:58     Text: field and how I can't really explain about that in physics

1:08:58 - 1:09:01     Text: in the second part of this.

1:09:01 - 1:09:04     Text: In your research and for this, I'm sure you dealt a lot with

1:09:04 - 1:09:06     Text: different things going on.

1:09:06 - 1:09:09     Text: But I kind of understand to this type of stuff,

1:09:09 - 1:09:12     Text: other than finding that you found particularly

1:09:12 - 1:09:14     Text: surprising we're going to expect you,

1:09:14 - 1:09:17     Text: I'm just with your past experience,

1:09:17 - 1:09:20     Text: because I always get this a lot of different types of data.

1:09:20 - 1:09:23     Text: Also since now, two of you are going to understand,

1:09:23 - 1:09:26     Text: but for you like under this or like other stuff

1:09:26 - 1:09:28     Text: that you're doing by phenolic,

1:09:28 - 1:09:31     Text: you're using stuff right now, like if you're anything that

1:09:31 - 1:09:34     Text: is coming through like this and going to call us about that

1:09:34 - 1:09:38     Text: or just like particularly, so, supposing.

1:09:38 - 1:09:44     Text: I think to me the most surprising thing is of these sorts of

1:09:44 - 1:09:49     Text: results was probably that there is a very, very precise trend.

1:09:49 - 1:09:53     Text: It seems like, I mean, yeah, I mean, like, I think this is

1:09:53 - 1:09:56     Text: sort of an unusual thing, and I think when I saw that,

1:09:56 - 1:09:59     Text: I thought it was a really big deal.

1:09:59 - 1:10:02     Text: I think that like, usually like, I mean, it's just,

1:10:02 - 1:10:05     Text: it's not true in most many things you plot.

1:10:05 - 1:10:07     Text: I mean, obviously there are other plots that don't show

1:10:07 - 1:10:09     Text: this kind of trend, even if they're reasonable.

1:10:09 - 1:10:11     Text: I mean, like, I don't know, I mean, there's sort of a trend

1:10:11 - 1:10:14     Text: to interview a QA, but I don't really know what that means.

1:10:14 - 1:10:17     Text: But I think the fact that there's something seemingly

1:10:17 - 1:10:22     Text: very precise is, I view that as like a very intriguing

1:10:22 - 1:10:25     Text: entry point to like try to dig into something,

1:10:25 - 1:10:27     Text: because it means that there's probably some deeper reason.

1:10:27 - 1:10:30     Text: And then the fact that it seems fairly universal

1:10:30 - 1:10:33     Text: across data distributions, again, suggests something like that.

1:10:33 - 1:10:37     Text: Yeah, the main difference between data distributions is

1:10:37 - 1:10:40     Text: these exponents in a scaling browser different.

1:10:40 - 1:10:43     Text: I mean, in terms of like coming from physics,

1:10:43 - 1:10:46     Text: I mean, I think I got into like a lot of this stuff partly

1:10:46 - 1:10:50     Text: because I'm fairly mercurial, and I was interested,

1:10:50 - 1:10:52     Text: and a lot of other friends I had were interested,

1:10:52 - 1:10:55     Text: and so we sort of studied it, and went from there,

1:10:55 - 1:10:58     Text: I had friends, et cetera.

1:10:58 - 1:11:01     Text: But I mean, from another point of view, I think I got involved

1:11:01 - 1:11:04     Text: in it for really weird reasons, perhaps, in the sense that like,

1:11:04 - 1:11:08     Text: I just know a lot of people who are already, and sort of,

1:11:08 - 1:11:11     Text: I don't know, 2015, talking about things like, wow,

1:11:11 - 1:11:14     Text: is like, how much better is AI going to get?

1:11:14 - 1:11:17     Text: What are the implications going to be for the world?

1:11:17 - 1:11:20     Text: Is this going to keep improving at an addressed clip?

1:11:20 - 1:11:23     Text: What are we going to do to sort of make sure that these

1:11:23 - 1:11:26     Text: models are aligned with human values to use the kind

1:11:26 - 1:11:29     Text: of usual sort of phrase that's now used?

1:11:29 - 1:11:33     Text: And I sort of thought these people were weird and crazy,

1:11:33 - 1:11:35     Text: even though they were friends of mine, and I sort of said,

1:11:35 - 1:11:38     Text: oh, like this is really dumb, like I don't think that these

1:11:38 - 1:11:40     Text: AI models are really something to worry about.

1:11:40 - 1:11:45     Text: But like, I was still interested, and sort of was like,

1:11:45 - 1:11:47     Text: well, like smart people I know think that AI is improving

1:11:47 - 1:11:52     Text: very rapidly, and that might have a lot of impacts,

1:11:52 - 1:11:55     Text: and might require a lot of sort of caution and thought,

1:11:55 - 1:11:58     Text: and work to sort of make it safe.

1:11:58 - 1:12:01     Text: And so that was actually a significant motivation for me

1:12:01 - 1:12:02     Text: getting involved.

1:12:02 - 1:12:06     Text: It was a mixture of sort of, there being a lot of potentially

1:12:06 - 1:12:09     Text: really intellectually interesting questions,

1:12:09 - 1:12:12     Text: liking to sort of switch fields every few years,

1:12:12 - 1:12:16     Text: and friends of mine being very kind of concerned about this

1:12:16 - 1:12:21     Text: question, and yeah, that was sort of what brought me in.

1:12:21 - 1:12:25     Text: Are there everything you've seen in this picture, you know,

1:12:25 - 1:12:28     Text: how do model scale, the one operator,

1:12:28 - 1:12:32     Text: you know, one factor that you found has the most potential

1:12:32 - 1:12:38     Text: to work on, the scale we've got for this kind of question.

1:12:38 - 1:12:43     Text: I mean, if we go back to sort of very basic ML ingredients,

1:12:43 - 1:12:47     Text: of like, what are these things, like,

1:12:47 - 1:12:49     Text: so there's a sense in which this is all you're doing,

1:12:49 - 1:12:52     Text: you choose one of each of these five things.

1:12:52 - 1:12:55     Text: I would guess that what the objective is,

1:12:55 - 1:12:58     Text: is most likely to sort of change things, in the sense that

1:12:58 - 1:13:00     Text: predicting the next word is really sort of one of the

1:13:00 - 1:13:03     Text: laziest sort of dumbest things you can do.

1:13:03 - 1:13:07     Text: And, and, I mean, there are all sorts of things,

1:13:07 - 1:13:10     Text: so it's really just chosen because you want to be able to

1:13:10 - 1:13:12     Text: compute, you want to be able to do back prop,

1:13:12 - 1:13:15     Text: and so you want to be able to get some differentiable thing,

1:13:15 - 1:13:17     Text: you want to be able to get a lot of data for which you can

1:13:17 - 1:13:23     Text: compute this differentiable thing, and so that's the game that you're playing.

1:13:23 - 1:13:26     Text: But, I think that you can have other objectives,

1:13:26 - 1:13:31     Text: like through reinforcement learning, or some other kind of active learning,

1:13:31 - 1:13:34     Text: whatever, I mean, some combination of such things.

1:13:34 - 1:13:40     Text: And, I sort of would just guess that generally performance will change a lot more.

1:13:40 - 1:13:43     Text: Like, if you're expecting sort of these trends to be very different,

1:13:43 - 1:13:45     Text: I would guess they're different if you have a different objective.

1:13:45 - 1:13:50     Text: I think changing the data distribution, or the model might also change things,

1:13:50 - 1:13:54     Text: but I think that, like, the lesson that I personally draw from something like this,

1:13:54 - 1:13:58     Text: is that even if you found a, like, really revolutionary change,

1:13:58 - 1:14:01     Text: that was, like, much better than transformers,

1:14:01 - 1:14:05     Text: it might be kind of equivalent to making transformers 10 times bigger,

1:14:05 - 1:14:10     Text: but I'm not sure if that would be as big of a deal as changing the loss.

1:14:10 - 1:14:12     Text: Changing what the objective is.

1:14:12 - 1:14:15     Text: But that's just my guess, I have no idea.

1:14:15 - 1:14:19     Text: And, of course, this paradigm, I mean, I think I was trying to be polite.

1:14:19 - 1:14:23     Text: I usually have, like, a picture of a grilled cheese here to emphasize,

1:14:23 - 1:14:26     Text: like, sort of, how simple and sort of silly this is,

1:14:26 - 1:14:30     Text: rather than this sort of very sophisticated palette of spices.

1:14:30 - 1:14:35     Text: And, I mean, maybe someone will say, like, this isn't the right set of ingredients

1:14:35 - 1:14:38     Text: from which to think about things, and there's a different thing you should do,

1:14:38 - 1:14:40     Text: and maybe that will make a big difference as well.

1:14:40 - 1:14:43     Text: But I, that's sort of an unknown, unknown.