Stanford CS224N NLP with Deep Learning | Winter 2021 | Lecture 14 - T5 and Large Language Models

0:00:00 - 0:00:12     Text: Okay, so hi everyone. Welcome back to CS224N. So today we're get to have the second of

0:00:12 - 0:00:17     Text: our invited speakers for this quarter. And so given all of the excitement that there's

0:00:17 - 0:00:22     Text: been about transformer pre-trained language models and all the great things have been

0:00:22 - 0:00:28     Text: done with them. We're especially excited today to be able to welcome Colin Reffel, who's

0:00:28 - 0:00:33     Text: been one of the key people who've been pushing the exploration of large pre-trained language

0:00:33 - 0:00:39     Text: models in particular. He was very interested in the development of the T5 language model,

0:00:39 - 0:00:46     Text: which you'll be telling you plenty about today. But to tell you a few more sentences, so Colin

0:00:46 - 0:00:51     Text: worked for a number of years at Google Brain, including working with Jeff Hinton on

0:00:51 - 0:00:58     Text: Captural Networks. He then got interested in the effectiveness of transfer using pre-large

0:00:58 - 0:01:03     Text: pre-trained language models. And as part of the work on that, he started with, together

0:01:03 - 0:01:08     Text: with other people on building even bigger, large pre-trained language models and doing

0:01:08 - 0:01:13     Text: a lot of investigations of those, which led to the T5 papers that you'll be hearing

0:01:13 - 0:01:18     Text: about today. So welcome, Colin. Right, yeah, thanks so much for the introduction and for

0:01:18 - 0:01:25     Text: having me. It's definitely an honor to speak at the legendary CS224N class. So yeah, I'm

0:01:25 - 0:01:31     Text: going to be talking today about large language models kind of in general, but focusing specifically

0:01:31 - 0:01:37     Text: on this model T5 that we released about a year and a half ago. I'll be presenting five

0:01:37 - 0:01:43     Text: or so papers that kind of should represent the full spectrum of good things, bad things

0:01:43 - 0:01:47     Text: and ugly things about large language models. And actually, I'll talk a bit about a paper

0:01:47 - 0:01:52     Text: that just appeared on archive last night. So hopefully everyone will learn something new

0:01:52 - 0:01:58     Text: today, even if you're already familiar with some of these papers. So just to give you an

0:01:58 - 0:02:02     Text: idea of what I'll be covering in this talk, I kind of will be answering each of these

0:02:02 - 0:02:09     Text: questions in turn in the over the course of the presentation. And I should mention that

0:02:09 - 0:02:15     Text: this, since some of these papers are new and some of this material is new, this is the

0:02:15 - 0:02:19     Text: first time I'll be presenting these slides. So if anything is confusing, I understand

0:02:19 - 0:02:23     Text: there's a way for you to pin questions to me through the hosts and feel free to ask

0:02:23 - 0:02:28     Text: about anything that might turn out to be confusing. So yeah, the first question that I'll try

0:02:28 - 0:02:37     Text: to answer is kind of the answer we sought out to focus on in the T5 paper, which is,

0:02:37 - 0:02:40     Text: you know, which of the transfer learning methods people have proposed so far work best.

0:02:40 - 0:02:48     Text: And what happens when we scale them up? And then after the T5 paper, we decided to investigate

0:02:48 - 0:02:52     Text: non-English pre-trained language models. T5 is an English only model. There are lots

0:02:52 - 0:02:57     Text: of languages spoken in the world. So what happens when we modify T5 so that it's a massively

0:02:57 - 0:03:02     Text: multi-lingual model? And then I'll talk about another paper where we try to investigate

0:03:02 - 0:03:08     Text: what kinds of knowledge and how much knowledge a model picks up over the course of pre-training.

0:03:08 - 0:03:12     Text: And then, relatedly in a follow-up work, we tried to figure out if the model actually

0:03:12 - 0:03:17     Text: memorizes training data during pre-training. If it actually, if we can get it to spit

0:03:17 - 0:03:23     Text: out verbatim entries from the training data set after it's been trained. And then finally,

0:03:23 - 0:03:28     Text: I'll talk about a very recent work that's similar in spirit to the T5 paper where the

0:03:28 - 0:03:32     Text: goal is to answer not what transfer learning methods work best, but what modifications

0:03:32 - 0:03:39     Text: to the transformer architecture that people have proposed to work best. So just to motivate

0:03:39 - 0:03:43     Text: this, I actually looked through some of the lectures that you all have had so far just

0:03:43 - 0:03:48     Text: to get a sense of what you've learned already. And I know that you are already pretty familiar

0:03:48 - 0:03:52     Text: with this transfer learning paradigm that has kind of taken the field of natural language

0:03:52 - 0:03:58     Text: processing by storm. But just as a quick refresher in this style of transfer learning, what

0:03:58 - 0:04:05     Text: we typically do is take a bunch of unlabeled text data and we apply some unsupervised

0:04:05 - 0:04:09     Text: objective, where you might say a self-supervised objective, where you do something like you

0:04:09 - 0:04:12     Text: mask out words at random and then you train the model to predict the missing words. So

0:04:12 - 0:04:18     Text: you can see these blanks in the block of text in green and the missing words that the

0:04:18 - 0:04:22     Text: language model will be trained to predict in the yellow box at the bottom. And then

0:04:22 - 0:04:27     Text: after we do this pre-training for a while, we fine tune the model on some downstream supervised

0:04:27 - 0:04:33     Text: task. In this example, I'm showing a sentiment analysis task for movie reviews. And the

0:04:33 - 0:04:38     Text: upshot of all of this is that doing this first unsupervised pre-training step is just

0:04:38 - 0:04:42     Text: ridiculously helpful. Not only does it usually make the performance better, it often gets

0:04:42 - 0:04:47     Text: you very good performance with relatively little fine-tuning data compared to training

0:04:47 - 0:04:51     Text: from scratch. So this is really a very common, it's kind of the de facto standard way

0:04:51 - 0:04:58     Text: to attack many natural language processing problems now. And I think you've already reviewed

0:04:58 - 0:05:04     Text: some of these methods, but because of how effective the transfer learning recipe was, there

0:05:04 - 0:05:09     Text: really was kind of an explosion of work on transfer learning starting in maybe 2018

0:05:09 - 0:05:15     Text: or so. There's some prior work on other approaches to doing transfer learning like word vectors,

0:05:15 - 0:05:20     Text: like word-to-vec, some sort of preliminary work that proposed the recipe that I just described

0:05:20 - 0:05:25     Text: called semi-supervised sequence learning, some work that suggested that this kind of stuff

0:05:25 - 0:05:29     Text: might be really possible, like the unsupervised sentiment on paper. But I would say around

0:05:29 - 0:05:35     Text: in 2018, there was kind of a string of papers that kicked off the excitement in the field,

0:05:35 - 0:05:39     Text: including the universal language model fine-tuning paper, how the Elmo paper, what we now

0:05:39 - 0:05:46     Text: call GPT-1, and of course the BERT paper in late 2018. And then starting in 2019, there

0:05:46 - 0:05:51     Text: really was just an incredible explosion of different methods for doing transfer learning, including

0:05:51 - 0:05:57     Text: new transfer learning, sorry, new pre-training objectives, new data sets, new ways of doing

0:05:57 - 0:06:05     Text: fine-tuning, and so on. And we started working on the T5 project in late 2018, and we noticed

0:06:05 - 0:06:09     Text: that as all these methods were coming up, it was getting harder and harder to figure out

0:06:09 - 0:06:15     Text: what actually works best. And part of the reason for that was just because there was so

0:06:15 - 0:06:20     Text: many methods that were being proposed kind of simultaneously. And when that happens, even

0:06:20 - 0:06:25     Text: when everyone is working in Ernest with good faith, you might have situations like this

0:06:25 - 0:06:31     Text: where one paper comes along, paper A that proposes a new unsupervised pre-training

0:06:31 - 0:06:35     Text: like technique called fancy learn, another paper comes along maybe around the same time

0:06:35 - 0:06:40     Text: with a pre-training technique called fancyer learn, and paper A pre-trains on Wikipedia

0:06:40 - 0:06:45     Text: for unlabeled data, while paper B uses Wikipedia and the Toronto Books Corpus was just a collection

0:06:45 - 0:06:51     Text: of novel text. And then the question obviously is, is fancyer learn better than fancy learn?

0:06:51 - 0:06:54     Text: Well it's hard to say because they use different sources of pre-training data. You might

0:06:54 - 0:06:59     Text: imagine that maybe they use different model sizes, maybe they pre-trained for a different

0:06:59 - 0:07:04     Text: amount of time, they use different optimizers, there are tons of design decisions that come

0:07:04 - 0:07:11     Text: into play here. And so given that these design decisions can make it hard to determine

0:07:11 - 0:07:17     Text: what worked best, our goal in the T5 paper was kind of to step back and just say, given

0:07:17 - 0:07:20     Text: the current landscape of transfer learning, you know, all of the methods that people have

0:07:20 - 0:07:26     Text: proposed, what actually works best when we compare them in a in the same exact setting.

0:07:26 - 0:07:31     Text: And once we know what works best, how far can we push these tools that we already have?

0:07:31 - 0:07:37     Text: And how much can we explore the limits and figure out how well these things work in scale?

0:07:37 - 0:07:43     Text: And so to attack this problem, the kind of the only thing that we introduced, since again,

0:07:43 - 0:07:50     Text: we're kind of exploring existing techniques, was this idea of treating all text problems

0:07:50 - 0:07:57     Text: in the same format. And this kind of approach, this dogma of treating every text problem

0:07:57 - 0:08:02     Text: in the same format gives rise to our model, which we call the text-to-text transfer transformer.

0:08:02 - 0:08:08     Text: And so to explain this format to you, the basic idea is that we cast every text-based

0:08:08 - 0:08:14     Text: NLP problem as a text-to-text task. And by that, I mean, the model takes text as input

0:08:14 - 0:08:19     Text: and it produces text as output. So in things like English to German translation, this is

0:08:19 - 0:08:25     Text: pretty typical. We feed in a English sentence on the input and we train the model to predict

0:08:25 - 0:08:29     Text: a German sentence on the output. And you'll notice in our case, we actually also feed

0:08:29 - 0:08:34     Text: what we call a task prefix, translate English to German. That just tells the model what

0:08:34 - 0:08:38     Text: we wanted to do with the input. Because if, especially if we're training a multi-task model,

0:08:38 - 0:08:41     Text: if you just feed the model, that is good. The model doesn't know what to do with it.

0:08:41 - 0:08:45     Text: It doesn't know if you're trying to do sentiment analysis or English to German translation

0:08:45 - 0:08:52     Text: or what. So this shouldn't be that surprising so far. You probably learned about encoder

0:08:52 - 0:08:55     Text: decoder model, sequence-to-sequence models. And that's basically all that we're doing

0:08:55 - 0:09:00     Text: here. It maybe gets a little more unusual when we start to tackle things like text classification

0:09:00 - 0:09:06     Text: tasks. So this is an example from the cola benchmark, which is the corpus of linguistic

0:09:06 - 0:09:11     Text: acceptability. And the goal here is to take a sentence and determine if the sentence is

0:09:11 - 0:09:16     Text: quote unquote acceptable, which kind of means whether it's grammatically correct and also

0:09:16 - 0:09:21     Text: if it's nonsense or not. And in this case, the sentence is the course is jumping well

0:09:21 - 0:09:26     Text: and of course, courses can't jump. So this sentence is not acceptable. But rather than

0:09:26 - 0:09:31     Text: training our model through a classifier layer that outputs a class label or a probability

0:09:31 - 0:09:36     Text: distribution over class indices, we actually train the model to output the literal text

0:09:36 - 0:09:43     Text: not acceptable. So it's outputting this text string token by token. And it can get even

0:09:43 - 0:09:48     Text: a little weirder. We can attack things like regression problems where effectively the

0:09:48 - 0:09:53     Text: model is supposed to be outputting a floating point value. And we actually just do this

0:09:53 - 0:09:57     Text: by taking the floating point numbers and converting them to strings and training the model to

0:09:57 - 0:10:02     Text: just predict the string. And it turns out that at least for this particular task, which

0:10:02 - 0:10:07     Text: is the STSB benchmark, it works perfectly fine. And we ultimately actually got state

0:10:07 - 0:10:12     Text: of the art on this benchmark. So doing this type of sort of float to string conversion

0:10:12 - 0:10:18     Text: with a little bit of quantization turns out to work fine for regression problems. And finally,

0:10:18 - 0:10:21     Text: you know, the main point of this is that there are a lot, there are really tons and tons

0:10:21 - 0:10:26     Text: of problems that we can cast into this format. So here's an example of abstractive summarization.

0:10:26 - 0:10:30     Text: We feed in a news article on the left and we predict that at this summary on the right.

0:10:30 - 0:10:35     Text: And again, we can really attack all of these problems in exactly the same way. So we're

0:10:35 - 0:10:41     Text: using exactly the same objective during training and exactly the same decoding procedure at test

0:10:41 - 0:10:47     Text: time to attack a huge variety of natural language processing problems. And the nice part about

0:10:47 - 0:10:53     Text: doing this is that as long as a transfer learning improvement is applicable to our model and to

0:10:53 - 0:11:00     Text: this text to text format, we can try it on a huge suite of downstream tasks while using exactly

0:11:00 - 0:11:05     Text: the same model, exactly the same learning rate optimizer training procedure, exactly the same

0:11:05 - 0:11:10     Text: inference procedure. So we can get a rid of a ton of the confounders that I mentioned earlier.

0:11:11 - 0:11:12     Text: Hey, Colin. Yeah.

0:11:12 - 0:11:20     Text: Can you take a second as we're discussing the formatting here to talk about how the 3.8 sort of

0:11:20 - 0:11:26     Text: regression example works and yeah, just go over that one more time. Yeah, absolutely. So in this

0:11:26 - 0:11:32     Text: particular task, STSB, the goal is to take in two sentences and predict a floating point number,

0:11:32 - 0:11:37     Text: which denotes how similar those two sentences are. And that floating point number ranges between

0:11:37 - 0:11:44     Text: 1.0 and 5.0. So we basically took the ground truth, annotated values that were supposed to

0:11:44 - 0:11:51     Text: regress. We quantized them to the nearest 0.2 and we cast it to a string. And now we have a string

0:11:51 - 0:11:56     Text: like three three period eight, a 3.8, but you could think of it like three period eight or four

0:11:56 - 0:12:02     Text: period zero. And we just train the model to predict that string, which ultimately it's a string.

0:12:02 - 0:12:08     Text: It's not a number anymore. We just predict it token by token. So in a sense, you can kind of think

0:12:08 - 0:12:13     Text: of it as converting the regression problem to a classification problem because you're doing this

0:12:13 - 0:12:18     Text: quantization, but you also more broadly can just think of it as converting the regression problem

0:12:18 - 0:12:25     Text: to a text to text problem, which is what we're doing. Thanks. Yeah. It's a little funky, but I promise

0:12:25 - 0:12:32     Text: it works. And great. So the nice thing again about using this sort of sequence to sequence text

0:12:32 - 0:12:38     Text: to text format is that we can actually just use the original vanilla transformer as it was proposed

0:12:38 - 0:12:43     Text: because if you remember, the transformer was actually proposed for English to, well, it was

0:12:43 - 0:12:48     Text: proposed for machine translation primarily, which is a sequence task, you know, taking a language

0:12:48 - 0:12:54     Text: in sentence and one language as input and producing the corresponding sentence in other language.

0:12:54 - 0:12:58     Text: And so really, I won't say a lot about the model that we use. There are relatively few changes

0:12:58 - 0:13:02     Text: that we made to the standard transformer architecture as originally proposed and attention is

0:13:02 - 0:13:10     Text: all you need when constructing our T5 model. I will towards the end of the talk discuss lots of

0:13:10 - 0:13:15     Text: architectural modifications that people have made sense, but for T5 or the T5 paper, we really

0:13:15 - 0:13:22     Text: basically stuck to the basics. So the next big question when you're attacking a transfer

0:13:22 - 0:13:28     Text: learning problem is what should my pre-training data set be? And because of the internet, there

0:13:28 - 0:13:34     Text: are lots of possible sources of unlabeled text data, one common source is Wikipedia. I'm displaying

0:13:34 - 0:13:41     Text: a ton of Wikipedia articles on the screen here. And in undertaking this project, one of the factors

0:13:41 - 0:13:46     Text: that we wanted to study was the effect of the pre-training data set itself. And so we actually

0:13:46 - 0:13:52     Text: constructed a new pre-training data set that would allow us to vary the size across many orders of

0:13:52 - 0:13:57     Text: magnitude, also that has a filtering pipeline that would allow us to control the quality and type

0:13:57 - 0:14:02     Text: of data that was pre-trained on. And I'll describe how we built that data set now.

0:14:03 - 0:14:09     Text: So the first thing we did is we wanted to source our data from a publicly available source. We

0:14:09 - 0:14:14     Text: didn't want to use some Google internal web scrape that we couldn't release. So we made use of

0:14:14 - 0:14:19     Text: these web scrapes by a nonprofit organization called Common Crawl, which is really just

0:14:20 - 0:14:25     Text: organization that sends web crawlers out through the internet and downloads as much text as they

0:14:25 - 0:14:31     Text: can. And every month they dump out what they call web extracted text, which is you can think of

0:14:31 - 0:14:39     Text: them as websites with all of the HTML and JavaScript ideally removed. And this produces text that

0:14:39 - 0:14:46     Text: has a pretty decent amount of natural language in it. Also a lot of boilerplate and menu text

0:14:46 - 0:14:51     Text: and also a little bit of gibberish. But as a whole, it's kind of a good starting point for

0:14:51 - 0:14:57     Text: constructing these pre-training data sets. And then we took a few steps to kind of try to make it

0:14:57 - 0:15:02     Text: so this data set was a little cleaner. So the first thing we did is we removed any lines that

0:15:02 - 0:15:08     Text: didn't end in a terminal punctuation mark. We used a language classifier to only retain English text.

0:15:09 - 0:15:14     Text: We removed anything that looked like placeholder text, like Laura Mipson text on the right.

0:15:14 - 0:15:20     Text: We removed anything that looked like code. We duplicated things on a sentence level. So any time that

0:15:20 - 0:15:26     Text: any chunk of text appeared on multiple pages, we only retained it on one of the pages and so on.

0:15:26 - 0:15:32     Text: And ultimately these heuristics were relatively simple but produced reasonably clean text. And I

0:15:32 - 0:15:37     Text: will discuss later the effect of these choices in cleaning. That was one of the experiments that we ran.

0:15:39 - 0:15:44     Text: And so after doing this, we created this data set called C4, which is the colossal clean crawl

0:15:44 - 0:15:50     Text: corpus. And it's available in TensorFlow data sets. You actually, you need to do the processing

0:15:50 - 0:15:56     Text: yourself, which is somewhat computationally expensive. But nevertheless, it is entirely possible.

0:15:56 - 0:16:02     Text: And it produces about 750 gigabytes of reasonably clean natural text data.

0:16:04 - 0:16:10     Text: Okay, so now we have our framework, our model, our pre-training data set. We need our pre-training

0:16:10 - 0:16:18     Text: objective. What are we going to do to train the model on our unlabeled text? And so just to explain

0:16:18 - 0:16:22     Text: the objective that we chose kind of for our baseline experimental procedure, which again we will

0:16:22 - 0:16:27     Text: experiment with different pre-training objectives later. Imagine that you have some original text

0:16:27 - 0:16:32     Text: sentence like thank you for inviting me to your party last week. And what we do is we basically

0:16:32 - 0:16:37     Text: choose some words at random and technically we're choosing tokens at random. But for now,

0:16:37 - 0:16:43     Text: it's just assumed that tokens are words. And we're going to drop those tokens out. And so we end

0:16:43 - 0:16:48     Text: up with something that looks like this. We, for every consecutive span of tokens that have been dropped

0:16:48 - 0:16:53     Text: out, we replace it with a sentinel token. And each sentinel token gets a unique index. So one of

0:16:53 - 0:16:59     Text: them we call it the sentinel x. And the other one will be the sentinel y. And so you can see that

0:16:59 - 0:17:04     Text: because the words for and inviting are subsequent words that we decided to mask out by randomly

0:17:04 - 0:17:09     Text: masking words, we're going to replace both of those words with a single sentinel, single unique

0:17:09 - 0:17:15     Text: sentinel token. And then the model's goal will just be to fill in the blanks. And so if you're

0:17:15 - 0:17:19     Text: familiar with the birth pre-training objective, this is somewhat similar. The fact that we're

0:17:19 - 0:17:25     Text: collapsing subsequent tokens into a span that we're going to be replacing words from is slightly

0:17:25 - 0:17:30     Text: different. And the fact that we're reconstructing just the missing words and not the entire sequence

0:17:30 - 0:17:37     Text: is maybe slightly different too. Okay, so now I'll kind of talk through our baseline experimental

0:17:37 - 0:17:42     Text: procedure that we're going to use. We're going to kind of tweak over time in very long various

0:17:42 - 0:17:48     Text: axes to explore the landscape of transfer learning, at least circa when this paper came out.

0:17:49 - 0:17:54     Text: So to pre-training the model, we're going to take a model that has a birth base size encoder

0:17:54 - 0:17:58     Text: and decoder. So technically has twice as many parameters as birth base because there's a

0:17:58 - 0:18:03     Text: birth base size encoder and a birth base size decoder. We're going to use the denoising objective,

0:18:03 - 0:18:07     Text: the sort of mass language modeling objective that I just described. And we're going to apply it on

0:18:07 - 0:18:14     Text: the C4 dataset that I mentioned earlier. We're going to pre-traine for about 34 billion tokens,

0:18:14 - 0:18:18     Text: which is about a quarter as long as birth base was trained. So it's not a ton of pre-training time,

0:18:18 - 0:18:22     Text: but because we're training on, we're doing so many experiments, we need to cut it back a little bit.

0:18:24 - 0:18:28     Text: We're going to use an inverse square learning rate schedule that turned out to work reasonably

0:18:28 - 0:18:33     Text: well in our setting, but it's not a terribly important design decision. And then we'll fine-tune on

0:18:33 - 0:18:38     Text: a variety of downstream tasks, kind of tasks that people care a lot about at the time. There's

0:18:38 - 0:18:43     Text: the glue benchmark, which is kind of a meta benchmark of many individual downstream tasks,

0:18:43 - 0:18:48     Text: like cola and STSB that I already mentioned. These are what some people might call natural

0:18:48 - 0:18:54     Text: language understanding tasks, but for the most part you can think of them as sentence classification,

0:18:54 - 0:19:00     Text: sentence pair classification, or regression tasks. We also consider the CNN Daily Mail

0:19:00 - 0:19:04     Text: Abstractive Summarization Corpus. This is a sequence-to-sequence problem where you're given a

0:19:04 - 0:19:09     Text: news article and you have to output the summary. The squad question answering benchmark, which is

0:19:09 - 0:19:14     Text: a reading comprehension benchmark where you're given a paragraph and you have to answer a question

0:19:14 - 0:19:18     Text: about the paragraph. You can either attack it in an extractive setting where you extract the

0:19:18 - 0:19:22     Text: answer from the paragraph or an abstractive setting where you just output the answer.

0:19:22 - 0:19:29     Text: We use the abstractive form because it's a text-to-text problem. We also included the super glue

0:19:29 - 0:19:32     Text: benchmark, which was a new benchmark at the time that was designed to essentially be a more

0:19:32 - 0:19:37     Text: difficult version of the glue benchmark. It has a new set of tasks that were hard for existing

0:19:37 - 0:19:42     Text: models. And then finally we included three translation data sets, English to German,

0:19:42 - 0:19:46     Text: English to French and English to Romanian translation. English to French being the largest,

0:19:46 - 0:19:52     Text: which was an extremely large data set in English to Romanian being many more to my attitude smaller.

0:19:54 - 0:19:58     Text: And we're going to fine tune on each of these tasks individually and separately. So we take the

0:19:58 - 0:20:02     Text: pre-trained model and separately fine tune on each of these downstream tasks. And we're going to

0:20:02 - 0:20:08     Text: fine tune for up to 17 billion tokens, but we're going to save checkpoints along the way, evaluate

0:20:08 - 0:20:14     Text: each checkpoint on the validation set and report the performance on the best checkpoint. And note that

0:20:14 - 0:20:19     Text: this is not an experimentally valid way to report your performance because you're basically doing

0:20:19 - 0:20:24     Text: model selection on the data set that you are, the data split that you are, you're reporting

0:20:24 - 0:20:29     Text: performance on the data set split that you're doing model selection on, which is not a good way to

0:20:29 - 0:20:34     Text: compare different methods, but to compare within methods. It's, it's, it's, we're reasonably

0:20:34 - 0:20:41     Text: comfortable doing this. So now I'm going to kind of give you a very high level overview of some

0:20:41 - 0:20:47     Text: of the experimental results in this paper. This paper is pretty huge in terms of just the number

0:20:47 - 0:20:52     Text: experiments we ran. And so if I was to really drill into this paper, it probably would take me the

0:20:52 - 0:20:57     Text: whole time, but there's other fun stuff that I want to tell you about. So, but, but anyways, the point

0:20:57 - 0:21:02     Text: is I will be showing you lots of tables like this one. And so in these tables on the columns, you

0:21:02 - 0:21:06     Text: have the performance on the various downstream tasks. And in the rows, you have different experimental

0:21:06 - 0:21:11     Text: settings that we considered. So to give you an example, here's kind of the scores that we got

0:21:13 - 0:21:18     Text: from our baseline, which is the exact experimental procedure that I must describe, that I just

0:21:18 - 0:21:24     Text: described. And we also ran that baseline 10 times in reporting with standard deviation on the

0:21:24 - 0:21:29     Text: second line. And then in this last line here, we're reporting the performance of the same

0:21:29 - 0:21:34     Text: model without any pre-training, just basically only trained separately with supervision on all of

0:21:34 - 0:21:41     Text: these downstream tasks. And just to point out a couple of things on this table, the first obvious

0:21:41 - 0:21:48     Text: thing is that in most cases, the pre-training setting is dramatically worse. So indeed, transfer

0:21:48 - 0:21:53     Text: learning does tend to be helpful. The place where that's not true is actually on this English

0:21:53 - 0:21:57     Text: difference translation task. And that's probably because it's such a big task that you actually

0:21:57 - 0:22:03     Text: don't really need pre-training to do well at it. We wanted to include this because if the performance

0:22:03 - 0:22:09     Text: regresses on this task, then that's something that we should be worried about. The next feature of

0:22:09 - 0:22:14     Text: this table to notice is this little star appeared. That star will appear anytime there's a row in the

0:22:14 - 0:22:20     Text: table that's equivalent to our baseline. And another little thing to note, maybe if you're familiar with

0:22:20 - 0:22:25     Text: the history, the score that we got on glue and squad was reasonably comparable to BERT. So it's a

0:22:25 - 0:22:32     Text: decent sanity check. We have a model that has more parameters, but it's only trained for a quarter

0:22:32 - 0:22:38     Text: as long. And it nevertheless got comparable performance, which using a similar objective, so we

0:22:38 - 0:22:42     Text: shouldn't be too surprised about that. And then the last thing to mention is that we're going to use

0:22:42 - 0:22:47     Text: this standard deviation over and over again so that we can bold entries in the table when there

0:22:47 - 0:22:52     Text: within one standard deviation of the maximum value for that data set in the table. So now, I'll just

0:22:52 - 0:22:56     Text: make a big disclaimer, which is we're going to compare lots of different things. We're going to run

0:22:56 - 0:23:01     Text: lots of experiments, but we're not going to tweak any hyperparameters because if we did, like,

0:23:01 - 0:23:06     Text: change the learning rate or whatever, it would be just too computationally expensive to do this

0:23:06 - 0:23:13     Text: for each individual methods. Our hope is that this is okay because we are treating all problems in

0:23:13 - 0:23:19     Text: exactly the same framework. We're always doing text to text maximum likelihood training. So hopefully

0:23:19 - 0:23:23     Text: we can keep hyperparameters fixed. And arguably, if you propose a new method that requires extensive

0:23:23 - 0:23:28     Text: hyperparameter tuning, it's not a very useful method for practitioners. And we'll get into that a

0:23:28 - 0:23:33     Text: little bit more later when I talk about architectural modifications too. The other thing I'll say is

0:23:33 - 0:23:38     Text: that while we did run lots of experiments, there's no way we could be comprehensive because there were

0:23:38 - 0:23:44     Text: so many methods out there. And the inclusion or exclusion of one particular method is not meant as

0:23:44 - 0:23:50     Text: a judgment on its quality. It's just what we were able to do given the constraints that we were

0:23:50 - 0:23:58     Text: working under. So the first set of experiments that we ran were to compare different model structures.

0:23:58 - 0:24:05     Text: So as I mentioned earlier, the main baseline T5 model is an encoder decoder model. And in this

0:24:05 - 0:24:10     Text: case, you have a separate layer stack for that encodes a sequence and a separate layer stack that

0:24:10 - 0:24:16     Text: decodes the target sequence. Basically, it generates the target sequence, one token by token,

0:24:16 - 0:24:21     Text: while attending back to the encoders output to figure out what it should condition on. The next

0:24:22 - 0:24:26     Text: setup that we considered is an encoder decoder model except that all of the relevant parameters

0:24:26 - 0:24:32     Text: in the encoder decoder are shared. So they're basically half as many parameters. And then finally,

0:24:32 - 0:24:37     Text: another variant that we considered is an encoder decoder model where the encoder and decoder have

0:24:37 - 0:24:42     Text: half as many layers as they do in the baseline. And that's because we're also considering single

0:24:42 - 0:24:48     Text: stack models, the language model and what we call a prefix language model. The language model is a

0:24:48 - 0:24:53     Text: model that models the sequence strictly from the left to right fashion in a causal fashion. It

0:24:53 - 0:24:59     Text: basically just ingests tokens one at a time and predicts the next token. And you can actually

0:24:59 - 0:25:04     Text: apply these to text to text problems by basically feeding the input as a prefix before you start

0:25:04 - 0:25:10     Text: predicting anything. Now, if you just use a language model in its strict format, then you still have to

0:25:10 - 0:25:16     Text: have what we would call a causal mask, so a causal attention pattern on the prefix. And that's

0:25:16 - 0:25:24     Text: actually how the GPT series models treat all of their problems. But because we are explicitly denoting

0:25:24 - 0:25:28     Text: part of the sequence as an input and the rest of the sequence as a target, we actually can allow

0:25:28 - 0:25:33     Text: the model to have full visibility, you know, a non-causal mask on the input region of the sequence.

0:25:33 - 0:25:36     Text: And when we make that change, we call that the prefix language model.

0:25:36 - 0:25:42     Text: And now the upshot of all of this really is that the encoder decoder model for our framework turns

0:25:42 - 0:25:46     Text: out to work best. You can see that when we share the parameters, it does hurt performance a little

0:25:46 - 0:25:52     Text: bit, but maybe a little less than you might expect. The prefix language model attains slightly

0:25:52 - 0:25:56     Text: worse performance, but significantly better performance than doing strictly causal left to right

0:25:56 - 0:26:01     Text: language modeling, which is what you see in the fourth row here. And finally, having the number

0:26:01 - 0:26:09     Text: parameters in the encoder and decoder is harm's performance significantly. One thing to note is that

0:26:09 - 0:26:13     Text: in all of these cases, we're processing the same total sequence length. It's the same input sequence

0:26:13 - 0:26:19     Text: in the same target sequence. So in most of these cases, the total number of flops required to process

0:26:19 - 0:26:24     Text: the sequences the same, even though the number parameters is twice as many in the baseline model.

0:26:26 - 0:26:30     Text: So the next thing we looked into were different variants on our pre-training objective. So the

0:26:30 - 0:26:35     Text: first thing we did was kind of compare different high level approaches, maybe just training the model

0:26:35 - 0:26:40     Text: to predict the next token, one token at a time. That's kind of a language modeling objective. Another

0:26:40 - 0:26:44     Text: would be to take the input sequence, shuffle it up and train the model to predict the unshuffled

0:26:44 - 0:26:50     Text: sequence, or to consider a mass language model style, a bird style objective, like the one that we

0:26:50 - 0:26:56     Text: that we mentioned earlier. And now on the second step that I'm showing here, the results for,

0:26:56 - 0:27:01     Text: we considered a bird style objective where the model is trained to predict the entire

0:27:02 - 0:27:07     Text: original uncorrupted input sequence, a mass style objective, which is quite similar.

0:27:08 - 0:27:12     Text: And then a replace corrupted spans objective, which is like the one that I described at the

0:27:12 - 0:27:16     Text: beginning that we're using in our baseline model. And finally, a variant where rather than

0:27:16 - 0:27:23     Text: replacing each token with a unique Sentinel token, we just dropped the mass tokens completely

0:27:23 - 0:27:28     Text: and train the model to predict the dropped tokens. And you can see that the latter two options

0:27:28 - 0:27:33     Text: work roughly as well as one another. But another pertinent difference between these

0:27:33 - 0:27:39     Text: these sets of objectives is that the first two involve predicting the entire input sequence.

0:27:39 - 0:27:44     Text: And the last two basically just involve predicting the massed out tokens. And when you only

0:27:44 - 0:27:48     Text: predict the massed out tokens, you have a much shorter target sequence. And so the overall cost

0:27:48 - 0:27:54     Text: is significantly lower for pre-training. So we decided that that was the best approach.

0:27:54 - 0:28:00     Text: And then we considered other hyper parameters in our masking strategy, such as how many tokens to

0:28:00 - 0:28:07     Text: to mask out. So the next thing we considered were different variants of a pre-training data set.

0:28:08 - 0:28:13     Text: In our baseline, we use the C4 data set that I proposed at the beginning of the talk.

0:28:13 - 0:28:19     Text: We also compared to pre-training only on unfiltered data from C4. So rather than doing all

0:28:19 - 0:28:23     Text: these heuristic filtering steps, we just take the raw web extracted text from C4 and pre-training

0:28:23 - 0:28:28     Text: on that. And you can see that that does uniformly worse. So it does seem to be true that these

0:28:28 - 0:28:33     Text: cleaning steps that we're doing are actually useful. The next four data sets were our attempt to

0:28:34 - 0:28:38     Text: pre-training similar data sets that had been used in password. The real news data set came from

0:28:38 - 0:28:44     Text: the Grover paper. It's essentially pre-training only on data from news sites. Web text is this

0:28:44 - 0:28:50     Text: data set that we used in the GPT2 paper where you only train on Web text that was linked to and

0:28:50 - 0:28:56     Text: received a reasonably high score on Reddit. And then the last two variants are either Wikipedia

0:28:56 - 0:29:01     Text: alone or as was used in the in the birth paper Wikipedia with the Toronto Books Corpus.

0:29:02 - 0:29:08     Text: And you might actually notice that some of these more specialized data sets we get better performance.

0:29:08 - 0:29:13     Text: So for example, you can see that on the Wikipedia and Toronto Books Corpus on the bottom row,

0:29:13 - 0:29:19     Text: we actually do much better on superglue with a score of a little over 73 compared to pre-training

0:29:19 - 0:29:25     Text: on C4. And it turns out this is because or we we conjecture that this is because superglue

0:29:25 - 0:29:33     Text: contains a task called multi-RC, which is a reading comprehension task on on on Wicc news sorry

0:29:33 - 0:29:38     Text: and cyclopedia articles and on novels. So the basic takeaway here is that when you pre-training

0:29:39 - 0:29:44     Text: data that's similar to your downstream task that has a similar domain, you often get a big boost

0:29:44 - 0:29:49     Text: in that downstream task and that's indeed what happened here. Interestingly, you can also see the

0:29:49 - 0:29:55     Text: opposite effect. So if you if you look on Wikipedia on the second to last row, if you only pre-training

0:29:55 - 0:30:00     Text: Wikipedia, you end up doing much worse on cola, which is the corpus of linguistic acceptability

0:30:00 - 0:30:06     Text: tasks that I mentioned early on. And we conjecture that this is because on Wikipedia has very little

0:30:06 - 0:30:11     Text: unacceptable text. You're basically only pre-training on clean text, whereas C4 has some

0:30:11 - 0:30:15     Text: ungrammatical texts, some nonsense in it. And so that actually can boost your performance a little

0:30:15 - 0:30:20     Text: bit on cola. The last thing to note is that while you do see some gain sometimes on using these

0:30:20 - 0:30:25     Text: smaller data sets, these data sets are about an order of magnitude smaller than C4. So then the

0:30:25 - 0:30:30     Text: natural question is does it actually hurt you to pre-traine on a smaller data set? So to answer

0:30:30 - 0:30:36     Text: that question, what we did is basically took C4 and artificially made it smaller so that it was

0:30:36 - 0:30:42     Text: repeated over the course of pre-training. And you can see here that when you repeat the data set 64

0:30:42 - 0:30:48     Text: times, so it's 34 billion divided by 64 tokens, because that's how much pre-training we did,

0:30:49 - 0:30:53     Text: you actually don't sacrifice much performance. The performance is roughly the same.

0:30:53 - 0:31:00     Text: But if you repeat the data set 256 times, 1024 times or more, you actually start to see degradation.

0:31:00 - 0:31:04     Text: And the reason that we think this is happening is because you're basically overfitting during

0:31:04 - 0:31:09     Text: pre-training. And you can get a sense for whether that's true or not by looking, just looking at

0:31:09 - 0:31:13     Text: the training loss. You can see that the model attains a much, much smaller training loss as

0:31:13 - 0:31:18     Text: repeat the data set more and more times. So the upshot of this is that your data set should be

0:31:18 - 0:31:24     Text: at least as big that you don't see significant overfitting during pre-training. And later on,

0:31:24 - 0:31:30     Text: when we scale up these models and pre-training them on much more data, we would do enough repeats

0:31:30 - 0:31:35     Text: of the smaller sort of more domain specific data sets that we imagine we would see harmful effects.

0:31:38 - 0:31:43     Text: The next thing we experimented with were multi-task learning strategies. So when you're doing

0:31:43 - 0:31:49     Text: multi-task learning, you're essentially training the model on multiple tasks at once. And in most of

0:31:49 - 0:31:54     Text: the, in all the experiments I'm showing on this slide here, we're actually training on every single

0:31:54 - 0:32:00     Text: task at once. So the pre-training task and all of the downstream tasks together. And the most

0:32:00 - 0:32:04     Text: pertinent question when you're doing multi-task training like this is how often should I sample data

0:32:04 - 0:32:11     Text: from each task? So one approach is just to sample data at an equal rate across all of the tasks.

0:32:11 - 0:32:18     Text: Another case is to basically pretend like you just concatenated all the data sets. We call that

0:32:18 - 0:32:22     Text: examples proportional mixing because it's equivalent to sampling from the data set in accordance to

0:32:22 - 0:32:27     Text: how many examples there are in the data set. The difficult thing with that though is that our pre-training

0:32:27 - 0:32:32     Text: data set is so big that its proportion would be much, much, much bigger than every downstream task.

0:32:32 - 0:32:38     Text: And we basically would never train on any of the downstream data. So we introduced this hyper parameter

0:32:38 - 0:32:44     Text: K, which is a constant that basically is how big should we pretend that the pre-training data set is.

0:32:45 - 0:32:51     Text: The last thing you can do is take the number of examples in each data set and scale it by a

0:32:51 - 0:32:56     Text: temperature. The larger the temperature, the closer you get to equal mixing to uniform sampling

0:32:56 - 0:33:04     Text: from each data set. But at any rate, the main takeaway from this table is that you can get pretty

0:33:04 - 0:33:09     Text: close to the performance of separate pre-training and fine-tuning like we do on our baseline

0:33:09 - 0:33:14     Text: if you get the mixing strategy right. But ultimately, we found that you do tend to sacrifice

0:33:14 - 0:33:18     Text: some performance when doing multi-task trading and at least some of the tasks.

0:33:19 - 0:33:23     Text: Collin, there were lots of questions back on the choice of on the slide with the different

0:33:23 - 0:33:28     Text: data sets, one showing real news and C4 and so on. Sure.

0:33:28 - 0:33:36     Text: Everyone take a couple. Absolutely. Yeah, so firstly, if you just look at this, it still does kind of

0:33:36 - 0:33:44     Text: look like you can get great results with more than an order of magnitude, less text, and C4.

0:33:44 - 0:33:48     Text: Yeah. And it seemed like that's not the message you wanted to be putting forward.

0:33:49 - 0:33:53     Text: Yeah. So there's a little nuance here, which is basically that in our baseline, in these

0:33:53 - 0:33:58     Text: experiments that I've been running so far, we're actually not pre-training for that long. So as I

0:33:58 - 0:34:03     Text: mentioned earlier, we're actually pre-training for a quarter as long as BERT. And actually for us,

0:34:03 - 0:34:12     Text: I believe, one 256th as long as Excel net, for example. So and when later in the paper,

0:34:12 - 0:34:17     Text: we're going to pre-traine for much, much, much longer. And in that case, we would end up

0:34:17 - 0:34:21     Text: repeating these data sets many, many times over the course of pre-training and we'd start to see

0:34:21 - 0:34:25     Text: these negative effects that I explained on the next slide. Oh, so that's why you're then doing

0:34:25 - 0:34:29     Text: the repeat. Okay, so when they're asked about that as to why you train on the same stuff over and

0:34:29 - 0:34:38     Text: over again, that's the test set. Yeah. Okay, exactly. And then on the data sets, so to a first

0:34:38 - 0:34:46     Text: approximation, the C4 contain Wikipedia or very partially many pages of Wikipedia, but not all of

0:34:46 - 0:34:56     Text: Wikipedia. Common crawl is done by like is a sort of web crawl by following links at some priority.

0:34:56 - 0:35:00     Text: And ultimately, it didn't cover all of Wikipedia. I don't actually know the exact proportion of

0:35:00 - 0:35:08     Text: Wikipedia that's included in C4, but definitely when training on C4, you will see some Wikipedia text.

0:35:08 - 0:35:13     Text: It will be at a relatively low proportion compared to all of the other data that you'll

0:35:13 - 0:35:21     Text: see. Sure. And then someone wasn't quite convinced with your argument that the good quality of

0:35:21 - 0:35:29     Text: Wikipedia explained the worst performance on Cola because they thought, well, surely real news

0:35:29 - 0:35:34     Text: that's basically well edited text as well. And yet it seems to work fine.

0:35:35 - 0:35:39     Text: Yeah, it's a good point. I'm not sure, you know, it could be that real news because real news has

0:35:39 - 0:35:45     Text: cloaks in it or maybe real news ended up having some content from the common sections of sites.

0:35:45 - 0:35:50     Text: I should say that when I, the reason that they're real news like and web text like is that these

0:35:50 - 0:35:54     Text: are our own reproductions of them so they might not be exactly the same as the originally proposed

0:35:54 - 0:36:01     Text: variance because web text, for example, was never released. Yeah, but it's an interesting point.

0:36:01 - 0:36:05     Text: And that's also why I would say that it's a conjecture. It's not, you know, not something that I

0:36:05 - 0:36:11     Text: can make a rigorous claim about. Okay, maybe we should let you go on now. Great. Thanks. Yeah,

0:36:11 - 0:36:17     Text: thanks for those questions. So then the next thing after looking at these different multitask

0:36:17 - 0:36:24     Text: training strategies is to see if there's any way for us to close the gap between multitask training

0:36:24 - 0:36:30     Text: and this pre-training followed by separate fine tuning. And we experimented with many different

0:36:30 - 0:36:35     Text: strategies here, basically strict multitask training, doing multitask training followed by individual

0:36:35 - 0:36:41     Text: task fine tuning, doing multitask training, but without any unsupervised data. And really,

0:36:41 - 0:36:47     Text: the main takeaway from all of these experiments was that if you do the multitask training first,

0:36:47 - 0:36:52     Text: including the unsupervised task, and then you fine tune the model on each task separately,

0:36:53 - 0:36:57     Text: which is the third row here, you actually don't really sacrifice much performance at all.

0:36:57 - 0:37:03     Text: You are, you don't end up with a multitask model because you're fine tuning on each task individually.

0:37:03 - 0:37:07     Text: But the nice thing about this approach is that you can monitor the

0:37:08 - 0:37:13     Text: performance on your downstream tasks while you're doing pre-training. And you don't sacrifice much

0:37:13 - 0:37:20     Text: performance. One setting that we didn't consider is the unsupervised pre-training followed by

0:37:20 - 0:37:28     Text: supervised multitask training. I wish that we had run that, but we just didn't. So then sort of

0:37:28 - 0:37:33     Text: the last set of experiments we ran, try to answer the following question. Let's say that someone comes

0:37:33 - 0:37:38     Text: along and all of a sudden gives you four times as much compute. What should you do with it? And so

0:37:38 - 0:37:42     Text: there are a number of things you could do. You could increase the number of training steps by a factor

0:37:42 - 0:37:47     Text: of four. You could increase your batch size by a factor of four. You could make your model twice as big

0:37:47 - 0:37:52     Text: and train for twice as long. You can make your model four times as big. You could train four models

0:37:52 - 0:37:57     Text: separately and ensemble them. Or you could do this last thing which doesn't actually use four times

0:37:57 - 0:38:02     Text: as much compute where you pre-trained one model and you fine tune it four times separately and then

0:38:02 - 0:38:09     Text: ensemble those. And the main takeaway here is that scaling helps. This is, you know, very unsurprising,

0:38:09 - 0:38:18     Text: especially in 2021. But interestingly, you get significant gains whether you just increase

0:38:18 - 0:38:24     Text: the training time or if you increase the size. So you can see that along both of these access we

0:38:24 - 0:38:28     Text: get significant performance improvements, although the performance improvements are more dramatic when

0:38:28 - 0:38:33     Text: we increase the size. And in particular, you can see that we've gone from a score of about 71

0:38:33 - 0:38:40     Text: and super glue to 78 just by making the model four times bigger. Okay, so let me just kind of give

0:38:40 - 0:38:45     Text: a quick recap of all of that and then use that recap to explain the design decisions that went into

0:38:45 - 0:38:50     Text: the final sort of T5 models. The first thing is that we're going to choose an encoder decoder

0:38:50 - 0:38:55     Text: architecture because that seemed to work best in our text to text format. The next thing is that

0:38:55 - 0:38:59     Text: we're going to use a span prediction objective which is ultimately quite similar to the baseline

0:38:59 - 0:39:04     Text: objective that I described earlier. We will use the C4 dataset because it did attain reasonable

0:39:04 - 0:39:09     Text: performance but was large enough that we didn't have to worry about repeating the data and seeing

0:39:09 - 0:39:15     Text: detrimental overfitting during pre-training when we scale up the number of pre-training steps.

0:39:15 - 0:39:19     Text: We actually decided to do multitask pre-training because we will be scaling up the amount of

0:39:19 - 0:39:24     Text: pre-training. Our longest training runs took about a month and we wanted to be able to monitor

0:39:24 - 0:39:28     Text: performance over the course of pre-training without doing fine tuning. So we're going to be doing

0:39:28 - 0:39:32     Text: this multitask pre-training followed by fine tuning and then the last thing of course is we're

0:39:32 - 0:39:38     Text: going to train bigger models for longer. Specifically the model sizes that we ended up

0:39:38 - 0:39:45     Text: releasing we call small base large 3B and 11B. The small model has 60 million parameters. It's

0:39:45 - 0:39:50     Text: about a quarter as big as our baseline which again was a birth-based size encoder, birth-based

0:39:50 - 0:39:56     Text: size decoder. We also trained a model that was a birth-large size encoder, birth-large size decoder

0:39:56 - 0:40:01     Text: and then we created two larger variants simply by scaling up the feed-forward dimension of the

0:40:01 - 0:40:06     Text: transformer and the number of attention heads in the transformer. You can see our largest model

0:40:06 - 0:40:12     Text: actually had a hidden dimensionality of 65,000 in the feed-forward layers. The reason that we did this

0:40:12 - 0:40:18     Text: kind of unusual way of scaling up the perimeter count is just because the feed-forward layers are

0:40:18 - 0:40:22     Text: just gigantic matrix multiplies and that's the best way to make use of hardware accelerators.

0:40:22 - 0:40:34     Text: I just stick in one more question. Someone was asking about how you did the multi-task training.

0:40:34 - 0:40:40     Text: Was that sticking a simple softmax classifier on top for each task?

0:40:40 - 0:40:49     Text: In our case because we're using this text-to-text format basically you train on exactly the same

0:40:49 - 0:40:53     Text: model no new classification heads for every task. The only difference is that each task gets its

0:40:53 - 0:40:57     Text: own task prefix. So if you remember all the way back at the beginning we say you know translate

0:40:57 - 0:41:03     Text: English to German colon English sentence or you know summarize colon English paragraph and that

0:41:03 - 0:41:07     Text: tells the model what it should do and then you just train the model to predict the corresponding target.

0:41:11 - 0:41:17     Text: Cool. So the last pertinent detail again is that we did scale up the amount of pre-training.

0:41:17 - 0:41:23     Text: We ended up pre-training on a trillion tokens of data rather than 34 billion tokens. So it's

0:41:23 - 0:41:28     Text: quite a lot more pre-training although it's still less pre-training that was used in excel

0:41:28 - 0:41:34     Text: net. I think by a factor of two if I don't remember incorrectly. So here are the results that

0:41:34 - 0:41:37     Text: and these were kind of the way that things stood at the time that we released the paper.

0:41:38 - 0:41:44     Text: We ended up getting state of the art results on the glue metabenschmark, CNN Daily Mail,

0:41:44 - 0:41:50     Text: abstract summarization, squad question answering. And we were actually quite excited to see how well

0:41:50 - 0:41:56     Text: we did on super glue. We ultimately came pretty close to the human score which was the score of 89.8

0:41:56 - 0:42:02     Text: and we performed significantly better than Roberta. Super glue it turns out is a benchmark that

0:42:02 - 0:42:07     Text: benefits a lot from large models and so you can really see a dramatic increase in the model's

0:42:07 - 0:42:13     Text: performance as we scale the model up. On the other hand we did not obtain state of the art results

0:42:13 - 0:42:17     Text: on any of the translation data sets. And the reason that we think that this is true is because

0:42:17 - 0:42:22     Text: all of the state of the art results at the time on these translation data sets used back translation.

0:42:22 - 0:42:29     Text: And if you remember we did English only pre-training in our model and we expect that in terms of

0:42:29 - 0:42:35     Text: making use of unlabeled data it's more effective to use back translation for machine translation

0:42:35 - 0:42:40     Text: problems than to use this English only pre-training that we did. I should mention of course these

0:42:40 - 0:42:45     Text: results are now quite a bit stale and some of these scores have been beaten by subsequent models.

0:42:47 - 0:42:52     Text: So now I'll just quick make a plug that you know all of our code is released, our pre-trained models

0:42:52 - 0:42:56     Text: have been released. You can make use of them in our code base. They're also of course in the

0:42:56 - 0:43:05     Text: Hugging Face Transformers code base. We made a colab at the time that shows a pretty basic demo of

0:43:05 - 0:43:11     Text: how to take one of our pre-trained models and basically train it on a TSV file of inputs and

0:43:11 - 0:43:16     Text: targets. So because all problems are text to text problems you just need to give the model some

0:43:16 - 0:43:21     Text: input text and some target text and that's all you need to fine tune the model. And you can make

0:43:21 - 0:43:27     Text: you can actually fine tune up to the 3 billion parameter model on a free colab TPU using the

0:43:27 - 0:43:34     Text: the link at the bottom here. Great so so far we've been talking about an English only pre-training

0:43:34 - 0:43:39     Text: model. I mean we did apply it to machine translation in downstream tasks. So kind of a natural

0:43:39 - 0:43:45     Text: question is you know what about all the other languages? Why not train a multi-lingual model?

0:43:45 - 0:43:48     Text: And so that's something that we did more recently. And actually let me just pause because I did

0:43:48 - 0:43:53     Text: see a couple questions coming on. I want to make sure that I I'm not leaving anyone behind as I

0:43:53 - 0:44:04     Text: move to the next section. Sure so one of the questions is about the multitask set.

0:44:04 - 0:44:09     Text: So if you include an unknown task prefix does anything interesting happen? And if you don't

0:44:09 - 0:44:18     Text: include a prefix what what does it do? So if you if you include an unknown task prefix or if you

0:44:18 - 0:44:23     Text: don't include a prefix at all what it will probably do is apply the unsupervised objective because

0:44:23 - 0:44:30     Text: we actually didn't use a task prefix for the unsupervised objective. Well I guess I should say

0:44:30 - 0:44:37     Text: what if it's not quite that's not quite true because there won't be any Sentinel tokens in the input.

0:44:37 - 0:44:42     Text: So we actually see it it typically does is it outputs kind of some related words and some other

0:44:42 - 0:44:47     Text: Sentinel tokens in gibberish. It's it's not very useful as I guess the upshot.

0:44:49 - 0:44:52     Text: You have questions about back translation. I don't think they heard about back translation in

0:44:52 - 0:44:58     Text: the rest of the course. So back translation is a pretty straightforward method. The basic idea is

0:44:58 - 0:45:03     Text: that if I have unlabeled text data in one language I use my current model to translate that data

0:45:04 - 0:45:12     Text: to some particular language and I use that as training data then subsequently for my model.

0:45:12 - 0:45:15     Text: It's it's similar to self-training if you're familiar with it. Basically you're making predictions

0:45:15 - 0:45:20     Text: on unlabeled data and then using those predictions to train the model. Turns out to be helpful.

0:45:20 - 0:45:28     Text: Yep and then one one detailed question maximum input length and maximum output length. How did

0:45:28 - 0:45:33     Text: you choose them? Did you do a study on that as well? Yeah so most of the tasks we considered

0:45:34 - 0:45:39     Text: did not have input length significantly longer than 512 tokens most of the time using the tokenization

0:45:39 - 0:45:47     Text: strategy that that that we made use of and so we used a maximum input length of 512 but we used

0:45:47 - 0:45:52     Text: that position encoding scheme that allows arbitrary input lengths and we actually have in

0:45:52 - 0:45:59     Text: subsequent work fine tune T5 on sequences of of length 2048. Beyond that you start to get

0:45:59 - 0:46:05     Text: into memory issues because of attention's quadratic memory complexity but you in principle you

0:46:05 - 0:46:10     Text: can you can apply it to long sequences. And then maybe one last thing if you have a second is

0:46:10 - 0:46:17     Text: talking again about how translation again seems to be not the not the killer app for this and

0:46:17 - 0:46:24     Text: so what's your intuition as to why translations like does it not benefit from pre-training for

0:46:24 - 0:46:35     Text: for what reason? I'm not you know I'm really not sure. I yeah I can only conjecture. I think that

0:46:35 - 0:46:40     Text: pre-training helps the model learn the meaning of words. It helps the model learn some

0:46:40 - 0:46:43     Text: world knowledge which I'll talk about a little bit later and that's a that's a very loose concept.

0:46:45 - 0:46:51     Text: I think that for translation learning world knowledge is not very useful because everything

0:46:51 - 0:46:56     Text: all of the knowledge you need to translate a sentence for for the most part is in the sentence

0:46:58 - 0:47:03     Text: and so basically all of the sort of contextual knowledge style information that you need to

0:47:03 - 0:47:08     Text: produce the German sentence is in the input sentence. So gaining world knowledge during pre-training

0:47:08 - 0:47:14     Text: is not very useful. Of course it's useful to know what words mean but to a certain extent you

0:47:14 - 0:47:20     Text: that's kind of the easiest signal to pick up on during training and I imagine that would be my guess

0:47:20 - 0:47:30     Text: I don't have any rigorous proof of any of this. Thanks. Great so like I was saying we trained

0:47:30 - 0:47:36     Text: this English only model and we wanted to address the major shortcoming that really only can

0:47:36 - 0:47:41     Text: speak one language. So we introduced a model called MT5 multi-lingual T5 and really for the

0:47:41 - 0:47:47     Text: most part the if you remember one thing about MT5 it's basically that it's exactly the same model

0:47:47 - 0:47:53     Text: but trained on a multi-lingual corpus and the text to text format is the same you know we feed in

0:47:53 - 0:47:59     Text: task prefixes but we can feed in content to different languages and we can do classification tasks

0:47:59 - 0:48:04     Text: we can do question answering tasks with MT5 in exactly the same way that we can do with T5.

0:48:04 - 0:48:11     Text: So like I said the pertinent thing about MT5 was creating a multi-lingual variant of C4.

0:48:11 - 0:48:17     Text: Overall the process is very similar to the process we used for C4 except that it includes 101

0:48:17 - 0:48:25     Text: languages that we detected using a open source language detector. We also extracted data from

0:48:25 - 0:48:29     Text: more common crawl dumps because especially for the low resource languages it was hard to get enough

0:48:29 - 0:48:35     Text: data from only one common crawl dump and you can see a list of the languages that we include here.

0:48:35 - 0:48:41     Text: Ultimately the data set ended up being about 27 terabytes in size. So here's a distribution of

0:48:41 - 0:48:46     Text: the number of pages in the MT4 training data set for various languages. You can see our

0:48:46 - 0:48:51     Text: most our highest resource language is English where we have about three billion pages with

0:48:51 - 0:48:57     Text: three trillion tokens total. The lowest resource language is your RUBA with only about 50,000 pages.

0:48:57 - 0:49:02     Text: So you can see the amount of data that we got for each language varies by many orders of magnitude.

0:49:04 - 0:49:08     Text: Because of that a common strategy is to use this sort of temperature scaling that I mentioned

0:49:08 - 0:49:13     Text: earlier where basically you sample from data in a particular language by using

0:49:14 - 0:49:19     Text: by scaling the number of examples in that language by a temperature. And as the

0:49:20 - 0:49:26     Text: I apologize that the temperature here is one over the temperature that I described previously.

0:49:26 - 0:49:30     Text: So in this case as the temperature gets smaller and smaller you get closer and closer to a uniform

0:49:30 - 0:49:36     Text: distribution. The net effect of this is that for very small temperatures you tend to do better on

0:49:36 - 0:49:41     Text: downstream tasks on low resource languages like Urdu but as you increase the temperature so that you

0:49:41 - 0:49:45     Text: get so you basically are doing examples proportional mixing of the different languages you do better

0:49:45 - 0:49:55     Text: on high resource languages like Russian. So we took MT4 we pre-trained MT5 again basically everything

0:49:55 - 0:50:00     Text: was kept the same we made the vocabulary a bit bigger to accommodate all the different languages

0:50:00 - 0:50:06     Text: but overall the amount of pre-training the model sizes etc are basically the same. And ultimately

0:50:07 - 0:50:11     Text: got state of the art on some of the tasks in the extreme benchmark. You'll notice that we don't

0:50:11 - 0:50:17     Text: report results for some of these tasks that's partially because extreme is designed for sentence

0:50:17 - 0:50:25     Text: encoders like Bert. T5 is an MT5 our encoder decoder models. We did not experiment with using the

0:50:25 - 0:50:29     Text: encoder on its own but in order to attack some of these problems like the sentence retrieval

0:50:29 - 0:50:35     Text: problems you need a model that can output a single vector representation of your sequence

0:50:35 - 0:50:39     Text: and we don't have that in T5 so we didn't apply it to those tasks.

0:50:42 - 0:50:46     Text: One interesting finding from this paper that I'll just quickly mention here is that the

0:50:48 - 0:50:53     Text: there's there are basically multiple settings that people consider multilingual benchmarks.

0:50:53 - 0:50:59     Text: One is the case the zero shot case and in that case you don't do any pre-training on a language of

0:51:00 - 0:51:05     Text: sorry you don't have any fine-tuning data on each particular language you only have pre-training

0:51:05 - 0:51:09     Text: data on those languages. So you fine-tune let's say only in English and then you feed the model

0:51:09 - 0:51:14     Text: some text in another language and see if it produces the right predictions. The next setting is the

0:51:14 - 0:51:19     Text: translate train setting that's where you take a machine translation model and translate the data

0:51:19 - 0:51:23     Text: in the English fine-tuning corpus into different languages and then the last setting is the

0:51:23 - 0:51:28     Text: in-language multitask setting that's the setting where you assume that you have gold standard

0:51:28 - 0:51:33     Text: ground-truth data in every language that you want the model to be able to process data in

0:51:33 - 0:51:40     Text: and the takeaway here actually is that the difference in performance between small models and our

0:51:40 - 0:51:47     Text: largest model as we go along the x-axis is much much bigger for the zero shot and translate

0:51:47 - 0:51:52     Text: train settings than it is for the in-language multitask setting. So what this suggests to us is

0:51:52 - 0:52:00     Text: basically that the model learns a much wider distribution of languages if it has a much larger

0:52:00 - 0:52:06     Text: amount of parameters and it is able to do this kind of like zero shot task learning,

0:52:06 - 0:52:10     Text: multi-lingual task learning much better when it has more parameters.

0:52:12 - 0:52:17     Text: So kind of along those lines you know larger models can maybe fit more knowledge about

0:52:17 - 0:52:22     Text: more languages. We had another paper where we basically tried to answer the question you know

0:52:22 - 0:52:27     Text: how much and what kind of knowledge does a model pick up during pre-training. And so to answer

0:52:27 - 0:52:33     Text: that question we took a we basically introduced a new variant of the question answering task.

0:52:33 - 0:52:38     Text: So the question answering task kind of comes in a couple of different flavors. The simplest flavor

0:52:38 - 0:52:43     Text: which I've mentioned already is reading comprehension and in that case the model is basically

0:52:43 - 0:52:47     Text: given a paragraph or an article and then it's asked a question about the paragraph or article

0:52:47 - 0:52:52     Text: and it has to basically extract the answer. So you can see if it's being asked what color is

0:52:52 - 0:52:59     Text: 11 it has to look in the in the in the paragraph that it's seen and see that it's the 11 has a yellow

0:52:59 - 0:53:03     Text: fruit and output the word yellow. So this is kind of the simplest form of the question answering task.

0:53:04 - 0:53:08     Text: A more difficult form is what people call open domain question answering. And in that case you

0:53:08 - 0:53:12     Text: assume that the model is given a question and has access to a large external database of

0:53:12 - 0:53:19     Text: of knowledge maybe all of Wikipedia. So the model has to do two things. If first has to find the

0:53:19 - 0:53:24     Text: article or the snippet of text that contains the answer in the database and then it has to

0:53:24 - 0:53:30     Text: extract the answer from the article. So there's this additional retrieval step that makes the problem

0:53:30 - 0:53:35     Text: quite a bit harder. But we introduced a sort of third variant of question answering that we call

0:53:35 - 0:53:40     Text: closed book question answering the name takes inspiration from closed book exams. The goal here is

0:53:40 - 0:53:45     Text: that you just feed the model the question. It does not have access to an external knowledge source.

0:53:45 - 0:53:50     Text: It cannot look up information anywhere. It could only answer the question based on the knowledge that

0:53:50 - 0:53:55     Text: picked up during pre-training. So if you feed the model the question what colors a lemon it has to

0:53:55 - 0:54:01     Text: output the model yellow correctly because it so so to speak knows that the that elements are yellow.

0:54:02 - 0:54:08     Text: So this is a good way we argue of testing the knowledge the amount of knowledge stored in the model

0:54:08 - 0:54:14     Text: during pre-training. So why would we expect that to work? Well you could imagine here we're doing

0:54:14 - 0:54:19     Text: our normal pre-training of T5. We're masking out words and training the model to predict the

0:54:19 - 0:54:25     Text: masked out spans of words. And you might imagine that somewhere during pre-training it sees a

0:54:25 - 0:54:31     Text: sentence that says President Franklin Blank born blank January 1882. The goal here would be to

0:54:31 - 0:54:38     Text: output D. Roosevelt was blank in. And then during fine tuning we train the model to predict when the

0:54:40 - 0:54:45     Text: year 1882 when it was asked the question when was Franklin D. Roosevelt born. And you might

0:54:45 - 0:54:52     Text: hope that it kind of recalls back to its pre-training task and recalls some knowledge that it picked

0:54:52 - 0:54:59     Text: up during pre-training in order to answer this question correctly. So we took some standard open

0:54:59 - 0:55:05     Text: domain question answering data sets, natural questions, web questions, and trivia QA. And basically

0:55:05 - 0:55:12     Text: removed all of the context and trained our model to predict the correct answer when asked some

0:55:12 - 0:55:16     Text: particular question and then evaluated its performance on the test set for each of these tasks.

0:55:17 - 0:55:21     Text: And in this table we're comparing the state of the art results for an open domain system. These

0:55:21 - 0:55:27     Text: are systems that explicitly retrieve knowledge from an external knowledge source compared to T5

0:55:27 - 0:55:32     Text: when it's been trained in this closed book setting. You'll notice that we're actually using a

0:55:32 - 0:55:39     Text: slightly different version of T5 here using T5.1.1. The pertinent difference is just that T5.1.1

0:55:39 - 0:55:44     Text: was not multitask pre-trained. It was only pre-trained using an unsupervised objective. The

0:55:44 - 0:55:47     Text: reason we did that again is because we want to measure the amount of knowledge that the model

0:55:47 - 0:55:53     Text: picked up during pre-training. And you can see we actually got, you know, reasonably strong

0:55:53 - 0:56:04     Text: performance, maybe respectable performance. I don't want to use the performance again, the accuracy

0:56:04 - 0:56:09     Text: basically on each of these data sets increases as the model size increases, which maybe at a

0:56:09 - 0:56:13     Text: in a loose way suggests that the larger models have picked up more knowledge during pre-training.

0:56:13 - 0:56:17     Text: But we ultimately lagged behind the state of the art results for open domain systems that

0:56:17 - 0:56:25     Text: explicitly retrieve knowledge. So to try to close this gap, we made use of this objective called

0:56:25 - 0:56:31     Text: salient span masking from a paper called retrieval augmented language model pre-training. And salient

0:56:31 - 0:56:37     Text: span masking is a very simple idea. The idea is that rather than masking outwards at random in your

0:56:37 - 0:56:44     Text: pre-training objective, you actually mask out entities explicitly. So that's people's names, places,

0:56:44 - 0:56:51     Text: dates, etc. And you basically just use an off the shelf named end-to-recognizere to figure out

0:56:51 - 0:56:57     Text: what entities are in your pre-training data set. And you train the model to fill in salient

0:56:57 - 0:57:04     Text: spans instead of random spans. So what we did is we took T5.1.1 after it was pre-trained and did

0:57:04 - 0:57:09     Text: continued pre-training on salient span masking and then measured the performance on our downstream

0:57:09 - 0:57:13     Text: tasks after fine tuning. And you can see that the more salient span mask pre-training we did,

0:57:13 - 0:57:19     Text: the better, excuse me, the better and better the performance got when we fine tune on the downstream

0:57:19 - 0:57:25     Text: tasks. And we ultimately were able to close some of these gaps significantly and actually outperform

0:57:25 - 0:57:32     Text: the best open domain system on web questions by adding salient span masking to T5.1.1.

0:57:34 - 0:57:41     Text: So this just is a message to tell you that the objective matters. And doing this kind of now people,

0:57:41 - 0:57:45     Text: at the time people didn't call it this, but now people call this domain adaptive or task adaptive

0:57:45 - 0:57:50     Text: pre-training. And this is a good way of getting better performance on your downstream tasks.

0:57:52 - 0:57:56     Text: So I've got a good set of questions here. It's now a good time.

0:57:58 - 0:58:03     Text: Yes, that'd be great. So for some context, the students in their most recent assignment

0:58:03 - 0:58:09     Text: had to make effectively a mini T5 thing were the only questions that were asked from a simple

0:58:09 - 0:58:14     Text: domain so that they could pre-traine on a single GPU. And one of the questions they have is,

0:58:14 - 0:58:19     Text: how can we be sure that the answer produced by the model is not made up? They were asked

0:58:19 - 0:58:28     Text: this on the assignment as well, I think. So if you don't have access to a ground-treat answer,

0:58:28 - 0:58:33     Text: it's actually very hard to know. There's a nice paper that came out after this paper called,

0:58:33 - 0:58:39     Text: how can we know when language models know? And what the goal of that paper is to make it so that T5

0:58:39 - 0:58:44     Text: is what we call well calibrated. And when a model is well calibrated, it means that when it doesn't

0:58:44 - 0:58:50     Text: know the answer, it doesn't output highly confident predictions. And this paper explored various

0:58:50 - 0:58:55     Text: ways of calibrating T5 for close-up question answering. And they ultimately found that when the model

0:58:55 - 0:59:01     Text: doesn't know the answer, when it's outputting something made up, that they could effectively make it

0:59:01 - 0:59:07     Text: be very unconfident in its predictions. So that's one way to do it.

0:59:11 - 0:59:17     Text: I think you're actually our muted, John. Great. Thanks. And then another question is the knowledge

0:59:17 - 0:59:23     Text: that's necessary for doing these fine-tuning, like the QA on these fine-tuning datasets.

0:59:24 - 0:59:30     Text: Is it knowledge that is all present at pre-training time? Yeah, so that's also something that was

0:59:30 - 0:59:35     Text: explored by a subsequent paper. We're shown that actually there is a decent amount of

0:59:35 - 0:59:40     Text: training and test overlap in terms of knowledge in these datasets. So it's definitely possible in

0:59:40 - 0:59:45     Text: these cases that that the model is picking up knowledge during fine-tuning and not pre-training.

0:59:45 - 0:59:55     Text: However, just as kind of a side experimental note, we find that the performance of T5

0:59:55 - 1:00:01     Text: actually plateaus before it makes a single pass over the fine-tuning dataset. So basically T5

1:00:01 - 1:00:05     Text: will very, very quickly figure out what the heck you're trying to get it to do. It doesn't even

1:00:05 - 1:00:10     Text: see the full training set before it gets, basically it's maximum performance on the test set.

1:00:10 - 1:00:14     Text: So we don't actually think that that's a major factor for for these results.

1:00:15 - 1:00:19     Text: Great. And then last question, how did the if you studied it, how did the multitask model do?

1:00:19 - 1:00:26     Text: Yeah, those results are in the paper. The results are almost exactly the same. It's a little easier

1:00:26 - 1:00:31     Text: to explain the whole knowledge pre-training thing when you are talking about T5.1.1.

1:00:32 - 1:00:39     Text: In the interest of time, I might skip these next three slides, which are the short summary of

1:00:39 - 1:00:44     Text: these slides is just that the evaluation procedure that we use is unfairly penalizes close book

1:00:44 - 1:00:48     Text: question answering systems. If you want to learn a bit more about that, you can poke into the paper

1:00:48 - 1:00:55     Text: a bit. It doesn't really support the main point that I'm trying to make in any meaningful way.

1:00:55 - 1:01:00     Text: So, and I want to get to some of the more recent papers that and I should be able to have time to do

1:01:00 - 1:01:09     Text: that. Cool. So we kind of answered this question, you know, how much knowledge does a model pick up

1:01:09 - 1:01:14     Text: during pre-training? The answer is arguably a lot. So kind of a follow-up question is, does the

1:01:14 - 1:01:18     Text: model is kind of memorizing this knowledge, right? But does it also memorize, do large language

1:01:18 - 1:01:23     Text: models also memorize stuff that we don't want them to memorize, specifically like private data?

1:01:23 - 1:01:28     Text: Like you could imagine, let's say that somewhere in C4 is someone's social security number.

1:01:29 - 1:01:34     Text: We probably don't want our model to memorize that and spit it out when we're decoding from it.

1:01:35 - 1:01:40     Text: And we certainly don't want it to happen for unconditional models like GPT2 or GPT3.

1:01:40 - 1:01:46     Text: So in this next work, we try to answer this question, you know, do large language models memorize stuff

1:01:46 - 1:01:53     Text: from their pre-training data set? And we can first actually turn to experts and see what experts think.

1:01:53 - 1:02:00     Text: And here are two statements made by the EFF and OpenAI. There was sent to the US Patent and Trademark

1:02:00 - 1:02:05     Text: Office when there were a call for comments on, on basically exactly this question. And you can see

1:02:05 - 1:02:12     Text: that in both cases, these organizations basically say, you know, there's basically no reason to

1:02:12 - 1:02:17     Text: believe that a large language model would output, would copy data from its training data set.

1:02:18 - 1:02:23     Text: OpenAI, you know, kind of calls this a well-constructed AI system. And I think what they actually

1:02:23 - 1:02:27     Text: mean by that, if you read their statement a little longer, they kind of say, you know, if you

1:02:27 - 1:02:32     Text: construct an AI system appropriately, the AI system will not overfit to the training data set.

1:02:32 - 1:02:37     Text: And if it's not overfit, we don't expect them to actually output their, their, any of their training

1:02:37 - 1:02:46     Text: set in any non-trivial way. So these are kind of statements that were hunches. And in this work,

1:02:46 - 1:02:51     Text: we tried to investigate more rigorously whether they were true. And the way that we did that was

1:02:51 - 1:02:57     Text: basically by taking the pre-trained GPT2 model and feeding it prefixes. So you can imagine you,

1:02:57 - 1:03:04     Text: you take a causal language model like GPT2 that just predicts tokens auto-aggressively,

1:03:04 - 1:03:09     Text: you feed it a prefix in the basket to basically predict what comes next. And we're showing here

1:03:09 - 1:03:14     Text: this sort of odd prefix East Stroudsburg Stroudsburg. But we found when we fed this particular prefix

1:03:14 - 1:03:21     Text: into GPT2, it actually output for BADOM and address, name, email address, phone number, and fax

1:03:21 - 1:03:26     Text: number of a real person that appears on the internet. This actually example actually only appears

1:03:26 - 1:03:33     Text: six times on the entire public internet. So it's unlikely that GPT2 saw this address very many times.

1:03:34 - 1:03:40     Text: And the main point of this work is that yes, it does seem like GPT2 at least has memorized a

1:03:40 - 1:03:47     Text: significant amount of non-trivial information from its pre-training data set. So how did we undertake

1:03:47 - 1:03:54     Text: this study? We use this procedure that we are, that's shown on the screen here. We basically consider

1:03:54 - 1:04:00     Text: three different ways of sampling data from GPT2. The first is just to sample auto-aggressively from

1:04:00 - 1:04:06     Text: it. The next is to sample auto-aggressively but with a decaying temperature. This basically means

1:04:06 - 1:04:11     Text: that you want the model to become more and more confident in its predictions over the course of

1:04:11 - 1:04:17     Text: sampling. And the last option is to take random text from the internet and use that as conditioning

1:04:17 - 1:04:24     Text: to GPT2 before asking it to generate what comes next. So now for each of these, for each of these

1:04:24 - 1:04:29     Text: generations, each of these 200,000 generations we did for each of these sampling methods,

1:04:29 - 1:04:33     Text: we want a way of kind of trying to predict whether it might be memorized or not.

1:04:34 - 1:04:40     Text: And so we came up with six metrics to use to give us a notion of whether a particular sample from

1:04:40 - 1:04:46     Text: GPT2 might be memorized. All of these different metrics basically make use of GPT2's perplexity

1:04:46 - 1:04:53     Text: for the sample. The perplexity, as I think you probably learned, is basically a measure of

1:04:53 - 1:04:59     Text: how confident GPT2 was in generating this particular sample. You can also think of it as a measure

1:04:59 - 1:05:08     Text: of compression. So these metrics all make use of the perplexity. Either we just measure GPT2's

1:05:08 - 1:05:15     Text: perplexity for the thing that it generated or we compute the ratio of GPT2's perplexity

1:05:15 - 1:05:20     Text: to the perplexity for another variant of GPT or to a text compression library called ZLib.

1:05:20 - 1:05:27     Text: We also compared via a ratio of the perplexity of the original sample versus a lower case version

1:05:27 - 1:05:32     Text: of the sample and also a windowed perplexity where we only compute the perplexity over a small

1:05:32 - 1:05:40     Text: window of the sample instead of the whole thing. We take the top, we do some deduplication on the

1:05:40 - 1:05:45     Text: generation and then choose the top 100 generations according to each of these metrics. So that'll

1:05:45 - 1:05:51     Text: ultimately give us 600 possible memorized generations and then we actually just do a basic, excuse me,

1:05:51 - 1:05:58     Text: a basic Google search to see if we can find that text that GPT2 generated on the internet somewhere.

1:05:58 - 1:06:03     Text: And if we do find it on the internet somewhere, we asked that GPT2 authors was this in the training

1:06:03 - 1:06:08     Text: set or not and they checked for all of the examples that we found and let us know if GPT2 actually

1:06:08 - 1:06:14     Text: spit out something from its training dataset. So just to give you an idea of what these metrics are

1:06:14 - 1:06:21     Text: and why they might be helpful, this scatter plot is showing the perplexity assigned by GPT2

1:06:21 - 1:06:29     Text: and the perplexity assigned by ZLib to 200,000 samples generated by GPT2. And you can see that most

1:06:29 - 1:06:34     Text: of them kind of fall on this line. There's a big clump of them on the right there in gray.

1:06:35 - 1:06:43     Text: But highlighted here in red and blue are samples that kind of are outliers. GPT2 is assigning a

1:06:43 - 1:06:50     Text: much lower perplexity to ZLib, sorry, to those samples than ZLib is. And what that means is that

1:06:50 - 1:06:56     Text: GPT2 is very, very good at predicting what comes next in these samples and ZLib is not. ZLib is kind of

1:06:56 - 1:07:04     Text: a unbiased source, right? It's not really pre-trained on a bunch of data, it's kind of data agnostic.

1:07:04 - 1:07:09     Text: So it might be the case that if GPT2 is very good at predicting what comes next in a given sequence,

1:07:09 - 1:07:14     Text: but ZLib is not, that GPT2 has memorized those samples. So all of these things kind of in the top

1:07:14 - 1:07:20     Text: left there are possible memorized samples. And actually we marked the ones in blue that it turned out

1:07:20 - 1:07:29     Text: were actually in the training data set and that GPT2 had memorized. Overall, we found many examples

1:07:29 - 1:07:35     Text: of verbatim text memorized from the training data set. This included news, log files, licenses,

1:07:35 - 1:07:43     Text: you know, pages from Wikipedia, URLs, and we highlight two types of data here that we found

1:07:43 - 1:07:50     Text: that GPT2 had memorized that, in our opinion, constitute private information, like named individuals

1:07:50 - 1:07:54     Text: from non-new samples or contact information like the example I showed early on.

1:07:56 - 1:08:01     Text: And so you might ask, okay, you know, maybe it's not that surprising that GPT2 memorized a news article,

1:08:01 - 1:08:06     Text: if that news article appears hundreds of times in the internet. We actually got lucky because it

1:08:06 - 1:08:12     Text: turned out there were a bunch of examples of memorized data that only appeared on one document

1:08:12 - 1:08:19     Text: on the entire public internet. It was basically a paste, like a paste of a bunch of URLs from Reddit,

1:08:19 - 1:08:25     Text: from a controversial subreddit called the Donald. And all of these URLs had exactly the same form.

1:08:25 - 1:08:32     Text: It was, you know, HTTP colon slash slash Reddit.com slash r slash the Donald slash a bunch of random

1:08:32 - 1:08:38     Text: numbers and letters slash the name of the thread. Now this random numbers and letters part is nice

1:08:38 - 1:08:44     Text: because it's equally hard to predict in all cases. It's basically a random hash. So we know that

1:08:44 - 1:08:49     Text: that part of the sequence should be equally hard for any model to memorize. And what that means is

1:08:49 - 1:08:58     Text: that we can measure how many times does a particular URL need to appear in this list of URLs

1:08:58 - 1:09:04     Text: because there were repetitions in the list in order for one of the particular GPT2 sized models

1:09:05 - 1:09:11     Text: to memorize it. And what we found was that the largest variant of GPT2, GPT2 XL,

1:09:11 - 1:09:21     Text: memorized a URL that appeared 33 times in this particular document, but not URLs that appeared 17 times

1:09:21 - 1:09:28     Text: or fewer. The medium size model was only really able to fully memorize a URL that appeared 56 times.

1:09:28 - 1:09:33     Text: And the small model really didn't memorize any. The a half basically means that we could get it to

1:09:33 - 1:09:37     Text: spit out the URL if we gave it some additional prompting. We basically hinted at what some of the

1:09:37 - 1:09:43     Text: numbers were. And what this take away of this is that we actually, by coincidence, because there

1:09:43 - 1:09:48     Text: is this particular document with this structure, we were able to say reasonably confidently that

1:09:48 - 1:09:55     Text: larger models tend to memorize more data. They need to see particular examples fewer times

1:09:55 - 1:09:59     Text: in order to memorize them, which we thought was an interesting finding.

1:09:59 - 1:10:08     Text: So, so far I have kind of mostly been talking about the benefits of larger models, right? Because

1:10:09 - 1:10:14     Text: larger models did better on super glue, larger models did better on close-up question answering.

1:10:14 - 1:10:17     Text: Of course, there is this caveat that larger models also seem to be better at memorizing

1:10:17 - 1:10:25     Text: their training data set. But larger models are also inconvenient. They are more computationally

1:10:25 - 1:10:30     Text: expensive to run. They consume more energy. And they don't fit on, for example, T511B doesn't fit

1:10:30 - 1:10:39     Text: on a single GPU unless you use kind of clever methods. So, the last paper that I'll discuss,

1:10:39 - 1:10:45     Text: which is super recent work, is can we basically close the performance gap between large and small

1:10:45 - 1:10:50     Text: models through improvements to the transformer architecture? So in this work, we take basically the

1:10:50 - 1:10:55     Text: same strategy that we took in the T5 paper, where we took sort of the landscape of existing

1:10:55 - 1:11:00     Text: modifications to the transformer architecture and evaluated them in the same exact setting.

1:11:00 - 1:11:05     Text: And there have been lots of variants proposed to the transformer architecture. In T5, we use

1:11:05 - 1:11:09     Text: basically the standard and coder decoder architecture from the attention as all you need paper

1:11:10 - 1:11:15     Text: that's visualized here. But there have been lots and lots of modifications that have been proposed

1:11:15 - 1:11:21     Text: since the transformer was released in 2017. For example, maybe people suggested that you should factorize

1:11:21 - 1:11:26     Text: your embedding matrix. You should share the embedding matrix and the softmax output layer.

1:11:26 - 1:11:31     Text: You should use different forms of softmax like a mixture of softmaxes or an adaptive softmax.

1:11:32 - 1:11:39     Text: Different ways of normalizing or initializing the model. Maybe different attention mechanisms,

1:11:39 - 1:11:42     Text: alternatives to attention mechanisms like lightweight and dynamical convolutions.

1:11:42 - 1:11:47     Text: Different nonlinearities in the feed forward layers. Different structures for the feed forward

1:11:47 - 1:11:52     Text: layers like mixture of expert to the switch transformer. Completely different architectures that

1:11:52 - 1:11:57     Text: were inspired by the transformer, like the funnel transformer of all transformer, the universal

1:11:57 - 1:12:01     Text: transformer and so on. Really, there have just been tons and tons and tons of them. And again,

1:12:01 - 1:12:06     Text: the goal in this paper was to take a bunch of these modifications and apply the same basic methodology

1:12:06 - 1:12:12     Text: from the T5 paper where we test them in the same experimental setting. Specifically, we basically

1:12:12 - 1:12:17     Text: tested them in exactly the T5 setting that I described at the beginning of the talk where we

1:12:17 - 1:12:24     Text: pre-trained a T5-based size model on C4 and then fine-tuned it on a few downstream tasks.

1:12:26 - 1:12:29     Text: I won't discuss that too much more because I gave a pretty thorough introduction to it at the

1:12:29 - 1:12:36     Text: beginning of the talk. And so here is a sort of a first set of results that I'll show you.

1:12:37 - 1:12:41     Text: Along the x-axis are different transformer modifications. I'm not labeling which one is which

1:12:41 - 1:12:48     Text: because I don't want to call any particular modification out. This is the validation loss

1:12:48 - 1:12:53     Text: attained by the model for the pre-training objective. So when we do pre-training on C4,

1:12:53 - 1:12:57     Text: we hold out some data from C4 and then we basically measure the validation loss

1:12:58 - 1:13:04     Text: on the held out data. So lower is better in this case. And you can see this dotted black line is

1:13:04 - 1:13:09     Text: the performance of the baseline model without any transformer modifications. It's basically

1:13:09 - 1:13:13     Text: of an L-transformer. And you can see that actually some of the transformer modifications did

1:13:13 - 1:13:18     Text: attain better performance, which was good. But a lot of them didn't and a lot of them actually

1:13:18 - 1:13:25     Text: got significantly worse performance. But maybe even worse, some of these variants of the

1:13:25 - 1:13:29     Text: transformer that attained better performance were pretty minor changes. Like for example,

1:13:29 - 1:13:35     Text: just taking the relu in the dense relu dense layer and swapping it with another non-linearity.

1:13:35 - 1:13:41     Text: It's a pretty minor change. And some of the other really highly performant ones were actually

1:13:41 - 1:13:47     Text: ultimately more expensive models. We did use the same base model. It was the same T5 base size model.

1:13:48 - 1:13:52     Text: But some of these methods, like the switch transformer, for example,

1:13:52 - 1:13:57     Text: increases the parameter counter-matically. So it has more expensive in terms of memory.

1:13:57 - 1:14:02     Text: Some of the other methods, by coincidence, maybe if you make the model deeper,

1:14:02 - 1:14:08     Text: it's not able to make use of the accelerator as efficiently. And so it makes the

1:14:09 - 1:14:15     Text: training time and inference time a little more expensive. So once you factor out the very simple

1:14:15 - 1:14:19     Text: changes to the transformer and the ones that ultimately made the model more expensive along some

1:14:19 - 1:14:24     Text: axis, there actually were very few, if any, modifications that improved performance

1:14:24 - 1:14:32     Text: meaningfully. And this is true on the pre-training task. It's also true on the downstream tasks we

1:14:32 - 1:14:39     Text: considered. So this is the Rouge 2 score. It's just one of the metrics people use on the X-Sum task.

1:14:39 - 1:14:43     Text: X-Sum is you can sort of think of it like a harder version of CNN Daily Mail via the Summization

1:14:43 - 1:14:50     Text: task. And you can see that the model variance that attained a better validation score tended to

1:14:50 - 1:14:56     Text: also attain a better X-Sum Rouge 2 score. But again, almost all of the variance we tried

1:14:56 - 1:15:03     Text: decreased the performance. And just as an aside, I kind of alluded to this a little bit.

1:15:03 - 1:15:07     Text: There is a reasonably good correlation between the validation loss and the superglue score.

1:15:08 - 1:15:13     Text: Although I'll just point out a couple of interesting points here. One is this method called

1:15:13 - 1:15:19     Text: transparent attention. It attained a pretty good validation loss, but ultimately a very bad superglue

1:15:19 - 1:15:23     Text: score, which was surprising to us. The switch transformer, which I'll highlight here,

1:15:24 - 1:15:28     Text: attained the best validation loss, but it did not get the best superglue score.

1:15:28 - 1:15:34     Text: On the closed book of variant of web questions, the switch transformer actually achieved merely

1:15:34 - 1:15:41     Text: the best validation accuracy. And this kind of supports a loose conjecture in the field that

1:15:43 - 1:15:48     Text: scaling up the number of parameters can prove the amount of knowledge that the model can internalize.

1:15:48 - 1:15:53     Text: But it doesn't help the model reason. So kind of very, very loosely speaking. Again,

1:15:53 - 1:15:58     Text: this is kind of a conjecture. Superglue requires deep reasoning capabilities.

1:15:58 - 1:16:04     Text: Web, close book web questions requires knowledge intensive capabilities. And so the switch transformer,

1:16:04 - 1:16:10     Text: which only scales up the parameter count without scaling up the processing maybe does better on this

1:16:10 - 1:16:16     Text: on the web questions task. So this kind of raises and should raise some red flags for you.

1:16:16 - 1:16:21     Text: Because this is a pretty bold claim that most of these things don't actually help that much.

1:16:21 - 1:16:26     Text: And there's kind of a couple of possible reasons that this could be the case. One is that our

1:16:26 - 1:16:32     Text: code base is just very unusual and non-standard. We don't think this is the case because the code

1:16:32 - 1:16:37     Text: base that we used was actually developed by one of the people who invented the transformer,

1:16:37 - 1:16:44     Text: a NOMA Shazir. And it's been used a lot. It's been used in lots of various papers. It's basically

1:16:44 - 1:16:49     Text: the same as the tensor to tensor code base. And so we think that arguably our code base and

1:16:49 - 1:16:55     Text: the implementation detail should be reasonably standard. Maybe the tasks we consider are non-standard.

1:16:55 - 1:17:01     Text: We think this is probably not true. Pre-training followed by fine tuning is pretty common. Basically,

1:17:01 - 1:17:07     Text: all the tasks we tested out on have state of the art results from transformers. And actually,

1:17:07 - 1:17:13     Text: we included separately supervised only training on WMT English German, which was the task that

1:17:13 - 1:17:20     Text: transformer was actually proposed on. Maybe we need more hyper parameter tuning because we didn't,

1:17:20 - 1:17:25     Text: again, we didn't do significant hyper parameter tuning for each of these methods. To test how true

1:17:25 - 1:17:30     Text: this was, we actually took one of the methods that we performed significantly worse than we expected.

1:17:30 - 1:17:35     Text: And we ran maybe a couple hundred trials of hyper parameter optimization. One of the researchers

1:17:35 - 1:17:39     Text: on this paper spent a long time trying to get hyper parameters right to make it work. And it

1:17:39 - 1:17:45     Text: ultimately never worked as well as the baseline method. Next possibilities that we implemented

1:17:45 - 1:17:50     Text: these modifications incorrectly. To sanity check this, we actually emailed the authors of all of

1:17:50 - 1:17:54     Text: the different modifications and asked them to check our implementation. All of the ones that got

1:17:54 - 1:17:58     Text: back to us said that it looked correct to them. And then finally, the last option is that maybe

1:17:58 - 1:18:03     Text: these modifications to the transformer don't really kind of transfer. They don't transfer across

1:18:03 - 1:18:09     Text: code bases and implementations and applications. And to us, at least based on the evidence that we

1:18:09 - 1:18:15     Text: have, this is a plausible possibility. In my opinion, the best way to control for this is if you're

1:18:15 - 1:18:20     Text: proposing a new modification to the transformer, try to apply it to as many code bases and tasks as

1:18:20 - 1:18:25     Text: you can without tweaking hyper parameters. And if it works in all of those settings, then your

1:18:25 - 1:18:30     Text: golden and then your thing probably will transfer. And we think that it's probably the case that

1:18:30 - 1:18:36     Text: simpler modifications like changing the non-ladyarity are not so dependent on hyper parameters and

1:18:36 - 1:18:41     Text: implementation details. And so they may be the ones that are more likely to transfer, so to speak.

1:18:43 - 1:18:47     Text: So that's all discussed in this talk. I recognize that I was kind of a whirlwind tour. So I've

1:18:47 - 1:18:52     Text: linked all of the papers that I discussed on this slide here. Of course, this was work done by

1:18:52 - 1:18:58     Text: a huge and truly amazing group of collaborators over the course of these five papers who I've listed

1:18:58 - 1:19:03     Text: on the screen here. And yeah, I'm happy to answer any additional questions that you all have.

1:19:04 - 1:19:08     Text: Okay, so thank you so much Colin for that great talk. And yeah, it was a bit of a

1:19:08 - 1:19:14     Text: fire hose of information. I realize also there was one thing I forgot to say in my introduction. So

1:19:14 - 1:19:22     Text: I guess I need to have an afterward as well, which is that Colin has now started as a professor at

1:19:22 - 1:19:27     Text: the University of North Carolina. So effectively, the University of North Carolina is playing a big

1:19:27 - 1:19:33     Text: part in this course because it was also the source of the Cherokee data that we use for assignment

1:19:33 - 1:19:43     Text: four for the Cherokee English translation. So go tar heels. But yeah, so yeah, so Colin's happy

1:19:43 - 1:19:50     Text: to stay and answer some questions. So if you'd like to have more questions, use the raise hand,

1:19:50 - 1:19:58     Text: and we'll then sort of invite you into the where people can see each other, zoom room, and you know,

1:19:58 - 1:20:04     Text: if you're up to it, it'd even be nice to turn on your video. So people can see who they're talking to.

1:20:05 - 1:20:13     Text: And yeah, maybe in the first instance, you should stop sharing the screen. And yeah, if there's

1:20:13 - 1:20:18     Text: something you want to show again, you can turn it back on. Yeah, maybe I'll just say while people

1:20:18 - 1:20:24     Text: are still around, on the point of me being a professor at UNC in the event that there are any

1:20:24 - 1:20:30     Text: masters or undergraduate students in the audience who are applying for PhD programs, the application

1:20:30 - 1:20:36     Text: deadline for UNC actually has not occurred yet. So if you maybe you wanted apply to another school,

1:20:36 - 1:20:40     Text: have another option, you're excited about the work that I presented, you can still apply to UNC.

1:20:40 - 1:20:46     Text: We have a remarkably late application deadline. So just a plug in case there's anyone who's looking for

1:20:46 - 1:20:54     Text: a PhD. And UNC is the oldest public university in the nation. And if we do the full, you would see

1:20:54 - 1:21:00     Text: the investment. And I think we have the second oldest CS department too, which yeah, it's been around

1:21:00 - 1:21:07     Text: for on. It's pretty small. So it's only about 50 faculty. So while we're waiting for someone to

1:21:07 - 1:21:11     Text: join it, we do have one question already actually from

1:21:11 - 1:21:21     Text: you know, hi, things for the lecture. I had an question about earlier when you discussed like

1:21:23 - 1:21:30     Text: the T-bind overfitting and how many passes you did as I took for it to overfit. So

1:21:30 - 1:21:37     Text: I'm curious as to if you think some of the larger models like the 3 billion, 11 billion,

1:21:37 - 1:21:45     Text: 11 like the scale before players are overfitting and kind of generally how do you know

1:21:46 - 1:21:52     Text: when a model is overfitting, especially on the scale. Yeah, so I mean, if you measure overfitting

1:21:52 - 1:21:57     Text: in sort of the standard way where you compare the loss on training data versus the loss on validation

1:21:57 - 1:22:04     Text: data, we see that even in the in the very, very large models, it's roughly the same, which suggests

1:22:04 - 1:22:08     Text: kind of in the traditional sense that there is no overfitting. One reason for that is that C4

1:22:08 - 1:22:15     Text: is actually big enough that we do just over one pass over it when we train for a trillion tokens.

1:22:15 - 1:22:20     Text: And you know, you might hope that you see limited overfitting when you only see each piece of data

1:22:20 - 1:22:24     Text: basically once over the course of training. So it's sort of like every time you're seeing data,

1:22:24 - 1:22:28     Text: it's new data. So there's not a huge difference between the data on the training set and the

1:22:28 - 1:22:33     Text: validation set. Of course, there is also this notion of overfitting that's kind of like worst case

1:22:33 - 1:22:38     Text: overfitting, which ties into the memorization work I mentioned. It does seem that it's possible

1:22:38 - 1:22:44     Text: for language models to memorize data, even when they do relatively few passes over the training

1:22:44 - 1:22:50     Text: data set. And you don't see kind of average case overfitting by comparing the training loss and

1:22:50 - 1:23:02     Text: the validation loss. Yep. Okay, then as a question of two. Sure, sorry, I was trying to

1:23:02 - 1:23:08     Text: unmute my video, but I can't do that for forever reason. First of all, thank you so much.

1:23:08 - 1:23:16     Text: This is fantastic. I'm sure you're really enjoying it. One thing that I particularly enjoyed was

1:23:16 - 1:23:23     Text: the work that you guys did on your training data extraction attack, trying to identify,

1:23:23 - 1:23:29     Text: really test this hunch on open AI and the EFF support that these models don't actually memorize.

1:23:29 - 1:23:36     Text: Training data. I'm wondering if you have two questions. One, have open AI and EFF

1:23:36 - 1:23:42     Text: since they're telling, since you actually or has your DM actually published, there's also that

1:23:42 - 1:23:47     Text: and they since acknowledged that, you know, well, constructive models may actually do this.

1:23:47 - 1:23:52     Text: And two, with this approach actually work for detecting other say,

1:23:52 - 1:24:00     Text: accidentally encoded biases towards like extreme biases which are prevalent in some language

1:24:00 - 1:24:05     Text: models, could you be able to create packages that could send those sorts of attacks on these

1:24:05 - 1:24:11     Text: models and then determine some degree of accuracy on how much these biases actually are present.

1:24:12 - 1:24:18     Text: Yeah, so with regards to the first question, I don't know of any official statements that have

1:24:18 - 1:24:23     Text: been made by anyone, but I will say that actually on the memorization paper, we had multiple co-authors

1:24:23 - 1:24:29     Text: from open AI. So it was very much a cooperation with them. I mean, you know, we're all scientists

1:24:29 - 1:24:33     Text: and we're all, you know, we all kind of make hypotheses that sometimes turn out correctly and

1:24:33 - 1:24:41     Text: incorrectly. And so I think open AI is definitely aware and on the side of the fact that, yeah,

1:24:41 - 1:24:49     Text: it is possible that these models might memorize data even when they don't exhibit the traditional

1:24:49 - 1:24:57     Text: signs of overfitting. To the second point, I, the way that people have kind of measured this in an

1:24:57 - 1:25:04     Text: ad hoc way is by feeding a prefix into the model about a particular demographic group or type of

1:25:04 - 1:25:12     Text: person and see what the model says about that person. And I think, I think in principle, you can kind

1:25:12 - 1:25:18     Text: of think of our approach as related to that, except that we have this additional step that kind of

1:25:18 - 1:25:25     Text: measures whether the model is generating that text because it saw it in its training data, basically,

1:25:26 - 1:25:32     Text: because the perplexity is excessively low for some continuation compared to a model that wasn't

1:25:32 - 1:25:39     Text: trained on the same data. So it might be interesting, for example, if you, if you feed the model a prefix

1:25:39 - 1:25:44     Text: that you're asking it to fill in some offensive information about some demographic group,

1:25:44 - 1:25:51     Text: check whether the perplexity of the model for its continuation is dramatically lower than, you know,

1:25:51 - 1:25:57     Text: Z-lib, for example. And in that case, you might think that the bias actually, maybe this like bias that

1:25:57 - 1:26:02     Text: the model has picked up is because it saw some actually a sentence that looked just like that and

1:26:02 - 1:26:09     Text: its training data, or if it's a more kind of loose concept that the model has internalized over

1:26:09 - 1:26:15     Text: the course of training. Good, Dessert. Thank you very much for sharing. Yeah, thanks for the

1:26:15 - 1:26:24     Text: questions. Okay, so next up, I guess is... Right, thank you for the talk that was super interesting.

1:26:24 - 1:26:34     Text: So my question is sort of what are your thoughts on potentially doing like multiple rounds of

1:26:34 - 1:26:42     Text: pre-training? So to make it more concrete, you know, like let's say you have a task somewhere,

1:26:42 - 1:26:49     Text: like something like response generation, and that's very bespoke to the particular response

1:26:49 - 1:26:54     Text: generation data set that you use, but potentially you want to kind of spruce that up by bringing in

1:26:55 - 1:27:02     Text: some general dialogue data set that consists of naturalistic human data. So I'm wondering if you

1:27:02 - 1:27:09     Text: kind of have any thoughts or intuitions on how effective it is to maybe start with, you know,

1:27:09 - 1:27:15     Text: the general internet, and then fine tune on this dialogue, unstructured dialogue data set,

1:27:15 - 1:27:22     Text: and then fine tune on maybe a more kind of tightly-scoped response generation data set.

1:27:22 - 1:27:29     Text: Yeah, so the technique you're describing sounds pretty similar to this really excellent

1:27:30 - 1:27:35     Text: approach that people now call domain adaptive pre-training or task adaptive pre-training.

1:27:35 - 1:27:40     Text: I was introduced in a paper called Don't Stop Pre-Training, and then there's a less catchy

1:27:40 - 1:27:47     Text: subtitle. And the idea is very similar to what you proposed. Basically, you take a pre-trained

1:27:47 - 1:27:53     Text: model that was trained on generic text, you do what you might call intermediate task training,

1:27:53 - 1:27:58     Text: or you do continued pre-training on domain specific data, and then finally you do fine tuning on

1:27:58 - 1:28:04     Text: your specific fine tuning data set. In their case, they're considering things like, you know,

1:28:04 - 1:28:09     Text: fine, like doing a scientific text classification or biomedical text analysis, and when they do an

1:28:09 - 1:28:16     Text: intermediate pre-training step on in-domain data, or even just doing the pre-training objective

1:28:16 - 1:28:23     Text: on the data from the task, it definitely helps significantly. And yeah, so that's a very excellent

1:28:23 - 1:28:30     Text: intuition, and I think that's the most similar method to what you're describing. It does kind of

1:28:30 - 1:28:35     Text: raise a clear question that I don't think has been addressed to my knowledge in the literature,

1:28:35 - 1:28:40     Text: which is, you know, we usually think of transfer learning as pre-trained and then fine-tune. And

1:28:40 - 1:28:45     Text: now we're doing kind of like pre-trained, and then maybe some more pre-training and then fine-tuning.

1:28:45 - 1:28:51     Text: And there are other methods that kind of inject other steps along the way. And so there's this

1:28:51 - 1:28:58     Text: natural question of like, what should the curriculum of tasks be, you know, how many intermediate steps

1:28:58 - 1:29:03     Text: should there be? What should the intermediate steps be? What's the benefit of one domain versus the

1:29:03 - 1:29:08     Text: other? How much domain shift is there? And what are the corresponding benefits and so on? And I

1:29:08 - 1:29:13     Text: think there would be, there's a fascinating line of work that would be basically better answering

1:29:13 - 1:29:25     Text: those questions. And what was the acronym? Yeah, so it's called Adapt or Taped, Domain Adaptive

1:29:25 - 1:29:30     Text: Pre-training or Task Adaptive Pre-training. The other papers called Don't Stop Pre-training,

1:29:30 - 1:29:37     Text: which is easy to remember if you like the song Don't Stop Believe in, which is how I, I don't know

1:29:37 - 1:29:42     Text: if it's an intended reference to that song. I assume it must be. I think they should have done a

1:29:42 - 1:29:47     Text: Don't Stop Pre-training with an apostrophe if they really wanted to drive it home, but, you know,

1:29:47 - 1:29:50     Text: maybe that would have been too cheesy. But anyways, yeah, the papers called Don't Stop Pre-training.

1:29:51 - 1:29:57     Text: All right, that's great. Thank you so much. Yeah, absolutely. Okay, next question is,

1:29:57 - 1:30:04     Text: and I'm not sure quite what they know corresponds to. Yes, hi, thank you, Colin, for that

1:30:04 - 1:30:10     Text: part of the really interesting. I'm again kind of looking at the question, I'm really looking for

1:30:10 - 1:30:15     Text: some advice here. So it feels like the recent headline grabbing your dancements and the end of

1:30:15 - 1:30:20     Text: being the tree have been achieved by building these massive models, like GPT-3, with down to

1:30:20 - 1:30:26     Text: parameters that oftentimes cost millions of dollars you can. And these dancements are funded by

1:30:26 - 1:30:31     Text: like larger organizations like Google, Facebook, OpenAI, who are less have infinite resources in

1:30:31 - 1:30:37     Text: the way, right? So my question is, you know, as a sole practitioner with limited resources, but

1:30:37 - 1:30:42     Text: an infinite appetite for learning. What are some ways that I can participate in these

1:30:42 - 1:30:48     Text: advancements and kind of, you know, just rotate and we're top-ending in the industry. Yeah,

1:30:48 - 1:30:53     Text: absolutely. I mean, I, I, I actually totally sympathize with you and agree with you in the

1:30:53 - 1:30:57     Text: sense that, you know, most of the development of these models is taking place by small groups

1:30:58 - 1:31:03     Text: behind closed doors at at large corporations. And that's not usually how I like to see, you know,

1:31:03 - 1:31:08     Text: science developed. I like to see it as a community endeavor that involves, you know, all kinds of

1:31:08 - 1:31:12     Text: stakeholders with varying amounts of resources. And we're not really quite at that stage with

1:31:12 - 1:31:20     Text: with this work yet. I do think that to the extent that people are still releasing pre-trained models,

1:31:20 - 1:31:26     Text: which is true, for example, for T5, but, but not for GPT-3, there is a lot of work to be done on

1:31:26 - 1:31:32     Text: basically analysis. Some of the stuff that that we were discussing earlier, you know, even the,

1:31:32 - 1:31:37     Text: memorization work is basically, I would say it's like analysis work. Some of the stuff

1:31:37 - 1:31:42     Text: pertaining to bias involves in analyzing these models. And I think there's, you know, there's,

1:31:42 - 1:31:49     Text: there's so little that we actually know about how these models work and what makes them useful at

1:31:49 - 1:31:57     Text: scale. The, there's plenty of room for interesting analytical work, which requires significantly

1:31:57 - 1:32:06     Text: less compute. I guess I would say a couple other things. One is that I do really hope that the field

1:32:06 - 1:32:12     Text: moves more towards community development models and moves towards frameworks that allow people to

1:32:12 - 1:32:17     Text: collaboratively train a model, for example, like in a distributed fashion. I think that's an

1:32:17 - 1:32:21     Text: incredibly setting research direction. It's something that I'm working with my students on in my

1:32:21 - 1:32:29     Text: lab at UNC now. And, and the last thing I'll say, and I actually don't usually like saying this,

1:32:29 - 1:32:37     Text: but I'll say it anyways, I do think that our field often undergoes sort of a tick-tock pattern

1:32:37 - 1:32:42     Text: where we show something as possible at scale, and then we show that the scale is not necessary to

1:32:42 - 1:32:47     Text: achieve the same thing. And to some extent, you could argue that this has happened already for GPT-3,

1:32:47 - 1:32:52     Text: in the sense that we saw GPT-3 come along, get outstanding results on, for example, superglue,

1:32:52 - 1:32:56     Text: with only 32 examples per class. And then there was the paper that proposed this method called

1:32:56 - 1:33:04     Text: iPad, which I think is interactive, an iterative patterned, exploited training that obtained basically

1:33:04 - 1:33:11     Text: comparable performance in a dramatically smaller model with the same amount of data. And, and you know,

1:33:11 - 1:33:17     Text: I think you can point to other examples. I personally like to attribute the story of

1:33:17 - 1:33:22     Text: attentions invention to the fact that researchers at the Montreal Institute for Learning Algorithms

1:33:22 - 1:33:27     Text: couldn't afford an AGPU machine, so they couldn't run the giant LSTM in the sequence of sequence

1:33:27 - 1:33:31     Text: paper. So they needed to invent something that worked better, but didn't require such a big model,

1:33:31 - 1:33:37     Text: so they invented attention. Of course, it's not good advice to get to tell someone that they should

1:33:37 - 1:33:41     Text: just go invent something smaller, but I'm at least hopeful that some of these things that we've shown

1:33:41 - 1:33:49     Text: are possible at scale are also possible at a much smaller scale. Thank you. Yeah.

1:33:52 - 1:33:57     Text: Okay, I think there's no one else who's have a hand up at the moment. Maybe there's now some

1:33:57 - 1:34:03     Text: moment for John to ask his question, but if any other people have questions, now's a good point

1:34:03 - 1:34:10     Text: to jump in. I was just going to ask a question from the Q&A and came in and asked it live, so.

1:34:12 - 1:34:18     Text: Yeah, I will say one other thing just quickly, which is that, you know, T5, I was very, like I

1:34:18 - 1:34:23     Text: said, I was very excited that we achieved near-human performance on Super Glue. The model that came

1:34:23 - 1:34:29     Text: along and actually closed that 0.5 percent gap is a model that is about, you know, 10 times smaller

1:34:29 - 1:34:34     Text: in terms of parameter count, so that's like another reasonable example of, I mean, it's still quite

1:34:34 - 1:34:42     Text: big, but at least, you know, as you make algorithmic and architectural improvements, sometimes you can

1:34:42 - 1:34:49     Text: close these gaps. Well, thank you again, Colin, and let you, whatever, have a beer and go to bed or something.

1:34:50 - 1:34:55     Text: Yeah, yeah, sounds great. Yeah, thanks again for having me. It's such a pleasure. So,

1:34:55 - 1:35:00     Text: and I should say if anyone has any follow-up questions I think of later on, I'm always excited to

1:35:00 - 1:35:04     Text: get emails about this kind of stuff. It's, you know, this is stuff I like working on, so.

1:35:04 - 1:35:33     Text: Yeah, thank you again for the great informative talk.