Stanford CS224N NLP with Deep Learning | Winter 2021 | Lecture 10 - Transformers and Pretraining

0:00:00 - 0:00:07     Text: Hello, everybody.

0:00:07 - 0:00:10     Text: Welcome to CS224N lecture 10.

0:00:10 - 0:00:16     Text: This is going to be primarily on pre-training, but we will also discuss sub-word models a

0:00:16 - 0:00:18     Text: little bit and review transformers.

0:00:18 - 0:00:26     Text: Okay, so we have a lot of exciting things to get into today, but some reminders about

0:00:26 - 0:00:30     Text: the class.

0:00:30 - 0:00:32     Text: Assignment 5 is being released today.

0:00:32 - 0:00:37     Text: Assignment 4 was due a minute ago, so if you are done with that, congratulations.

0:00:37 - 0:00:41     Text: If not, I hope that the late days go well.

0:00:41 - 0:00:48     Text: Assignment 5 is on pre-training and transformers, so these lectures are going to be very useful

0:00:48 - 0:00:52     Text: to you for that and I just don't cover anything after these lectures.

0:00:52 - 0:00:54     Text: All right.

0:00:54 - 0:01:00     Text: So today, let's kind of take a little peek through what the outline will be.

0:01:00 - 0:01:04     Text: We haven't talked about sub-word modeling yet and sort of we should have.

0:01:04 - 0:01:06     Text: And so we're going to talk a little bit about sub-words.

0:01:06 - 0:01:11     Text: You saw these in assignment 4, all just, you know, as the data that we provided to you

0:01:11 - 0:01:15     Text: with your machine translation system, but we're going to talk a little bit about why they're

0:01:15 - 0:01:21     Text: so ubiquitous in NLP because they are used in pre-trained models.

0:01:21 - 0:01:26     Text: I mean, they're used in a number of different models, but when we discuss pre-training,

0:01:26 - 0:01:29     Text: it's important to know that sub-words are part of it.

0:01:29 - 0:01:34     Text: Then we'll sort of motivate, we'll go on another journey of motivation of motivating

0:01:34 - 0:01:36     Text: model pre-training from word embedding.

0:01:36 - 0:01:41     Text: So we've already seen pre-training in some sense in the very first lecture of this course

0:01:41 - 0:01:46     Text: because we pre-trained individual word embeddings that don't take into account their contexts

0:01:46 - 0:01:50     Text: on very large text corporate and saw that they were able to encode a lot of useful things

0:01:50 - 0:01:53     Text: about language.

0:01:53 - 0:01:57     Text: So after we do that motivation, we'll go through model pre-training three ways.

0:01:57 - 0:02:00     Text: And we're going to, you know, reference actually the lecture on Tuesday.

0:02:00 - 0:02:02     Text: So this is why we're going to review a little bit of the transformer stuff.

0:02:02 - 0:02:07     Text: We'll talk about model pre-training in decoters, like a transformer decoder that we saw

0:02:07 - 0:02:10     Text: last week in encoters, and then encoder decoters.

0:02:10 - 0:02:14     Text: And each of these three cases, we're going to talk a little bit about sort of what things

0:02:14 - 0:02:20     Text: you could be doing and then popular models that are in use across research and in industry.

0:02:20 - 0:02:23     Text: And we're going to talk a little bit about, you know, what do we think pre-training is

0:02:23 - 0:02:24     Text: teaching?

0:02:24 - 0:02:25     Text: This is going to be very brief.

0:02:25 - 0:02:29     Text: Actually, a lot of the interpretability and analysis lecture in two weeks is going

0:02:29 - 0:02:35     Text: to talk more about sort of the mystery and the scientific problem of figuring out what

0:02:35 - 0:02:39     Text: these models are learning about language through pre-training objectives, but we'll sort

0:02:39 - 0:02:40     Text: of get a peak.

0:02:40 - 0:02:43     Text: And then we'll talk about very large models and in context learning.

0:02:43 - 0:02:49     Text: So if you've heard of GPT-3, for example, we're going to just briefly touch on that here

0:02:49 - 0:02:53     Text: and I think we'll discuss more about it in the course later on as well.

0:02:53 - 0:02:55     Text: Okay, so we've got a lot to do.

0:02:55 - 0:02:57     Text: Let's jump right in.

0:02:57 - 0:02:59     Text: So word structure and sub-broad models.

0:02:59 - 0:03:04     Text: Let's think about sort of the assumptions we've been making in this course so far.

0:03:04 - 0:03:09     Text: When we give you an assignment, when we talk about training word to veck, for example,

0:03:09 - 0:03:11     Text: we made this assumption about a language's vocabulary.

0:03:11 - 0:03:15     Text: In particular, we've made this assumption that has a fixed vocabulary of something like

0:03:15 - 0:03:18     Text: tens of thousands, maybe a hundred thousand, I don't know, a number of...

0:03:18 - 0:03:23     Text: But some relatively large, it seems, number of words, and that seems sort of like pretty

0:03:23 - 0:03:27     Text: good so far, at least, and what we've done.

0:03:27 - 0:03:32     Text: And we build this vocabulary from the set that we train, say, word to veck on.

0:03:32 - 0:03:37     Text: And then here's a crucial thing, any novel word, any word that you did not see at training

0:03:37 - 0:03:42     Text: time, is sort of mapped to a single unctocon.

0:03:42 - 0:03:46     Text: There are other ways to handle this, but you sort of have to do something and a frequent

0:03:46 - 0:03:49     Text: method is to map them all to unct.

0:03:49 - 0:03:53     Text: So let's walk through what this sort of means in English.

0:03:53 - 0:03:55     Text: You learn embeddings, you map them, it all works.

0:03:55 - 0:04:02     Text: Then you have a variation on a word like tase, with a bunch of a's.

0:04:02 - 0:04:07     Text: And your model isn't smart enough to know that that sort of means like very tasty, maybe.

0:04:07 - 0:04:12     Text: And so it maps it to unct, because it's just a dictionary look-up mess.

0:04:12 - 0:04:18     Text: And then you have a typo like lern, and that maps to unct as well, potentially, if it

0:04:18 - 0:04:22     Text: wasn't in your training set, some people make typos, but not all of them will be seen

0:04:22 - 0:04:23     Text: at training time.

0:04:23 - 0:04:24     Text: And then you'll have novel items.

0:04:24 - 0:04:29     Text: So this could be the first time that you've ever seen US, the students, and 224N have

0:04:29 - 0:04:32     Text: seen the word transformer if I.

0:04:32 - 0:04:36     Text: But I get the feeling you sort of have a notion of what it's supposed to mean, like

0:04:36 - 0:04:41     Text: maybe add transformers to or turn into using transformers, or turn into a transformer

0:04:41 - 0:04:42     Text: or something like that.

0:04:42 - 0:04:46     Text: And this is also going to be mapped to unct, even though you've seen transformer and

0:04:46 - 0:04:48     Text: if I.

0:04:48 - 0:04:54     Text: And so somehow the conclusion we have to come to is that looking at words as just like

0:04:54 - 0:05:00     Text: the individual sequence of characters uniquely identifies that word, and that's sort of

0:05:00 - 0:05:04     Text: how we should parameterize things is just wrong.

0:05:04 - 0:05:09     Text: And so not only is this true in English, but in many languages, this finite vocabulary

0:05:09 - 0:05:10     Text: assumption makes even less sense.

0:05:10 - 0:05:16     Text: So already it doesn't make sense in English, but English is, it's not the worst for English.

0:05:16 - 0:05:22     Text: So morphology is the study of the structure of words.

0:05:22 - 0:05:28     Text: And English is known to have pretty simple morphology in kind of specific ways.

0:05:28 - 0:05:34     Text: And when languages have complex morphology, it means you have longer words, more complex

0:05:34 - 0:05:39     Text: words that get modified more, and each one of them occurs less frequently.

0:05:39 - 0:05:40     Text: That should sound like a problem, right?

0:05:40 - 0:05:44     Text: If a word occurs less frequently, it will be less likely to show up in your training

0:05:44 - 0:05:46     Text: set.

0:05:46 - 0:05:49     Text: And maybe it'll show up in your test set, never in your training set.

0:05:49 - 0:05:51     Text: Now it's mapped to unct, and you don't know what to do.

0:05:51 - 0:05:56     Text: So an example, Swahili verbs can have hundreds of conjugations.

0:05:56 - 0:06:03     Text: So each conjugation encodes important information about the sentence that in English might be

0:06:03 - 0:06:05     Text: represented through, say, more words.

0:06:05 - 0:06:11     Text: And Swahili it's mapped onto the verb as prefixes and suffixes, and the like, this is called

0:06:11 - 0:06:13     Text: inflectional morphology.

0:06:13 - 0:06:14     Text: And so you can have hundreds of conjugations.

0:06:14 - 0:06:20     Text: I've just sort of pasted this wick-shenary block just to give you a small sample of just

0:06:20 - 0:06:22     Text: the huge number of conjugations there are.

0:06:22 - 0:06:26     Text: And so trying to memorize independently a meaning of each one of these words is just not

0:06:26 - 0:06:32     Text: the right answer.

0:06:32 - 0:06:34     Text: So this is going to be a very brief overview.

0:06:34 - 0:06:43     Text: And so what we're going to do is take one, let's say, class of algorithms for sub-word modeling

0:06:43 - 0:06:50     Text: that have been kind of developed to try to take a middle ground between two options.

0:06:50 - 0:06:54     Text: One option is saying everything is just like individual words.

0:06:54 - 0:06:58     Text: Either I know the word and I saw it at training time, or I don't know the word, and it's like

0:06:58 - 0:06:59     Text: unct.

0:06:59 - 0:07:03     Text: And then sort of another extreme option is to say it's just characters.

0:07:03 - 0:07:08     Text: Right? So like I get a sequence of characters, and then my neural network on top of my sequence

0:07:08 - 0:07:13     Text: of characters has to learn everything, has to learn how to combine words and stuff.

0:07:13 - 0:07:18     Text: So sub-word models in general just means looking at the sort of internal structure of words

0:07:18 - 0:07:20     Text: somehow, looking below the word level.

0:07:20 - 0:07:25     Text: But this group of models is going to try to meet a middle ground.

0:07:25 - 0:07:28     Text: So byte parent coding.

0:07:28 - 0:07:33     Text: What we're going to do is we're going to learn a vocabulary from a training data set

0:07:33 - 0:07:34     Text: again.

0:07:34 - 0:07:35     Text: So now we have a training data set.

0:07:35 - 0:07:39     Text: Instead of just saying, oh, everything that was split by my heuristic word splitter,

0:07:39 - 0:07:45     Text: like spaces in English, for example, is going to be a word in my vocabulary, we're going

0:07:45 - 0:07:50     Text: to learn the vocabulary using a greedy algorithm in this case.

0:07:50 - 0:07:52     Text: So here's what we're going to do.

0:07:52 - 0:07:55     Text: We start with the vocabulary containing only characters.

0:07:55 - 0:07:56     Text: So that's our extreme, right?

0:07:56 - 0:08:02     Text: So at the very least, if you've seen all the characters, then you know that you can never

0:08:02 - 0:08:03     Text: have an unque, right?

0:08:03 - 0:08:07     Text: Because you see a word, you've never seen it before, you just split it into its characters,

0:08:07 - 0:08:11     Text: and then you try to see, you know, deal with it that way.

0:08:11 - 0:08:14     Text: And then also an end of word symbol.

0:08:14 - 0:08:15     Text: And then we'll iterate over this algorithm.

0:08:15 - 0:08:20     Text: We'll say, use the corpus of text, find common adjacent letters.

0:08:20 - 0:08:24     Text: So maybe A and B are very frequently adjacent.

0:08:24 - 0:08:30     Text: And the pair of them together as a single sub word into your vocabulary.

0:08:30 - 0:08:34     Text: Now replace instances of that character pair with a new sub word repeat until you're desired

0:08:34 - 0:08:35     Text: vocabulary size.

0:08:35 - 0:08:41     Text: So maybe you start with a small character vocabulary, and then you end up with that same small

0:08:41 - 0:08:47     Text: character vocabulary plus a bunch of sort of entire words or parts of words.

0:08:47 - 0:08:50     Text: So notice how Apple, an entire word, looks like Apple.

0:08:50 - 0:08:56     Text: But then app, maybe this is sort of the first part, the first sub word of application, or

0:08:56 - 0:08:57     Text: up.

0:08:57 - 0:08:59     Text: Yeah.

0:08:59 - 0:09:07     Text: And then Lee, I guess I should have not put the hash there, but you know, maybe you learned

0:09:07 - 0:09:10     Text: Lee as like the end of a word, for example.

0:09:10 - 0:09:16     Text: And so what you end up with is, you know, a vocabulary where common things you get to

0:09:16 - 0:09:20     Text: map to themselves and then rare sequences of characters.

0:09:20 - 0:09:23     Text: You kind of split as little as possible.

0:09:23 - 0:09:27     Text: And it doesn't always end up so nicely that you learn like morphologically relevant suffixes

0:09:27 - 0:09:29     Text: like Lee.

0:09:29 - 0:09:32     Text: But you can, you know, try to split things somewhat reasonably.

0:09:32 - 0:09:37     Text: And if you have enough data, the sub word vocabulary you learn tends to be okay.

0:09:37 - 0:09:40     Text: So this is originally used in machine translation.

0:09:40 - 0:09:45     Text: And now a similar method, word piece, which we won't go over in this lecture is used

0:09:45 - 0:09:46     Text: in pre-trained models.

0:09:46 - 0:09:48     Text: But you know, the idea is effectively the same.

0:09:48 - 0:09:50     Text: And you end up with vocabularies that look a lot like this.

0:09:50 - 0:09:58     Text: So if we go back to our, if we go back to our examples of where, you know, word level

0:09:58 - 0:10:04     Text: NLP was failing us, then you have hat mapping to hat.

0:10:04 - 0:10:05     Text: Okay, that's good.

0:10:05 - 0:10:09     Text: You have hat mapping to hat because that was a common enough sequence of characters that

0:10:09 - 0:10:12     Text: it was actually incorporated into our sub word vocabulary.

0:10:12 - 0:10:13     Text: Right?

0:10:13 - 0:10:15     Text: And then you have learned mapping to learn.

0:10:15 - 0:10:16     Text: So common words good.

0:10:16 - 0:10:20     Text: And that means that the model, the neural network that you're going to process this text

0:10:20 - 0:10:27     Text: with does not need to, say, combine the letters of learn and hat in order to try to like

0:10:27 - 0:10:31     Text: derive the meaning of these words from the letters, because you can imagine that might

0:10:31 - 0:10:32     Text: be difficult.

0:10:32 - 0:10:38     Text: But then when you get a word that you have not seen before, you are able to decompose

0:10:38 - 0:10:39     Text: it.

0:10:39 - 0:10:45     Text: And so if you've seen tasty with varying numbers of A's at, at training time, you know,

0:10:45 - 0:10:50     Text: maybe you actually get some of the same sub words or similar sub words that you're splitting

0:10:50 - 0:10:52     Text: it into at evaluation time.

0:10:52 - 0:10:56     Text: So we never saw tasty enough to like, you know, however many A's in order to add it into

0:10:56 - 0:10:58     Text: a sub word vocabulary.

0:10:58 - 0:11:01     Text: But we're still able to split it into things.

0:11:01 - 0:11:05     Text: And then the neural network that runs on top of these sub word embeddings could be able

0:11:05 - 0:11:10     Text: to sort of induce that, oh, yeah, this is one of those things where people like, you know,

0:11:10 - 0:11:15     Text: chain letters together, chain vowels together in English for emphasis.

0:11:15 - 0:11:18     Text: So misspellings still pretty much mess you up.

0:11:18 - 0:11:22     Text: So now learn with this misspelling might be mapped to two sub words.

0:11:22 - 0:11:27     Text: But if you saw misspellings like this frequently enough, maybe you could learn sort of to handle

0:11:27 - 0:11:28     Text: it.

0:11:28 - 0:11:31     Text: It still messes up the model though.

0:11:31 - 0:11:33     Text: And, but at the very least, it's not just an umk, right?

0:11:33 - 0:11:35     Text: It seems clearly better than that.

0:11:35 - 0:11:40     Text: And then transformer, if I, maybe in the best, this is sort of optimistic, but maybe

0:11:40 - 0:11:44     Text: in the best case, right, you were able to say, ah, yes, this is transformer.

0:11:44 - 0:11:51     Text: And if I, again, the sub words that you learn don't actually tend to be this well morphologically

0:11:51 - 0:11:52     Text: motivated, I think.

0:11:52 - 0:11:58     Text: So if I is like a clear, like suffix in English that has a very common and replicable meaning

0:11:58 - 0:12:02     Text: when you apply it to nouns, that's derivational morphology.

0:12:02 - 0:12:07     Text: But you know, you're able to sort of compose the word of the meaning of transformer if I

0:12:07 - 0:12:11     Text: possibly from its two sub word constituents.

0:12:11 - 0:12:15     Text: And so when we talk about words being input to transformer models, pre-trained transformer

0:12:15 - 0:12:20     Text: models, throughout the entirety of this lecture, we will be talking about sub words.

0:12:20 - 0:12:26     Text: So I might say word, and what I mean is, you know, possibly a full word, also possibly

0:12:26 - 0:12:27     Text: a sub word.

0:12:27 - 0:12:31     Text: Okay, so when we say a sequence of words, the transformer, the pre-trained transformer

0:12:31 - 0:12:37     Text: has no idea, sort of whether it's dealing with words or sub words, when it's doing itself

0:12:37 - 0:12:40     Text: attention operations.

0:12:40 - 0:12:41     Text: And so this can be a problem.

0:12:41 - 0:12:46     Text: You can imagine if you have really weird sequences of characters, you can actually have an individual

0:12:46 - 0:12:51     Text: single word mapped to as many sub words as it has characters.

0:12:51 - 0:12:55     Text: That can be a problem because suddenly, you know, you have a ten-word sentence, but one

0:12:55 - 0:12:58     Text: of the words is mapped to, you know, twenty sub words.

0:12:58 - 0:13:02     Text: Now you have a thirty-word sentence, where twenty of the thirty words are just one real

0:13:02 - 0:13:03     Text: word.

0:13:03 - 0:13:05     Text: So keep this in mind.

0:13:05 - 0:13:09     Text: But, you know, I think it's important for sort of this open vocabulary assumption, it's

0:13:09 - 0:13:15     Text: important in English, and it's even more important in many other languages.

0:13:15 - 0:13:18     Text: And the actual algorithm, and you can go into the actual algorithms that are done for

0:13:18 - 0:13:23     Text: this, byte per encoding is sort of my favorite for going over briefly, word piece you can

0:13:23 - 0:13:26     Text: also take a look at.

0:13:26 - 0:13:27     Text: Okay.

0:13:27 - 0:13:29     Text: Any questions on sub words?

0:13:29 - 0:13:34     Text: I guess John, let me look after what does the hashtag mean?

0:13:34 - 0:13:36     Text: Oh, great, great point.

0:13:36 - 0:13:40     Text: So this means that you should be combining this sub word, so this sub word is not the

0:13:40 - 0:13:41     Text: end of a word.

0:13:41 - 0:13:45     Text: TAA, hash hash, is sort of telling the model.

0:13:45 - 0:13:50     Text: So if I had TAA with no hashes, that's a separate sub word.

0:13:50 - 0:13:54     Text: That means there's an entire word that is ta, or at the very least it's not the end of

0:13:54 - 0:13:55     Text: the word.

0:13:55 - 0:13:56     Text: See how here?

0:13:56 - 0:13:57     Text: I don't have the hashes at the end.

0:13:57 - 0:14:00     Text: It's because this is indicating that this is at the end of the word.

0:14:00 - 0:14:04     Text: Different sub word schemes differ on whether you should put something at the beginning of

0:14:04 - 0:14:08     Text: the word, if it does begin a word, or if you should put something at the end of the

0:14:08 - 0:14:11     Text: word, if it doesn't end the word.

0:14:11 - 0:14:15     Text: So when the tokenizer is running over your data, so you've got something that's tokenizing

0:14:15 - 0:14:20     Text: this sentence in the worst case.

0:14:20 - 0:14:27     Text: In the worst case, it says, in, that's a whole word, give it just the word in, no hashes,

0:14:27 - 0:14:32     Text: that's a whole word, give it just the word the, no hashes, and then maybe over here at

0:14:32 - 0:14:34     Text: sub words.

0:14:34 - 0:14:39     Text: We've got this weird word sub words, and it splits it into sub and words.

0:14:39 - 0:14:45     Text: And so sub, it's going to give it the sub word with sub hash hash to indicate that it's

0:14:45 - 0:14:52     Text: part of this larger word, sub words, as opposed to the word sub, like submarine, which would

0:14:52 - 0:14:53     Text: be different.

0:14:53 - 0:15:02     Text: Yeah, that's a great question.

0:15:02 - 0:15:06     Text: Okay, great.

0:15:06 - 0:15:11     Text: So that was our note on sub word modeling, and you can, you know, sub words are important,

0:15:11 - 0:15:17     Text: for example, in, you know, a lot of translation applications, that's why we gave you sub words

0:15:17 - 0:15:19     Text: on the machine translation assignment.

0:15:19 - 0:15:22     Text: Now let's talk about model pre-training and word embeddings.

0:15:22 - 0:15:25     Text: So I love, I love being able to go to this slide.

0:15:25 - 0:15:29     Text: So, so we saw this quote at the beginning of the class, you shall know a word by the company

0:15:29 - 0:15:34     Text: it keeps, and this was sort of one of the things that we used to summarize distributional

0:15:34 - 0:15:35     Text: semantics.

0:15:35 - 0:15:39     Text: This idea that word to veck was sort of well motivated in some way, because the meaning

0:15:39 - 0:15:44     Text: of a word can be thought of as being derived from the kind of co-occurrent statistics of

0:15:44 - 0:15:52     Text: words that co-occur around it, and that was just fascinatingly effective, I think.

0:15:52 - 0:15:54     Text: But there's this other quote actually from the same person.

0:15:54 - 0:16:01     Text: So we have J.R. Firth, 1935, compared to our quote before from 1957, and the second

0:16:01 - 0:16:06     Text: quote says, the complete meaning of a word is always contextual, and no study of meaning

0:16:06 - 0:16:10     Text: apart from a complete context can be taken seriously.

0:16:10 - 0:16:15     Text: Now again, these are just things that we can sort of think about and chew on, but it

0:16:15 - 0:16:20     Text: comes to mind, right, when you, when you embed words with word to veck, one of the issues

0:16:20 - 0:16:26     Text: is that you don't actually look at its neighbors as you're giving it an embedding.

0:16:26 - 0:16:33     Text: So if I have the sentence I record the record, you know, the two instances of REC, ORD,

0:16:33 - 0:16:37     Text: mean different things, but they're given the same word to veck embedding, right, because

0:16:37 - 0:16:42     Text: in word to veck you take the string, you map it to, oh, I've seen the word record before,

0:16:42 - 0:16:47     Text: you get that sort of vector from your learned matrix, and you give it the same thing in both

0:16:47 - 0:16:49     Text: cases.

0:16:49 - 0:16:54     Text: And so what we're going to be doing today is actually not conceptually all that different

0:16:54 - 0:16:56     Text: from training word to veck.

0:16:56 - 0:17:01     Text: Word to veck training you can think of as pre-training just a very simple model that only assigns

0:17:01 - 0:17:07     Text: an individual vector to each unique word type, each unique element in your vocabulary.

0:17:07 - 0:17:12     Text: Today we'll be going a lot farther than that, but the idea is very similar.

0:17:12 - 0:17:17     Text: So back in, you know, 2017, we would start with pre-trained word embeddings, and again,

0:17:17 - 0:17:21     Text: remember no context there, so you give a word and embedding independent of the context

0:17:21 - 0:17:23     Text: that it shows up in.

0:17:23 - 0:17:25     Text: And then you learn how to incorporate the context.

0:17:25 - 0:17:28     Text: It's not like our NLP models never used context, right?

0:17:28 - 0:17:34     Text: Instead, you would learn to incorporate the context using your LSTM, or it's later in 2017,

0:17:34 - 0:17:37     Text: you know, your transformer.

0:17:37 - 0:17:41     Text: And you would learn to incorporate context while training on the task.

0:17:41 - 0:17:43     Text: So you have some supervision.

0:17:43 - 0:17:47     Text: Maybe it's machine translation supervision, maybe sentiment, maybe question answering.

0:17:47 - 0:17:53     Text: And you would learn how to incorporate context in your LSTM or otherwise through the signal

0:17:53 - 0:17:56     Text: of the training instead of say through the word to veck signal.

0:17:56 - 0:18:01     Text: And so, you know, sort of pictographically, you have these word embeddings here, so the

0:18:01 - 0:18:04     Text: red are sort of your word to veck embeddings, and those are pre-trained.

0:18:04 - 0:18:08     Text: Those take up some of the parameters of your network.

0:18:08 - 0:18:09     Text: And then you've got your contextualization.

0:18:09 - 0:18:12     Text: Now this looks like an LSTM, but it could be whatever.

0:18:12 - 0:18:16     Text: So this maybe bidirectional encoder thing here is not pre-trained.

0:18:16 - 0:18:19     Text: And now that's a lot of parameters that are not pre-trained.

0:18:19 - 0:18:24     Text: And then maybe you have some sort of readout function at the end, right, to predict whatever

0:18:24 - 0:18:25     Text: thing you're trying to predict.

0:18:25 - 0:18:30     Text: Again, maybe it's sentiment, maybe you're doing, I don't know, topic labeling, whatever

0:18:30 - 0:18:31     Text: you want to do.

0:18:31 - 0:18:32     Text: This is sort of the paradigm.

0:18:32 - 0:18:36     Text: Like you set some sort of architecture and you only pre-trained the word embeddings.

0:18:36 - 0:18:44     Text: And so this isn't actually the conceptually, necessarily the biggest problem, because,

0:18:44 - 0:18:49     Text: you know, we like to think in deep learning stuff that we have a lot of training data

0:18:49 - 0:18:50     Text: for our objectives.

0:18:50 - 0:18:56     Text: I mean, one of the things that we motivated, you know, big, deep neural networks for is

0:18:56 - 0:18:59     Text: that they can take a lot of data and they can learn patterns from it.

0:18:59 - 0:19:05     Text: But it does put the onus on our downstream data to be sort of sufficient to teach the

0:19:05 - 0:19:07     Text: contextual aspects of language.

0:19:07 - 0:19:13     Text: So you can imagine if you only have a little bit of, you know, labeled data for fine tuning,

0:19:13 - 0:19:17     Text: you're putting a pretty big role on that data to say, hey, maybe here's some pre-trained

0:19:17 - 0:19:21     Text: embeddings, but like how you handle like sentences and how they compose and all that stuff,

0:19:21 - 0:19:23     Text: that's up to you.

0:19:23 - 0:19:27     Text: So if you don't have a lot of labeled data for your downstream task, you're asking

0:19:27 - 0:19:32     Text: it to do a lot with, you know, a large number of parameters that have been initialized randomly.

0:19:32 - 0:19:38     Text: Okay, so like a small portion of the parameters have been pre-trained.

0:19:38 - 0:19:43     Text: Okay, so where we're going is pre-training whole models.

0:19:43 - 0:19:47     Text: I mean, conceptually, you know, we're pretty close to there.

0:19:47 - 0:19:53     Text: So nowadays, almost all parameters in your neural network and let's say a lot of research

0:19:53 - 0:19:58     Text: settings and increasingly in industry are initialized via pre-training, just like word

0:19:58 - 0:20:07     Text: to vac parameters were initialized and pre-training methods in general hide parts of the input

0:20:07 - 0:20:11     Text: from the model itself and then train the model to reconstruct those parts.

0:20:11 - 0:20:14     Text: How does this connect to word to vac?

0:20:14 - 0:20:19     Text: In word to vac, you know, people don't usually make this connection, but it's the following.

0:20:19 - 0:20:25     Text: You have an individual word and it knows itself, right, because you have the embedding for

0:20:25 - 0:20:27     Text: the center word, right, from assignment two.

0:20:27 - 0:20:31     Text: You have the embedding for the center word and knows itself and you've masked out all

0:20:31 - 0:20:33     Text: of its neighbors.

0:20:33 - 0:20:36     Text: You've hidden all of its neighbors from it, right, every all of its window neighbors, you've

0:20:36 - 0:20:37     Text: hidden from it.

0:20:37 - 0:20:41     Text: You ask the center word to predict its neighbors, right?

0:20:41 - 0:20:47     Text: And so this is, this falls under the category of pre-training.

0:20:47 - 0:20:48     Text: All of these methods look similar.

0:20:48 - 0:20:54     Text: You hide parts of the input from the model and train the model to reconstruct those parts.

0:20:54 - 0:20:58     Text: The differences with full model pre-training is that you don't give the model just the

0:20:58 - 0:21:01     Text: individual word and have it learn an embedding of that word.

0:21:01 - 0:21:06     Text: You give it much more of the sequence and have it predict, you know, held out parts of

0:21:06 - 0:21:07     Text: the sequence.

0:21:07 - 0:21:08     Text: And we'll get into the details there.

0:21:08 - 0:21:14     Text: But, you know, the takeaway is that everything here is pre-trained jointly, possibly with

0:21:14 - 0:21:18     Text: the exception of the very last layer that predicts the label.

0:21:18 - 0:21:26     Text: Okay, and this has just been exceptionally effective at building representations of language

0:21:26 - 0:21:31     Text: that just map similar things in language, similar representations in these encoders, just

0:21:31 - 0:21:35     Text: like how word-to-vec map similar words to similar vectors.

0:21:35 - 0:21:40     Text: It's been exceptionally effective at making parameter initializations where you start with

0:21:40 - 0:21:45     Text: these parameters that have been pre-trained and then you fine-tune them on your label data.

0:21:45 - 0:21:49     Text: And then third, they have an exceptionally effective at defining probability distributions

0:21:49 - 0:21:54     Text: over language, like in language modeling, that are actually really useful to sample from

0:21:54 - 0:21:56     Text: in certain cases.

0:21:56 - 0:21:59     Text: So these are three ways in which we interact with pre-trained models.

0:21:59 - 0:22:02     Text: We use their representations just to compute similarities.

0:22:02 - 0:22:04     Text: We use them for parameter initializations.

0:22:04 - 0:22:10     Text: And we actually just use them as probability distributions, sort of how we train to them.

0:22:10 - 0:22:12     Text: Okay.

0:22:12 - 0:22:16     Text: So let's get into some technical parts here.

0:22:16 - 0:22:21     Text: I sort of want to think broad thoughts about what we could do with pre-training and what

0:22:21 - 0:22:26     Text: kind of things we could expect to potentially learn from this general method of hide part

0:22:26 - 0:22:30     Text: of the input and then see other parts of the input and then try to predict the parts

0:22:30 - 0:22:31     Text: that you hid.

0:22:31 - 0:22:32     Text: Okay.

0:22:32 - 0:22:36     Text: So Stanford University is located in Blank California.

0:22:36 - 0:22:39     Text: If we gave a model, everything that was not blanked out here and asked to predict the

0:22:39 - 0:22:49     Text: middle, the loss function would train the model to predict Palo Alto here, I expect.

0:22:49 - 0:22:50     Text: Okay.

0:22:50 - 0:22:54     Text: So this is an instance of something that you could imagine being a pre-training objective.

0:22:54 - 0:22:59     Text: You take in a sentence, you remove part of it and you say recreate the part that I removed.

0:22:59 - 0:23:03     Text: And in this case, if I just gave a bunch of examples that looked like this, it might

0:23:03 - 0:23:07     Text: learn sort of trivia thing here.

0:23:07 - 0:23:08     Text: Okay.

0:23:08 - 0:23:09     Text: Here's another one.

0:23:09 - 0:23:12     Text: I put blank fork down on the table.

0:23:12 - 0:23:14     Text: This one is under specified.

0:23:14 - 0:23:24     Text: So this could be the fork, my fork, his fork, her fork, some fork, yeah, a fork.

0:23:24 - 0:23:29     Text: So this is, you know, specifying the kinds of syntactic categories of things that can

0:23:29 - 0:23:32     Text: sort of appear in this context.

0:23:32 - 0:23:37     Text: So this is another thing that you might be able to learn from such an objective.

0:23:37 - 0:23:42     Text: So you have the woman walked across the street, checking for traffic over blank shoulder.

0:23:42 - 0:23:44     Text: One of the things that could go over here is her.

0:23:44 - 0:23:47     Text: That's a co-reference statement.

0:23:47 - 0:23:53     Text: So you could learn sort of connections between entities in a text where one word woman can

0:23:53 - 0:24:01     Text: also co-refer to the same entity in the world as this word, this pronoun her.

0:24:01 - 0:24:06     Text: Here you could think about, you know, I went to the ocean to see the fish, turtles, seals,

0:24:06 - 0:24:07     Text: and blank.

0:24:07 - 0:24:10     Text: Here I don't think there's a single correct answer as to what we could see going into that

0:24:10 - 0:24:11     Text: blank.

0:24:11 - 0:24:15     Text: But a model could learn a distribution of the kinds of things that people might be talking

0:24:15 - 0:24:20     Text: about when they, one, go to the ocean and two, are excited to see marine life.

0:24:20 - 0:24:21     Text: Right?

0:24:21 - 0:24:25     Text: So this is sort of a semantic category, a lexical semantic category of things that might

0:24:25 - 0:24:32     Text: sort of be in the same set of interest as fish, turtles, and seals in the context of

0:24:32 - 0:24:34     Text: I went to the ocean.

0:24:34 - 0:24:36     Text: Okay?

0:24:36 - 0:24:40     Text: So, and, you know, man, I expect that there would be examples of this in a large corpus

0:24:40 - 0:24:41     Text: of text.

0:24:41 - 0:24:43     Text: Maybe it may be a book.

0:24:43 - 0:24:44     Text: Okay.

0:24:44 - 0:24:46     Text: Here's another example.

0:24:46 - 0:24:52     Text: Overall, the value I got from the two hours watching it was the sum total of the popcorn

0:24:52 - 0:24:53     Text: and the drink.

0:24:53 - 0:24:55     Text: The movie was blank.

0:24:55 - 0:24:56     Text: Right?

0:24:56 - 0:25:00     Text: And this is when I'd sort of like look out into the audience and say, was the movie better

0:25:00 - 0:25:02     Text: good, but the movie was bad.

0:25:02 - 0:25:04     Text: It's my prediction here.

0:25:04 - 0:25:06     Text: Right?

0:25:06 - 0:25:09     Text: And so this is teaching you something about sentiment, about how people express sentiment

0:25:09 - 0:25:11     Text: in language.

0:25:11 - 0:25:17     Text: And so this is, even, it looks like a task itself, like do sentiment analysis is sort

0:25:17 - 0:25:22     Text: of what you need to do in order to figure out whether the movie was bad or good, or maybe

0:25:22 - 0:25:24     Text: maybe the word is neither bad or good.

0:25:24 - 0:25:25     Text: The movie was over or something like that.

0:25:25 - 0:25:30     Text: But like, if you had to choose between is bad or good more likely, right?

0:25:30 - 0:25:33     Text: You sort of had to figure out the sentiment of the text.

0:25:33 - 0:25:36     Text: Now, that's really fascinating.

0:25:36 - 0:25:37     Text: Okay.

0:25:37 - 0:25:39     Text: Here's another one.

0:25:39 - 0:25:42     Text: Iro went into the kitchen to make some tea.

0:25:42 - 0:25:46     Text: Standing next to Iro, Zuko pondered his destiny.

0:25:46 - 0:25:48     Text: Zuko left the blank.

0:25:48 - 0:25:49     Text: Okay.

0:25:49 - 0:25:53     Text: So this is a little easy because we really only show one place.

0:25:53 - 0:25:56     Text: I guess we have another now in destiny.

0:25:56 - 0:26:00     Text: But this is sort of talking reasoning about spatial location and the movement of sort of

0:26:00 - 0:26:02     Text: agents in an imagined world.

0:26:02 - 0:26:06     Text: We could imagine text that has lines like this.

0:26:06 - 0:26:10     Text: Person went into the place and was next to so and so who left and did that and sort of

0:26:10 - 0:26:12     Text: you have these like relationships.

0:26:12 - 0:26:14     Text: So here, Zuko left the kitchen.

0:26:14 - 0:26:17     Text: It's the most likely thing that I think would go here.

0:26:17 - 0:26:24     Text: And it sort of indicates that in order for a model to learn to perform this fill in the

0:26:24 - 0:26:32     Text: missing part task, it might need to, in general, figure out sort of where things are and

0:26:32 - 0:26:36     Text: whether statements mean or imply that locality.

0:26:36 - 0:26:40     Text: So standing next to Iro went into the kitchen.

0:26:40 - 0:26:45     Text: Now Iro is in the kitchen and then standing next to Iro means Zuko is now in the kitchen.

0:26:45 - 0:26:48     Text: And then Zuko now leaves where?

0:26:48 - 0:26:50     Text: Well, he was in the kitchen before.

0:26:50 - 0:26:53     Text: So this is sort of a very basic sense of reasoning.

0:26:53 - 0:26:55     Text: Now this one.

0:26:55 - 0:26:56     Text: Here's a sentence.

0:26:56 - 0:27:03     Text: I was thinking about the sequence that goes 1, 1, 2, 3, 5, 8, 13, 21, blank.

0:27:03 - 0:27:05     Text: So I don't know.

0:27:05 - 0:27:07     Text: I can imagine people writing stuff.

0:27:07 - 0:27:09     Text: So this is the Fibonacci sequence.

0:27:09 - 0:27:12     Text: And sort of you know you use some of these two to get the next one, some of these two

0:27:12 - 0:27:14     Text: to get the next one, some of these two.

0:27:14 - 0:27:16     Text: And so you have this running sum.

0:27:16 - 0:27:17     Text: It's a famous sequence.

0:27:17 - 0:27:20     Text: It shows up in a lot of text on the internet.

0:27:20 - 0:27:26     Text: And in general you have to learn the algorithm or just the formula, I guess, that defines the

0:27:26 - 0:27:29     Text: Fibonacci sequence in order to keep going.

0:27:29 - 0:27:32     Text: Do models in this in practice?

0:27:32 - 0:27:33     Text: Wait and find out.

0:27:33 - 0:27:37     Text: But you would have to learn it in order to get the sequence to keep going and going and

0:27:37 - 0:27:39     Text: going.

0:27:39 - 0:27:46     Text: OK, so we're going to get into specific pre-trained models, specific methods of pre-training

0:27:46 - 0:27:47     Text: now.

0:27:47 - 0:27:56     Text: So I'm going to go over a brief review of transformer encoders, decoders, and encoder decoders.

0:27:56 - 0:27:58     Text: Because we're going to get into the sort of technical bits now.

0:27:58 - 0:28:01     Text: So before I do that, I'm going to pause.

0:28:01 - 0:28:02     Text: Are there any questions?

0:28:02 - 0:28:11     Text: Yeah, there's an interesting question asked about, do these co-opening our model on our

0:28:11 - 0:28:13     Text: input training data and the link to training?

0:28:13 - 0:28:17     Text: And we need to also add some questions in the light of the huge between models that we

0:28:17 - 0:28:19     Text: think nowadays.

0:28:19 - 0:28:25     Text: Sorry, the first part of that question, was it, are we overfitting our models to what?

0:28:25 - 0:28:29     Text: Yes, so the risk of almost getting our model on our input training data when they're

0:28:29 - 0:28:30     Text: doing training?

0:28:30 - 0:28:31     Text: Got it.

0:28:31 - 0:28:34     Text: Yeah, so that's a good point.

0:28:34 - 0:28:36     Text: So we're using very large models.

0:28:36 - 0:28:40     Text: And we might imagine that there's a risk of overfitting.

0:28:40 - 0:28:45     Text: And in practice, yeah, it's actually one of the more crucial things to do to make pre-training

0:28:45 - 0:28:46     Text: work.

0:28:46 - 0:28:51     Text: So that turns out that you need to have a lot, a lot of data, like a lot of data.

0:28:51 - 0:28:56     Text: And in fact, we'll show results later on where people built a pre-trained model, pre-trained

0:28:56 - 0:28:58     Text: it on a lot of data.

0:28:58 - 0:29:02     Text: And then like six months later, someone else came along and was like, hey, if you pre-trained

0:29:02 - 0:29:06     Text: it on 10 months later and changed almost nothing else, it would have gone even better.

0:29:06 - 0:29:07     Text: Now was it overfitting?

0:29:07 - 0:29:13     Text: I mean, you can sort of like hold out some text during pre-training, right, and sort

0:29:13 - 0:29:18     Text: of evaluate the perplexity, right, the language modeling performance on that held out text.

0:29:18 - 0:29:22     Text: And it tends to be the case that actually these models are underfitting, right, that we

0:29:22 - 0:29:28     Text: need even larger and larger models to express the complex interactions that allow us to

0:29:28 - 0:29:30     Text: fit these datasets better.

0:29:30 - 0:29:33     Text: And so we'll talk about that when we talk about BERT.

0:29:33 - 0:29:37     Text: And one of the really interesting results is that BERT is underfit, not overfit, but

0:29:37 - 0:29:42     Text: in principle, yes, it's a problem to, this potentially a problem to overfit.

0:29:42 - 0:29:46     Text: But we end up having a ton of text in English at least, although not in every language.

0:29:46 - 0:29:51     Text: And so, yeah, it's important to scale them, but currently our models don't seem overfit

0:29:51 - 0:29:53     Text: to the pre-training text.

0:29:53 - 0:29:54     Text: Okay.

0:29:54 - 0:30:02     Text: Any other questions?

0:30:02 - 0:30:05     Text: All right.

0:30:05 - 0:30:07     Text: So we saw this figure before, right here.

0:30:07 - 0:30:12     Text: We saw this figure of a transformer encoder to coder from this paper attention is all you

0:30:12 - 0:30:14     Text: need.

0:30:14 - 0:30:17     Text: And so we have a couple of things.

0:30:17 - 0:30:22     Text: We're not going to go over the form of attention again today because we have a lot to go over,

0:30:22 - 0:30:25     Text: but I'm happy to chat about it more on Ed.

0:30:25 - 0:30:28     Text: But so in our encoder, we have some input sequence.

0:30:28 - 0:30:31     Text: Remember, this is a sequence of sub words now.

0:30:31 - 0:30:34     Text: Each sub word gets a word embedding.

0:30:34 - 0:30:38     Text: And each index in the transformer gets a position embedding.

0:30:38 - 0:30:44     Text: Now remember that we have a finite length that our sequence can possibly be like 512.

0:30:44 - 0:30:45     Text: That's tokens.

0:30:45 - 0:30:47     Text: That was that capital T from last lecture.

0:30:47 - 0:30:48     Text: So you have some finite length.

0:30:48 - 0:30:55     Text: So you have one embedding of a position for every index for all 512 indices.

0:30:55 - 0:30:57     Text: And then you have all your word embeddings.

0:30:57 - 0:31:03     Text: And then the transformer encoder, right, was this combination of sort of sub-modules that

0:31:03 - 0:31:08     Text: we walked through line by line on Tuesday, right.

0:31:08 - 0:31:12     Text: Multi-headed attention was sort of the core building block.

0:31:12 - 0:31:16     Text: And then we had residual and layer norm, right, to help with passing gradients and to help

0:31:16 - 0:31:19     Text: make training go better and faster.

0:31:19 - 0:31:25     Text: We had that feed forward layer to, yeah, process sort of the result of the multi-headed

0:31:25 - 0:31:30     Text: attention, another residual and layer norm, and then pass to an identical transformer

0:31:30 - 0:31:31     Text: encoder block here.

0:31:31 - 0:31:32     Text: And these would be stacked.

0:31:32 - 0:31:38     Text: We'll see a number of different configurations here, but I think, you know, 612 of these

0:31:38 - 0:31:39     Text: sort of stacked together.

0:31:39 - 0:31:40     Text: Okay.

0:31:40 - 0:31:42     Text: So that's a transformer encoder.

0:31:42 - 0:31:47     Text: And we're actually going to see whole models today that are just transformer encoders.

0:31:47 - 0:31:48     Text: Okay.

0:31:48 - 0:31:52     Text: So when we talked about machine translation, when we talked about the transformer itself,

0:31:52 - 0:31:56     Text: the transformer encoder decoder, we talked about this whole thing.

0:31:56 - 0:31:59     Text: But you could actually just have this left column, and you could actually just have this

0:31:59 - 0:32:02     Text: right column as well.

0:32:02 - 0:32:04     Text: Although the right column changes a little bit if you just have it.

0:32:04 - 0:32:10     Text: So remember, the right column, we had this masked multi-head self-attention, right, so

0:32:10 - 0:32:14     Text: where you can't look at the future.

0:32:14 - 0:32:18     Text: And someone asked actually about how we decode from transformers, given that you have this

0:32:18 - 0:32:20     Text: sort of big chunking operation.

0:32:20 - 0:32:21     Text: It's a great question.

0:32:21 - 0:32:25     Text: I won't be able to get into it in detail today, but you have to run it once during the decoding

0:32:25 - 0:32:31     Text: process for every time that you decode to sort of predict the next word.

0:32:31 - 0:32:34     Text: I'll write out something on Ed for this.

0:32:34 - 0:32:38     Text: So in the masked multi-head self-attention, you're not allowed to look at the future so

0:32:38 - 0:32:44     Text: that you sort of have this well-defined objective of trying to do language modeling.

0:32:44 - 0:32:46     Text: Then we have residual and layer norm.

0:32:46 - 0:32:50     Text: The multi-head cross-attention, remember, goes back to the last layer of the transformer

0:32:50 - 0:32:53     Text: encoder, or the last transformer encoder block.

0:32:53 - 0:32:57     Text: And then more residual and layer norm, another feed-forward layer, more residual and layer

0:32:57 - 0:32:58     Text: norm.

0:32:58 - 0:33:04     Text: Now, if we don't have an encoder here, then we get rid of the cross-attention and residual

0:33:04 - 0:33:05     Text: and layer norm here.

0:33:05 - 0:33:09     Text: So if we didn't have this stack of encoders, the decoders get simpler because you don't

0:33:09 - 0:33:10     Text: have to attend to them.

0:33:10 - 0:33:14     Text: But then again, you also have these word embeddings at the bottom and position representations

0:33:14 - 0:33:17     Text: for the output sequence.

0:33:17 - 0:33:20     Text: Okay, so that's been review.

0:33:20 - 0:33:22     Text: Let's talk about pre-training through language modeling.

0:33:22 - 0:33:26     Text: So we've actually talked maybe a little bit about this before, and we've seen language

0:33:26 - 0:33:30     Text: modeling in the context of maybe just wanting to do it our priori.

0:33:30 - 0:33:35     Text: So language models were useful, for example, in automatic speech recognition systems.

0:33:35 - 0:33:38     Text: They were useful in statistical machine translation systems.

0:33:38 - 0:33:42     Text: So let's recall the language modeling task.

0:33:42 - 0:33:47     Text: You can say it's defined as modeling the probability of a word at a given index t, of any word

0:33:47 - 0:33:51     Text: at any given index, given all the words before it.

0:33:51 - 0:33:59     Text: And this probability distribution is a distribution of words given their past contexts.

0:33:59 - 0:34:05     Text: And so this is just saying, for any prefix here, IRO goes to make.

0:34:05 - 0:34:07     Text: I want a probability of whatever the next word should be.

0:34:07 - 0:34:14     Text: So the observed next word is tasty, but maybe there's goes to make t, goes to make hot

0:34:14 - 0:34:15     Text: water, etc.

0:34:15 - 0:34:19     Text: You can have a distribution over what the next word should be in this decoder.

0:34:19 - 0:34:24     Text: And remember that because of the masked self-attention, make can look back to the word

0:34:24 - 0:34:30     Text: two, or goes, or IRO, but it can't look forward to tasty.

0:34:30 - 0:34:31     Text: So there's a lot of data for this, right?

0:34:31 - 0:34:33     Text: You just have text.

0:34:33 - 0:34:36     Text: And like voila, you have language modeling data.

0:34:36 - 0:34:37     Text: It's free.

0:34:37 - 0:34:38     Text: No.

0:34:38 - 0:34:40     Text: Once you have the text, it's freely available.

0:34:40 - 0:34:42     Text: You don't need to label it.

0:34:42 - 0:34:44     Text: And in English, you have a lot of it, right?

0:34:44 - 0:34:50     Text: This is not true of every language by any means, but in English, you have a lot of pre-training

0:34:50 - 0:34:52     Text: data.

0:34:52 - 0:34:58     Text: And so the simple thing about pre-training is, well, what we're going to do is we're

0:34:58 - 0:35:01     Text: going to train a neural network to do language modeling on a large amount of text, and we'll

0:35:01 - 0:35:06     Text: just save the parameters of our train network to disk.

0:35:06 - 0:35:10     Text: So conceptually, it's not actually different from the things that we've done before.

0:35:10 - 0:35:12     Text: It's just sort of the intent, right?

0:35:12 - 0:35:17     Text: We're training these parameters to start using them for something else later down the line.

0:35:17 - 0:35:20     Text: But the language modeling itself doesn't change.

0:35:20 - 0:35:22     Text: The decoder here doesn't change, right?

0:35:22 - 0:35:27     Text: It's a transformer in tree-trained models in a modern, because this is sort of a newly

0:35:27 - 0:35:30     Text: popular concept.

0:35:30 - 0:35:36     Text: Although back in 2015 was sort of when this, I think, was first effectively tried out and

0:35:36 - 0:35:39     Text: got some interesting results.

0:35:39 - 0:35:42     Text: But this could be anything here.

0:35:42 - 0:35:47     Text: Today, it's most going to be transformers in the models that we actually observe.

0:35:47 - 0:35:49     Text: Okay.

0:35:49 - 0:35:52     Text: So once you have your pre-trained network, what's the sort of default thing you do to

0:35:52 - 0:35:54     Text: take to use it?

0:35:54 - 0:35:55     Text: Right?

0:35:55 - 0:35:58     Text: And if you take anything away from this lecture in terms of just like engineering practices

0:35:58 - 0:36:05     Text: that will be broadly useful to you as you go off and build things and study things, maybe

0:36:05 - 0:36:11     Text: as a machine learning engineer or a computational social scientist, et cetera, what people tend

0:36:11 - 0:36:16     Text: to do is you pre-traine your network on just a lot of data, lots of text, learn very

0:36:16 - 0:36:18     Text: general things.

0:36:18 - 0:36:22     Text: And then you adapt the network to whatever you wanted to do.

0:36:22 - 0:36:26     Text: So we had a bunch of pre-training data, and then maybe this is a movie review that

0:36:26 - 0:36:34     Text: we're taking as input here, and we just apply the decoder that we sort of pre-trained,

0:36:34 - 0:36:41     Text: start the parameters there, and then fine tune it on whatever we were sort of wanting

0:36:41 - 0:36:42     Text: to do.

0:36:42 - 0:36:43     Text: Maybe this is a sentiment analysis task.

0:36:43 - 0:36:48     Text: So we run the whole sequence through the decoder, get a hidden state at the end at the

0:36:48 - 0:36:53     Text: very last thing, and then we predict maybe plus or minus sentiment.

0:36:53 - 0:36:56     Text: And this is sort of adapting the pre-trained network to the task.

0:36:56 - 0:37:02     Text: Because pre-trained fine-tune paradigm is wildly successful, and you should really try

0:37:02 - 0:37:09     Text: it whenever you're doing any NLP task nowadays effectively.

0:37:09 - 0:37:14     Text: Because this tends to be what some variant of this tends to be what works best.

0:37:14 - 0:37:19     Text: Okay, so we've got a technical note now.

0:37:19 - 0:37:27     Text: So if you don't like to think about optimization or gradient descent, maybe take a pass on

0:37:27 - 0:37:34     Text: this slide, but I encourage you to just think for a second about why should this help?

0:37:34 - 0:37:40     Text: Training neural nets, we're using gradient descent to try to find some global minimum

0:37:40 - 0:37:43     Text: of this loss function.

0:37:43 - 0:37:47     Text: And we're sort of doing this in two steps.

0:37:47 - 0:37:55     Text: The first step is we get some parameters theta hat by approximating min over our, sorry,

0:37:55 - 0:37:57     Text: theta is the parameters of the neural network.

0:37:57 - 0:38:03     Text: So all of the KQV vectors in our transformer, the word embeddings, the position embeddings,

0:38:03 - 0:38:07     Text: it's just all of the parameters of our neural network.

0:38:07 - 0:38:11     Text: And so we're doing min over all the parameters of our theta, we're trying to approximate min

0:38:11 - 0:38:14     Text: over the parameters of our neural network of our pre-training loss, which here was language

0:38:14 - 0:38:18     Text: modeling of our parameters.

0:38:18 - 0:38:23     Text: And this is, we just get this sort of estimate of some parameters theta hat.

0:38:23 - 0:38:31     Text: And then we fine tune by approximating this min over theta of the fine tune loss, maybe

0:38:31 - 0:38:33     Text: that's sentiment, right?

0:38:33 - 0:38:34     Text: Starting at theta hat.

0:38:34 - 0:38:37     Text: So we initialize our gradient descent at theta hat, and then we just sort of let it do

0:38:37 - 0:38:38     Text: what it wants.

0:38:38 - 0:38:42     Text: And it's just like, it just works.

0:38:42 - 0:38:49     Text: And in part, it has to be because something about where we start is so important, not just

0:38:49 - 0:38:53     Text: in terms of sort of gradient flow, although that is a big part of it.

0:38:53 - 0:39:00     Text: But also, it seems like, you know, stochastic gradient descent sticks relatively close to

0:39:00 - 0:39:03     Text: that pre-training initialization during fine tuning.

0:39:03 - 0:39:09     Text: This is something that we seem to observe in practice, right, that somehow the locality

0:39:09 - 0:39:13     Text: of stochastic gradient descent, finding local minima that are close to this theta hat,

0:39:13 - 0:39:19     Text: that was good for such a general problem of language modeling, it seems like, yeah, the

0:39:19 - 0:39:23     Text: local minima of the fine tuning loss, because we don't find, or yeah, the local minima

0:39:23 - 0:39:28     Text: of the fine tuning loss tend to generalize well when they're near to this theta hat that

0:39:28 - 0:39:29     Text: we pre-trained.

0:39:29 - 0:39:32     Text: And this is sort of a mystery that we're still trying to figure out more about.

0:39:32 - 0:39:35     Text: And then also, yeah, maybe the gradients, right, the gradients of the fine tuning loss

0:39:35 - 0:39:40     Text: near theta propagate nicely, so our network training goes really well as well.

0:39:40 - 0:39:45     Text: Okay, so this is something to chew on, but in practice, it works.

0:39:45 - 0:39:49     Text: I think it's just still fascinating that it works.

0:39:49 - 0:39:59     Text: Okay, so we talked about mainly the transformer encoder to coder, and in fact, right, I said

0:39:59 - 0:40:04     Text: that we could have just sort of the left-hand side encoders, you know, that to be pre-trained

0:40:04 - 0:40:08     Text: or just decoders to be pre-trained or encoder decoders.

0:40:08 - 0:40:13     Text: And there are actually really popular sort of famous models in each of these three categories.

0:40:13 - 0:40:20     Text: The kinds of pre-training you can do, and the kinds of applications or uses of those

0:40:20 - 0:40:25     Text: pre-trained models that are most natural actually depend strongly on whether you choose

0:40:25 - 0:40:31     Text: to pre-traine and encoder a decoder or an encoder decoder.

0:40:31 - 0:40:36     Text: So I think it's useful as we go through some of these popular sort of model names that

0:40:36 - 0:40:41     Text: you need to know and what they sort of, what their innovations were to actually split

0:40:41 - 0:40:44     Text: it up into these categories.

0:40:44 - 0:40:46     Text: So we've all, so here's the thing.

0:40:46 - 0:40:51     Text: We're going to go through these three, and they all have sort of benefits and in some

0:40:51 - 0:40:52     Text: sense, drawbacks.

0:40:52 - 0:40:58     Text: So the decoders, right, really what we're talking about here mainly is language models,

0:40:58 - 0:41:03     Text: and we've seen this so far, we've talked about pre-trained decoders, and these are nice

0:41:03 - 0:41:04     Text: to generate from.

0:41:04 - 0:41:08     Text: So you can just sample from your pre-trained language model and get things that look

0:41:08 - 0:41:11     Text: like the text that you were pre-training on.

0:41:11 - 0:41:15     Text: But one problem is that you can't condition on future words, right?

0:41:15 - 0:41:21     Text: So we mentioned in our modeling with LSTMs that just like, instead, if you could, when

0:41:21 - 0:41:27     Text: you can do it, we said that having a bi-directional LSTM was actually just way more useful than

0:41:27 - 0:41:29     Text: having a one-directional LSTM.

0:41:29 - 0:41:31     Text: Well, it's sort of true for transformers as well.

0:41:31 - 0:41:37     Text: So if you can see how the arrows are pointing here, the arrows are pointing up into the,

0:41:37 - 0:41:38     Text: you know, to the right.

0:41:38 - 0:41:45     Text: So this word is sort of looking back at its past history, but, you know, this word can't

0:41:45 - 0:41:48     Text: see, can't contextualize with the future.

0:41:48 - 0:41:52     Text: Whereas in the encoder block here in blue, just below it, you sort of have all pairs of

0:41:52 - 0:41:54     Text: interactions.

0:41:54 - 0:41:56     Text: And so, you know, when you're building your representations, it can actually be super

0:41:56 - 0:41:58     Text: useful to know what the future words are.

0:41:58 - 0:42:00     Text: So that's what encoders get you, right?

0:42:00 - 0:42:02     Text: You get bi-directional context.

0:42:02 - 0:42:05     Text: So you can condition on the future, maybe that helps you build up better representations

0:42:05 - 0:42:06     Text: of language.

0:42:06 - 0:42:12     Text: But the question that we'll actually go through here is, well, how do you pre-train them?

0:42:12 - 0:42:15     Text: You can't pre-train them as language models because you have access to the future.

0:42:15 - 0:42:20     Text: So if you try to do that, the loss will just immediately be zero because you can just

0:42:20 - 0:42:21     Text: see what the future is.

0:42:21 - 0:42:22     Text: That's not useful.

0:42:22 - 0:42:28     Text: And then we'll talk about pre-trained encoder decoders, which like maybe the best of both

0:42:28 - 0:42:33     Text: worlds, but also maybe unclear what's the best way to pre-train them.

0:42:33 - 0:42:36     Text: They definitely have benefits for both.

0:42:36 - 0:42:43     Text: So let's get into some general top, like a more, yeah, let's get into the decoders first,

0:42:43 - 0:42:46     Text: we'll go through all three.

0:42:46 - 0:42:47     Text: Okay.

0:42:47 - 0:42:54     Text: When we're pre-training a language model, right, we're pre-training it on this objective,

0:42:54 - 0:42:59     Text: we're trying to make it approximate this probability of a word given all of its previous

0:42:59 - 0:43:01     Text: words.

0:43:01 - 0:43:04     Text: What we end up doing, and I showed this sort of pictographically, but I'll add some math,

0:43:04 - 0:43:11     Text: right, we get a hidden state, h1 to ht for each of the words in the input w1 to wt.

0:43:11 - 0:43:14     Text: And I remember words again, mean sub words here.

0:43:14 - 0:43:15     Text: Okay.

0:43:15 - 0:43:20     Text: And we're fine tuning this, right, we can take the representation, this should be ht,

0:43:20 - 0:43:22     Text: a, ht plus b.

0:43:22 - 0:43:25     Text: And then the picture here is, right, here's ht.

0:43:25 - 0:43:29     Text: It's the very last encoder state.

0:43:29 - 0:43:36     Text: And now this has sort of the, it's seen all of its history, right, and so you can apply

0:43:36 - 0:43:41     Text: a linear layer here, maybe multiplying it by some parameters a and b that were not

0:43:41 - 0:43:46     Text: pre-trained, and then you're predicting sentiment maybe, you know, plus or minus sentiment,

0:43:46 - 0:43:47     Text: perhaps.

0:43:47 - 0:43:51     Text: And so, you know, look at the red and the gray, so most of the parameters of my neural

0:43:51 - 0:43:56     Text: network have now been pre-trained, the very last layer that's learning, the sentiment,

0:43:56 - 0:44:00     Text: say, decision, has not been pre-trained.

0:44:00 - 0:44:02     Text: So those have been randomly initialized.

0:44:02 - 0:44:06     Text: And when you, when you take the loss of the sentiment loss, right, you train not just

0:44:06 - 0:44:11     Text: the linear layer here, but you actually back propagate the gradients all the way through

0:44:11 - 0:44:16     Text: the entire pre-trained network and fine tune all of those parameters, right?

0:44:16 - 0:44:20     Text: So it's not like you're just training this, fine tuning time, this linear layer, you're

0:44:20 - 0:44:25     Text: training the whole network as a function of this fine tuning loss.

0:44:25 - 0:44:30     Text: And you know, maybe it's bad that like the linear layer wasn't pre-trained.

0:44:30 - 0:44:34     Text: In the grand scheme of things, it's not that many parameters also.

0:44:34 - 0:44:38     Text: So this is you, so this is just one way to interact with pre-trained models, right?

0:44:38 - 0:44:42     Text: And so what I want you to take away from this is that there was a contract that we had

0:44:42 - 0:44:44     Text: with the original model, right?

0:44:44 - 0:44:48     Text: The contract was that it was defining probability distributions.

0:44:48 - 0:44:52     Text: But when we're fine tuning, when we're interacting with the pre-trained model, what we also

0:44:52 - 0:44:55     Text: have are just like the trained weights and the network architecture.

0:44:55 - 0:44:58     Text: We don't need to use it as a language model, we don't need to use it as a probability

0:44:58 - 0:44:59     Text: distribution.

0:44:59 - 0:45:04     Text: When we're actually fine tuning it, we're really just using it for its initialization

0:45:04 - 0:45:08     Text: of its parameters and saying, oh, this is just a transformer decoder that was

0:45:08 - 0:45:14     Text: pre-trained by, oh, and it happens to be really great in that when you find tuna on some

0:45:14 - 0:45:17     Text: sentiment data, it does a really good job.

0:45:17 - 0:45:22     Text: Okay, but there's a second way to interact with pre-trained decoders, which is in some

0:45:22 - 0:45:24     Text: sense even more natural.

0:45:24 - 0:45:28     Text: It actually is closer to the contract that we started with.

0:45:28 - 0:45:32     Text: So we don't have to just ignore the fact that it was a probability distribution entirely,

0:45:32 - 0:45:35     Text: we can make use of it while still fine tuning it.

0:45:35 - 0:45:37     Text: So here's what we're going to do.

0:45:37 - 0:45:40     Text: So we can use them as a generator at fine tuning time.

0:45:40 - 0:45:47     Text: By generator, I mean, it's going to define this distribution of words given their context.

0:45:47 - 0:45:51     Text: And then we'll actually just fine tune that probability distribution.

0:45:51 - 0:45:58     Text: So in a task like some kind of turn-based dialogue, we might encode the dialogue history

0:45:58 - 0:46:01     Text: as your past context.

0:46:01 - 0:46:06     Text: So you have a dialogue history of some things that people are saying back and forth

0:46:06 - 0:46:10     Text: to each other, you encode it as words, and you try to predict the next words in the

0:46:10 - 0:46:11     Text: dialogue.

0:46:11 - 0:46:15     Text: Right, and maybe you're pre-training objective, you looked at very general purpose text

0:46:15 - 0:46:19     Text: from, I don't know, Wikipedia or books or something, and you're fine tuning it as a

0:46:19 - 0:46:25     Text: language model, but you're fine tuning it as a language model on this sort of domain-specific

0:46:25 - 0:46:30     Text: distribution of text like dialogue or maybe summarization where you paste in the whole

0:46:30 - 0:46:37     Text: document and then say a specific word and then the summary and say predict the summary.

0:46:37 - 0:46:43     Text: And so what this looks like is, again, at fine tuning time here, you have your h1 to

0:46:43 - 0:46:49     Text: ht is equal to the decoder of the words, and then you have this distribution that you're

0:46:49 - 0:46:55     Text: fine tuning of wt is a h is the type again, ht minus 1 plus b.

0:46:55 - 0:47:01     Text: So now every time I have this, I'm predicting these words from word 1, I predict word 2,

0:47:01 - 0:47:07     Text: we're 2, I predict word 3, etc., right, the actual last layer of the network unlike before,

0:47:07 - 0:47:12     Text: the last layer of the network has been pre-trained, but I'm still fine tuning the whole thing.

0:47:12 - 0:47:17     Text: Right, so a and b here are mapping to sort of a probability distribution over my vocabulary

0:47:17 - 0:47:23     Text: or the logits of a probability distribution, and I guess get this sort of like tweak them

0:47:23 - 0:47:28     Text: now, in order to have the distribution that I'm going to use, reflect the thing like dialogue

0:47:28 - 0:47:30     Text: that I wanted to reflect.

0:47:30 - 0:47:36     Text: Okay, so those are two ways of interacting with a pre-trained decoder.

0:47:36 - 0:47:44     Text: Now here's an example of what is ended up being the first, that be a line of wildly successful

0:47:44 - 0:47:49     Text: or at least talked about pre-trained decoders.

0:47:49 - 0:47:57     Text: So the generative pre-trained decoder, or GPC, was a huge success in some sense, or at

0:47:57 - 0:48:04     Text: least it got a lot of buzz, so it's a transformer decoder, no encoder, with 12 layers, I'm giving

0:48:04 - 0:48:10     Text: you the details so you can start to get a feeling for how the size of things changes.

0:48:10 - 0:48:15     Text: Over the years, as we'll continue to progress here, had each of our, each of the hidden

0:48:15 - 0:48:20     Text: states was dimensionality, 70, had 768, so if you remember back to last lecture, we had

0:48:20 - 0:48:26     Text: a term D, which was our dimensionality, so D is 768, and then an interesting statement

0:48:26 - 0:48:31     Text: that you should keep in mind for the engineering-minded folks is that the actual feed-forward

0:48:31 - 0:48:35     Text: layers, right, you've got a hidden layer in the feed-forward layer, and this was actually

0:48:35 - 0:48:41     Text: very large, so you had these sort of like position-wise feed-forward layers, right, and the

0:48:41 - 0:48:47     Text: feed-forward layer would take the 768-dimensional vector, sort of like project it to 3,000-dimensional

0:48:47 - 0:48:52     Text: space through the sort of non-linearity, and then project it back to 768.

0:48:52 - 0:48:56     Text: This ends up being because you can squash a lot more parameters in, for not too much

0:48:56 - 0:49:01     Text: more compute in this way, but that's curious.

0:49:01 - 0:49:06     Text: Okay, and then, byte-parent coding, it's actually, was this one byte-parent coding?

0:49:06 - 0:49:10     Text: Well, it was a sub-word vocabulary with 40,000 merges, so 40,000 merges, so that's not

0:49:10 - 0:49:15     Text: the size of the vocabulary because you started with a bunch of characters, and I don't remember

0:49:15 - 0:49:19     Text: how many characters they started with, but so it's a relatively small vocabulary you can

0:49:19 - 0:49:21     Text: see, right?

0:49:21 - 0:49:27     Text: And compared to, if you tried to say, have every word, have a unique representation, now

0:49:27 - 0:49:32     Text: it's going to be trained on books, corporates, it's got 7,000 unique books, and it contains

0:49:32 - 0:49:37     Text: long spans of contiguous texts, so you have, instead of, say, training it on individual

0:49:37 - 0:49:40     Text: sentences, just small short sentences, right?

0:49:40 - 0:49:46     Text: The model is able to learn long distance dependencies because you haven't split, like, a book

0:49:46 - 0:49:48     Text: into random sentences and shuffled them all around.

0:49:48 - 0:49:53     Text: You've sort of kept it contiguous, so we can have that sort of consistency.

0:49:53 - 0:49:58     Text: And then, a little treat here, yeah, so GPC never showed up in the original paper, or

0:49:58 - 0:50:03     Text: the original blog post, like as an acronym, and it could actually sort of refer to, like,

0:50:03 - 0:50:07     Text: generative pre-training, sort of what, like, the title of the paper would suggest, or

0:50:07 - 0:50:09     Text: generative pre-trained transformer.

0:50:09 - 0:50:13     Text: And I sort of decided to say generative pre-trained transformer because this seemed like way

0:50:13 - 0:50:15     Text: too general.

0:50:15 - 0:50:17     Text: So GPC.

0:50:17 - 0:50:22     Text: Okay, so they pre-trained this huge language model transformer, this huge transformer

0:50:22 - 0:50:25     Text: decoder, just on 7,000 books.

0:50:25 - 0:50:28     Text: And they fine-tuned it on a number of different tasks, and I want to talk a little bit about

0:50:28 - 0:50:31     Text: the details about how they fine-tuned it.

0:50:31 - 0:50:36     Text: And so they fine-tuned it on one particular task, or family tasks, called natural language

0:50:36 - 0:50:38     Text: inference.

0:50:38 - 0:50:43     Text: So in natural language inference, we're labeling pairs of sentences as entailing or contradictory

0:50:43 - 0:50:44     Text: to each other in neutral.

0:50:44 - 0:50:50     Text: So you have a premise, and you hold the premise as sort of true, the man is in the doorway.

0:50:50 - 0:50:54     Text: And you have a hypothesis, the person is near the door.

0:50:54 - 0:50:59     Text: If this person is referring to that man, then, you know, it's sort of like, oh, yeah,

0:50:59 - 0:51:04     Text: so this is sort of entailed because there's a person, because the man is a person, and

0:51:04 - 0:51:06     Text: they're in the doorway, then they are near the door.

0:51:06 - 0:51:11     Text: So you have this sort of logical reasoning that you're doing, or you're supposed to be

0:51:11 - 0:51:14     Text: able to be doing, and you're labeling these sentences.

0:51:14 - 0:51:15     Text: So it's a labeled task.

0:51:15 - 0:51:21     Text: You've got sort of an input that's cut into two parts, and then one of three outputs.

0:51:21 - 0:51:25     Text: Okay, so the GPT paper evaluates on this task.

0:51:25 - 0:51:28     Text: But what they've got is a transformer decoder.

0:51:28 - 0:51:30     Text: So what do they do?

0:51:30 - 0:51:37     Text: This is sort of one of the earlier examples of, you know, taking, instead of changing your

0:51:37 - 0:51:42     Text: neural network architecture to adapt to the kind of task you're doing, you're going to

0:51:42 - 0:51:49     Text: just format the task as like a bunch of tokens and not change your architecture.

0:51:49 - 0:51:53     Text: Because the pre-training was so useful, it's probably better to keep the architecture

0:51:53 - 0:51:59     Text: fixed, pre-training it, and then change the task specification to sort of fit the pre-trained

0:51:59 - 0:52:00     Text: architecture.

0:52:00 - 0:52:05     Text: So what they did, right, they put this token start, this is a special token, the man is

0:52:05 - 0:52:09     Text: in the doorway, some delimiter token, right.

0:52:09 - 0:52:15     Text: So this is just a linear sequence of tokens that we're giving as one big prefix to GPT.

0:52:15 - 0:52:21     Text: And then the person is near the door, and then some extra token here, right, extract.

0:52:21 - 0:52:25     Text: And then, you know, the linear classifier that we talked about, and sort of the first

0:52:25 - 0:52:32     Text: way to interact with models, with decoder models, it's applied to the representation of the

0:52:32 - 0:52:34     Text: extract token, right.

0:52:34 - 0:52:39     Text: So you have the last hidden state on top of extract, and then you fine tune the whole

0:52:39 - 0:52:41     Text: network to predict these labels, right.

0:52:41 - 0:52:48     Text: And so this sort of input formatting is increasingly, increasingly used to keep the model architecture

0:52:48 - 0:52:53     Text: the same and allow for a variety of different problems to be solved with it.

0:52:53 - 0:52:55     Text: Okay, and so did it work?

0:52:55 - 0:52:58     Text: Unnatural language inference, the answer is yes.

0:52:58 - 0:53:00     Text: So there's a number of different numbers here.

0:53:00 - 0:53:01     Text: I wouldn't worry too much about it.

0:53:01 - 0:53:06     Text: The fine tune transformer language model is sort of what you should pay attention to.

0:53:06 - 0:53:09     Text: There's a lot of effort that went into the other models, right.

0:53:09 - 0:53:11     Text: And so this is the story of pre-training.

0:53:11 - 0:53:15     Text: People put a lot of effort into models that do various sort of careful things.

0:53:15 - 0:53:20     Text: And then you take a single transformer and you say, I'm going to pre-training it on a

0:53:20 - 0:53:24     Text: ton of text and not worry too much about anything else and just fine tune it, and you end up

0:53:24 - 0:53:28     Text: doing super, super well.

0:53:28 - 0:53:32     Text: Sometimes not too much better in the GPT case than sort of the best known state of the

0:53:32 - 0:53:36     Text: art methods, but usually a little bit better.

0:53:36 - 0:53:39     Text: And again, the amount of effort, the amount of tasks, specific effort that you have to put

0:53:39 - 0:53:41     Text: into it, it's very low.

0:53:41 - 0:53:46     Text: Okay, and so what about the other way of interacting with decoters, right.

0:53:46 - 0:53:49     Text: So we had, we said that we can interact with decoters just by sampling from them, just

0:53:49 - 0:53:52     Text: by saying, well, there are probability distributions.

0:53:52 - 0:53:55     Text: So we can use them in their capacities as language models.

0:53:55 - 0:54:02     Text: And so GPT 2, this is just really just a bigger GPT, and we're too much about it, with

0:54:02 - 0:54:05     Text: larger hidden units, more layers.

0:54:05 - 0:54:10     Text: When it was trained on more data, it was shown to produce sort of relatively convincing

0:54:10 - 0:54:11     Text: samples of natural language.

0:54:11 - 0:54:14     Text: So this is something that went around Twitter a lot, right.

0:54:14 - 0:54:20     Text: So you have this sort of contrived example that probably didn't show up in the training

0:54:20 - 0:54:24     Text: data that has a scientist discovering a herd of unicorns.

0:54:24 - 0:54:31     Text: And then they sort of sample from a, almost the distribution of the model.

0:54:31 - 0:54:36     Text: They sort of give the model some extra credit here.

0:54:36 - 0:54:42     Text: They do something called truncating the distribution of the language models, sort of cut out noise

0:54:42 - 0:54:44     Text: at GPT 2.

0:54:44 - 0:54:52     Text: So it's not exactly a perfect sample, but more or less GPT 2 generated this.

0:54:52 - 0:54:56     Text: And so you have the scientist discovering unicorns, and then, you know, you have this

0:54:56 - 0:55:00     Text: consistency, okay, there's the scientist.

0:55:00 - 0:55:03     Text: You know, you have them giving you the name.

0:55:03 - 0:55:11     Text: You have, you refer back to this, well, yeah, you refer back to the scientist's name.

0:55:11 - 0:55:13     Text: You sort of have these like topic consistency things.

0:55:13 - 0:55:15     Text: Also the syntax is really good.

0:55:15 - 0:55:18     Text: It looks, you know, vaguely like English.

0:55:18 - 0:55:20     Text: And so this is sort of continued to be a trend.

0:55:20 - 0:55:23     Text: As we get larger and larger language models, we actually sample from them, even when we

0:55:23 - 0:55:29     Text: give them prompts that look sort of odd, and they seem to be increasingly convincing.

0:55:29 - 0:55:31     Text: Okay.

0:55:31 - 0:55:36     Text: So pre-training encoders, okay.

0:55:36 - 0:55:37     Text: Pre-training encoders.

0:55:37 - 0:55:42     Text: So let's take another second because I need some more water here.

0:55:42 - 0:55:44     Text: If there's another question, let me know.

0:55:44 - 0:55:53     Text: All right.

0:55:53 - 0:55:59     Text: So the benefit of encoders that we talked about was that they get this bidirectional context.

0:55:59 - 0:56:05     Text: So you can, while you're building representations of your sentence, of your parts of sentences,

0:56:05 - 0:56:08     Text: you can look to the future and that can help you build a better representation of the word

0:56:08 - 0:56:10     Text: that you're looking at right now.

0:56:10 - 0:56:13     Text: But the big problem is that we can't do language modeling now.

0:56:13 - 0:56:17     Text: So we've pretty much only said, we like, we've relied on this task that we already knew about

0:56:17 - 0:56:19     Text: language modeling to do our pre-training.

0:56:19 - 0:56:21     Text: But now we want to pre-training coders.

0:56:21 - 0:56:23     Text: And so we can't, we can't use it.

0:56:23 - 0:56:27     Text: So what are we going to do?

0:56:27 - 0:56:32     Text: Here's the solution that was come up with a paper that introduced the language model of

0:56:32 - 0:56:34     Text: the model called Bert.

0:56:34 - 0:56:37     Text: It's called masked language modeling.

0:56:37 - 0:56:39     Text: So here's the idea.

0:56:39 - 0:56:43     Text: We get the sentence and then we just take a fraction of the words and we replace them

0:56:43 - 0:56:46     Text: with a sort of a mask token.

0:56:46 - 0:56:50     Text: A token that's, that means you don't know what this is right now.

0:56:50 - 0:56:53     Text: And then you predict these words.

0:56:53 - 0:56:54     Text: Some details we'll get into in the next slide.

0:56:54 - 0:56:56     Text: But so here's what it looks like.

0:56:56 - 0:57:01     Text: We have the sentence, I mask to the mask.

0:57:01 - 0:57:03     Text: We get some hidden states for all of them, right?

0:57:03 - 0:57:08     Text: So we haven't changed the transformer encoder at all.

0:57:08 - 0:57:10     Text: We've just said, okay, here's like this sequence.

0:57:10 - 0:57:12     Text: You get to see everything, right?

0:57:12 - 0:57:13     Text: Look at all the arrows going everywhere.

0:57:13 - 0:57:19     Text: But then, right, we have this prediction layer that we're, that we're, that we're pre-training,

0:57:19 - 0:57:20     Text: right?

0:57:20 - 0:57:21     Text: And we're using it.

0:57:21 - 0:57:26     Text: We only have loss on the words where we had masks here.

0:57:26 - 0:57:31     Text: So I had this masked and then I have to predict that it was went that went here and store

0:57:31 - 0:57:32     Text: that went here.

0:57:32 - 0:57:36     Text: And now this is a lot like language modeling you might say.

0:57:36 - 0:57:39     Text: But now you don't need to have this sort of left to right decomposition.

0:57:39 - 0:57:43     Text: You're saying, I'm going to remove some of the words and you have to predict what they

0:57:43 - 0:57:44     Text: are.

0:57:44 - 0:57:46     Text: This is called masked language modeling.

0:57:46 - 0:57:49     Text: And it's been very, very, very effective with a quick caveat.

0:57:49 - 0:57:51     Text: It gets a little more complicated.

0:57:51 - 0:57:54     Text: So, so what did they actually do?

0:57:54 - 0:57:56     Text: They, they proposed masked language modeling.

0:57:56 - 0:57:59     Text: And they released the weights of this, of this pre-trained transformer.

0:57:59 - 0:58:03     Text: So the little bit more complexity to get masked language modeling to work.

0:58:03 - 0:58:09     Text: So you are going to take a random 15% of the sub word tokens.

0:58:09 - 0:58:10     Text: That was, that was true.

0:58:10 - 0:58:14     Text: But you're not always going to replace them with mask.

0:58:14 - 0:58:19     Text: You can think of it like, if the model sees a mask token, it gets a guarantee that it

0:58:19 - 0:58:21     Text: needs to predict something.

0:58:21 - 0:58:26     Text: And if the model doesn't see a mask token, it gets a guarantee that it doesn't need to

0:58:26 - 0:58:27     Text: predict anything.

0:58:27 - 0:58:33     Text: So why should it bother building strong representations of the words that aren't masked?

0:58:33 - 0:58:36     Text: And I want my model to build strong representations of everything.

0:58:36 - 0:58:38     Text: So we're going to add some sort of uncertainty to the model.

0:58:38 - 0:58:43     Text: So what we're going to do is, for those 15% of tokens, 80% of the time, we're going

0:58:43 - 0:58:44     Text: to replace it with a mask.

0:58:44 - 0:58:48     Text: That was our original idea of mask language modeling.

0:58:48 - 0:58:52     Text: Then 10% of the time, we're actually going to replace the word with just a random token.

0:58:52 - 0:58:56     Text: Just a random vocabulary item can be anything.

0:58:56 - 0:58:59     Text: And then the other 10% of the time, we're going to leave the word unchanged.

0:58:59 - 0:59:03     Text: So now, it sees a word.

0:59:03 - 0:59:06     Text: It could be a random token, or it could be unchanged.

0:59:06 - 0:59:10     Text: And if I see a mask, I know I need to predict it.

0:59:10 - 0:59:15     Text: So what these two things do here is say, you have to sort of be doing this, you have to

0:59:15 - 0:59:18     Text: be on your toes for every word in your representation.

0:59:18 - 0:59:22     Text: So here, I pizza to the mask.

0:59:22 - 0:59:27     Text: And it turns out, and the model didn't know this, but it's getting three lost terms for

0:59:27 - 0:59:28     Text: this sentence.

0:59:28 - 0:59:32     Text: It only has one mask, but it's going to be penalized for predicting three different things.

0:59:32 - 0:59:35     Text: And it needs to predict that this word is actually went.

0:59:35 - 0:59:37     Text: So I replaced this one.

0:59:37 - 0:59:41     Text: It needs to predict that this word is two, is in fact the word two.

0:59:41 - 0:59:46     Text: And then it needs to predict that this word is in fact store.

0:59:46 - 0:59:49     Text: Now as a short interlude, you might be thinking, you might be thinking, John, there's no way

0:59:49 - 0:59:52     Text: the model could know this.

0:59:52 - 0:59:54     Text: It's so under specified.

0:59:54 - 0:59:56     Text: I pizza is a little weird, I admit.

0:59:56 - 0:59:58     Text: But there's just no way to know that this is store or in went into.

0:59:58 - 1:00:01     Text: I mean, the same thing is true of language modeling.

1:00:01 - 1:00:05     Text: So it's going to end up learning these average statistics about what things tend to be in

1:00:05 - 1:00:06     Text: the given context.

1:00:06 - 1:00:10     Text: And it's going to sort of hedge its bets and try to build a distribution of what things

1:00:10 - 1:00:12     Text: could appear there.

1:00:12 - 1:00:14     Text: So for the people who are thinking that, if there wasn't, that's what you should be

1:00:14 - 1:00:15     Text: thinking.

1:00:15 - 1:00:18     Text: It has to sort of know what kinds of things will end up in these slots.

1:00:18 - 1:00:23     Text: It has other uncertainty, because it can't be sure that any of the other words are necessarily

1:00:23 - 1:00:25     Text: right.

1:00:25 - 1:00:30     Text: And then it is, it's predicting these three words.

1:00:30 - 1:00:36     Text: And so you can see why it's important to not just have masks potentially, to have these

1:00:36 - 1:00:41     Text: sort of token randomization things, because again, we don't actually care about its ability

1:00:41 - 1:00:43     Text: to predict the masks.

1:00:43 - 1:00:48     Text: I'm not going to usually, I'm not going to actually sample from the model's distribution

1:00:48 - 1:00:50     Text: over what should go here.

1:00:50 - 1:00:56     Text: Instead, I am going to use the parameters of the neural network and expect that it built

1:00:56 - 1:00:58     Text: strong representations of language.

1:00:58 - 1:01:02     Text: So I don't want it to think it's got a free pass for representing something if it doesn't

1:01:02 - 1:01:06     Text: have a mask there.

1:01:06 - 1:01:14     Text: So there was one extra thing with the BERT pre-training, which is a next sentence prediction

1:01:14 - 1:01:15     Text: objective.

1:01:15 - 1:01:17     Text: So the input to BERT looks like this.

1:01:17 - 1:01:19     Text: This is straight from the BERT paper.

1:01:19 - 1:01:24     Text: You have a label here before your first sentence, and then a separation, and then a second

1:01:24 - 1:01:25     Text: sentence.

1:01:25 - 1:01:29     Text: So you had always two contiguous chunks of text.

1:01:29 - 1:01:31     Text: You had a first chunk of text here.

1:01:31 - 1:01:33     Text: My dog is cute.

1:01:33 - 1:01:35     Text: And then a second chunk of text, he likes playing.

1:01:35 - 1:01:38     Text: You can see the sub words there.

1:01:38 - 1:01:42     Text: And now these would actually be both be much longer.

1:01:42 - 1:01:47     Text: So these whole thing would be 512 words, and it would be about half, and that would be

1:01:47 - 1:01:51     Text: about half, and they'd be contiguous chunks of text.

1:01:51 - 1:01:53     Text: But here was the deal.

1:01:53 - 1:01:57     Text: What they wanted to do was they wanted to try to teach the system to understand sort of

1:01:57 - 1:02:01     Text: relationships between different whole pieces of text.

1:02:01 - 1:02:06     Text: In order to better pre-trained for downstream applications like question answering, where

1:02:06 - 1:02:11     Text: you have two pretty different pieces of text, and you need to know how they relate to

1:02:11 - 1:02:12     Text: each other.

1:02:12 - 1:02:18     Text: So the objective they came up with was you should sometimes have the second chunk of text

1:02:18 - 1:02:26     Text: be the actual chunk of text that directly follows the first in your data set, and sometimes

1:02:26 - 1:02:32     Text: have the second chunk of text be randomly sampled from somewhere else, so unrelated.

1:02:32 - 1:02:37     Text: And the model should predict whether it's the first case or the second.

1:02:37 - 1:02:41     Text: In order, again, to sort of have to reason about the relationships between the two chunks

1:02:41 - 1:02:42     Text: of text.

1:02:42 - 1:02:44     Text: So this is next sentence prediction.

1:02:44 - 1:02:48     Text: I think it's important to think about because it's a very different idea of pre-training

1:02:48 - 1:02:53     Text: objective than language modeling and masked language modeling.

1:02:53 - 1:02:58     Text: Even though later we're sort of argued that in the case of BERT, it's not necessary or

1:02:58 - 1:02:59     Text: useful.

1:02:59 - 1:03:06     Text: And one of the arguments is actually because it's actually way better to have a single

1:03:06 - 1:03:12     Text: context that's twice as long, so you can learn even longer distance dependencies and things.

1:03:12 - 1:03:15     Text: And so whether the objective itself would be useful if you could always just double

1:03:15 - 1:03:18     Text: the context size, I'm not sure if anyone's done research on that.

1:03:18 - 1:03:22     Text: But again, it's like a different kind of objective, and it's still noisy something about

1:03:22 - 1:03:23     Text: the input, right?

1:03:23 - 1:03:28     Text: The input was this big chunk of text, and you've noise it to say like, now you don't know

1:03:28 - 1:03:32     Text: whether it really was that or whether you sort of replaced it with a bunch of garbage,

1:03:32 - 1:03:39     Text: this sort of second portion here, whether the second portion has been replaced with something

1:03:39 - 1:03:44     Text: that didn't actually come from the same sequence.

1:03:44 - 1:03:49     Text: Okay, so let's talk some details about BERT.

1:03:49 - 1:03:53     Text: So BERT had 12 or 24 layers, depending on BERT base or BERT large.

1:03:53 - 1:03:57     Text: You'll probably use one of these models or one of the sort of descendants of these models

1:03:57 - 1:04:02     Text: if you choose to do something with the custom final project potentially, or if you choose

1:04:02 - 1:04:06     Text: the version of the default final project.

1:04:06 - 1:04:11     Text: And you had a 600 or a 1000 dimension hidden states, a bunch of attention heads, so this

1:04:11 - 1:04:14     Text: is that multi-headed attention, remember, about a bunch of them.

1:04:14 - 1:04:19     Text: So you're splitting all your dimensions into those 16 heads, and we're talking on the

1:04:19 - 1:04:23     Text: order of a couple hundred million parameters.

1:04:23 - 1:04:28     Text: At the time, right in 2018, we were like, whoa, that's a lot of parameters.

1:04:28 - 1:04:32     Text: How do you, that's a lot of parameters.

1:04:32 - 1:04:35     Text: And now, models are way, way, way, way bigger.

1:04:35 - 1:04:39     Text: So let's keep track of sort of the model sizes as we're going through this.

1:04:39 - 1:04:42     Text: And let's come back now to the corpus sizes as well.

1:04:42 - 1:04:43     Text: So we have books corpus.

1:04:43 - 1:04:45     Text: And this is the number of words there.

1:04:45 - 1:04:50     Text: This is the same thing that GPT-1 was trained on, 800 million words.

1:04:50 - 1:04:56     Text: Now we're going to train on also English Wikipedia, it's 250, sorry, that's 2,500 million,

1:04:56 - 1:04:59     Text: so that's 2,500,000,000 words.

1:04:59 - 1:05:06     Text: And again, to give you an idea of what is done in practice, right, pre-training is expensive

1:05:06 - 1:05:11     Text: and impractical for most users, let's say.

1:05:11 - 1:05:16     Text: So if you are a researcher with a GPU or five GPUs or something like that, you tend to

1:05:16 - 1:05:20     Text: not really be pre-training your whole own BERT model unless you're willing to spend

1:05:20 - 1:05:22     Text: a long time doing it.

1:05:22 - 1:05:25     Text: BERT itself was pre-trained with 64 TPU chips.

1:05:25 - 1:05:31     Text: A TPU is a special kind of hardware accelerator that accelerates the tensor operations effectively

1:05:31 - 1:05:35     Text: is developed by Google.

1:05:35 - 1:05:40     Text: So TPUs are just fast and can hold a lot.

1:05:40 - 1:05:42     Text: And for four days they had 64 chips.

1:05:42 - 1:05:46     Text: So if you have one GPU which you can think of as less than a single TPU, you're going

1:05:46 - 1:05:48     Text: to be waiting a long time to pre-training.

1:05:48 - 1:05:54     Text: But fine-tuning is so fast, it's so fast and impractical, it's common on a single

1:05:54 - 1:06:00     Text: GPU, you'll see how much faster fine-tuning is than pre-training in assignment five.

1:06:00 - 1:06:06     Text: And so this becomes, I think, a refrain of the field, you pre-trained once or handful

1:06:06 - 1:06:11     Text: of times, right, like a couple of people released big pre-trained models and then you fine-tune

1:06:11 - 1:06:15     Text: many times, right, so you save those parameters from pre-training and you fine-tune on all

1:06:15 - 1:06:20     Text: kinds of different problems.

1:06:20 - 1:06:25     Text: And that paradigm, right, taking something like Bert or whatever the best descendant of

1:06:25 - 1:06:31     Text: Bert is and taking it pre-trained and then fine-tuning it on what you want is pretty

1:06:31 - 1:06:37     Text: close to, you know, it's a very, very strong baseline in NLP right now, right?

1:06:37 - 1:06:40     Text: So and the simplicity is pretty fascinating.

1:06:40 - 1:06:46     Text: And there's one code base called Transformers from a company called Hugging Face that

1:06:46 - 1:06:51     Text: makes this just really just a couple of lines of Python to try out as well.

1:06:51 - 1:06:57     Text: So it sort of opened up very strong baselines without too, too much effort for a lot of

1:06:57 - 1:06:58     Text: tasks.

1:06:58 - 1:07:01     Text: Okay, so let's talk about evaluation.

1:07:01 - 1:07:06     Text: So pre-training is pitched as requiring all this different kind of language understanding.

1:07:06 - 1:07:11     Text: And the field is, the field of NLP has a hard time doing evaluation.

1:07:11 - 1:07:15     Text: But we try our best and we build datasets that we think are hard for various reasons because

1:07:15 - 1:07:19     Text: they require you to know stuff about language and about the world and about reasoning.

1:07:19 - 1:07:26     Text: And so when we evaluate whether pre-training is getting you a lot of sort of general knowledge,

1:07:26 - 1:07:30     Text: we evaluate on a lot of these tasks.

1:07:30 - 1:07:37     Text: So we evaluate on things like paraphrase detection on core questions.

1:07:37 - 1:07:39     Text: Natural language inference we saw.

1:07:39 - 1:07:43     Text: We have hard sentiment analysis datasets or what we're hard sentiment analysis datasets

1:07:43 - 1:07:45     Text: a couple of years ago.

1:07:45 - 1:07:50     Text: And actually, figuring out if sentences are grammatical tends to be hard.

1:07:50 - 1:07:54     Text: Determining the semantic similarity of text can be hard.

1:07:54 - 1:07:55     Text: Paraphrasing again.

1:07:55 - 1:07:57     Text: Natural language inference on a very, very small dataset.

1:07:57 - 1:08:01     Text: So this is this pre-training help you train on smaller datasets.

1:08:01 - 1:08:03     Text: The answer is yes, sort of thing.

1:08:03 - 1:08:09     Text: And so the birth folks released their paper after GPT was released.

1:08:09 - 1:08:13     Text: And there were a lot of sort of state of the art results that came from various things

1:08:13 - 1:08:16     Text: that you were supposed to be doing.

1:08:16 - 1:08:22     Text: And the results that you get sort of with pre-training, so here's open AI, GPT, here's

1:08:22 - 1:08:23     Text: birth base and large.

1:08:23 - 1:08:25     Text: The last three rows are all pre-trained.

1:08:25 - 1:08:32     Text: Elmo is sort of in the middle between pre-training the whole model and just having word embeddings.

1:08:32 - 1:08:34     Text: That's what this is.

1:08:34 - 1:08:39     Text: And the numbers you get are just, I think, to the field where quite astounding actually.

1:08:39 - 1:08:44     Text: We were all surprised that there was that much left to even be gotten on some of these datasets.

1:08:44 - 1:08:49     Text: And taking here, so this line in the table is unmarked when it's actually the number

1:08:49 - 1:08:50     Text: of training examples.

1:08:50 - 1:08:53     Text: This dataset has 2.5,000 training examples.

1:08:53 - 1:08:59     Text: And before sort of the big transformers came around, we had 60% accuracy on it.

1:08:59 - 1:09:01     Text: We run transformers on it.

1:09:01 - 1:09:03     Text: We get 10 points just by pre-training.

1:09:03 - 1:09:07     Text: And this has been a trend that has just continued.

1:09:07 - 1:09:11     Text: So why do anything but pre-trained encoders?

1:09:11 - 1:09:13     Text: We know encoders are good.

1:09:13 - 1:09:15     Text: We like the fact that you have bidirectional context.

1:09:15 - 1:09:18     Text: We also saw that BERT did better than GPT.

1:09:18 - 1:09:27     Text: But if you want to actually get it to do things, you can't just generate sequences from

1:09:27 - 1:09:32     Text: it the same way that you would from a model like GPT, a pre-trained decoder.

1:09:32 - 1:09:34     Text: You can sort of sample what things should go in a mask.

1:09:34 - 1:09:39     Text: So here's a mask. You can put a mask somewhere, sample the words that should go there.

1:09:39 - 1:09:42     Text: But if you want to sample whole context, right, if you want to get that story about the

1:09:42 - 1:09:46     Text: unicorns, for example, the encoder is not what you want to do.

1:09:46 - 1:09:51     Text: So they have sort of different contracts, and they can be used naturally at least in

1:09:51 - 1:09:53     Text: different ways.

1:09:53 - 1:09:57     Text: Okay, so let's talk very briefly about extensions of BERT.

1:09:57 - 1:10:00     Text: So they're BERT variants like Roberta and Spanbert.

1:10:00 - 1:10:04     Text: And there's just a bunch of papers with the word BERT in the title that did various things.

1:10:04 - 1:10:06     Text: Two very strong takeaways.

1:10:06 - 1:10:08     Text: Roberta, train BERT longer.

1:10:08 - 1:10:10     Text: BERT is underfit.

1:10:10 - 1:10:11     Text: Train it on more data.

1:10:11 - 1:10:13     Text: Train it for more steps.

1:10:13 - 1:10:18     Text: Spanbert, mask, contiguous spans of sub words.

1:10:18 - 1:10:21     Text: Words makes a harder, more useful pre-training task.

1:10:21 - 1:10:25     Text: So this is the idea that we can come up with better ways of noisy the input, of hiding

1:10:25 - 1:10:30     Text: stuff in the input, or breaking stuff in the input for our model to correct.

1:10:30 - 1:10:37     Text: So for example, if you have the sentence mask, ear, razz, razz, good, it's just not that

1:10:37 - 1:10:43     Text: hard to know that this is irresistibly, right, because like what could this possibly

1:10:43 - 1:10:44     Text: be after these sub words?

1:10:44 - 1:10:51     Text: So this is irresist, you know, something's about to come here and it's probably the end

1:10:51 - 1:10:52     Text: of that word.

1:10:52 - 1:10:57     Text: Whereas if you mask a long sequence of things, right now this is much harder, and actually

1:10:57 - 1:11:02     Text: you're getting a useful signal that is irresistibly good, and you sort of needed to mask all of

1:11:02 - 1:11:04     Text: them to make the task interesting.

1:11:04 - 1:11:08     Text: So Spanbert was like, oh, you should do this.

1:11:08 - 1:11:10     Text: This was super useful as well.

1:11:10 - 1:11:15     Text: So Roberta, just to point you at the fact that Roberta showed that BERT was underfit,

1:11:15 - 1:11:21     Text: you know, he said, BERT was trained on about 13 gigabytes of text, it got some accuracies,

1:11:21 - 1:11:27     Text: you can get above the amazing results of BERT, four extra points or so here, right, just

1:11:27 - 1:11:35     Text: by taking the identical model and training it on more data, the larger batch size for

1:11:35 - 1:11:36     Text: a long time.

1:11:36 - 1:11:41     Text: And if you train it, yeah, even longer without sort of more data, you don't get any

1:11:41 - 1:11:45     Text: better.

1:11:45 - 1:11:49     Text: Very briefly, okay, so very briefly on the encoder decoders.

1:11:49 - 1:11:53     Text: So we've seen decoders can be good because we get to play with the contracts that they

1:11:53 - 1:11:57     Text: give us, we get to play with them as language models, encoders give us that bidirectional

1:11:57 - 1:11:58     Text: context.

1:11:58 - 1:12:02     Text: So encoder decoders, maybe we get both.

1:12:02 - 1:12:04     Text: In practice, they're actually, yeah, pretty strong.

1:12:04 - 1:12:11     Text: So there was a, right, we could, so I guess one of the questions is like, what do we do

1:12:11 - 1:12:13     Text: to pre-train them?

1:12:13 - 1:12:18     Text: So we could do something like language modeling, right, where we take a sequence of words,

1:12:18 - 1:12:27     Text: one to word two t instead of t, right, and so as I have word one here, dot, dot, dot,

1:12:27 - 1:12:32     Text: word t, we provide those all to our encoder and we predict on none of them.

1:12:32 - 1:12:37     Text: And then we have word t plus one to word two t here in our decoder, right, and we predict

1:12:37 - 1:12:38     Text: on these.

1:12:38 - 1:12:42     Text: So we're doing language modeling on half the sequence and we've taken the other half

1:12:42 - 1:12:46     Text: to have our bidirectional encoder, right, so we're building strong representations on

1:12:46 - 1:12:52     Text: the encoder side, not predicting language modeling on any of this.

1:12:52 - 1:12:55     Text: And then we, on the other half of the tokens, we predict, you know, as a language model

1:12:55 - 1:12:57     Text: would do.

1:12:57 - 1:13:01     Text: And the hope is that you sort of pre-trained both of these well through the one language

1:13:01 - 1:13:04     Text: modeling loss up here.

1:13:04 - 1:13:06     Text: And this is actually, so this works pretty well.

1:13:06 - 1:13:11     Text: The encoder benefits from bidirectionality, the decoder, you can use to train the model.

1:13:11 - 1:13:19     Text: But what this paper showed that introduced the Model T5, roughly at all, found to work

1:13:19 - 1:13:22     Text: best was actually a very, or at least a somewhat different objective.

1:13:22 - 1:13:26     Text: And this should keep in your mind sort of that we have different ways of specifying the

1:13:26 - 1:13:30     Text: pre-training objectives and they will really work differently from each other.

1:13:30 - 1:13:33     Text: So what they said, let's say you have an original text like this.

1:13:33 - 1:13:37     Text: Thank you for inviting me to your party last week.

1:13:37 - 1:13:43     Text: We're going to define variable length spans in the text to replace with a unique symbol

1:13:43 - 1:13:46     Text: that says something is missing here.

1:13:46 - 1:13:48     Text: And then we'll replace and then we'll remove that.

1:13:48 - 1:13:57     Text: So now our input to our encoder is thank you symbol one, me to your party symbol to week.

1:13:57 - 1:14:01     Text: So we've noise the input, we've hidden stuff in the input.

1:14:01 - 1:14:05     Text: Also really interestingly, this doesn't say how long this is supposed to be.

1:14:05 - 1:14:07     Text: That's different from BERT.

1:14:07 - 1:14:10     Text: BERT said, oh, you masked this many sub words.

1:14:10 - 1:14:13     Text: This says, well, I got some token that says something's missing here.

1:14:13 - 1:14:14     Text: And I don't know what it is.

1:14:14 - 1:14:17     Text: I don't even know how many sub words it is.

1:14:17 - 1:14:24     Text: And then so you have this in your encoder and then your decoder predicts the first special

1:14:24 - 1:14:27     Text: word, this x here.

1:14:27 - 1:14:30     Text: And then what was missing for inviting.

1:14:30 - 1:14:33     Text: So thank you x for inviting.

1:14:33 - 1:14:34     Text: And then it predicts y.

1:14:34 - 1:14:35     Text: Here's this y here.

1:14:35 - 1:14:40     Text: And then what was missing from the y last week.

1:14:40 - 1:14:42     Text: This is called span corruption.

1:14:42 - 1:14:47     Text: And it's really interesting to me because in terms of the actual encoder decoder, we don't

1:14:47 - 1:14:51     Text: have to change it compared to whether we, if we were just doing language modeling pre-training.

1:14:51 - 1:14:54     Text: Because I just do language modeling on all these things.

1:14:54 - 1:14:57     Text: I just predict these words as if I'm a language model.

1:14:57 - 1:15:01     Text: I've just done a text pre-processing step.

1:15:01 - 1:15:06     Text: So the actual, I've just pre-processed the text to look like, oh, yeah, take the input,

1:15:06 - 1:15:11     Text: make it look like this, then make an output that looks like that up there.

1:15:11 - 1:15:15     Text: And the model gets to do what is effectively language modeling, but it actually works better.

1:15:15 - 1:15:18     Text: So there's a lot of numbers I realize.

1:15:18 - 1:15:20     Text: But look at the star here.

1:15:20 - 1:15:25     Text: This encoder decoder with a denoising objective that tends to work the best.

1:15:25 - 1:15:31     Text: And they tried similar models like a prefix language model that was sort of the first

1:15:31 - 1:15:35     Text: try that we had at defining a pre-training objective for language models, sorry, for encoder

1:15:35 - 1:15:38     Text: decoders.

1:15:38 - 1:15:41     Text: And then they had another, a number of other options, but what worked best for the encoder

1:15:41 - 1:15:43     Text: decoders.

1:15:43 - 1:15:48     Text: And one of the fascinating things about T5 is that you could pre-train it and fine tune

1:15:48 - 1:15:54     Text: it on questions like when was Franklin D. Roosevelt born and fine tune it to produce the

1:15:54 - 1:15:55     Text: answer.

1:15:55 - 1:15:58     Text: And then you could ask it new questions at test time.

1:15:58 - 1:16:02     Text: And then it would retrieve the answer from its parameters with some accuracy.

1:16:02 - 1:16:05     Text: And it would do so relatively well actually.

1:16:05 - 1:16:10     Text: And it would do so maybe 25% of the time on some of these data sets with 220 million

1:16:10 - 1:16:11     Text: parameters.

1:16:11 - 1:16:15     Text: And then at 11 billion parameters, this is way bigger than Bert large.

1:16:15 - 1:16:20     Text: It would do so even better, sometimes even doing as well as systems that were allowed

1:16:20 - 1:16:22     Text: to look at stuff other than their own parameters.

1:16:22 - 1:16:26     Text: So again, this is just making this answer come from its parameters.

1:16:26 - 1:16:30     Text: Yeah, I'm going to have to skip this.

1:16:30 - 1:16:35     Text: So if you look back at this slide after class, I have each of the examples of the things

1:16:35 - 1:16:40     Text: that we could imagine learning from pre-training with a label of what you might be learning.

1:16:40 - 1:16:43     Text: So this example is 10 for universities located in blank.

1:16:43 - 1:16:44     Text: You might learn trivia.

1:16:44 - 1:16:46     Text: In all these cases, there's all these things you can learn.

1:16:46 - 1:16:53     Text: One thing I will say is that models also learn and can make even worse racism, sexism,

1:16:53 - 1:16:56     Text: all manner of bad biases that are encoded in our text.

1:16:56 - 1:17:00     Text: When I say, yeah, they do this.

1:17:00 - 1:17:02     Text: And so we'll learn more about this in our later lectures, but it's important to keep

1:17:02 - 1:17:06     Text: in mind that when you're doing pre-training, you're learning a lot of stuff, and not all

1:17:06 - 1:17:09     Text: of it is good.

1:17:09 - 1:17:16     Text: So with GPT-3, the last thing here is that there's this third way of interacting with models

1:17:16 - 1:17:19     Text: that's related to treating them as language models.

1:17:19 - 1:17:25     Text: So GPT-3 is this very, very large model that was released by OpenAI.

1:17:25 - 1:17:31     Text: But it seems to be able to learn from examples in their context, their decoder context,

1:17:31 - 1:17:36     Text: without gradient steps, simply by looking sort of within their history.

1:17:36 - 1:17:40     Text: And now GPT-3 has 175 billion parameters, right?

1:17:40 - 1:17:44     Text: The last T5 model we saw was 11 billion parameters.

1:17:44 - 1:17:48     Text: And it seems to be sort of the canonical example of this working.

1:17:48 - 1:17:52     Text: And so what it looks like is you give it as part of its prefix.

1:17:52 - 1:17:56     Text: This goes to Merci, hello, goes to Mint, goes to writes, you've got these translation

1:17:56 - 1:18:03     Text: examples, you ask it for the last one, and it comes up with the correct translation.

1:18:03 - 1:18:06     Text: Seemingly because it's learned something about the task that you're sort of telling

1:18:06 - 1:18:08     Text: it to do through its prefix.

1:18:08 - 1:18:10     Text: And so you might do the same thing with addition.

1:18:10 - 1:18:15     Text: So something, if I plus eight is 13, give it addition examples, you might do the next

1:18:15 - 1:18:18     Text: addition example for you.

1:18:18 - 1:18:24     Text: Or maybe trying to figure out grammatical or spelling errors, for example.

1:18:24 - 1:18:29     Text: And here's the French case.

1:18:29 - 1:18:33     Text: So again, you're learning just to do pre-training.

1:18:33 - 1:18:40     Text: But when you're evaluating it, you don't even fine tune the model, you just provide prefixes.

1:18:40 - 1:18:43     Text: And so this especially is not well understood.

1:18:43 - 1:18:48     Text: And so a lot of research is going into sort of what the limitations of this so-called

1:18:48 - 1:18:49     Text: in-context learning are.

1:18:49 - 1:18:53     Text: But it's a fascinating direction for future work.

1:18:53 - 1:18:56     Text: In total, these models are not well understood.

1:18:56 - 1:19:01     Text: However, small, small, in-air growth models like Bert have become general tools in a wide

1:19:01 - 1:19:02     Text: range of settings.

1:19:02 - 1:19:06     Text: They do have these issues about learning all these biases about the world.

1:19:06 - 1:19:10     Text: They'll go into and further lectures in this course.

1:19:10 - 1:19:15     Text: And so, yeah, what you've learned this week, transformers and pre-training form the basis

1:19:15 - 1:19:20     Text: or at least the base lines for much of a natural language processing today.

1:19:20 - 1:19:25     Text: And assignment five is out and you'll be able to look more into it.

1:19:25 - 1:19:26     Text: And I'm over time.

1:19:26 - 1:19:27     Text: All right.

1:19:27 - 1:19:28     Text: Yeah.

1:19:28 - 1:19:39     Text: I guess I can take a question if there is any, but people can keep going as well.

1:19:39 - 1:19:50     Text: So I think that I think there's a question about P5, which was how does the D-toder know

1:19:50 - 1:19:52     Text: that I'm currently predicting X for Y?

1:19:52 - 1:19:54     Text: Could you repeat that?

1:19:54 - 1:19:55     Text: Yeah.

1:19:55 - 1:20:00     Text: So about P5, there's a question that was asking how does the D-toder know it's currently

1:20:00 - 1:20:04     Text: predicting X for Y?

1:20:04 - 1:20:09     Text: It's hierarchy of predicting X for Y?

1:20:09 - 1:20:13     Text: I guess it doesn't specify it's going to change how does it know that it's currently

1:20:13 - 1:20:15     Text: predicting X for Y?

1:20:15 - 1:20:16     Text: OK.

1:20:16 - 1:20:17     Text: Yeah.

1:20:17 - 1:20:18     Text: That makes sense.

1:20:18 - 1:20:19     Text: So what it does, right?

1:20:19 - 1:20:23     Text: So it knows from the encoder that it has to at some point predict X and at some point

1:20:23 - 1:20:28     Text: predict Y because the encoder can just like remember that, oh, yeah, there's two things

1:20:28 - 1:20:29     Text: missing.

1:20:29 - 1:20:33     Text: And if there were more spans replaced, there would be a Z and then an A and then a B and

1:20:33 - 1:20:38     Text: you know whatever, just a bunch of unique identifiers.

1:20:38 - 1:20:44     Text: And then up here, it gets to say, OK, I have attention, I suppose.

1:20:44 - 1:20:48     Text: I can look and I know that first I have to predict this first master thing.

1:20:48 - 1:20:52     Text: So I'm going to generate that in my D-coder and then it gets that symbol, right?

1:20:52 - 1:20:55     Text: So we're doing training by giving it the right symbol.

1:20:55 - 1:20:59     Text: Now it gets that X and it says, OK, I'm predicting X now.

1:20:59 - 1:21:01     Text: And now it can predict, predict, predict, predict.

1:21:01 - 1:21:02     Text: Then it gets Y.

1:21:02 - 1:21:06     Text: So we're doing this teacher forcing training where we give it the right answer after penalizing

1:21:06 - 1:21:07     Text: it if it's wrong.

1:21:07 - 1:21:10     Text: Now it gets this Y, right?

1:21:10 - 1:21:12     Text: And it says, OK, now I have to predict what should go and why.

1:21:12 - 1:21:16     Text: And it can attend, you know, into the natural parts of this as well as what it's already

1:21:16 - 1:21:22     Text: predicted here because the decoder has attention within itself and it can see what should go

1:21:22 - 1:21:23     Text: there.

1:21:23 - 1:21:25     Text: So what's fascinating here is you're doing something like language modeling.

1:21:25 - 1:21:29     Text: But when you're predicting Y, right, you get to see what came after it.

1:21:29 - 1:21:31     Text: And that's I think one of the benefits of span corruption.

1:21:31 - 1:21:34     Text: So you're doing this thing where you don't know how long you should be predicting for

1:21:34 - 1:21:39     Text: like language modeling, but you get to know what came after the thing that's missing.