Stanford CS224N NLP with Deep Learning ｜ Winter 2021 ｜ Lecture 10 - Transformers and Pretraining

0:00:00 - 0:00:07 Text: Hello, everybody.

0:00:07 - 0:00:10 Text: Welcome to CS224N lecture 10.

0:00:10 - 0:00:16 Text: This is going to be primarily on pre-training, but we will also discuss sub-word models a

0:00:16 - 0:00:18 Text: little bit and review transformers.

0:00:18 - 0:00:26 Text: Okay, so we have a lot of exciting things to get into today, but some reminders about

0:00:26 - 0:00:30 Text: the class.

0:00:30 - 0:00:32 Text: Assignment 5 is being released today.

0:00:32 - 0:00:37 Text: Assignment 4 was due a minute ago, so if you are done with that, congratulations.

0:00:37 - 0:00:41 Text: If not, I hope that the late days go well.

0:00:41 - 0:00:48 Text: Assignment 5 is on pre-training and transformers, so these lectures are going to be very useful

0:00:48 - 0:00:52 Text: to you for that and I just don't cover anything after these lectures.

0:00:52 - 0:00:54 Text: All right.

0:00:54 - 0:01:00 Text: So today, let's kind of take a little peek through what the outline will be.

0:01:00 - 0:01:04 Text: We haven't talked about sub-word modeling yet and sort of we should have.

0:01:04 - 0:01:06 Text: And so we're going to talk a little bit about sub-words.

0:01:06 - 0:01:11 Text: You saw these in assignment 4, all just, you know, as the data that we provided to you

0:01:11 - 0:01:15 Text: with your machine translation system, but we're going to talk a little bit about why they're

0:01:15 - 0:01:21 Text: so ubiquitous in NLP because they are used in pre-trained models.

0:01:21 - 0:01:26 Text: I mean, they're used in a number of different models, but when we discuss pre-training,

0:01:26 - 0:01:29 Text: it's important to know that sub-words are part of it.

0:01:29 - 0:01:34 Text: Then we'll sort of motivate, we'll go on another journey of motivation of motivating

0:01:34 - 0:01:36 Text: model pre-training from word embedding.

0:01:36 - 0:01:41 Text: So we've already seen pre-training in some sense in the very first lecture of this course

0:01:41 - 0:01:46 Text: because we pre-trained individual word embeddings that don't take into account their contexts

0:01:46 - 0:01:50 Text: on very large text corporate and saw that they were able to encode a lot of useful things

0:01:50 - 0:01:53 Text: about language.

0:01:53 - 0:01:57 Text: So after we do that motivation, we'll go through model pre-training three ways.

0:01:57 - 0:02:00 Text: And we're going to, you know, reference actually the lecture on Tuesday.

0:02:00 - 0:02:02 Text: So this is why we're going to review a little bit of the transformer stuff.

0:02:02 - 0:02:07 Text: We'll talk about model pre-training in decoters, like a transformer decoder that we saw

0:02:07 - 0:02:10 Text: last week in encoters, and then encoder decoters.

0:02:10 - 0:02:14 Text: And each of these three cases, we're going to talk a little bit about sort of what things

0:02:14 - 0:02:20 Text: you could be doing and then popular models that are in use across research and in industry.

0:02:20 - 0:02:23 Text: And we're going to talk a little bit about, you know, what do we think pre-training is

0:02:23 - 0:02:24 Text: teaching?

0:02:24 - 0:02:25 Text: This is going to be very brief.

0:02:25 - 0:02:29 Text: Actually, a lot of the interpretability and analysis lecture in two weeks is going

0:02:29 - 0:02:35 Text: to talk more about sort of the mystery and the scientific problem of figuring out what

0:02:35 - 0:02:39 Text: these models are learning about language through pre-training objectives, but we'll sort

0:02:39 - 0:02:40 Text: of get a peak.

0:02:40 - 0:02:43 Text: And then we'll talk about very large models and in context learning.

0:02:43 - 0:02:49 Text: So if you've heard of GPT-3, for example, we're going to just briefly touch on that here

0:02:49 - 0:02:53 Text: and I think we'll discuss more about it in the course later on as well.

0:02:53 - 0:02:55 Text: Okay, so we've got a lot to do.

0:02:55 - 0:02:57 Text: Let's jump right in.

0:02:57 - 0:02:59 Text: So word structure and sub-broad models.

0:02:59 - 0:03:04 Text: Let's think about sort of the assumptions we've been making in this course so far.

0:03:04 - 0:03:09 Text: When we give you an assignment, when we talk about training word to veck, for example,

0:03:09 - 0:03:11 Text: we made this assumption about a language's vocabulary.

0:03:11 - 0:03:15 Text: In particular, we've made this assumption that has a fixed vocabulary of something like

0:03:15 - 0:03:18 Text: tens of thousands, maybe a hundred thousand, I don't know, a number of...

0:03:18 - 0:03:23 Text: But some relatively large, it seems, number of words, and that seems sort of like pretty

0:03:23 - 0:03:27 Text: good so far, at least, and what we've done.

0:03:27 - 0:03:32 Text: And we build this vocabulary from the set that we train, say, word to veck on.

0:03:32 - 0:03:37 Text: And then here's a crucial thing, any novel word, any word that you did not see at training

0:03:37 - 0:03:42 Text: time, is sort of mapped to a single unctocon.

0:03:42 - 0:03:46 Text: There are other ways to handle this, but you sort of have to do something and a frequent

0:03:46 - 0:03:49 Text: method is to map them all to unct.

0:03:49 - 0:03:53 Text: So let's walk through what this sort of means in English.

0:03:53 - 0:03:55 Text: You learn embeddings, you map them, it all works.

0:03:55 - 0:04:02 Text: Then you have a variation on a word like tase, with a bunch of a's.

0:04:02 - 0:04:07 Text: And your model isn't smart enough to know that that sort of means like very tasty, maybe.

0:04:07 - 0:04:12 Text: And so it maps it to unct, because it's just a dictionary look-up mess.

0:04:12 - 0:04:18 Text: And then you have a typo like lern, and that maps to unct as well, potentially, if it

0:04:18 - 0:04:22 Text: wasn't in your training set, some people make typos, but not all of them will be seen

0:04:22 - 0:04:23 Text: at training time.

0:04:23 - 0:04:24 Text: And then you'll have novel items.

0:04:24 - 0:04:29 Text: So this could be the first time that you've ever seen US, the students, and 224N have

0:04:29 - 0:04:32 Text: seen the word transformer if I.

0:04:32 - 0:04:36 Text: But I get the feeling you sort of have a notion of what it's supposed to mean, like

0:04:36 - 0:04:41 Text: maybe add transformers to or turn into using transformers, or turn into a transformer

0:04:41 - 0:04:42 Text: or something like that.

0:04:42 - 0:04:46 Text: And this is also going to be mapped to unct, even though you've seen transformer and

0:04:46 - 0:04:48 Text: if I.

0:04:48 - 0:04:54 Text: And so somehow the conclusion we have to come to is that looking at words as just like

0:04:54 - 0:05:00 Text: the individual sequence of characters uniquely identifies that word, and that's sort of

0:05:00 - 0:05:04 Text: how we should parameterize things is just wrong.

0:05:04 - 0:05:09 Text: And so not only is this true in English, but in many languages, this finite vocabulary

0:05:09 - 0:05:10 Text: assumption makes even less sense.

0:05:10 - 0:05:16 Text: So already it doesn't make sense in English, but English is, it's not the worst for English.

0:05:16 - 0:05:22 Text: So morphology is the study of the structure of words.

0:05:22 - 0:05:28 Text: And English is known to have pretty simple morphology in kind of specific ways.

0:05:28 - 0:05:34 Text: And when languages have complex morphology, it means you have longer words, more complex

0:05:34 - 0:05:39 Text: words that get modified more, and each one of them occurs less frequently.

0:05:39 - 0:05:40 Text: That should sound like a problem, right?

0:05:40 - 0:05:44 Text: If a word occurs less frequently, it will be less likely to show up in your training

0:05:44 - 0:05:46 Text: set.

0:05:46 - 0:05:49 Text: And maybe it'll show up in your test set, never in your training set.

0:05:49 - 0:05:51 Text: Now it's mapped to unct, and you don't know what to do.

0:05:51 - 0:05:56 Text: So an example, Swahili verbs can have hundreds of conjugations.

0:05:56 - 0:06:03 Text: So each conjugation encodes important information about the sentence that in English might be

0:06:03 - 0:06:05 Text: represented through, say, more words.

0:06:05 - 0:06:11 Text: And Swahili it's mapped onto the verb as prefixes and suffixes, and the like, this is called

0:06:11 - 0:06:13 Text: inflectional morphology.

0:06:13 - 0:06:14 Text: And so you can have hundreds of conjugations.

0:06:14 - 0:06:20 Text: I've just sort of pasted this wick-shenary block just to give you a small sample of just

0:06:20 - 0:06:22 Text: the huge number of conjugations there are.

0:06:22 - 0:06:26 Text: And so trying to memorize independently a meaning of each one of these words is just not

0:06:26 - 0:06:32 Text: the right answer.

0:06:32 - 0:06:34 Text: So this is going to be a very brief overview.

0:06:34 - 0:06:43 Text: And so what we're going to do is take one, let's say, class of algorithms for sub-word modeling

0:06:43 - 0:06:50 Text: that have been kind of developed to try to take a middle ground between two options.

0:06:50 - 0:06:54 Text: One option is saying everything is just like individual words.

0:06:54 - 0:06:58 Text: Either I know the word and I saw it at training time, or I don't know the word, and it's like

0:06:58 - 0:06:59 Text: unct.

0:06:59 - 0:07:03 Text: And then sort of another extreme option is to say it's just characters.

0:07:03 - 0:07:08 Text: Right? So like I get a sequence of characters, and then my neural network on top of my sequence

0:07:08 - 0:07:13 Text: of characters has to learn everything, has to learn how to combine words and stuff.

0:07:13 - 0:07:18 Text: So sub-word models in general just means looking at the sort of internal structure of words

0:07:18 - 0:07:20 Text: somehow, looking below the word level.

0:07:20 - 0:07:25 Text: But this group of models is going to try to meet a middle ground.

0:07:25 - 0:07:28 Text: So byte parent coding.

0:07:28 - 0:07:33 Text: What we're going to do is we're going to learn a vocabulary from a training data set

0:07:33 - 0:07:34 Text: again.

0:07:34 - 0:07:35 Text: So now we have a training data set.

0:07:35 - 0:07:39 Text: Instead of just saying, oh, everything that was split by my heuristic word splitter,

0:07:39 - 0:07:45 Text: like spaces in English, for example, is going to be a word in my vocabulary, we're going

0:07:45 - 0:07:50 Text: to learn the vocabulary using a greedy algorithm in this case.

0:07:50 - 0:07:52 Text: So here's what we're going to do.

0:07:52 - 0:07:55 Text: We start with the vocabulary containing only characters.

0:07:55 - 0:07:56 Text: So that's our extreme, right?

0:07:56 - 0:08:02 Text: So at the very least, if you've seen all the characters, then you know that you can never

0:08:02 - 0:08:03 Text: have an unque, right?

0:08:03 - 0:08:07 Text: Because you see a word, you've never seen it before, you just split it into its characters,

0:08:07 - 0:08:11 Text: and then you try to see, you know, deal with it that way.

0:08:11 - 0:08:14 Text: And then also an end of word symbol.

0:08:14 - 0:08:15 Text: And then we'll iterate over this algorithm.

0:08:15 - 0:08:20 Text: We'll say, use the corpus of text, find common adjacent letters.

0:08:20 - 0:08:24 Text: So maybe A and B are very frequently adjacent.

0:08:24 - 0:08:30 Text: And the pair of them together as a single sub word into your vocabulary.

0:08:30 - 0:08:34 Text: Now replace instances of that character pair with a new sub word repeat until you're desired

0:08:34 - 0:08:35 Text: vocabulary size.

0:08:35 - 0:08:41 Text: So maybe you start with a small character vocabulary, and then you end up with that same small

0:08:41 - 0:08:47 Text: character vocabulary plus a bunch of sort of entire words or parts of words.

0:08:47 - 0:08:50 Text: So notice how Apple, an entire word, looks like Apple.

0:08:50 - 0:08:56 Text: But then app, maybe this is sort of the first part, the first sub word of application, or

0:08:56 - 0:08:57 Text: up.

0:08:57 - 0:08:59 Text: Yeah.

0:08:59 - 0:09:07 Text: And then Lee, I guess I should have not put the hash there, but you know, maybe you learned

0:09:07 - 0:09:10 Text: Lee as like the end of a word, for example.

0:09:10 - 0:09:16 Text: And so what you end up with is, you know, a vocabulary where common things you get to

0:09:16 - 0:09:20 Text: map to themselves and then rare sequences of characters.

0:09:20 - 0:09:23 Text: You kind of split as little as possible.

0:09:23 - 0:09:27 Text: And it doesn't always end up so nicely that you learn like morphologically relevant suffixes

0:09:27 - 0:09:29 Text: like Lee.

0:09:29 - 0:09:32 Text: But you can, you know, try to split things somewhat reasonably.

0:09:32 - 0:09:37 Text: And if you have enough data, the sub word vocabulary you learn tends to be okay.

0:09:37 - 0:09:40 Text: So this is originally used in machine translation.

0:09:40 - 0:09:45 Text: And now a similar method, word piece, which we won't go over in this lecture is used

0:09:45 - 0:09:46 Text: in pre-trained models.

0:09:46 - 0:09:48 Text: But you know, the idea is effectively the same.

0:09:48 - 0:09:50 Text: And you end up with vocabularies that look a lot like this.

0:09:50 - 0:09:58 Text: So if we go back to our, if we go back to our examples of where, you know, word level

0:09:58 - 0:10:04 Text: NLP was failing us, then you have hat mapping to hat.

0:10:04 - 0:10:05 Text: Okay, that's good.

0:10:05 - 0:10:09 Text: You have hat mapping to hat because that was a common enough sequence of characters that

0:10:09 - 0:10:12 Text: it was actually incorporated into our sub word vocabulary.

0:10:12 - 0:10:13 Text: Right?

0:10:13 - 0:10:15 Text: And then you have learned mapping to learn.

0:10:15 - 0:10:16 Text: So common words good.

0:10:16 - 0:10:20 Text: And that means that the model, the neural network that you're going to process this text

0:10:20 - 0:10:27 Text: with does not need to, say, combine the letters of learn and hat in order to try to like

0:10:27 - 0:10:31 Text: derive the meaning of these words from the letters, because you can imagine that might

0:10:31 - 0:10:32 Text: be difficult.

0:10:32 - 0:10:38 Text: But then when you get a word that you have not seen before, you are able to decompose

0:10:38 - 0:10:39 Text: it.

0:10:39 - 0:10:45 Text: And so if you've seen tasty with varying numbers of A's at, at training time, you know,

0:10:45 - 0:10:50 Text: maybe you actually get some of the same sub words or similar sub words that you're splitting

0:10:50 - 0:10:52 Text: it into at evaluation time.

0:10:52 - 0:10:56 Text: So we never saw tasty enough to like, you know, however many A's in order to add it into

0:10:56 - 0:10:58 Text: a sub word vocabulary.

0:10:58 - 0:11:01 Text: But we're still able to split it into things.

0:11:01 - 0:11:05 Text: And then the neural network that runs on top of these sub word embeddings could be able

0:11:05 - 0:11:10 Text: to sort of induce that, oh, yeah, this is one of those things where people like, you know,

0:11:10 - 0:11:15 Text: chain letters together, chain vowels together in English for emphasis.

0:11:15 - 0:11:18 Text: So misspellings still pretty much mess you up.

0:11:18 - 0:11:22 Text: So now learn with this misspelling might be mapped to two sub words.

0:11:22 - 0:11:27 Text: But if you saw misspellings like this frequently enough, maybe you could learn sort of to handle

0:11:27 - 0:11:28 Text: it.

0:11:28 - 0:11:31 Text: It still messes up the model though.

0:11:31 - 0:11:33 Text: And, but at the very least, it's not just an umk, right?

0:11:33 - 0:11:35 Text: It seems clearly better than that.

0:11:35 - 0:11:40 Text: And then transformer, if I, maybe in the best, this is sort of optimistic, but maybe

0:11:40 - 0:11:44 Text: in the best case, right, you were able to say, ah, yes, this is transformer.

0:11:44 - 0:11:51 Text: And if I, again, the sub words that you learn don't actually tend to be this well morphologically

0:11:51 - 0:11:52 Text: motivated, I think.

0:11:52 - 0:11:58 Text: So if I is like a clear, like suffix in English that has a very common and replicable meaning

0:11:58 - 0:12:02 Text: when you apply it to nouns, that's derivational morphology.

0:12:02 - 0:12:07 Text: But you know, you're able to sort of compose the word of the meaning of transformer if I

0:12:07 - 0:12:11 Text: possibly from its two sub word constituents.

0:12:11 - 0:12:15 Text: And so when we talk about words being input to transformer models, pre-trained transformer

0:12:15 - 0:12:20 Text: models, throughout the entirety of this lecture, we will be talking about sub words.

0:12:20 - 0:12:26 Text: So I might say word, and what I mean is, you know, possibly a full word, also possibly

0:12:26 - 0:12:27 Text: a sub word.

0:12:27 - 0:12:31 Text: Okay, so when we say a sequence of words, the transformer, the pre-trained transformer

0:12:31 - 0:12:37 Text: has no idea, sort of whether it's dealing with words or sub words, when it's doing itself

0:12:37 - 0:12:40 Text: attention operations.

0:12:40 - 0:12:41 Text: And so this can be a problem.

0:12:41 - 0:12:46 Text: You can imagine if you have really weird sequences of characters, you can actually have an individual

0:12:46 - 0:12:51 Text: single word mapped to as many sub words as it has characters.

0:12:51 - 0:12:55 Text: That can be a problem because suddenly, you know, you have a ten-word sentence, but one

0:12:55 - 0:12:58 Text: of the words is mapped to, you know, twenty sub words.

0:12:58 - 0:13:02 Text: Now you have a thirty-word sentence, where twenty of the thirty words are just one real

0:13:02 - 0:13:03 Text: word.

0:13:03 - 0:13:05 Text: So keep this in mind.

0:13:05 - 0:13:09 Text: But, you know, I think it's important for sort of this open vocabulary assumption, it's

0:13:09 - 0:13:15 Text: important in English, and it's even more important in many other languages.

0:13:15 - 0:13:18 Text: And the actual algorithm, and you can go into the actual algorithms that are done for

0:13:18 - 0:13:23 Text: this, byte per encoding is sort of my favorite for going over briefly, word piece you can

0:13:23 - 0:13:26 Text: also take a look at.

0:13:26 - 0:13:27 Text: Okay.

0:13:27 - 0:13:29 Text: Any questions on sub words?

0:13:29 - 0:13:34 Text: I guess John, let me look after what does the hashtag mean?

0:13:34 - 0:13:36 Text: Oh, great, great point.

0:13:36 - 0:13:40 Text: So this means that you should be combining this sub word, so this sub word is not the

0:13:40 - 0:13:41 Text: end of a word.

0:13:41 - 0:13:45 Text: TAA, hash hash, is sort of telling the model.

0:13:45 - 0:13:50 Text: So if I had TAA with no hashes, that's a separate sub word.

0:13:50 - 0:13:54 Text: That means there's an entire word that is ta, or at the very least it's not the end of

0:13:54 - 0:13:55 Text: the word.

0:13:55 - 0:13:56 Text: See how here?

0:13:56 - 0:13:57 Text: I don't have the hashes at the end.

0:13:57 - 0:14:00 Text: It's because this is indicating that this is at the end of the word.

0:14:00 - 0:14:04 Text: Different sub word schemes differ on whether you should put something at the beginning of

0:14:04 - 0:14:08 Text: the word, if it does begin a word, or if you should put something at the end of the

0:14:08 - 0:14:11 Text: word, if it doesn't end the word.

0:14:11 - 0:14:15 Text: So when the tokenizer is running over your data, so you've got something that's tokenizing

0:14:15 - 0:14:20 Text: this sentence in the worst case.

0:14:20 - 0:14:27 Text: In the worst case, it says, in, that's a whole word, give it just the word in, no hashes,

0:14:27 - 0:14:32 Text: that's a whole word, give it just the word the, no hashes, and then maybe over here at

0:14:32 - 0:14:34 Text: sub words.

0:14:34 - 0:14:39 Text: We've got this weird word sub words, and it splits it into sub and words.

0:14:39 - 0:14:45 Text: And so sub, it's going to give it the sub word with sub hash hash to indicate that it's

0:14:45 - 0:14:52 Text: part of this larger word, sub words, as opposed to the word sub, like submarine, which would

0:14:52 - 0:14:53 Text: be different.

0:14:53 - 0:15:02 Text: Yeah, that's a great question.

0:15:02 - 0:15:06 Text: Okay, great.

0:15:06 - 0:15:11 Text: So that was our note on sub word modeling, and you can, you know, sub words are important,

0:15:11 - 0:15:17 Text: for example, in, you know, a lot of translation applications, that's why we gave you sub words

0:15:17 - 0:15:19 Text: on the machine translation assignment.

0:15:19 - 0:15:22 Text: Now let's talk about model pre-training and word embeddings.

0:15:22 - 0:15:25 Text: So I love, I love being able to go to this slide.

0:15:25 - 0:15:29 Text: So, so we saw this quote at the beginning of the class, you shall know a word by the company

0:15:29 - 0:15:34 Text: it keeps, and this was sort of one of the things that we used to summarize distributional

0:15:34 - 0:15:35 Text: semantics.

0:15:35 - 0:15:39 Text: This idea that word to veck was sort of well motivated in some way, because the meaning

0:15:39 - 0:15:44 Text: of a word can be thought of as being derived from the kind of co-occurrent statistics of

0:15:44 - 0:15:52 Text: words that co-occur around it, and that was just fascinatingly effective, I think.

0:15:52 - 0:15:54 Text: But there's this other quote actually from the same person.

0:15:54 - 0:16:01 Text: So we have J.R. Firth, 1935, compared to our quote before from 1957, and the second

0:16:01 - 0:16:06 Text: quote says, the complete meaning of a word is always contextual, and no study of meaning

0:16:06 - 0:16:10 Text: apart from a complete context can be taken seriously.

0:16:10 - 0:16:15 Text: Now again, these are just things that we can sort of think about and chew on, but it

0:16:15 - 0:16:20 Text: comes to mind, right, when you, when you embed words with word to veck, one of the issues

0:16:20 - 0:16:26 Text: is that you don't actually look at its neighbors as you're giving it an embedding.

0:16:26 - 0:16:33 Text: So if I have the sentence I record the record, you know, the two instances of REC, ORD,

0:16:33 - 0:16:37 Text: mean different things, but they're given the same word to veck embedding, right, because

0:16:37 - 0:16:42 Text: in word to veck you take the string, you map it to, oh, I've seen the word record before,

0:16:42 - 0:16:47 Text: you get that sort of vector from your learned matrix, and you give it the same thing in both

0:16:47 - 0:16:49 Text: cases.

0:16:49 - 0:16:54 Text: And so what we're going to be doing today is actually not conceptually all that different

0:16:54 - 0:16:56 Text: from training word to veck.

0:16:56 - 0:17:01 Text: Word to veck training you can think of as pre-training just a very simple model that only assigns

0:17:01 - 0:17:07 Text: an individual vector to each unique word type, each unique element in your vocabulary.

0:17:07 - 0:17:12 Text: Today we'll be going a lot farther than that, but the idea is very similar.

0:17:12 - 0:17:17 Text: So back in, you know, 2017, we would start with pre-trained word embeddings, and again,

0:17:17 - 0:17:21 Text: remember no context there, so you give a word and embedding independent of the context

0:17:21 - 0:17:23 Text: that it shows up in.

0:17:23 - 0:17:25 Text: And then you learn how to incorporate the context.

0:17:25 - 0:17:28 Text: It's not like our NLP models never used context, right?

0:17:28 - 0:17:34 Text: Instead, you would learn to incorporate the context using your LSTM, or it's later in 2017,

0:17:34 - 0:17:37 Text: you know, your transformer.

0:17:37 - 0:17:41 Text: And you would learn to incorporate context while training on the task.

0:17:41 - 0:17:43 Text: So you have some supervision.

0:17:43 - 0:17:47 Text: Maybe it's machine translation supervision, maybe sentiment, maybe question answering.

0:17:47 - 0:17:53 Text: And you would learn how to incorporate context in your LSTM or otherwise through the signal

0:17:53 - 0:17:56 Text: of the training instead of say through the word to veck signal.

0:17:56 - 0:18:01 Text: And so, you know, sort of pictographically, you have these word embeddings here, so the

0:18:01 - 0:18:04 Text: red are sort of your word to veck embeddings, and those are pre-trained.

0:18:04 - 0:18:08 Text: Those take up some of the parameters of your network.

0:18:08 - 0:18:09 Text: And then you've got your contextualization.

0:18:09 - 0:18:12 Text: Now this looks like an LSTM, but it could be whatever.

0:18:12 - 0:18:16 Text: So this maybe bidirectional encoder thing here is not pre-trained.

0:18:16 - 0:18:19 Text: And now that's a lot of parameters that are not pre-trained.

0:18:19 - 0:18:24 Text: And then maybe you have some sort of readout function at the end, right, to predict whatever

0:18:24 - 0:18:25 Text: thing you're trying to predict.

0:18:25 - 0:18:30 Text: Again, maybe it's sentiment, maybe you're doing, I don't know, topic labeling, whatever

0:18:30 - 0:18:31 Text: you want to do.

0:18:31 - 0:18:32 Text: This is sort of the paradigm.

0:18:32 - 0:18:36 Text: Like you set some sort of architecture and you only pre-trained the word embeddings.

0:18:36 - 0:18:44 Text: And so this isn't actually the conceptually, necessarily the biggest problem, because,

0:18:44 - 0:18:49 Text: you know, we like to think in deep learning stuff that we have a lot of training data

0:18:49 - 0:18:50 Text: for our objectives.

0:18:50 - 0:18:56 Text: I mean, one of the things that we motivated, you know, big, deep neural networks for is

0:18:56 - 0:18:59 Text: that they can take a lot of data and they can learn patterns from it.

0:18:59 - 0:19:05 Text: But it does put the onus on our downstream data to be sort of sufficient to teach the

0:19:05 - 0:19:07 Text: contextual aspects of language.

0:19:07 - 0:19:13 Text: So you can imagine if you only have a little bit of, you know, labeled data for fine tuning,

0:19:13 - 0:19:17 Text: you're putting a pretty big role on that data to say, hey, maybe here's some pre-trained

0:19:17 - 0:19:21 Text: embeddings, but like how you handle like sentences and how they compose and all that stuff,

0:19:21 - 0:19:23 Text: that's up to you.

0:19:23 - 0:19:27 Text: So if you don't have a lot of labeled data for your downstream task, you're asking

0:19:27 - 0:19:32 Text: it to do a lot with, you know, a large number of parameters that have been initialized randomly.

0:19:32 - 0:19:38 Text: Okay, so like a small portion of the parameters have been pre-trained.

0:19:38 - 0:19:43 Text: Okay, so where we're going is pre-training whole models.

0:19:43 - 0:19:47 Text: I mean, conceptually, you know, we're pretty close to there.

0:19:47 - 0:19:53 Text: So nowadays, almost all parameters in your neural network and let's say a lot of research

0:19:53 - 0:19:58 Text: settings and increasingly in industry are initialized via pre-training, just like word

0:19:58 - 0:20:07 Text: to vac parameters were initialized and pre-training methods in general hide parts of the input

0:20:07 - 0:20:11 Text: from the model itself and then train the model to reconstruct those parts.

0:20:11 - 0:20:14 Text: How does this connect to word to vac?

0:20:14 - 0:20:19 Text: In word to vac, you know, people don't usually make this connection, but it's the following.

0:20:19 - 0:20:25 Text: You have an individual word and it knows itself, right, because you have the embedding for

0:20:25 - 0:20:27 Text: the center word, right, from assignment two.

0:20:27 - 0:20:31 Text: You have the embedding for the center word and knows itself and you've masked out all

0:20:31 - 0:20:33 Text: of its neighbors.

0:20:33 - 0:20:36 Text: You've hidden all of its neighbors from it, right, every all of its window neighbors, you've

0:20:36 - 0:20:37 Text: hidden from it.

0:20:37 - 0:20:41 Text: You ask the center word to predict its neighbors, right?

0:20:41 - 0:20:47 Text: And so this is, this falls under the category of pre-training.

0:20:47 - 0:20:48 Text: All of these methods look similar.

0:20:48 - 0:20:54 Text: You hide parts of the input from the model and train the model to reconstruct those parts.

0:20:54 - 0:20:58 Text: The differences with full model pre-training is that you don't give the model just the

0:20:58 - 0:21:01 Text: individual word and have it learn an embedding of that word.

0:21:01 - 0:21:06 Text: You give it much more of the sequence and have it predict, you know, held out parts of

0:21:06 - 0:21:07 Text: the sequence.

0:21:07 - 0:21:08 Text: And we'll get into the details there.

0:21:08 - 0:21:14 Text: But, you know, the takeaway is that everything here is pre-trained jointly, possibly with

0:21:14 - 0:21:18 Text: the exception of the very last layer that predicts the label.

0:21:18 - 0:21:26 Text: Okay, and this has just been exceptionally effective at building representations of language

0:21:26 - 0:21:31 Text: that just map similar things in language, similar representations in these encoders, just

0:21:31 - 0:21:35 Text: like how word-to-vec map similar words to similar vectors.

0:21:35 - 0:21:40 Text: It's been exceptionally effective at making parameter initializations where you start with

0:21:40 - 0:21:45 Text: these parameters that have been pre-trained and then you fine-tune them on your label data.

0:21:45 - 0:21:49 Text: And then third, they have an exceptionally effective at defining probability distributions

0:21:49 - 0:21:54 Text: over language, like in language modeling, that are actually really useful to sample from

0:21:54 - 0:21:56 Text: in certain cases.

0:21:56 - 0:21:59 Text: So these are three ways in which we interact with pre-trained models.

0:21:59 - 0:22:02 Text: We use their representations just to compute similarities.

0:22:02 - 0:22:04 Text: We use them for parameter initializations.

0:22:04 - 0:22:10 Text: And we actually just use them as probability distributions, sort of how we train to them.

0:22:10 - 0:22:12 Text: Okay.

0:22:12 - 0:22:16 Text: So let's get into some technical parts here.

0:22:16 - 0:22:21 Text: I sort of want to think broad thoughts about what we could do with pre-training and what

0:22:21 - 0:22:26 Text: kind of things we could expect to potentially learn from this general method of hide part

0:22:26 - 0:22:30 Text: of the input and then see other parts of the input and then try to predict the parts

0:22:30 - 0:22:31 Text: that you hid.

0:22:31 - 0:22:32 Text: Okay.

0:22:32 - 0:22:36 Text: So Stanford University is located in Blank California.

0:22:36 - 0:22:39 Text: If we gave a model, everything that was not blanked out here and asked to predict the

0:22:39 - 0:22:49 Text: middle, the loss function would train the model to predict Palo Alto here, I expect.

0:22:49 - 0:22:50 Text: Okay.

0:22:50 - 0:22:54 Text: So this is an instance of something that you could imagine being a pre-training objective.

0:22:54 - 0:22:59 Text: You take in a sentence, you remove part of it and you say recreate the part that I removed.

0:22:59 - 0:23:03 Text: And in this case, if I just gave a bunch of examples that looked like this, it might

0:23:03 - 0:23:07 Text: learn sort of trivia thing here.

0:23:07 - 0:23:08 Text: Okay.

0:23:08 - 0:23:09 Text: Here's another one.

0:23:09 - 0:23:12 Text: I put blank fork down on the table.

0:23:12 - 0:23:14 Text: This one is under specified.

0:23:14 - 0:23:24 Text: So this could be the fork, my fork, his fork, her fork, some fork, yeah, a fork.

0:23:24 - 0:23:29 Text: So this is, you know, specifying the kinds of syntactic categories of things that can

0:23:29 - 0:23:32 Text: sort of appear in this context.

0:23:32 - 0:23:37 Text: So this is another thing that you might be able to learn from such an objective.

0:23:37 - 0:23:42 Text: So you have the woman walked across the street, checking for traffic over blank shoulder.

0:23:42 - 0:23:44 Text: One of the things that could go over here is her.

0:23:44 - 0:23:47 Text: That's a co-reference statement.

0:23:47 - 0:23:53 Text: So you could learn sort of connections between entities in a text where one word woman can

0:23:53 - 0:24:01 Text: also co-refer to the same entity in the world as this word, this pronoun her.

0:24:01 - 0:24:06 Text: Here you could think about, you know, I went to the ocean to see the fish, turtles, seals,

0:24:06 - 0:24:07 Text: and blank.

0:24:07 - 0:24:10 Text: Here I don't think there's a single correct answer as to what we could see going into that

0:24:10 - 0:24:11 Text: blank.

0:24:11 - 0:24:15 Text: But a model could learn a distribution of the kinds of things that people might be talking

0:24:15 - 0:24:20 Text: about when they, one, go to the ocean and two, are excited to see marine life.

0:24:20 - 0:24:21 Text: Right?

0:24:21 - 0:24:25 Text: So this is sort of a semantic category, a lexical semantic category of things that might

0:24:25 - 0:24:32 Text: sort of be in the same set of interest as fish, turtles, and seals in the context of

0:24:32 - 0:24:34 Text: I went to the ocean.

0:24:34 - 0:24:36 Text: Okay?

0:24:36 - 0:24:40 Text: So, and, you know, man, I expect that there would be examples of this in a large corpus

0:24:40 - 0:24:41 Text: of text.

0:24:41 - 0:24:43 Text: Maybe it may be a book.

0:24:43 - 0:24:44 Text: Okay.

0:24:44 - 0:24:46 Text: Here's another example.

0:24:46 - 0:24:52 Text: Overall, the value I got from the two hours watching it was the sum total of the popcorn

0:24:52 - 0:24:53 Text: and the drink.

0:24:53 - 0:24:55 Text: The movie was blank.

0:24:55 - 0:24:56 Text: Right?

0:24:56 - 0:25:00 Text: And this is when I'd sort of like look out into the audience and say, was the movie better

0:25:00 - 0:25:02 Text: good, but the movie was bad.

0:25:02 - 0:25:04 Text: It's my prediction here.

0:25:04 - 0:25:06 Text: Right?

0:25:06 - 0:25:09 Text: And so this is teaching you something about sentiment, about how people express sentiment

0:25:09 - 0:25:11 Text: in language.

0:25:11 - 0:25:17 Text: And so this is, even, it looks like a task itself, like do sentiment analysis is sort

0:25:17 - 0:25:22 Text: of what you need to do in order to figure out whether the movie was bad or good, or maybe

0:25:22 - 0:25:24 Text: maybe the word is neither bad or good.

0:25:24 - 0:25:25 Text: The movie was over or something like that.

0:25:25 - 0:25:30 Text: But like, if you had to choose between is bad or good more likely, right?

0:25:30 - 0:25:33 Text: You sort of had to figure out the sentiment of the text.

0:25:33 - 0:25:36 Text: Now, that's really fascinating.

0:25:36 - 0:25:37 Text: Okay.

0:25:37 - 0:25:39 Text: Here's another one.

0:25:39 - 0:25:42 Text: Iro went into the kitchen to make some tea.

0:25:42 - 0:25:46 Text: Standing next to Iro, Zuko pondered his destiny.

0:25:46 - 0:25:48 Text: Zuko left the blank.

0:25:48 - 0:25:49 Text: Okay.

0:25:49 - 0:25:53 Text: So this is a little easy because we really only show one place.

0:25:53 - 0:25:56 Text: I guess we have another now in destiny.

0:25:56 - 0:26:00 Text: But this is sort of talking reasoning about spatial location and the movement of sort of

0:26:00 - 0:26:02 Text: agents in an imagined world.

0:26:02 - 0:26:06 Text: We could imagine text that has lines like this.

0:26:06 - 0:26:10 Text: Person went into the place and was next to so and so who left and did that and sort of

0:26:10 - 0:26:12 Text: you have these like relationships.

0:26:12 - 0:26:14 Text: So here, Zuko left the kitchen.

0:26:14 - 0:26:17 Text: It's the most likely thing that I think would go here.

0:26:17 - 0:26:24 Text: And it sort of indicates that in order for a model to learn to perform this fill in the

0:26:24 - 0:26:32 Text: missing part task, it might need to, in general, figure out sort of where things are and

0:26:32 - 0:26:36 Text: whether statements mean or imply that locality.

0:26:36 - 0:26:40 Text: So standing next to Iro went into the kitchen.

0:26:40 - 0:26:45 Text: Now Iro is in the kitchen and then standing next to Iro means Zuko is now in the kitchen.

0:26:45 - 0:26:48 Text: And then Zuko now leaves where?

0:26:48 - 0:26:50 Text: Well, he was in the kitchen before.

0:26:50 - 0:26:53 Text: So this is sort of a very basic sense of reasoning.

0:26:53 - 0:26:55 Text: Now this one.

0:26:55 - 0:26:56 Text: Here's a sentence.

0:26:56 - 0:27:03 Text: I was thinking about the sequence that goes 1, 1, 2, 3, 5, 8, 13, 21, blank.

0:27:03 - 0:27:05 Text: So I don't know.

0:27:05 - 0:27:07 Text: I can imagine people writing stuff.

0:27:07 - 0:27:09 Text: So this is the Fibonacci sequence.

0:27:09 - 0:27:12 Text: And sort of you know you use some of these two to get the next one, some of these two

0:27:12 - 0:27:14 Text: to get the next one, some of these two.

0:27:14 - 0:27:16 Text: And so you have this running sum.

0:27:16 - 0:27:17 Text: It's a famous sequence.

0:27:17 - 0:27:20 Text: It shows up in a lot of text on the internet.

0:27:20 - 0:27:26 Text: And in general you have to learn the algorithm or just the formula, I guess, that defines the

0:27:26 - 0:27:29 Text: Fibonacci sequence in order to keep going.

0:27:29 - 0:27:32 Text: Do models in this in practice?

0:27:32 - 0:27:33 Text: Wait and find out.

0:27:33 - 0:27:37 Text: But you would have to learn it in order to get the sequence to keep going and going and

0:27:37 - 0:27:39 Text: going.

0:27:39 - 0:27:46 Text: OK, so we're going to get into specific pre-trained models, specific methods of pre-training

0:27:46 - 0:27:47 Text: now.

0:27:47 - 0:27:56 Text: So I'm going to go over a brief review of transformer encoders, decoders, and encoder decoders.

0:27:56 - 0:27:58 Text: Because we're going to get into the sort of technical bits now.

0:27:58 - 0:28:01 Text: So before I do that, I'm going to pause.

0:28:01 - 0:28:02 Text: Are there any questions?

0:28:02 - 0:28:11 Text: Yeah, there's an interesting question asked about, do these co-opening our model on our

0:28:11 - 0:28:13 Text: input training data and the link to training?

0:28:13 - 0:28:17 Text: And we need to also add some questions in the light of the huge between models that we

0:28:17 - 0:28:19 Text: think nowadays.

0:28:19 - 0:28:25 Text: Sorry, the first part of that question, was it, are we overfitting our models to what?

0:28:25 - 0:28:29 Text: Yes, so the risk of almost getting our model on our input training data when they're

0:28:29 - 0:28:30 Text: doing training?

0:28:30 - 0:28:31 Text: Got it.

0:28:31 - 0:28:34 Text: Yeah, so that's a good point.

0:28:34 - 0:28:36 Text: So we're using very large models.

0:28:36 - 0:28:40 Text: And we might imagine that there's a risk of overfitting.

0:28:40 - 0:28:45 Text: And in practice, yeah, it's actually one of the more crucial things to do to make pre-training

0:28:45 - 0:28:46 Text: work.

0:28:46 - 0:28:51 Text: So that turns out that you need to have a lot, a lot of data, like a lot of data.

0:28:51 - 0:28:56 Text: And in fact, we'll show results later on where people built a pre-trained model, pre-trained

0:28:56 - 0:28:58 Text: it on a lot of data.

0:28:58 - 0:29:02 Text: And then like six months later, someone else came along and was like, hey, if you pre-trained

0:29:02 - 0:29:06 Text: it on 10 months later and changed almost nothing else, it would have gone even better.

0:29:06 - 0:29:07 Text: Now was it overfitting?

0:29:07 - 0:29:13 Text: I mean, you can sort of like hold out some text during pre-training, right, and sort

0:29:13 - 0:29:18 Text: of evaluate the perplexity, right, the language modeling performance on that held out text.

0:29:18 - 0:29:22 Text: And it tends to be the case that actually these models are underfitting, right, that we

0:29:22 - 0:29:28 Text: need even larger and larger models to express the complex interactions that allow us to

0:29:28 - 0:29:30 Text: fit these datasets better.

0:29:30 - 0:29:33 Text: And so we'll talk about that when we talk about BERT.

0:29:33 - 0:29:37 Text: And one of the really interesting results is that BERT is underfit, not overfit, but

0:29:37 - 0:29:42 Text: in principle, yes, it's a problem to, this potentially a problem to overfit.

0:29:42 - 0:29:46 Text: But we end up having a ton of text in English at least, although not in every language.

0:29:46 - 0:29:51 Text: And so, yeah, it's important to scale them, but currently our models don't seem overfit

0:29:51 - 0:29:53 Text: to the pre-training text.

0:29:53 - 0:29:54 Text: Okay.

0:29:54 - 0:30:02 Text: Any other questions?

0:30:02 - 0:30:05 Text: All right.

0:30:05 - 0:30:07 Text: So we saw this figure before, right here.

0:30:07 - 0:30:12 Text: We saw this figure of a transformer encoder to coder from this paper attention is all you

0:30:12 - 0:30:14 Text: need.

0:30:14 - 0:30:17 Text: And so we have a couple of things.

0:30:17 - 0:30:22 Text: We're not going to go over the form of attention again today because we have a lot to go over,

0:30:22 - 0:30:25 Text: but I'm happy to chat about it more on Ed.

0:30:25 - 0:30:28 Text: But so in our encoder, we have some input sequence.

0:30:28 - 0:30:31 Text: Remember, this is a sequence of sub words now.

0:30:31 - 0:30:34 Text: Each sub word gets a word embedding.

0:30:34 - 0:30:38 Text: And each index in the transformer gets a position embedding.

0:30:38 - 0:30:44 Text: Now remember that we have a finite length that our sequence can possibly be like 512.

0:30:44 - 0:30:45 Text: That's tokens.

0:30:45 - 0:30:47 Text: That was that capital T from last lecture.

0:30:47 - 0:30:48 Text: So you have some finite length.

0:30:48 - 0:30:55 Text: So you have one embedding of a position for every index for all 512 indices.

0:30:55 - 0:30:57 Text: And then you have all your word embeddings.

0:30:57 - 0:31:03 Text: And then the transformer encoder, right, was this combination of sort of sub-modules that

0:31:03 - 0:31:08 Text: we walked through line by line on Tuesday, right.

0:31:08 - 0:31:12 Text: Multi-headed attention was sort of the core building block.

0:31:12 - 0:31:16 Text: And then we had residual and layer norm, right, to help with passing gradients and to help

0:31:16 - 0:31:19 Text: make training go better and faster.

0:31:19 - 0:31:25 Text: We had that feed forward layer to, yeah, process sort of the result of the multi-headed

0:31:25 - 0:31:30 Text: attention, another residual and layer norm, and then pass to an identical transformer

0:31:30 - 0:31:31 Text: encoder block here.

0:31:31 - 0:31:32 Text: And these would be stacked.

0:31:32 - 0:31:38 Text: We'll see a number of different configurations here, but I think, you know, 612 of these

0:31:38 - 0:31:39 Text: sort of stacked together.

0:31:39 - 0:31:40 Text: Okay.

0:31:40 - 0:31:42 Text: So that's a transformer encoder.

0:31:42 - 0:31:47 Text: And we're actually going to see whole models today that are just transformer encoders.

0:31:47 - 0:31:48 Text: Okay.

0:31:48 - 0:31:52 Text: So when we talked about machine translation, when we talked about the transformer itself,

0:31:52 - 0:31:56 Text: the transformer encoder decoder, we talked about this whole thing.

0:31:56 - 0:31:59 Text: But you could actually just have this left column, and you could actually just have this

0:31:59 - 0:32:02 Text: right column as well.

0:32:02 - 0:32:04 Text: Although the right column changes a little bit if you just have it.

0:32:04 - 0:32:10 Text: So remember, the right column, we had this masked multi-head self-attention, right, so

0:32:10 - 0:32:14 Text: where you can't look at the future.

0:32:14 - 0:32:18 Text: And someone asked actually about how we decode from transformers, given that you have this

0:32:18 - 0:32:20 Text: sort of big chunking operation.

0:32:20 - 0:32:21 Text: It's a great question.

0:32:21 - 0:32:25 Text: I won't be able to get into it in detail today, but you have to run it once during the decoding

0:32:25 - 0:32:31 Text: process for every time that you decode to sort of predict the next word.

0:32:31 - 0:32:34 Text: I'll write out something on Ed for this.

0:32:34 - 0:32:38 Text: So in the masked multi-head self-attention, you're not allowed to look at the future so

0:32:38 - 0:32:44 Text: that you sort of have this well-defined objective of trying to do language modeling.

0:32:44 - 0:32:46 Text: Then we have residual and layer norm.

0:32:46 - 0:32:50 Text: The multi-head cross-attention, remember, goes back to the last layer of the transformer

0:32:50 - 0:32:53 Text: encoder, or the last transformer encoder block.

0:32:53 - 0:32:57 Text: And then more residual and layer norm, another feed-forward layer, more residual and layer

0:32:57 - 0:32:58 Text: norm.

0:32:58 - 0:33:04 Text: Now, if we don't have an encoder here, then we get rid of the cross-attention and residual

0:33:04 - 0:33:05 Text: and layer norm here.

0:33:05 - 0:33:09 Text: So if we didn't have this stack of encoders, the decoders get simpler because you don't

0:33:09 - 0:33:10 Text: have to attend to them.

0:33:10 - 0:33:14 Text: But then again, you also have these word embeddings at the bottom and position representations

0:33:14 - 0:33:17 Text: for the output sequence.

0:33:17 - 0:33:20 Text: Okay, so that's been review.

0:33:20 - 0:33:22 Text: Let's talk about pre-training through language modeling.

0:33:22 - 0:33:26 Text: So we've actually talked maybe a little bit about this before, and we've seen language

0:33:26 - 0:33:30 Text: modeling in the context of maybe just wanting to do it our priori.

0:33:30 - 0:33:35 Text: So language models were useful, for example, in automatic speech recognition systems.

0:33:35 - 0:33:38 Text: They were useful in statistical machine translation systems.

0:33:38 - 0:33:42 Text: So let's recall the language modeling task.

0:33:42 - 0:33:47 Text: You can say it's defined as modeling the probability of a word at a given index t, of any word

0:33:47 - 0:33:51 Text: at any given index, given all the words before it.

0:33:51 - 0:33:59 Text: And this probability distribution is a distribution of words given their past contexts.

0:33:59 - 0:34:05 Text: And so this is just saying, for any prefix here, IRO goes to make.

0:34:05 - 0:34:07 Text: I want a probability of whatever the next word should be.

0:34:07 - 0:34:14 Text: So the observed next word is tasty, but maybe there's goes to make t, goes to make hot

0:34:14 - 0:34:15 Text: water, etc.

0:34:15 - 0:34:19 Text: You can have a distribution over what the next word should be in this decoder.

0:34:19 - 0:34:24 Text: And remember that because of the masked self-attention, make can look back to the word

0:34:24 - 0:34:30 Text: two, or goes, or IRO, but it can't look forward to tasty.

0:34:30 - 0:34:31 Text: So there's a lot of data for this, right?

0:34:31 - 0:34:33 Text: You just have text.

0:34:33 - 0:34:36 Text: And like voila, you have language modeling data.

0:34:36 - 0:34:37 Text: It's free.

0:34:37 - 0:34:38 Text: No.

0:34:38 - 0:34:40 Text: Once you have the text, it's freely available.

0:34:40 - 0:34:42 Text: You don't need to label it.

0:34:42 - 0:34:44 Text: And in English, you have a lot of it, right?

0:34:44 - 0:34:50 Text: This is not true of every language by any means, but in English, you have a lot of pre-training

0:34:50 - 0:34:52 Text: data.

0:34:52 - 0:34:58 Text: And so the simple thing about pre-training is, well, what we're going to do is we're

0:34:58 - 0:35:01 Text: going to train a neural network to do language modeling on a large amount of text, and we'll

0:35:01 - 0:35:06 Text: just save the parameters of our train network to disk.

0:35:06 - 0:35:10 Text: So conceptually, it's not actually different from the things that we've done before.

0:35:10 - 0:35:12 Text: It's just sort of the intent, right?

0:35:12 - 0:35:17 Text: We're training these parameters to start using them for something else later down the line.

0:35:17 - 0:35:20 Text: But the language modeling itself doesn't change.

0:35:20 - 0:35:22 Text: The decoder here doesn't change, right?

0:35:22 - 0:35:27 Text: It's a transformer in tree-trained models in a modern, because this is sort of a newly

0:35:27 - 0:35:30 Text: popular concept.

0:35:30 - 0:35:36 Text: Although back in 2015 was sort of when this, I think, was first effectively tried out and

0:35:36 - 0:35:39 Text: got some interesting results.

0:35:39 - 0:35:42 Text: But this could be anything here.

0:35:42 - 0:35:47 Text: Today, it's most going to be transformers in the models that we actually observe.

0:35:47 - 0:35:49 Text: Okay.

0:35:49 - 0:35:52 Text: So once you have your pre-trained network, what's the sort of default thing you do to

0:35:52 - 0:35:54 Text: take to use it?

0:35:54 - 0:35:55 Text: Right?

0:35:55 - 0:35:58 Text: And if you take anything away from this lecture in terms of just like engineering practices

0:35:58 - 0:36:05 Text: that will be broadly useful to you as you go off and build things and study things, maybe

0:36:05 - 0:36:11 Text: as a machine learning engineer or a computational social scientist, et cetera, what people tend

0:36:11 - 0:36:16 Text: to do is you pre-traine your network on just a lot of data, lots of text, learn very

0:36:16 - 0:36:18 Text: general things.

0:36:18 - 0:36:22 Text: And then you adapt the network to whatever you wanted to do.

0:36:22 - 0:36:26 Text: So we had a bunch of pre-training data, and then maybe this is a movie review that

0:36:26 - 0:36:34 Text: we're taking as input here, and we just apply the decoder that we sort of pre-trained,

0:36:34 - 0:36:41 Text: start the parameters there, and then fine tune it on whatever we were sort of wanting

0:36:41 - 0:36:42 Text: to do.

0:36:42 - 0:36:43 Text: Maybe this is a sentiment analysis task.

0:36:43 - 0:36:48 Text: So we run the whole sequence through the decoder, get a hidden state at the end at the

0:36:48 - 0:36:53 Text: very last thing, and then we predict maybe plus or minus sentiment.

0:36:53 - 0:36:56 Text: And this is sort of adapting the pre-trained network to the task.

0:36:56 - 0:37:02 Text: Because pre-trained fine-tune paradigm is wildly successful, and you should really try

0:37:02 - 0:37:09 Text: it whenever you're doing any NLP task nowadays effectively.

0:37:09 - 0:37:14 Text: Because this tends to be what some variant of this tends to be what works best.

0:37:14 - 0:37:19 Text: Okay, so we've got a technical note now.

0:37:19 - 0:37:27 Text: So if you don't like to think about optimization or gradient descent, maybe take a pass on

0:37:27 - 0:37:34 Text: this slide, but I encourage you to just think for a second about why should this help?

0:37:34 - 0:37:40 Text: Training neural nets, we're using gradient descent to try to find some global minimum

0:37:40 - 0:37:43 Text: of this loss function.

0:37:43 - 0:37:47 Text: And we're sort of doing this in two steps.

0:37:47 - 0:37:55 Text: The first step is we get some parameters theta hat by approximating min over our, sorry,

0:37:55 - 0:37:57 Text: theta is the parameters of the neural network.

0:37:57 - 0:38:03 Text: So all of the KQV vectors in our transformer, the word embeddings, the position embeddings,

0:38:03 - 0:38:07 Text: it's just all of the parameters of our neural network.

0:38:07 - 0:38:11 Text: And so we're doing min over all the parameters of our theta, we're trying to approximate min

0:38:11 - 0:38:14 Text: over the parameters of our neural network of our pre-training loss, which here was language

0:38:14 - 0:38:18 Text: modeling of our parameters.

0:38:18 - 0:38:23 Text: And this is, we just get this sort of estimate of some parameters theta hat.

0:38:23 - 0:38:31 Text: And then we fine tune by approximating this min over theta of the fine tune loss, maybe

0:38:31 - 0:38:33 Text: that's sentiment, right?

0:38:33 - 0:38:34 Text: Starting at theta hat.

0:38:34 - 0:38:37 Text: So we initialize our gradient descent at theta hat, and then we just sort of let it do

0:38:37 - 0:38:38 Text: what it wants.

0:38:38 - 0:38:42 Text: And it's just like, it just works.

0:38:42 - 0:38:49 Text: And in part, it has to be because something about where we start is so important, not just

0:38:49 - 0:38:53 Text: in terms of sort of gradient flow, although that is a big part of it.

0:38:53 - 0:39:00 Text: But also, it seems like, you know, stochastic gradient descent sticks relatively close to

0:39:00 - 0:39:03 Text: that pre-training initialization during fine tuning.

0:39:03 - 0:39:09 Text: This is something that we seem to observe in practice, right, that somehow the locality

0:39:09 - 0:39:13 Text: of stochastic gradient descent, finding local minima that are close to this theta hat,

0:39:13 - 0:39:19 Text: that was good for such a general problem of language modeling, it seems like, yeah, the

0:39:19 - 0:39:23 Text: local minima of the fine tuning loss, because we don't find, or yeah, the local minima

0:39:23 - 0:39:28 Text: of the fine tuning loss tend to generalize well when they're near to this theta hat that

0:39:28 - 0:39:29 Text: we pre-trained.

0:39:29 - 0:39:32 Text: And this is sort of a mystery that we're still trying to figure out more about.

0:39:32 - 0:39:35 Text: And then also, yeah, maybe the gradients, right, the gradients of the fine tuning loss

0:39:35 - 0:39:40 Text: near theta propagate nicely, so our network training goes really well as well.

0:39:40 - 0:39:45 Text: Okay, so this is something to chew on, but in practice, it works.

0:39:45 - 0:39:49 Text: I think it's just still fascinating that it works.

0:39:49 - 0:39:59 Text: Okay, so we talked about mainly the transformer encoder to coder, and in fact, right, I said

0:39:59 - 0:40:04 Text: that we could have just sort of the left-hand side encoders, you know, that to be pre-trained

0:40:04 - 0:40:08 Text: or just decoders to be pre-trained or encoder decoders.

0:40:08 - 0:40:13 Text: And there are actually really popular sort of famous models in each of these three categories.

0:40:13 - 0:40:20 Text: The kinds of pre-training you can do, and the kinds of applications or uses of those

0:40:20 - 0:40:25 Text: pre-trained models that are most natural actually depend strongly on whether you choose

0:40:25 - 0:40:31 Text: to pre-traine and encoder a decoder or an encoder decoder.

0:40:31 - 0:40:36 Text: So I think it's useful as we go through some of these popular sort of model names that

0:40:36 - 0:40:41 Text: you need to know and what they sort of, what their innovations were to actually split

0:40:41 - 0:40:44 Text: it up into these categories.

0:40:44 - 0:40:46 Text: So we've all, so here's the thing.

0:40:46 - 0:40:51 Text: We're going to go through these three, and they all have sort of benefits and in some

0:40:51 - 0:40:52 Text: sense, drawbacks.

0:40:52 - 0:40:58 Text: So the decoders, right, really what we're talking about here mainly is language models,

0:40:58 - 0:41:03 Text: and we've seen this so far, we've talked about pre-trained decoders, and these are nice

0:41:03 - 0:41:04 Text: to generate from.

0:41:04 - 0:41:08 Text: So you can just sample from your pre-trained language model and get things that look

0:41:08 - 0:41:11 Text: like the text that you were pre-training on.

0:41:11 - 0:41:15 Text: But one problem is that you can't condition on future words, right?

0:41:15 - 0:41:21 Text: So we mentioned in our modeling with LSTMs that just like, instead, if you could, when

0:41:21 - 0:41:27 Text: you can do it, we said that having a bi-directional LSTM was actually just way more useful than

0:41:27 - 0:41:29 Text: having a one-directional LSTM.

0:41:29 - 0:41:31 Text: Well, it's sort of true for transformers as well.

0:41:31 - 0:41:37 Text: So if you can see how the arrows are pointing here, the arrows are pointing up into the,

0:41:37 - 0:41:38 Text: you know, to the right.

0:41:38 - 0:41:45 Text: So this word is sort of looking back at its past history, but, you know, this word can't

0:41:45 - 0:41:48 Text: see, can't contextualize with the future.

0:41:48 - 0:41:52 Text: Whereas in the encoder block here in blue, just below it, you sort of have all pairs of

0:41:52 - 0:41:54 Text: interactions.

0:41:54 - 0:41:56 Text: And so, you know, when you're building your representations, it can actually be super

0:41:56 - 0:41:58 Text: useful to know what the future words are.

0:41:58 - 0:42:00 Text: So that's what encoders get you, right?

0:42:00 - 0:42:02 Text: You get bi-directional context.

0:42:02 - 0:42:05 Text: So you can condition on the future, maybe that helps you build up better representations

0:42:05 - 0:42:06 Text: of language.

0:42:06 - 0:42:12 Text: But the question that we'll actually go through here is, well, how do you pre-train them?

0:42:12 - 0:42:15 Text: You can't pre-train them as language models because you have access to the future.

0:42:15 - 0:42:20 Text: So if you try to do that, the loss will just immediately be zero because you can just

0:42:20 - 0:42:21 Text: see what the future is.

0:42:21 - 0:42:22 Text: That's not useful.

0:42:22 - 0:42:28 Text: And then we'll talk about pre-trained encoder decoders, which like maybe the best of both

0:42:28 - 0:42:33 Text: worlds, but also maybe unclear what's the best way to pre-train them.

0:42:33 - 0:42:36 Text: They definitely have benefits for both.

0:42:36 - 0:42:43 Text: So let's get into some general top, like a more, yeah, let's get into the decoders first,

0:42:43 - 0:42:46 Text: we'll go through all three.

0:42:46 - 0:42:47 Text: Okay.

0:42:47 - 0:42:54 Text: When we're pre-training a language model, right, we're pre-training it on this objective,

0:42:54 - 0:42:59 Text: we're trying to make it approximate this probability of a word given all of its previous

0:42:59 - 0:43:01 Text: words.

0:43:01 - 0:43:04 Text: What we end up doing, and I showed this sort of pictographically, but I'll add some math,

0:43:04 - 0:43:11 Text: right, we get a hidden state, h1 to ht for each of the words in the input w1 to wt.

0:43:11 - 0:43:14 Text: And I remember words again, mean sub words here.

0:43:14 - 0:43:15 Text: Okay.

0:43:15 - 0:43:20 Text: And we're fine tuning this, right, we can take the representation, this should be ht,

0:43:20 - 0:43:22 Text: a, ht plus b.

0:43:22 - 0:43:25 Text: And then the picture here is, right, here's ht.

0:43:25 - 0:43:29 Text: It's the very last encoder state.

0:43:29 - 0:43:36 Text: And now this has sort of the, it's seen all of its history, right, and so you can apply

0:43:36 - 0:43:41 Text: a linear layer here, maybe multiplying it by some parameters a and b that were not

0:43:41 - 0:43:46 Text: pre-trained, and then you're predicting sentiment maybe, you know, plus or minus sentiment,

0:43:46 - 0:43:47 Text: perhaps.

0:43:47 - 0:43:51 Text: And so, you know, look at the red and the gray, so most of the parameters of my neural

0:43:51 - 0:43:56 Text: network have now been pre-trained, the very last layer that's learning, the sentiment,

0:43:56 - 0:44:00 Text: say, decision, has not been pre-trained.

0:44:00 - 0:44:02 Text: So those have been randomly initialized.

0:44:02 - 0:44:06 Text: And when you, when you take the loss of the sentiment loss, right, you train not just

0:44:06 - 0:44:11 Text: the linear layer here, but you actually back propagate the gradients all the way through

0:44:11 - 0:44:16 Text: the entire pre-trained network and fine tune all of those parameters, right?

0:44:16 - 0:44:20 Text: So it's not like you're just training this, fine tuning time, this linear layer, you're

0:44:20 - 0:44:25 Text: training the whole network as a function of this fine tuning loss.

0:44:25 - 0:44:30 Text: And you know, maybe it's bad that like the linear layer wasn't pre-trained.

0:44:30 - 0:44:34 Text: In the grand scheme of things, it's not that many parameters also.

0:44:34 - 0:44:38 Text: So this is you, so this is just one way to interact with pre-trained models, right?

0:44:38 - 0:44:42 Text: And so what I want you to take away from this is that there was a contract that we had

0:44:42 - 0:44:44 Text: with the original model, right?

0:44:44 - 0:44:48 Text: The contract was that it was defining probability distributions.

0:44:48 - 0:44:52 Text: But when we're fine tuning, when we're interacting with the pre-trained model, what we also

0:44:52 - 0:44:55 Text: have are just like the trained weights and the network architecture.

0:44:55 - 0:44:58 Text: We don't need to use it as a language model, we don't need to use it as a probability

0:44:58 - 0:44:59 Text: distribution.

0:44:59 - 0:45:04 Text: When we're actually fine tuning it, we're really just using it for its initialization

0:45:04 - 0:45:08 Text: of its parameters and saying, oh, this is just a transformer decoder that was

0:45:08 - 0:45:14 Text: pre-trained by, oh, and it happens to be really great in that when you find tuna on some

0:45:14 - 0:45:17 Text: sentiment data, it does a really good job.

0:45:17 - 0:45:22 Text: Okay, but there's a second way to interact with pre-trained decoders, which is in some

0:45:22 - 0:45:24 Text: sense even more natural.

0:45:24 - 0:45:28 Text: It actually is closer to the contract that we started with.

0:45:28 - 0:45:32 Text: So we don't have to just ignore the fact that it was a probability distribution entirely,

0:45:32 - 0:45:35 Text: we can make use of it while still fine tuning it.

0:45:35 - 0:45:37 Text: So here's what we're going to do.

0:45:37 - 0:45:40 Text: So we can use them as a generator at fine tuning time.

0:45:40 - 0:45:47 Text: By generator, I mean, it's going to define this distribution of words given their context.

0:45:47 - 0:45:51 Text: And then we'll actually just fine tune that probability distribution.

0:45:51 - 0:45:58 Text: So in a task like some kind of turn-based dialogue, we might encode the dialogue history

0:45:58 - 0:46:01 Text: as your past context.

0:46:01 - 0:46:06 Text: So you have a dialogue history of some things that people are saying back and forth

0:46:06 - 0:46:10 Text: to each other, you encode it as words, and you try to predict the next words in the

0:46:10 - 0:46:11 Text: dialogue.

0:46:11 - 0:46:15 Text: Right, and maybe you're pre-training objective, you looked at very general purpose text

0:46:15 - 0:46:19 Text: from, I don't know, Wikipedia or books or something, and you're fine tuning it as a

0:46:19 - 0:46:25 Text: language model, but you're fine tuning it as a language model on this sort of domain-specific

0:46:25 - 0:46:30 Text: distribution of text like dialogue or maybe summarization where you paste in the whole

0:46:30 - 0:46:37 Text: document and then say a specific word and then the summary and say predict the summary.

0:46:37 - 0:46:43 Text: And so what this looks like is, again, at fine tuning time here, you have your h1 to

0:46:43 - 0:46:49 Text: ht is equal to the decoder of the words, and then you have this distribution that you're

0:46:49 - 0:46:55 Text: fine tuning of wt is a h is the type again, ht minus 1 plus b.

0:46:55 - 0:47:01 Text: So now every time I have this, I'm predicting these words from word 1, I predict word 2,

0:47:01 - 0:47:07 Text: we're 2, I predict word 3, etc., right, the actual last layer of the network unlike before,

0:47:07 - 0:47:12 Text: the last layer of the network has been pre-trained, but I'm still fine tuning the whole thing.

0:47:12 - 0:47:17 Text: Right, so a and b here are mapping to sort of a probability distribution over my vocabulary

0:47:17 - 0:47:23 Text: or the logits of a probability distribution, and I guess get this sort of like tweak them

0:47:23 - 0:47:28 Text: now, in order to have the distribution that I'm going to use, reflect the thing like dialogue

0:47:28 - 0:47:30 Text: that I wanted to reflect.

0:47:30 - 0:47:36 Text: Okay, so those are two ways of interacting with a pre-trained decoder.

0:47:36 - 0:47:44 Text: Now here's an example of what is ended up being the first, that be a line of wildly successful

0:47:44 - 0:47:49 Text: or at least talked about pre-trained decoders.

0:47:49 - 0:47:57 Text: So the generative pre-trained decoder, or GPC, was a huge success in some sense, or at

0:47:57 - 0:48:04 Text: least it got a lot of buzz, so it's a transformer decoder, no encoder, with 12 layers, I'm giving

0:48:04 - 0:48:10 Text: you the details so you can start to get a feeling for how the size of things changes.

0:48:10 - 0:48:15 Text: Over the years, as we'll continue to progress here, had each of our, each of the hidden

0:48:15 - 0:48:20 Text: states was dimensionality, 70, had 768, so if you remember back to last lecture, we had

0:48:20 - 0:48:26 Text: a term D, which was our dimensionality, so D is 768, and then an interesting statement

0:48:26 - 0:48:31 Text: that you should keep in mind for the engineering-minded folks is that the actual feed-forward

0:48:31 - 0:48:35 Text: layers, right, you've got a hidden layer in the feed-forward layer, and this was actually

0:48:35 - 0:48:41 Text: very large, so you had these sort of like position-wise feed-forward layers, right, and the

0:48:41 - 0:48:47 Text: feed-forward layer would take the 768-dimensional vector, sort of like project it to 3,000-dimensional

0:48:47 - 0:48:52 Text: space through the sort of non-linearity, and then project it back to 768.

0:48:52 - 0:48:56 Text: This ends up being because you can squash a lot more parameters in, for not too much

0:48:56 - 0:49:01 Text: more compute in this way, but that's curious.

0:49:01 - 0:49:06 Text: Okay, and then, byte-parent coding, it's actually, was this one byte-parent coding?

0:49:06 - 0:49:10 Text: Well, it was a sub-word vocabulary with 40,000 merges, so 40,000 merges, so that's not

0:49:10 - 0:49:15 Text: the size of the vocabulary because you started with a bunch of characters, and I don't remember

0:49:15 - 0:49:19 Text: how many characters they started with, but so it's a relatively small vocabulary you can

0:49:19 - 0:49:21 Text: see, right?

0:49:21 - 0:49:27 Text: And compared to, if you tried to say, have every word, have a unique representation, now

0:49:27 - 0:49:32 Text: it's going to be trained on books, corporates, it's got 7,000 unique books, and it contains

0:49:32 - 0:49:37 Text: long spans of contiguous texts, so you have, instead of, say, training it on individual

0:49:37 - 0:49:40 Text: sentences, just small short sentences, right?

0:49:40 - 0:49:46 Text: The model is able to learn long distance dependencies because you haven't split, like, a book

0:49:46 - 0:49:48 Text: into random sentences and shuffled them all around.

0:49:48 - 0:49:53 Text: You've sort of kept it contiguous, so we can have that sort of consistency.

0:49:53 - 0:49:58 Text: And then, a little treat here, yeah, so GPC never showed up in the original paper, or

0:49:58 - 0:50:03 Text: the original blog post, like as an acronym, and it could actually sort of refer to, like,

0:50:03 - 0:50:07 Text: generative pre-training, sort of what, like, the title of the paper would suggest, or

0:50:07 - 0:50:09 Text: generative pre-trained transformer.

0:50:09 - 0:50:13 Text: And I sort of decided to say generative pre-trained transformer because this seemed like way

0:50:13 - 0:50:15 Text: too general.

0:50:15 - 0:50:17 Text: So GPC.

0:50:17 - 0:50:22 Text: Okay, so they pre-trained this huge language model transformer, this huge transformer

0:50:22 - 0:50:25 Text: decoder, just on 7,000 books.

0:50:25 - 0:50:28 Text: And they fine-tuned it on a number of different tasks, and I want to talk a little bit about

0:50:28 - 0:50:31 Text: the details about how they fine-tuned it.

0:50:31 - 0:50:36 Text: And so they fine-tuned it on one particular task, or family tasks, called natural language

0:50:36 - 0:50:38 Text: inference.

0:50:38 - 0:50:43 Text: So in natural language inference, we're labeling pairs of sentences as entailing or contradictory

0:50:43 - 0:50:44 Text: to each other in neutral.

0:50:44 - 0:50:50 Text: So you have a premise, and you hold the premise as sort of true, the man is in the doorway.

0:50:50 - 0:50:54 Text: And you have a hypothesis, the person is near the door.

0:50:54 - 0:50:59 Text: If this person is referring to that man, then, you know, it's sort of like, oh, yeah,

0:50:59 - 0:51:04 Text: so this is sort of entailed because there's a person, because the man is a person, and

0:51:04 - 0:51:06 Text: they're in the doorway, then they are near the door.

0:51:06 - 0:51:11 Text: So you have this sort of logical reasoning that you're doing, or you're supposed to be

0:51:11 - 0:51:14 Text: able to be doing, and you're labeling these sentences.

0:51:14 - 0:51:15 Text: So it's a labeled task.

0:51:15 - 0:51:21 Text: You've got sort of an input that's cut into two parts, and then one of three outputs.

0:51:21 - 0:51:25 Text: Okay, so the GPT paper evaluates on this task.

0:51:25 - 0:51:28 Text: But what they've got is a transformer decoder.

0:51:28 - 0:51:30 Text: So what do they do?

0:51:30 - 0:51:37 Text: This is sort of one of the earlier examples of, you know, taking, instead of changing your

0:51:37 - 0:51:42 Text: neural network architecture to adapt to the kind of task you're doing, you're going to

0:51:42 - 0:51:49 Text: just format the task as like a bunch of tokens and not change your architecture.

0:51:49 - 0:51:53 Text: Because the pre-training was so useful, it's probably better to keep the architecture

0:51:53 - 0:51:59 Text: fixed, pre-training it, and then change the task specification to sort of fit the pre-trained

0:51:59 - 0:52:00 Text: architecture.

0:52:00 - 0:52:05 Text: So what they did, right, they put this token start, this is a special token, the man is

0:52:05 - 0:52:09 Text: in the doorway, some delimiter token, right.

0:52:09 - 0:52:15 Text: So this is just a linear sequence of tokens that we're giving as one big prefix to GPT.

0:52:15 - 0:52:21 Text: And then the person is near the door, and then some extra token here, right, extract.

0:52:21 - 0:52:25 Text: And then, you know, the linear classifier that we talked about, and sort of the first

0:52:25 - 0:52:32 Text: way to interact with models, with decoder models, it's applied to the representation of the

0:52:32 - 0:52:34 Text: extract token, right.

0:52:34 - 0:52:39 Text: So you have the last hidden state on top of extract, and then you fine tune the whole

0:52:39 - 0:52:41 Text: network to predict these labels, right.

0:52:41 - 0:52:48 Text: And so this sort of input formatting is increasingly, increasingly used to keep the model architecture

0:52:48 - 0:52:53 Text: the same and allow for a variety of different problems to be solved with it.

0:52:53 - 0:52:55 Text: Okay, and so did it work?

0:52:55 - 0:52:58 Text: Unnatural language inference, the answer is yes.

0:52:58 - 0:53:00 Text: So there's a number of different numbers here.

0:53:00 - 0:53:01 Text: I wouldn't worry too much about it.

0:53:01 - 0:53:06 Text: The fine tune transformer language model is sort of what you should pay attention to.

0:53:06 - 0:53:09 Text: There's a lot of effort that went into the other models, right.

0:53:09 - 0:53:11 Text: And so this is the story of pre-training.

0:53:11 - 0:53:15 Text: People put a lot of effort into models that do various sort of careful things.

0:53:15 - 0:53:20 Text: And then you take a single transformer and you say, I'm going to pre-training it on a

0:53:20 - 0:53:24 Text: ton of text and not worry too much about anything else and just fine tune it, and you end up

0:53:24 - 0:53:28 Text: doing super, super well.

0:53:28 - 0:53:32 Text: Sometimes not too much better in the GPT case than sort of the best known state of the

0:53:32 - 0:53:36 Text: art methods, but usually a little bit better.

0:53:36 - 0:53:39 Text: And again, the amount of effort, the amount of tasks, specific effort that you have to put

0:53:39 - 0:53:41 Text: into it, it's very low.

0:53:41 - 0:53:46 Text: Okay, and so what about the other way of interacting with decoters, right.

0:53:46 - 0:53:49 Text: So we had, we said that we can interact with decoters just by sampling from them, just

0:53:49 - 0:53:52 Text: by saying, well, there are probability distributions.

0:53:52 - 0:53:55 Text: So we can use them in their capacities as language models.

0:53:55 - 0:54:02 Text: And so GPT 2, this is just really just a bigger GPT, and we're too much about it, with

0:54:02 - 0:54:05 Text: larger hidden units, more layers.

0:54:05 - 0:54:10 Text: When it was trained on more data, it was shown to produce sort of relatively convincing

0:54:10 - 0:54:11 Text: samples of natural language.

0:54:11 - 0:54:14 Text: So this is something that went around Twitter a lot, right.

0:54:14 - 0:54:20 Text: So you have this sort of contrived example that probably didn't show up in the training

0:54:20 - 0:54:24 Text: data that has a scientist discovering a herd of unicorns.

0:54:24 - 0:54:31 Text: And then they sort of sample from a, almost the distribution of the model.

0:54:31 - 0:54:36 Text: They sort of give the model some extra credit here.

0:54:36 - 0:54:42 Text: They do something called truncating the distribution of the language models, sort of cut out noise

0:54:42 - 0:54:44 Text: at GPT 2.

0:54:44 - 0:54:52 Text: So it's not exactly a perfect sample, but more or less GPT 2 generated this.

0:54:52 - 0:54:56 Text: And so you have the scientist discovering unicorns, and then, you know, you have this

0:54:56 - 0:55:00 Text: consistency, okay, there's the scientist.

0:55:00 - 0:55:03 Text: You know, you have them giving you the name.

0:55:03 - 0:55:11 Text: You have, you refer back to this, well, yeah, you refer back to the scientist's name.

0:55:11 - 0:55:13 Text: You sort of have these like topic consistency things.

0:55:13 - 0:55:15 Text: Also the syntax is really good.

0:55:15 - 0:55:18 Text: It looks, you know, vaguely like English.

0:55:18 - 0:55:20 Text: And so this is sort of continued to be a trend.

0:55:20 - 0:55:23 Text: As we get larger and larger language models, we actually sample from them, even when we

0:55:23 - 0:55:29 Text: give them prompts that look sort of odd, and they seem to be increasingly convincing.

0:55:29 - 0:55:31 Text: Okay.

0:55:31 - 0:55:36 Text: So pre-training encoders, okay.

0:55:36 - 0:55:37 Text: Pre-training encoders.

0:55:37 - 0:55:42 Text: So let's take another second because I need some more water here.

0:55:42 - 0:55:44 Text: If there's another question, let me know.

0:55:44 - 0:55:53 Text: All right.

0:55:53 - 0:55:59 Text: So the benefit of encoders that we talked about was that they get this bidirectional context.

0:55:59 - 0:56:05 Text: So you can, while you're building representations of your sentence, of your parts of sentences,

0:56:05 - 0:56:08 Text: you can look to the future and that can help you build a better representation of the word

0:56:08 - 0:56:10 Text: that you're looking at right now.

0:56:10 - 0:56:13 Text: But the big problem is that we can't do language modeling now.

0:56:13 - 0:56:17 Text: So we've pretty much only said, we like, we've relied on this task that we already knew about

0:56:17 - 0:56:19 Text: language modeling to do our pre-training.

0:56:19 - 0:56:21 Text: But now we want to pre-training coders.

0:56:21 - 0:56:23 Text: And so we can't, we can't use it.

0:56:23 - 0:56:27 Text: So what are we going to do?

0:56:27 - 0:56:32 Text: Here's the solution that was come up with a paper that introduced the language model of

0:56:32 - 0:56:34 Text: the model called Bert.

0:56:34 - 0:56:37 Text: It's called masked language modeling.

0:56:37 - 0:56:39 Text: So here's the idea.

0:56:39 - 0:56:43 Text: We get the sentence and then we just take a fraction of the words and we replace them

0:56:43 - 0:56:46 Text: with a sort of a mask token.

0:56:46 - 0:56:50 Text: A token that's, that means you don't know what this is right now.

0:56:50 - 0:56:53 Text: And then you predict these words.

0:56:53 - 0:56:54 Text: Some details we'll get into in the next slide.

0:56:54 - 0:56:56 Text: But so here's what it looks like.

0:56:56 - 0:57:01 Text: We have the sentence, I mask to the mask.

0:57:01 - 0:57:03 Text: We get some hidden states for all of them, right?

0:57:03 - 0:57:08 Text: So we haven't changed the transformer encoder at all.

0:57:08 - 0:57:10 Text: We've just said, okay, here's like this sequence.

0:57:10 - 0:57:12 Text: You get to see everything, right?

0:57:12 - 0:57:13 Text: Look at all the arrows going everywhere.

0:57:13 - 0:57:19 Text: But then, right, we have this prediction layer that we're, that we're, that we're pre-training,

0:57:19 - 0:57:20 Text: right?

0:57:20 - 0:57:21 Text: And we're using it.

0:57:21 - 0:57:26 Text: We only have loss on the words where we had masks here.

0:57:26 - 0:57:31 Text: So I had this masked and then I have to predict that it was went that went here and store

0:57:31 - 0:57:32 Text: that went here.

0:57:32 - 0:57:36 Text: And now this is a lot like language modeling you might say.

0:57:36 - 0:57:39 Text: But now you don't need to have this sort of left to right decomposition.

0:57:39 - 0:57:43 Text: You're saying, I'm going to remove some of the words and you have to predict what they

0:57:43 - 0:57:44 Text: are.

0:57:44 - 0:57:46 Text: This is called masked language modeling.

0:57:46 - 0:57:49 Text: And it's been very, very, very effective with a quick caveat.

0:57:49 - 0:57:51 Text: It gets a little more complicated.

0:57:51 - 0:57:54 Text: So, so what did they actually do?

0:57:54 - 0:57:56 Text: They, they proposed masked language modeling.

0:57:56 - 0:57:59 Text: And they released the weights of this, of this pre-trained transformer.

0:57:59 - 0:58:03 Text: So the little bit more complexity to get masked language modeling to work.

0:58:03 - 0:58:09 Text: So you are going to take a random 15% of the sub word tokens.

0:58:09 - 0:58:10 Text: That was, that was true.

0:58:10 - 0:58:14 Text: But you're not always going to replace them with mask.

0:58:14 - 0:58:19 Text: You can think of it like, if the model sees a mask token, it gets a guarantee that it

0:58:19 - 0:58:21 Text: needs to predict something.

0:58:21 - 0:58:26 Text: And if the model doesn't see a mask token, it gets a guarantee that it doesn't need to

0:58:26 - 0:58:27 Text: predict anything.

0:58:27 - 0:58:33 Text: So why should it bother building strong representations of the words that aren't masked?

0:58:33 - 0:58:36 Text: And I want my model to build strong representations of everything.

0:58:36 - 0:58:38 Text: So we're going to add some sort of uncertainty to the model.

0:58:38 - 0:58:43 Text: So what we're going to do is, for those 15% of tokens, 80% of the time, we're going

0:58:43 - 0:58:44 Text: to replace it with a mask.

0:58:44 - 0:58:48 Text: That was our original idea of mask language modeling.

0:58:48 - 0:58:52 Text: Then 10% of the time, we're actually going to replace the word with just a random token.

0:58:52 - 0:58:56 Text: Just a random vocabulary item can be anything.

0:58:56 - 0:58:59 Text: And then the other 10% of the time, we're going to leave the word unchanged.

0:58:59 - 0:59:03 Text: So now, it sees a word.

0:59:03 - 0:59:06 Text: It could be a random token, or it could be unchanged.

0:59:06 - 0:59:10 Text: And if I see a mask, I know I need to predict it.

0:59:10 - 0:59:15 Text: So what these two things do here is say, you have to sort of be doing this, you have to

0:59:15 - 0:59:18 Text: be on your toes for every word in your representation.

0:59:18 - 0:59:22 Text: So here, I pizza to the mask.

0:59:22 - 0:59:27 Text: And it turns out, and the model didn't know this, but it's getting three lost terms for

0:59:27 - 0:59:28 Text: this sentence.

0:59:28 - 0:59:32 Text: It only has one mask, but it's going to be penalized for predicting three different things.

0:59:32 - 0:59:35 Text: And it needs to predict that this word is actually went.

0:59:35 - 0:59:37 Text: So I replaced this one.

0:59:37 - 0:59:41 Text: It needs to predict that this word is two, is in fact the word two.

0:59:41 - 0:59:46 Text: And then it needs to predict that this word is in fact store.

0:59:46 - 0:59:49 Text: Now as a short interlude, you might be thinking, you might be thinking, John, there's no way

0:59:49 - 0:59:52 Text: the model could know this.

0:59:52 - 0:59:54 Text: It's so under specified.

0:59:54 - 0:59:56 Text: I pizza is a little weird, I admit.

0:59:56 - 0:59:58 Text: But there's just no way to know that this is store or in went into.

0:59:58 - 1:00:01 Text: I mean, the same thing is true of language modeling.

1:00:01 - 1:00:05 Text: So it's going to end up learning these average statistics about what things tend to be in

1:00:05 - 1:00:06 Text: the given context.

1:00:06 - 1:00:10 Text: And it's going to sort of hedge its bets and try to build a distribution of what things

1:00:10 - 1:00:12 Text: could appear there.

1:00:12 - 1:00:14 Text: So for the people who are thinking that, if there wasn't, that's what you should be

1:00:14 - 1:00:15 Text: thinking.

1:00:15 - 1:00:18 Text: It has to sort of know what kinds of things will end up in these slots.

1:00:18 - 1:00:23 Text: It has other uncertainty, because it can't be sure that any of the other words are necessarily

1:00:23 - 1:00:25 Text: right.

1:00:25 - 1:00:30 Text: And then it is, it's predicting these three words.

1:00:30 - 1:00:36 Text: And so you can see why it's important to not just have masks potentially, to have these

1:00:36 - 1:00:41 Text: sort of token randomization things, because again, we don't actually care about its ability

1:00:41 - 1:00:43 Text: to predict the masks.

1:00:43 - 1:00:48 Text: I'm not going to usually, I'm not going to actually sample from the model's distribution

1:00:48 - 1:00:50 Text: over what should go here.

1:00:50 - 1:00:56 Text: Instead, I am going to use the parameters of the neural network and expect that it built

1:00:56 - 1:00:58 Text: strong representations of language.

1:00:58 - 1:01:02 Text: So I don't want it to think it's got a free pass for representing something if it doesn't

1:01:02 - 1:01:06 Text: have a mask there.

1:01:06 - 1:01:14 Text: So there was one extra thing with the BERT pre-training, which is a next sentence prediction

1:01:14 - 1:01:15 Text: objective.

1:01:15 - 1:01:17 Text: So the input to BERT looks like this.

1:01:17 - 1:01:19 Text: This is straight from the BERT paper.

1:01:19 - 1:01:24 Text: You have a label here before your first sentence, and then a separation, and then a second

1:01:24 - 1:01:25 Text: sentence.

1:01:25 - 1:01:29 Text: So you had always two contiguous chunks of text.

1:01:29 - 1:01:31 Text: You had a first chunk of text here.

1:01:31 - 1:01:33 Text: My dog is cute.

1:01:33 - 1:01:35 Text: And then a second chunk of text, he likes playing.

1:01:35 - 1:01:38 Text: You can see the sub words there.

1:01:38 - 1:01:42 Text: And now these would actually be both be much longer.

1:01:42 - 1:01:47 Text: So these whole thing would be 512 words, and it would be about half, and that would be

1:01:47 - 1:01:51 Text: about half, and they'd be contiguous chunks of text.

1:01:51 - 1:01:53 Text: But here was the deal.

1:01:53 - 1:01:57 Text: What they wanted to do was they wanted to try to teach the system to understand sort of

1:01:57 - 1:02:01 Text: relationships between different whole pieces of text.

1:02:01 - 1:02:06 Text: In order to better pre-trained for downstream applications like question answering, where

1:02:06 - 1:02:11 Text: you have two pretty different pieces of text, and you need to know how they relate to

1:02:11 - 1:02:12 Text: each other.

1:02:12 - 1:02:18 Text: So the objective they came up with was you should sometimes have the second chunk of text

1:02:18 - 1:02:26 Text: be the actual chunk of text that directly follows the first in your data set, and sometimes

1:02:26 - 1:02:32 Text: have the second chunk of text be randomly sampled from somewhere else, so unrelated.

1:02:32 - 1:02:37 Text: And the model should predict whether it's the first case or the second.

1:02:37 - 1:02:41 Text: In order, again, to sort of have to reason about the relationships between the two chunks

1:02:41 - 1:02:42 Text: of text.

1:02:42 - 1:02:44 Text: So this is next sentence prediction.

1:02:44 - 1:02:48 Text: I think it's important to think about because it's a very different idea of pre-training

1:02:48 - 1:02:53 Text: objective than language modeling and masked language modeling.

1:02:53 - 1:02:58 Text: Even though later we're sort of argued that in the case of BERT, it's not necessary or

1:02:58 - 1:02:59 Text: useful.

1:02:59 - 1:03:06 Text: And one of the arguments is actually because it's actually way better to have a single

1:03:06 - 1:03:12 Text: context that's twice as long, so you can learn even longer distance dependencies and things.

1:03:12 - 1:03:15 Text: And so whether the objective itself would be useful if you could always just double

1:03:15 - 1:03:18 Text: the context size, I'm not sure if anyone's done research on that.

1:03:18 - 1:03:22 Text: But again, it's like a different kind of objective, and it's still noisy something about

1:03:22 - 1:03:23 Text: the input, right?

1:03:23 - 1:03:28 Text: The input was this big chunk of text, and you've noise it to say like, now you don't know

1:03:28 - 1:03:32 Text: whether it really was that or whether you sort of replaced it with a bunch of garbage,

1:03:32 - 1:03:39 Text: this sort of second portion here, whether the second portion has been replaced with something

1:03:39 - 1:03:44 Text: that didn't actually come from the same sequence.

1:03:44 - 1:03:49 Text: Okay, so let's talk some details about BERT.

1:03:49 - 1:03:53 Text: So BERT had 12 or 24 layers, depending on BERT base or BERT large.

1:03:53 - 1:03:57 Text: You'll probably use one of these models or one of the sort of descendants of these models

1:03:57 - 1:04:02 Text: if you choose to do something with the custom final project potentially, or if you choose

1:04:02 - 1:04:06 Text: the version of the default final project.

1:04:06 - 1:04:11 Text: And you had a 600 or a 1000 dimension hidden states, a bunch of attention heads, so this

1:04:11 - 1:04:14 Text: is that multi-headed attention, remember, about a bunch of them.

1:04:14 - 1:04:19 Text: So you're splitting all your dimensions into those 16 heads, and we're talking on the

1:04:19 - 1:04:23 Text: order of a couple hundred million parameters.

1:04:23 - 1:04:28 Text: At the time, right in 2018, we were like, whoa, that's a lot of parameters.

1:04:28 - 1:04:32 Text: How do you, that's a lot of parameters.

1:04:32 - 1:04:35 Text: And now, models are way, way, way, way bigger.

1:04:35 - 1:04:39 Text: So let's keep track of sort of the model sizes as we're going through this.

1:04:39 - 1:04:42 Text: And let's come back now to the corpus sizes as well.

1:04:42 - 1:04:43 Text: So we have books corpus.

1:04:43 - 1:04:45 Text: And this is the number of words there.

1:04:45 - 1:04:50 Text: This is the same thing that GPT-1 was trained on, 800 million words.

1:04:50 - 1:04:56 Text: Now we're going to train on also English Wikipedia, it's 250, sorry, that's 2,500 million,

1:04:56 - 1:04:59 Text: so that's 2,500,000,000 words.

1:04:59 - 1:05:06 Text: And again, to give you an idea of what is done in practice, right, pre-training is expensive

1:05:06 - 1:05:11 Text: and impractical for most users, let's say.

1:05:11 - 1:05:16 Text: So if you are a researcher with a GPU or five GPUs or something like that, you tend to

1:05:16 - 1:05:20 Text: not really be pre-training your whole own BERT model unless you're willing to spend

1:05:20 - 1:05:22 Text: a long time doing it.

1:05:22 - 1:05:25 Text: BERT itself was pre-trained with 64 TPU chips.

1:05:25 - 1:05:31 Text: A TPU is a special kind of hardware accelerator that accelerates the tensor operations effectively

1:05:31 - 1:05:35 Text: is developed by Google.

1:05:35 - 1:05:40 Text: So TPUs are just fast and can hold a lot.

1:05:40 - 1:05:42 Text: And for four days they had 64 chips.

1:05:42 - 1:05:46 Text: So if you have one GPU which you can think of as less than a single TPU, you're going

1:05:46 - 1:05:48 Text: to be waiting a long time to pre-training.

1:05:48 - 1:05:54 Text: But fine-tuning is so fast, it's so fast and impractical, it's common on a single

1:05:54 - 1:06:00 Text: GPU, you'll see how much faster fine-tuning is than pre-training in assignment five.

1:06:00 - 1:06:06 Text: And so this becomes, I think, a refrain of the field, you pre-trained once or handful

1:06:06 - 1:06:11 Text: of times, right, like a couple of people released big pre-trained models and then you fine-tune

1:06:11 - 1:06:15 Text: many times, right, so you save those parameters from pre-training and you fine-tune on all

1:06:15 - 1:06:20 Text: kinds of different problems.

1:06:20 - 1:06:25 Text: And that paradigm, right, taking something like Bert or whatever the best descendant of

1:06:25 - 1:06:31 Text: Bert is and taking it pre-trained and then fine-tuning it on what you want is pretty

1:06:31 - 1:06:37 Text: close to, you know, it's a very, very strong baseline in NLP right now, right?

1:06:37 - 1:06:40 Text: So and the simplicity is pretty fascinating.

1:06:40 - 1:06:46 Text: And there's one code base called Transformers from a company called Hugging Face that

1:06:46 - 1:06:51 Text: makes this just really just a couple of lines of Python to try out as well.

1:06:51 - 1:06:57 Text: So it sort of opened up very strong baselines without too, too much effort for a lot of

1:06:57 - 1:06:58 Text: tasks.

1:06:58 - 1:07:01 Text: Okay, so let's talk about evaluation.

1:07:01 - 1:07:06 Text: So pre-training is pitched as requiring all this different kind of language understanding.

1:07:06 - 1:07:11 Text: And the field is, the field of NLP has a hard time doing evaluation.

1:07:11 - 1:07:15 Text: But we try our best and we build datasets that we think are hard for various reasons because

1:07:15 - 1:07:19 Text: they require you to know stuff about language and about the world and about reasoning.

1:07:19 - 1:07:26 Text: And so when we evaluate whether pre-training is getting you a lot of sort of general knowledge,

1:07:26 - 1:07:30 Text: we evaluate on a lot of these tasks.

1:07:30 - 1:07:37 Text: So we evaluate on things like paraphrase detection on core questions.

1:07:37 - 1:07:39 Text: Natural language inference we saw.

1:07:39 - 1:07:43 Text: We have hard sentiment analysis datasets or what we're hard sentiment analysis datasets

1:07:43 - 1:07:45 Text: a couple of years ago.

1:07:45 - 1:07:50 Text: And actually, figuring out if sentences are grammatical tends to be hard.

1:07:50 - 1:07:54 Text: Determining the semantic similarity of text can be hard.

1:07:54 - 1:07:55 Text: Paraphrasing again.

1:07:55 - 1:07:57 Text: Natural language inference on a very, very small dataset.

1:07:57 - 1:08:01 Text: So this is this pre-training help you train on smaller datasets.

1:08:01 - 1:08:03 Text: The answer is yes, sort of thing.

1:08:03 - 1:08:09 Text: And so the birth folks released their paper after GPT was released.

1:08:09 - 1:08:13 Text: And there were a lot of sort of state of the art results that came from various things

1:08:13 - 1:08:16 Text: that you were supposed to be doing.

1:08:16 - 1:08:22 Text: And the results that you get sort of with pre-training, so here's open AI, GPT, here's

1:08:22 - 1:08:23 Text: birth base and large.

1:08:23 - 1:08:25 Text: The last three rows are all pre-trained.

1:08:25 - 1:08:32 Text: Elmo is sort of in the middle between pre-training the whole model and just having word embeddings.

1:08:32 - 1:08:34 Text: That's what this is.

1:08:34 - 1:08:39 Text: And the numbers you get are just, I think, to the field where quite astounding actually.

1:08:39 - 1:08:44 Text: We were all surprised that there was that much left to even be gotten on some of these datasets.

1:08:44 - 1:08:49 Text: And taking here, so this line in the table is unmarked when it's actually the number

1:08:49 - 1:08:50 Text: of training examples.

1:08:50 - 1:08:53 Text: This dataset has 2.5,000 training examples.

1:08:53 - 1:08:59 Text: And before sort of the big transformers came around, we had 60% accuracy on it.

1:08:59 - 1:09:01 Text: We run transformers on it.

1:09:01 - 1:09:03 Text: We get 10 points just by pre-training.

1:09:03 - 1:09:07 Text: And this has been a trend that has just continued.

1:09:07 - 1:09:11 Text: So why do anything but pre-trained encoders?

1:09:11 - 1:09:13 Text: We know encoders are good.

1:09:13 - 1:09:15 Text: We like the fact that you have bidirectional context.

1:09:15 - 1:09:18 Text: We also saw that BERT did better than GPT.

1:09:18 - 1:09:27 Text: But if you want to actually get it to do things, you can't just generate sequences from

1:09:27 - 1:09:32 Text: it the same way that you would from a model like GPT, a pre-trained decoder.

1:09:32 - 1:09:34 Text: You can sort of sample what things should go in a mask.

1:09:34 - 1:09:39 Text: So here's a mask. You can put a mask somewhere, sample the words that should go there.

1:09:39 - 1:09:42 Text: But if you want to sample whole context, right, if you want to get that story about the

1:09:42 - 1:09:46 Text: unicorns, for example, the encoder is not what you want to do.

1:09:46 - 1:09:51 Text: So they have sort of different contracts, and they can be used naturally at least in

1:09:51 - 1:09:53 Text: different ways.

1:09:53 - 1:09:57 Text: Okay, so let's talk very briefly about extensions of BERT.

1:09:57 - 1:10:00 Text: So they're BERT variants like Roberta and Spanbert.

1:10:00 - 1:10:04 Text: And there's just a bunch of papers with the word BERT in the title that did various things.

1:10:04 - 1:10:06 Text: Two very strong takeaways.

1:10:06 - 1:10:08 Text: Roberta, train BERT longer.

1:10:08 - 1:10:10 Text: BERT is underfit.

1:10:10 - 1:10:11 Text: Train it on more data.

1:10:11 - 1:10:13 Text: Train it for more steps.

1:10:13 - 1:10:18 Text: Spanbert, mask, contiguous spans of sub words.

1:10:18 - 1:10:21 Text: Words makes a harder, more useful pre-training task.

1:10:21 - 1:10:25 Text: So this is the idea that we can come up with better ways of noisy the input, of hiding

1:10:25 - 1:10:30 Text: stuff in the input, or breaking stuff in the input for our model to correct.

1:10:30 - 1:10:37 Text: So for example, if you have the sentence mask, ear, razz, razz, good, it's just not that

1:10:37 - 1:10:43 Text: hard to know that this is irresistibly, right, because like what could this possibly

1:10:43 - 1:10:44 Text: be after these sub words?

1:10:44 - 1:10:51 Text: So this is irresist, you know, something's about to come here and it's probably the end

1:10:51 - 1:10:52 Text: of that word.

1:10:52 - 1:10:57 Text: Whereas if you mask a long sequence of things, right now this is much harder, and actually

1:10:57 - 1:11:02 Text: you're getting a useful signal that is irresistibly good, and you sort of needed to mask all of

1:11:02 - 1:11:04 Text: them to make the task interesting.

1:11:04 - 1:11:08 Text: So Spanbert was like, oh, you should do this.

1:11:08 - 1:11:10 Text: This was super useful as well.

1:11:10 - 1:11:15 Text: So Roberta, just to point you at the fact that Roberta showed that BERT was underfit,

1:11:15 - 1:11:21 Text: you know, he said, BERT was trained on about 13 gigabytes of text, it got some accuracies,

1:11:21 - 1:11:27 Text: you can get above the amazing results of BERT, four extra points or so here, right, just

1:11:27 - 1:11:35 Text: by taking the identical model and training it on more data, the larger batch size for

1:11:35 - 1:11:36 Text: a long time.

1:11:36 - 1:11:41 Text: And if you train it, yeah, even longer without sort of more data, you don't get any

1:11:41 - 1:11:45 Text: better.

1:11:45 - 1:11:49 Text: Very briefly, okay, so very briefly on the encoder decoders.

1:11:49 - 1:11:53 Text: So we've seen decoders can be good because we get to play with the contracts that they

1:11:53 - 1:11:57 Text: give us, we get to play with them as language models, encoders give us that bidirectional

1:11:57 - 1:11:58 Text: context.

1:11:58 - 1:12:02 Text: So encoder decoders, maybe we get both.

1:12:02 - 1:12:04 Text: In practice, they're actually, yeah, pretty strong.

1:12:04 - 1:12:11 Text: So there was a, right, we could, so I guess one of the questions is like, what do we do

1:12:11 - 1:12:13 Text: to pre-train them?

1:12:13 - 1:12:18 Text: So we could do something like language modeling, right, where we take a sequence of words,

1:12:18 - 1:12:27 Text: one to word two t instead of t, right, and so as I have word one here, dot, dot, dot,

1:12:27 - 1:12:32 Text: word t, we provide those all to our encoder and we predict on none of them.

1:12:32 - 1:12:37 Text: And then we have word t plus one to word two t here in our decoder, right, and we predict

1:12:37 - 1:12:38 Text: on these.

1:12:38 - 1:12:42 Text: So we're doing language modeling on half the sequence and we've taken the other half

1:12:42 - 1:12:46 Text: to have our bidirectional encoder, right, so we're building strong representations on

1:12:46 - 1:12:52 Text: the encoder side, not predicting language modeling on any of this.

1:12:52 - 1:12:55 Text: And then we, on the other half of the tokens, we predict, you know, as a language model

1:12:55 - 1:12:57 Text: would do.

1:12:57 - 1:13:01 Text: And the hope is that you sort of pre-trained both of these well through the one language

1:13:01 - 1:13:04 Text: modeling loss up here.

1:13:04 - 1:13:06 Text: And this is actually, so this works pretty well.

1:13:06 - 1:13:11 Text: The encoder benefits from bidirectionality, the decoder, you can use to train the model.

1:13:11 - 1:13:19 Text: But what this paper showed that introduced the Model T5, roughly at all, found to work

1:13:19 - 1:13:22 Text: best was actually a very, or at least a somewhat different objective.

1:13:22 - 1:13:26 Text: And this should keep in your mind sort of that we have different ways of specifying the

1:13:26 - 1:13:30 Text: pre-training objectives and they will really work differently from each other.

1:13:30 - 1:13:33 Text: So what they said, let's say you have an original text like this.

1:13:33 - 1:13:37 Text: Thank you for inviting me to your party last week.

1:13:37 - 1:13:43 Text: We're going to define variable length spans in the text to replace with a unique symbol

1:13:43 - 1:13:46 Text: that says something is missing here.

1:13:46 - 1:13:48 Text: And then we'll replace and then we'll remove that.

1:13:48 - 1:13:57 Text: So now our input to our encoder is thank you symbol one, me to your party symbol to week.

1:13:57 - 1:14:01 Text: So we've noise the input, we've hidden stuff in the input.

1:14:01 - 1:14:05 Text: Also really interestingly, this doesn't say how long this is supposed to be.

1:14:05 - 1:14:07 Text: That's different from BERT.

1:14:07 - 1:14:10 Text: BERT said, oh, you masked this many sub words.

1:14:10 - 1:14:13 Text: This says, well, I got some token that says something's missing here.

1:14:13 - 1:14:14 Text: And I don't know what it is.

1:14:14 - 1:14:17 Text: I don't even know how many sub words it is.

1:14:17 - 1:14:24 Text: And then so you have this in your encoder and then your decoder predicts the first special

1:14:24 - 1:14:27 Text: word, this x here.

1:14:27 - 1:14:30 Text: And then what was missing for inviting.

1:14:30 - 1:14:33 Text: So thank you x for inviting.

1:14:33 - 1:14:34 Text: And then it predicts y.

1:14:34 - 1:14:35 Text: Here's this y here.

1:14:35 - 1:14:40 Text: And then what was missing from the y last week.

1:14:40 - 1:14:42 Text: This is called span corruption.

1:14:42 - 1:14:47 Text: And it's really interesting to me because in terms of the actual encoder decoder, we don't

1:14:47 - 1:14:51 Text: have to change it compared to whether we, if we were just doing language modeling pre-training.

1:14:51 - 1:14:54 Text: Because I just do language modeling on all these things.

1:14:54 - 1:14:57 Text: I just predict these words as if I'm a language model.

1:14:57 - 1:15:01 Text: I've just done a text pre-processing step.

1:15:01 - 1:15:06 Text: So the actual, I've just pre-processed the text to look like, oh, yeah, take the input,

1:15:06 - 1:15:11 Text: make it look like this, then make an output that looks like that up there.

1:15:11 - 1:15:15 Text: And the model gets to do what is effectively language modeling, but it actually works better.

1:15:15 - 1:15:18 Text: So there's a lot of numbers I realize.

1:15:18 - 1:15:20 Text: But look at the star here.

1:15:20 - 1:15:25 Text: This encoder decoder with a denoising objective that tends to work the best.

1:15:25 - 1:15:31 Text: And they tried similar models like a prefix language model that was sort of the first

1:15:31 - 1:15:35 Text: try that we had at defining a pre-training objective for language models, sorry, for encoder

1:15:35 - 1:15:38 Text: decoders.

1:15:38 - 1:15:41 Text: And then they had another, a number of other options, but what worked best for the encoder

1:15:41 - 1:15:43 Text: decoders.

1:15:43 - 1:15:48 Text: And one of the fascinating things about T5 is that you could pre-train it and fine tune

1:15:48 - 1:15:54 Text: it on questions like when was Franklin D. Roosevelt born and fine tune it to produce the

1:15:54 - 1:15:55 Text: answer.

1:15:55 - 1:15:58 Text: And then you could ask it new questions at test time.

1:15:58 - 1:16:02 Text: And then it would retrieve the answer from its parameters with some accuracy.

1:16:02 - 1:16:05 Text: And it would do so relatively well actually.

1:16:05 - 1:16:10 Text: And it would do so maybe 25% of the time on some of these data sets with 220 million

1:16:10 - 1:16:11 Text: parameters.

1:16:11 - 1:16:15 Text: And then at 11 billion parameters, this is way bigger than Bert large.

1:16:15 - 1:16:20 Text: It would do so even better, sometimes even doing as well as systems that were allowed

1:16:20 - 1:16:22 Text: to look at stuff other than their own parameters.

1:16:22 - 1:16:26 Text: So again, this is just making this answer come from its parameters.

1:16:26 - 1:16:30 Text: Yeah, I'm going to have to skip this.

1:16:30 - 1:16:35 Text: So if you look back at this slide after class, I have each of the examples of the things

1:16:35 - 1:16:40 Text: that we could imagine learning from pre-training with a label of what you might be learning.

1:16:40 - 1:16:43 Text: So this example is 10 for universities located in blank.

1:16:43 - 1:16:44 Text: You might learn trivia.

1:16:44 - 1:16:46 Text: In all these cases, there's all these things you can learn.

1:16:46 - 1:16:53 Text: One thing I will say is that models also learn and can make even worse racism, sexism,

1:16:53 - 1:16:56 Text: all manner of bad biases that are encoded in our text.

1:16:56 - 1:17:00 Text: When I say, yeah, they do this.

1:17:00 - 1:17:02 Text: And so we'll learn more about this in our later lectures, but it's important to keep

1:17:02 - 1:17:06 Text: in mind that when you're doing pre-training, you're learning a lot of stuff, and not all

1:17:06 - 1:17:09 Text: of it is good.

1:17:09 - 1:17:16 Text: So with GPT-3, the last thing here is that there's this third way of interacting with models

1:17:16 - 1:17:19 Text: that's related to treating them as language models.

1:17:19 - 1:17:25 Text: So GPT-3 is this very, very large model that was released by OpenAI.

1:17:25 - 1:17:31 Text: But it seems to be able to learn from examples in their context, their decoder context,

1:17:31 - 1:17:36 Text: without gradient steps, simply by looking sort of within their history.

1:17:36 - 1:17:40 Text: And now GPT-3 has 175 billion parameters, right?

1:17:40 - 1:17:44 Text: The last T5 model we saw was 11 billion parameters.

1:17:44 - 1:17:48 Text: And it seems to be sort of the canonical example of this working.

1:17:48 - 1:17:52 Text: And so what it looks like is you give it as part of its prefix.

1:17:52 - 1:17:56 Text: This goes to Merci, hello, goes to Mint, goes to writes, you've got these translation

1:17:56 - 1:18:03 Text: examples, you ask it for the last one, and it comes up with the correct translation.

1:18:03 - 1:18:06 Text: Seemingly because it's learned something about the task that you're sort of telling

1:18:06 - 1:18:08 Text: it to do through its prefix.

1:18:08 - 1:18:10 Text: And so you might do the same thing with addition.

1:18:10 - 1:18:15 Text: So something, if I plus eight is 13, give it addition examples, you might do the next

1:18:15 - 1:18:18 Text: addition example for you.

1:18:18 - 1:18:24 Text: Or maybe trying to figure out grammatical or spelling errors, for example.

1:18:24 - 1:18:29 Text: And here's the French case.

1:18:29 - 1:18:33 Text: So again, you're learning just to do pre-training.

1:18:33 - 1:18:40 Text: But when you're evaluating it, you don't even fine tune the model, you just provide prefixes.

1:18:40 - 1:18:43 Text: And so this especially is not well understood.

1:18:43 - 1:18:48 Text: And so a lot of research is going into sort of what the limitations of this so-called

1:18:48 - 1:18:49 Text: in-context learning are.

1:18:49 - 1:18:53 Text: But it's a fascinating direction for future work.

1:18:53 - 1:18:56 Text: In total, these models are not well understood.

1:18:56 - 1:19:01 Text: However, small, small, in-air growth models like Bert have become general tools in a wide

1:19:01 - 1:19:02 Text: range of settings.

1:19:02 - 1:19:06 Text: They do have these issues about learning all these biases about the world.

1:19:06 - 1:19:10 Text: They'll go into and further lectures in this course.

1:19:10 - 1:19:15 Text: And so, yeah, what you've learned this week, transformers and pre-training form the basis

1:19:15 - 1:19:20 Text: or at least the base lines for much of a natural language processing today.

1:19:20 - 1:19:25 Text: And assignment five is out and you'll be able to look more into it.

1:19:25 - 1:19:26 Text: And I'm over time.

1:19:26 - 1:19:27 Text: All right.

1:19:27 - 1:19:28 Text: Yeah.

1:19:28 - 1:19:39 Text: I guess I can take a question if there is any, but people can keep going as well.

1:19:39 - 1:19:50 Text: So I think that I think there's a question about P5, which was how does the D-toder know

1:19:50 - 1:19:52 Text: that I'm currently predicting X for Y?

1:19:52 - 1:19:54 Text: Could you repeat that?

1:19:54 - 1:19:55 Text: Yeah.

1:19:55 - 1:20:00 Text: So about P5, there's a question that was asking how does the D-toder know it's currently

1:20:00 - 1:20:04 Text: predicting X for Y?

1:20:04 - 1:20:09 Text: It's hierarchy of predicting X for Y?

1:20:09 - 1:20:13 Text: I guess it doesn't specify it's going to change how does it know that it's currently

1:20:13 - 1:20:15 Text: predicting X for Y?

1:20:15 - 1:20:16 Text: OK.

1:20:16 - 1:20:17 Text: Yeah.

1:20:17 - 1:20:18 Text: That makes sense.

1:20:18 - 1:20:19 Text: So what it does, right?

1:20:19 - 1:20:23 Text: So it knows from the encoder that it has to at some point predict X and at some point

1:20:23 - 1:20:28 Text: predict Y because the encoder can just like remember that, oh, yeah, there's two things

1:20:28 - 1:20:29 Text: missing.

1:20:29 - 1:20:33 Text: And if there were more spans replaced, there would be a Z and then an A and then a B and

1:20:33 - 1:20:38 Text: you know whatever, just a bunch of unique identifiers.

1:20:38 - 1:20:44 Text: And then up here, it gets to say, OK, I have attention, I suppose.

1:20:44 - 1:20:48 Text: I can look and I know that first I have to predict this first master thing.

1:20:48 - 1:20:52 Text: So I'm going to generate that in my D-coder and then it gets that symbol, right?

1:20:52 - 1:20:55 Text: So we're doing training by giving it the right symbol.

1:20:55 - 1:20:59 Text: Now it gets that X and it says, OK, I'm predicting X now.

1:20:59 - 1:21:01 Text: And now it can predict, predict, predict, predict.

1:21:01 - 1:21:02 Text: Then it gets Y.

1:21:02 - 1:21:06 Text: So we're doing this teacher forcing training where we give it the right answer after penalizing

1:21:06 - 1:21:07 Text: it if it's wrong.

1:21:07 - 1:21:10 Text: Now it gets this Y, right?

1:21:10 - 1:21:12 Text: And it says, OK, now I have to predict what should go and why.

1:21:12 - 1:21:16 Text: And it can attend, you know, into the natural parts of this as well as what it's already

1:21:16 - 1:21:22 Text: predicted here because the decoder has attention within itself and it can see what should go

1:21:22 - 1:21:23 Text: there.

1:21:23 - 1:21:25 Text: So what's fascinating here is you're doing something like language modeling.

1:21:25 - 1:21:29 Text: But when you're predicting Y, right, you get to see what came after it.

1:21:29 - 1:21:31 Text: And that's I think one of the benefits of span corruption.

1:21:31 - 1:21:34 Text: So you're doing this thing where you don't know how long you should be predicting for

1:21:34 - 1:21:39 Text: like language modeling, but you get to know what came after the thing that's missing.