0:00:00 - 0:00:07 Text: Hello, everybody.
0:00:07 - 0:00:10 Text: Welcome to CS224N lecture 10.
0:00:10 - 0:00:16 Text: This is going to be primarily on pre-training, but we will also discuss sub-word models a
0:00:16 - 0:00:18 Text: little bit and review transformers.
0:00:18 - 0:00:26 Text: Okay, so we have a lot of exciting things to get into today, but some reminders about
0:00:26 - 0:00:30 Text: the class.
0:00:30 - 0:00:32 Text: Assignment 5 is being released today.
0:00:32 - 0:00:37 Text: Assignment 4 was due a minute ago, so if you are done with that, congratulations.
0:00:37 - 0:00:41 Text: If not, I hope that the late days go well.
0:00:41 - 0:00:48 Text: Assignment 5 is on pre-training and transformers, so these lectures are going to be very useful
0:00:48 - 0:00:52 Text: to you for that and I just don't cover anything after these lectures.
0:00:52 - 0:00:54 Text: All right.
0:00:54 - 0:01:00 Text: So today, let's kind of take a little peek through what the outline will be.
0:01:00 - 0:01:04 Text: We haven't talked about sub-word modeling yet and sort of we should have.
0:01:04 - 0:01:06 Text: And so we're going to talk a little bit about sub-words.
0:01:06 - 0:01:11 Text: You saw these in assignment 4, all just, you know, as the data that we provided to you
0:01:11 - 0:01:15 Text: with your machine translation system, but we're going to talk a little bit about why they're
0:01:15 - 0:01:21 Text: so ubiquitous in NLP because they are used in pre-trained models.
0:01:21 - 0:01:26 Text: I mean, they're used in a number of different models, but when we discuss pre-training,
0:01:26 - 0:01:29 Text: it's important to know that sub-words are part of it.
0:01:29 - 0:01:34 Text: Then we'll sort of motivate, we'll go on another journey of motivation of motivating
0:01:34 - 0:01:36 Text: model pre-training from word embedding.
0:01:36 - 0:01:41 Text: So we've already seen pre-training in some sense in the very first lecture of this course
0:01:41 - 0:01:46 Text: because we pre-trained individual word embeddings that don't take into account their contexts
0:01:46 - 0:01:50 Text: on very large text corporate and saw that they were able to encode a lot of useful things
0:01:50 - 0:01:53 Text: about language.
0:01:53 - 0:01:57 Text: So after we do that motivation, we'll go through model pre-training three ways.
0:01:57 - 0:02:00 Text: And we're going to, you know, reference actually the lecture on Tuesday.
0:02:00 - 0:02:02 Text: So this is why we're going to review a little bit of the transformer stuff.
0:02:02 - 0:02:07 Text: We'll talk about model pre-training in decoters, like a transformer decoder that we saw
0:02:07 - 0:02:10 Text: last week in encoters, and then encoder decoters.
0:02:10 - 0:02:14 Text: And each of these three cases, we're going to talk a little bit about sort of what things
0:02:14 - 0:02:20 Text: you could be doing and then popular models that are in use across research and in industry.
0:02:20 - 0:02:23 Text: And we're going to talk a little bit about, you know, what do we think pre-training is
0:02:23 - 0:02:24 Text: teaching?
0:02:24 - 0:02:25 Text: This is going to be very brief.
0:02:25 - 0:02:29 Text: Actually, a lot of the interpretability and analysis lecture in two weeks is going
0:02:29 - 0:02:35 Text: to talk more about sort of the mystery and the scientific problem of figuring out what
0:02:35 - 0:02:39 Text: these models are learning about language through pre-training objectives, but we'll sort
0:02:39 - 0:02:40 Text: of get a peak.
0:02:40 - 0:02:43 Text: And then we'll talk about very large models and in context learning.
0:02:43 - 0:02:49 Text: So if you've heard of GPT-3, for example, we're going to just briefly touch on that here
0:02:49 - 0:02:53 Text: and I think we'll discuss more about it in the course later on as well.
0:02:53 - 0:02:55 Text: Okay, so we've got a lot to do.
0:02:55 - 0:02:57 Text: Let's jump right in.
0:02:57 - 0:02:59 Text: So word structure and sub-broad models.
0:02:59 - 0:03:04 Text: Let's think about sort of the assumptions we've been making in this course so far.
0:03:04 - 0:03:09 Text: When we give you an assignment, when we talk about training word to veck, for example,
0:03:09 - 0:03:11 Text: we made this assumption about a language's vocabulary.
0:03:11 - 0:03:15 Text: In particular, we've made this assumption that has a fixed vocabulary of something like
0:03:15 - 0:03:18 Text: tens of thousands, maybe a hundred thousand, I don't know, a number of...
0:03:18 - 0:03:23 Text: But some relatively large, it seems, number of words, and that seems sort of like pretty
0:03:23 - 0:03:27 Text: good so far, at least, and what we've done.
0:03:27 - 0:03:32 Text: And we build this vocabulary from the set that we train, say, word to veck on.
0:03:32 - 0:03:37 Text: And then here's a crucial thing, any novel word, any word that you did not see at training
0:03:37 - 0:03:42 Text: time, is sort of mapped to a single unctocon.
0:03:42 - 0:03:46 Text: There are other ways to handle this, but you sort of have to do something and a frequent
0:03:46 - 0:03:49 Text: method is to map them all to unct.
0:03:49 - 0:03:53 Text: So let's walk through what this sort of means in English.
0:03:53 - 0:03:55 Text: You learn embeddings, you map them, it all works.
0:03:55 - 0:04:02 Text: Then you have a variation on a word like tase, with a bunch of a's.
0:04:02 - 0:04:07 Text: And your model isn't smart enough to know that that sort of means like very tasty, maybe.
0:04:07 - 0:04:12 Text: And so it maps it to unct, because it's just a dictionary look-up mess.
0:04:12 - 0:04:18 Text: And then you have a typo like lern, and that maps to unct as well, potentially, if it
0:04:18 - 0:04:22 Text: wasn't in your training set, some people make typos, but not all of them will be seen
0:04:22 - 0:04:23 Text: at training time.
0:04:23 - 0:04:24 Text: And then you'll have novel items.
0:04:24 - 0:04:29 Text: So this could be the first time that you've ever seen US, the students, and 224N have
0:04:29 - 0:04:32 Text: seen the word transformer if I.
0:04:32 - 0:04:36 Text: But I get the feeling you sort of have a notion of what it's supposed to mean, like
0:04:36 - 0:04:41 Text: maybe add transformers to or turn into using transformers, or turn into a transformer
0:04:41 - 0:04:42 Text: or something like that.
0:04:42 - 0:04:46 Text: And this is also going to be mapped to unct, even though you've seen transformer and
0:04:46 - 0:04:48 Text: if I.
0:04:48 - 0:04:54 Text: And so somehow the conclusion we have to come to is that looking at words as just like
0:04:54 - 0:05:00 Text: the individual sequence of characters uniquely identifies that word, and that's sort of
0:05:00 - 0:05:04 Text: how we should parameterize things is just wrong.
0:05:04 - 0:05:09 Text: And so not only is this true in English, but in many languages, this finite vocabulary
0:05:09 - 0:05:10 Text: assumption makes even less sense.
0:05:10 - 0:05:16 Text: So already it doesn't make sense in English, but English is, it's not the worst for English.
0:05:16 - 0:05:22 Text: So morphology is the study of the structure of words.
0:05:22 - 0:05:28 Text: And English is known to have pretty simple morphology in kind of specific ways.
0:05:28 - 0:05:34 Text: And when languages have complex morphology, it means you have longer words, more complex
0:05:34 - 0:05:39 Text: words that get modified more, and each one of them occurs less frequently.
0:05:39 - 0:05:40 Text: That should sound like a problem, right?
0:05:40 - 0:05:44 Text: If a word occurs less frequently, it will be less likely to show up in your training
0:05:44 - 0:05:46 Text: set.
0:05:46 - 0:05:49 Text: And maybe it'll show up in your test set, never in your training set.
0:05:49 - 0:05:51 Text: Now it's mapped to unct, and you don't know what to do.
0:05:51 - 0:05:56 Text: So an example, Swahili verbs can have hundreds of conjugations.
0:05:56 - 0:06:03 Text: So each conjugation encodes important information about the sentence that in English might be
0:06:03 - 0:06:05 Text: represented through, say, more words.
0:06:05 - 0:06:11 Text: And Swahili it's mapped onto the verb as prefixes and suffixes, and the like, this is called
0:06:11 - 0:06:13 Text: inflectional morphology.
0:06:13 - 0:06:14 Text: And so you can have hundreds of conjugations.
0:06:14 - 0:06:20 Text: I've just sort of pasted this wick-shenary block just to give you a small sample of just
0:06:20 - 0:06:22 Text: the huge number of conjugations there are.
0:06:22 - 0:06:26 Text: And so trying to memorize independently a meaning of each one of these words is just not
0:06:26 - 0:06:32 Text: the right answer.
0:06:32 - 0:06:34 Text: So this is going to be a very brief overview.
0:06:34 - 0:06:43 Text: And so what we're going to do is take one, let's say, class of algorithms for sub-word modeling
0:06:43 - 0:06:50 Text: that have been kind of developed to try to take a middle ground between two options.
0:06:50 - 0:06:54 Text: One option is saying everything is just like individual words.
0:06:54 - 0:06:58 Text: Either I know the word and I saw it at training time, or I don't know the word, and it's like
0:06:58 - 0:06:59 Text: unct.
0:06:59 - 0:07:03 Text: And then sort of another extreme option is to say it's just characters.
0:07:03 - 0:07:08 Text: Right? So like I get a sequence of characters, and then my neural network on top of my sequence
0:07:08 - 0:07:13 Text: of characters has to learn everything, has to learn how to combine words and stuff.
0:07:13 - 0:07:18 Text: So sub-word models in general just means looking at the sort of internal structure of words
0:07:18 - 0:07:20 Text: somehow, looking below the word level.
0:07:20 - 0:07:25 Text: But this group of models is going to try to meet a middle ground.
0:07:25 - 0:07:28 Text: So byte parent coding.
0:07:28 - 0:07:33 Text: What we're going to do is we're going to learn a vocabulary from a training data set
0:07:33 - 0:07:34 Text: again.
0:07:34 - 0:07:35 Text: So now we have a training data set.
0:07:35 - 0:07:39 Text: Instead of just saying, oh, everything that was split by my heuristic word splitter,
0:07:39 - 0:07:45 Text: like spaces in English, for example, is going to be a word in my vocabulary, we're going
0:07:45 - 0:07:50 Text: to learn the vocabulary using a greedy algorithm in this case.
0:07:50 - 0:07:52 Text: So here's what we're going to do.
0:07:52 - 0:07:55 Text: We start with the vocabulary containing only characters.
0:07:55 - 0:07:56 Text: So that's our extreme, right?
0:07:56 - 0:08:02 Text: So at the very least, if you've seen all the characters, then you know that you can never
0:08:02 - 0:08:03 Text: have an unque, right?
0:08:03 - 0:08:07 Text: Because you see a word, you've never seen it before, you just split it into its characters,
0:08:07 - 0:08:11 Text: and then you try to see, you know, deal with it that way.
0:08:11 - 0:08:14 Text: And then also an end of word symbol.
0:08:14 - 0:08:15 Text: And then we'll iterate over this algorithm.
0:08:15 - 0:08:20 Text: We'll say, use the corpus of text, find common adjacent letters.
0:08:20 - 0:08:24 Text: So maybe A and B are very frequently adjacent.
0:08:24 - 0:08:30 Text: And the pair of them together as a single sub word into your vocabulary.
0:08:30 - 0:08:34 Text: Now replace instances of that character pair with a new sub word repeat until you're desired
0:08:34 - 0:08:35 Text: vocabulary size.
0:08:35 - 0:08:41 Text: So maybe you start with a small character vocabulary, and then you end up with that same small
0:08:41 - 0:08:47 Text: character vocabulary plus a bunch of sort of entire words or parts of words.
0:08:47 - 0:08:50 Text: So notice how Apple, an entire word, looks like Apple.
0:08:50 - 0:08:56 Text: But then app, maybe this is sort of the first part, the first sub word of application, or
0:08:56 - 0:08:57 Text: up.
0:08:57 - 0:08:59 Text: Yeah.
0:08:59 - 0:09:07 Text: And then Lee, I guess I should have not put the hash there, but you know, maybe you learned
0:09:07 - 0:09:10 Text: Lee as like the end of a word, for example.
0:09:10 - 0:09:16 Text: And so what you end up with is, you know, a vocabulary where common things you get to
0:09:16 - 0:09:20 Text: map to themselves and then rare sequences of characters.
0:09:20 - 0:09:23 Text: You kind of split as little as possible.
0:09:23 - 0:09:27 Text: And it doesn't always end up so nicely that you learn like morphologically relevant suffixes
0:09:27 - 0:09:29 Text: like Lee.
0:09:29 - 0:09:32 Text: But you can, you know, try to split things somewhat reasonably.
0:09:32 - 0:09:37 Text: And if you have enough data, the sub word vocabulary you learn tends to be okay.
0:09:37 - 0:09:40 Text: So this is originally used in machine translation.
0:09:40 - 0:09:45 Text: And now a similar method, word piece, which we won't go over in this lecture is used
0:09:45 - 0:09:46 Text: in pre-trained models.
0:09:46 - 0:09:48 Text: But you know, the idea is effectively the same.
0:09:48 - 0:09:50 Text: And you end up with vocabularies that look a lot like this.
0:09:50 - 0:09:58 Text: So if we go back to our, if we go back to our examples of where, you know, word level
0:09:58 - 0:10:04 Text: NLP was failing us, then you have hat mapping to hat.
0:10:04 - 0:10:05 Text: Okay, that's good.
0:10:05 - 0:10:09 Text: You have hat mapping to hat because that was a common enough sequence of characters that
0:10:09 - 0:10:12 Text: it was actually incorporated into our sub word vocabulary.
0:10:12 - 0:10:13 Text: Right?
0:10:13 - 0:10:15 Text: And then you have learned mapping to learn.
0:10:15 - 0:10:16 Text: So common words good.
0:10:16 - 0:10:20 Text: And that means that the model, the neural network that you're going to process this text
0:10:20 - 0:10:27 Text: with does not need to, say, combine the letters of learn and hat in order to try to like
0:10:27 - 0:10:31 Text: derive the meaning of these words from the letters, because you can imagine that might
0:10:31 - 0:10:32 Text: be difficult.
0:10:32 - 0:10:38 Text: But then when you get a word that you have not seen before, you are able to decompose
0:10:38 - 0:10:39 Text: it.
0:10:39 - 0:10:45 Text: And so if you've seen tasty with varying numbers of A's at, at training time, you know,
0:10:45 - 0:10:50 Text: maybe you actually get some of the same sub words or similar sub words that you're splitting
0:10:50 - 0:10:52 Text: it into at evaluation time.
0:10:52 - 0:10:56 Text: So we never saw tasty enough to like, you know, however many A's in order to add it into
0:10:56 - 0:10:58 Text: a sub word vocabulary.
0:10:58 - 0:11:01 Text: But we're still able to split it into things.
0:11:01 - 0:11:05 Text: And then the neural network that runs on top of these sub word embeddings could be able
0:11:05 - 0:11:10 Text: to sort of induce that, oh, yeah, this is one of those things where people like, you know,
0:11:10 - 0:11:15 Text: chain letters together, chain vowels together in English for emphasis.
0:11:15 - 0:11:18 Text: So misspellings still pretty much mess you up.
0:11:18 - 0:11:22 Text: So now learn with this misspelling might be mapped to two sub words.
0:11:22 - 0:11:27 Text: But if you saw misspellings like this frequently enough, maybe you could learn sort of to handle
0:11:27 - 0:11:28 Text: it.
0:11:28 - 0:11:31 Text: It still messes up the model though.
0:11:31 - 0:11:33 Text: And, but at the very least, it's not just an umk, right?
0:11:33 - 0:11:35 Text: It seems clearly better than that.
0:11:35 - 0:11:40 Text: And then transformer, if I, maybe in the best, this is sort of optimistic, but maybe
0:11:40 - 0:11:44 Text: in the best case, right, you were able to say, ah, yes, this is transformer.
0:11:44 - 0:11:51 Text: And if I, again, the sub words that you learn don't actually tend to be this well morphologically
0:11:51 - 0:11:52 Text: motivated, I think.
0:11:52 - 0:11:58 Text: So if I is like a clear, like suffix in English that has a very common and replicable meaning
0:11:58 - 0:12:02 Text: when you apply it to nouns, that's derivational morphology.
0:12:02 - 0:12:07 Text: But you know, you're able to sort of compose the word of the meaning of transformer if I
0:12:07 - 0:12:11 Text: possibly from its two sub word constituents.
0:12:11 - 0:12:15 Text: And so when we talk about words being input to transformer models, pre-trained transformer
0:12:15 - 0:12:20 Text: models, throughout the entirety of this lecture, we will be talking about sub words.
0:12:20 - 0:12:26 Text: So I might say word, and what I mean is, you know, possibly a full word, also possibly
0:12:26 - 0:12:27 Text: a sub word.
0:12:27 - 0:12:31 Text: Okay, so when we say a sequence of words, the transformer, the pre-trained transformer
0:12:31 - 0:12:37 Text: has no idea, sort of whether it's dealing with words or sub words, when it's doing itself
0:12:37 - 0:12:40 Text: attention operations.
0:12:40 - 0:12:41 Text: And so this can be a problem.
0:12:41 - 0:12:46 Text: You can imagine if you have really weird sequences of characters, you can actually have an individual
0:12:46 - 0:12:51 Text: single word mapped to as many sub words as it has characters.
0:12:51 - 0:12:55 Text: That can be a problem because suddenly, you know, you have a ten-word sentence, but one
0:12:55 - 0:12:58 Text: of the words is mapped to, you know, twenty sub words.
0:12:58 - 0:13:02 Text: Now you have a thirty-word sentence, where twenty of the thirty words are just one real
0:13:02 - 0:13:03 Text: word.
0:13:03 - 0:13:05 Text: So keep this in mind.
0:13:05 - 0:13:09 Text: But, you know, I think it's important for sort of this open vocabulary assumption, it's
0:13:09 - 0:13:15 Text: important in English, and it's even more important in many other languages.
0:13:15 - 0:13:18 Text: And the actual algorithm, and you can go into the actual algorithms that are done for
0:13:18 - 0:13:23 Text: this, byte per encoding is sort of my favorite for going over briefly, word piece you can
0:13:23 - 0:13:26 Text: also take a look at.
0:13:26 - 0:13:27 Text: Okay.
0:13:27 - 0:13:29 Text: Any questions on sub words?
0:13:29 - 0:13:34 Text: I guess John, let me look after what does the hashtag mean?
0:13:34 - 0:13:36 Text: Oh, great, great point.
0:13:36 - 0:13:40 Text: So this means that you should be combining this sub word, so this sub word is not the
0:13:40 - 0:13:41 Text: end of a word.
0:13:41 - 0:13:45 Text: TAA, hash hash, is sort of telling the model.
0:13:45 - 0:13:50 Text: So if I had TAA with no hashes, that's a separate sub word.
0:13:50 - 0:13:54 Text: That means there's an entire word that is ta, or at the very least it's not the end of
0:13:54 - 0:13:55 Text: the word.
0:13:55 - 0:13:56 Text: See how here?
0:13:56 - 0:13:57 Text: I don't have the hashes at the end.
0:13:57 - 0:14:00 Text: It's because this is indicating that this is at the end of the word.
0:14:00 - 0:14:04 Text: Different sub word schemes differ on whether you should put something at the beginning of
0:14:04 - 0:14:08 Text: the word, if it does begin a word, or if you should put something at the end of the
0:14:08 - 0:14:11 Text: word, if it doesn't end the word.
0:14:11 - 0:14:15 Text: So when the tokenizer is running over your data, so you've got something that's tokenizing
0:14:15 - 0:14:20 Text: this sentence in the worst case.
0:14:20 - 0:14:27 Text: In the worst case, it says, in, that's a whole word, give it just the word in, no hashes,
0:14:27 - 0:14:32 Text: that's a whole word, give it just the word the, no hashes, and then maybe over here at
0:14:32 - 0:14:34 Text: sub words.
0:14:34 - 0:14:39 Text: We've got this weird word sub words, and it splits it into sub and words.
0:14:39 - 0:14:45 Text: And so sub, it's going to give it the sub word with sub hash hash to indicate that it's
0:14:45 - 0:14:52 Text: part of this larger word, sub words, as opposed to the word sub, like submarine, which would
0:14:52 - 0:14:53 Text: be different.
0:14:53 - 0:15:02 Text: Yeah, that's a great question.
0:15:02 - 0:15:06 Text: Okay, great.
0:15:06 - 0:15:11 Text: So that was our note on sub word modeling, and you can, you know, sub words are important,
0:15:11 - 0:15:17 Text: for example, in, you know, a lot of translation applications, that's why we gave you sub words
0:15:17 - 0:15:19 Text: on the machine translation assignment.
0:15:19 - 0:15:22 Text: Now let's talk about model pre-training and word embeddings.
0:15:22 - 0:15:25 Text: So I love, I love being able to go to this slide.
0:15:25 - 0:15:29 Text: So, so we saw this quote at the beginning of the class, you shall know a word by the company
0:15:29 - 0:15:34 Text: it keeps, and this was sort of one of the things that we used to summarize distributional
0:15:34 - 0:15:35 Text: semantics.
0:15:35 - 0:15:39 Text: This idea that word to veck was sort of well motivated in some way, because the meaning
0:15:39 - 0:15:44 Text: of a word can be thought of as being derived from the kind of co-occurrent statistics of
0:15:44 - 0:15:52 Text: words that co-occur around it, and that was just fascinatingly effective, I think.
0:15:52 - 0:15:54 Text: But there's this other quote actually from the same person.
0:15:54 - 0:16:01 Text: So we have J.R. Firth, 1935, compared to our quote before from 1957, and the second
0:16:01 - 0:16:06 Text: quote says, the complete meaning of a word is always contextual, and no study of meaning
0:16:06 - 0:16:10 Text: apart from a complete context can be taken seriously.
0:16:10 - 0:16:15 Text: Now again, these are just things that we can sort of think about and chew on, but it
0:16:15 - 0:16:20 Text: comes to mind, right, when you, when you embed words with word to veck, one of the issues
0:16:20 - 0:16:26 Text: is that you don't actually look at its neighbors as you're giving it an embedding.
0:16:26 - 0:16:33 Text: So if I have the sentence I record the record, you know, the two instances of REC, ORD,
0:16:33 - 0:16:37 Text: mean different things, but they're given the same word to veck embedding, right, because
0:16:37 - 0:16:42 Text: in word to veck you take the string, you map it to, oh, I've seen the word record before,
0:16:42 - 0:16:47 Text: you get that sort of vector from your learned matrix, and you give it the same thing in both
0:16:47 - 0:16:49 Text: cases.
0:16:49 - 0:16:54 Text: And so what we're going to be doing today is actually not conceptually all that different
0:16:54 - 0:16:56 Text: from training word to veck.
0:16:56 - 0:17:01 Text: Word to veck training you can think of as pre-training just a very simple model that only assigns
0:17:01 - 0:17:07 Text: an individual vector to each unique word type, each unique element in your vocabulary.
0:17:07 - 0:17:12 Text: Today we'll be going a lot farther than that, but the idea is very similar.
0:17:12 - 0:17:17 Text: So back in, you know, 2017, we would start with pre-trained word embeddings, and again,
0:17:17 - 0:17:21 Text: remember no context there, so you give a word and embedding independent of the context
0:17:21 - 0:17:23 Text: that it shows up in.
0:17:23 - 0:17:25 Text: And then you learn how to incorporate the context.
0:17:25 - 0:17:28 Text: It's not like our NLP models never used context, right?
0:17:28 - 0:17:34 Text: Instead, you would learn to incorporate the context using your LSTM, or it's later in 2017,
0:17:34 - 0:17:37 Text: you know, your transformer.
0:17:37 - 0:17:41 Text: And you would learn to incorporate context while training on the task.
0:17:41 - 0:17:43 Text: So you have some supervision.
0:17:43 - 0:17:47 Text: Maybe it's machine translation supervision, maybe sentiment, maybe question answering.
0:17:47 - 0:17:53 Text: And you would learn how to incorporate context in your LSTM or otherwise through the signal
0:17:53 - 0:17:56 Text: of the training instead of say through the word to veck signal.
0:17:56 - 0:18:01 Text: And so, you know, sort of pictographically, you have these word embeddings here, so the
0:18:01 - 0:18:04 Text: red are sort of your word to veck embeddings, and those are pre-trained.
0:18:04 - 0:18:08 Text: Those take up some of the parameters of your network.
0:18:08 - 0:18:09 Text: And then you've got your contextualization.
0:18:09 - 0:18:12 Text: Now this looks like an LSTM, but it could be whatever.
0:18:12 - 0:18:16 Text: So this maybe bidirectional encoder thing here is not pre-trained.
0:18:16 - 0:18:19 Text: And now that's a lot of parameters that are not pre-trained.
0:18:19 - 0:18:24 Text: And then maybe you have some sort of readout function at the end, right, to predict whatever
0:18:24 - 0:18:25 Text: thing you're trying to predict.
0:18:25 - 0:18:30 Text: Again, maybe it's sentiment, maybe you're doing, I don't know, topic labeling, whatever
0:18:30 - 0:18:31 Text: you want to do.
0:18:31 - 0:18:32 Text: This is sort of the paradigm.
0:18:32 - 0:18:36 Text: Like you set some sort of architecture and you only pre-trained the word embeddings.
0:18:36 - 0:18:44 Text: And so this isn't actually the conceptually, necessarily the biggest problem, because,
0:18:44 - 0:18:49 Text: you know, we like to think in deep learning stuff that we have a lot of training data
0:18:49 - 0:18:50 Text: for our objectives.
0:18:50 - 0:18:56 Text: I mean, one of the things that we motivated, you know, big, deep neural networks for is
0:18:56 - 0:18:59 Text: that they can take a lot of data and they can learn patterns from it.
0:18:59 - 0:19:05 Text: But it does put the onus on our downstream data to be sort of sufficient to teach the
0:19:05 - 0:19:07 Text: contextual aspects of language.
0:19:07 - 0:19:13 Text: So you can imagine if you only have a little bit of, you know, labeled data for fine tuning,
0:19:13 - 0:19:17 Text: you're putting a pretty big role on that data to say, hey, maybe here's some pre-trained
0:19:17 - 0:19:21 Text: embeddings, but like how you handle like sentences and how they compose and all that stuff,
0:19:21 - 0:19:23 Text: that's up to you.
0:19:23 - 0:19:27 Text: So if you don't have a lot of labeled data for your downstream task, you're asking
0:19:27 - 0:19:32 Text: it to do a lot with, you know, a large number of parameters that have been initialized randomly.
0:19:32 - 0:19:38 Text: Okay, so like a small portion of the parameters have been pre-trained.
0:19:38 - 0:19:43 Text: Okay, so where we're going is pre-training whole models.
0:19:43 - 0:19:47 Text: I mean, conceptually, you know, we're pretty close to there.
0:19:47 - 0:19:53 Text: So nowadays, almost all parameters in your neural network and let's say a lot of research
0:19:53 - 0:19:58 Text: settings and increasingly in industry are initialized via pre-training, just like word
0:19:58 - 0:20:07 Text: to vac parameters were initialized and pre-training methods in general hide parts of the input
0:20:07 - 0:20:11 Text: from the model itself and then train the model to reconstruct those parts.
0:20:11 - 0:20:14 Text: How does this connect to word to vac?
0:20:14 - 0:20:19 Text: In word to vac, you know, people don't usually make this connection, but it's the following.
0:20:19 - 0:20:25 Text: You have an individual word and it knows itself, right, because you have the embedding for
0:20:25 - 0:20:27 Text: the center word, right, from assignment two.
0:20:27 - 0:20:31 Text: You have the embedding for the center word and knows itself and you've masked out all
0:20:31 - 0:20:33 Text: of its neighbors.
0:20:33 - 0:20:36 Text: You've hidden all of its neighbors from it, right, every all of its window neighbors, you've
0:20:36 - 0:20:37 Text: hidden from it.
0:20:37 - 0:20:41 Text: You ask the center word to predict its neighbors, right?
0:20:41 - 0:20:47 Text: And so this is, this falls under the category of pre-training.
0:20:47 - 0:20:48 Text: All of these methods look similar.
0:20:48 - 0:20:54 Text: You hide parts of the input from the model and train the model to reconstruct those parts.
0:20:54 - 0:20:58 Text: The differences with full model pre-training is that you don't give the model just the
0:20:58 - 0:21:01 Text: individual word and have it learn an embedding of that word.
0:21:01 - 0:21:06 Text: You give it much more of the sequence and have it predict, you know, held out parts of
0:21:06 - 0:21:07 Text: the sequence.
0:21:07 - 0:21:08 Text: And we'll get into the details there.
0:21:08 - 0:21:14 Text: But, you know, the takeaway is that everything here is pre-trained jointly, possibly with
0:21:14 - 0:21:18 Text: the exception of the very last layer that predicts the label.
0:21:18 - 0:21:26 Text: Okay, and this has just been exceptionally effective at building representations of language
0:21:26 - 0:21:31 Text: that just map similar things in language, similar representations in these encoders, just
0:21:31 - 0:21:35 Text: like how word-to-vec map similar words to similar vectors.
0:21:35 - 0:21:40 Text: It's been exceptionally effective at making parameter initializations where you start with
0:21:40 - 0:21:45 Text: these parameters that have been pre-trained and then you fine-tune them on your label data.
0:21:45 - 0:21:49 Text: And then third, they have an exceptionally effective at defining probability distributions
0:21:49 - 0:21:54 Text: over language, like in language modeling, that are actually really useful to sample from
0:21:54 - 0:21:56 Text: in certain cases.
0:21:56 - 0:21:59 Text: So these are three ways in which we interact with pre-trained models.
0:21:59 - 0:22:02 Text: We use their representations just to compute similarities.
0:22:02 - 0:22:04 Text: We use them for parameter initializations.
0:22:04 - 0:22:10 Text: And we actually just use them as probability distributions, sort of how we train to them.
0:22:10 - 0:22:12 Text: Okay.
0:22:12 - 0:22:16 Text: So let's get into some technical parts here.
0:22:16 - 0:22:21 Text: I sort of want to think broad thoughts about what we could do with pre-training and what
0:22:21 - 0:22:26 Text: kind of things we could expect to potentially learn from this general method of hide part
0:22:26 - 0:22:30 Text: of the input and then see other parts of the input and then try to predict the parts
0:22:30 - 0:22:31 Text: that you hid.
0:22:31 - 0:22:32 Text: Okay.
0:22:32 - 0:22:36 Text: So Stanford University is located in Blank California.
0:22:36 - 0:22:39 Text: If we gave a model, everything that was not blanked out here and asked to predict the
0:22:39 - 0:22:49 Text: middle, the loss function would train the model to predict Palo Alto here, I expect.
0:22:49 - 0:22:50 Text: Okay.
0:22:50 - 0:22:54 Text: So this is an instance of something that you could imagine being a pre-training objective.
0:22:54 - 0:22:59 Text: You take in a sentence, you remove part of it and you say recreate the part that I removed.
0:22:59 - 0:23:03 Text: And in this case, if I just gave a bunch of examples that looked like this, it might
0:23:03 - 0:23:07 Text: learn sort of trivia thing here.
0:23:07 - 0:23:08 Text: Okay.
0:23:08 - 0:23:09 Text: Here's another one.
0:23:09 - 0:23:12 Text: I put blank fork down on the table.
0:23:12 - 0:23:14 Text: This one is under specified.
0:23:14 - 0:23:24 Text: So this could be the fork, my fork, his fork, her fork, some fork, yeah, a fork.
0:23:24 - 0:23:29 Text: So this is, you know, specifying the kinds of syntactic categories of things that can
0:23:29 - 0:23:32 Text: sort of appear in this context.
0:23:32 - 0:23:37 Text: So this is another thing that you might be able to learn from such an objective.
0:23:37 - 0:23:42 Text: So you have the woman walked across the street, checking for traffic over blank shoulder.
0:23:42 - 0:23:44 Text: One of the things that could go over here is her.
0:23:44 - 0:23:47 Text: That's a co-reference statement.
0:23:47 - 0:23:53 Text: So you could learn sort of connections between entities in a text where one word woman can
0:23:53 - 0:24:01 Text: also co-refer to the same entity in the world as this word, this pronoun her.
0:24:01 - 0:24:06 Text: Here you could think about, you know, I went to the ocean to see the fish, turtles, seals,
0:24:06 - 0:24:07 Text: and blank.
0:24:07 - 0:24:10 Text: Here I don't think there's a single correct answer as to what we could see going into that
0:24:10 - 0:24:11 Text: blank.
0:24:11 - 0:24:15 Text: But a model could learn a distribution of the kinds of things that people might be talking
0:24:15 - 0:24:20 Text: about when they, one, go to the ocean and two, are excited to see marine life.
0:24:20 - 0:24:21 Text: Right?
0:24:21 - 0:24:25 Text: So this is sort of a semantic category, a lexical semantic category of things that might
0:24:25 - 0:24:32 Text: sort of be in the same set of interest as fish, turtles, and seals in the context of
0:24:32 - 0:24:34 Text: I went to the ocean.
0:24:34 - 0:24:36 Text: Okay?
0:24:36 - 0:24:40 Text: So, and, you know, man, I expect that there would be examples of this in a large corpus
0:24:40 - 0:24:41 Text: of text.
0:24:41 - 0:24:43 Text: Maybe it may be a book.
0:24:43 - 0:24:44 Text: Okay.
0:24:44 - 0:24:46 Text: Here's another example.
0:24:46 - 0:24:52 Text: Overall, the value I got from the two hours watching it was the sum total of the popcorn
0:24:52 - 0:24:53 Text: and the drink.
0:24:53 - 0:24:55 Text: The movie was blank.
0:24:55 - 0:24:56 Text: Right?
0:24:56 - 0:25:00 Text: And this is when I'd sort of like look out into the audience and say, was the movie better
0:25:00 - 0:25:02 Text: good, but the movie was bad.
0:25:02 - 0:25:04 Text: It's my prediction here.
0:25:04 - 0:25:06 Text: Right?
0:25:06 - 0:25:09 Text: And so this is teaching you something about sentiment, about how people express sentiment
0:25:09 - 0:25:11 Text: in language.
0:25:11 - 0:25:17 Text: And so this is, even, it looks like a task itself, like do sentiment analysis is sort
0:25:17 - 0:25:22 Text: of what you need to do in order to figure out whether the movie was bad or good, or maybe
0:25:22 - 0:25:24 Text: maybe the word is neither bad or good.
0:25:24 - 0:25:25 Text: The movie was over or something like that.
0:25:25 - 0:25:30 Text: But like, if you had to choose between is bad or good more likely, right?
0:25:30 - 0:25:33 Text: You sort of had to figure out the sentiment of the text.
0:25:33 - 0:25:36 Text: Now, that's really fascinating.
0:25:36 - 0:25:37 Text: Okay.
0:25:37 - 0:25:39 Text: Here's another one.
0:25:39 - 0:25:42 Text: Iro went into the kitchen to make some tea.
0:25:42 - 0:25:46 Text: Standing next to Iro, Zuko pondered his destiny.
0:25:46 - 0:25:48 Text: Zuko left the blank.
0:25:48 - 0:25:49 Text: Okay.
0:25:49 - 0:25:53 Text: So this is a little easy because we really only show one place.
0:25:53 - 0:25:56 Text: I guess we have another now in destiny.
0:25:56 - 0:26:00 Text: But this is sort of talking reasoning about spatial location and the movement of sort of
0:26:00 - 0:26:02 Text: agents in an imagined world.
0:26:02 - 0:26:06 Text: We could imagine text that has lines like this.
0:26:06 - 0:26:10 Text: Person went into the place and was next to so and so who left and did that and sort of
0:26:10 - 0:26:12 Text: you have these like relationships.
0:26:12 - 0:26:14 Text: So here, Zuko left the kitchen.
0:26:14 - 0:26:17 Text: It's the most likely thing that I think would go here.
0:26:17 - 0:26:24 Text: And it sort of indicates that in order for a model to learn to perform this fill in the
0:26:24 - 0:26:32 Text: missing part task, it might need to, in general, figure out sort of where things are and
0:26:32 - 0:26:36 Text: whether statements mean or imply that locality.
0:26:36 - 0:26:40 Text: So standing next to Iro went into the kitchen.
0:26:40 - 0:26:45 Text: Now Iro is in the kitchen and then standing next to Iro means Zuko is now in the kitchen.
0:26:45 - 0:26:48 Text: And then Zuko now leaves where?
0:26:48 - 0:26:50 Text: Well, he was in the kitchen before.
0:26:50 - 0:26:53 Text: So this is sort of a very basic sense of reasoning.
0:26:53 - 0:26:55 Text: Now this one.
0:26:55 - 0:26:56 Text: Here's a sentence.
0:26:56 - 0:27:03 Text: I was thinking about the sequence that goes 1, 1, 2, 3, 5, 8, 13, 21, blank.
0:27:03 - 0:27:05 Text: So I don't know.
0:27:05 - 0:27:07 Text: I can imagine people writing stuff.
0:27:07 - 0:27:09 Text: So this is the Fibonacci sequence.
0:27:09 - 0:27:12 Text: And sort of you know you use some of these two to get the next one, some of these two
0:27:12 - 0:27:14 Text: to get the next one, some of these two.
0:27:14 - 0:27:16 Text: And so you have this running sum.
0:27:16 - 0:27:17 Text: It's a famous sequence.
0:27:17 - 0:27:20 Text: It shows up in a lot of text on the internet.
0:27:20 - 0:27:26 Text: And in general you have to learn the algorithm or just the formula, I guess, that defines the
0:27:26 - 0:27:29 Text: Fibonacci sequence in order to keep going.
0:27:29 - 0:27:32 Text: Do models in this in practice?
0:27:32 - 0:27:33 Text: Wait and find out.
0:27:33 - 0:27:37 Text: But you would have to learn it in order to get the sequence to keep going and going and
0:27:37 - 0:27:39 Text: going.
0:27:39 - 0:27:46 Text: OK, so we're going to get into specific pre-trained models, specific methods of pre-training
0:27:46 - 0:27:47 Text: now.
0:27:47 - 0:27:56 Text: So I'm going to go over a brief review of transformer encoders, decoders, and encoder decoders.
0:27:56 - 0:27:58 Text: Because we're going to get into the sort of technical bits now.
0:27:58 - 0:28:01 Text: So before I do that, I'm going to pause.
0:28:01 - 0:28:02 Text: Are there any questions?
0:28:02 - 0:28:11 Text: Yeah, there's an interesting question asked about, do these co-opening our model on our
0:28:11 - 0:28:13 Text: input training data and the link to training?
0:28:13 - 0:28:17 Text: And we need to also add some questions in the light of the huge between models that we
0:28:17 - 0:28:19 Text: think nowadays.
0:28:19 - 0:28:25 Text: Sorry, the first part of that question, was it, are we overfitting our models to what?
0:28:25 - 0:28:29 Text: Yes, so the risk of almost getting our model on our input training data when they're
0:28:29 - 0:28:30 Text: doing training?
0:28:30 - 0:28:31 Text: Got it.
0:28:31 - 0:28:34 Text: Yeah, so that's a good point.
0:28:34 - 0:28:36 Text: So we're using very large models.
0:28:36 - 0:28:40 Text: And we might imagine that there's a risk of overfitting.
0:28:40 - 0:28:45 Text: And in practice, yeah, it's actually one of the more crucial things to do to make pre-training
0:28:45 - 0:28:46 Text: work.
0:28:46 - 0:28:51 Text: So that turns out that you need to have a lot, a lot of data, like a lot of data.
0:28:51 - 0:28:56 Text: And in fact, we'll show results later on where people built a pre-trained model, pre-trained
0:28:56 - 0:28:58 Text: it on a lot of data.
0:28:58 - 0:29:02 Text: And then like six months later, someone else came along and was like, hey, if you pre-trained
0:29:02 - 0:29:06 Text: it on 10 months later and changed almost nothing else, it would have gone even better.
0:29:06 - 0:29:07 Text: Now was it overfitting?
0:29:07 - 0:29:13 Text: I mean, you can sort of like hold out some text during pre-training, right, and sort
0:29:13 - 0:29:18 Text: of evaluate the perplexity, right, the language modeling performance on that held out text.
0:29:18 - 0:29:22 Text: And it tends to be the case that actually these models are underfitting, right, that we
0:29:22 - 0:29:28 Text: need even larger and larger models to express the complex interactions that allow us to
0:29:28 - 0:29:30 Text: fit these datasets better.
0:29:30 - 0:29:33 Text: And so we'll talk about that when we talk about BERT.
0:29:33 - 0:29:37 Text: And one of the really interesting results is that BERT is underfit, not overfit, but
0:29:37 - 0:29:42 Text: in principle, yes, it's a problem to, this potentially a problem to overfit.
0:29:42 - 0:29:46 Text: But we end up having a ton of text in English at least, although not in every language.
0:29:46 - 0:29:51 Text: And so, yeah, it's important to scale them, but currently our models don't seem overfit
0:29:51 - 0:29:53 Text: to the pre-training text.
0:29:53 - 0:29:54 Text: Okay.
0:29:54 - 0:30:02 Text: Any other questions?
0:30:02 - 0:30:05 Text: All right.
0:30:05 - 0:30:07 Text: So we saw this figure before, right here.
0:30:07 - 0:30:12 Text: We saw this figure of a transformer encoder to coder from this paper attention is all you
0:30:12 - 0:30:14 Text: need.
0:30:14 - 0:30:17 Text: And so we have a couple of things.
0:30:17 - 0:30:22 Text: We're not going to go over the form of attention again today because we have a lot to go over,
0:30:22 - 0:30:25 Text: but I'm happy to chat about it more on Ed.
0:30:25 - 0:30:28 Text: But so in our encoder, we have some input sequence.
0:30:28 - 0:30:31 Text: Remember, this is a sequence of sub words now.
0:30:31 - 0:30:34 Text: Each sub word gets a word embedding.
0:30:34 - 0:30:38 Text: And each index in the transformer gets a position embedding.
0:30:38 - 0:30:44 Text: Now remember that we have a finite length that our sequence can possibly be like 512.
0:30:44 - 0:30:45 Text: That's tokens.
0:30:45 - 0:30:47 Text: That was that capital T from last lecture.
0:30:47 - 0:30:48 Text: So you have some finite length.
0:30:48 - 0:30:55 Text: So you have one embedding of a position for every index for all 512 indices.
0:30:55 - 0:30:57 Text: And then you have all your word embeddings.
0:30:57 - 0:31:03 Text: And then the transformer encoder, right, was this combination of sort of sub-modules that
0:31:03 - 0:31:08 Text: we walked through line by line on Tuesday, right.
0:31:08 - 0:31:12 Text: Multi-headed attention was sort of the core building block.
0:31:12 - 0:31:16 Text: And then we had residual and layer norm, right, to help with passing gradients and to help
0:31:16 - 0:31:19 Text: make training go better and faster.
0:31:19 - 0:31:25 Text: We had that feed forward layer to, yeah, process sort of the result of the multi-headed
0:31:25 - 0:31:30 Text: attention, another residual and layer norm, and then pass to an identical transformer
0:31:30 - 0:31:31 Text: encoder block here.
0:31:31 - 0:31:32 Text: And these would be stacked.
0:31:32 - 0:31:38 Text: We'll see a number of different configurations here, but I think, you know, 612 of these
0:31:38 - 0:31:39 Text: sort of stacked together.
0:31:39 - 0:31:40 Text: Okay.
0:31:40 - 0:31:42 Text: So that's a transformer encoder.
0:31:42 - 0:31:47 Text: And we're actually going to see whole models today that are just transformer encoders.
0:31:47 - 0:31:48 Text: Okay.
0:31:48 - 0:31:52 Text: So when we talked about machine translation, when we talked about the transformer itself,
0:31:52 - 0:31:56 Text: the transformer encoder decoder, we talked about this whole thing.
0:31:56 - 0:31:59 Text: But you could actually just have this left column, and you could actually just have this
0:31:59 - 0:32:02 Text: right column as well.
0:32:02 - 0:32:04 Text: Although the right column changes a little bit if you just have it.
0:32:04 - 0:32:10 Text: So remember, the right column, we had this masked multi-head self-attention, right, so
0:32:10 - 0:32:14 Text: where you can't look at the future.
0:32:14 - 0:32:18 Text: And someone asked actually about how we decode from transformers, given that you have this
0:32:18 - 0:32:20 Text: sort of big chunking operation.
0:32:20 - 0:32:21 Text: It's a great question.
0:32:21 - 0:32:25 Text: I won't be able to get into it in detail today, but you have to run it once during the decoding
0:32:25 - 0:32:31 Text: process for every time that you decode to sort of predict the next word.
0:32:31 - 0:32:34 Text: I'll write out something on Ed for this.
0:32:34 - 0:32:38 Text: So in the masked multi-head self-attention, you're not allowed to look at the future so
0:32:38 - 0:32:44 Text: that you sort of have this well-defined objective of trying to do language modeling.
0:32:44 - 0:32:46 Text: Then we have residual and layer norm.
0:32:46 - 0:32:50 Text: The multi-head cross-attention, remember, goes back to the last layer of the transformer
0:32:50 - 0:32:53 Text: encoder, or the last transformer encoder block.
0:32:53 - 0:32:57 Text: And then more residual and layer norm, another feed-forward layer, more residual and layer
0:32:57 - 0:32:58 Text: norm.
0:32:58 - 0:33:04 Text: Now, if we don't have an encoder here, then we get rid of the cross-attention and residual
0:33:04 - 0:33:05 Text: and layer norm here.
0:33:05 - 0:33:09 Text: So if we didn't have this stack of encoders, the decoders get simpler because you don't
0:33:09 - 0:33:10 Text: have to attend to them.
0:33:10 - 0:33:14 Text: But then again, you also have these word embeddings at the bottom and position representations
0:33:14 - 0:33:17 Text: for the output sequence.
0:33:17 - 0:33:20 Text: Okay, so that's been review.
0:33:20 - 0:33:22 Text: Let's talk about pre-training through language modeling.
0:33:22 - 0:33:26 Text: So we've actually talked maybe a little bit about this before, and we've seen language
0:33:26 - 0:33:30 Text: modeling in the context of maybe just wanting to do it our priori.
0:33:30 - 0:33:35 Text: So language models were useful, for example, in automatic speech recognition systems.
0:33:35 - 0:33:38 Text: They were useful in statistical machine translation systems.
0:33:38 - 0:33:42 Text: So let's recall the language modeling task.
0:33:42 - 0:33:47 Text: You can say it's defined as modeling the probability of a word at a given index t, of any word
0:33:47 - 0:33:51 Text: at any given index, given all the words before it.
0:33:51 - 0:33:59 Text: And this probability distribution is a distribution of words given their past contexts.
0:33:59 - 0:34:05 Text: And so this is just saying, for any prefix here, IRO goes to make.
0:34:05 - 0:34:07 Text: I want a probability of whatever the next word should be.
0:34:07 - 0:34:14 Text: So the observed next word is tasty, but maybe there's goes to make t, goes to make hot
0:34:14 - 0:34:15 Text: water, etc.
0:34:15 - 0:34:19 Text: You can have a distribution over what the next word should be in this decoder.
0:34:19 - 0:34:24 Text: And remember that because of the masked self-attention, make can look back to the word
0:34:24 - 0:34:30 Text: two, or goes, or IRO, but it can't look forward to tasty.
0:34:30 - 0:34:31 Text: So there's a lot of data for this, right?
0:34:31 - 0:34:33 Text: You just have text.
0:34:33 - 0:34:36 Text: And like voila, you have language modeling data.
0:34:36 - 0:34:37 Text: It's free.
0:34:37 - 0:34:38 Text: No.
0:34:38 - 0:34:40 Text: Once you have the text, it's freely available.
0:34:40 - 0:34:42 Text: You don't need to label it.
0:34:42 - 0:34:44 Text: And in English, you have a lot of it, right?
0:34:44 - 0:34:50 Text: This is not true of every language by any means, but in English, you have a lot of pre-training
0:34:50 - 0:34:52 Text: data.
0:34:52 - 0:34:58 Text: And so the simple thing about pre-training is, well, what we're going to do is we're
0:34:58 - 0:35:01 Text: going to train a neural network to do language modeling on a large amount of text, and we'll
0:35:01 - 0:35:06 Text: just save the parameters of our train network to disk.
0:35:06 - 0:35:10 Text: So conceptually, it's not actually different from the things that we've done before.
0:35:10 - 0:35:12 Text: It's just sort of the intent, right?
0:35:12 - 0:35:17 Text: We're training these parameters to start using them for something else later down the line.
0:35:17 - 0:35:20 Text: But the language modeling itself doesn't change.
0:35:20 - 0:35:22 Text: The decoder here doesn't change, right?
0:35:22 - 0:35:27 Text: It's a transformer in tree-trained models in a modern, because this is sort of a newly
0:35:27 - 0:35:30 Text: popular concept.
0:35:30 - 0:35:36 Text: Although back in 2015 was sort of when this, I think, was first effectively tried out and
0:35:36 - 0:35:39 Text: got some interesting results.
0:35:39 - 0:35:42 Text: But this could be anything here.
0:35:42 - 0:35:47 Text: Today, it's most going to be transformers in the models that we actually observe.
0:35:47 - 0:35:49 Text: Okay.
0:35:49 - 0:35:52 Text: So once you have your pre-trained network, what's the sort of default thing you do to
0:35:52 - 0:35:54 Text: take to use it?
0:35:54 - 0:35:55 Text: Right?
0:35:55 - 0:35:58 Text: And if you take anything away from this lecture in terms of just like engineering practices
0:35:58 - 0:36:05 Text: that will be broadly useful to you as you go off and build things and study things, maybe
0:36:05 - 0:36:11 Text: as a machine learning engineer or a computational social scientist, et cetera, what people tend
0:36:11 - 0:36:16 Text: to do is you pre-traine your network on just a lot of data, lots of text, learn very
0:36:16 - 0:36:18 Text: general things.
0:36:18 - 0:36:22 Text: And then you adapt the network to whatever you wanted to do.
0:36:22 - 0:36:26 Text: So we had a bunch of pre-training data, and then maybe this is a movie review that
0:36:26 - 0:36:34 Text: we're taking as input here, and we just apply the decoder that we sort of pre-trained,
0:36:34 - 0:36:41 Text: start the parameters there, and then fine tune it on whatever we were sort of wanting
0:36:41 - 0:36:42 Text: to do.
0:36:42 - 0:36:43 Text: Maybe this is a sentiment analysis task.
0:36:43 - 0:36:48 Text: So we run the whole sequence through the decoder, get a hidden state at the end at the
0:36:48 - 0:36:53 Text: very last thing, and then we predict maybe plus or minus sentiment.
0:36:53 - 0:36:56 Text: And this is sort of adapting the pre-trained network to the task.
0:36:56 - 0:37:02 Text: Because pre-trained fine-tune paradigm is wildly successful, and you should really try
0:37:02 - 0:37:09 Text: it whenever you're doing any NLP task nowadays effectively.
0:37:09 - 0:37:14 Text: Because this tends to be what some variant of this tends to be what works best.
0:37:14 - 0:37:19 Text: Okay, so we've got a technical note now.
0:37:19 - 0:37:27 Text: So if you don't like to think about optimization or gradient descent, maybe take a pass on
0:37:27 - 0:37:34 Text: this slide, but I encourage you to just think for a second about why should this help?
0:37:34 - 0:37:40 Text: Training neural nets, we're using gradient descent to try to find some global minimum
0:37:40 - 0:37:43 Text: of this loss function.
0:37:43 - 0:37:47 Text: And we're sort of doing this in two steps.
0:37:47 - 0:37:55 Text: The first step is we get some parameters theta hat by approximating min over our, sorry,
0:37:55 - 0:37:57 Text: theta is the parameters of the neural network.
0:37:57 - 0:38:03 Text: So all of the KQV vectors in our transformer, the word embeddings, the position embeddings,
0:38:03 - 0:38:07 Text: it's just all of the parameters of our neural network.
0:38:07 - 0:38:11 Text: And so we're doing min over all the parameters of our theta, we're trying to approximate min
0:38:11 - 0:38:14 Text: over the parameters of our neural network of our pre-training loss, which here was language
0:38:14 - 0:38:18 Text: modeling of our parameters.
0:38:18 - 0:38:23 Text: And this is, we just get this sort of estimate of some parameters theta hat.
0:38:23 - 0:38:31 Text: And then we fine tune by approximating this min over theta of the fine tune loss, maybe
0:38:31 - 0:38:33 Text: that's sentiment, right?
0:38:33 - 0:38:34 Text: Starting at theta hat.
0:38:34 - 0:38:37 Text: So we initialize our gradient descent at theta hat, and then we just sort of let it do
0:38:37 - 0:38:38 Text: what it wants.
0:38:38 - 0:38:42 Text: And it's just like, it just works.
0:38:42 - 0:38:49 Text: And in part, it has to be because something about where we start is so important, not just
0:38:49 - 0:38:53 Text: in terms of sort of gradient flow, although that is a big part of it.
0:38:53 - 0:39:00 Text: But also, it seems like, you know, stochastic gradient descent sticks relatively close to
0:39:00 - 0:39:03 Text: that pre-training initialization during fine tuning.
0:39:03 - 0:39:09 Text: This is something that we seem to observe in practice, right, that somehow the locality
0:39:09 - 0:39:13 Text: of stochastic gradient descent, finding local minima that are close to this theta hat,
0:39:13 - 0:39:19 Text: that was good for such a general problem of language modeling, it seems like, yeah, the
0:39:19 - 0:39:23 Text: local minima of the fine tuning loss, because we don't find, or yeah, the local minima
0:39:23 - 0:39:28 Text: of the fine tuning loss tend to generalize well when they're near to this theta hat that
0:39:28 - 0:39:29 Text: we pre-trained.
0:39:29 - 0:39:32 Text: And this is sort of a mystery that we're still trying to figure out more about.
0:39:32 - 0:39:35 Text: And then also, yeah, maybe the gradients, right, the gradients of the fine tuning loss
0:39:35 - 0:39:40 Text: near theta propagate nicely, so our network training goes really well as well.
0:39:40 - 0:39:45 Text: Okay, so this is something to chew on, but in practice, it works.
0:39:45 - 0:39:49 Text: I think it's just still fascinating that it works.
0:39:49 - 0:39:59 Text: Okay, so we talked about mainly the transformer encoder to coder, and in fact, right, I said
0:39:59 - 0:40:04 Text: that we could have just sort of the left-hand side encoders, you know, that to be pre-trained
0:40:04 - 0:40:08 Text: or just decoders to be pre-trained or encoder decoders.
0:40:08 - 0:40:13 Text: And there are actually really popular sort of famous models in each of these three categories.
0:40:13 - 0:40:20 Text: The kinds of pre-training you can do, and the kinds of applications or uses of those
0:40:20 - 0:40:25 Text: pre-trained models that are most natural actually depend strongly on whether you choose
0:40:25 - 0:40:31 Text: to pre-traine and encoder a decoder or an encoder decoder.
0:40:31 - 0:40:36 Text: So I think it's useful as we go through some of these popular sort of model names that
0:40:36 - 0:40:41 Text: you need to know and what they sort of, what their innovations were to actually split
0:40:41 - 0:40:44 Text: it up into these categories.
0:40:44 - 0:40:46 Text: So we've all, so here's the thing.
0:40:46 - 0:40:51 Text: We're going to go through these three, and they all have sort of benefits and in some
0:40:51 - 0:40:52 Text: sense, drawbacks.
0:40:52 - 0:40:58 Text: So the decoders, right, really what we're talking about here mainly is language models,
0:40:58 - 0:41:03 Text: and we've seen this so far, we've talked about pre-trained decoders, and these are nice
0:41:03 - 0:41:04 Text: to generate from.
0:41:04 - 0:41:08 Text: So you can just sample from your pre-trained language model and get things that look
0:41:08 - 0:41:11 Text: like the text that you were pre-training on.
0:41:11 - 0:41:15 Text: But one problem is that you can't condition on future words, right?
0:41:15 - 0:41:21 Text: So we mentioned in our modeling with LSTMs that just like, instead, if you could, when
0:41:21 - 0:41:27 Text: you can do it, we said that having a bi-directional LSTM was actually just way more useful than
0:41:27 - 0:41:29 Text: having a one-directional LSTM.
0:41:29 - 0:41:31 Text: Well, it's sort of true for transformers as well.
0:41:31 - 0:41:37 Text: So if you can see how the arrows are pointing here, the arrows are pointing up into the,
0:41:37 - 0:41:38 Text: you know, to the right.
0:41:38 - 0:41:45 Text: So this word is sort of looking back at its past history, but, you know, this word can't
0:41:45 - 0:41:48 Text: see, can't contextualize with the future.
0:41:48 - 0:41:52 Text: Whereas in the encoder block here in blue, just below it, you sort of have all pairs of
0:41:52 - 0:41:54 Text: interactions.
0:41:54 - 0:41:56 Text: And so, you know, when you're building your representations, it can actually be super
0:41:56 - 0:41:58 Text: useful to know what the future words are.
0:41:58 - 0:42:00 Text: So that's what encoders get you, right?
0:42:00 - 0:42:02 Text: You get bi-directional context.
0:42:02 - 0:42:05 Text: So you can condition on the future, maybe that helps you build up better representations
0:42:05 - 0:42:06 Text: of language.
0:42:06 - 0:42:12 Text: But the question that we'll actually go through here is, well, how do you pre-train them?
0:42:12 - 0:42:15 Text: You can't pre-train them as language models because you have access to the future.
0:42:15 - 0:42:20 Text: So if you try to do that, the loss will just immediately be zero because you can just
0:42:20 - 0:42:21 Text: see what the future is.
0:42:21 - 0:42:22 Text: That's not useful.
0:42:22 - 0:42:28 Text: And then we'll talk about pre-trained encoder decoders, which like maybe the best of both
0:42:28 - 0:42:33 Text: worlds, but also maybe unclear what's the best way to pre-train them.
0:42:33 - 0:42:36 Text: They definitely have benefits for both.
0:42:36 - 0:42:43 Text: So let's get into some general top, like a more, yeah, let's get into the decoders first,
0:42:43 - 0:42:46 Text: we'll go through all three.
0:42:46 - 0:42:47 Text: Okay.
0:42:47 - 0:42:54 Text: When we're pre-training a language model, right, we're pre-training it on this objective,
0:42:54 - 0:42:59 Text: we're trying to make it approximate this probability of a word given all of its previous
0:42:59 - 0:43:01 Text: words.
0:43:01 - 0:43:04 Text: What we end up doing, and I showed this sort of pictographically, but I'll add some math,
0:43:04 - 0:43:11 Text: right, we get a hidden state, h1 to ht for each of the words in the input w1 to wt.
0:43:11 - 0:43:14 Text: And I remember words again, mean sub words here.
0:43:14 - 0:43:15 Text: Okay.
0:43:15 - 0:43:20 Text: And we're fine tuning this, right, we can take the representation, this should be ht,
0:43:20 - 0:43:22 Text: a, ht plus b.
0:43:22 - 0:43:25 Text: And then the picture here is, right, here's ht.
0:43:25 - 0:43:29 Text: It's the very last encoder state.
0:43:29 - 0:43:36 Text: And now this has sort of the, it's seen all of its history, right, and so you can apply
0:43:36 - 0:43:41 Text: a linear layer here, maybe multiplying it by some parameters a and b that were not
0:43:41 - 0:43:46 Text: pre-trained, and then you're predicting sentiment maybe, you know, plus or minus sentiment,
0:43:46 - 0:43:47 Text: perhaps.
0:43:47 - 0:43:51 Text: And so, you know, look at the red and the gray, so most of the parameters of my neural
0:43:51 - 0:43:56 Text: network have now been pre-trained, the very last layer that's learning, the sentiment,
0:43:56 - 0:44:00 Text: say, decision, has not been pre-trained.
0:44:00 - 0:44:02 Text: So those have been randomly initialized.
0:44:02 - 0:44:06 Text: And when you, when you take the loss of the sentiment loss, right, you train not just
0:44:06 - 0:44:11 Text: the linear layer here, but you actually back propagate the gradients all the way through
0:44:11 - 0:44:16 Text: the entire pre-trained network and fine tune all of those parameters, right?
0:44:16 - 0:44:20 Text: So it's not like you're just training this, fine tuning time, this linear layer, you're
0:44:20 - 0:44:25 Text: training the whole network as a function of this fine tuning loss.
0:44:25 - 0:44:30 Text: And you know, maybe it's bad that like the linear layer wasn't pre-trained.
0:44:30 - 0:44:34 Text: In the grand scheme of things, it's not that many parameters also.
0:44:34 - 0:44:38 Text: So this is you, so this is just one way to interact with pre-trained models, right?
0:44:38 - 0:44:42 Text: And so what I want you to take away from this is that there was a contract that we had
0:44:42 - 0:44:44 Text: with the original model, right?
0:44:44 - 0:44:48 Text: The contract was that it was defining probability distributions.
0:44:48 - 0:44:52 Text: But when we're fine tuning, when we're interacting with the pre-trained model, what we also
0:44:52 - 0:44:55 Text: have are just like the trained weights and the network architecture.
0:44:55 - 0:44:58 Text: We don't need to use it as a language model, we don't need to use it as a probability
0:44:58 - 0:44:59 Text: distribution.
0:44:59 - 0:45:04 Text: When we're actually fine tuning it, we're really just using it for its initialization
0:45:04 - 0:45:08 Text: of its parameters and saying, oh, this is just a transformer decoder that was
0:45:08 - 0:45:14 Text: pre-trained by, oh, and it happens to be really great in that when you find tuna on some
0:45:14 - 0:45:17 Text: sentiment data, it does a really good job.
0:45:17 - 0:45:22 Text: Okay, but there's a second way to interact with pre-trained decoders, which is in some
0:45:22 - 0:45:24 Text: sense even more natural.
0:45:24 - 0:45:28 Text: It actually is closer to the contract that we started with.
0:45:28 - 0:45:32 Text: So we don't have to just ignore the fact that it was a probability distribution entirely,
0:45:32 - 0:45:35 Text: we can make use of it while still fine tuning it.
0:45:35 - 0:45:37 Text: So here's what we're going to do.
0:45:37 - 0:45:40 Text: So we can use them as a generator at fine tuning time.
0:45:40 - 0:45:47 Text: By generator, I mean, it's going to define this distribution of words given their context.
0:45:47 - 0:45:51 Text: And then we'll actually just fine tune that probability distribution.
0:45:51 - 0:45:58 Text: So in a task like some kind of turn-based dialogue, we might encode the dialogue history
0:45:58 - 0:46:01 Text: as your past context.
0:46:01 - 0:46:06 Text: So you have a dialogue history of some things that people are saying back and forth
0:46:06 - 0:46:10 Text: to each other, you encode it as words, and you try to predict the next words in the
0:46:10 - 0:46:11 Text: dialogue.
0:46:11 - 0:46:15 Text: Right, and maybe you're pre-training objective, you looked at very general purpose text
0:46:15 - 0:46:19 Text: from, I don't know, Wikipedia or books or something, and you're fine tuning it as a
0:46:19 - 0:46:25 Text: language model, but you're fine tuning it as a language model on this sort of domain-specific
0:46:25 - 0:46:30 Text: distribution of text like dialogue or maybe summarization where you paste in the whole
0:46:30 - 0:46:37 Text: document and then say a specific word and then the summary and say predict the summary.
0:46:37 - 0:46:43 Text: And so what this looks like is, again, at fine tuning time here, you have your h1 to
0:46:43 - 0:46:49 Text: ht is equal to the decoder of the words, and then you have this distribution that you're
0:46:49 - 0:46:55 Text: fine tuning of wt is a h is the type again, ht minus 1 plus b.
0:46:55 - 0:47:01 Text: So now every time I have this, I'm predicting these words from word 1, I predict word 2,
0:47:01 - 0:47:07 Text: we're 2, I predict word 3, etc., right, the actual last layer of the network unlike before,
0:47:07 - 0:47:12 Text: the last layer of the network has been pre-trained, but I'm still fine tuning the whole thing.
0:47:12 - 0:47:17 Text: Right, so a and b here are mapping to sort of a probability distribution over my vocabulary
0:47:17 - 0:47:23 Text: or the logits of a probability distribution, and I guess get this sort of like tweak them
0:47:23 - 0:47:28 Text: now, in order to have the distribution that I'm going to use, reflect the thing like dialogue
0:47:28 - 0:47:30 Text: that I wanted to reflect.
0:47:30 - 0:47:36 Text: Okay, so those are two ways of interacting with a pre-trained decoder.
0:47:36 - 0:47:44 Text: Now here's an example of what is ended up being the first, that be a line of wildly successful
0:47:44 - 0:47:49 Text: or at least talked about pre-trained decoders.
0:47:49 - 0:47:57 Text: So the generative pre-trained decoder, or GPC, was a huge success in some sense, or at
0:47:57 - 0:48:04 Text: least it got a lot of buzz, so it's a transformer decoder, no encoder, with 12 layers, I'm giving
0:48:04 - 0:48:10 Text: you the details so you can start to get a feeling for how the size of things changes.
0:48:10 - 0:48:15 Text: Over the years, as we'll continue to progress here, had each of our, each of the hidden
0:48:15 - 0:48:20 Text: states was dimensionality, 70, had 768, so if you remember back to last lecture, we had
0:48:20 - 0:48:26 Text: a term D, which was our dimensionality, so D is 768, and then an interesting statement
0:48:26 - 0:48:31 Text: that you should keep in mind for the engineering-minded folks is that the actual feed-forward
0:48:31 - 0:48:35 Text: layers, right, you've got a hidden layer in the feed-forward layer, and this was actually
0:48:35 - 0:48:41 Text: very large, so you had these sort of like position-wise feed-forward layers, right, and the
0:48:41 - 0:48:47 Text: feed-forward layer would take the 768-dimensional vector, sort of like project it to 3,000-dimensional
0:48:47 - 0:48:52 Text: space through the sort of non-linearity, and then project it back to 768.
0:48:52 - 0:48:56 Text: This ends up being because you can squash a lot more parameters in, for not too much
0:48:56 - 0:49:01 Text: more compute in this way, but that's curious.
0:49:01 - 0:49:06 Text: Okay, and then, byte-parent coding, it's actually, was this one byte-parent coding?
0:49:06 - 0:49:10 Text: Well, it was a sub-word vocabulary with 40,000 merges, so 40,000 merges, so that's not
0:49:10 - 0:49:15 Text: the size of the vocabulary because you started with a bunch of characters, and I don't remember
0:49:15 - 0:49:19 Text: how many characters they started with, but so it's a relatively small vocabulary you can
0:49:19 - 0:49:21 Text: see, right?
0:49:21 - 0:49:27 Text: And compared to, if you tried to say, have every word, have a unique representation, now
0:49:27 - 0:49:32 Text: it's going to be trained on books, corporates, it's got 7,000 unique books, and it contains
0:49:32 - 0:49:37 Text: long spans of contiguous texts, so you have, instead of, say, training it on individual
0:49:37 - 0:49:40 Text: sentences, just small short sentences, right?
0:49:40 - 0:49:46 Text: The model is able to learn long distance dependencies because you haven't split, like, a book
0:49:46 - 0:49:48 Text: into random sentences and shuffled them all around.
0:49:48 - 0:49:53 Text: You've sort of kept it contiguous, so we can have that sort of consistency.
0:49:53 - 0:49:58 Text: And then, a little treat here, yeah, so GPC never showed up in the original paper, or
0:49:58 - 0:50:03 Text: the original blog post, like as an acronym, and it could actually sort of refer to, like,
0:50:03 - 0:50:07 Text: generative pre-training, sort of what, like, the title of the paper would suggest, or
0:50:07 - 0:50:09 Text: generative pre-trained transformer.
0:50:09 - 0:50:13 Text: And I sort of decided to say generative pre-trained transformer because this seemed like way
0:50:13 - 0:50:15 Text: too general.
0:50:15 - 0:50:17 Text: So GPC.
0:50:17 - 0:50:22 Text: Okay, so they pre-trained this huge language model transformer, this huge transformer
0:50:22 - 0:50:25 Text: decoder, just on 7,000 books.
0:50:25 - 0:50:28 Text: And they fine-tuned it on a number of different tasks, and I want to talk a little bit about
0:50:28 - 0:50:31 Text: the details about how they fine-tuned it.
0:50:31 - 0:50:36 Text: And so they fine-tuned it on one particular task, or family tasks, called natural language
0:50:36 - 0:50:38 Text: inference.
0:50:38 - 0:50:43 Text: So in natural language inference, we're labeling pairs of sentences as entailing or contradictory
0:50:43 - 0:50:44 Text: to each other in neutral.
0:50:44 - 0:50:50 Text: So you have a premise, and you hold the premise as sort of true, the man is in the doorway.
0:50:50 - 0:50:54 Text: And you have a hypothesis, the person is near the door.
0:50:54 - 0:50:59 Text: If this person is referring to that man, then, you know, it's sort of like, oh, yeah,
0:50:59 - 0:51:04 Text: so this is sort of entailed because there's a person, because the man is a person, and
0:51:04 - 0:51:06 Text: they're in the doorway, then they are near the door.
0:51:06 - 0:51:11 Text: So you have this sort of logical reasoning that you're doing, or you're supposed to be
0:51:11 - 0:51:14 Text: able to be doing, and you're labeling these sentences.
0:51:14 - 0:51:15 Text: So it's a labeled task.
0:51:15 - 0:51:21 Text: You've got sort of an input that's cut into two parts, and then one of three outputs.
0:51:21 - 0:51:25 Text: Okay, so the GPT paper evaluates on this task.
0:51:25 - 0:51:28 Text: But what they've got is a transformer decoder.
0:51:28 - 0:51:30 Text: So what do they do?
0:51:30 - 0:51:37 Text: This is sort of one of the earlier examples of, you know, taking, instead of changing your
0:51:37 - 0:51:42 Text: neural network architecture to adapt to the kind of task you're doing, you're going to
0:51:42 - 0:51:49 Text: just format the task as like a bunch of tokens and not change your architecture.
0:51:49 - 0:51:53 Text: Because the pre-training was so useful, it's probably better to keep the architecture
0:51:53 - 0:51:59 Text: fixed, pre-training it, and then change the task specification to sort of fit the pre-trained
0:51:59 - 0:52:00 Text: architecture.
0:52:00 - 0:52:05 Text: So what they did, right, they put this token start, this is a special token, the man is
0:52:05 - 0:52:09 Text: in the doorway, some delimiter token, right.
0:52:09 - 0:52:15 Text: So this is just a linear sequence of tokens that we're giving as one big prefix to GPT.
0:52:15 - 0:52:21 Text: And then the person is near the door, and then some extra token here, right, extract.
0:52:21 - 0:52:25 Text: And then, you know, the linear classifier that we talked about, and sort of the first
0:52:25 - 0:52:32 Text: way to interact with models, with decoder models, it's applied to the representation of the
0:52:32 - 0:52:34 Text: extract token, right.
0:52:34 - 0:52:39 Text: So you have the last hidden state on top of extract, and then you fine tune the whole
0:52:39 - 0:52:41 Text: network to predict these labels, right.
0:52:41 - 0:52:48 Text: And so this sort of input formatting is increasingly, increasingly used to keep the model architecture
0:52:48 - 0:52:53 Text: the same and allow for a variety of different problems to be solved with it.
0:52:53 - 0:52:55 Text: Okay, and so did it work?
0:52:55 - 0:52:58 Text: Unnatural language inference, the answer is yes.
0:52:58 - 0:53:00 Text: So there's a number of different numbers here.
0:53:00 - 0:53:01 Text: I wouldn't worry too much about it.
0:53:01 - 0:53:06 Text: The fine tune transformer language model is sort of what you should pay attention to.
0:53:06 - 0:53:09 Text: There's a lot of effort that went into the other models, right.
0:53:09 - 0:53:11 Text: And so this is the story of pre-training.
0:53:11 - 0:53:15 Text: People put a lot of effort into models that do various sort of careful things.
0:53:15 - 0:53:20 Text: And then you take a single transformer and you say, I'm going to pre-training it on a
0:53:20 - 0:53:24 Text: ton of text and not worry too much about anything else and just fine tune it, and you end up
0:53:24 - 0:53:28 Text: doing super, super well.
0:53:28 - 0:53:32 Text: Sometimes not too much better in the GPT case than sort of the best known state of the
0:53:32 - 0:53:36 Text: art methods, but usually a little bit better.
0:53:36 - 0:53:39 Text: And again, the amount of effort, the amount of tasks, specific effort that you have to put
0:53:39 - 0:53:41 Text: into it, it's very low.
0:53:41 - 0:53:46 Text: Okay, and so what about the other way of interacting with decoters, right.
0:53:46 - 0:53:49 Text: So we had, we said that we can interact with decoters just by sampling from them, just
0:53:49 - 0:53:52 Text: by saying, well, there are probability distributions.
0:53:52 - 0:53:55 Text: So we can use them in their capacities as language models.
0:53:55 - 0:54:02 Text: And so GPT 2, this is just really just a bigger GPT, and we're too much about it, with
0:54:02 - 0:54:05 Text: larger hidden units, more layers.
0:54:05 - 0:54:10 Text: When it was trained on more data, it was shown to produce sort of relatively convincing
0:54:10 - 0:54:11 Text: samples of natural language.
0:54:11 - 0:54:14 Text: So this is something that went around Twitter a lot, right.
0:54:14 - 0:54:20 Text: So you have this sort of contrived example that probably didn't show up in the training
0:54:20 - 0:54:24 Text: data that has a scientist discovering a herd of unicorns.
0:54:24 - 0:54:31 Text: And then they sort of sample from a, almost the distribution of the model.
0:54:31 - 0:54:36 Text: They sort of give the model some extra credit here.
0:54:36 - 0:54:42 Text: They do something called truncating the distribution of the language models, sort of cut out noise
0:54:42 - 0:54:44 Text: at GPT 2.
0:54:44 - 0:54:52 Text: So it's not exactly a perfect sample, but more or less GPT 2 generated this.
0:54:52 - 0:54:56 Text: And so you have the scientist discovering unicorns, and then, you know, you have this
0:54:56 - 0:55:00 Text: consistency, okay, there's the scientist.
0:55:00 - 0:55:03 Text: You know, you have them giving you the name.
0:55:03 - 0:55:11 Text: You have, you refer back to this, well, yeah, you refer back to the scientist's name.
0:55:11 - 0:55:13 Text: You sort of have these like topic consistency things.
0:55:13 - 0:55:15 Text: Also the syntax is really good.
0:55:15 - 0:55:18 Text: It looks, you know, vaguely like English.
0:55:18 - 0:55:20 Text: And so this is sort of continued to be a trend.
0:55:20 - 0:55:23 Text: As we get larger and larger language models, we actually sample from them, even when we
0:55:23 - 0:55:29 Text: give them prompts that look sort of odd, and they seem to be increasingly convincing.
0:55:29 - 0:55:31 Text: Okay.
0:55:31 - 0:55:36 Text: So pre-training encoders, okay.
0:55:36 - 0:55:37 Text: Pre-training encoders.
0:55:37 - 0:55:42 Text: So let's take another second because I need some more water here.
0:55:42 - 0:55:44 Text: If there's another question, let me know.
0:55:44 - 0:55:53 Text: All right.
0:55:53 - 0:55:59 Text: So the benefit of encoders that we talked about was that they get this bidirectional context.
0:55:59 - 0:56:05 Text: So you can, while you're building representations of your sentence, of your parts of sentences,
0:56:05 - 0:56:08 Text: you can look to the future and that can help you build a better representation of the word
0:56:08 - 0:56:10 Text: that you're looking at right now.
0:56:10 - 0:56:13 Text: But the big problem is that we can't do language modeling now.
0:56:13 - 0:56:17 Text: So we've pretty much only said, we like, we've relied on this task that we already knew about
0:56:17 - 0:56:19 Text: language modeling to do our pre-training.
0:56:19 - 0:56:21 Text: But now we want to pre-training coders.
0:56:21 - 0:56:23 Text: And so we can't, we can't use it.
0:56:23 - 0:56:27 Text: So what are we going to do?
0:56:27 - 0:56:32 Text: Here's the solution that was come up with a paper that introduced the language model of
0:56:32 - 0:56:34 Text: the model called Bert.
0:56:34 - 0:56:37 Text: It's called masked language modeling.
0:56:37 - 0:56:39 Text: So here's the idea.
0:56:39 - 0:56:43 Text: We get the sentence and then we just take a fraction of the words and we replace them
0:56:43 - 0:56:46 Text: with a sort of a mask token.
0:56:46 - 0:56:50 Text: A token that's, that means you don't know what this is right now.
0:56:50 - 0:56:53 Text: And then you predict these words.
0:56:53 - 0:56:54 Text: Some details we'll get into in the next slide.
0:56:54 - 0:56:56 Text: But so here's what it looks like.
0:56:56 - 0:57:01 Text: We have the sentence, I mask to the mask.
0:57:01 - 0:57:03 Text: We get some hidden states for all of them, right?
0:57:03 - 0:57:08 Text: So we haven't changed the transformer encoder at all.
0:57:08 - 0:57:10 Text: We've just said, okay, here's like this sequence.
0:57:10 - 0:57:12 Text: You get to see everything, right?
0:57:12 - 0:57:13 Text: Look at all the arrows going everywhere.
0:57:13 - 0:57:19 Text: But then, right, we have this prediction layer that we're, that we're, that we're pre-training,
0:57:19 - 0:57:20 Text: right?
0:57:20 - 0:57:21 Text: And we're using it.
0:57:21 - 0:57:26 Text: We only have loss on the words where we had masks here.
0:57:26 - 0:57:31 Text: So I had this masked and then I have to predict that it was went that went here and store
0:57:31 - 0:57:32 Text: that went here.
0:57:32 - 0:57:36 Text: And now this is a lot like language modeling you might say.
0:57:36 - 0:57:39 Text: But now you don't need to have this sort of left to right decomposition.
0:57:39 - 0:57:43 Text: You're saying, I'm going to remove some of the words and you have to predict what they
0:57:43 - 0:57:44 Text: are.
0:57:44 - 0:57:46 Text: This is called masked language modeling.
0:57:46 - 0:57:49 Text: And it's been very, very, very effective with a quick caveat.
0:57:49 - 0:57:51 Text: It gets a little more complicated.
0:57:51 - 0:57:54 Text: So, so what did they actually do?
0:57:54 - 0:57:56 Text: They, they proposed masked language modeling.
0:57:56 - 0:57:59 Text: And they released the weights of this, of this pre-trained transformer.
0:57:59 - 0:58:03 Text: So the little bit more complexity to get masked language modeling to work.
0:58:03 - 0:58:09 Text: So you are going to take a random 15% of the sub word tokens.
0:58:09 - 0:58:10 Text: That was, that was true.
0:58:10 - 0:58:14 Text: But you're not always going to replace them with mask.
0:58:14 - 0:58:19 Text: You can think of it like, if the model sees a mask token, it gets a guarantee that it
0:58:19 - 0:58:21 Text: needs to predict something.
0:58:21 - 0:58:26 Text: And if the model doesn't see a mask token, it gets a guarantee that it doesn't need to
0:58:26 - 0:58:27 Text: predict anything.
0:58:27 - 0:58:33 Text: So why should it bother building strong representations of the words that aren't masked?
0:58:33 - 0:58:36 Text: And I want my model to build strong representations of everything.
0:58:36 - 0:58:38 Text: So we're going to add some sort of uncertainty to the model.
0:58:38 - 0:58:43 Text: So what we're going to do is, for those 15% of tokens, 80% of the time, we're going
0:58:43 - 0:58:44 Text: to replace it with a mask.
0:58:44 - 0:58:48 Text: That was our original idea of mask language modeling.
0:58:48 - 0:58:52 Text: Then 10% of the time, we're actually going to replace the word with just a random token.
0:58:52 - 0:58:56 Text: Just a random vocabulary item can be anything.
0:58:56 - 0:58:59 Text: And then the other 10% of the time, we're going to leave the word unchanged.
0:58:59 - 0:59:03 Text: So now, it sees a word.
0:59:03 - 0:59:06 Text: It could be a random token, or it could be unchanged.
0:59:06 - 0:59:10 Text: And if I see a mask, I know I need to predict it.
0:59:10 - 0:59:15 Text: So what these two things do here is say, you have to sort of be doing this, you have to
0:59:15 - 0:59:18 Text: be on your toes for every word in your representation.
0:59:18 - 0:59:22 Text: So here, I pizza to the mask.
0:59:22 - 0:59:27 Text: And it turns out, and the model didn't know this, but it's getting three lost terms for
0:59:27 - 0:59:28 Text: this sentence.
0:59:28 - 0:59:32 Text: It only has one mask, but it's going to be penalized for predicting three different things.
0:59:32 - 0:59:35 Text: And it needs to predict that this word is actually went.
0:59:35 - 0:59:37 Text: So I replaced this one.
0:59:37 - 0:59:41 Text: It needs to predict that this word is two, is in fact the word two.
0:59:41 - 0:59:46 Text: And then it needs to predict that this word is in fact store.
0:59:46 - 0:59:49 Text: Now as a short interlude, you might be thinking, you might be thinking, John, there's no way
0:59:49 - 0:59:52 Text: the model could know this.
0:59:52 - 0:59:54 Text: It's so under specified.
0:59:54 - 0:59:56 Text: I pizza is a little weird, I admit.
0:59:56 - 0:59:58 Text: But there's just no way to know that this is store or in went into.
0:59:58 - 1:00:01 Text: I mean, the same thing is true of language modeling.
1:00:01 - 1:00:05 Text: So it's going to end up learning these average statistics about what things tend to be in
1:00:05 - 1:00:06 Text: the given context.
1:00:06 - 1:00:10 Text: And it's going to sort of hedge its bets and try to build a distribution of what things
1:00:10 - 1:00:12 Text: could appear there.
1:00:12 - 1:00:14 Text: So for the people who are thinking that, if there wasn't, that's what you should be
1:00:14 - 1:00:15 Text: thinking.
1:00:15 - 1:00:18 Text: It has to sort of know what kinds of things will end up in these slots.
1:00:18 - 1:00:23 Text: It has other uncertainty, because it can't be sure that any of the other words are necessarily
1:00:23 - 1:00:25 Text: right.
1:00:25 - 1:00:30 Text: And then it is, it's predicting these three words.
1:00:30 - 1:00:36 Text: And so you can see why it's important to not just have masks potentially, to have these
1:00:36 - 1:00:41 Text: sort of token randomization things, because again, we don't actually care about its ability
1:00:41 - 1:00:43 Text: to predict the masks.
1:00:43 - 1:00:48 Text: I'm not going to usually, I'm not going to actually sample from the model's distribution
1:00:48 - 1:00:50 Text: over what should go here.
1:00:50 - 1:00:56 Text: Instead, I am going to use the parameters of the neural network and expect that it built
1:00:56 - 1:00:58 Text: strong representations of language.
1:00:58 - 1:01:02 Text: So I don't want it to think it's got a free pass for representing something if it doesn't
1:01:02 - 1:01:06 Text: have a mask there.
1:01:06 - 1:01:14 Text: So there was one extra thing with the BERT pre-training, which is a next sentence prediction
1:01:14 - 1:01:15 Text: objective.
1:01:15 - 1:01:17 Text: So the input to BERT looks like this.
1:01:17 - 1:01:19 Text: This is straight from the BERT paper.
1:01:19 - 1:01:24 Text: You have a label here before your first sentence, and then a separation, and then a second
1:01:24 - 1:01:25 Text: sentence.
1:01:25 - 1:01:29 Text: So you had always two contiguous chunks of text.
1:01:29 - 1:01:31 Text: You had a first chunk of text here.
1:01:31 - 1:01:33 Text: My dog is cute.
1:01:33 - 1:01:35 Text: And then a second chunk of text, he likes playing.
1:01:35 - 1:01:38 Text: You can see the sub words there.
1:01:38 - 1:01:42 Text: And now these would actually be both be much longer.
1:01:42 - 1:01:47 Text: So these whole thing would be 512 words, and it would be about half, and that would be
1:01:47 - 1:01:51 Text: about half, and they'd be contiguous chunks of text.
1:01:51 - 1:01:53 Text: But here was the deal.
1:01:53 - 1:01:57 Text: What they wanted to do was they wanted to try to teach the system to understand sort of
1:01:57 - 1:02:01 Text: relationships between different whole pieces of text.
1:02:01 - 1:02:06 Text: In order to better pre-trained for downstream applications like question answering, where
1:02:06 - 1:02:11 Text: you have two pretty different pieces of text, and you need to know how they relate to
1:02:11 - 1:02:12 Text: each other.
1:02:12 - 1:02:18 Text: So the objective they came up with was you should sometimes have the second chunk of text
1:02:18 - 1:02:26 Text: be the actual chunk of text that directly follows the first in your data set, and sometimes
1:02:26 - 1:02:32 Text: have the second chunk of text be randomly sampled from somewhere else, so unrelated.
1:02:32 - 1:02:37 Text: And the model should predict whether it's the first case or the second.
1:02:37 - 1:02:41 Text: In order, again, to sort of have to reason about the relationships between the two chunks
1:02:41 - 1:02:42 Text: of text.
1:02:42 - 1:02:44 Text: So this is next sentence prediction.
1:02:44 - 1:02:48 Text: I think it's important to think about because it's a very different idea of pre-training
1:02:48 - 1:02:53 Text: objective than language modeling and masked language modeling.
1:02:53 - 1:02:58 Text: Even though later we're sort of argued that in the case of BERT, it's not necessary or
1:02:58 - 1:02:59 Text: useful.
1:02:59 - 1:03:06 Text: And one of the arguments is actually because it's actually way better to have a single
1:03:06 - 1:03:12 Text: context that's twice as long, so you can learn even longer distance dependencies and things.
1:03:12 - 1:03:15 Text: And so whether the objective itself would be useful if you could always just double
1:03:15 - 1:03:18 Text: the context size, I'm not sure if anyone's done research on that.
1:03:18 - 1:03:22 Text: But again, it's like a different kind of objective, and it's still noisy something about
1:03:22 - 1:03:23 Text: the input, right?
1:03:23 - 1:03:28 Text: The input was this big chunk of text, and you've noise it to say like, now you don't know
1:03:28 - 1:03:32 Text: whether it really was that or whether you sort of replaced it with a bunch of garbage,
1:03:32 - 1:03:39 Text: this sort of second portion here, whether the second portion has been replaced with something
1:03:39 - 1:03:44 Text: that didn't actually come from the same sequence.
1:03:44 - 1:03:49 Text: Okay, so let's talk some details about BERT.
1:03:49 - 1:03:53 Text: So BERT had 12 or 24 layers, depending on BERT base or BERT large.
1:03:53 - 1:03:57 Text: You'll probably use one of these models or one of the sort of descendants of these models
1:03:57 - 1:04:02 Text: if you choose to do something with the custom final project potentially, or if you choose
1:04:02 - 1:04:06 Text: the version of the default final project.
1:04:06 - 1:04:11 Text: And you had a 600 or a 1000 dimension hidden states, a bunch of attention heads, so this
1:04:11 - 1:04:14 Text: is that multi-headed attention, remember, about a bunch of them.
1:04:14 - 1:04:19 Text: So you're splitting all your dimensions into those 16 heads, and we're talking on the
1:04:19 - 1:04:23 Text: order of a couple hundred million parameters.
1:04:23 - 1:04:28 Text: At the time, right in 2018, we were like, whoa, that's a lot of parameters.
1:04:28 - 1:04:32 Text: How do you, that's a lot of parameters.
1:04:32 - 1:04:35 Text: And now, models are way, way, way, way bigger.
1:04:35 - 1:04:39 Text: So let's keep track of sort of the model sizes as we're going through this.
1:04:39 - 1:04:42 Text: And let's come back now to the corpus sizes as well.
1:04:42 - 1:04:43 Text: So we have books corpus.
1:04:43 - 1:04:45 Text: And this is the number of words there.
1:04:45 - 1:04:50 Text: This is the same thing that GPT-1 was trained on, 800 million words.
1:04:50 - 1:04:56 Text: Now we're going to train on also English Wikipedia, it's 250, sorry, that's 2,500 million,
1:04:56 - 1:04:59 Text: so that's 2,500,000,000 words.
1:04:59 - 1:05:06 Text: And again, to give you an idea of what is done in practice, right, pre-training is expensive
1:05:06 - 1:05:11 Text: and impractical for most users, let's say.
1:05:11 - 1:05:16 Text: So if you are a researcher with a GPU or five GPUs or something like that, you tend to
1:05:16 - 1:05:20 Text: not really be pre-training your whole own BERT model unless you're willing to spend
1:05:20 - 1:05:22 Text: a long time doing it.
1:05:22 - 1:05:25 Text: BERT itself was pre-trained with 64 TPU chips.
1:05:25 - 1:05:31 Text: A TPU is a special kind of hardware accelerator that accelerates the tensor operations effectively
1:05:31 - 1:05:35 Text: is developed by Google.
1:05:35 - 1:05:40 Text: So TPUs are just fast and can hold a lot.
1:05:40 - 1:05:42 Text: And for four days they had 64 chips.
1:05:42 - 1:05:46 Text: So if you have one GPU which you can think of as less than a single TPU, you're going
1:05:46 - 1:05:48 Text: to be waiting a long time to pre-training.
1:05:48 - 1:05:54 Text: But fine-tuning is so fast, it's so fast and impractical, it's common on a single
1:05:54 - 1:06:00 Text: GPU, you'll see how much faster fine-tuning is than pre-training in assignment five.
1:06:00 - 1:06:06 Text: And so this becomes, I think, a refrain of the field, you pre-trained once or handful
1:06:06 - 1:06:11 Text: of times, right, like a couple of people released big pre-trained models and then you fine-tune
1:06:11 - 1:06:15 Text: many times, right, so you save those parameters from pre-training and you fine-tune on all
1:06:15 - 1:06:20 Text: kinds of different problems.
1:06:20 - 1:06:25 Text: And that paradigm, right, taking something like Bert or whatever the best descendant of
1:06:25 - 1:06:31 Text: Bert is and taking it pre-trained and then fine-tuning it on what you want is pretty
1:06:31 - 1:06:37 Text: close to, you know, it's a very, very strong baseline in NLP right now, right?
1:06:37 - 1:06:40 Text: So and the simplicity is pretty fascinating.
1:06:40 - 1:06:46 Text: And there's one code base called Transformers from a company called Hugging Face that
1:06:46 - 1:06:51 Text: makes this just really just a couple of lines of Python to try out as well.
1:06:51 - 1:06:57 Text: So it sort of opened up very strong baselines without too, too much effort for a lot of
1:06:57 - 1:06:58 Text: tasks.
1:06:58 - 1:07:01 Text: Okay, so let's talk about evaluation.
1:07:01 - 1:07:06 Text: So pre-training is pitched as requiring all this different kind of language understanding.
1:07:06 - 1:07:11 Text: And the field is, the field of NLP has a hard time doing evaluation.
1:07:11 - 1:07:15 Text: But we try our best and we build datasets that we think are hard for various reasons because
1:07:15 - 1:07:19 Text: they require you to know stuff about language and about the world and about reasoning.
1:07:19 - 1:07:26 Text: And so when we evaluate whether pre-training is getting you a lot of sort of general knowledge,
1:07:26 - 1:07:30 Text: we evaluate on a lot of these tasks.
1:07:30 - 1:07:37 Text: So we evaluate on things like paraphrase detection on core questions.
1:07:37 - 1:07:39 Text: Natural language inference we saw.
1:07:39 - 1:07:43 Text: We have hard sentiment analysis datasets or what we're hard sentiment analysis datasets
1:07:43 - 1:07:45 Text: a couple of years ago.
1:07:45 - 1:07:50 Text: And actually, figuring out if sentences are grammatical tends to be hard.
1:07:50 - 1:07:54 Text: Determining the semantic similarity of text can be hard.
1:07:54 - 1:07:55 Text: Paraphrasing again.
1:07:55 - 1:07:57 Text: Natural language inference on a very, very small dataset.
1:07:57 - 1:08:01 Text: So this is this pre-training help you train on smaller datasets.
1:08:01 - 1:08:03 Text: The answer is yes, sort of thing.
1:08:03 - 1:08:09 Text: And so the birth folks released their paper after GPT was released.
1:08:09 - 1:08:13 Text: And there were a lot of sort of state of the art results that came from various things
1:08:13 - 1:08:16 Text: that you were supposed to be doing.
1:08:16 - 1:08:22 Text: And the results that you get sort of with pre-training, so here's open AI, GPT, here's
1:08:22 - 1:08:23 Text: birth base and large.
1:08:23 - 1:08:25 Text: The last three rows are all pre-trained.
1:08:25 - 1:08:32 Text: Elmo is sort of in the middle between pre-training the whole model and just having word embeddings.
1:08:32 - 1:08:34 Text: That's what this is.
1:08:34 - 1:08:39 Text: And the numbers you get are just, I think, to the field where quite astounding actually.
1:08:39 - 1:08:44 Text: We were all surprised that there was that much left to even be gotten on some of these datasets.
1:08:44 - 1:08:49 Text: And taking here, so this line in the table is unmarked when it's actually the number
1:08:49 - 1:08:50 Text: of training examples.
1:08:50 - 1:08:53 Text: This dataset has 2.5,000 training examples.
1:08:53 - 1:08:59 Text: And before sort of the big transformers came around, we had 60% accuracy on it.
1:08:59 - 1:09:01 Text: We run transformers on it.
1:09:01 - 1:09:03 Text: We get 10 points just by pre-training.
1:09:03 - 1:09:07 Text: And this has been a trend that has just continued.
1:09:07 - 1:09:11 Text: So why do anything but pre-trained encoders?
1:09:11 - 1:09:13 Text: We know encoders are good.
1:09:13 - 1:09:15 Text: We like the fact that you have bidirectional context.
1:09:15 - 1:09:18 Text: We also saw that BERT did better than GPT.
1:09:18 - 1:09:27 Text: But if you want to actually get it to do things, you can't just generate sequences from
1:09:27 - 1:09:32 Text: it the same way that you would from a model like GPT, a pre-trained decoder.
1:09:32 - 1:09:34 Text: You can sort of sample what things should go in a mask.
1:09:34 - 1:09:39 Text: So here's a mask. You can put a mask somewhere, sample the words that should go there.
1:09:39 - 1:09:42 Text: But if you want to sample whole context, right, if you want to get that story about the
1:09:42 - 1:09:46 Text: unicorns, for example, the encoder is not what you want to do.
1:09:46 - 1:09:51 Text: So they have sort of different contracts, and they can be used naturally at least in
1:09:51 - 1:09:53 Text: different ways.
1:09:53 - 1:09:57 Text: Okay, so let's talk very briefly about extensions of BERT.
1:09:57 - 1:10:00 Text: So they're BERT variants like Roberta and Spanbert.
1:10:00 - 1:10:04 Text: And there's just a bunch of papers with the word BERT in the title that did various things.
1:10:04 - 1:10:06 Text: Two very strong takeaways.
1:10:06 - 1:10:08 Text: Roberta, train BERT longer.
1:10:08 - 1:10:10 Text: BERT is underfit.
1:10:10 - 1:10:11 Text: Train it on more data.
1:10:11 - 1:10:13 Text: Train it for more steps.
1:10:13 - 1:10:18 Text: Spanbert, mask, contiguous spans of sub words.
1:10:18 - 1:10:21 Text: Words makes a harder, more useful pre-training task.
1:10:21 - 1:10:25 Text: So this is the idea that we can come up with better ways of noisy the input, of hiding
1:10:25 - 1:10:30 Text: stuff in the input, or breaking stuff in the input for our model to correct.
1:10:30 - 1:10:37 Text: So for example, if you have the sentence mask, ear, razz, razz, good, it's just not that
1:10:37 - 1:10:43 Text: hard to know that this is irresistibly, right, because like what could this possibly
1:10:43 - 1:10:44 Text: be after these sub words?
1:10:44 - 1:10:51 Text: So this is irresist, you know, something's about to come here and it's probably the end
1:10:51 - 1:10:52 Text: of that word.
1:10:52 - 1:10:57 Text: Whereas if you mask a long sequence of things, right now this is much harder, and actually
1:10:57 - 1:11:02 Text: you're getting a useful signal that is irresistibly good, and you sort of needed to mask all of
1:11:02 - 1:11:04 Text: them to make the task interesting.
1:11:04 - 1:11:08 Text: So Spanbert was like, oh, you should do this.
1:11:08 - 1:11:10 Text: This was super useful as well.
1:11:10 - 1:11:15 Text: So Roberta, just to point you at the fact that Roberta showed that BERT was underfit,
1:11:15 - 1:11:21 Text: you know, he said, BERT was trained on about 13 gigabytes of text, it got some accuracies,
1:11:21 - 1:11:27 Text: you can get above the amazing results of BERT, four extra points or so here, right, just
1:11:27 - 1:11:35 Text: by taking the identical model and training it on more data, the larger batch size for
1:11:35 - 1:11:36 Text: a long time.
1:11:36 - 1:11:41 Text: And if you train it, yeah, even longer without sort of more data, you don't get any
1:11:41 - 1:11:45 Text: better.
1:11:45 - 1:11:49 Text: Very briefly, okay, so very briefly on the encoder decoders.
1:11:49 - 1:11:53 Text: So we've seen decoders can be good because we get to play with the contracts that they
1:11:53 - 1:11:57 Text: give us, we get to play with them as language models, encoders give us that bidirectional
1:11:57 - 1:11:58 Text: context.
1:11:58 - 1:12:02 Text: So encoder decoders, maybe we get both.
1:12:02 - 1:12:04 Text: In practice, they're actually, yeah, pretty strong.
1:12:04 - 1:12:11 Text: So there was a, right, we could, so I guess one of the questions is like, what do we do
1:12:11 - 1:12:13 Text: to pre-train them?
1:12:13 - 1:12:18 Text: So we could do something like language modeling, right, where we take a sequence of words,
1:12:18 - 1:12:27 Text: one to word two t instead of t, right, and so as I have word one here, dot, dot, dot,
1:12:27 - 1:12:32 Text: word t, we provide those all to our encoder and we predict on none of them.
1:12:32 - 1:12:37 Text: And then we have word t plus one to word two t here in our decoder, right, and we predict
1:12:37 - 1:12:38 Text: on these.
1:12:38 - 1:12:42 Text: So we're doing language modeling on half the sequence and we've taken the other half
1:12:42 - 1:12:46 Text: to have our bidirectional encoder, right, so we're building strong representations on
1:12:46 - 1:12:52 Text: the encoder side, not predicting language modeling on any of this.
1:12:52 - 1:12:55 Text: And then we, on the other half of the tokens, we predict, you know, as a language model
1:12:55 - 1:12:57 Text: would do.
1:12:57 - 1:13:01 Text: And the hope is that you sort of pre-trained both of these well through the one language
1:13:01 - 1:13:04 Text: modeling loss up here.
1:13:04 - 1:13:06 Text: And this is actually, so this works pretty well.
1:13:06 - 1:13:11 Text: The encoder benefits from bidirectionality, the decoder, you can use to train the model.
1:13:11 - 1:13:19 Text: But what this paper showed that introduced the Model T5, roughly at all, found to work
1:13:19 - 1:13:22 Text: best was actually a very, or at least a somewhat different objective.
1:13:22 - 1:13:26 Text: And this should keep in your mind sort of that we have different ways of specifying the
1:13:26 - 1:13:30 Text: pre-training objectives and they will really work differently from each other.
1:13:30 - 1:13:33 Text: So what they said, let's say you have an original text like this.
1:13:33 - 1:13:37 Text: Thank you for inviting me to your party last week.
1:13:37 - 1:13:43 Text: We're going to define variable length spans in the text to replace with a unique symbol
1:13:43 - 1:13:46 Text: that says something is missing here.
1:13:46 - 1:13:48 Text: And then we'll replace and then we'll remove that.
1:13:48 - 1:13:57 Text: So now our input to our encoder is thank you symbol one, me to your party symbol to week.
1:13:57 - 1:14:01 Text: So we've noise the input, we've hidden stuff in the input.
1:14:01 - 1:14:05 Text: Also really interestingly, this doesn't say how long this is supposed to be.
1:14:05 - 1:14:07 Text: That's different from BERT.
1:14:07 - 1:14:10 Text: BERT said, oh, you masked this many sub words.
1:14:10 - 1:14:13 Text: This says, well, I got some token that says something's missing here.
1:14:13 - 1:14:14 Text: And I don't know what it is.
1:14:14 - 1:14:17 Text: I don't even know how many sub words it is.
1:14:17 - 1:14:24 Text: And then so you have this in your encoder and then your decoder predicts the first special
1:14:24 - 1:14:27 Text: word, this x here.
1:14:27 - 1:14:30 Text: And then what was missing for inviting.
1:14:30 - 1:14:33 Text: So thank you x for inviting.
1:14:33 - 1:14:34 Text: And then it predicts y.
1:14:34 - 1:14:35 Text: Here's this y here.
1:14:35 - 1:14:40 Text: And then what was missing from the y last week.
1:14:40 - 1:14:42 Text: This is called span corruption.
1:14:42 - 1:14:47 Text: And it's really interesting to me because in terms of the actual encoder decoder, we don't
1:14:47 - 1:14:51 Text: have to change it compared to whether we, if we were just doing language modeling pre-training.
1:14:51 - 1:14:54 Text: Because I just do language modeling on all these things.
1:14:54 - 1:14:57 Text: I just predict these words as if I'm a language model.
1:14:57 - 1:15:01 Text: I've just done a text pre-processing step.
1:15:01 - 1:15:06 Text: So the actual, I've just pre-processed the text to look like, oh, yeah, take the input,
1:15:06 - 1:15:11 Text: make it look like this, then make an output that looks like that up there.
1:15:11 - 1:15:15 Text: And the model gets to do what is effectively language modeling, but it actually works better.
1:15:15 - 1:15:18 Text: So there's a lot of numbers I realize.
1:15:18 - 1:15:20 Text: But look at the star here.
1:15:20 - 1:15:25 Text: This encoder decoder with a denoising objective that tends to work the best.
1:15:25 - 1:15:31 Text: And they tried similar models like a prefix language model that was sort of the first
1:15:31 - 1:15:35 Text: try that we had at defining a pre-training objective for language models, sorry, for encoder
1:15:35 - 1:15:38 Text: decoders.
1:15:38 - 1:15:41 Text: And then they had another, a number of other options, but what worked best for the encoder
1:15:41 - 1:15:43 Text: decoders.
1:15:43 - 1:15:48 Text: And one of the fascinating things about T5 is that you could pre-train it and fine tune
1:15:48 - 1:15:54 Text: it on questions like when was Franklin D. Roosevelt born and fine tune it to produce the
1:15:54 - 1:15:55 Text: answer.
1:15:55 - 1:15:58 Text: And then you could ask it new questions at test time.
1:15:58 - 1:16:02 Text: And then it would retrieve the answer from its parameters with some accuracy.
1:16:02 - 1:16:05 Text: And it would do so relatively well actually.
1:16:05 - 1:16:10 Text: And it would do so maybe 25% of the time on some of these data sets with 220 million
1:16:10 - 1:16:11 Text: parameters.
1:16:11 - 1:16:15 Text: And then at 11 billion parameters, this is way bigger than Bert large.
1:16:15 - 1:16:20 Text: It would do so even better, sometimes even doing as well as systems that were allowed
1:16:20 - 1:16:22 Text: to look at stuff other than their own parameters.
1:16:22 - 1:16:26 Text: So again, this is just making this answer come from its parameters.
1:16:26 - 1:16:30 Text: Yeah, I'm going to have to skip this.
1:16:30 - 1:16:35 Text: So if you look back at this slide after class, I have each of the examples of the things
1:16:35 - 1:16:40 Text: that we could imagine learning from pre-training with a label of what you might be learning.
1:16:40 - 1:16:43 Text: So this example is 10 for universities located in blank.
1:16:43 - 1:16:44 Text: You might learn trivia.
1:16:44 - 1:16:46 Text: In all these cases, there's all these things you can learn.
1:16:46 - 1:16:53 Text: One thing I will say is that models also learn and can make even worse racism, sexism,
1:16:53 - 1:16:56 Text: all manner of bad biases that are encoded in our text.
1:16:56 - 1:17:00 Text: When I say, yeah, they do this.
1:17:00 - 1:17:02 Text: And so we'll learn more about this in our later lectures, but it's important to keep
1:17:02 - 1:17:06 Text: in mind that when you're doing pre-training, you're learning a lot of stuff, and not all
1:17:06 - 1:17:09 Text: of it is good.
1:17:09 - 1:17:16 Text: So with GPT-3, the last thing here is that there's this third way of interacting with models
1:17:16 - 1:17:19 Text: that's related to treating them as language models.
1:17:19 - 1:17:25 Text: So GPT-3 is this very, very large model that was released by OpenAI.
1:17:25 - 1:17:31 Text: But it seems to be able to learn from examples in their context, their decoder context,
1:17:31 - 1:17:36 Text: without gradient steps, simply by looking sort of within their history.
1:17:36 - 1:17:40 Text: And now GPT-3 has 175 billion parameters, right?
1:17:40 - 1:17:44 Text: The last T5 model we saw was 11 billion parameters.
1:17:44 - 1:17:48 Text: And it seems to be sort of the canonical example of this working.
1:17:48 - 1:17:52 Text: And so what it looks like is you give it as part of its prefix.
1:17:52 - 1:17:56 Text: This goes to Merci, hello, goes to Mint, goes to writes, you've got these translation
1:17:56 - 1:18:03 Text: examples, you ask it for the last one, and it comes up with the correct translation.
1:18:03 - 1:18:06 Text: Seemingly because it's learned something about the task that you're sort of telling
1:18:06 - 1:18:08 Text: it to do through its prefix.
1:18:08 - 1:18:10 Text: And so you might do the same thing with addition.
1:18:10 - 1:18:15 Text: So something, if I plus eight is 13, give it addition examples, you might do the next
1:18:15 - 1:18:18 Text: addition example for you.
1:18:18 - 1:18:24 Text: Or maybe trying to figure out grammatical or spelling errors, for example.
1:18:24 - 1:18:29 Text: And here's the French case.
1:18:29 - 1:18:33 Text: So again, you're learning just to do pre-training.
1:18:33 - 1:18:40 Text: But when you're evaluating it, you don't even fine tune the model, you just provide prefixes.
1:18:40 - 1:18:43 Text: And so this especially is not well understood.
1:18:43 - 1:18:48 Text: And so a lot of research is going into sort of what the limitations of this so-called
1:18:48 - 1:18:49 Text: in-context learning are.
1:18:49 - 1:18:53 Text: But it's a fascinating direction for future work.
1:18:53 - 1:18:56 Text: In total, these models are not well understood.
1:18:56 - 1:19:01 Text: However, small, small, in-air growth models like Bert have become general tools in a wide
1:19:01 - 1:19:02 Text: range of settings.
1:19:02 - 1:19:06 Text: They do have these issues about learning all these biases about the world.
1:19:06 - 1:19:10 Text: They'll go into and further lectures in this course.
1:19:10 - 1:19:15 Text: And so, yeah, what you've learned this week, transformers and pre-training form the basis
1:19:15 - 1:19:20 Text: or at least the base lines for much of a natural language processing today.
1:19:20 - 1:19:25 Text: And assignment five is out and you'll be able to look more into it.
1:19:25 - 1:19:26 Text: And I'm over time.
1:19:26 - 1:19:27 Text: All right.
1:19:27 - 1:19:28 Text: Yeah.
1:19:28 - 1:19:39 Text: I guess I can take a question if there is any, but people can keep going as well.
1:19:39 - 1:19:50 Text: So I think that I think there's a question about P5, which was how does the D-toder know
1:19:50 - 1:19:52 Text: that I'm currently predicting X for Y?
1:19:52 - 1:19:54 Text: Could you repeat that?
1:19:54 - 1:19:55 Text: Yeah.
1:19:55 - 1:20:00 Text: So about P5, there's a question that was asking how does the D-toder know it's currently
1:20:00 - 1:20:04 Text: predicting X for Y?
1:20:04 - 1:20:09 Text: It's hierarchy of predicting X for Y?
1:20:09 - 1:20:13 Text: I guess it doesn't specify it's going to change how does it know that it's currently
1:20:13 - 1:20:15 Text: predicting X for Y?
1:20:15 - 1:20:16 Text: OK.
1:20:16 - 1:20:17 Text: Yeah.
1:20:17 - 1:20:18 Text: That makes sense.
1:20:18 - 1:20:19 Text: So what it does, right?
1:20:19 - 1:20:23 Text: So it knows from the encoder that it has to at some point predict X and at some point
1:20:23 - 1:20:28 Text: predict Y because the encoder can just like remember that, oh, yeah, there's two things
1:20:28 - 1:20:29 Text: missing.
1:20:29 - 1:20:33 Text: And if there were more spans replaced, there would be a Z and then an A and then a B and
1:20:33 - 1:20:38 Text: you know whatever, just a bunch of unique identifiers.
1:20:38 - 1:20:44 Text: And then up here, it gets to say, OK, I have attention, I suppose.
1:20:44 - 1:20:48 Text: I can look and I know that first I have to predict this first master thing.
1:20:48 - 1:20:52 Text: So I'm going to generate that in my D-coder and then it gets that symbol, right?
1:20:52 - 1:20:55 Text: So we're doing training by giving it the right symbol.
1:20:55 - 1:20:59 Text: Now it gets that X and it says, OK, I'm predicting X now.
1:20:59 - 1:21:01 Text: And now it can predict, predict, predict, predict.
1:21:01 - 1:21:02 Text: Then it gets Y.
1:21:02 - 1:21:06 Text: So we're doing this teacher forcing training where we give it the right answer after penalizing
1:21:06 - 1:21:07 Text: it if it's wrong.
1:21:07 - 1:21:10 Text: Now it gets this Y, right?
1:21:10 - 1:21:12 Text: And it says, OK, now I have to predict what should go and why.
1:21:12 - 1:21:16 Text: And it can attend, you know, into the natural parts of this as well as what it's already
1:21:16 - 1:21:22 Text: predicted here because the decoder has attention within itself and it can see what should go
1:21:22 - 1:21:23 Text: there.
1:21:23 - 1:21:25 Text: So what's fascinating here is you're doing something like language modeling.
1:21:25 - 1:21:29 Text: But when you're predicting Y, right, you get to see what came after it.
1:21:29 - 1:21:31 Text: And that's I think one of the benefits of span corruption.
1:21:31 - 1:21:34 Text: So you're doing this thing where you don't know how long you should be predicting for
1:21:34 - 1:21:39 Text: like language modeling, but you get to know what came after the thing that's missing.