Stanford CS224N： NLP with Deep Learning ｜ Winter 2020 ｜ BERT and Other Pre-trained Language Models

0:00:00 - 0:00:11 Text: Okay, so I'm going to talk about BERT and also some kind of precursor work and then some

0:00:11 - 0:00:15 Text: follow-up work that's happened in the last year or not, we'll not follow up, but more

0:00:15 - 0:00:19 Text: recent advancements that's happened since then.

0:00:19 - 0:00:22 Text: So first we're going to talk about history and background.

0:00:22 - 0:00:27 Text: So everyone knows and loves word embeddings in NLP, right?

0:00:27 - 0:00:33 Text: They're kind of the basis for why neural networks work for NLP.

0:00:33 - 0:00:40 Text: Because neural networks work in continuous space vectors and matrices and obviously text

0:00:40 - 0:00:45 Text: is discrete space and so there needed to be something to bridge the gap and it turns

0:00:45 - 0:00:49 Text: out that the thing to bridge the gap, it's actually pretty simple, it's just a look-up

0:00:49 - 0:00:56 Text: table from each, from a set of discrete vocabulary to a vector that's learned discriminatively

0:00:56 - 0:00:57 Text: end to end, right?

0:00:57 - 0:01:02 Text: So originally these were just learned like in the original Benjiah 2003, neural language

0:01:02 - 0:01:07 Text: on a paper, these were just trained discriminatively end to end and these were actually, and so

0:01:07 - 0:01:12 Text: then people would train language models and then use these pre-trained, use the embedding

0:01:12 - 0:01:16 Text: layer as pre-trained representations for other tasks.

0:01:16 - 0:01:19 Text: But they wouldn't use the rest of the language model, they would just use the embedding layer.

0:01:19 - 0:01:24 Text: And then word devalc and glove and stuff came along where then people found a much cheaper,

0:01:24 - 0:01:31 Text: much more scalable way to train where you can just use the statistics of a corpus where

0:01:31 - 0:01:34 Text: it's just a linear model so you don't have to compute these expensive feed-forward layers

0:01:34 - 0:01:38 Text: that you're going to throw out anyways and so you can scale up to like billions of tokens

0:01:38 - 0:01:41 Text: on a single CPU, right?

0:01:41 - 0:01:47 Text: So the problem though is that these word embeddings are applied in the context-free manner, right?

0:01:47 - 0:01:53 Text: So for like a kind of a simple, to example, the word bank, if you say open a bank account

0:01:53 - 0:01:55 Text: and on a river bank, it's going to be the same embedding.

0:01:55 - 0:02:00 Text: So people try to do stuff like word sense embeddings where it's not just a single word, it's

0:02:00 - 0:02:06 Text: a full word sense, but this kind of bank example, it's a little bit of a toy example, right?

0:02:06 - 0:02:10 Text: Almost any word has a different meaning depending on the context.

0:02:10 - 0:02:15 Text: It's very, so even like open the bank account and I went to the bank, those are still

0:02:15 - 0:02:19 Text: in a semi-different senses of the word bank.

0:02:19 - 0:02:23 Text: Kind of them is a, I mean they have different parts of each text kind of, well I guess

0:02:23 - 0:02:26 Text: not really, but like they're kind of using different senses, right?

0:02:26 - 0:02:29 Text: And so, yes, so we really need a contextual representation, right?

0:02:29 - 0:02:36 Text: So we want something where it's a representation of a word after it's been put into the context

0:02:36 - 0:02:37 Text: of the sense that we've seen it in, right?

0:02:37 - 0:02:39 Text: Which would be like at the bottom here.

0:02:39 - 0:02:45 Text: So kind of for history of contextual representations, the first big paper for this type of contextual

0:02:45 - 0:02:50 Text: representation was a paper from Google in 2015 called semi-supervised sequence learning

0:02:50 - 0:02:52 Text: from Andrew Dianne Cochle.

0:02:52 - 0:02:58 Text: And so in this one, it was actually very similar to papers that came after it, it didn't

0:02:58 - 0:02:59 Text: get as much attention for various reasons.

0:02:59 - 0:03:04 Text: So but basically they had some classification task like sentiment classification on

0:03:04 - 0:03:07 Text: movie reviews, and they had a big corpus on movie reviews.

0:03:07 - 0:03:11 Text: And so then they said what happens if we just take our existing LSTM model and instead

0:03:11 - 0:03:13 Text: of just using pre-trained embeddings, which everyone has already been doing since like

0:03:13 - 0:03:19 Text: at least for the like actually probably since 2003, people had been using pre-trained embeddings,

0:03:19 - 0:03:23 Text: but they said let's actually pre-trained the entire model as a language model and then

0:03:23 - 0:03:26 Text: let's fine tune it for our classification task.

0:03:26 - 0:03:31 Text: And they got pretty good results but not like stellar results.

0:03:31 - 0:03:34 Text: And so now we know that the reason why they didn't get stellar results is they didn't train

0:03:34 - 0:03:36 Text: on enough data and they, because they basically train on the same corpus that they were training

0:03:36 - 0:03:40 Text: on and they trained the same size model that they were training on.

0:03:40 - 0:03:41 Text: Which we now know needs to be bigger.

0:03:41 - 0:03:46 Text: But that's kind of, this was already kind of a little bit ahead of its time partially

0:03:46 - 0:03:50 Text: because like stuff wasn't, like we didn't have a bit of a computer back then even those

0:03:50 - 0:03:52 Text: only five years ago.

0:03:52 - 0:03:54 Text: And it would have been more expensive.

0:03:54 - 0:04:02 Text: So and then in 2017, Elmok came out, which was from the University of Washington in AI2.

0:04:02 - 0:04:08 Text: And so this one, they did something pretty clever where they took, you train a language

0:04:08 - 0:04:12 Text: model on a big corpus, so they trained it on a billion word corpus and they trained a

0:04:12 - 0:04:17 Text: big model, LSTM with 4,000 hidden dimensions which is quite expensive.

0:04:17 - 0:04:19 Text: And they trained a bi-directional model.

0:04:19 - 0:04:24 Text: But it was kind of weekly bi-directional where they trained a left-right model and then

0:04:24 - 0:04:28 Text: a left-right model and then they concatenated the two.

0:04:28 - 0:04:30 Text: And they called these contextual pre-trained embeddings.

0:04:30 - 0:04:36 Text: And so the idea behind Elmok is that this doesn't actually change your existing model architecture.

0:04:36 - 0:04:40 Text: You kind of take whatever task specific model architecture that you have, which could

0:04:40 - 0:04:45 Text: be for question answering, it might be some sort of fancy model where you do a LSTM over

0:04:45 - 0:04:50 Text: the source and over the question and over the answer, then you tend to one another and

0:04:50 - 0:04:52 Text: whatever kind of architecture you have.

0:04:52 - 0:04:58 Text: And wherever you would have put in glove embeddings before, now you put in Elmok embeddings.

0:04:58 - 0:05:03 Text: And so this got set of the art on everything at the time, question answering, semantic

0:05:03 - 0:05:07 Text: parsing, syntactic parsing, because it was, and so if you just took any existing kind

0:05:07 - 0:05:12 Text: of state-of-the-art model, you could fit in, put in Elmok embeddings and get state-of-the-art,

0:05:12 - 0:05:13 Text: right?

0:05:13 - 0:05:17 Text: But they weren't, but these were kind of, the models were kind of fixed.

0:05:17 - 0:05:24 Text: And so then after that, opening AI published, improving language understanding with generative

0:05:24 - 0:05:27 Text: pre-training, which is called GPT-1.

0:05:27 - 0:05:37 Text: And so in this, they took a similar large corpus, that of billion words, and they trained a

0:05:37 - 0:05:38 Text: very large language model.

0:05:38 - 0:05:42 Text: So a 12-layer language model, which at the time was maybe, I don't know whether it was

0:05:42 - 0:05:44 Text: actually a large language model that had been trained at the time, certainly it was

0:05:44 - 0:05:47 Text: the largest language model that had been trained on that much data for a kind of open-source

0:05:47 - 0:05:49 Text: model.

0:05:49 - 0:05:54 Text: And when I first read it, I actually thought that it was too big, not that it was worse,

0:05:54 - 0:05:56 Text: but that they were kind of just showing off by showing how big of a model they could

0:05:56 - 0:05:57 Text: train.

0:05:57 - 0:06:01 Text: But now we know that actually this depth that they had was actually kind of the crucial element.

0:06:01 - 0:06:03 Text: So they did something that was like fairly simple, right?

0:06:03 - 0:06:07 Text: They just trained a language model, a very large one, and then they just fine-tuned it by

0:06:07 - 0:06:11 Text: taking the last token and then fine-tuning it for a classification task, right?

0:06:11 - 0:06:13 Text: So is this positive or negative?

0:06:13 - 0:06:18 Text: And they got basically state-of-the-art on lots of different classification tasks.

0:06:18 - 0:06:25 Text: But, and so I'm going to actually take a kind of a side here before I go into BERT,

0:06:25 - 0:06:26 Text: which is about transformer.

0:06:26 - 0:06:31 Text: So that was the other kind of big thing, like the big precursor that allowed BERT and

0:06:31 - 0:06:33 Text: GPT to work well, right?

0:06:33 - 0:06:38 Text: So BERT and GPT both use the transformer, which I'm sure you guys have learned about.

0:06:38 - 0:06:43 Text: And so I don't need to necessarily go into all the details about it.

0:06:43 - 0:06:49 Text: But, so it has multi-headed attention, feed-forward layers, lay-in-arm.

0:06:49 - 0:06:51 Text: I won't go into all the details because I think you guys already learned about it.

0:06:51 - 0:06:57 Text: But the big thing about why this kind of took over is, there's really two advantages

0:06:57 - 0:06:58 Text: versus the LSTM.

0:06:58 - 0:07:00 Text: One is that there's no locality bias.

0:07:00 - 0:07:06 Text: And so, longest since context has an equal to the opportunity to short distance context,

0:07:06 - 0:07:09 Text: which is important.

0:07:09 - 0:07:14 Text: So for like normal language understanding, that's the locality bias of LSTM's is generally

0:07:14 - 0:07:18 Text: considered to be a good thing.

0:07:18 - 0:07:22 Text: Because local context is more relevant than longest since context.

0:07:22 - 0:07:27 Text: But the way that GPT and BERT and other models work is that they actually can catenate

0:07:27 - 0:07:28 Text: context.

0:07:28 - 0:07:35 Text: And so if you have a model that says does sentence one entail sentence two, the way that it was

0:07:35 - 0:07:39 Text: done historically, meaning like before GPT, was that you would like encode them both,

0:07:39 - 0:07:43 Text: let's say with an LSTM, then you would do attention from one to the other.

0:07:43 - 0:07:49 Text: With a transformer, you can just put them into the same sequence and give them separate

0:07:49 - 0:07:51 Text: sequence of adding to it at a separated token.

0:07:51 - 0:07:56 Text: And it will learn how to, and then it can, things can attend to its own sentence locally.

0:07:56 - 0:08:02 Text: But it can also attend all the way to the other sentence for almost, for no, it's just

0:08:02 - 0:08:04 Text: as easy for it to attend all the way to the other sentence.

0:08:04 - 0:08:07 Text: And so when you do this, kind of you can just pack everything into a single sequence and

0:08:07 - 0:08:09 Text: then everything will be learned.

0:08:09 - 0:08:13 Text: Rather than having to do this as part of the model architecture, which ends up being a

0:08:13 - 0:08:17 Text: pretty important thing about simplifying these models.

0:08:17 - 0:08:25 Text: And so the other thing is that having a, with transformers, with LSTMs, let's say this

0:08:25 - 0:08:27 Text: is a batch and these are the words in the batch.

0:08:27 - 0:08:30 Text: You have two sentences and four words per sentence.

0:08:30 - 0:08:32 Text: Every step has to be computed one at a time.

0:08:32 - 0:08:35 Text: So you only get a batch size of two effectively.

0:08:35 - 0:08:40 Text: And so on modern hardware, which is TPUs and GPUs, the bigger the matrix multiplication,

0:08:40 - 0:08:41 Text: the better it is.

0:08:41 - 0:08:43 Text: You want all three dimensions to be big.

0:08:43 - 0:08:47 Text: So even if you have big hidden layers, your batch size dimension will still be small,

0:08:47 - 0:08:51 Text: unless you have a huge batch, but then that's too expensive for long sequences.

0:08:51 - 0:08:57 Text: But with transformers, it's the total, because it's layer wise attention, the total number,

0:08:57 - 0:08:58 Text: the batch size is the total number of words.

0:08:58 - 0:09:05 Text: So if you have 500 words and then 32 sentences, it's actually 32, 10, 512 is the total batch size.

0:09:05 - 0:09:10 Text: So you get these huge matrix multiplication, and you can take advantage of modern hardware.

0:09:10 - 0:09:14 Text: And so that's kind of why the transformer has taken over, because of these two things.

0:09:14 - 0:09:16 Text: And that's why it was used in GPD and Y at Susan Burke.

0:09:16 - 0:09:20 Text: So now I'm going to talk about Burke.

0:09:20 - 0:09:29 Text: So the problem with the previous model being LMO and GPD and, once we for it, is that

0:09:29 - 0:09:35 Text: the language model is only used left context or right context or a concatenation of both,

0:09:35 - 0:09:39 Text: but really, but language understanding is bidirectional.

0:09:39 - 0:09:45 Text: So there's this clear kind of mismatch between why did everyone train on new direction of

0:09:45 - 0:09:50 Text: models, where you could only see to the left or only see to the right, when we know that

0:09:50 - 0:09:53 Text: in order to understand language, you need to look in both directions.

0:09:53 - 0:09:55 Text: So there's two reasons.

0:09:55 - 0:10:01 Text: So one is that language models historically had been used for typically as features in

0:10:01 - 0:10:02 Text: other systems.

0:10:02 - 0:10:05 Text: So the most direct application of language model would be predictive text, which is directly

0:10:05 - 0:10:07 Text: just saying predict the next word.

0:10:07 - 0:10:10 Text: But the other applications that are actually more common are to use them in a machine

0:10:10 - 0:10:15 Text: translation system or a speech recognition system, where you have these features like translation

0:10:15 - 0:10:19 Text: features or acoustic features, and then you add a language model that says what's the

0:10:19 - 0:10:20 Text: probability of the sentence.

0:10:20 - 0:10:23 Text: And so for this, you want it to be a well-formed distribution.

0:10:23 - 0:10:26 Text: So these pre-trained models we actually don't care about this, but this was kind of something

0:10:26 - 0:10:32 Text: that was kind of people had just been like kind of, I guess, fixed on this idea that language

0:10:32 - 0:10:35 Text: models have to have a distribution, probably a distribution, even though we actually don't

0:10:35 - 0:10:36 Text: care about that.

0:10:36 - 0:10:40 Text: But the other kind of bigger reason is that words can see themselves in a bidirectional

0:10:40 - 0:10:42 Text: encoder.

0:10:42 - 0:10:50 Text: And so what this means is when you build a representation incrementally, so you have your input and

0:10:50 - 0:10:52 Text: then you have your output and it's always offset by one.

0:10:52 - 0:10:58 Text: So we have the start-up sentence token, we predict the first word, then we feed in the

0:10:58 - 0:11:01 Text: second word, we feed in the first word and predict the second word, and so we can encode

0:11:01 - 0:11:05 Text: the sentence once and predict all the words in the sentence with the unidirectional model.

0:11:05 - 0:11:07 Text: And so this gives us good sample efficiency, right?

0:11:07 - 0:11:11 Text: Because if we have a 500 and 12-dimension, like a sequence of 500 words, we don't want

0:11:11 - 0:11:17 Text: to have to only predict one word because it's going to be 500 times as much compute to

0:11:17 - 0:11:20 Text: get the same amount of predictions.

0:11:20 - 0:11:24 Text: If we would just trivially do a bidirectional LSTM or transformer, we would have a situation

0:11:24 - 0:11:28 Text: where you encode your sentence, everything is bidirectional.

0:11:28 - 0:11:32 Text: And so after the first layer, everything can see itself.

0:11:32 - 0:11:35 Text: So this word open, there's a path back down to open.

0:11:35 - 0:11:39 Text: And so it's trivial to predict a word that can, where it's in the input also, right?

0:11:39 - 0:11:42 Text: There's no actual prediction going on there.

0:11:42 - 0:11:47 Text: So the simple solution, which is basically the whole crux of birth, is that let's instead

0:11:47 - 0:11:55 Text: of training a normal language model, let's just predict, mask out, k percent of the words.

0:11:55 - 0:11:59 Text: So the man went to the mask to buy a mask of milk.

0:11:59 - 0:12:03 Text: And so now you can run a bidirectional model on that.

0:12:03 - 0:12:08 Text: And because the words aren't in the input, you can't cheat, right?

0:12:08 - 0:12:13 Text: And so the downside of this is that you're not getting as many predictions per sentence,

0:12:13 - 0:12:14 Text: right?

0:12:14 - 0:12:17 Text: You're only getting, predicting 15 percent of words instead of 100 percent of words.

0:12:17 - 0:12:19 Text: So the upside is that you're getting much more rich model because you're seeing both

0:12:19 - 0:12:21 Text: directions, right?

0:12:21 - 0:12:26 Text: So this value of k is a hyper parameter that we have to just decide on empirically.

0:12:26 - 0:12:27 Text: So we use 15 percent.

0:12:27 - 0:12:30 Text: It turns out that that's actually kind of an optimal value.

0:12:30 - 0:12:34 Text: So we and also people have since then have done more thorough ablation experiments and

0:12:34 - 0:12:36 Text: found that this 15 percent is good.

0:12:36 - 0:12:40 Text: So the reason for doing a certain percent over another is that if you were to do, let's

0:12:40 - 0:12:43 Text: say, 50 percent masking, you would get way more predictions, but you would also mask

0:12:43 - 0:12:46 Text: out like all of your context.

0:12:46 - 0:12:52 Text: And so you can, if you mask out all of your context and you're not getting any, you can't

0:12:52 - 0:12:53 Text: learn contextual models.

0:12:53 - 0:12:58 Text: And if you only do, like, let's say you can mask out one word, that might be optimal maybe,

0:12:58 - 0:13:00 Text: but you have to do way more data processing.

0:13:00 - 0:13:02 Text: So it would be way more expensive to train.

0:13:02 - 0:13:05 Text: And we know that these models are basically just compute bounded.

0:13:05 - 0:13:08 Text: So if you just have enough data, you can just kind of train them infinitely and it'll

0:13:08 - 0:13:10 Text: always do better.

0:13:10 - 0:13:14 Text: So it's really just a trade-off between these two things.

0:13:14 - 0:13:19 Text: So one other little detail in part, which should not have to be super important, is that

0:13:19 - 0:13:24 Text: because the mass token is never seen at fine-tuning time, instead of always replacing a word

0:13:24 - 0:13:30 Text: with the mass token as in this case, we would randomly sometimes predict it with a random

0:13:30 - 0:13:32 Text: word and sometimes keep the same word.

0:13:32 - 0:13:36 Text: So like, so a 100 percent time would say, we went to the store and went to the running,

0:13:36 - 0:13:37 Text: right?

0:13:37 - 0:13:42 Text: And so we wouldn't tell the model which case was which.

0:13:42 - 0:13:48 Text: We would just have to, we would just say, what should this word be, right?

0:13:48 - 0:13:50 Text: And didn't know whether it's right or not.

0:13:50 - 0:13:51 Text: So it could be the same word.

0:13:51 - 0:13:52 Text: So it's 10 percent time, it's the same word.

0:13:52 - 0:13:53 Text: It could be a random word.

0:13:53 - 0:13:58 Text: And so it has to basically be able to maintain a good representation of every word because

0:13:58 - 0:14:00 Text: it doesn't know whether it's really the right word.

0:14:00 - 0:14:03 Text: So it has to actually look at every word and figure out whether this is the right word.

0:14:03 - 0:14:06 Text: So we could potentially even just get away with not using mass token at all, and just doing

0:14:06 - 0:14:08 Text: like this 50 percent of time and this 50 percent of the time.

0:14:08 - 0:14:12 Text: But the reason for not doing that is that, you know, then we'd be corrupting a lot of our

0:14:12 - 0:14:16 Text: data and we don't want it to necessarily corrupt a data because the fact that this is the

0:14:16 - 0:14:19 Text: wrong word might mess up our prediction for some other word over here, right?

0:14:19 - 0:14:23 Text: So whereas a mass token at least it knows that it's not the right word, so it doesn't

0:14:23 - 0:14:26 Text: use that as part of its context.

0:14:26 - 0:14:33 Text: So the other kind of detail of BERT which also now and subsequently may not be, have been

0:14:33 - 0:14:40 Text: that important, is that a lot of these tasks that we're not just learning words, we're

0:14:40 - 0:14:42 Text: want to predict the relationship between sentences.

0:14:42 - 0:14:47 Text: So if question answering in particular, we have a query which is generally a sentence

0:14:47 - 0:14:54 Text: and then we have an answer which is a paragraph or a sentence or a document and we want to,

0:14:54 - 0:14:56 Text: you know, say does this answer the question.

0:14:56 - 0:15:03 Text: So by doing that, we, so we want to have some pre-taining task that actually does a sentence

0:15:03 - 0:15:06 Text: of prediction rather than just a word level prediction.

0:15:06 - 0:15:09 Text: So the way that we did this, which, and we need this to have like an infinite amount

0:15:09 - 0:15:10 Text: of data, right?

0:15:10 - 0:15:13 Text: We're going to generate an infinite amount of data so we don't want this to be an annotated

0:15:13 - 0:15:14 Text: task.

0:15:14 - 0:15:20 Text: So the way that we did this is we just did a next sentence-producing task where we just

0:15:20 - 0:15:26 Text: took two sentences from the same corpus, from the same document and 50% time there from

0:15:26 - 0:15:31 Text: the same document, 50% time there from a random document and then we just said, was this

0:15:31 - 0:15:33 Text: the real next sentence or not?

0:15:33 - 0:15:37 Text: And so if you have like the man went to the store, he bought a gun, a milk, that is the next

0:15:37 - 0:15:38 Text: sentence.

0:15:38 - 0:15:40 Text: If you said the man went to the store, penguins are flightless, that's not the next

0:15:40 - 0:15:41 Text: sentence.

0:15:41 - 0:15:45 Text: So basically now we're forcing the model app pre-taining time to actually make, to look

0:15:45 - 0:15:48 Text: at the full sentences and then make some sort of sentence of a prediction and we hope that

0:15:48 - 0:15:55 Text: this is kind of generalized which is something like question answering where you have a question

0:15:55 - 0:16:00 Text: and answer as sentence, and sentence B.

0:16:00 - 0:16:07 Text: So in terms of our input representation, it looks pretty similar to a normal transformer

0:16:07 - 0:16:12 Text: but we have these additional embeddings which are called segment embeddings.

0:16:12 - 0:16:17 Text: So normal transformer, you would have your input and then you would do word piece segmentation

0:16:17 - 0:16:25 Text: right where you split up, we apply this unsupervised splitting of words into kind of morphological

0:16:25 - 0:16:29 Text: splits but they're usually often not morphological, so it's unsupervised.

0:16:29 - 0:16:32 Text: But you end up with something that's roughly morphological right?

0:16:32 - 0:16:37 Text: And so now you have like no out of vocabulary tokens, everything is represented at the

0:16:37 - 0:16:39 Text: very least you always split into characters.

0:16:39 - 0:16:45 Text: So we use a 30,000 word vocabulary and then we have our token embeddings, then we have

0:16:45 - 0:16:47 Text: our normal position embeddings which is at the bottom.

0:16:47 - 0:16:51 Text: So these are part of the transformer where because transformers unlike LSTMs don't have

0:16:51 - 0:16:55 Text: any sort of locational awareness.

0:16:55 - 0:17:00 Text: So the way to encode that is that you encode an actual embedding for every position.

0:17:00 - 0:17:05 Text: So this is called absolute position embedding, there's other techniques nowadays.

0:17:05 - 0:17:10 Text: And then you have the segment embedding which is this is a sentence A or sentence B.

0:17:10 - 0:17:14 Text: And so this kind of generalizes in more general context.

0:17:14 - 0:17:17 Text: So you can imagine if you're trying to say like you're trying to do like web search, you

0:17:17 - 0:17:24 Text: might say here's my query, here's the title, here's the URL, here's the document content.

0:17:24 - 0:17:27 Text: And so you can kind of just pack these all into a single sequence and then just give them

0:17:27 - 0:17:34 Text: different segment embeddings or type embeddings so that now you get are able to have a much

0:17:34 - 0:17:41 Text: stronger, you're able to kind of just represent everything in this kind of same single sequence

0:17:41 - 0:17:44 Text: where you kind of differentiate it but just the single embedding that's different.

0:17:44 - 0:17:47 Text: And this is all of course learned.

0:17:47 - 0:17:50 Text: And so this is in contrast to kind of the older style where you would typically have a

0:17:50 - 0:17:51 Text: different encoder for every part.

0:17:51 - 0:17:54 Text: So like you would have a different encoder for the query and then maybe the title and the

0:17:54 - 0:17:55 Text: URL.

0:17:55 - 0:17:58 Text: But this case it's all just a single sequence.

0:17:58 - 0:18:03 Text: So we trained on about a 3 billion word corpus which was at the time large now it's not

0:18:03 - 0:18:07 Text: actually even that big compared to what people are training on.

0:18:07 - 0:18:12 Text: We used a batch size which was also large.

0:18:12 - 0:18:14 Text: We trained for about 40 epochs of the data.

0:18:14 - 0:18:18 Text: We trained these two models which are still relatively large.

0:18:18 - 0:18:22 Text: So one of them is 12 layer, 768 and then the other one is 24 layer, 1024.

0:18:22 - 0:18:26 Text: So at the time this is basically like one of the largest models that had been trained

0:18:26 - 0:18:32 Text: although now people are training models that I think are 30 times or more bigger than this

0:18:32 - 0:18:33 Text: in the more recent papers.

0:18:33 - 0:18:39 Text: So things have kind of exploded in terms of compute in the last I know about three years.

0:18:39 - 0:18:42 Text: But yeah.

0:18:42 - 0:18:47 Text: So the fine tuning procedure is, it's pretty straightforward right?

0:18:47 - 0:18:50 Text: So we pre-trained this model for these two tasks.

0:18:50 - 0:18:54 Text: And so now we have an input sequence which is multiple sentences with different type

0:18:54 - 0:18:55 Text: embeddings.

0:18:55 - 0:19:00 Text: We feed them through our transformer model.

0:19:00 - 0:19:03 Text: And now we have the special embedding which I think I didn't mention.

0:19:03 - 0:19:08 Text: So this special embedding, this is basically, it's learned to predict the next sentence

0:19:08 - 0:19:14 Text: prediction task and then this is used also for classification task.

0:19:14 - 0:19:16 Text: But it's not just that we're using the embedding right?

0:19:16 - 0:19:19 Text: We are, we're fine tuning the entire model right?

0:19:19 - 0:19:22 Text: So it's really not that this embedding is intrinsically useful or that the word embedding

0:19:22 - 0:19:23 Text: is intrinsically useful.

0:19:23 - 0:19:29 Text: It's that the weights inside the entire 12 or 24 layer model are useful.

0:19:29 - 0:19:33 Text: And so by fine tuning the entire model you can kind of pick out the salient parts that

0:19:33 - 0:19:37 Text: are important for some downstream task.

0:19:37 - 0:19:42 Text: And so this is the kind of the class specific fine tuning.

0:19:42 - 0:19:49 Text: So if we have a single classification task like let's say sentiment analysis where you

0:19:49 - 0:19:52 Text: say is this a positive or negative review?

0:19:52 - 0:19:54 Text: We encode our sentence with the birth model.

0:19:54 - 0:19:58 Text: And the only parameters that we add are this final output matrix right?

0:19:58 - 0:20:02 Text: So maybe if we have three, like say positive, negative or neutral, this might be a thousand

0:20:02 - 0:20:03 Text: times three right?

0:20:03 - 0:20:07 Text: So it's just 3,000 parameters and 300 million.

0:20:07 - 0:20:11 Text: So a 3,000 new parameters and 300 million old parameters.

0:20:11 - 0:20:18 Text: And we jointly train all 300 million plus 3,000 for this downstream task.

0:20:18 - 0:20:22 Text: But because the vast majority of them are pre-trained, we can kind of adapt to it in only like a

0:20:22 - 0:20:24 Text: few thousand label examples.

0:20:24 - 0:20:29 Text: And similarly for a sentence pair class, we do, we just can count into three sentences

0:20:29 - 0:20:30 Text: with different type embeddings.

0:20:30 - 0:20:33 Text: So we have, if you want to say does the sentence entail this other sentence, you say sentence

0:20:33 - 0:20:39 Text: A, you put it, can count in sentence B, and then also predicts from this token and fine

0:20:39 - 0:20:40 Text: to the entire thing.

0:20:40 - 0:20:43 Text: Similarly, very few additional parameters.

0:20:43 - 0:20:49 Text: For span prediction tasks, you just have kind of a start of span end of span.

0:20:49 - 0:20:52 Text: So you're only adding a few thousand new parameters.

0:20:52 - 0:21:01 Text: And then for tagging tasks, like part of speech tagging, you just have a single sentence.

0:21:01 - 0:21:05 Text: You add every single token or maybe every token except for the word pieces.

0:21:05 - 0:21:08 Text: But like, you, that's kind of free processing.

0:21:08 - 0:21:10 Text: You predict what's the part of speech of this.

0:21:10 - 0:21:19 Text: And so this is really why, so like, bird itself is really a kind of, I would say an incremental

0:21:19 - 0:21:20 Text: improvement over what already existed.

0:21:20 - 0:21:28 Text: So it kind of took transformers, LMO, GPT, really these three ideas and kind of made

0:21:28 - 0:21:33 Text: a pretty simple change on top of them.

0:21:33 - 0:21:37 Text: But the reason why it had such big impact is not just the numbers that I'll show in

0:21:37 - 0:21:38 Text: a few slides.

0:21:38 - 0:21:40 Text: It's really this thing.

0:21:40 - 0:21:48 Text: Because with LMO, there was really no fundamental difference between, it was just contextual

0:21:48 - 0:21:49 Text: embedding.

0:21:49 - 0:21:54 Text: So you still have, like, a lot of deep learning historically has been fun and building new

0:21:54 - 0:21:55 Text: models, right?

0:21:55 - 0:21:58 Text: So you have all of these components, kind of like Lego blocks, right?

0:21:58 - 0:22:04 Text: You have attention layers, feed forward layers, layer normalization, LSTMs, et cetera.

0:22:04 - 0:22:10 Text: And you can just kind of like figure out, say, okay, for this new task, how do I glue

0:22:10 - 0:22:12 Text: these together in a way that's best, right?

0:22:12 - 0:22:19 Text: And so, and so with LMO, it wasn't really, it didn't really change anything fundamentally.

0:22:19 - 0:22:22 Text: It was just, because these were, you just fed it into your existing model and you got

0:22:22 - 0:22:23 Text: to the art.

0:22:23 - 0:22:29 Text: For GPT one, it wasn't, most, like these things didn't really work, right?

0:22:29 - 0:22:31 Text: Because it was a left to right language model.

0:22:31 - 0:22:35 Text: And so you could just kind of take the last token and then predict a classification task,

0:22:35 - 0:22:39 Text: but it didn't really make any sense to predict, like, part of speech tags.

0:22:39 - 0:22:41 Text: Because for the first word, there's no context.

0:22:41 - 0:22:45 Text: So it makes no sense to predict the word with no context.

0:22:45 - 0:22:51 Text: With BERT, the reason why it had such high impact was because it kind of simplified things.

0:22:51 - 0:22:56 Text: And so that's not, I'm not saying that's necessarily a good thing, because as a researcher,

0:22:56 - 0:22:57 Text: or a bad thing.

0:22:57 - 0:23:01 Text: So as a researcher, kind of, ironically, the ultimate goal of research is often like research

0:23:01 - 0:23:03 Text: yourself out of a job, right?

0:23:03 - 0:23:07 Text: It's like, you know, if a physicist, I'm not saying BERT had, anywhere near this impact,

0:23:07 - 0:23:11 Text: like a physicist that came up with like a grand theory of physics, they would kind of,

0:23:11 - 0:23:15 Text: like, they would be like the greatest moment in physics, but also that would kind of eliminate

0:23:15 - 0:23:16 Text: a lot of research, right?

0:23:16 - 0:23:20 Text: And so that's kind of like the end goal of research is kind of solve the problem, right?

0:23:20 - 0:23:30 Text: So BERT kind of has a step where now, like, all of these different causes of problems,

0:23:30 - 0:23:35 Text: there's really, it kind of killed like a lot of the need to do model architecture design,

0:23:35 - 0:23:39 Text: which is kind of unfortunate because that's like really fun.

0:23:39 - 0:23:41 Text: And so that's kind of the impact.

0:23:41 - 0:23:45 Text: And I'm not going to say whether it's like a good or a bad impact, it's kind of like the

0:23:45 - 0:23:46 Text: objective impact.

0:23:46 - 0:23:52 Text: So why it's had so much impact is because it has kind of had this effect on now so many

0:23:52 - 0:23:56 Text: things that used to be like designing fun, you know, models, it's just fitted in and

0:23:56 - 0:24:01 Text: use one of these four recipes and it kind of just works for all of these different tasks.

0:24:01 - 0:24:05 Text: So in terms of actual empirical results, so these are at the time that Hickory's published,

0:24:05 - 0:24:09 Text: of course, things have gotten better since then.

0:24:09 - 0:24:14 Text: So this glue task is a set of, they're all kind of similar in that they're all sentenced

0:24:14 - 0:24:16 Text: pair or sentence classification tests.

0:24:16 - 0:24:21 Text: So like multi NLI would be something like hills and mountains are especially sanctified

0:24:21 - 0:24:22 Text: in Janism.

0:24:22 - 0:24:26 Text: And then hypothesis is Janism hates nature, that's a contradiction, right?

0:24:26 - 0:24:30 Text: So in order for an NLP model to be able to understand this or to be able to answer this

0:24:30 - 0:24:35 Text: correctly and give this label of contradiction, it needs to know that hills and mountains

0:24:35 - 0:24:41 Text: are part of nature, sanctified is a good thing and that hating something is a bad thing

0:24:41 - 0:24:43 Text: and be able to do all of this reasoning, right?

0:24:43 - 0:24:45 Text: So it's pretty complicated reasoning.

0:24:45 - 0:24:48 Text: Similarly for cola, you have to be able to say like the wagon rumbled down the road

0:24:48 - 0:24:50 Text: versus the car honked down the road.

0:24:50 - 0:24:57 Text: And so these things are, you know, one of them to a native English speaker sounds totally

0:24:57 - 0:24:59 Text: fine, the other one sounds weird.

0:24:59 - 0:25:01 Text: And so it's similar and neither you have very much data, right?

0:25:01 - 0:25:07 Text: So you have to be able to generalize on only like a few thousand examples.

0:25:07 - 0:25:14 Text: So birthed base, which is the same size as open AI, it significantly beat open AI, which

0:25:14 - 0:25:16 Text: was the previous data they are.

0:25:16 - 0:25:22 Text: And then the birth large, which was bigger, of course got better results, which is only

0:25:22 - 0:25:26 Text: surprising that it got better results across the board, including on the very, very tiny

0:25:26 - 0:25:28 Text: data, that's only a few thousand examples.

0:25:28 - 0:25:32 Text: That's kind of the more interesting result rather than just the fact that it got better

0:25:32 - 0:25:33 Text: results.

0:25:33 - 0:25:39 Text: Historically, when there is rules of thumb about, if you have some number of examples, how

0:25:39 - 0:25:42 Text: do you design the model size that's optimal for that?

0:25:42 - 0:25:45 Text: And so if you don't do pre-training, like if you keep making the model bigger without

0:25:45 - 0:25:49 Text: pre-training, eventually you'll get worse results because your model overfitting your

0:25:49 - 0:25:51 Text: training data.

0:25:51 - 0:25:54 Text: With pre-training, you basically only ever do like one pass of the day anyways.

0:25:54 - 0:25:58 Text: So there seems to be almost no limit to how big you can make it and still get good results

0:25:58 - 0:26:00 Text: even with a tiny amount of fine-tuning data.

0:26:00 - 0:26:03 Text: And that's really like one of the big takeaways.

0:26:03 - 0:26:12 Text: So I'm not going to, yeah, so the reason why these numbers are, these range are lower

0:26:12 - 0:26:19 Text: is because I took the screenshot after, like this, you know, significantly after, when

0:26:19 - 0:26:22 Text: other bunch of other people had submitted systems.

0:26:22 - 0:26:26 Text: But so this is a question and answer and get a set.

0:26:26 - 0:26:28 Text: So it'd be like, what action did the US take to start this second oil shock?

0:26:28 - 0:26:30 Text: So in this case, there's no answer, right?

0:26:30 - 0:26:33 Text: So it's something you have to be able to predict is that there's no answer in this phase.

0:26:33 - 0:26:36 Text: If you have to be able to predict the answer or say there's no answer.

0:26:36 - 0:26:42 Text: So burp beat the previous state of the art, at the time that it was submitted by about

0:26:42 - 0:26:44 Text: six points, which was a pretty big gain.

0:26:44 - 0:26:47 Text: Now this has kind of gone past human level.

0:26:47 - 0:26:50 Text: But at the time, yeah, it was large.

0:26:50 - 0:26:54 Text: So I'll kind of do some ablation experiments or go through some ablation experiments.

0:26:54 - 0:26:57 Text: So this one, there's four things that I'm comparing here.

0:26:57 - 0:27:01 Text: So this is all burp based size models.

0:27:01 - 0:27:05 Text: So the blue line is kind of the burp based model.

0:27:05 - 0:27:09 Text: The red line is, with we take out the next sentence prediction.

0:27:09 - 0:27:12 Text: So in our case, even though people have subsequently said that they don't think it's important,

0:27:12 - 0:27:13 Text: in our case, we actually did measure it.

0:27:13 - 0:27:19 Text: And it turns out it seemed like it was important to have this next sentence prediction task,

0:27:19 - 0:27:25 Text: especially for kind of question answering task, which is this one.

0:27:25 - 0:27:28 Text: But it seems to have a little bit, at least in all four of them, to have it.

0:27:28 - 0:27:32 Text: So this kind of does indicate that there's some strength in learning some model that learns

0:27:32 - 0:27:37 Text: a relationship between sentences.

0:27:37 - 0:27:45 Text: So then this one is the one that makes an apples to apples comparison between open AI and

0:27:45 - 0:27:48 Text: open AI is GPT1 and BERT, right?

0:27:48 - 0:27:50 Text: Because I also, I made the model bigger, but not per base.

0:27:50 - 0:27:53 Text: So BERT based was the exact same size, but it was China more data.

0:27:53 - 0:27:58 Text: So to make it a fair comparison, I basically retrained my own implementation of open AI's

0:27:58 - 0:28:04 Text: GPT1, which is a yellow line.

0:28:04 - 0:28:08 Text: And we can see that on some of the tests, it's not that far, although this is actually a

0:28:08 - 0:28:09 Text: pretty big gap.

0:28:09 - 0:28:12 Text: This drop is four points, which is a lot.

0:28:12 - 0:28:16 Text: But on some tasks like squad and M.O.P.C, it was way worse.

0:28:16 - 0:28:24 Text: And so for squad it makes sense because squad is a labeling, is a span labeling task.

0:28:24 - 0:28:29 Text: And so if you only have left context, then words at the beginning have basically no context.

0:28:29 - 0:28:32 Text: And so you're asking it to do span labeling on words with almost no context.

0:28:32 - 0:28:35 Text: So it really doesn't make any sense.

0:28:35 - 0:28:36 Text: And so of course it's going to do much worse.

0:28:36 - 0:28:40 Text: So then we also added a, to make it fair, we also added an LSTM on top of it, which is

0:28:40 - 0:28:42 Text: trained from scratch.

0:28:42 - 0:28:45 Text: And this does help a little bit on some of the tasks, but on other ones it doesn't help.

0:28:45 - 0:28:48 Text: So on SWAT it helps because now you have bidirectional context.

0:28:48 - 0:28:52 Text: But on MRC because it's a very small task, it's only got 3,000 labeled examples, it doesn't

0:28:52 - 0:28:54 Text: help at all.

0:28:54 - 0:29:00 Text: So it does show that kind of the mass language model and the next inspection are both important,

0:29:00 - 0:29:04 Text: especially the mass language model.

0:29:04 - 0:29:11 Text: So the other thing, one of the other ablations is that when we apply the mass language model,

0:29:11 - 0:29:15 Text: we're only predicting 15% of words in the sentence.

0:29:15 - 0:29:17 Text: So when you do a left to right language model, you're predicting every single word,

0:29:17 - 0:29:20 Text: conditioning all the words to the left.

0:29:20 - 0:29:24 Text: So one question might be how much does this make it take longer to converge?

0:29:24 - 0:29:27 Text: Even though eventually we know that it converges at a much better point, if you have a limited

0:29:27 - 0:29:30 Text: training budget, it's a better to do a left to right model.

0:29:30 - 0:29:37 Text: And so we see that when you do this mass language model, the bidirectionality starts to improve,

0:29:37 - 0:29:39 Text: like at the very, very beginning because you're doing so many more predictions, it's true

0:29:39 - 0:29:46 Text: that the left to right model does do better at the very, like for, like, epoch one, but

0:29:46 - 0:29:50 Text: then very soon after because the bidirectionality is so important, it starts to take over.

0:29:50 - 0:29:53 Text: And so it's basically better from almost the start to do bidirectionality.

0:29:53 - 0:29:57 Text: And then it takes slightly longer to converge, but the overall convergence is, of course, much

0:29:57 - 0:30:01 Text: higher.

0:30:01 - 0:30:10 Text: And then finally for this oblations, we can see that going from a smaller model, which

0:30:10 - 0:30:14 Text: was 100 million to 300 million parameters, helps a lot, which isn't surprising.

0:30:14 - 0:30:20 Text: What the more surprising thing is that one of these curves, these aren't comparable,

0:30:20 - 0:30:22 Text: you shouldn't compare the curves to each other.

0:30:22 - 0:30:29 Text: The point is to look at the curves as a function of the number of parameters and see that this

0:30:29 - 0:30:35 Text: one is, this one only has 3,000 labeled examples, and this one has 4,000 labeled examples.

0:30:35 - 0:30:41 Text: So in both cases, the curves look very similar, which is surprising because the rule of thumb

0:30:41 - 0:30:46 Text: that you're going to overfit your data if you only have a few labeled examples turns

0:30:46 - 0:30:48 Text: out not to really be true anymore.

0:30:48 - 0:30:51 Text: And there's, you know, in these curves keep going up, right?

0:30:51 - 0:30:56 Text: So now with subsequent papers, we'll talk about, like this, this big one was 300 million

0:30:56 - 0:30:59 Text: parameters, people have gone up to 11 billion parameters and still seeing similar behaviors.

0:30:59 - 0:31:03 Text: So still seeing the curves go way up and gotten to the other results, which is kind of crazy

0:31:03 - 0:31:09 Text: because now we know that, you know, basically there's almost no limit.

0:31:09 - 0:31:13 Text: So another thing I want to talk about before I talk about stuff that's happened since

0:31:13 - 0:31:14 Text: Bert.

0:31:14 - 0:31:22 Text: Is the kind of, is, even though Bert itself was in some ways very simple, which is, you

0:31:22 - 0:31:26 Text: know, not a bad thing, it was very successful immediately.

0:31:26 - 0:31:29 Text: And you know, part of that is the Google brand and like, you know, it got a cute name

0:31:29 - 0:31:30 Text: and stuff like that.

0:31:30 - 0:31:32 Text: But I think that I spent a lot of time with the open social release and particularly

0:31:32 - 0:31:36 Text: looking at other open source releases and figuring out what people didn't like about

0:31:36 - 0:31:38 Text: those.

0:31:38 - 0:31:42 Text: And so I think this is important, like when you're, when you're a PhD student or even

0:31:42 - 0:31:45 Text: working industry as a, and trying to release something.

0:31:45 - 0:31:50 Text: So I kind of just listed the things here that I thought were important for like why

0:31:50 - 0:31:55 Text: it was successful compared to other things.

0:31:55 - 0:31:58 Text: So like, not, I'm not trying to call them out just to be mean because it, but like the

0:31:58 - 0:32:01 Text: open AI GPT-1 release was really, was not very good.

0:32:01 - 0:32:04 Text: And then like I'm sure that they were this because the open AI GPT-2 release was very

0:32:04 - 0:32:05 Text: good.

0:32:05 - 0:32:11 Text: And so, yeah, because it was very hard to run and there was not comment that the TensorFlow

0:32:11 - 0:32:14 Text: code was very, like it worked fine, like I replicated it.

0:32:14 - 0:32:17 Text: But like, the, the TensorFlow code was very non-idiamatic.

0:32:17 - 0:32:18 Text: It used all sorts of weird stuff.

0:32:18 - 0:32:19 Text: The Python code was weird.

0:32:19 - 0:32:20 Text: There was no comments.

0:32:20 - 0:32:21 Text: There was basically no instructions.

0:32:21 - 0:32:26 Text: And then other code bases also are kind of too big.

0:32:26 - 0:32:29 Text: It's like people just want to like say like, we want to have one unified code base for

0:32:29 - 0:32:32 Text: our entire, you know, language team.

0:32:32 - 0:32:33 Text: And so they just put stuff as part of that.

0:32:33 - 0:32:34 Text: And people don't really like that either.

0:32:34 - 0:32:37 Text: So I was very insistent that we do a minimal release.

0:32:37 - 0:32:39 Text: So like this, we're just going to release Bert.

0:32:39 - 0:32:40 Text: It's not going to be part of anything.

0:32:40 - 0:32:41 Text: There's going to be any external dependencies.

0:32:41 - 0:32:44 Text: And it's going to be like very well commented.

0:32:44 - 0:32:48 Text: I think that people, and it was kind of also easy to drop in just the modeling part and

0:32:48 - 0:32:53 Text: just the tokenization part and just the front end, which runs like the training loop.

0:32:53 - 0:32:55 Text: And kind of separate all these out because that way.

0:32:55 - 0:33:00 Text: And so I think because of that, people kind of started using it much quicker.

0:33:00 - 0:33:02 Text: And of course, like all the publicity help.

0:33:02 - 0:33:08 Text: But I think that, you know, it could have easily been not as successful if it had been,

0:33:08 - 0:33:09 Text: you know, done in a different way.

0:33:09 - 0:33:12 Text: So it's just kind of advice.

0:33:12 - 0:33:14 Text: So yeah.

0:33:14 - 0:33:20 Text: So now I'm going to talk about five models that have come out since Bert that have all improved

0:33:20 - 0:33:22 Text: on top of Bert in various ways.

0:33:22 - 0:33:23 Text: There's been more than five.

0:33:23 - 0:33:27 Text: But I'm going to highlight these five, they think they're interesting.

0:33:27 - 0:33:31 Text: A lot of them did come from Google, but it's not because, well, a lot of them involved

0:33:31 - 0:33:32 Text: Google.

0:33:32 - 0:33:37 Text: I would say many of them actually were not, they were interns at Google from various

0:33:37 - 0:33:41 Text: universities who were supervised by Google researchers and also use Google compute.

0:33:41 - 0:33:44 Text: I mean, the reason why a lot of them came from Google is because, like, frankly, like,

0:33:44 - 0:33:50 Text: other than Facebook, Google, and Microsoft, there's not really many, like, people that can,

0:33:50 - 0:33:54 Text: the companies that have the resources to train these huge state of the art models.

0:33:54 - 0:34:01 Text: And so, almost by necessity, it's going to come from one of these labs.

0:34:01 - 0:34:04 Text: So the first one was Roberta.

0:34:04 - 0:34:07 Text: And so this is probably the one that had, like, the least kind of new stuff.

0:34:07 - 0:34:11 Text: It was really just, and so this was University of Washington Facebook.

0:34:11 - 0:34:13 Text: It came out not that long after a birth.

0:34:13 - 0:34:17 Text: And so what they showed was that birth was really under-trained.

0:34:17 - 0:34:22 Text: And so basically, they took, even on the same amount of data, which was, even though I

0:34:22 - 0:34:27 Text: did 40 epochs on the data, if you do it for, like, 200 epochs, you get even better results,

0:34:27 - 0:34:29 Text: like significantly.

0:34:29 - 0:34:33 Text: So they basically trained more epochs on the same data.

0:34:33 - 0:34:36 Text: And they also showed that more data helps, which is also not super-sprising.

0:34:36 - 0:34:40 Text: And they did improve masking and pre-training using a couple of tweaks to that.

0:34:40 - 0:34:44 Text: And they were able to get a state of the art results, which is cool.

0:34:44 - 0:34:49 Text: And so, yeah, but that was a pretty straightforward paper.

0:34:49 - 0:34:55 Text: So the next one is XLNet, which is done by some interns in CMU when they were at Google

0:34:55 - 0:34:56 Text: Brain.

0:34:56 - 0:34:59 Text: And so this actually had some really cool changes.

0:34:59 - 0:35:04 Text: So one of them was, they used this transformer XL, which was actually the precursor done

0:35:04 - 0:35:09 Text: by the same people that were, they were just doing links on all the tasks, did a pre-training.

0:35:09 - 0:35:15 Text: But the big, one of the big innovations of transformer XL is this idea of relative position

0:35:15 - 0:35:16 Text: embeddings.

0:35:16 - 0:35:23 Text: And so with absolute position embeddings, the problem is that every word gets, like, this

0:35:23 - 0:35:26 Text: is word four, this is word five, this is word six.

0:35:26 - 0:35:28 Text: And so they are embeddings, so they do generalize.

0:35:28 - 0:35:30 Text: But in practice, there's a quadratic number of relationships.

0:35:30 - 0:35:33 Text: Like, how does word A3 relate to word 76, right?

0:35:33 - 0:35:37 Text: That's, that's, that's, and once you get bigger, like, 500, a thousand.

0:35:37 - 0:35:39 Text: Now you have a thousand squared total relationships.

0:35:39 - 0:35:42 Text: Like, you have to say, how does word 97 relate to whatever, right?

0:35:42 - 0:35:46 Text: And so that's obviously not optimal once you get to a large size.

0:35:46 - 0:35:56 Text: And so with, with relative position embeddings, you basically can say how much does dog

0:35:56 - 0:36:00 Text: attend to hot and how much should the word dog attend to the previous word.

0:36:00 - 0:36:04 Text: And then you get, and these are nonlinear for, these are linear at first.

0:36:04 - 0:36:07 Text: But then you combine them and then you get a nonlinear, a conditional representation.

0:36:07 - 0:36:09 Text: And you do this in many, many layers.

0:36:09 - 0:36:12 Text: And this ends up being, so then you say, how does this contextual

0:36:12 - 0:36:14 Text: precision of dog, how much does that attend to the previous word.

0:36:14 - 0:36:16 Text: And then you kind of get, can build up.

0:36:16 - 0:36:18 Text: And so this generalizes much better for long sequences.

0:36:18 - 0:36:20 Text: So that's a cool innovation.

0:36:20 - 0:36:25 Text: And then the other one, which is specific to pre-training and not just the model itself,

0:36:25 - 0:36:27 Text: is this idea of permutation language modeling.

0:36:27 - 0:36:28 Text: So this is a little bit hard to explain.

0:36:31 - 0:36:34 Text: I think the paper explained it very formally, I guess.

0:36:34 - 0:36:40 Text: And so, but basically, there's a trick where, so in a left-for-right language model,

0:36:40 - 0:36:44 Text: every word done predicting is based on the word the left, right?

0:36:44 - 0:36:48 Text: But imagine that instead of predicting all the, you can basically take any permutation.

0:36:48 - 0:36:52 Text: So it's like, I'm going to predict the first word, then I'm going to be the third word,

0:36:52 - 0:36:55 Text: then the second word, then the fourth word.

0:36:55 - 0:36:58 Text: And so, that's a totally valid way.

0:36:58 - 0:37:00 Text: And you still get the well-for-and-probability distribution, because it's still predicting

0:37:00 - 0:37:04 Text: one word at a time, given some permutation of the input.

0:37:04 - 0:37:07 Text: And with transformers and with attention, you can actually do this very efficiently,

0:37:07 - 0:37:09 Text: just by masking out your attention probabilities.

0:37:09 - 0:37:15 Text: And so, every single sentence you have, you can kind of sample a single permutation of this.

0:37:15 - 0:37:21 Text: And you can now, you can effectively train a bi-directional model, because this word,

0:37:21 - 0:37:23 Text: it won't be conditioned on every, still on average,

0:37:23 - 0:37:25 Text: every word will only be conditioned on half the words.

0:37:25 - 0:37:30 Text: But this word will be conditioned on, you know, all these words to the left and all these words to the right.

0:37:30 - 0:37:32 Text: And maybe it'll be missing these words, but that's fine.

0:37:32 - 0:37:35 Text: And so, you get much better sample efficiency.

0:37:35 - 0:37:38 Text: So I thought this was a really clever idea.

0:37:38 - 0:37:41 Text: And so, and this was kind of the main innovation of X on it.

0:37:41 - 0:37:47 Text: And so, yeah, they basically get better sample efficiency, because they're able to

0:37:47 - 0:37:50 Text: do this random permutation and kind of take advantage of this.

0:37:50 - 0:37:54 Text: So this wouldn't work with LFTMs, because of this ordering, but

0:37:54 - 0:37:57 Text: because the way that masking is done in transformers, it's just,

0:37:59 - 0:38:01 Text: it's just a mask on the attention.

0:38:01 - 0:38:03 Text: So it actually ends up working very well.

0:38:03 - 0:38:10 Text: And so, they also got, so, yeah, the numbers, they actually ended up being pretty similar.

0:38:12 - 0:38:16 Text: But a lot of these things are hard to compare, because people change the data set and

0:38:16 - 0:38:18 Text: change the size of the model.

0:38:18 - 0:38:20 Text: So it's hard to compare apples to apples.

0:38:20 - 0:38:26 Text: But these two techniques ended up being pretty similar, but I think X on it had more innovations in terms of technique.

0:38:26 - 0:38:33 Text: So Albert, it's called a lightbert for self-supervised learning.

0:38:33 - 0:38:37 Text: And so, this also had a couple of cool innovations.

0:38:37 - 0:38:41 Text: And so the idea here is really massive parameter sharing, with the idea being that,

0:38:41 - 0:38:44 Text: if you share parameters, you're not going to get a better language model, but

0:38:44 - 0:38:46 Text: you're going to get better sample efficiency.

0:38:46 - 0:38:48 Text: You're going to get less overfitting when you fine tune, right?

0:38:48 - 0:38:51 Text: Because if you have a billion parameters and you fine tune them on a 300,

0:38:51 - 0:38:55 Text: on a data set with like a thousand labeled examples, you're still going to overfit very quickly, right?

0:38:55 - 0:38:59 Text: But if you have a much smaller number of parameters, you're going to get less overfitting.

0:38:59 - 0:39:02 Text: So if you get a similarly powerful model with fewer parameters, you're going to get less overfitting.

0:39:03 - 0:39:07 Text: And so they, so there's two major innovations where,

0:39:07 - 0:39:09 Text: so instead of using a word, because the wording of the table is big, right?

0:39:09 - 0:39:15 Text: Because it's the size of your vocabulary, the number of word pieces, times the hidden size.

0:39:15 - 0:39:18 Text: And so it's going to be much bigger than the hidden layer.

0:39:18 - 0:39:21 Text: So first thing is that they use the factorized embedding table.

0:39:21 - 0:39:28 Text: So if they had a hidden size of a thousand, they only use like 128 dimensional input embedding.

0:39:28 - 0:39:33 Text: And then they projected that to a thousand using a matrix.

0:39:33 - 0:39:40 Text: And so instead of having 1024 by 100,000, they would have 100,000 by 100,000 plus 1024 times 100,000,

0:39:40 - 0:39:43 Text: and you multiply these together and multiply these two matrices together.

0:39:43 - 0:39:48 Text: And then effectively you have a 1024 by 100,000 embedding matrix.

0:39:48 - 0:39:50 Text: But you have much fewer parameters.

0:39:50 - 0:39:51 Text: So you're doing parameter tying.

0:39:51 - 0:39:55 Text: Well, not, this isn't a parameter tie, but you're doing parameter reduction in a clever way.

0:39:56 - 0:39:58 Text: The other one is cross layer parameter sharing.

0:39:58 - 0:39:59 Text: So this is similar.

0:39:59 - 0:40:06 Text: It's as simple and it was all, it's, it was, it was been done in previous papers, especially universal transformer.

0:40:06 - 0:40:10 Text: And the idea is that you, you've run a much of transformer layers, but all,

0:40:10 - 0:40:13 Text: let's say if you have 12 layers, all 12 layers just share the same parameters, right?

0:40:13 - 0:40:19 Text: And so that ends up, so now you can have a much bigger model

0:40:19 - 0:40:22 Text: that has fewer parameters than bird has.

0:40:22 - 0:40:24 Text: And so you get less over fitting.

0:40:24 - 0:40:30 Text: And so they got state of the art compared to X on that in Roberta.

0:40:30 - 0:40:33 Text: But one important thing to keep in mind is that Albert is light in terms of parameters,

0:40:33 - 0:40:35 Text: not in terms of speed.

0:40:35 - 0:40:47 Text: So for a, for the mix, for the model that's actually comparable to, to bird,

0:40:47 - 0:40:53 Text: they, they actually did slightly, like, like this model and this model were about the same,

0:40:53 - 0:40:55 Text: but this one was actually slower.

0:40:55 - 0:41:01 Text: So it's only when they started making models that were much bigger in terms of compute than

0:41:01 - 0:41:05 Text: bird, but doing more parameter tying than they started getting good results.

0:41:05 - 0:41:11 Text: And so the, the implication of this is that like, you can, you can reduce the number of parameters,

0:41:11 - 0:41:16 Text: but still nobody's figured out how to reduce the amount of pre-training compute

0:41:16 - 0:41:19 Text: that it would describe, which is, you know, kind of unfortunate.

0:41:19 - 0:41:25 Text: So the next one is T5, which is exploring the limits of transfer-leading with unified

0:41:25 - 0:41:26 Text: text-txt firmware.

0:41:26 - 0:41:33 Text: So this was a paper by Google Brain and other groups in Google where they used just,

0:41:33 - 0:41:38 Text: they used a lot of compute and they did tons of ablation on pre-training.

0:41:38 - 0:41:41 Text: They didn't, like, their goal wasn't to come up with some, with some super clever new

0:41:41 - 0:41:42 Text: pre-training technique.

0:41:42 - 0:41:46 Text: Right, it was really just to carefully ablate every aspect, how much is model size matter,

0:41:46 - 0:41:49 Text: how much is training data matter, how much is clemeness of data matter, like, how much

0:41:49 - 0:41:53 Text: is the exact way that you do the pre-training objective matter, like doing the masking,

0:41:53 - 0:41:54 Text: like, how many spans do you mask?

0:41:54 - 0:42:00 Text: And so they wanted to kind of very clearly do the, and they also wanted to push the limits

0:42:00 - 0:42:04 Text: of size and say, what happens if we have 300 million, a billion, 10 billion parameters,

0:42:04 - 0:42:06 Text: right?

0:42:06 - 0:42:12 Text: And then, so they did tons and tons of ablation and they got state of the art and everything

0:42:12 - 0:42:14 Text: and they're still sitting at the art and everything.

0:42:14 - 0:42:22 Text: And the results, though, are a little bit bleak in the sense that nothing really mattered

0:42:22 - 0:42:27 Text: except making the data, like, like, all of the ablations, it wasn't like, oh, you know,

0:42:27 - 0:42:31 Text: burnt it everything perfectly, it was that, it doesn't matter, like, you could do 20%,

0:42:31 - 0:42:35 Text: 25%, you can do this fine tuning recipe, this fine tuning recipe, it's like, all that

0:42:35 - 0:42:40 Text: really matters is making the model bigger and training it on a more data, and clean data.

0:42:40 - 0:42:48 Text: And so, yeah, it's a little bit of a bleak paper if you are hoping that there is exists

0:42:48 - 0:42:53 Text: some pre-training technique which is super computationally efficient and also can get,

0:42:53 - 0:42:56 Text: you know, very impressive results, which I'm not saying there isn't, but, like, most

0:42:56 - 0:42:58 Text: of this evidence points in that.

0:42:58 - 0:43:04 Text: So the one kind of newest paper that is maybe the most positive in this direction is this

0:43:04 - 0:43:06 Text: paper called Electra.

0:43:06 - 0:43:14 Text: And so, and so this is done by Kevin Clark from here and, uh, and Google Brain.

0:43:14 - 0:43:17 Text: And so, yeah, in this one, it's a pretty clever idea.

0:43:17 - 0:43:23 Text: So basically, the idea is instead of training, instead of training to generate the output,

0:43:23 - 0:43:25 Text: you just train it as a, as a discriminator.

0:43:25 - 0:43:30 Text: And so, you have a local language model, you have, you do some asking, you have a local

0:43:30 - 0:43:33 Text: language model which replaces it, and then you train it to discriminate whether it's

0:43:33 - 0:43:37 Text: the original one or not.

0:43:37 - 0:43:40 Text: And so, the idea here is that you are doing a much, you're, you're, you're getting better

0:43:40 - 0:43:46 Text: sample efficiency for pre-training because you're predicting every, every word, which

0:43:46 - 0:43:50 Text: is actually, I mean, I don't know exactly why it would be that from, from, from, from

0:43:50 - 0:43:55 Text: Berkis, Berkis still, uh, in terms of, because you don't replace with, with the mask with

0:43:55 - 0:43:56 Text: everywhere, you also randomly corrupt it.

0:43:56 - 0:44:01 Text: But, but the, the, the biggest difference is that, um, is that these are kind of contextual

0:44:01 - 0:44:05 Text: every place. So, it's like, it, when I did random masking, and replace with the random

0:44:05 - 0:44:06 Text: word, it was truly a random word.

0:44:06 - 0:44:09 Text: So, most of the time it was completely trivial to, to tell that this was not the right word.

0:44:09 - 0:44:12 Text: You didn't necessarily know which word should be replaced, but in this case, they actually

0:44:12 - 0:44:17 Text: used a intentionally weak but still non-tribule language model to predict which word.

0:44:17 - 0:44:21 Text: So, like, this locally makes sense, the chef ate the meal, but it doesn't make any sense,

0:44:21 - 0:44:23 Text: like, a very strong model will not predict this, right?

0:44:23 - 0:44:27 Text: So, so, that's the idea that you, use a weak model to, to, to, to, to, to the substitution

0:44:27 - 0:44:29 Text: of the use, then you train a strong model to, to, um, do this.

0:44:29 - 0:44:36 Text: So, these results are, I guess it's a big cable, but these results are, they're certainly

0:44:36 - 0:44:44 Text: positive with regard to, uh, previous results in terms of compute versus, um, so, like,

0:44:44 - 0:44:51 Text: for, if we compare this row, so, uh, which is, one tenth of the compute of Bert large,

0:44:51 - 0:44:53 Text: to Bert base, which is also one tenth of the compute of Bert large, it certainly does

0:44:53 - 0:44:55 Text: a lot better than Bert base.

0:44:55 - 0:45:05 Text: Um, but, when they, uh, if you, but, but in terms of state of the art models, um, when

0:45:05 - 0:45:11 Text: they do, you know, the same amount of, uh, compute as their Bert large, which is this one,

0:45:11 - 0:45:13 Text: work a better, to other state of the art models, they're not, in order to get state of

0:45:13 - 0:45:16 Text: the art, or to get similar to state of the art, they basically need to do as much compute

0:45:16 - 0:45:19 Text: as state of the art, so, like, 444x, 5.4x.

0:45:19 - 0:45:23 Text: So, I mean, at scale-down values, they were able to do better, but this is a, still a

0:45:23 - 0:45:25 Text: pretty big gap, like, 4 points.

0:45:25 - 0:45:29 Text: Um, so, it's, it's positive, but it's not, it's certainly not like the silver bullet

0:45:29 - 0:45:37 Text: in terms of, uh, showing that, uh, we can, you know, pre-trained models, much better

0:45:37 - 0:45:39 Text: for, for, for cheaper.

0:45:39 - 0:45:44 Text: So, but, so, the last thing I want to talk about is, um, how we actually serve these models,

0:45:44 - 0:45:45 Text: right?

0:45:45 - 0:45:50 Text: Because, you know, I've said that, like, they're incredibly expensive to train, and nobody

0:45:50 - 0:45:54 Text: has been able to figure out how to make that faster, but, you know, they're being used all

0:45:54 - 0:45:55 Text: over the place, right?

0:45:55 - 0:46:00 Text: So, like, uh, you know, there's news stories, Google has improved 10% of searches by

0:46:00 - 0:46:03 Text: language-assigning, say, a little bit, and then Bing says it's been playing Burt since

0:46:03 - 0:46:06 Text: April, so, and so, this, this is live in Google Search, and Bing Search, and so, these

0:46:06 - 0:46:10 Text: are, like, really low-lanancy services, right, that have, like, a few milliseconds of,

0:46:10 - 0:46:17 Text: of, of latency, and they serve, you know, billions of, of queries a day, so, how, how are

0:46:17 - 0:46:22 Text: they doing this, is it just, like, uh, that, you know, Google and Microsoft are spending

0:46:22 - 0:46:23 Text: billions of dollars on hardware.

0:46:23 - 0:46:25 Text: What they are, but not just for this, right?

0:46:25 - 0:46:30 Text: And so, like, uh, like, it would, it would cost billions of dollars just to serve this

0:46:30 - 0:46:32 Text: if we were actually serving Burt.

0:46:32 - 0:46:36 Text: But, we're serving, uh, not, instead of we're using, we're using model distillation, right?

0:46:36 - 0:46:38 Text: So, this has been around for a while.

0:46:38 - 0:46:42 Text: Um, so, it's, you know, called distillation and model compression.

0:46:42 - 0:46:47 Text: Uh, one, one of the first papers was the Smolk Compression Paper, um, that was, that was

0:46:47 - 0:46:52 Text: done for, uh, I forget exactly what task, but then, and then Hinton's paper, just stealing

0:46:52 - 0:46:55 Text: knowledge in neural networks, is a more well-known version of, or not, not version, but

0:46:55 - 0:46:59 Text: a more well-known, uh, uh, paper on, on distillation.

0:46:59 - 0:47:02 Text: But in reality, the one, the version that, that we use at Google and the version that

0:47:02 - 0:47:08 Text: most people use when they say model distillation for, uh, pre-shanked language models, it's a, um,

0:47:08 - 0:47:12 Text: it's a very simple technique, but, but it's easy to misinterpret what, what, what we mean.

0:47:12 - 0:47:17 Text: So, what we do is we pre-train, we train a state of the art model, whichever is the

0:47:17 - 0:47:19 Text: ones we can most afford to train, right?

0:47:19 - 0:47:22 Text: Because, of course, we can just make it bigger, but we, we set some budget of, you know,

0:47:22 - 0:47:25 Text: we want to train it for a day on some number of TPUs.

0:47:25 - 0:47:27 Text: And then, we fine-tune it, right?

0:47:27 - 0:47:29 Text: So we get a model that's the maximum accuracy, and that's our teacher model, and this is

0:47:29 - 0:47:30 Text: expensive.

0:47:30 - 0:47:34 Text: Then we have a, a large amount of unlabeled input, which is typically for, for most industry

0:47:34 - 0:47:38 Text: applications, you have unlabeled input, because you have, you know, in search, you have,

0:47:38 - 0:47:41 Text: this is what they use to search for, this is what they click on, that's how it's searched

0:47:41 - 0:47:42 Text: into the trained.

0:47:42 - 0:47:48 Text: Uh, and so, you can then just take these, and you, um, and then you just label your examples

0:47:48 - 0:47:49 Text: with them.

0:47:49 - 0:47:52 Text: So you can get billions of these, uh, if you actually want a real service.

0:47:52 - 0:47:58 Text: And then you, so then you, then you run these, you know, query answer pairs through your

0:47:58 - 0:48:03 Text: teacher, and you get a pseudo label, and you're just training much smaller model, much meaning

0:48:03 - 0:48:08 Text: like 50 times, 100 times smaller, to, uh, predict your student, your teacher outputs.

0:48:08 - 0:48:11 Text: And so, and you can generally do this for most techniques.

0:48:11 - 0:48:17 Text: I mean, for most tasks, you can do this, uh, pretty easily, and get a huge 50-200X, uh,

0:48:17 - 0:48:19 Text: compression with no degradation.

0:48:19 - 0:48:23 Text: But the important thing to realize that we're not compressing the pre-train model itself.

0:48:23 - 0:48:25 Text: We haven't really had any luck doing that.

0:48:25 - 0:48:28 Text: So like, you can't actually just take Bert, and then compress it to a smaller model, which

0:48:28 - 0:48:30 Text: you can then fine-tune for all these other tasks.

0:48:30 - 0:48:34 Text: It's only after you've chosen the task, and after you find it, tune it for the task that

0:48:34 - 0:48:37 Text: you, that we were able to do it.

0:48:37 - 0:48:43 Text: So to show some specific results, so let's say we have, let's say we have a Bert large

0:48:43 - 0:48:44 Text: teacher.

0:48:44 - 0:48:45 Text: So this is an Amazon book review script.

0:48:45 - 0:48:48 Text: So this is a paper that, that, I've got to cite it, but this is a paper that my group,

0:48:48 - 0:48:51 Text: uh, published, you know, the ATURK, uh, wrote.

0:48:51 - 0:48:59 Text: And so, um, this has 50,000 labeled examples and 8 million, uh, unlabeled examples.

0:48:59 - 0:49:03 Text: So you, you, you, you find tune on, you pre-train Bert large, normal, like you take the pre-train

0:49:03 - 0:49:08 Text: Bert large, you, uh, you find tune on these 50,000 labels, you get this 88% accuracy,

0:49:08 - 0:49:09 Text: right?

0:49:09 - 0:49:15 Text: Then, uh, and so, but then now, let's say instead of using Bert large, you used a much smaller

0:49:15 - 0:49:16 Text: version.

0:49:16 - 0:49:20 Text: So this one's, according to the size, this one's, you know, uh, a 16th size, whatever,

0:49:20 - 0:49:22 Text: this one's a hundredth of the size, right?

0:49:22 - 0:49:26 Text: So this, this row that's a hundredth of the size, if you were to just train it, if you

0:49:26 - 0:49:32 Text: were to pre-train this on the same Wikipedia book, just like Bert, and then fine tune it,

0:49:32 - 0:49:37 Text: you would get 82% accuracy, which is, you know, a lot worse, 60%, like, 66, absolute

0:49:37 - 0:49:39 Text: worse, which is quite a big drop, right?

0:49:39 - 0:49:45 Text: But then if you were to take this 88% teacher, labeled 8 million examples, which are, of

0:49:45 - 0:49:48 Text: course, held out, this is test, this is test accuracy.

0:49:48 - 0:49:55 Text: Um, and then, uh, and then train this classification model, which says this is a good about review,

0:49:55 - 0:49:59 Text: on these 8 million examples, you can take this model to 100 times smaller and get the same

0:49:59 - 0:50:00 Text: accuracy as the teacher, right?

0:50:00 - 0:50:02 Text: You get the same 80% accuracy.

0:50:02 - 0:50:06 Text: So that's really the, uh, the cool thing with distillation is that you can get models

0:50:06 - 0:50:09 Text: that are much smaller, but you still need to train the big model in the first place.

0:50:09 - 0:50:10 Text: So it doesn't help the training cost.

0:50:10 - 0:50:11 Text: It just helps.

0:50:11 - 0:50:13 Text: It actually works, it's the reason you use this big model to train, to label millions or

0:50:13 - 0:50:14 Text: billions of examples.

0:50:14 - 0:50:17 Text: So it ends up being more expensive than just training Bert, but you can actually serve

0:50:17 - 0:50:21 Text: this model at, at inference time for, for a tiny cost.

0:50:21 - 0:50:26 Text: So the question is, why does distillation work so well?

0:50:26 - 0:50:32 Text: So the big hypothesis is that language modeling is kind of the ultimate NLP task, right?

0:50:32 - 0:50:36 Text: A perfect language model is also a perfect question answering system, a perfect entailment

0:50:36 - 0:50:39 Text: system, sentiment analysis, co-reference, et cetera, right?

0:50:39 - 0:50:42 Text: Because in order to be able to, to do these things, you kind of have to be able, you

0:50:42 - 0:50:44 Text: could construct it as a language model.

0:50:44 - 0:50:49 Text: So when you're training a massive language model, you are learning many millions of latent

0:50:49 - 0:50:53 Text: features, which are effectively the same features that you need for any other task.

0:50:53 - 0:50:57 Text: And so when you're doing a simpler, a fine tuning of a more specific task, what's the fine

0:50:57 - 0:51:00 Text: tuning is basically taking these latent features, which your system happened to learn,

0:51:00 - 0:51:03 Text: and it's some, encoded somewhere in your weights.

0:51:03 - 0:51:06 Text: And you are, it's kind of just tweaking these, which is why I can do it with a single pass

0:51:06 - 0:51:09 Text: over the fine tuning data.

0:51:09 - 0:51:13 Text: And so, but once you figure out which parts are important, then there exists a hypothetically

0:51:13 - 0:51:17 Text: much smaller model size, which can still get the same representation and same generalization,

0:51:17 - 0:51:18 Text: right?

0:51:18 - 0:51:21 Text: So, there's a bunch of examples with this fine tuning model.

0:51:21 - 0:51:23 Text: And now you can learn a model that can really hone in on just these features that are important.

0:51:23 - 0:51:32 Text: And so, it can take them, you know, it can train a model that's 100th the size, and just

0:51:32 - 0:51:36 Text: hone in on these features if you have a lot of pseudo label data.

0:51:36 - 0:51:37 Text: And that's why it works.

0:51:37 - 0:51:41 Text: And so, the evidence really is that it just doesn't work to do self-dissolation, right?

0:51:41 - 0:51:46 Text: And so it must be that it's really just learning a subset of the features for most of these

0:51:46 - 0:51:49 Text: tasks.

0:51:49 - 0:51:53 Text: And so, basically every task but language modeling, we've been able to get distillation to work

0:51:53 - 0:51:54 Text: for.

0:51:54 - 0:51:58 Text: So, this includes tasks that seem really hard like question answering and search.

0:51:58 - 0:52:03 Text: So that does imply that language modeling itself, which is basically language generation

0:52:03 - 0:52:08 Text: also, because that's just a form of language modeling, is fundamentally harder than language

0:52:08 - 0:52:12 Text: understanding, which is not super hard to buy.

0:52:12 - 0:52:14 Text: Or at least it's not fundamentally harder.

0:52:14 - 0:52:18 Text: But given the state of the art, state of the art models for language understanding are

0:52:18 - 0:52:20 Text: fundamentally simpler than what they do, right?

0:52:20 - 0:52:27 Text: So presumably, just doing kind of pattern recognition than models that are generating language.

0:52:27 - 0:52:33 Text: And so that's kind of why all of these classification models can kind of be distilled so well.

0:52:33 - 0:52:38 Text: So basically, in conclusion, the preaching models work really well.

0:52:38 - 0:52:39 Text: They're very expensive.

0:52:39 - 0:52:45 Text: We know how to kind of solve this for inference time and we can do fast inference, but it

0:52:45 - 0:52:52 Text: is still unsolved how to make these fast at training time.

0:52:52 - 0:52:59 Text: And moreover, it seems like a lot of the details about algorithmic improvements for making

0:52:59 - 0:53:04 Text: the training more efficient don't seem to have a ton of benefit in terms of at least

0:53:04 - 0:53:06 Text: getting to the results.

0:53:06 - 0:53:09 Text: And it seems like a lot of choices don't really matter that much.

0:53:09 - 0:53:14 Text: And it's really just about a couple of like compared to just the kind of the simple

0:53:14 - 0:53:15 Text: mass-colon baseline.

0:53:15 - 0:53:19 Text: It's pretty hard to beat that in an Apple's Apple comparison.

0:53:19 - 0:53:24 Text: So yeah, it's a little bit unfortunate for a research perspective.

0:53:24 - 0:53:29 Text: It's definitely good for me from people who want to build NLP systems and who want to,

0:53:29 - 0:53:34 Text: especially domain specific NLP systems, like people who want to adapt to a medical domain

0:53:34 - 0:53:37 Text: or people who only have a tiny amount of data or people who want to do startups or they

0:53:37 - 0:53:40 Text: want to build an actual product and they only have a tiny amount of data.

0:53:40 - 0:53:41 Text: So it's definitely good for that perspective.

0:53:41 - 0:53:49 Text: But certainly, I think from the perspective of sometimes research, as I was saying, the

0:53:49 - 0:53:53 Text: goal of research is to kind of like research yourself out of a job, then it is kind of,

0:53:53 - 0:53:56 Text: you know, it's a little unfortunate from that perspective.

0:53:56 - 0:54:00 Text: But I still think that there's a possibility that there's going to be a breakthrough that

0:54:00 - 0:54:07 Text: kind of shows how to do computational efficiency without, and can kind of show compelling results

0:54:07 - 0:54:10 Text: that you don't need, you know, such an absurdly large model.

0:54:10 - 0:54:15 Text: Or actually, besides the models matter, you don't need such an expensive model to do well.

0:54:15 - 0:54:19 Text: Maybe look from sparsity, right, or something like that where you actually do have a really

0:54:19 - 0:54:31 Text: large model, just sparsely activated in some, using some efficiency tricks or whatever.