0:00:00 - 0:00:11 Text: Okay, so I'm going to talk about BERT and also some kind of precursor work and then some
0:00:11 - 0:00:15 Text: follow-up work that's happened in the last year or not, we'll not follow up, but more
0:00:15 - 0:00:19 Text: recent advancements that's happened since then.
0:00:19 - 0:00:22 Text: So first we're going to talk about history and background.
0:00:22 - 0:00:27 Text: So everyone knows and loves word embeddings in NLP, right?
0:00:27 - 0:00:33 Text: They're kind of the basis for why neural networks work for NLP.
0:00:33 - 0:00:40 Text: Because neural networks work in continuous space vectors and matrices and obviously text
0:00:40 - 0:00:45 Text: is discrete space and so there needed to be something to bridge the gap and it turns
0:00:45 - 0:00:49 Text: out that the thing to bridge the gap, it's actually pretty simple, it's just a look-up
0:00:49 - 0:00:56 Text: table from each, from a set of discrete vocabulary to a vector that's learned discriminatively
0:00:56 - 0:00:57 Text: end to end, right?
0:00:57 - 0:01:02 Text: So originally these were just learned like in the original Benjiah 2003, neural language
0:01:02 - 0:01:07 Text: on a paper, these were just trained discriminatively end to end and these were actually, and so
0:01:07 - 0:01:12 Text: then people would train language models and then use these pre-trained, use the embedding
0:01:12 - 0:01:16 Text: layer as pre-trained representations for other tasks.
0:01:16 - 0:01:19 Text: But they wouldn't use the rest of the language model, they would just use the embedding layer.
0:01:19 - 0:01:24 Text: And then word devalc and glove and stuff came along where then people found a much cheaper,
0:01:24 - 0:01:31 Text: much more scalable way to train where you can just use the statistics of a corpus where
0:01:31 - 0:01:34 Text: it's just a linear model so you don't have to compute these expensive feed-forward layers
0:01:34 - 0:01:38 Text: that you're going to throw out anyways and so you can scale up to like billions of tokens
0:01:38 - 0:01:41 Text: on a single CPU, right?
0:01:41 - 0:01:47 Text: So the problem though is that these word embeddings are applied in the context-free manner, right?
0:01:47 - 0:01:53 Text: So for like a kind of a simple, to example, the word bank, if you say open a bank account
0:01:53 - 0:01:55 Text: and on a river bank, it's going to be the same embedding.
0:01:55 - 0:02:00 Text: So people try to do stuff like word sense embeddings where it's not just a single word, it's
0:02:00 - 0:02:06 Text: a full word sense, but this kind of bank example, it's a little bit of a toy example, right?
0:02:06 - 0:02:10 Text: Almost any word has a different meaning depending on the context.
0:02:10 - 0:02:15 Text: It's very, so even like open the bank account and I went to the bank, those are still
0:02:15 - 0:02:19 Text: in a semi-different senses of the word bank.
0:02:19 - 0:02:23 Text: Kind of them is a, I mean they have different parts of each text kind of, well I guess
0:02:23 - 0:02:26 Text: not really, but like they're kind of using different senses, right?
0:02:26 - 0:02:29 Text: And so, yes, so we really need a contextual representation, right?
0:02:29 - 0:02:36 Text: So we want something where it's a representation of a word after it's been put into the context
0:02:36 - 0:02:37 Text: of the sense that we've seen it in, right?
0:02:37 - 0:02:39 Text: Which would be like at the bottom here.
0:02:39 - 0:02:45 Text: So kind of for history of contextual representations, the first big paper for this type of contextual
0:02:45 - 0:02:50 Text: representation was a paper from Google in 2015 called semi-supervised sequence learning
0:02:50 - 0:02:52 Text: from Andrew Dianne Cochle.
0:02:52 - 0:02:58 Text: And so in this one, it was actually very similar to papers that came after it, it didn't
0:02:58 - 0:02:59 Text: get as much attention for various reasons.
0:02:59 - 0:03:04 Text: So but basically they had some classification task like sentiment classification on
0:03:04 - 0:03:07 Text: movie reviews, and they had a big corpus on movie reviews.
0:03:07 - 0:03:11 Text: And so then they said what happens if we just take our existing LSTM model and instead
0:03:11 - 0:03:13 Text: of just using pre-trained embeddings, which everyone has already been doing since like
0:03:13 - 0:03:19 Text: at least for the like actually probably since 2003, people had been using pre-trained embeddings,
0:03:19 - 0:03:23 Text: but they said let's actually pre-trained the entire model as a language model and then
0:03:23 - 0:03:26 Text: let's fine tune it for our classification task.
0:03:26 - 0:03:31 Text: And they got pretty good results but not like stellar results.
0:03:31 - 0:03:34 Text: And so now we know that the reason why they didn't get stellar results is they didn't train
0:03:34 - 0:03:36 Text: on enough data and they, because they basically train on the same corpus that they were training
0:03:36 - 0:03:40 Text: on and they trained the same size model that they were training on.
0:03:40 - 0:03:41 Text: Which we now know needs to be bigger.
0:03:41 - 0:03:46 Text: But that's kind of, this was already kind of a little bit ahead of its time partially
0:03:46 - 0:03:50 Text: because like stuff wasn't, like we didn't have a bit of a computer back then even those
0:03:50 - 0:03:52 Text: only five years ago.
0:03:52 - 0:03:54 Text: And it would have been more expensive.
0:03:54 - 0:04:02 Text: So and then in 2017, Elmok came out, which was from the University of Washington in AI2.
0:04:02 - 0:04:08 Text: And so this one, they did something pretty clever where they took, you train a language
0:04:08 - 0:04:12 Text: model on a big corpus, so they trained it on a billion word corpus and they trained a
0:04:12 - 0:04:17 Text: big model, LSTM with 4,000 hidden dimensions which is quite expensive.
0:04:17 - 0:04:19 Text: And they trained a bi-directional model.
0:04:19 - 0:04:24 Text: But it was kind of weekly bi-directional where they trained a left-right model and then
0:04:24 - 0:04:28 Text: a left-right model and then they concatenated the two.
0:04:28 - 0:04:30 Text: And they called these contextual pre-trained embeddings.
0:04:30 - 0:04:36 Text: And so the idea behind Elmok is that this doesn't actually change your existing model architecture.
0:04:36 - 0:04:40 Text: You kind of take whatever task specific model architecture that you have, which could
0:04:40 - 0:04:45 Text: be for question answering, it might be some sort of fancy model where you do a LSTM over
0:04:45 - 0:04:50 Text: the source and over the question and over the answer, then you tend to one another and
0:04:50 - 0:04:52 Text: whatever kind of architecture you have.
0:04:52 - 0:04:58 Text: And wherever you would have put in glove embeddings before, now you put in Elmok embeddings.
0:04:58 - 0:05:03 Text: And so this got set of the art on everything at the time, question answering, semantic
0:05:03 - 0:05:07 Text: parsing, syntactic parsing, because it was, and so if you just took any existing kind
0:05:07 - 0:05:12 Text: of state-of-the-art model, you could fit in, put in Elmok embeddings and get state-of-the-art,
0:05:12 - 0:05:13 Text: right?
0:05:13 - 0:05:17 Text: But they weren't, but these were kind of, the models were kind of fixed.
0:05:17 - 0:05:24 Text: And so then after that, opening AI published, improving language understanding with generative
0:05:24 - 0:05:27 Text: pre-training, which is called GPT-1.
0:05:27 - 0:05:37 Text: And so in this, they took a similar large corpus, that of billion words, and they trained a
0:05:37 - 0:05:38 Text: very large language model.
0:05:38 - 0:05:42 Text: So a 12-layer language model, which at the time was maybe, I don't know whether it was
0:05:42 - 0:05:44 Text: actually a large language model that had been trained at the time, certainly it was
0:05:44 - 0:05:47 Text: the largest language model that had been trained on that much data for a kind of open-source
0:05:47 - 0:05:49 Text: model.
0:05:49 - 0:05:54 Text: And when I first read it, I actually thought that it was too big, not that it was worse,
0:05:54 - 0:05:56 Text: but that they were kind of just showing off by showing how big of a model they could
0:05:56 - 0:05:57 Text: train.
0:05:57 - 0:06:01 Text: But now we know that actually this depth that they had was actually kind of the crucial element.
0:06:01 - 0:06:03 Text: So they did something that was like fairly simple, right?
0:06:03 - 0:06:07 Text: They just trained a language model, a very large one, and then they just fine-tuned it by
0:06:07 - 0:06:11 Text: taking the last token and then fine-tuning it for a classification task, right?
0:06:11 - 0:06:13 Text: So is this positive or negative?
0:06:13 - 0:06:18 Text: And they got basically state-of-the-art on lots of different classification tasks.
0:06:18 - 0:06:25 Text: But, and so I'm going to actually take a kind of a side here before I go into BERT,
0:06:25 - 0:06:26 Text: which is about transformer.
0:06:26 - 0:06:31 Text: So that was the other kind of big thing, like the big precursor that allowed BERT and
0:06:31 - 0:06:33 Text: GPT to work well, right?
0:06:33 - 0:06:38 Text: So BERT and GPT both use the transformer, which I'm sure you guys have learned about.
0:06:38 - 0:06:43 Text: And so I don't need to necessarily go into all the details about it.
0:06:43 - 0:06:49 Text: But, so it has multi-headed attention, feed-forward layers, lay-in-arm.
0:06:49 - 0:06:51 Text: I won't go into all the details because I think you guys already learned about it.
0:06:51 - 0:06:57 Text: But the big thing about why this kind of took over is, there's really two advantages
0:06:57 - 0:06:58 Text: versus the LSTM.
0:06:58 - 0:07:00 Text: One is that there's no locality bias.
0:07:00 - 0:07:06 Text: And so, longest since context has an equal to the opportunity to short distance context,
0:07:06 - 0:07:09 Text: which is important.
0:07:09 - 0:07:14 Text: So for like normal language understanding, that's the locality bias of LSTM's is generally
0:07:14 - 0:07:18 Text: considered to be a good thing.
0:07:18 - 0:07:22 Text: Because local context is more relevant than longest since context.
0:07:22 - 0:07:27 Text: But the way that GPT and BERT and other models work is that they actually can catenate
0:07:27 - 0:07:28 Text: context.
0:07:28 - 0:07:35 Text: And so if you have a model that says does sentence one entail sentence two, the way that it was
0:07:35 - 0:07:39 Text: done historically, meaning like before GPT, was that you would like encode them both,
0:07:39 - 0:07:43 Text: let's say with an LSTM, then you would do attention from one to the other.
0:07:43 - 0:07:49 Text: With a transformer, you can just put them into the same sequence and give them separate
0:07:49 - 0:07:51 Text: sequence of adding to it at a separated token.
0:07:51 - 0:07:56 Text: And it will learn how to, and then it can, things can attend to its own sentence locally.
0:07:56 - 0:08:02 Text: But it can also attend all the way to the other sentence for almost, for no, it's just
0:08:02 - 0:08:04 Text: as easy for it to attend all the way to the other sentence.
0:08:04 - 0:08:07 Text: And so when you do this, kind of you can just pack everything into a single sequence and
0:08:07 - 0:08:09 Text: then everything will be learned.
0:08:09 - 0:08:13 Text: Rather than having to do this as part of the model architecture, which ends up being a
0:08:13 - 0:08:17 Text: pretty important thing about simplifying these models.
0:08:17 - 0:08:25 Text: And so the other thing is that having a, with transformers, with LSTMs, let's say this
0:08:25 - 0:08:27 Text: is a batch and these are the words in the batch.
0:08:27 - 0:08:30 Text: You have two sentences and four words per sentence.
0:08:30 - 0:08:32 Text: Every step has to be computed one at a time.
0:08:32 - 0:08:35 Text: So you only get a batch size of two effectively.
0:08:35 - 0:08:40 Text: And so on modern hardware, which is TPUs and GPUs, the bigger the matrix multiplication,
0:08:40 - 0:08:41 Text: the better it is.
0:08:41 - 0:08:43 Text: You want all three dimensions to be big.
0:08:43 - 0:08:47 Text: So even if you have big hidden layers, your batch size dimension will still be small,
0:08:47 - 0:08:51 Text: unless you have a huge batch, but then that's too expensive for long sequences.
0:08:51 - 0:08:57 Text: But with transformers, it's the total, because it's layer wise attention, the total number,
0:08:57 - 0:08:58 Text: the batch size is the total number of words.
0:08:58 - 0:09:05 Text: So if you have 500 words and then 32 sentences, it's actually 32, 10, 512 is the total batch size.
0:09:05 - 0:09:10 Text: So you get these huge matrix multiplication, and you can take advantage of modern hardware.
0:09:10 - 0:09:14 Text: And so that's kind of why the transformer has taken over, because of these two things.
0:09:14 - 0:09:16 Text: And that's why it was used in GPD and Y at Susan Burke.
0:09:16 - 0:09:20 Text: So now I'm going to talk about Burke.
0:09:20 - 0:09:29 Text: So the problem with the previous model being LMO and GPD and, once we for it, is that
0:09:29 - 0:09:35 Text: the language model is only used left context or right context or a concatenation of both,
0:09:35 - 0:09:39 Text: but really, but language understanding is bidirectional.
0:09:39 - 0:09:45 Text: So there's this clear kind of mismatch between why did everyone train on new direction of
0:09:45 - 0:09:50 Text: models, where you could only see to the left or only see to the right, when we know that
0:09:50 - 0:09:53 Text: in order to understand language, you need to look in both directions.
0:09:53 - 0:09:55 Text: So there's two reasons.
0:09:55 - 0:10:01 Text: So one is that language models historically had been used for typically as features in
0:10:01 - 0:10:02 Text: other systems.
0:10:02 - 0:10:05 Text: So the most direct application of language model would be predictive text, which is directly
0:10:05 - 0:10:07 Text: just saying predict the next word.
0:10:07 - 0:10:10 Text: But the other applications that are actually more common are to use them in a machine
0:10:10 - 0:10:15 Text: translation system or a speech recognition system, where you have these features like translation
0:10:15 - 0:10:19 Text: features or acoustic features, and then you add a language model that says what's the
0:10:19 - 0:10:20 Text: probability of the sentence.
0:10:20 - 0:10:23 Text: And so for this, you want it to be a well-formed distribution.
0:10:23 - 0:10:26 Text: So these pre-trained models we actually don't care about this, but this was kind of something
0:10:26 - 0:10:32 Text: that was kind of people had just been like kind of, I guess, fixed on this idea that language
0:10:32 - 0:10:35 Text: models have to have a distribution, probably a distribution, even though we actually don't
0:10:35 - 0:10:36 Text: care about that.
0:10:36 - 0:10:40 Text: But the other kind of bigger reason is that words can see themselves in a bidirectional
0:10:40 - 0:10:42 Text: encoder.
0:10:42 - 0:10:50 Text: And so what this means is when you build a representation incrementally, so you have your input and
0:10:50 - 0:10:52 Text: then you have your output and it's always offset by one.
0:10:52 - 0:10:58 Text: So we have the start-up sentence token, we predict the first word, then we feed in the
0:10:58 - 0:11:01 Text: second word, we feed in the first word and predict the second word, and so we can encode
0:11:01 - 0:11:05 Text: the sentence once and predict all the words in the sentence with the unidirectional model.
0:11:05 - 0:11:07 Text: And so this gives us good sample efficiency, right?
0:11:07 - 0:11:11 Text: Because if we have a 500 and 12-dimension, like a sequence of 500 words, we don't want
0:11:11 - 0:11:17 Text: to have to only predict one word because it's going to be 500 times as much compute to
0:11:17 - 0:11:20 Text: get the same amount of predictions.
0:11:20 - 0:11:24 Text: If we would just trivially do a bidirectional LSTM or transformer, we would have a situation
0:11:24 - 0:11:28 Text: where you encode your sentence, everything is bidirectional.
0:11:28 - 0:11:32 Text: And so after the first layer, everything can see itself.
0:11:32 - 0:11:35 Text: So this word open, there's a path back down to open.
0:11:35 - 0:11:39 Text: And so it's trivial to predict a word that can, where it's in the input also, right?
0:11:39 - 0:11:42 Text: There's no actual prediction going on there.
0:11:42 - 0:11:47 Text: So the simple solution, which is basically the whole crux of birth, is that let's instead
0:11:47 - 0:11:55 Text: of training a normal language model, let's just predict, mask out, k percent of the words.
0:11:55 - 0:11:59 Text: So the man went to the mask to buy a mask of milk.
0:11:59 - 0:12:03 Text: And so now you can run a bidirectional model on that.
0:12:03 - 0:12:08 Text: And because the words aren't in the input, you can't cheat, right?
0:12:08 - 0:12:13 Text: And so the downside of this is that you're not getting as many predictions per sentence,
0:12:13 - 0:12:14 Text: right?
0:12:14 - 0:12:17 Text: You're only getting, predicting 15 percent of words instead of 100 percent of words.
0:12:17 - 0:12:19 Text: So the upside is that you're getting much more rich model because you're seeing both
0:12:19 - 0:12:21 Text: directions, right?
0:12:21 - 0:12:26 Text: So this value of k is a hyper parameter that we have to just decide on empirically.
0:12:26 - 0:12:27 Text: So we use 15 percent.
0:12:27 - 0:12:30 Text: It turns out that that's actually kind of an optimal value.
0:12:30 - 0:12:34 Text: So we and also people have since then have done more thorough ablation experiments and
0:12:34 - 0:12:36 Text: found that this 15 percent is good.
0:12:36 - 0:12:40 Text: So the reason for doing a certain percent over another is that if you were to do, let's
0:12:40 - 0:12:43 Text: say, 50 percent masking, you would get way more predictions, but you would also mask
0:12:43 - 0:12:46 Text: out like all of your context.
0:12:46 - 0:12:52 Text: And so you can, if you mask out all of your context and you're not getting any, you can't
0:12:52 - 0:12:53 Text: learn contextual models.
0:12:53 - 0:12:58 Text: And if you only do, like, let's say you can mask out one word, that might be optimal maybe,
0:12:58 - 0:13:00 Text: but you have to do way more data processing.
0:13:00 - 0:13:02 Text: So it would be way more expensive to train.
0:13:02 - 0:13:05 Text: And we know that these models are basically just compute bounded.
0:13:05 - 0:13:08 Text: So if you just have enough data, you can just kind of train them infinitely and it'll
0:13:08 - 0:13:10 Text: always do better.
0:13:10 - 0:13:14 Text: So it's really just a trade-off between these two things.
0:13:14 - 0:13:19 Text: So one other little detail in part, which should not have to be super important, is that
0:13:19 - 0:13:24 Text: because the mass token is never seen at fine-tuning time, instead of always replacing a word
0:13:24 - 0:13:30 Text: with the mass token as in this case, we would randomly sometimes predict it with a random
0:13:30 - 0:13:32 Text: word and sometimes keep the same word.
0:13:32 - 0:13:36 Text: So like, so a 100 percent time would say, we went to the store and went to the running,
0:13:36 - 0:13:37 Text: right?
0:13:37 - 0:13:42 Text: And so we wouldn't tell the model which case was which.
0:13:42 - 0:13:48 Text: We would just have to, we would just say, what should this word be, right?
0:13:48 - 0:13:50 Text: And didn't know whether it's right or not.
0:13:50 - 0:13:51 Text: So it could be the same word.
0:13:51 - 0:13:52 Text: So it's 10 percent time, it's the same word.
0:13:52 - 0:13:53 Text: It could be a random word.
0:13:53 - 0:13:58 Text: And so it has to basically be able to maintain a good representation of every word because
0:13:58 - 0:14:00 Text: it doesn't know whether it's really the right word.
0:14:00 - 0:14:03 Text: So it has to actually look at every word and figure out whether this is the right word.
0:14:03 - 0:14:06 Text: So we could potentially even just get away with not using mass token at all, and just doing
0:14:06 - 0:14:08 Text: like this 50 percent of time and this 50 percent of the time.
0:14:08 - 0:14:12 Text: But the reason for not doing that is that, you know, then we'd be corrupting a lot of our
0:14:12 - 0:14:16 Text: data and we don't want it to necessarily corrupt a data because the fact that this is the
0:14:16 - 0:14:19 Text: wrong word might mess up our prediction for some other word over here, right?
0:14:19 - 0:14:23 Text: So whereas a mass token at least it knows that it's not the right word, so it doesn't
0:14:23 - 0:14:26 Text: use that as part of its context.
0:14:26 - 0:14:33 Text: So the other kind of detail of BERT which also now and subsequently may not be, have been
0:14:33 - 0:14:40 Text: that important, is that a lot of these tasks that we're not just learning words, we're
0:14:40 - 0:14:42 Text: want to predict the relationship between sentences.
0:14:42 - 0:14:47 Text: So if question answering in particular, we have a query which is generally a sentence
0:14:47 - 0:14:54 Text: and then we have an answer which is a paragraph or a sentence or a document and we want to,
0:14:54 - 0:14:56 Text: you know, say does this answer the question.
0:14:56 - 0:15:03 Text: So by doing that, we, so we want to have some pre-taining task that actually does a sentence
0:15:03 - 0:15:06 Text: of prediction rather than just a word level prediction.
0:15:06 - 0:15:09 Text: So the way that we did this, which, and we need this to have like an infinite amount
0:15:09 - 0:15:10 Text: of data, right?
0:15:10 - 0:15:13 Text: We're going to generate an infinite amount of data so we don't want this to be an annotated
0:15:13 - 0:15:14 Text: task.
0:15:14 - 0:15:20 Text: So the way that we did this is we just did a next sentence-producing task where we just
0:15:20 - 0:15:26 Text: took two sentences from the same corpus, from the same document and 50% time there from
0:15:26 - 0:15:31 Text: the same document, 50% time there from a random document and then we just said, was this
0:15:31 - 0:15:33 Text: the real next sentence or not?
0:15:33 - 0:15:37 Text: And so if you have like the man went to the store, he bought a gun, a milk, that is the next
0:15:37 - 0:15:38 Text: sentence.
0:15:38 - 0:15:40 Text: If you said the man went to the store, penguins are flightless, that's not the next
0:15:40 - 0:15:41 Text: sentence.
0:15:41 - 0:15:45 Text: So basically now we're forcing the model app pre-taining time to actually make, to look
0:15:45 - 0:15:48 Text: at the full sentences and then make some sort of sentence of a prediction and we hope that
0:15:48 - 0:15:55 Text: this is kind of generalized which is something like question answering where you have a question
0:15:55 - 0:16:00 Text: and answer as sentence, and sentence B.
0:16:00 - 0:16:07 Text: So in terms of our input representation, it looks pretty similar to a normal transformer
0:16:07 - 0:16:12 Text: but we have these additional embeddings which are called segment embeddings.
0:16:12 - 0:16:17 Text: So normal transformer, you would have your input and then you would do word piece segmentation
0:16:17 - 0:16:25 Text: right where you split up, we apply this unsupervised splitting of words into kind of morphological
0:16:25 - 0:16:29 Text: splits but they're usually often not morphological, so it's unsupervised.
0:16:29 - 0:16:32 Text: But you end up with something that's roughly morphological right?
0:16:32 - 0:16:37 Text: And so now you have like no out of vocabulary tokens, everything is represented at the
0:16:37 - 0:16:39 Text: very least you always split into characters.
0:16:39 - 0:16:45 Text: So we use a 30,000 word vocabulary and then we have our token embeddings, then we have
0:16:45 - 0:16:47 Text: our normal position embeddings which is at the bottom.
0:16:47 - 0:16:51 Text: So these are part of the transformer where because transformers unlike LSTMs don't have
0:16:51 - 0:16:55 Text: any sort of locational awareness.
0:16:55 - 0:17:00 Text: So the way to encode that is that you encode an actual embedding for every position.
0:17:00 - 0:17:05 Text: So this is called absolute position embedding, there's other techniques nowadays.
0:17:05 - 0:17:10 Text: And then you have the segment embedding which is this is a sentence A or sentence B.
0:17:10 - 0:17:14 Text: And so this kind of generalizes in more general context.
0:17:14 - 0:17:17 Text: So you can imagine if you're trying to say like you're trying to do like web search, you
0:17:17 - 0:17:24 Text: might say here's my query, here's the title, here's the URL, here's the document content.
0:17:24 - 0:17:27 Text: And so you can kind of just pack these all into a single sequence and then just give them
0:17:27 - 0:17:34 Text: different segment embeddings or type embeddings so that now you get are able to have a much
0:17:34 - 0:17:41 Text: stronger, you're able to kind of just represent everything in this kind of same single sequence
0:17:41 - 0:17:44 Text: where you kind of differentiate it but just the single embedding that's different.
0:17:44 - 0:17:47 Text: And this is all of course learned.
0:17:47 - 0:17:50 Text: And so this is in contrast to kind of the older style where you would typically have a
0:17:50 - 0:17:51 Text: different encoder for every part.
0:17:51 - 0:17:54 Text: So like you would have a different encoder for the query and then maybe the title and the
0:17:54 - 0:17:55 Text: URL.
0:17:55 - 0:17:58 Text: But this case it's all just a single sequence.
0:17:58 - 0:18:03 Text: So we trained on about a 3 billion word corpus which was at the time large now it's not
0:18:03 - 0:18:07 Text: actually even that big compared to what people are training on.
0:18:07 - 0:18:12 Text: We used a batch size which was also large.
0:18:12 - 0:18:14 Text: We trained for about 40 epochs of the data.
0:18:14 - 0:18:18 Text: We trained these two models which are still relatively large.
0:18:18 - 0:18:22 Text: So one of them is 12 layer, 768 and then the other one is 24 layer, 1024.
0:18:22 - 0:18:26 Text: So at the time this is basically like one of the largest models that had been trained
0:18:26 - 0:18:32 Text: although now people are training models that I think are 30 times or more bigger than this
0:18:32 - 0:18:33 Text: in the more recent papers.
0:18:33 - 0:18:39 Text: So things have kind of exploded in terms of compute in the last I know about three years.
0:18:39 - 0:18:42 Text: But yeah.
0:18:42 - 0:18:47 Text: So the fine tuning procedure is, it's pretty straightforward right?
0:18:47 - 0:18:50 Text: So we pre-trained this model for these two tasks.
0:18:50 - 0:18:54 Text: And so now we have an input sequence which is multiple sentences with different type
0:18:54 - 0:18:55 Text: embeddings.
0:18:55 - 0:19:00 Text: We feed them through our transformer model.
0:19:00 - 0:19:03 Text: And now we have the special embedding which I think I didn't mention.
0:19:03 - 0:19:08 Text: So this special embedding, this is basically, it's learned to predict the next sentence
0:19:08 - 0:19:14 Text: prediction task and then this is used also for classification task.
0:19:14 - 0:19:16 Text: But it's not just that we're using the embedding right?
0:19:16 - 0:19:19 Text: We are, we're fine tuning the entire model right?
0:19:19 - 0:19:22 Text: So it's really not that this embedding is intrinsically useful or that the word embedding
0:19:22 - 0:19:23 Text: is intrinsically useful.
0:19:23 - 0:19:29 Text: It's that the weights inside the entire 12 or 24 layer model are useful.
0:19:29 - 0:19:33 Text: And so by fine tuning the entire model you can kind of pick out the salient parts that
0:19:33 - 0:19:37 Text: are important for some downstream task.
0:19:37 - 0:19:42 Text: And so this is the kind of the class specific fine tuning.
0:19:42 - 0:19:49 Text: So if we have a single classification task like let's say sentiment analysis where you
0:19:49 - 0:19:52 Text: say is this a positive or negative review?
0:19:52 - 0:19:54 Text: We encode our sentence with the birth model.
0:19:54 - 0:19:58 Text: And the only parameters that we add are this final output matrix right?
0:19:58 - 0:20:02 Text: So maybe if we have three, like say positive, negative or neutral, this might be a thousand
0:20:02 - 0:20:03 Text: times three right?
0:20:03 - 0:20:07 Text: So it's just 3,000 parameters and 300 million.
0:20:07 - 0:20:11 Text: So a 3,000 new parameters and 300 million old parameters.
0:20:11 - 0:20:18 Text: And we jointly train all 300 million plus 3,000 for this downstream task.
0:20:18 - 0:20:22 Text: But because the vast majority of them are pre-trained, we can kind of adapt to it in only like a
0:20:22 - 0:20:24 Text: few thousand label examples.
0:20:24 - 0:20:29 Text: And similarly for a sentence pair class, we do, we just can count into three sentences
0:20:29 - 0:20:30 Text: with different type embeddings.
0:20:30 - 0:20:33 Text: So we have, if you want to say does the sentence entail this other sentence, you say sentence
0:20:33 - 0:20:39 Text: A, you put it, can count in sentence B, and then also predicts from this token and fine
0:20:39 - 0:20:40 Text: to the entire thing.
0:20:40 - 0:20:43 Text: Similarly, very few additional parameters.
0:20:43 - 0:20:49 Text: For span prediction tasks, you just have kind of a start of span end of span.
0:20:49 - 0:20:52 Text: So you're only adding a few thousand new parameters.
0:20:52 - 0:21:01 Text: And then for tagging tasks, like part of speech tagging, you just have a single sentence.
0:21:01 - 0:21:05 Text: You add every single token or maybe every token except for the word pieces.
0:21:05 - 0:21:08 Text: But like, you, that's kind of free processing.
0:21:08 - 0:21:10 Text: You predict what's the part of speech of this.
0:21:10 - 0:21:19 Text: And so this is really why, so like, bird itself is really a kind of, I would say an incremental
0:21:19 - 0:21:20 Text: improvement over what already existed.
0:21:20 - 0:21:28 Text: So it kind of took transformers, LMO, GPT, really these three ideas and kind of made
0:21:28 - 0:21:33 Text: a pretty simple change on top of them.
0:21:33 - 0:21:37 Text: But the reason why it had such big impact is not just the numbers that I'll show in
0:21:37 - 0:21:38 Text: a few slides.
0:21:38 - 0:21:40 Text: It's really this thing.
0:21:40 - 0:21:48 Text: Because with LMO, there was really no fundamental difference between, it was just contextual
0:21:48 - 0:21:49 Text: embedding.
0:21:49 - 0:21:54 Text: So you still have, like, a lot of deep learning historically has been fun and building new
0:21:54 - 0:21:55 Text: models, right?
0:21:55 - 0:21:58 Text: So you have all of these components, kind of like Lego blocks, right?
0:21:58 - 0:22:04 Text: You have attention layers, feed forward layers, layer normalization, LSTMs, et cetera.
0:22:04 - 0:22:10 Text: And you can just kind of like figure out, say, okay, for this new task, how do I glue
0:22:10 - 0:22:12 Text: these together in a way that's best, right?
0:22:12 - 0:22:19 Text: And so, and so with LMO, it wasn't really, it didn't really change anything fundamentally.
0:22:19 - 0:22:22 Text: It was just, because these were, you just fed it into your existing model and you got
0:22:22 - 0:22:23 Text: to the art.
0:22:23 - 0:22:29 Text: For GPT one, it wasn't, most, like these things didn't really work, right?
0:22:29 - 0:22:31 Text: Because it was a left to right language model.
0:22:31 - 0:22:35 Text: And so you could just kind of take the last token and then predict a classification task,
0:22:35 - 0:22:39 Text: but it didn't really make any sense to predict, like, part of speech tags.
0:22:39 - 0:22:41 Text: Because for the first word, there's no context.
0:22:41 - 0:22:45 Text: So it makes no sense to predict the word with no context.
0:22:45 - 0:22:51 Text: With BERT, the reason why it had such high impact was because it kind of simplified things.
0:22:51 - 0:22:56 Text: And so that's not, I'm not saying that's necessarily a good thing, because as a researcher,
0:22:56 - 0:22:57 Text: or a bad thing.
0:22:57 - 0:23:01 Text: So as a researcher, kind of, ironically, the ultimate goal of research is often like research
0:23:01 - 0:23:03 Text: yourself out of a job, right?
0:23:03 - 0:23:07 Text: It's like, you know, if a physicist, I'm not saying BERT had, anywhere near this impact,
0:23:07 - 0:23:11 Text: like a physicist that came up with like a grand theory of physics, they would kind of,
0:23:11 - 0:23:15 Text: like, they would be like the greatest moment in physics, but also that would kind of eliminate
0:23:15 - 0:23:16 Text: a lot of research, right?
0:23:16 - 0:23:20 Text: And so that's kind of like the end goal of research is kind of solve the problem, right?
0:23:20 - 0:23:30 Text: So BERT kind of has a step where now, like, all of these different causes of problems,
0:23:30 - 0:23:35 Text: there's really, it kind of killed like a lot of the need to do model architecture design,
0:23:35 - 0:23:39 Text: which is kind of unfortunate because that's like really fun.
0:23:39 - 0:23:41 Text: And so that's kind of the impact.
0:23:41 - 0:23:45 Text: And I'm not going to say whether it's like a good or a bad impact, it's kind of like the
0:23:45 - 0:23:46 Text: objective impact.
0:23:46 - 0:23:52 Text: So why it's had so much impact is because it has kind of had this effect on now so many
0:23:52 - 0:23:56 Text: things that used to be like designing fun, you know, models, it's just fitted in and
0:23:56 - 0:24:01 Text: use one of these four recipes and it kind of just works for all of these different tasks.
0:24:01 - 0:24:05 Text: So in terms of actual empirical results, so these are at the time that Hickory's published,
0:24:05 - 0:24:09 Text: of course, things have gotten better since then.
0:24:09 - 0:24:14 Text: So this glue task is a set of, they're all kind of similar in that they're all sentenced
0:24:14 - 0:24:16 Text: pair or sentence classification tests.
0:24:16 - 0:24:21 Text: So like multi NLI would be something like hills and mountains are especially sanctified
0:24:21 - 0:24:22 Text: in Janism.
0:24:22 - 0:24:26 Text: And then hypothesis is Janism hates nature, that's a contradiction, right?
0:24:26 - 0:24:30 Text: So in order for an NLP model to be able to understand this or to be able to answer this
0:24:30 - 0:24:35 Text: correctly and give this label of contradiction, it needs to know that hills and mountains
0:24:35 - 0:24:41 Text: are part of nature, sanctified is a good thing and that hating something is a bad thing
0:24:41 - 0:24:43 Text: and be able to do all of this reasoning, right?
0:24:43 - 0:24:45 Text: So it's pretty complicated reasoning.
0:24:45 - 0:24:48 Text: Similarly for cola, you have to be able to say like the wagon rumbled down the road
0:24:48 - 0:24:50 Text: versus the car honked down the road.
0:24:50 - 0:24:57 Text: And so these things are, you know, one of them to a native English speaker sounds totally
0:24:57 - 0:24:59 Text: fine, the other one sounds weird.
0:24:59 - 0:25:01 Text: And so it's similar and neither you have very much data, right?
0:25:01 - 0:25:07 Text: So you have to be able to generalize on only like a few thousand examples.
0:25:07 - 0:25:14 Text: So birthed base, which is the same size as open AI, it significantly beat open AI, which
0:25:14 - 0:25:16 Text: was the previous data they are.
0:25:16 - 0:25:22 Text: And then the birth large, which was bigger, of course got better results, which is only
0:25:22 - 0:25:26 Text: surprising that it got better results across the board, including on the very, very tiny
0:25:26 - 0:25:28 Text: data, that's only a few thousand examples.
0:25:28 - 0:25:32 Text: That's kind of the more interesting result rather than just the fact that it got better
0:25:32 - 0:25:33 Text: results.
0:25:33 - 0:25:39 Text: Historically, when there is rules of thumb about, if you have some number of examples, how
0:25:39 - 0:25:42 Text: do you design the model size that's optimal for that?
0:25:42 - 0:25:45 Text: And so if you don't do pre-training, like if you keep making the model bigger without
0:25:45 - 0:25:49 Text: pre-training, eventually you'll get worse results because your model overfitting your
0:25:49 - 0:25:51 Text: training data.
0:25:51 - 0:25:54 Text: With pre-training, you basically only ever do like one pass of the day anyways.
0:25:54 - 0:25:58 Text: So there seems to be almost no limit to how big you can make it and still get good results
0:25:58 - 0:26:00 Text: even with a tiny amount of fine-tuning data.
0:26:00 - 0:26:03 Text: And that's really like one of the big takeaways.
0:26:03 - 0:26:12 Text: So I'm not going to, yeah, so the reason why these numbers are, these range are lower
0:26:12 - 0:26:19 Text: is because I took the screenshot after, like this, you know, significantly after, when
0:26:19 - 0:26:22 Text: other bunch of other people had submitted systems.
0:26:22 - 0:26:26 Text: But so this is a question and answer and get a set.
0:26:26 - 0:26:28 Text: So it'd be like, what action did the US take to start this second oil shock?
0:26:28 - 0:26:30 Text: So in this case, there's no answer, right?
0:26:30 - 0:26:33 Text: So it's something you have to be able to predict is that there's no answer in this phase.
0:26:33 - 0:26:36 Text: If you have to be able to predict the answer or say there's no answer.
0:26:36 - 0:26:42 Text: So burp beat the previous state of the art, at the time that it was submitted by about
0:26:42 - 0:26:44 Text: six points, which was a pretty big gain.
0:26:44 - 0:26:47 Text: Now this has kind of gone past human level.
0:26:47 - 0:26:50 Text: But at the time, yeah, it was large.
0:26:50 - 0:26:54 Text: So I'll kind of do some ablation experiments or go through some ablation experiments.
0:26:54 - 0:26:57 Text: So this one, there's four things that I'm comparing here.
0:26:57 - 0:27:01 Text: So this is all burp based size models.
0:27:01 - 0:27:05 Text: So the blue line is kind of the burp based model.
0:27:05 - 0:27:09 Text: The red line is, with we take out the next sentence prediction.
0:27:09 - 0:27:12 Text: So in our case, even though people have subsequently said that they don't think it's important,
0:27:12 - 0:27:13 Text: in our case, we actually did measure it.
0:27:13 - 0:27:19 Text: And it turns out it seemed like it was important to have this next sentence prediction task,
0:27:19 - 0:27:25 Text: especially for kind of question answering task, which is this one.
0:27:25 - 0:27:28 Text: But it seems to have a little bit, at least in all four of them, to have it.
0:27:28 - 0:27:32 Text: So this kind of does indicate that there's some strength in learning some model that learns
0:27:32 - 0:27:37 Text: a relationship between sentences.
0:27:37 - 0:27:45 Text: So then this one is the one that makes an apples to apples comparison between open AI and
0:27:45 - 0:27:48 Text: open AI is GPT1 and BERT, right?
0:27:48 - 0:27:50 Text: Because I also, I made the model bigger, but not per base.
0:27:50 - 0:27:53 Text: So BERT based was the exact same size, but it was China more data.
0:27:53 - 0:27:58 Text: So to make it a fair comparison, I basically retrained my own implementation of open AI's
0:27:58 - 0:28:04 Text: GPT1, which is a yellow line.
0:28:04 - 0:28:08 Text: And we can see that on some of the tests, it's not that far, although this is actually a
0:28:08 - 0:28:09 Text: pretty big gap.
0:28:09 - 0:28:12 Text: This drop is four points, which is a lot.
0:28:12 - 0:28:16 Text: But on some tasks like squad and M.O.P.C, it was way worse.
0:28:16 - 0:28:24 Text: And so for squad it makes sense because squad is a labeling, is a span labeling task.
0:28:24 - 0:28:29 Text: And so if you only have left context, then words at the beginning have basically no context.
0:28:29 - 0:28:32 Text: And so you're asking it to do span labeling on words with almost no context.
0:28:32 - 0:28:35 Text: So it really doesn't make any sense.
0:28:35 - 0:28:36 Text: And so of course it's going to do much worse.
0:28:36 - 0:28:40 Text: So then we also added a, to make it fair, we also added an LSTM on top of it, which is
0:28:40 - 0:28:42 Text: trained from scratch.
0:28:42 - 0:28:45 Text: And this does help a little bit on some of the tasks, but on other ones it doesn't help.
0:28:45 - 0:28:48 Text: So on SWAT it helps because now you have bidirectional context.
0:28:48 - 0:28:52 Text: But on MRC because it's a very small task, it's only got 3,000 labeled examples, it doesn't
0:28:52 - 0:28:54 Text: help at all.
0:28:54 - 0:29:00 Text: So it does show that kind of the mass language model and the next inspection are both important,
0:29:00 - 0:29:04 Text: especially the mass language model.
0:29:04 - 0:29:11 Text: So the other thing, one of the other ablations is that when we apply the mass language model,
0:29:11 - 0:29:15 Text: we're only predicting 15% of words in the sentence.
0:29:15 - 0:29:17 Text: So when you do a left to right language model, you're predicting every single word,
0:29:17 - 0:29:20 Text: conditioning all the words to the left.
0:29:20 - 0:29:24 Text: So one question might be how much does this make it take longer to converge?
0:29:24 - 0:29:27 Text: Even though eventually we know that it converges at a much better point, if you have a limited
0:29:27 - 0:29:30 Text: training budget, it's a better to do a left to right model.
0:29:30 - 0:29:37 Text: And so we see that when you do this mass language model, the bidirectionality starts to improve,
0:29:37 - 0:29:39 Text: like at the very, very beginning because you're doing so many more predictions, it's true
0:29:39 - 0:29:46 Text: that the left to right model does do better at the very, like for, like, epoch one, but
0:29:46 - 0:29:50 Text: then very soon after because the bidirectionality is so important, it starts to take over.
0:29:50 - 0:29:53 Text: And so it's basically better from almost the start to do bidirectionality.
0:29:53 - 0:29:57 Text: And then it takes slightly longer to converge, but the overall convergence is, of course, much
0:29:57 - 0:30:01 Text: higher.
0:30:01 - 0:30:10 Text: And then finally for this oblations, we can see that going from a smaller model, which
0:30:10 - 0:30:14 Text: was 100 million to 300 million parameters, helps a lot, which isn't surprising.
0:30:14 - 0:30:20 Text: What the more surprising thing is that one of these curves, these aren't comparable,
0:30:20 - 0:30:22 Text: you shouldn't compare the curves to each other.
0:30:22 - 0:30:29 Text: The point is to look at the curves as a function of the number of parameters and see that this
0:30:29 - 0:30:35 Text: one is, this one only has 3,000 labeled examples, and this one has 4,000 labeled examples.
0:30:35 - 0:30:41 Text: So in both cases, the curves look very similar, which is surprising because the rule of thumb
0:30:41 - 0:30:46 Text: that you're going to overfit your data if you only have a few labeled examples turns
0:30:46 - 0:30:48 Text: out not to really be true anymore.
0:30:48 - 0:30:51 Text: And there's, you know, in these curves keep going up, right?
0:30:51 - 0:30:56 Text: So now with subsequent papers, we'll talk about, like this, this big one was 300 million
0:30:56 - 0:30:59 Text: parameters, people have gone up to 11 billion parameters and still seeing similar behaviors.
0:30:59 - 0:31:03 Text: So still seeing the curves go way up and gotten to the other results, which is kind of crazy
0:31:03 - 0:31:09 Text: because now we know that, you know, basically there's almost no limit.
0:31:09 - 0:31:13 Text: So another thing I want to talk about before I talk about stuff that's happened since
0:31:13 - 0:31:14 Text: Bert.
0:31:14 - 0:31:22 Text: Is the kind of, is, even though Bert itself was in some ways very simple, which is, you
0:31:22 - 0:31:26 Text: know, not a bad thing, it was very successful immediately.
0:31:26 - 0:31:29 Text: And you know, part of that is the Google brand and like, you know, it got a cute name
0:31:29 - 0:31:30 Text: and stuff like that.
0:31:30 - 0:31:32 Text: But I think that I spent a lot of time with the open social release and particularly
0:31:32 - 0:31:36 Text: looking at other open source releases and figuring out what people didn't like about
0:31:36 - 0:31:38 Text: those.
0:31:38 - 0:31:42 Text: And so I think this is important, like when you're, when you're a PhD student or even
0:31:42 - 0:31:45 Text: working industry as a, and trying to release something.
0:31:45 - 0:31:50 Text: So I kind of just listed the things here that I thought were important for like why
0:31:50 - 0:31:55 Text: it was successful compared to other things.
0:31:55 - 0:31:58 Text: So like, not, I'm not trying to call them out just to be mean because it, but like the
0:31:58 - 0:32:01 Text: open AI GPT-1 release was really, was not very good.
0:32:01 - 0:32:04 Text: And then like I'm sure that they were this because the open AI GPT-2 release was very
0:32:04 - 0:32:05 Text: good.
0:32:05 - 0:32:11 Text: And so, yeah, because it was very hard to run and there was not comment that the TensorFlow
0:32:11 - 0:32:14 Text: code was very, like it worked fine, like I replicated it.
0:32:14 - 0:32:17 Text: But like, the, the TensorFlow code was very non-idiamatic.
0:32:17 - 0:32:18 Text: It used all sorts of weird stuff.
0:32:18 - 0:32:19 Text: The Python code was weird.
0:32:19 - 0:32:20 Text: There was no comments.
0:32:20 - 0:32:21 Text: There was basically no instructions.
0:32:21 - 0:32:26 Text: And then other code bases also are kind of too big.
0:32:26 - 0:32:29 Text: It's like people just want to like say like, we want to have one unified code base for
0:32:29 - 0:32:32 Text: our entire, you know, language team.
0:32:32 - 0:32:33 Text: And so they just put stuff as part of that.
0:32:33 - 0:32:34 Text: And people don't really like that either.
0:32:34 - 0:32:37 Text: So I was very insistent that we do a minimal release.
0:32:37 - 0:32:39 Text: So like this, we're just going to release Bert.
0:32:39 - 0:32:40 Text: It's not going to be part of anything.
0:32:40 - 0:32:41 Text: There's going to be any external dependencies.
0:32:41 - 0:32:44 Text: And it's going to be like very well commented.
0:32:44 - 0:32:48 Text: I think that people, and it was kind of also easy to drop in just the modeling part and
0:32:48 - 0:32:53 Text: just the tokenization part and just the front end, which runs like the training loop.
0:32:53 - 0:32:55 Text: And kind of separate all these out because that way.
0:32:55 - 0:33:00 Text: And so I think because of that, people kind of started using it much quicker.
0:33:00 - 0:33:02 Text: And of course, like all the publicity help.
0:33:02 - 0:33:08 Text: But I think that, you know, it could have easily been not as successful if it had been,
0:33:08 - 0:33:09 Text: you know, done in a different way.
0:33:09 - 0:33:12 Text: So it's just kind of advice.
0:33:12 - 0:33:14 Text: So yeah.
0:33:14 - 0:33:20 Text: So now I'm going to talk about five models that have come out since Bert that have all improved
0:33:20 - 0:33:22 Text: on top of Bert in various ways.
0:33:22 - 0:33:23 Text: There's been more than five.
0:33:23 - 0:33:27 Text: But I'm going to highlight these five, they think they're interesting.
0:33:27 - 0:33:31 Text: A lot of them did come from Google, but it's not because, well, a lot of them involved
0:33:31 - 0:33:32 Text: Google.
0:33:32 - 0:33:37 Text: I would say many of them actually were not, they were interns at Google from various
0:33:37 - 0:33:41 Text: universities who were supervised by Google researchers and also use Google compute.
0:33:41 - 0:33:44 Text: I mean, the reason why a lot of them came from Google is because, like, frankly, like,
0:33:44 - 0:33:50 Text: other than Facebook, Google, and Microsoft, there's not really many, like, people that can,
0:33:50 - 0:33:54 Text: the companies that have the resources to train these huge state of the art models.
0:33:54 - 0:34:01 Text: And so, almost by necessity, it's going to come from one of these labs.
0:34:01 - 0:34:04 Text: So the first one was Roberta.
0:34:04 - 0:34:07 Text: And so this is probably the one that had, like, the least kind of new stuff.
0:34:07 - 0:34:11 Text: It was really just, and so this was University of Washington Facebook.
0:34:11 - 0:34:13 Text: It came out not that long after a birth.
0:34:13 - 0:34:17 Text: And so what they showed was that birth was really under-trained.
0:34:17 - 0:34:22 Text: And so basically, they took, even on the same amount of data, which was, even though I
0:34:22 - 0:34:27 Text: did 40 epochs on the data, if you do it for, like, 200 epochs, you get even better results,
0:34:27 - 0:34:29 Text: like significantly.
0:34:29 - 0:34:33 Text: So they basically trained more epochs on the same data.
0:34:33 - 0:34:36 Text: And they also showed that more data helps, which is also not super-sprising.
0:34:36 - 0:34:40 Text: And they did improve masking and pre-training using a couple of tweaks to that.
0:34:40 - 0:34:44 Text: And they were able to get a state of the art results, which is cool.
0:34:44 - 0:34:49 Text: And so, yeah, but that was a pretty straightforward paper.
0:34:49 - 0:34:55 Text: So the next one is XLNet, which is done by some interns in CMU when they were at Google
0:34:55 - 0:34:56 Text: Brain.
0:34:56 - 0:34:59 Text: And so this actually had some really cool changes.
0:34:59 - 0:35:04 Text: So one of them was, they used this transformer XL, which was actually the precursor done
0:35:04 - 0:35:09 Text: by the same people that were, they were just doing links on all the tasks, did a pre-training.
0:35:09 - 0:35:15 Text: But the big, one of the big innovations of transformer XL is this idea of relative position
0:35:15 - 0:35:16 Text: embeddings.
0:35:16 - 0:35:23 Text: And so with absolute position embeddings, the problem is that every word gets, like, this
0:35:23 - 0:35:26 Text: is word four, this is word five, this is word six.
0:35:26 - 0:35:28 Text: And so they are embeddings, so they do generalize.
0:35:28 - 0:35:30 Text: But in practice, there's a quadratic number of relationships.
0:35:30 - 0:35:33 Text: Like, how does word A3 relate to word 76, right?
0:35:33 - 0:35:37 Text: That's, that's, that's, and once you get bigger, like, 500, a thousand.
0:35:37 - 0:35:39 Text: Now you have a thousand squared total relationships.
0:35:39 - 0:35:42 Text: Like, you have to say, how does word 97 relate to whatever, right?
0:35:42 - 0:35:46 Text: And so that's obviously not optimal once you get to a large size.
0:35:46 - 0:35:56 Text: And so with, with relative position embeddings, you basically can say how much does dog
0:35:56 - 0:36:00 Text: attend to hot and how much should the word dog attend to the previous word.
0:36:00 - 0:36:04 Text: And then you get, and these are nonlinear for, these are linear at first.
0:36:04 - 0:36:07 Text: But then you combine them and then you get a nonlinear, a conditional representation.
0:36:07 - 0:36:09 Text: And you do this in many, many layers.
0:36:09 - 0:36:12 Text: And this ends up being, so then you say, how does this contextual
0:36:12 - 0:36:14 Text: precision of dog, how much does that attend to the previous word.
0:36:14 - 0:36:16 Text: And then you kind of get, can build up.
0:36:16 - 0:36:18 Text: And so this generalizes much better for long sequences.
0:36:18 - 0:36:20 Text: So that's a cool innovation.
0:36:20 - 0:36:25 Text: And then the other one, which is specific to pre-training and not just the model itself,
0:36:25 - 0:36:27 Text: is this idea of permutation language modeling.
0:36:27 - 0:36:28 Text: So this is a little bit hard to explain.
0:36:31 - 0:36:34 Text: I think the paper explained it very formally, I guess.
0:36:34 - 0:36:40 Text: And so, but basically, there's a trick where, so in a left-for-right language model,
0:36:40 - 0:36:44 Text: every word done predicting is based on the word the left, right?
0:36:44 - 0:36:48 Text: But imagine that instead of predicting all the, you can basically take any permutation.
0:36:48 - 0:36:52 Text: So it's like, I'm going to predict the first word, then I'm going to be the third word,
0:36:52 - 0:36:55 Text: then the second word, then the fourth word.
0:36:55 - 0:36:58 Text: And so, that's a totally valid way.
0:36:58 - 0:37:00 Text: And you still get the well-for-and-probability distribution, because it's still predicting
0:37:00 - 0:37:04 Text: one word at a time, given some permutation of the input.
0:37:04 - 0:37:07 Text: And with transformers and with attention, you can actually do this very efficiently,
0:37:07 - 0:37:09 Text: just by masking out your attention probabilities.
0:37:09 - 0:37:15 Text: And so, every single sentence you have, you can kind of sample a single permutation of this.
0:37:15 - 0:37:21 Text: And you can now, you can effectively train a bi-directional model, because this word,
0:37:21 - 0:37:23 Text: it won't be conditioned on every, still on average,
0:37:23 - 0:37:25 Text: every word will only be conditioned on half the words.
0:37:25 - 0:37:30 Text: But this word will be conditioned on, you know, all these words to the left and all these words to the right.
0:37:30 - 0:37:32 Text: And maybe it'll be missing these words, but that's fine.
0:37:32 - 0:37:35 Text: And so, you get much better sample efficiency.
0:37:35 - 0:37:38 Text: So I thought this was a really clever idea.
0:37:38 - 0:37:41 Text: And so, and this was kind of the main innovation of X on it.
0:37:41 - 0:37:47 Text: And so, yeah, they basically get better sample efficiency, because they're able to
0:37:47 - 0:37:50 Text: do this random permutation and kind of take advantage of this.
0:37:50 - 0:37:54 Text: So this wouldn't work with LFTMs, because of this ordering, but
0:37:54 - 0:37:57 Text: because the way that masking is done in transformers, it's just,
0:37:59 - 0:38:01 Text: it's just a mask on the attention.
0:38:01 - 0:38:03 Text: So it actually ends up working very well.
0:38:03 - 0:38:10 Text: And so, they also got, so, yeah, the numbers, they actually ended up being pretty similar.
0:38:12 - 0:38:16 Text: But a lot of these things are hard to compare, because people change the data set and
0:38:16 - 0:38:18 Text: change the size of the model.
0:38:18 - 0:38:20 Text: So it's hard to compare apples to apples.
0:38:20 - 0:38:26 Text: But these two techniques ended up being pretty similar, but I think X on it had more innovations in terms of technique.
0:38:26 - 0:38:33 Text: So Albert, it's called a lightbert for self-supervised learning.
0:38:33 - 0:38:37 Text: And so, this also had a couple of cool innovations.
0:38:37 - 0:38:41 Text: And so the idea here is really massive parameter sharing, with the idea being that,
0:38:41 - 0:38:44 Text: if you share parameters, you're not going to get a better language model, but
0:38:44 - 0:38:46 Text: you're going to get better sample efficiency.
0:38:46 - 0:38:48 Text: You're going to get less overfitting when you fine tune, right?
0:38:48 - 0:38:51 Text: Because if you have a billion parameters and you fine tune them on a 300,
0:38:51 - 0:38:55 Text: on a data set with like a thousand labeled examples, you're still going to overfit very quickly, right?
0:38:55 - 0:38:59 Text: But if you have a much smaller number of parameters, you're going to get less overfitting.
0:38:59 - 0:39:02 Text: So if you get a similarly powerful model with fewer parameters, you're going to get less overfitting.
0:39:03 - 0:39:07 Text: And so they, so there's two major innovations where,
0:39:07 - 0:39:09 Text: so instead of using a word, because the wording of the table is big, right?
0:39:09 - 0:39:15 Text: Because it's the size of your vocabulary, the number of word pieces, times the hidden size.
0:39:15 - 0:39:18 Text: And so it's going to be much bigger than the hidden layer.
0:39:18 - 0:39:21 Text: So first thing is that they use the factorized embedding table.
0:39:21 - 0:39:28 Text: So if they had a hidden size of a thousand, they only use like 128 dimensional input embedding.
0:39:28 - 0:39:33 Text: And then they projected that to a thousand using a matrix.
0:39:33 - 0:39:40 Text: And so instead of having 1024 by 100,000, they would have 100,000 by 100,000 plus 1024 times 100,000,
0:39:40 - 0:39:43 Text: and you multiply these together and multiply these two matrices together.
0:39:43 - 0:39:48 Text: And then effectively you have a 1024 by 100,000 embedding matrix.
0:39:48 - 0:39:50 Text: But you have much fewer parameters.
0:39:50 - 0:39:51 Text: So you're doing parameter tying.
0:39:51 - 0:39:55 Text: Well, not, this isn't a parameter tie, but you're doing parameter reduction in a clever way.
0:39:56 - 0:39:58 Text: The other one is cross layer parameter sharing.
0:39:58 - 0:39:59 Text: So this is similar.
0:39:59 - 0:40:06 Text: It's as simple and it was all, it's, it was, it was been done in previous papers, especially universal transformer.
0:40:06 - 0:40:10 Text: And the idea is that you, you've run a much of transformer layers, but all,
0:40:10 - 0:40:13 Text: let's say if you have 12 layers, all 12 layers just share the same parameters, right?
0:40:13 - 0:40:19 Text: And so that ends up, so now you can have a much bigger model
0:40:19 - 0:40:22 Text: that has fewer parameters than bird has.
0:40:22 - 0:40:24 Text: And so you get less over fitting.
0:40:24 - 0:40:30 Text: And so they got state of the art compared to X on that in Roberta.
0:40:30 - 0:40:33 Text: But one important thing to keep in mind is that Albert is light in terms of parameters,
0:40:33 - 0:40:35 Text: not in terms of speed.
0:40:35 - 0:40:47 Text: So for a, for the mix, for the model that's actually comparable to, to bird,
0:40:47 - 0:40:53 Text: they, they actually did slightly, like, like this model and this model were about the same,
0:40:53 - 0:40:55 Text: but this one was actually slower.
0:40:55 - 0:41:01 Text: So it's only when they started making models that were much bigger in terms of compute than
0:41:01 - 0:41:05 Text: bird, but doing more parameter tying than they started getting good results.
0:41:05 - 0:41:11 Text: And so the, the implication of this is that like, you can, you can reduce the number of parameters,
0:41:11 - 0:41:16 Text: but still nobody's figured out how to reduce the amount of pre-training compute
0:41:16 - 0:41:19 Text: that it would describe, which is, you know, kind of unfortunate.
0:41:19 - 0:41:25 Text: So the next one is T5, which is exploring the limits of transfer-leading with unified
0:41:25 - 0:41:26 Text: text-txt firmware.
0:41:26 - 0:41:33 Text: So this was a paper by Google Brain and other groups in Google where they used just,
0:41:33 - 0:41:38 Text: they used a lot of compute and they did tons of ablation on pre-training.
0:41:38 - 0:41:41 Text: They didn't, like, their goal wasn't to come up with some, with some super clever new
0:41:41 - 0:41:42 Text: pre-training technique.
0:41:42 - 0:41:46 Text: Right, it was really just to carefully ablate every aspect, how much is model size matter,
0:41:46 - 0:41:49 Text: how much is training data matter, how much is clemeness of data matter, like, how much
0:41:49 - 0:41:53 Text: is the exact way that you do the pre-training objective matter, like doing the masking,
0:41:53 - 0:41:54 Text: like, how many spans do you mask?
0:41:54 - 0:42:00 Text: And so they wanted to kind of very clearly do the, and they also wanted to push the limits
0:42:00 - 0:42:04 Text: of size and say, what happens if we have 300 million, a billion, 10 billion parameters,
0:42:04 - 0:42:06 Text: right?
0:42:06 - 0:42:12 Text: And then, so they did tons and tons of ablation and they got state of the art and everything
0:42:12 - 0:42:14 Text: and they're still sitting at the art and everything.
0:42:14 - 0:42:22 Text: And the results, though, are a little bit bleak in the sense that nothing really mattered
0:42:22 - 0:42:27 Text: except making the data, like, like, all of the ablations, it wasn't like, oh, you know,
0:42:27 - 0:42:31 Text: burnt it everything perfectly, it was that, it doesn't matter, like, you could do 20%,
0:42:31 - 0:42:35 Text: 25%, you can do this fine tuning recipe, this fine tuning recipe, it's like, all that
0:42:35 - 0:42:40 Text: really matters is making the model bigger and training it on a more data, and clean data.
0:42:40 - 0:42:48 Text: And so, yeah, it's a little bit of a bleak paper if you are hoping that there is exists
0:42:48 - 0:42:53 Text: some pre-training technique which is super computationally efficient and also can get,
0:42:53 - 0:42:56 Text: you know, very impressive results, which I'm not saying there isn't, but, like, most
0:42:56 - 0:42:58 Text: of this evidence points in that.
0:42:58 - 0:43:04 Text: So the one kind of newest paper that is maybe the most positive in this direction is this
0:43:04 - 0:43:06 Text: paper called Electra.
0:43:06 - 0:43:14 Text: And so, and so this is done by Kevin Clark from here and, uh, and Google Brain.
0:43:14 - 0:43:17 Text: And so, yeah, in this one, it's a pretty clever idea.
0:43:17 - 0:43:23 Text: So basically, the idea is instead of training, instead of training to generate the output,
0:43:23 - 0:43:25 Text: you just train it as a, as a discriminator.
0:43:25 - 0:43:30 Text: And so, you have a local language model, you have, you do some asking, you have a local
0:43:30 - 0:43:33 Text: language model which replaces it, and then you train it to discriminate whether it's
0:43:33 - 0:43:37 Text: the original one or not.
0:43:37 - 0:43:40 Text: And so, the idea here is that you are doing a much, you're, you're, you're getting better
0:43:40 - 0:43:46 Text: sample efficiency for pre-training because you're predicting every, every word, which
0:43:46 - 0:43:50 Text: is actually, I mean, I don't know exactly why it would be that from, from, from, from
0:43:50 - 0:43:55 Text: Berkis, Berkis still, uh, in terms of, because you don't replace with, with the mask with
0:43:55 - 0:43:56 Text: everywhere, you also randomly corrupt it.
0:43:56 - 0:44:01 Text: But, but the, the, the biggest difference is that, um, is that these are kind of contextual
0:44:01 - 0:44:05 Text: every place. So, it's like, it, when I did random masking, and replace with the random
0:44:05 - 0:44:06 Text: word, it was truly a random word.
0:44:06 - 0:44:09 Text: So, most of the time it was completely trivial to, to tell that this was not the right word.
0:44:09 - 0:44:12 Text: You didn't necessarily know which word should be replaced, but in this case, they actually
0:44:12 - 0:44:17 Text: used a intentionally weak but still non-tribule language model to predict which word.
0:44:17 - 0:44:21 Text: So, like, this locally makes sense, the chef ate the meal, but it doesn't make any sense,
0:44:21 - 0:44:23 Text: like, a very strong model will not predict this, right?
0:44:23 - 0:44:27 Text: So, so, that's the idea that you, use a weak model to, to, to, to, to, to the substitution
0:44:27 - 0:44:29 Text: of the use, then you train a strong model to, to, um, do this.
0:44:29 - 0:44:36 Text: So, these results are, I guess it's a big cable, but these results are, they're certainly
0:44:36 - 0:44:44 Text: positive with regard to, uh, previous results in terms of compute versus, um, so, like,
0:44:44 - 0:44:51 Text: for, if we compare this row, so, uh, which is, one tenth of the compute of Bert large,
0:44:51 - 0:44:53 Text: to Bert base, which is also one tenth of the compute of Bert large, it certainly does
0:44:53 - 0:44:55 Text: a lot better than Bert base.
0:44:55 - 0:45:05 Text: Um, but, when they, uh, if you, but, but in terms of state of the art models, um, when
0:45:05 - 0:45:11 Text: they do, you know, the same amount of, uh, compute as their Bert large, which is this one,
0:45:11 - 0:45:13 Text: work a better, to other state of the art models, they're not, in order to get state of
0:45:13 - 0:45:16 Text: the art, or to get similar to state of the art, they basically need to do as much compute
0:45:16 - 0:45:19 Text: as state of the art, so, like, 444x, 5.4x.
0:45:19 - 0:45:23 Text: So, I mean, at scale-down values, they were able to do better, but this is a, still a
0:45:23 - 0:45:25 Text: pretty big gap, like, 4 points.
0:45:25 - 0:45:29 Text: Um, so, it's, it's positive, but it's not, it's certainly not like the silver bullet
0:45:29 - 0:45:37 Text: in terms of, uh, showing that, uh, we can, you know, pre-trained models, much better
0:45:37 - 0:45:39 Text: for, for, for cheaper.
0:45:39 - 0:45:44 Text: So, but, so, the last thing I want to talk about is, um, how we actually serve these models,
0:45:44 - 0:45:45 Text: right?
0:45:45 - 0:45:50 Text: Because, you know, I've said that, like, they're incredibly expensive to train, and nobody
0:45:50 - 0:45:54 Text: has been able to figure out how to make that faster, but, you know, they're being used all
0:45:54 - 0:45:55 Text: over the place, right?
0:45:55 - 0:46:00 Text: So, like, uh, you know, there's news stories, Google has improved 10% of searches by
0:46:00 - 0:46:03 Text: language-assigning, say, a little bit, and then Bing says it's been playing Burt since
0:46:03 - 0:46:06 Text: April, so, and so, this, this is live in Google Search, and Bing Search, and so, these
0:46:06 - 0:46:10 Text: are, like, really low-lanancy services, right, that have, like, a few milliseconds of,
0:46:10 - 0:46:17 Text: of, of latency, and they serve, you know, billions of, of queries a day, so, how, how are
0:46:17 - 0:46:22 Text: they doing this, is it just, like, uh, that, you know, Google and Microsoft are spending
0:46:22 - 0:46:23 Text: billions of dollars on hardware.
0:46:23 - 0:46:25 Text: What they are, but not just for this, right?
0:46:25 - 0:46:30 Text: And so, like, uh, like, it would, it would cost billions of dollars just to serve this
0:46:30 - 0:46:32 Text: if we were actually serving Burt.
0:46:32 - 0:46:36 Text: But, we're serving, uh, not, instead of we're using, we're using model distillation, right?
0:46:36 - 0:46:38 Text: So, this has been around for a while.
0:46:38 - 0:46:42 Text: Um, so, it's, you know, called distillation and model compression.
0:46:42 - 0:46:47 Text: Uh, one, one of the first papers was the Smolk Compression Paper, um, that was, that was
0:46:47 - 0:46:52 Text: done for, uh, I forget exactly what task, but then, and then Hinton's paper, just stealing
0:46:52 - 0:46:55 Text: knowledge in neural networks, is a more well-known version of, or not, not version, but
0:46:55 - 0:46:59 Text: a more well-known, uh, uh, paper on, on distillation.
0:46:59 - 0:47:02 Text: But in reality, the one, the version that, that we use at Google and the version that
0:47:02 - 0:47:08 Text: most people use when they say model distillation for, uh, pre-shanked language models, it's a, um,
0:47:08 - 0:47:12 Text: it's a very simple technique, but, but it's easy to misinterpret what, what, what we mean.
0:47:12 - 0:47:17 Text: So, what we do is we pre-train, we train a state of the art model, whichever is the
0:47:17 - 0:47:19 Text: ones we can most afford to train, right?
0:47:19 - 0:47:22 Text: Because, of course, we can just make it bigger, but we, we set some budget of, you know,
0:47:22 - 0:47:25 Text: we want to train it for a day on some number of TPUs.
0:47:25 - 0:47:27 Text: And then, we fine-tune it, right?
0:47:27 - 0:47:29 Text: So we get a model that's the maximum accuracy, and that's our teacher model, and this is
0:47:29 - 0:47:30 Text: expensive.
0:47:30 - 0:47:34 Text: Then we have a, a large amount of unlabeled input, which is typically for, for most industry
0:47:34 - 0:47:38 Text: applications, you have unlabeled input, because you have, you know, in search, you have,
0:47:38 - 0:47:41 Text: this is what they use to search for, this is what they click on, that's how it's searched
0:47:41 - 0:47:42 Text: into the trained.
0:47:42 - 0:47:48 Text: Uh, and so, you can then just take these, and you, um, and then you just label your examples
0:47:48 - 0:47:49 Text: with them.
0:47:49 - 0:47:52 Text: So you can get billions of these, uh, if you actually want a real service.
0:47:52 - 0:47:58 Text: And then you, so then you, then you run these, you know, query answer pairs through your
0:47:58 - 0:48:03 Text: teacher, and you get a pseudo label, and you're just training much smaller model, much meaning
0:48:03 - 0:48:08 Text: like 50 times, 100 times smaller, to, uh, predict your student, your teacher outputs.
0:48:08 - 0:48:11 Text: And so, and you can generally do this for most techniques.
0:48:11 - 0:48:17 Text: I mean, for most tasks, you can do this, uh, pretty easily, and get a huge 50-200X, uh,
0:48:17 - 0:48:19 Text: compression with no degradation.
0:48:19 - 0:48:23 Text: But the important thing to realize that we're not compressing the pre-train model itself.
0:48:23 - 0:48:25 Text: We haven't really had any luck doing that.
0:48:25 - 0:48:28 Text: So like, you can't actually just take Bert, and then compress it to a smaller model, which
0:48:28 - 0:48:30 Text: you can then fine-tune for all these other tasks.
0:48:30 - 0:48:34 Text: It's only after you've chosen the task, and after you find it, tune it for the task that
0:48:34 - 0:48:37 Text: you, that we were able to do it.
0:48:37 - 0:48:43 Text: So to show some specific results, so let's say we have, let's say we have a Bert large
0:48:43 - 0:48:44 Text: teacher.
0:48:44 - 0:48:45 Text: So this is an Amazon book review script.
0:48:45 - 0:48:48 Text: So this is a paper that, that, I've got to cite it, but this is a paper that my group,
0:48:48 - 0:48:51 Text: uh, published, you know, the ATURK, uh, wrote.
0:48:51 - 0:48:59 Text: And so, um, this has 50,000 labeled examples and 8 million, uh, unlabeled examples.
0:48:59 - 0:49:03 Text: So you, you, you, you find tune on, you pre-train Bert large, normal, like you take the pre-train
0:49:03 - 0:49:08 Text: Bert large, you, uh, you find tune on these 50,000 labels, you get this 88% accuracy,
0:49:08 - 0:49:09 Text: right?
0:49:09 - 0:49:15 Text: Then, uh, and so, but then now, let's say instead of using Bert large, you used a much smaller
0:49:15 - 0:49:16 Text: version.
0:49:16 - 0:49:20 Text: So this one's, according to the size, this one's, you know, uh, a 16th size, whatever,
0:49:20 - 0:49:22 Text: this one's a hundredth of the size, right?
0:49:22 - 0:49:26 Text: So this, this row that's a hundredth of the size, if you were to just train it, if you
0:49:26 - 0:49:32 Text: were to pre-train this on the same Wikipedia book, just like Bert, and then fine tune it,
0:49:32 - 0:49:37 Text: you would get 82% accuracy, which is, you know, a lot worse, 60%, like, 66, absolute
0:49:37 - 0:49:39 Text: worse, which is quite a big drop, right?
0:49:39 - 0:49:45 Text: But then if you were to take this 88% teacher, labeled 8 million examples, which are, of
0:49:45 - 0:49:48 Text: course, held out, this is test, this is test accuracy.
0:49:48 - 0:49:55 Text: Um, and then, uh, and then train this classification model, which says this is a good about review,
0:49:55 - 0:49:59 Text: on these 8 million examples, you can take this model to 100 times smaller and get the same
0:49:59 - 0:50:00 Text: accuracy as the teacher, right?
0:50:00 - 0:50:02 Text: You get the same 80% accuracy.
0:50:02 - 0:50:06 Text: So that's really the, uh, the cool thing with distillation is that you can get models
0:50:06 - 0:50:09 Text: that are much smaller, but you still need to train the big model in the first place.
0:50:09 - 0:50:10 Text: So it doesn't help the training cost.
0:50:10 - 0:50:11 Text: It just helps.
0:50:11 - 0:50:13 Text: It actually works, it's the reason you use this big model to train, to label millions or
0:50:13 - 0:50:14 Text: billions of examples.
0:50:14 - 0:50:17 Text: So it ends up being more expensive than just training Bert, but you can actually serve
0:50:17 - 0:50:21 Text: this model at, at inference time for, for a tiny cost.
0:50:21 - 0:50:26 Text: So the question is, why does distillation work so well?
0:50:26 - 0:50:32 Text: So the big hypothesis is that language modeling is kind of the ultimate NLP task, right?
0:50:32 - 0:50:36 Text: A perfect language model is also a perfect question answering system, a perfect entailment
0:50:36 - 0:50:39 Text: system, sentiment analysis, co-reference, et cetera, right?
0:50:39 - 0:50:42 Text: Because in order to be able to, to do these things, you kind of have to be able, you
0:50:42 - 0:50:44 Text: could construct it as a language model.
0:50:44 - 0:50:49 Text: So when you're training a massive language model, you are learning many millions of latent
0:50:49 - 0:50:53 Text: features, which are effectively the same features that you need for any other task.
0:50:53 - 0:50:57 Text: And so when you're doing a simpler, a fine tuning of a more specific task, what's the fine
0:50:57 - 0:51:00 Text: tuning is basically taking these latent features, which your system happened to learn,
0:51:00 - 0:51:03 Text: and it's some, encoded somewhere in your weights.
0:51:03 - 0:51:06 Text: And you are, it's kind of just tweaking these, which is why I can do it with a single pass
0:51:06 - 0:51:09 Text: over the fine tuning data.
0:51:09 - 0:51:13 Text: And so, but once you figure out which parts are important, then there exists a hypothetically
0:51:13 - 0:51:17 Text: much smaller model size, which can still get the same representation and same generalization,
0:51:17 - 0:51:18 Text: right?
0:51:18 - 0:51:21 Text: So, there's a bunch of examples with this fine tuning model.
0:51:21 - 0:51:23 Text: And now you can learn a model that can really hone in on just these features that are important.
0:51:23 - 0:51:32 Text: And so, it can take them, you know, it can train a model that's 100th the size, and just
0:51:32 - 0:51:36 Text: hone in on these features if you have a lot of pseudo label data.
0:51:36 - 0:51:37 Text: And that's why it works.
0:51:37 - 0:51:41 Text: And so, the evidence really is that it just doesn't work to do self-dissolation, right?
0:51:41 - 0:51:46 Text: And so it must be that it's really just learning a subset of the features for most of these
0:51:46 - 0:51:49 Text: tasks.
0:51:49 - 0:51:53 Text: And so, basically every task but language modeling, we've been able to get distillation to work
0:51:53 - 0:51:54 Text: for.
0:51:54 - 0:51:58 Text: So, this includes tasks that seem really hard like question answering and search.
0:51:58 - 0:52:03 Text: So that does imply that language modeling itself, which is basically language generation
0:52:03 - 0:52:08 Text: also, because that's just a form of language modeling, is fundamentally harder than language
0:52:08 - 0:52:12 Text: understanding, which is not super hard to buy.
0:52:12 - 0:52:14 Text: Or at least it's not fundamentally harder.
0:52:14 - 0:52:18 Text: But given the state of the art, state of the art models for language understanding are
0:52:18 - 0:52:20 Text: fundamentally simpler than what they do, right?
0:52:20 - 0:52:27 Text: So presumably, just doing kind of pattern recognition than models that are generating language.
0:52:27 - 0:52:33 Text: And so that's kind of why all of these classification models can kind of be distilled so well.
0:52:33 - 0:52:38 Text: So basically, in conclusion, the preaching models work really well.
0:52:38 - 0:52:39 Text: They're very expensive.
0:52:39 - 0:52:45 Text: We know how to kind of solve this for inference time and we can do fast inference, but it
0:52:45 - 0:52:52 Text: is still unsolved how to make these fast at training time.
0:52:52 - 0:52:59 Text: And moreover, it seems like a lot of the details about algorithmic improvements for making
0:52:59 - 0:53:04 Text: the training more efficient don't seem to have a ton of benefit in terms of at least
0:53:04 - 0:53:06 Text: getting to the results.
0:53:06 - 0:53:09 Text: And it seems like a lot of choices don't really matter that much.
0:53:09 - 0:53:14 Text: And it's really just about a couple of like compared to just the kind of the simple
0:53:14 - 0:53:15 Text: mass-colon baseline.
0:53:15 - 0:53:19 Text: It's pretty hard to beat that in an Apple's Apple comparison.
0:53:19 - 0:53:24 Text: So yeah, it's a little bit unfortunate for a research perspective.
0:53:24 - 0:53:29 Text: It's definitely good for me from people who want to build NLP systems and who want to,
0:53:29 - 0:53:34 Text: especially domain specific NLP systems, like people who want to adapt to a medical domain
0:53:34 - 0:53:37 Text: or people who only have a tiny amount of data or people who want to do startups or they
0:53:37 - 0:53:40 Text: want to build an actual product and they only have a tiny amount of data.
0:53:40 - 0:53:41 Text: So it's definitely good for that perspective.
0:53:41 - 0:53:49 Text: But certainly, I think from the perspective of sometimes research, as I was saying, the
0:53:49 - 0:53:53 Text: goal of research is to kind of like research yourself out of a job, then it is kind of,
0:53:53 - 0:53:56 Text: you know, it's a little unfortunate from that perspective.
0:53:56 - 0:54:00 Text: But I still think that there's a possibility that there's going to be a breakthrough that
0:54:00 - 0:54:07 Text: kind of shows how to do computational efficiency without, and can kind of show compelling results
0:54:07 - 0:54:10 Text: that you don't need, you know, such an absurdly large model.
0:54:10 - 0:54:15 Text: Or actually, besides the models matter, you don't need such an expensive model to do well.
0:54:15 - 0:54:19 Text: Maybe look from sparsity, right, or something like that where you actually do have a really
0:54:19 - 0:54:31 Text: large model, just sparsely activated in some, using some efficiency tricks or whatever.