Stanford CS224N NLP with Deep Learning | Winter 2021 | Lecture 9 - Self- Attention and Transformers

0:00:00 - 0:00:15     Text: Hi everyone. Welcome to CS224N, Lecture 9, Self-Attention and Transformers. If I am not

0:00:15 - 0:00:21     Text: able to be heard right now, please send a message in the chat because I can't see anyone.

0:00:21 - 0:00:28     Text: But I'm excited to get into the content for today. We'll be talking about self-attention

0:00:28 - 0:00:36     Text: and transformers. Let us dive into the lecture plan and we'll talk about some sort of to-do's

0:00:36 - 0:00:43     Text: for the course as well. So we'll start with where we were back last week with recurrence,

0:00:43 - 0:00:48     Text: recurrent neural networks. And we'll talk about a movement from recurrence to attention

0:00:48 - 0:00:54     Text: based on LP models. We talked about attention and we're going to just go all in on attention.

0:00:54 - 0:00:58     Text: We'll introduce the transformer model, which is a particular type of attention based model.

0:00:58 - 0:01:03     Text: It's very popular. You need to know it. You're going to learn it. We'll talk about some great

0:01:03 - 0:01:09     Text: results with transformers and then some drawbacks and variance and sort of very recent work

0:01:09 - 0:01:16     Text: on improving them. So some reminders before we jump in. Assignment 4 is due. The mid-quarter

0:01:16 - 0:01:23     Text: feedback survey is due Tuesday, February 16th. You get some small number of points for

0:01:23 - 0:01:27     Text: doing that and we really appreciate your feedback on what we've done well, what we can improve

0:01:27 - 0:01:35     Text: on. And then, final project proposal is also due. One note on the proposals. Part of

0:01:35 - 0:01:41     Text: the goal of the proposal is to, you know, or let's say the main part of the goal of the

0:01:41 - 0:01:48     Text: proposal is to give you feedback on the idea that you have presented and make sure that

0:01:48 - 0:01:57     Text: it is a viable option for a final project and make sure we kind of recenter if not. And

0:01:57 - 0:02:03     Text: so we want to get feedback to you very quickly on that. Okay. All right. So with that, let's

0:02:03 - 0:02:13     Text: start in on the content of this week's lecture. So we were in this place in MLP as of last

0:02:13 - 0:02:17     Text: week where we had recurrent neural networks sort of for a lot of things that you wanted

0:02:17 - 0:02:24     Text: to do. So it's around 2016 and the strategy if you want to build a strong MLP model is

0:02:24 - 0:02:28     Text: you have, you know, sentences that you need to encode and you have like a bidirectional

0:02:28 - 0:02:35     Text: LSTM say. And, you know, maybe it looks a little bit like this pictographically. And maybe

0:02:35 - 0:02:38     Text: it's a source sentence in a translation, for example. We saw machine translation and

0:02:38 - 0:02:43     Text: then you define your output which is maybe a sequence of words which is the target, you

0:02:43 - 0:02:48     Text: know, translation that you're trying to predict or maybe it's a parse tree or it's a summary.

0:02:48 - 0:02:56     Text: And you use an LSTM with one direction to generate it. And this works really well. We

0:02:56 - 0:03:01     Text: did, we used these architectures to do all kinds of interesting things. But one thing

0:03:01 - 0:03:05     Text: that we said, we talked about this information sort of bottleneck that you're trying to encode

0:03:05 - 0:03:10     Text: maybe a very long sequence and sort of the very last vector in your or in one vector in

0:03:10 - 0:03:16     Text: your encoder. And so we used attention as this mechanism to take, you know, a representation

0:03:16 - 0:03:21     Text: of from our decoder and sort of look back and treat the encoded representations as a

0:03:21 - 0:03:25     Text: memory that we can back with that we can reference and sort of pick out what's important

0:03:25 - 0:03:32     Text: to any given time. And that was attention. And this week we're going to do something slightly

0:03:32 - 0:03:38     Text: different. So we learned about sequence to sequence models, the encoder, decoder way

0:03:38 - 0:03:44     Text: of thinking about problems more or less in order to deal with this idea of, you know, building

0:03:44 - 0:03:49     Text: a machine translation system that's end to end differentiable, right? And so this

0:03:49 - 0:03:53     Text: is sort of a really interesting way of thinking about problems. What we'll do this week is

0:03:53 - 0:03:59     Text: different. We're not trying to motivate sort of an entirely new way of thinking about

0:03:59 - 0:04:03     Text: problems like machine translation. Instead, we're going to take the building blocks that

0:04:03 - 0:04:09     Text: we were using, you know, recurrent neural networks. And we're going to spend a lot of trial

0:04:09 - 0:04:13     Text: and error in the field trying to figure out if there are building blocks that just work

0:04:13 - 0:04:21     Text: better across a broad range of problems. So we'll just slot the new thing in for recurrent

0:04:21 - 0:04:27     Text: neural networks and say, you know, voila, maybe it works better. And so I want to take

0:04:27 - 0:04:33     Text: us on this sort of journey to self attention networks. And we'll start with some

0:04:33 - 0:04:37     Text: problems with recurrent neural networks. So we spent a bit of time trying to convince

0:04:37 - 0:04:45     Text: you that recurrent neural networks were very useful. Now I'm going to talk about reasons

0:04:45 - 0:04:51     Text: why they can be improved. So we know that recurrent neural networks are enrolled left to

0:04:51 - 0:04:58     Text: right, quote, in air quotes, it could be right to left as well. So what does this mean?

0:04:58 - 0:05:06     Text: Or recurrent neural network encodes linear locality, right? So once I'm looking at tasty

0:05:06 - 0:05:10     Text: in this phrase, I'm about to look at pizza, or if I'm going in the other direction,

0:05:10 - 0:05:14     Text: once I look at pizza, I'm about to look at tasty. And so it's very easy for their

0:05:14 - 0:05:18     Text: meanings, for their presence in the sentence, to affect the meaning, to affect the representation

0:05:18 - 0:05:23     Text: of the other word. And this is actually quite useful because nearby words frequently

0:05:23 - 0:05:27     Text: do influence each other. That's practically one of the things we talked about with the

0:05:27 - 0:05:33     Text: distributional hypothesis as encoded by something like word-to-vec. But if words are

0:05:33 - 0:05:38     Text: distant linearly, they can still interact with each other. This is something that we saw

0:05:38 - 0:05:45     Text: in dependency parsing. So if I have, say, the phrase, the chef, notice chef bolded here,

0:05:45 - 0:05:50     Text: I'm running a recurrent neural network over this. And then the chef who, then I have this

0:05:50 - 0:05:59     Text: long sequence that I'm going to encode. And then the word was, right? Maybe it is the

0:05:59 - 0:06:06     Text: chef who was. But in between I have O of sequence length, many steps of the computation that

0:06:06 - 0:06:15     Text: I need to get to before chef and was can interact. Right? And so in the middle, things might

0:06:15 - 0:06:25     Text: go wrong. Maybe it's hard to learn things where they should interact. So in particular,

0:06:25 - 0:06:29     Text: it might be hard to learn long distance dependencies because we have gradient problems. We saw

0:06:29 - 0:06:35     Text: the LSTMs propagate gradients better than simple RNNs, but not perfectly. And so if chef and

0:06:35 - 0:06:40     Text: was are very far, it becomes hard to learn that they should interact. And the linear order

0:06:40 - 0:06:46     Text: of words is sort of baked into the model. You have to unroll the RNN throughout the sequence.

0:06:46 - 0:06:52     Text: And it's not really the right way to think about sentences necessarily, at least linear

0:06:52 - 0:07:01     Text: order isn't really how sentences are kind of structured. And so here you have chef. And

0:07:01 - 0:07:05     Text: then you've got all this sort of computation in the middle, all those applications of the

0:07:05 - 0:07:10     Text: recurrent weight matrix before you allow it to interact with was. And again, sort of

0:07:10 - 0:07:17     Text: dependence is O of sequence length. And then you got the word was. Okay? A second problem

0:07:17 - 0:07:24     Text: is very related. This is the lack of parallelizability. So this is going to be a huge refrain

0:07:24 - 0:07:29     Text: now that we've gotten to the transformer's lectures is parallelizability. It's what

0:07:29 - 0:07:38     Text: you get from your GPU and it's what you want to exploit. So when you run an RNN, you have

0:07:38 - 0:07:45     Text: O of sequence length many unparallelizable operations. And so while you have a GPU that

0:07:45 - 0:07:50     Text: can kind of chunk through a bunch of independent operations at once, you're unable to sort

0:07:50 - 0:07:54     Text: of do them all at once because you have this explicit time dependence in the recurrent

0:07:54 - 0:08:01     Text: equations. In particular, you know, a future RNN state down the line can't be computed

0:08:01 - 0:08:05     Text: until you've computed one that's earlier on. And this inhibits training on very large

0:08:05 - 0:08:12     Text: data sets. So let's take a look at this, unrolling an RNN. If this is, say, the first later

0:08:12 - 0:08:17     Text: layer of an RNN or an LSTM, maybe it doesn't depend on effectively anything. You can just

0:08:17 - 0:08:22     Text: compute it immediately. And then the second layer, so this is a, you know, a stacked set

0:08:22 - 0:08:30     Text: of two LSTMs. The second layer depends on the first layer here. In the time dimension

0:08:30 - 0:08:38     Text: though, this cell here depends on this. So you've got a one. And then, you know, this

0:08:38 - 0:08:42     Text: depends on this. So you've got a one. So you have, you know, at most, sorry, at least two

0:08:42 - 0:08:48     Text: things that you need to compute here before you can compute the value of this cell, likewise

0:08:48 - 0:08:53     Text: three here. And with the sequence length, it grows with O of the sequence length. So,

0:08:53 - 0:09:00     Text: so here, I have been unable to even try to compute this value when I get here because

0:09:00 - 0:09:05     Text: I had to sort of do all of this first. So I can't parallelize over the time dimension.

0:09:05 - 0:09:16     Text: And this inhibits training on very large data sets. Okay. So, and then, I guess, crystal

0:09:16 - 0:09:20     Text: TAs, you'll be able to stop me with the question if it feels like that's the right thing to

0:09:20 - 0:09:27     Text: do. Okay. So, and you can see how it's a related problem, right? It's really directly related

0:09:27 - 0:09:33     Text: to the recurrence of the model. The thing that we thought was really useful now is being

0:09:33 - 0:09:41     Text: problematic. Okay. So, so what I'm trying to say with that is we seemingly want to replace

0:09:41 - 0:09:46     Text: recurrence as the building block itself. So, let's go through some alternatives. And we've

0:09:46 - 0:09:51     Text: seen each of these alternatives in the class so far. We'll start with WordWindow, sort

0:09:51 - 0:09:56     Text: of building blocks for our NLP models. If we wanted to replace our encoders and our decoders

0:09:56 - 0:10:04     Text: with something that sort of fit the same goal, but had different properties. So, a WordWindow

0:10:04 - 0:10:09     Text: model will aggregate local context, right? We saw this with our sort of WordWindow classifiers

0:10:09 - 0:10:15     Text: that we've built already. You take a local context of words, you use it to represent

0:10:15 - 0:10:19     Text: information about the center word. This is also known as one-dimensional convolution.

0:10:19 - 0:10:25     Text: We'll go over this in depth later in the course. Right now, we'll consider it as WordWindow

0:10:25 - 0:10:32     Text: context. So, the number of unparallelizable operations with these WordWindow building

0:10:32 - 0:10:37     Text: blocks does not grow with the sequence length. And here's a picture of that. You have the

0:10:37 - 0:10:42     Text: embedding layer, say, so you can embed every word independently. Right? So, you don't

0:10:42 - 0:10:46     Text: need to know the other words surrounding it in order to pick the right embedding dimension

0:10:46 - 0:10:52     Text: out. And so, these all have sort of zero dependence in this sort of hand-wavy notion of how

0:10:52 - 0:10:59     Text: much parallelizability is there is. Now, you can walk a WordWindow classifier on top of

0:10:59 - 0:11:06     Text: each one to build a representation of the word that takes into account its local context.

0:11:06 - 0:11:11     Text: But in order to apply it to this word, h1, I don't need to know anything. Sorry, I don't

0:11:11 - 0:11:16     Text: need to have applied it to h1 in order to apply it to h2. Likewise, in order to apply a Word

0:11:16 - 0:11:22     Text: Window contextualizer to ht, I can just look at its local window independently. And so,

0:11:22 - 0:11:29     Text: again, none of these have a dependence in time. I can keep stacking layers like this. Right?

0:11:29 - 0:11:35     Text: So, this can look like an encoder, right? And encoder like our LSTM encoders. If I didn't

0:11:35 - 0:11:39     Text: allow you to look at the future by just cutting off the window, it could look like a decoder

0:11:39 - 0:11:47     Text: for language models. And this is nice. And we get this beautiful, you know, o of 1 dependence

0:11:47 - 0:11:51     Text: in time, right? No dependence at all in the time dimension. That's an improvement. But

0:11:51 - 0:11:55     Text: there are some problems. So, what about long distance dependencies, right? This is why

0:11:55 - 0:11:59     Text: we said we wanted to use recurrent neural networks because they would do better at encoding

0:11:59 - 0:12:06     Text: long distance dependencies. It's a problem, just like it was a problem before. But by stacking

0:12:06 - 0:12:12     Text: WordWindow layers, we can get to wider, longer contexts. So, if you have some sort of

0:12:12 - 0:12:20     Text: window size, and then you stack two layers, so red states here are state kind of how you

0:12:20 - 0:12:27     Text: can look, how far away you can look in order to encode hk, right? So, in the embedding layer,

0:12:27 - 0:12:32     Text: we have these sort of words here. So, this is the last layer, this top layer of the

0:12:32 - 0:12:38     Text: WordWindow classifier. Here's the embedding of hk at the output of your encoder. And so,

0:12:38 - 0:12:42     Text: it looks at, you know, the local five words, because that's the window size. And then,

0:12:42 - 0:12:48     Text: as well, you know, the farthest word over here has also looked a couple of words away,

0:12:48 - 0:12:52     Text: right? So, if you stack these, and stack these, and stack these without growing the window

0:12:52 - 0:12:57     Text: size at all at any given layer, you can look pretty far. And actually there are tricks

0:12:57 - 0:13:02     Text: you can use to look even farther. But you still have this sort of, at least in principle,

0:13:02 - 0:13:07     Text: problem, where you've got a word like this h1. And you can see how it's in blue. And,

0:13:07 - 0:13:13     Text: you know, with these two layers of the network, I don't know anything about h1 at all when

0:13:13 - 0:13:18     Text: I'm building up the representation of hk over here. I could solve that by adding another

0:13:18 - 0:13:24     Text: layer in depth, but, you know, in principle, you always have some finite field. So, this

0:13:24 - 0:13:30     Text: is, you know, actually pretty useful. These, you know, word window kind of contextualizers,

0:13:30 - 0:13:34     Text: and we will learn more about them later. And there was sort of a lot of this effort that

0:13:34 - 0:13:38     Text: I talked about at the beginning of the class, was actually sort of partly deciding which

0:13:38 - 0:13:42     Text: of, when you know, word window stuff, convolutional, it's called stuff, or attention, and attention

0:13:42 - 0:13:48     Text: has won out for the time being. And so, yeah, what about attention? So, what, why could

0:13:48 - 0:13:53     Text: it be useful as, like, as a fundamental building block instead of sort of sugar on top of

0:13:53 - 0:14:01     Text: our LSTMs? So, just to recall some of the intuitions of attention, it treats a words

0:14:01 - 0:14:08     Text: representation as a query, and it looks somewhere and tries to sort of access information from

0:14:08 - 0:14:13     Text: a set of values, right? So, we had a word representation in our decoder in our machine translation

0:14:13 - 0:14:20     Text: systems. The set of values were all of the encoder states for the source sentence.

0:14:20 - 0:14:24     Text: And today we'll think about, instead of attention from the decoder to the encoder, we will

0:14:24 - 0:14:31     Text: think about attention within a single sentence. So, just a very quick picture of it, you've

0:14:31 - 0:14:35     Text: got your embedding layer again, I'm putting the computational dependence counts here, so

0:14:35 - 0:14:40     Text: all of these sort of can be done in parallel for the embedding layer again. And now you're

0:14:40 - 0:14:44     Text: doing attention, right? So, you're kind of looking at every single word in the embedding

0:14:44 - 0:14:52     Text: layer to attend to this word. And I'm omitting a bunch of arrows here, so these are all arrows.

0:14:52 - 0:14:55     Text: All words interact with all words, and we'll get deep into this today, I promise, but I

0:14:55 - 0:15:01     Text: just wanted to sort of make this a little bit less dense looking of a graph. And then,

0:15:01 - 0:15:06     Text: so, in the second layer, again, all pairs of words interact, and this is all parallelizable.

0:15:06 - 0:15:11     Text: So, you can't parallelize in depth, right? Because you need to encode this layer before

0:15:11 - 0:15:18     Text: you can do that layer, but in time, it is parallelizable. So, it checks that box. So, again,

0:15:18 - 0:15:26     Text: we have O of one sort of computational dependence, you know, a number of unparallizable operations

0:15:26 - 0:15:33     Text: as a function of sequence length, and as an added benefit, right? The interaction distance

0:15:33 - 0:15:39     Text: between words is O of one as well. So, whereas before, we had recurrent networks where

0:15:39 - 0:15:45     Text: if you are far, so T is the last word in the sentence, you could have O of T operations

0:15:45 - 0:15:51     Text: between you and a far away word, with attention, you interact immediately. That's very first

0:15:51 - 0:15:56     Text: layer, you get to see your far away word, and so that's O of one. And this ends up being

0:15:56 - 0:16:05     Text: seemingly fascinatingly powerful, and we'll get into a lot of details today. Okay. So,

0:16:05 - 0:16:10     Text: this is sort of why attention solves the two problems that we brought up with recurrent

0:16:10 - 0:16:16     Text: neural networks, but with our empiricist hats on, it shouldn't be proof yet that, you

0:16:16 - 0:16:19     Text: know, it should be a good building block. And in fact, it takes a little bit of thinking

0:16:19 - 0:16:25     Text: to think about how to turn attention into a building block like RNNs were. So, let's

0:16:25 - 0:16:31     Text: start by digging right into just the equations for self-attention, which again is attention

0:16:31 - 0:16:36     Text: to, or the, everything is looking within itself, we'll formalize this for you. So, we're

0:16:36 - 0:16:43     Text: going to be talking all lecture today about queries, keys, and values. Our queries are

0:16:43 - 0:16:49     Text: going to be a set of T queries, each query is a vector in dimension D. You can just think

0:16:49 - 0:16:54     Text: of them as just those vectors right now, not worrying necessarily about where they came

0:16:54 - 0:17:01     Text: from. We have a set of keys, K1 to KT. Again, each vector K is in dimension L. So, we

0:17:01 - 0:17:07     Text: have a value D, and we have some values. Each value is going to be also in dimension L

0:17:07 - 0:17:13     Text: D. And for now, we're going to assume that we have the same number of all of them, that's

0:17:13 - 0:17:20     Text: not necessarily the case later. So, in self-attention, the keys, queries, and values come from the

0:17:20 - 0:17:27     Text: same source of information, the same sentence, for example. And so, yeah, in practice, when

0:17:27 - 0:17:30     Text: they all come from the same sentence, right, this is going to be the same number of all of

0:17:30 - 0:17:36     Text: them. It's all going to be T. In practice, you can have the numbers differ. So, where do

0:17:36 - 0:17:41     Text: these come from? We'll get into the specifics of this later, but for now, think about the

0:17:41 - 0:17:46     Text: output of the previous layer. So, imagine the output is, you know, you have like the embedding

0:17:46 - 0:17:50     Text: layer, right, and that's the input to something that's going to do self-attention. Think of

0:17:50 - 0:17:59     Text: all of these outputs of the embeddings as some vectors x i. And now, we can just say that

0:17:59 - 0:18:05     Text: the value is equal to the key, is equal to the query, is equal to that x i. So, we're

0:18:05 - 0:18:09     Text: just going to use the same vectors for all of them. But labeling them as keys, queries,

0:18:09 - 0:18:13     Text: and values, I promise, will be very useful in how we sort of think about what's going

0:18:13 - 0:18:22     Text: on and how we look at the equations that implement this. So, self-attention pretty generally,

0:18:22 - 0:18:27     Text: but with this dot product, so, dot product self-attention, here's just the math. Math is,

0:18:27 - 0:18:33     Text: you compute key query affinities, and the dot product bit is the fact that you're using

0:18:33 - 0:18:38     Text: the dot product function here. So, you take a dot product for all pairs i and j of

0:18:38 - 0:18:47     Text: qi dotted with kj. So, that is a t by t matrix, capital T, right, by t matrix of affinities.

0:18:47 - 0:18:54     Text: Those are scalar values not bounded in size. Next, you compute the attention weights we

0:18:54 - 0:18:58     Text: saw this as well, using the softmax function. I've just written out the softmax function

0:18:58 - 0:19:05     Text: here. So, you know, you exponentiate the affinity, and then you sum over, in this case,

0:19:05 - 0:19:10     Text: right, you're summing over all of the keys. So, you've got a given query, and you're

0:19:10 - 0:19:15     Text: summing over all the keys for the normalization. So, where should this query be looking? Remember,

0:19:15 - 0:19:22     Text: you've got t different queries that we're doing this for here. And so, for a given query,

0:19:22 - 0:19:26     Text: you sum over all the keys to get your normalization constant. Normalizing by that gives you a

0:19:26 - 0:19:33     Text: distribution over the sequence length t. So, now you have sort of a weight on all of the

0:19:33 - 0:19:41     Text: sequence indices. And again, we do our weighted average. So, we've got our weights for our

0:19:41 - 0:19:46     Text: average, and then the output, right, there's going to be one output per query. The output

0:19:46 - 0:19:53     Text: is the weights for that multiplied by the value vectors, right? So, again, if you set the

0:19:53 - 0:19:59     Text: keys, the queries, the values to all be x, this makes sense, but it's nice to have the

0:19:59 - 0:20:04     Text: keys and the keys to know sort of which thing is doing what. You can think of the query

0:20:04 - 0:20:09     Text: as being sort of looking for information somewhere, the key as, you know, interacting with

0:20:09 - 0:20:13     Text: the query, and then the value is the thing, you know, that you're actually going to, you

0:20:13 - 0:20:20     Text: know, weight in your average and output. So, question you might like to answer is, so

0:20:20 - 0:20:25     Text: if now we're connecting everything to everything, how is this different to using a fully connected

0:20:25 - 0:20:33     Text: layer? That's a great question. A couple of reasons. One is that, unlike a fully connected

0:20:33 - 0:20:40     Text: layer, you get to learn the interaction weights. Well, the interaction weights are dynamic

0:20:40 - 0:20:46     Text: as a function of what the actual values here are, right? So, in a fully connected layer,

0:20:46 - 0:20:50     Text: you have these weights that you're learning slowly over the course of the training

0:20:50 - 0:20:55     Text: your network that allow you to say sort of which hidden units you should be looking at.

0:20:55 - 0:21:01     Text: In attention, it's the actual interactions between the key and the query vectors, which

0:21:01 - 0:21:06     Text: are dependent on the actual content that are allowed to vary by time. And so, the actual

0:21:06 - 0:21:10     Text: strengths of all the interactions of all the sort of attention weights, which you could

0:21:10 - 0:21:15     Text: think of as, you know, connected to the weights in the fully connected layer, are allowed

0:21:15 - 0:21:20     Text: to change as a function of the input. A separate thing is that the parameterization is much

0:21:20 - 0:21:26     Text: different. So, you're not learning an independent connection weight for all pairs of things. Instead,

0:21:26 - 0:21:34     Text: you have, you're allowed to parameterize the attention as, you know, these sort of

0:21:34 - 0:21:38     Text: dot product functions between vectors that are representations. And you end up having,

0:21:38 - 0:21:44     Text: you know, the parameters work out more nicely, which we'll see later. We haven't gone into

0:21:44 - 0:21:48     Text: how we're parameterizing these functions yet. So, those are the two answers, I'd say,

0:21:48 - 0:21:54     Text: is one is you have this sort of dynamic connectivity. And two is, you know, you don't have just,

0:21:54 - 0:21:59     Text: it has this inductive bias that's not just connect everything to everything feed forward.

0:21:59 - 0:22:09     Text: Great. Okay, I think that's a very interesting question. Yeah, so I'm glad you asked it.

0:22:09 - 0:22:14     Text: Okay, so, we've talked about self-attention now. The equations are going to self-attention.

0:22:14 - 0:22:19     Text: But, can we just like use this as a building block? I mean, you know, take all of your LSTMs,

0:22:19 - 0:22:22     Text: throw them out. Use the self-attention that we've just defined instead. Why not? Well,

0:22:22 - 0:22:28     Text: here's a couple of reasons why. So, look at self-attention as a building block. So, we

0:22:28 - 0:22:35     Text: have some words in the sentence, the chef who, some stuff, long sentence, food is the last

0:22:35 - 0:22:41     Text: word of the sentence. Okay. And, you know, they have an embedding. And from that, you get

0:22:41 - 0:22:46     Text: your key query in value. We've said so far, right, there's the same vector actually, but,

0:22:46 - 0:22:51     Text: you know, key query value, key query value, key query value. And, you know, we might stack

0:22:51 - 0:22:55     Text: them like LSTM layers. So, you have key query value, perform self-attention on the key

0:22:55 - 0:23:00     Text: queries and values. As we said, self-attention is a function on key queries and values.

0:23:00 - 0:23:05     Text: So, perform self-attention now that you have these, get new key queries values, and then

0:23:05 - 0:23:10     Text: perform self-attention again. Look, you know, this looks a lot like, a lot like stacking

0:23:10 - 0:23:15     Text: LSTMs. But, it actually has a few issues as it stands. So, we're going to need to go on

0:23:15 - 0:23:19     Text: a journey to determine what's missing from our self-attention. And the first thing is

0:23:19 - 0:23:27     Text: that self-attention is an operation on sets. Okay. So, for the equations that we had before,

0:23:27 - 0:23:34     Text: the self-attention equation never referred to the indices of K, Q, or V, except to sort

0:23:34 - 0:23:38     Text: of say which pairs we're interacting with each other. It doesn't know what the order

0:23:38 - 0:23:42     Text: of your sentence is. When it's computing, though, the weights, it has no idea. And so,

0:23:42 - 0:23:48     Text: if I were to input this sentence, the chef who food, it would be the same as if I just

0:23:48 - 0:23:54     Text: swapped the with chef and then swapped who with the, and it just had, would have no idea.

0:23:54 - 0:23:58     Text: So, already this is not going to work because the order in which words appear in sentences

0:23:58 - 0:24:03     Text: matters. So, here's the first problem that we need to work with. So, I'm going to have

0:24:03 - 0:24:07     Text: a list of barriers. This is just the first, I had a whole journey ahead of us. And then

0:24:07 - 0:24:12     Text: we're going to have a list of solutions. So, we need to represent the sequence order somehow.

0:24:12 - 0:24:16     Text: We can't just lose that information entirely because we wouldn't know what order the words

0:24:16 - 0:24:22     Text: showed up in. So, somehow, if we're not going to change the self-attention equations

0:24:22 - 0:24:28     Text: themselves, we need to encode the order in the keys, queries, and values, and let the

0:24:28 - 0:24:35     Text: network sort of figure it out on its own. So, think about this. We have T sequence indices.

0:24:35 - 0:24:40     Text: And we're going to bound T to some finite constant. So, T is never going to be bigger

0:24:40 - 0:24:45     Text: than something for us. And we call it T. And now we're going to represent the sequence

0:24:45 - 0:24:51     Text: index as a vector. So, PI is going to be the vector representing index I. And it's going

0:24:51 - 0:24:55     Text: to be in dimensionality D just like our keys, queries, and values. And so, we're going

0:24:55 - 0:25:01     Text: to have one of these for one to T. So, don't worry yet about what the PI are like, how

0:25:01 - 0:25:06     Text: they're constructed. We'll get right into that. But think about this. It's easy to incorporate

0:25:06 - 0:25:11     Text: this information into our attention building blocks. At the first layer, if you let tilde

0:25:11 - 0:25:19     Text: V tilde K tilde Q be our old values keys and queries, we can just add. We could do other

0:25:19 - 0:25:25     Text: stuff too. But in practice, we just add. So, VI is equal to V tilde I, our orderless value

0:25:25 - 0:25:33     Text: vector plus PI. So, this might be your embedding vector. And then you add the index that it's

0:25:33 - 0:25:39     Text: at to its vector. And you might only do this at the first layer of the network, for example.

0:25:39 - 0:25:42     Text: So you do the same thing for the query and the key. So, this is something that you could

0:25:42 - 0:25:48     Text: do. In practice, you do something slightly different. But this is something that, you know,

0:25:48 - 0:25:55     Text: now it knows the order of the sequence. Because if these PI's, you've set properly somehow,

0:25:55 - 0:26:00     Text: then now the network is able to figure out what to do with it. So, what's one way of actually

0:26:00 - 0:26:08     Text: making this happen? One way of making this happen is through the concatenation of

0:26:08 - 0:26:12     Text: synosoids. And this was an interesting take when the first transformer's paper came out,

0:26:12 - 0:26:19     Text: they used this method. So, let's dig into it. So, you have varying wavelengths of

0:26:19 - 0:26:24     Text: sinusoidal functions in each of your dimensions. So, in the first dimension, if you have this

0:26:24 - 0:26:29     Text: sine function with a given period, and then this cosine function with a given period,

0:26:29 - 0:26:34     Text: and then sort of dot, dot, dot, you sort of change the periods until you get to much,

0:26:34 - 0:26:38     Text: different periods. And what does it look like? It looks like that. So, imagine here,

0:26:39 - 0:26:45     Text: in the vertical axis, we've got the dimensionality of the network. So, this is D,

0:26:46 - 0:26:51     Text: and then this is sequence length. And by just specifying, you know, in each row,

0:26:51 - 0:26:55     Text: is sort of one of these signs with different frequencies.

0:26:56 - 0:27:01     Text: Right? And you can sort of see how this is encoding position. These things have different values

0:27:01 - 0:27:07     Text: at different indices, and that's pretty cool. I don't really know how they thought of it

0:27:07 - 0:27:12     Text: immediately. But one cool thing about it is this periodicity notion. Right? The fact that

0:27:12 - 0:27:17     Text: the synosoids have periods that might be less than the sequence length indicates that maybe the

0:27:17 - 0:27:24     Text: absolute position of a word isn't so important. Right? Because if the period is less than the sequence

0:27:24 - 0:27:28     Text: length, you lose information, maybe about where you are. Of course, you have the concatenation of

0:27:28 - 0:27:34     Text: many of them. So, that's a pro. Maybe it can extrapolate to longer sequences, because again,

0:27:34 - 0:27:39     Text: you sort of have this repetition of values, right? Because the periods will, when they,

0:27:39 - 0:27:43     Text: when they complete, you'll see that value again. The cons are that it's not learnable. I mean,

0:27:43 - 0:27:48     Text: this is cool, but you can't, there's no learnable parameters in any of this. And also,

0:27:48 - 0:27:52     Text: the extrapolation doesn't really work. So, this is an interesting and definitely still done.

0:27:52 - 0:27:57     Text: But what's done more frequently now is we, you know, what do we do? We learn the position

0:27:57 - 0:28:05     Text: of representations from scratch. So, we have, we're going to learn them from scratch. So,

0:28:05 - 0:28:09     Text: let all the PI just be learnable parameters. So, what we're going to do is we're going to have a matrix

0:28:09 - 0:28:15     Text: P. That's going to be in dimensionality D, dimensionality of our network again, by the sequence length.

0:28:15 - 0:28:22     Text: So, this is just a big matrix, right? Of the size here, of this size effectively, D by sequence length.

0:28:24 - 0:28:27     Text: But every single value in that matrix is just a learnable parameter.

0:28:28 - 0:28:32     Text: Proves flexibility. Now, you get to learn what positions is sort of supposed to mean,

0:28:32 - 0:28:39     Text: according to your data, end to end. So, that's cool. Cons, you definitely can't extrapolate to

0:28:39 - 0:28:44     Text: indices outside 1 to T, right? Because you set the size of this parameter matrix at the beginning,

0:28:44 - 0:28:48     Text: and you learned them all. Now, if you want to go beyond position T, you know, you just,

0:28:48 - 0:28:54     Text: you have no way to represent it effectively. But most systems use this. This is super useful.

0:28:54 - 0:28:59     Text: And sometimes people try more flexible representations of position, because again, the absolute

0:28:59 - 0:29:07     Text: index of a word is not sort of its natural representation of its position in the sentence.

0:29:07 - 0:29:11     Text: And so, people have looked at the kind of the relative position between words, as well as

0:29:11 - 0:29:16     Text: position representations that depend on syntax. But we're not going to be able to go too far into

0:29:16 - 0:29:22     Text: those today. Okay, so that was problem 1, right? We just, no matter what we did, if we didn't have

0:29:22 - 0:29:27     Text: representation of position, there was no way we could use self-attention as our new building block.

0:29:27 - 0:29:30     Text: And we've solved it with position representations that we just sort of add to the inputs.

0:29:32 - 0:29:37     Text: Next, we're going to see this problem that you don't have non-linearities. You know,

0:29:37 - 0:29:44     Text: even saying non-linearities, abstract features, they're great deep learning, end to end learning

0:29:44 - 0:29:51     Text: of representations is awesome. But right now, we're just doing weighted averages. And so, what is

0:29:51 - 0:29:56     Text: our solution going to be? I mean, it's not going to be all that complex. So, all we're doing right

0:29:56 - 0:30:02     Text: now is re-averaging vectors, right? So, you've got sort of the self-attention here. And if you

0:30:02 - 0:30:08     Text: just stacked another one, you just keep sort of averaging projections of vectors. But what if we just

0:30:08 - 0:30:14     Text: add a feed-forward network for every individual word? So, within this layer, each of these feed-forward

0:30:14 - 0:30:21     Text: neural networks shares parameters. But it gets in just the output of self-attention for this word

0:30:21 - 0:30:28     Text: as we defined it, processes it, and emits something else. And so, you know, you have output i

0:30:28 - 0:30:34     Text: from self-attention, which we saw slides ago. Apply, you know, a feed-forward layer where you take

0:30:34 - 0:30:41     Text: the output, multiply by matrix, you know, non-linearity, other matrix. And the intuition here, you can think

0:30:41 - 0:30:47     Text: of at least is, well, you know, something like the feed-forward network processes the result

0:30:47 - 0:30:53     Text: of the attention for each thing. But more fundamentally, right, you needed some kind of non-linearity

0:30:53 - 0:30:59     Text: there. And, you know, a feed-forward network will do a good job. Okay, so that's another problem

0:30:59 - 0:31:05     Text: solved. Easy fix. Add a feed-forward network, get your non-linearity. Now, your self-attention

0:31:05 - 0:31:10     Text: output, you can sort of process it, have that sort of depth increasing as the layers of the network

0:31:10 - 0:31:18     Text: increase, which we know is useful. Another problem. Okay, so bear with me on this one. We don't want

0:31:18 - 0:31:22     Text: to look at the future when we're doing language modeling. Right, so language modeling, you're trying

0:31:22 - 0:31:29     Text: to predict words in the future. And with the recurrent model, it's very natural, right? Like, you just

0:31:29 - 0:31:36     Text: don't unroll it further. Once you've unrolled a word to unrolled your LSTM to a given word,

0:31:36 - 0:31:41     Text: the sort of no way to have given it to the next word as well. But in self-attention, we'll see that

0:31:41 - 0:31:46     Text: this is a little bit trickier. So, we can't cheat and look at the stuff we're trying to be predicting,

0:31:46 - 0:31:49     Text: because then we would train networks that were totally useless. So what are we going to do?

0:31:49 - 0:31:55     Text: We're going to mask, masking is a word that's going to keep coming up. We're going to mask the future

0:31:55 - 0:32:00     Text: in self-attention. So, in particular, this is important when we have decoters. Right, one of the

0:32:00 - 0:32:05     Text: reasons why we could use bidirectional LSTMs in our encoters was that we could see the whole source

0:32:05 - 0:32:09     Text: sentence in neural machine translation. But when we're predicting the output sentence, right, we

0:32:09 - 0:32:15     Text: can't see the future if we want to train the model to do the actual prediction. So, to use self-attention

0:32:15 - 0:32:21     Text: in a decoder, you can mask the future. One thing that you could do is you could just every time

0:32:21 - 0:32:28     Text: you compute attention, you change the set of keys and values this should be. Keys and values

0:32:28 - 0:32:33     Text: to only include past words. So, you're sort of dynamically changing the stuff that you're attending

0:32:33 - 0:32:39     Text: over. But that doesn't let us do stuff with tensors as well as as pre-melizably as we will see.

0:32:39 - 0:32:45     Text: So, we don't want to do that. Instead, we're going to mask out the future words through the

0:32:45 - 0:32:50     Text: attention weights themselves. So, in math, don't worry, we'll get to the sort of diagram. But in math,

0:32:50 - 0:32:56     Text: we had these attention scores and they were equal to just this dot product before for all pairs.

0:32:58 - 0:33:07     Text: Right, but now, only if the key is strictly less than the key index is strictly less than the

0:33:07 - 0:33:15     Text: key index. So, this would be j less than i. Should we let the network look at the word and

0:33:15 - 0:33:19     Text: it should be negative infinity otherwise. So, we don't let you look at the output. So,

0:33:19 - 0:33:25     Text: let's go to the picture. For encoding the words that we'll see here, so maybe we have a start token.

0:33:27 - 0:33:31     Text: You want to decide, this is your whole sentence now. You want to decide which words in the sentence

0:33:31 - 0:33:35     Text: you're allowed to look at when making your predictions. So, you're allowed to look at the

0:33:35 - 0:33:41     Text: first word and in order to predict the, I'm not allowed to look at the word the, I'm also not

0:33:41 - 0:33:47     Text: allowed to look at any of the future words, I am allowed to look at the word start. So, this

0:33:47 - 0:33:53     Text: kind of block is not shaded here. In order to predict the word chef, I can look at start and

0:33:53 - 0:33:59     Text: the, right, start the, but not chef naturally or the word that comes after it. So, we're

0:33:59 - 0:34:07     Text: allowed to look at the word chef naturally or the word that comes after it. Likewise for the other words.

0:34:07 - 0:34:13     Text: So, you can see this sort of, this matrix here, right. So, we just want to make sure that our

0:34:13 - 0:34:20     Text: attention weights are zero everywhere here. So, in the affinities calculation, we add negative

0:34:20 - 0:34:29     Text: infinity to all of these in this big matrix. And that guarantees that we can't look to the future.

0:34:29 - 0:34:36     Text: Okay, so now we can do big matrix multiplications to compute our attention as we will see. And we sort of

0:34:36 - 0:34:41     Text: don't worry about looking at the future because we've added these negative infinities. And that's the last,

0:34:41 - 0:34:50     Text: that's the last problem with self-attention sort of that comes up fundamentally as like what do we need for this building block.

0:34:50 - 0:34:56     Text: You have, you didn't have an inherent notion of order. Now you have a good notion of order or at least something of

0:34:56 - 0:35:04     Text: a notion of order. You didn't have nonlinearities, add feed forward networks. And then you didn't want to look at the future.

0:35:04 - 0:35:14     Text: You add the masks for the decoders. So, you know, self-attention is the basis of any self-attention based building block.

0:35:14 - 0:35:21     Text: Hello. Position representations are useful. Nonlinearities are good. You don't have to use a feed forward network.

0:35:21 - 0:35:28     Text: Right? Like you could have just done other stuff, I guess. But, you know, in practice actually it's really easy to parallelize these feed forward networks as well.

0:35:28 - 0:35:38     Text: So we end up doing that. And then the masking, you know, yeah, you don't want information to leak from the future to the past in your decoder.

0:35:38 - 0:35:47     Text: So, so let me be clear. We haven't talked about the transformer yet. But this is all you would need if you were thinking like, gosh,

0:35:47 - 0:35:53     Text: what do I need in order to build my self-attention building block. We'll see that there are a lot more details in the transformer that we're going to spend,

0:35:53 - 0:36:05     Text: the rest of the lecture going through. But I want you to sort of at least, as you're thinking about what's going to come next after the transformer and how you're going to invent it,

0:36:05 - 0:36:12     Text: think about the fact that these are the things that were necessary. And then the other things end up being very, very important to turns out.

0:36:12 - 0:36:19     Text: But, you know, there's a lot of design space here that hasn't been explored yet.

0:36:19 - 0:36:31     Text: Okay, so let's talk about the transformer model. And I'm going to pause. There are any, this is a good question. I can take it now.

0:36:31 - 0:36:41     Text: Okay. So, transformers. Let's get to it. Let's look at the transformer encoder decoder blocks at a high level first.

0:36:41 - 0:36:51     Text: This should look a lot like the encoder decoders that we saw with the recurrent neural network machine translation systems that we saw.

0:36:51 - 0:37:00     Text: Okay, so we have our word embeddings. We're going to add in our position representations. We saw that. And that's from our input sequence.

0:37:00 - 0:37:09     Text: We'll have a sequence of encoder blocks. Each of them is called a transformer encoder. And then, you know, we have our output sequence, word embeddings, position representation again.

0:37:09 - 0:37:21     Text: We have a transformer decoder. Each end, the last layer of encoders is going to be used in each layer of the transformer decoder.

0:37:21 - 0:37:31     Text: And then we get some outputs, some predictions. Okay, so this looks pretty much the same at a very high level. Maybe minus the fact that now we need to do the position representation addition at the very beginning.

0:37:31 - 0:37:39     Text: So now let's look at these blocks themselves. So the encoder and decoder blocks, what's left that we haven't covered, right?

0:37:39 - 0:37:45     Text: Because we could just put the building blocks that we just came up with in the first part of class in these things, right?

0:37:45 - 0:37:55     Text: In coders, we need our self attention, our feed forward networks. We have our position representations. We get the masking for the decoders. Right, we can just slot these in.

0:37:55 - 0:38:03     Text: But it turns out they wouldn't work all that well compared to transformers. So what's left? So the first thing is key query value attention.

0:38:03 - 0:38:16     Text: This is a specific way of getting the K, Q, and V vectors from the single word embedding. Right? So instead of letting K, Q, and V equal to X, like the output from the last layer, we're going to do something a little bit more.

0:38:16 - 0:38:27     Text: Next is multi-headed attention. We're going to attend to multiple places in a single layer. And we'll see that that gets us sort of kind of interesting properties in the homework later on.

0:38:27 - 0:38:36     Text: But we'll talk a little bit about it today. And then there's a bunch of things that just help with training. These seemed like they were very hard to train at first. A lot of these tricks are very useful.

0:38:36 - 0:38:47     Text: So we'll talk about residual connections, layer normalization, and scaling the dot product. Everything in bullet point three here, tricks to help with training. Don't improve with the model is able to do.

0:38:47 - 0:38:55     Text: But they're crucial in that they improve the training process. So modeling improvements of both kinds are really, really important.

0:38:55 - 0:39:02     Text: So it's good that we're using self attention, which is this cool thing that had these properties. But if we couldn't train it, it wouldn't be useful.

0:39:02 - 0:39:12     Text: OK, so here's how the transformer builds the key query and value vectors.

0:39:12 - 0:39:22     Text: We have X1 to Xt. The input vector is to our transformer layer. And we're going to find the transformer encoder here.

0:39:22 - 0:39:33     Text: So we have one of these vectors per word, you can say. And again, each XI is going to be a vector in dimensionality D. And here's how we compute the keys, queries, and values.

0:39:33 - 0:39:42     Text: We're going to let each key, KI, which we saw before, be equal to some matrix K times XI, where K is D by D.

0:39:42 - 0:39:52     Text: So this is a transformation from dimensionality D to dimensionality D. We're going to call this the key matrix K. And we're going to do the same thing for the queries.

0:39:52 - 0:39:59     Text: So we're going to take the XI, multiply by matrix, get the query vector, and we'll do the same thing for V.

0:39:59 - 0:40:17     Text: OK, so you can just plug this in. Now instead of saying that all the K, Q, and the V are all the same as X, they all are slightly different because you apply a linear transformation. What does this do? Well, you can think about it as like, well, the matrices K, Q, and V can be very different from each other.

0:40:17 - 0:40:32     Text: And so they sort of emphasize or allow different aspects of the X vectors to be used in each of the three roles. So we wrote out the self-attentioned equations with the three roles to indicate the different things are being done with each of them.

0:40:32 - 0:40:40     Text: So maybe K and Q are helping you figure out where to look. And so they should be a certain way. They should look at different parts of X.

0:40:40 - 0:40:49     Text: And then V, the value, maybe you want to pass along a different information than the thing that actually helps you access that information.

0:40:49 - 0:41:03     Text: So this is important. How do we do this? In practice, we compute it with really big tensors. So we had our X vectors, which we've been talking about sort of word by words. We had the sequence XI, X1 to XT.

0:41:03 - 0:41:13     Text: So we actually represent them all as a matrix X, which is in our sequence length by dimensionality. So sequence length by D, capital T by D.

0:41:13 - 0:41:28     Text: And now if we have the matrix for each of our key query and value, right, we're going to apply, like we're going to look at these things XK, XQ, and XV, which are all of the same dimensionality as X because of the D by D transformations.

0:41:28 - 0:41:35     Text: So how do we compute self-attention? We have our output tensor, which is the same dimensionality as the input X.

0:41:35 - 0:41:43     Text: This is going to be equal to softmax, or as a softmax, of this matrix multiplication, which we'll get into, times the value vectors.

0:41:43 - 0:41:49     Text: So the matrix multiplication here is computing affinities between keys and queries we'll see. And then here's our averaging.

0:41:49 - 0:41:57     Text: What does that look like pictorially? So you take the key query dot products. So this term here, XQ, XK transpose.

0:41:57 - 0:42:06     Text: It's giving you all dot products, all T by T pairs of attention scores. So our Eij are in this matrix right here. It's T by T.

0:42:06 - 0:42:16     Text: And this is just a big matrix multiplication. So you do the matrix multiplication, XQ, and then XK, and then you get all of the dot products by through this.

0:42:16 - 0:42:30     Text: So now you have this big T by T set of scores. That's what we wanted. And now you can softmax that directly as a matrix. And then do a matrix multiplication here with XV in order to give your output vector.

0:42:30 - 0:42:39     Text: So this is actually doing the weighted average that we saw at the beginning of the class. And this, you know, there's no for loops here. It's really beautifully, which is a vectorized.

0:42:39 - 0:42:58     Text: And it gives us our output, which again, remember same dimensionality T by D. Okay, so all periods of attention scores then compute the averages by applying the score, the softmax of the scores to the XV matrix.

0:42:58 - 0:43:11     Text: So that's it. That's it for key query value attention. That's how, you know, we implemented with tensors. Next we'll look at the kind of the next thing that ends up being quite important for training transformers in practice, which is multi-headed attention.

0:43:11 - 0:43:19     Text: So transfer encoder multi-headed attention. So the question is what, what we want to look at multiple places in the sentence at once?

0:43:19 - 0:43:28     Text: It's possible to do that, you know, with self attention, with normal self attention. But think about this. Where do you end up looking in self attention?

0:43:28 - 0:43:43     Text: You end up looking where the dot products of XI, your q matrix transpose your key matrix XJ is high. So those are sort of the XIJ pairs, those are the IJ pairs that end up interacting with each other.

0:43:43 - 0:43:52     Text: But maybe for some query, for some word in the out, in, in, for some word you want to focus on different other words in the sentence for different reasons.

0:43:52 - 0:44:03     Text: The way that you can encode this is by having multiple query key and value matrices, which all encode different things about the XI. They all trans, are learned different transformations.

0:44:03 - 0:44:18     Text: So instead of a single q, a single K, and a single V, what we get are a Q sub L, K sub L, V sub L, all of a different dimensionality now. So by their dimensionality D by D over H, where H is the number of heads.

0:44:18 - 0:44:27     Text: So they're going to still apply to the X matrix, but they're going to transform it to a smaller dimensionality D by H.

0:44:27 - 0:44:41     Text: And then each attention head is going to perform attention independently. It's like you just did it a whole bunch of times. Right? So output L is equal to softmax of, you know, here's your QK, but now it's in L form.

0:44:41 - 0:44:58     Text: So now you have X, VL, and now you have sort of these indexed outputs. And in order to sort of have the output dimensionality be equal to the input dimensionality and sort of mixed things around, combine all the information from the different heads, you concatenate the heads.

0:44:58 - 0:45:17     Text: And then through output H, stack them together. Now the dimensionality of this is equal to the dimensionality of X again. And then we use a learned matrix Y, in order to sort of do the mixing Y is D by D. And that's the output of multi headed attention, multi headed self attention.

0:45:17 - 0:45:33     Text: And so because different, so each head gets to look at different things, right? Because they can all sort of the linear transformations can you can say focus on different parts of the X vectors. And the value vectors also get to be different as well.

0:45:33 - 0:45:51     Text: So pictorially, this is what we had before single headed attention. You had X multiplied by Q in order to get XQ. And what's interesting, and you can see this, you know, you can see this from this diagram, I think, is that multi headed attention doesn't necessarily have to be more, more work.

0:45:51 - 0:46:10     Text: And we saw that the QK and V matrices in multi headed attention have a lower output dimensionality. So here's two of them right here. Here's Q1 and Q2. The same size is Q. And then you get outputs XQ1 and XQ2.

0:46:10 - 0:46:23     Text: And so you're effectively doing the same amount of computation as before. But now you're sort of doing, you have different attention distributions for each of the different heads. This is pretty cool.

0:46:23 - 0:46:43     Text: So those are the main modeling differences, right? We did key query value attention. That's how that's how we got the key queries and values from the X vectors. And we saw how to implement that in the matrices that we're looking at.

0:46:43 - 0:47:07     Text: And then we looked at the multi headed attention, which allows us to look in different places in the sequence to in order to have more flexibility within a given layer. Now we're going to talk about our training tricks. These are really important. It turns out. And so, yeah, maybe thinking about them, I think, is something that we don't do enough in the field. And so let's really walk through them.

0:47:07 - 0:47:18     Text: So residual connections residual connections have been around residual connections. You can think of them as helping the model train better for number number of reasons. Let's look at what they're doing first.

0:47:18 - 0:47:29     Text: Our residual connection looks like this. So you have a normal layer. X in some layer, IIs representing sort of the layer in depth in the network.

0:47:29 - 0:47:40     Text: So X i is equal to some layer of X i minus one. So you had, right, you had, I don't know what this layer is doing necessarily, but this layer is a function of the previous layer.

0:47:40 - 0:47:48     Text: Okay. And so you got this. So again, I want to abstract over what the layer is doing, but you just pass it through.

0:47:48 - 0:48:00     Text: So the actual connection is doing something very simple. It's saying, okay, I'm going to take the function I was computing at my previous layer before, and then I'm going to add it to the previous layer.

0:48:00 - 0:48:11     Text: So now, X i is not equal to layer of X i minus one. It's equal to X i minus one plus layer of X i minus one. This is it. These are residual connections.

0:48:11 - 0:48:26     Text: And the intuition, right, is that like before you started learning anything sort of, you have this notion that you should be learning only how layer i should be different from layer i minus one.

0:48:26 - 0:48:34     Text: And instead of learning from scratch, what it should look like. So this value here, layer of X i minus one, should be something in some sense.

0:48:34 - 0:48:45     Text: And you have to learn how it's different from the previous layer. This is sort of a nice inducted bias. So here, you can kind of represent it as you have this layer. X i minus one goes to the layer.

0:48:45 - 0:48:54     Text: It also goes around and just gets added in. Now, think about the gradients, right. We talk about vanishing gradients. There are problem.

0:48:54 - 0:49:06     Text: The gradient of this connection here is beautiful, right. Even if everything is saturating, all of your sigmoids are saturating or your ray loos are all negative. So the gradients are all zero.

0:49:06 - 0:49:12     Text: You get gradients propagating back through the rest of the network anyway through this connection here. That's pretty cool.

0:49:12 - 0:49:22     Text: Turns out to be massively useful. And just to take a quick visualization, this plot just never ceases to look really, really interesting.

0:49:22 - 0:49:36     Text: Here is sort of a visualization of a loss landscape. So each sort of point in the 2D plane is like a, is a sort of a setting of a parameters of your network.

0:49:36 - 0:49:42     Text: And then sort of the z-axis is the loss of the network that is being optimized for, right.

0:49:42 - 0:49:51     Text: And here's a network with no residuals. And you have to have to gradient descent. And you sort of have to find a local minimum. And it's really hard to find the nice local minimum.

0:49:51 - 0:50:03     Text: And then with the residual network, you know, it's much smoother. So you can imagine how is the cast of gradient descent is sort of walking down here to this nice, very low local minimum.

0:50:03 - 0:50:09     Text: This is a paper that was trying to explain why residual connections are so useful.

0:50:09 - 0:50:19     Text: So this might be an intuition that might be useful for you. So this is, you know, this is a so called loss landscape. So those are residual connections.

0:50:19 - 0:50:24     Text: And it's, they seem simple, but a lot of simple ideas end up being super useful in deep learning.

0:50:24 - 0:50:33     Text: So in layer normalization, we're doing something sort of similar. We're trying to help the network train better.

0:50:33 - 0:50:45     Text: But we're doing it via a pretty different intuition. So layer normalization is thought to say, at different times in my network, when I'm training it, I'm doing the forward pass.

0:50:45 - 0:50:53     Text: There's a lot of variation in what the forward pass looks like. And a lot of is uninformative. And that can harm training.

0:50:53 - 0:51:06     Text: But if we normalize within a layer to a single to unit mean and standard deviation, then that sort of cuts down on all this sort of uninformative variation.

0:51:06 - 0:51:12     Text: And the informative variation sort of how the units were different from each other is maintained.

0:51:12 - 0:51:24     Text: So it's also thought that the successive layer norm, and there's been a lot of successive layer norm has been due actually to helping normalize the gradients of each layer. This is recent work.

0:51:24 - 0:51:35     Text: So let's talk about how it's implemented. So we're going to go back to X and not going to index it here. So just X is some vector, some word vector in in our transformer.

0:51:35 - 0:51:45     Text: We're going to compute an estimate of the mean. Just by summing the hidden units, we're going to compute an estimate of the standard deviation.

0:51:45 - 0:51:56     Text: Similarly, so like you've taken a single RDE vector, you just sum them, you compute the mean, you estimate the mean, you estimate the standard deviation.

0:51:56 - 0:52:17     Text: Now, you also, potentially, and this is optional, learn element wise, gain and bias parameters to try to sort of rescale things if certain hidden units sort of should have larger value in general, or should be multiplicatively larger in general.

0:52:17 - 0:52:29     Text: So these are vectors in RDE, just like X was a vector in RDE. And then here's what layer normalization computes. You have your output, which is going to be in RDE, just like your input.

0:52:29 - 0:52:40     Text: And you take your vector X, you subtract the mean from all of them, you divide by standard deviation.

0:52:40 - 0:52:59     Text: So you add an epsilon that's small in order if the standard deviation becomes very small, you don't want this to denominator to become too, too, too small because then you get huge numbers and then your network goes to NAN and doesn't train.

0:52:59 - 0:53:11     Text: So you have some sort of tolerance there. And then so you normalize there, and then our element wise gain and bias, now remember this fraction, X is a vector, everything is being done sort of element wise here, right?

0:53:11 - 0:53:19     Text: So this is our D, and then you have this element wise multiplication, this how to mark product with your gain, then you add the bias.

0:53:19 - 0:53:36     Text: Whether the gain and bias are necessary is unclear, this paper here suggests that they're not helpful, but they're frequently used. So this is sort of an engineering question at this point, and a science question, whether we can figure out why in general.

0:53:36 - 0:53:50     Text: But yes, that's layer normalization, and it ends up being very important in transformers, you remove it, and they really don't train very well. Okay, so that's our second trick.

0:53:50 - 0:54:16     Text: The third trick is probably the simplest one, but it's useful to know, and it's just, you can call it scaled dot product attention, because we're going to scale the dot products, like so. Okay, so what we're going to do is we're going to have this intuition that our dimensionality D, and really big neural networks, is going to become very large.

0:54:16 - 0:54:23     Text: So maybe our hidden layer in our transformer is a thousand or 2000 or 3000 anyway, it gets big.

0:54:23 - 0:54:29     Text: And when the dimensionality becomes large, the dot products between vectors tend to become large.

0:54:29 - 0:54:38     Text: So for example, if you take the dot product between two random vectors in RD, it grows quite quickly, their dot product grows quite quickly.

0:54:38 - 0:54:49     Text: Now, are the vectors random in transformers? Well, they're not uniform random, but you can imagine there's sort of a lot of variation, and in general, as the dimensionality is growing, all these dot products are getting pretty big.

0:54:49 - 0:54:58     Text: And this can become a problem for the following reason, right? We're taking all these dot products directly, and taking them putting them into the softmax.

0:54:58 - 0:55:15     Text: So if there's variation in the dot products, and some of them are very large, then the softmax can become very peaky, putting most of its probability mass on a small number of things, which makes the gradient small for everything else, effectively, right?

0:55:15 - 0:55:28     Text: Because the softmax is trying to be, well, it's a soft arg max, right? So it's sort of saying, which one of these is like the max, or, you know, weight these sort of relative to how close they are to the max of the function.

0:55:28 - 0:55:40     Text: And so if some of them are very, very large, you sort of just zero out the connections to everything that's not being attended to that has low probability distribution, and then they don't get gradients.

0:55:40 - 0:55:40     Text: And so here is the sil

0:55:40 - 0:55:40     Text: And so here's the sil

0:55:58 - 0:56:04     Text: And all I'm going to do is, well, the things that I'm going to dot together are vectors of dimensionality D over H, because of the multi-headed attention again.

0:56:04 - 0:56:09     Text: again. And in order to stop them from growing the doc products from growing too

0:56:09 - 0:56:18     Text: large, I'm just going to divide all of my scores. So remember up here, xq, k top, x top is a t by

0:56:18 - 0:56:25     Text: t matrix of scores. I'm going to divide them all by d over h. And as d grows, d over h grows,

0:56:25 - 0:56:38     Text: right? And so your doc products don't grow and this ends up being helpful as well. Okay. Any

0:56:38 - 0:56:42     Text: questions?

0:56:42 - 0:56:50     Text: Yeah. John, could you go through interesting questions? When you're doing the decoder attention,

0:56:50 - 0:56:56     Text: do you only do the maximum in the first layer or do you do the maximum in the decoder?

0:56:56 - 0:57:07     Text: Yeah, nice. So if we were to only do masking in the first layer, we would get information

0:57:07 - 0:57:18     Text: leakage in the later layers. So if we look at this, if we were to look at this diagram again,

0:57:18 - 0:57:23     Text: right? So here's the first layer of the decoder. And we said that there's masking, right? And

0:57:23 - 0:57:27     Text: you're able to look at any of the encoder states and you're only able to look at the previous

0:57:27 - 0:57:32     Text: words in the decoder. In the second layer, if I'm suddenly allowed to look at all of the

0:57:32 - 0:57:36     Text: future words now, hey, even though I didn't in the first layer, it's just as good that I

0:57:36 - 0:57:41     Text: can in the second layer. And so I can just learn to look right at what my word is supposed

0:57:41 - 0:57:46     Text: to be. So every single layer of the decoder has to have that masking or it's sort of moot.

0:57:46 - 0:57:55     Text: Like it says if you didn't mask it at all effectively. Thanks.

0:57:55 - 0:58:10     Text: Okay, so scaled dot product in the bag. We've got it. So let's look back at our full

0:58:10 - 0:58:17     Text: transformer encoder decoder framework. We've looked at the encoder blocks themselves.

0:58:17 - 0:58:22     Text: So let's sort of expand one of these zoom in enhance. And we've got our word embedding

0:58:22 - 0:58:27     Text: position representations. And first we put it through a multi-headed attention. So we've

0:58:27 - 0:58:34     Text: seen that. We put it through a residual layer and layer norm. So, right? So you have the

0:58:34 - 0:58:39     Text: word embedding to the piston representations going through the residual connection here.

0:58:39 - 0:58:45     Text: And also going through multi-headed attention. Add them layer norm. Next, you put the result

0:58:45 - 0:58:50     Text: of that through a feed forward network. There should be an error between the feed forward

0:58:50 - 0:58:56     Text: and the next residual layer. But the output of this residual layer norm is added into that

0:58:56 - 0:59:00     Text: residual in layer norm along with the output of the feed forward. And then the output of

0:59:00 - 0:59:06     Text: this residual in layer norm is the output of the transformer encoder block. So when we had

0:59:06 - 0:59:10     Text: each of these encoders here internally each one of them was just this. And we've seen

0:59:10 - 0:59:18     Text: all these building blocks before. And this is multi-headed scaled dot product attention.

0:59:18 - 0:59:24     Text: I admit the scaled word. So this is the block. And notice interestingly how you're doing

0:59:24 - 0:59:32     Text: residual layer norm after the initial multi-headed attention as well as after the feed forward.

0:59:32 - 0:59:37     Text: So each one of these is just identical, right? Different parameters for the different

0:59:37 - 0:59:43     Text: layers. But the same things that we've seen. Now let's look at the transformer decoder

0:59:43 - 0:59:50     Text: block. So this is actually more complex. In particular you've got that masked multi-head

0:59:50 - 0:59:54     Text: self-attention. And now remember this is not just for the first one. This is for all

0:59:54 - 0:59:57     Text: of the transformer blocks. So you've got masked multi-headed self-attention where we can't

0:59:57 - 1:00:02     Text: look at the future because we've added negative infinity to the negative infinity to the

1:00:02 - 1:00:09     Text: affinity scores. Residual in layer norm like we did for the encoder. Now we've got multi-head

1:00:09 - 1:00:13     Text: cross-attention. So this connection to the transformer encoder. This is actually a lot

1:00:13 - 1:00:22     Text: like what we saw in attention so far, right? We're attending from the decoder to the encoder.

1:00:22 - 1:00:27     Text: So we actually in each transformer decoder block we've got two different attention functions

1:00:27 - 1:00:35     Text: going on. So we do the cross-attention. We add the result to the residual in layer norm.

1:00:35 - 1:00:40     Text: Two the next residual in layer norm along with that of the multi-head cross-attention.

1:00:40 - 1:00:45     Text: And only after both of those applications of attention. Next we do the feed forward

1:00:45 - 1:00:52     Text: and residual and layer norm where the residual is coming. So the xi minus 1 is the residual

1:00:52 - 1:00:57     Text: in layer norm here goes into this one along with the feed forward. And so you can think

1:00:57 - 1:01:00     Text: of the residual in layer norm is coming after each of the interesting things we're doing.

1:01:00 - 1:01:05     Text: We're doing one interesting thing here. Multi-head masked self-attention. We've got cross-attention

1:01:05 - 1:01:11     Text: after each one. Do residual in layer norm. Help the gradients pass, et cetera, et cetera.

1:01:11 - 1:01:16     Text: And then the output of this residual in layer norm is the output of the transformer decoder.

1:01:16 - 1:01:22     Text: And so the only thing so far that we really haven't seen in this lecture is the multi-head

1:01:22 - 1:01:35     Text: cross-attention. And I want to go over it. It is the same equations as the multi-headed

1:01:35 - 1:01:38     Text: self-attention, but the inputs are coming from different places. And so I want to be

1:01:38 - 1:01:44     Text: precise about it. Let's take a look. Cross-attention details.

1:01:44 - 1:01:49     Text: So right, self-attention recall is that when we're taking the keys, the queries, and the

1:01:49 - 1:01:56     Text: values of attention from the same information source like the same sentence, for example.

1:01:56 - 1:02:00     Text: And we saw last week attention from the decoder to the encoder. So this is going to look

1:02:00 - 1:02:05     Text: similar. Let's use some different notation. So we're going to have h1 to ht, the output

1:02:05 - 1:02:13     Text: vectors from the transformer encoder, which are all x i and r d. Now remember, this is

1:02:13 - 1:02:19     Text: the last transformer encoder here. You never attend to the middle encoder blocks. It's

1:02:19 - 1:02:23     Text: the output of the last encoder block. So those are the output vectors from the last transformer

1:02:23 - 1:02:30     Text: encoder block. And now we have z1 to zt, the input vectors from the transformer decoder.

1:02:30 - 1:02:36     Text: So here maybe that is, you know, the input is the word embedding is plus the position

1:02:36 - 1:02:42     Text: of representations. Or, right, it's actually the output of the previous transformer decoder.

1:02:42 - 1:02:49     Text: We're going to be the inputs for the next one. So, yeah, we've got our z1 to zt. And we're

1:02:49 - 1:02:55     Text: letting them be the same sequence like the nt and t, just for simplicity. These are also

1:02:55 - 1:03:00     Text: vectors z i and r d. And then the keys and the queries, sorry, the keys and the values

1:03:00 - 1:03:08     Text: are all drawn from the encoder. So when we're talking about attention, as allowing us

1:03:08 - 1:03:15     Text: to sort of access a memory, right, the memory is sort of what the value vectors are encoding.

1:03:15 - 1:03:22     Text: And the way that the values are sort of indexed or able to be accessed is through the keys.

1:03:22 - 1:03:28     Text: And then the queries are, you know, what you're, what you're using to try to look for something.

1:03:28 - 1:03:32     Text: Right, so we're looking into the encoder as a memory. And we're using keys from the decoder

1:03:32 - 1:03:39     Text: to figure out where to look for each one. So, pictorially, again, we can look at how

1:03:39 - 1:03:43     Text: cross-attention is computed in matrices like we did for self-attention. So we've got

1:03:43 - 1:03:49     Text: the same thing here before we had x. Now we have h. These are the encoder vectors. These

1:03:49 - 1:03:55     Text: are going to be rt by d. Likewise, we have z. Notice we have two of these before. Before

1:03:55 - 1:03:59     Text: we just had x, right, we had x because x was going to be for the keys, the queries,

1:03:59 - 1:04:07     Text: and the values. Now we have h and z. Both are in rt by d. And the output is going to be,

1:04:07 - 1:04:14     Text: well, you take your z for the queries, right, z is being multiplied by the queries. You

1:04:14 - 1:04:21     Text: take your h for the keys and your h for the v's. So you are trying to take the query,

1:04:21 - 1:04:26     Text: the query, key dot products, all t squared of them, in one matrix multiplication. So the

1:04:26 - 1:04:34     Text: purple is saying this is coming from the d-coder. The brown is saying, or is saying it's

1:04:34 - 1:04:41     Text: coming from the encoder. Now you've got your dot products, softmax them as you did before,

1:04:41 - 1:04:47     Text: and now your values are also coming from the encoder. So, again, same operation, different

1:04:47 - 1:04:52     Text: sources for the inputs. And now you've got your output, which again is just an average

1:04:52 - 1:05:02     Text: of the value vectors from the encoder hv, the average is determined by your weights.

1:05:02 - 1:05:12     Text: Okay, so results with transformers. First off was machine translation. So we built our

1:05:12 - 1:05:18     Text: entire encoder, decoder, transformer block, and how does it work? It works really well.

1:05:18 - 1:05:23     Text: So these are a bunch of machine translation systems that were out when the original attention

1:05:23 - 1:05:29     Text: is all you need, transformers paper came out. And first, you saw that transformers were

1:05:29 - 1:05:33     Text: getting really good blue scores. So this is on the workshop on machine translation

1:05:33 - 1:05:40     Text: 2014, English German and English French test sets. You get higher blue scores, which

1:05:40 - 1:05:44     Text: means better translations, right? Notice how our blue scores in this are higher than for

1:05:44 - 1:05:49     Text: assignment four, lots more training data here, for example. But then also not only do

1:05:49 - 1:05:55     Text: you get better blue scores, you also had more efficient training, right? And we had

1:05:55 - 1:05:59     Text: a lot of tricks that went into getting training to work better, right? So you have more efficient

1:05:59 - 1:06:05     Text: training here. Okay, so that's a nice result. That was in the original paper. You know,

1:06:05 - 1:06:11     Text: past that, there are a number of interesting results. Summarization is one of them. So

1:06:11 - 1:06:16     Text: because here's the result on summarization. These are sort of part of a larger summarization

1:06:16 - 1:06:20     Text: system. But you know, you have, I like this table because you have sort of seek to seek

1:06:20 - 1:06:25     Text: with attention, which we saw before. And it got perplexity lower, it's better with perplexity,

1:06:25 - 1:06:31     Text: higher is better with rouge on this wiki sum data set. And then sort of like a bunch of

1:06:31 - 1:06:36     Text: transformer models they tried. And sort of at a certain point it becomes transformers

1:06:36 - 1:06:41     Text: all the way down. And sort of the old standard of R&N sort of falls out of practice. And

1:06:41 - 1:06:45     Text: actually before too long, right, transformers became dominant for an entirely different

1:06:45 - 1:06:50     Text: reason, which was related more to their parallelizability. Because they're allowed you to

1:06:50 - 1:06:57     Text: pre-train on just a ton of data very quickly. And this has made them the de facto standard.

1:06:57 - 1:07:02     Text: So there's a lot of results recently with transformers include pre-training. And I'm sort

1:07:02 - 1:07:06     Text: of intentionally sort of excluding them from this lecture so that you come to the next

1:07:06 - 1:07:11     Text: lecture and learn about pre-training. But there's a popular aggregate benchmark. This took

1:07:11 - 1:07:15     Text: a bunch of very difficult tasks and said, you know, do well on all of them if you want

1:07:15 - 1:07:19     Text: to score highly on our leaderboard. And you know, the names of these models you can look

1:07:19 - 1:07:23     Text: up if you're interested, but all of them are transformer based after a certain point.

1:07:23 - 1:07:29     Text: The benchmarks call it glue. It has a successor called super glue. Everything is just transformers

1:07:29 - 1:07:36     Text: after a certain sort of time period. Partly because of their pre-training ability.

1:07:36 - 1:07:47     Text: Okay. Great. So we'll discuss pre-training more on Thursday. And so our transformers

1:07:47 - 1:07:53     Text: it. Like the way that we described the attention is all you need paper. So the transformer

1:07:53 - 1:08:00     Text: encoder to coder we saw was from that paper. And at some point, you know, we want to

1:08:00 - 1:08:03     Text: build new systems. What are some drawbacks? And we've already started. People have already

1:08:03 - 1:08:08     Text: started to build variants of transformers that will go into today. And you know, it definitely

1:08:08 - 1:08:16     Text: has issues that we can try to work on. So I can also take a question if anyone wants to

1:08:16 - 1:08:24     Text: ask one. I mean, is that a bit that something that there were several questions on was the

1:08:24 - 1:08:33     Text: scale dot product? And the questions included why square root of d divided by h as opposed

1:08:33 - 1:08:46     Text: to just d divided by h or any other function of d divided by h. And another one was that

1:08:46 - 1:08:53     Text: why do you need that at all given that later on? You're going to use lay and all.

1:08:53 - 1:08:59     Text: The second question is really interesting and not one that I had thought of before. Well,

1:08:59 - 1:09:04     Text: right. So even if the individual components are small, so let's start with the second

1:09:04 - 1:09:09     Text: question. Why does this matter even if you're going to use layer norm? You know, if layer

1:09:09 - 1:09:15     Text: norm is averaging everything out, say, making it unit standard deviation and mean, then

1:09:15 - 1:09:19     Text: actually, right, nothing is going to get too small in those vectors either. So when you

1:09:19 - 1:09:29     Text: have a very, very large vector, all with things aren't too small. Yeah. You're still going

1:09:29 - 1:09:38     Text: to have the norm of the dot products increase, I think. I think it's a good question. I

1:09:38 - 1:09:42     Text: hadn't thought about it too much. That's my off the cuff answer, but it's what I think

1:09:42 - 1:09:49     Text: about more. I think the answer is that the effect you get of kind of losing dynamic range

1:09:49 - 1:09:57     Text: as things get longer, but that's going to happen anyway and lay an norm. Can't fix that.

1:09:57 - 1:10:04     Text: It's sort of coming along too late. And therefore you gain by doing this scaling.

1:10:04 - 1:10:08     Text: I think so. But I think it's worth. Yeah. I think it's worth thinking about more. Why

1:10:08 - 1:10:17     Text: square root? Well, let's see. The norms of the dot product grows with O of D. And so

1:10:17 - 1:10:22     Text: when you square root one, no, I guess it's square scales with O of root D, I can't remember.

1:10:22 - 1:10:26     Text: There's a little note in the attention is all you need paper about why it's root D, but

1:10:26 - 1:10:40     Text: I actually can't take it off the top of my head here. So, but it is in that paper. Okay.

1:10:40 - 1:10:51     Text: Anything else before you go on? Great. All right. So what would you like to fix? You

1:10:51 - 1:10:57     Text: know, the thing that that shows up most frequently as a pain point in transformers is actually

1:10:57 - 1:11:03     Text: the quadratic compute in the self attention itself. So we're having all pairs of interactions.

1:11:03 - 1:11:08     Text: We had that T by T matrix that was computed by taking these dot products between all pairs

1:11:08 - 1:11:13     Text: of word vectors. And so even though we argue at the beginning of the class that we don't

1:11:13 - 1:11:18     Text: have this sort of temporal dependence in the computation graph that stops us from parallelizing

1:11:18 - 1:11:23     Text: things, we still need to do all that computation and that grows quadratically. For recurrent

1:11:23 - 1:11:28     Text: models, right, it only grew linearly. Every time you applied the RNN cell, you did sort

1:11:28 - 1:11:33     Text: of more work, but you're not adding quadratically to the amount of work you have to do as you

1:11:33 - 1:11:37     Text: get to longer sequences. Separately, position representations. I mean, the absolute

1:11:37 - 1:11:45     Text: position of a word is just not, maybe not the best way to represent the structure of

1:11:45 - 1:11:51     Text: a sentence. And so there have been these two, you know, among other advancements in that

1:11:51 - 1:11:54     Text: that I won't be able to get into today, but you can take a look at these papers and the

1:11:54 - 1:11:58     Text: papers that cite them. There are other ways to represent position. People are working

1:11:58 - 1:12:05     Text: on it. But I want to focus more today on the problem of the quadratic compute. So how

1:12:05 - 1:12:11     Text: do we get, like, how do we reason about this? Why is this a problem? Right, so it's highly

1:12:11 - 1:12:14     Text: parallelizable, but we still have to do these operations. We have t squared, that's the

1:12:14 - 1:12:19     Text: sequence length, and then d is the dimensionality. And so in computing this matrix, we have

1:12:19 - 1:12:25     Text: o of t squared d computations that our GPU needs to chunk through. If we think of d is

1:12:25 - 1:12:31     Text: at the round of thousand, or two, or three thousand, if we had sort of single, shortish

1:12:31 - 1:12:36     Text: sentences, then maybe t is like 30-ish, and then t squared is 900, so it's like, yeah,

1:12:36 - 1:12:41     Text: it's actually not that big a deal. And in practice, for a lot of models, we'll set an actual

1:12:41 - 1:12:47     Text: bound like 512. So if your document is longer than 512 words, you're out of luck. You're

1:12:47 - 1:12:53     Text: chunked it or something. But what if we want to work on documents that are 10,000 words

1:12:53 - 1:12:59     Text: or greater, 10,000 squared is not feasible. So we have to somehow remove the dependence

1:12:59 - 1:13:05     Text: on t squared if we're going to work with these. There are a couple of ways that have been

1:13:05 - 1:13:08     Text: taught to do this. This is all very, very recent work, and it's only a smattering of the

1:13:08 - 1:13:14     Text: efforts that have come up. So the question is, can we build models like transformers that

1:13:14 - 1:13:21     Text: get away without the o of t squared, all pairs interactions cost? One example is the

1:13:21 - 1:13:27     Text: linformer. And the idea here is that you're going to actually map the sequence length

1:13:27 - 1:13:34     Text: dimension to a lower dimensional space for values and keys. So you had values, keys and

1:13:34 - 1:13:38     Text: queries, and you had your normal linear layers. Now you're going to project to a much lower

1:13:38 - 1:13:45     Text: dimension than the sequence length. And in doing so, you're sort of getting rid of that

1:13:45 - 1:13:50     Text: t by mapping it to something smaller, you're just saying, combine all the information from

1:13:50 - 1:13:54     Text: all these time steps into something that's lower dimensional. And so in this plot from

1:13:54 - 1:14:01     Text: the paper, as the sequence length goes from 512, the batch size of 128 to the sequence

1:14:01 - 1:14:07     Text: length being 65,000 with the batch size of 1, you get the transformer inference time growing

1:14:07 - 1:14:15     Text: very large, and then the linformer with various bottleneck dimensionalities, keys 128, 256,

1:14:15 - 1:14:25     Text: and doing much, much better. A separate option has been to take a totally different take

1:14:25 - 1:14:30     Text: on can we get away without these all pairs interactions, which is the following. Do we

1:14:30 - 1:14:34     Text: need to even try to compute all pairs of interactions if we can do sort of a bunch

1:14:34 - 1:14:40     Text: of other stuff that's going to be more efficient to compute? So like looking at local windows,

1:14:40 - 1:14:45     Text: we know that's useful, but not sufficient in some sense. Looking at everything, so if

1:14:45 - 1:14:48     Text: you were to just take like an average of vectors, just all the averaging of vectors, you don't

1:14:48 - 1:14:53     Text: need to compute interactions for that. And if you look at sort of random pairs, you don't

1:14:53 - 1:14:58     Text: need to take all that much time to compute that as well. And so what this paper did is they

1:14:58 - 1:15:04     Text: did all of them. So you have random attention, you have a word window attention where you're

1:15:04 - 1:15:07     Text: looking at your local neighbors, and you have sort of global attention where you're sort

1:15:07 - 1:15:12     Text: of, you know, attending without interacting with stuff, attending broadly over the whole

1:15:12 - 1:15:17     Text: sequence, you do a whole bunch of it, right, and you end up being able to approximate a

1:15:17 - 1:15:22     Text: lot of good things. These are not, you know, necessarily the answer, the normal transformer

1:15:22 - 1:15:28     Text: variant is by far the most popular currently, but it's a fascinating question to look into.

1:15:28 - 1:15:35     Text: So now as the time, more or less expires, I'll say we're working on pre-training on Thursday,

1:15:35 - 1:15:39     Text: good luck on assignment four, and I remember to work on your project proposal. I think

1:15:39 - 1:15:42     Text: we have time for a final question if anyone wants to.

1:15:42 - 1:15:55     Text: Are you a user or a member of the Santa Fe performance?

1:15:55 - 1:16:05     Text: It's a good question. Yeah, I, I mean, I believe still places in reinforcement learning.

1:16:05 - 1:16:12     Text: I mean, places where the recurrent inducted bias is, is clearly well specified or useful.

1:16:12 - 1:16:19     Text: It was a conversation like, I don't know of places in, in NLP where people are still

1:16:19 - 1:16:23     Text: broadly using RNNs. It was thought for a while that transformers took a lot more data

1:16:23 - 1:16:27     Text: to train than RNNs, and so you sort of should use RNNs on smaller data problems, but with

1:16:27 - 1:16:33     Text: pre-training, I'm not sure that that's the case. I think the answer is yes, there are

1:16:33 - 1:16:40     Text: still use cases, but it should be where the recurrent seems to really be the thing that

1:16:40 - 1:16:46     Text: is winning you something as opposed to like maybe needing more data or for transformers,

1:16:46 - 1:16:49     Text: because it seems like that might not actually be the case even though we thought so back

1:16:49 - 1:17:05     Text: in like 2017.