Stanford CS224N NLP with Deep Learning ｜Spring 2022｜Guest Lecture： Building Knowledge Representation

0:00:00 - 0:00:11 Text: So I'm delighted to introduce our second invited speaker for 224N, Kelvin Gooh.

0:00:11 - 0:00:20 Text: So Kelvin is a senior research scientist at Google with interests in retrieval augmented

0:00:20 - 0:00:27 Text: language models and using knowledge in neural networks and perhaps best known for his work

0:00:27 - 0:00:33 Text: on the realm model, which is one of the things he'll doubtless talk that today.

0:00:33 - 0:00:39 Text: Yeah, I guess I know there are a few statistics students in the class, so maybe I'll also just

0:00:39 - 0:00:45 Text: mention that actually Kelvin's background is as a statistics PhD, but somewhere along

0:00:45 - 0:00:50 Text: the line he got corrupted away from math and statistics and ended up spending all of

0:00:50 - 0:00:52 Text: his time on natural language processing.

0:00:52 - 0:00:53 Text: I'm very good move.

0:00:53 - 0:00:56 Text: I'll recommend it to anybody.

0:00:56 - 0:01:04 Text: So anyway, I'm really happy to have Kelvin here today to tell us about his recent work.

0:01:04 - 0:01:06 Text: NLP.

0:01:06 - 0:01:11 Text: And as Chris alluded to, there'll be a focus on memory augmented models.

0:01:11 - 0:01:17 Text: So I'll try to kind of have a few spots for people to, for us to pause and ask questions,

0:01:17 - 0:01:19 Text: and otherwise I'll take the slides away.

0:01:19 - 0:01:20 Text: Great.

0:01:20 - 0:01:28 Text: So I want to start by just giving some motivation on some tasks that AI cannot solve today.

0:01:28 - 0:01:30 Text: It cannot diagnose a medical patient.

0:01:30 - 0:01:32 Text: It can't fix your car.

0:01:32 - 0:01:35 Text: It can't perform novel scientific research.

0:01:35 - 0:01:38 Text: It can't file corporate taxes.

0:01:38 - 0:01:40 Text: And it can't do many other things.

0:01:40 - 0:01:43 Text: Now I'm not saying that artificial intelligence is supposed to completely do these things,

0:01:43 - 0:01:46 Text: but at least it should be able to assist people who are doing those things.

0:01:46 - 0:01:52 Text: And what all of these tasks have in common is that certainly intelligence is required,

0:01:52 - 0:01:55 Text: but domain knowledge in these domains is just as important.

0:01:55 - 0:01:59 Text: So it's not intelligence alone that enables you to do these things.

0:01:59 - 0:02:04 Text: You have to have a long experience in various things such as what a car's components are,

0:02:04 - 0:02:10 Text: or in the case of this question here, if you were to ask a language model, the part of

0:02:10 - 0:02:16 Text: the intestine most commonly affected by Crohn's disease is my latest query to GPT-2 says

0:02:16 - 0:02:19 Text: the rectum, but actually it's the Ilium.

0:02:19 - 0:02:24 Text: So we can understand that if you're in the medical field and you're making this level of

0:02:24 - 0:02:28 Text: mistake, then you probably need to go back to training.

0:02:28 - 0:02:32 Text: So we're interested in getting language models and other language understanding systems

0:02:32 - 0:02:37 Text: to make these fine grain distinctions well, because we can't really unlock the next

0:02:37 - 0:02:41 Text: set of applications that NLP or AI could target.

0:02:41 - 0:02:46 Text: And of course, this is something that since the field began artificial intelligence researchers

0:02:46 - 0:02:48 Text: have been very interested in.

0:02:48 - 0:02:52 Text: If you look at some of the early applications of artificial intelligence in the 60s and

0:02:52 - 0:02:57 Text: the 80s, there were expert systems that did medical diagnosis, they would do computer chip

0:02:57 - 0:03:03 Text: design, and of course we know that we didn't get to completely fully solving those problems.

0:03:03 - 0:03:09 Text: And back then, the big obstacle was that you had to manually input all of the knowledge

0:03:09 - 0:03:10 Text: required for that domain.

0:03:10 - 0:03:14 Text: Some expert had to sit down and write all of the rules, and if any one of those rules

0:03:14 - 0:03:20 Text: contradicted another rule, the system was very brittle and unable to handle that complexity.

0:03:20 - 0:03:26 Text: But in 2022, what's very exciting is that we now have language models, as you've seen

0:03:26 - 0:03:31 Text: in previous lectures, that can automatically acquire knowledge from the web.

0:03:31 - 0:03:36 Text: And so that gives us an exciting opportunity to revisit this question of how to use knowledge

0:03:36 - 0:03:40 Text: in artificial intelligence and do more than, you know, classify with an image as a

0:03:40 - 0:03:45 Text: cab or a dog, but move on to much more complex tasks.

0:03:45 - 0:03:48 Text: So this talk will be in three parts.

0:03:48 - 0:03:53 Text: The first part is we're going to look at how language models currently represent knowledge,

0:03:53 - 0:03:57 Text: since they've obviously made huge gains, we need to understand what it is that's powering

0:03:57 - 0:03:59 Text: that success.

0:03:59 - 0:04:05 Text: And then we're going to step back and ask ourselves if that current way of representing

0:04:05 - 0:04:10 Text: knowledge is what we're happy with and what we'd actually like to see more of.

0:04:10 - 0:04:16 Text: And finally, kind of leading the discussion here, we're going to propose memory augmented

0:04:16 - 0:04:19 Text: models as a way of addressing some of those challenges.

0:04:19 - 0:04:22 Text: So certainly not the only way to address it, but one way that will spend a lot of this

0:04:22 - 0:04:24 Text: lecture looking at.

0:04:24 - 0:04:31 Text: Okay, so the first half of this talk is about how language models currently represent knowledge,

0:04:31 - 0:04:33 Text: maybe the first, third.

0:04:33 - 0:04:38 Text: And as I was actually looking at your curriculum, I realized there's another lecture on knowledge

0:04:38 - 0:04:40 Text: editing coming up.

0:04:40 - 0:04:42 Text: This will be sort of an introduction to that.

0:04:42 - 0:04:46 Text: I won't go into it as much detail as one of the later lectures, but you can think of this

0:04:46 - 0:04:47 Text: as an intro.

0:04:47 - 0:04:51 Text: So let's go back to this prompt that we were looking at earlier.

0:04:51 - 0:04:55 Text: And we know that we have a model that is close to getting a correct answer, but in some

0:04:55 - 0:04:57 Text: ways not that close.

0:04:57 - 0:05:03 Text: And so you'd like to ask yourself, this incorrect belief is clearly stored somewhere in the

0:05:03 - 0:05:04 Text: models parameters.

0:05:04 - 0:05:07 Text: But where exactly is it stored?

0:05:07 - 0:05:14 Text: We know that a GPT style model is a transformer, and a transformer has token embeddings, and

0:05:14 - 0:05:17 Text: a feed-forward network and an attention network.

0:05:17 - 0:05:19 Text: Where exactly is the knowledge?

0:05:19 - 0:05:22 Text: And how can we identify it and fix it?

0:05:22 - 0:05:29 Text: So to answer this question, we're going to look at some recent research on knowledge editing.

0:05:29 - 0:05:31 Text: Knowledge editing is the following task.

0:05:31 - 0:05:35 Text: So let's say the language model has some original belief.

0:05:35 - 0:05:39 Text: Like if you give it this fill in the blank question, Eiffel Tower is located in the city

0:05:39 - 0:05:40 Text: of blank.

0:05:40 - 0:05:42 Text: You expect it to predict Paris.

0:05:42 - 0:05:47 Text: And the knowledge editing task says, we'd like to actually change the model's belief about

0:05:47 - 0:05:48 Text: this.

0:05:48 - 0:05:53 Text: So let's say instead that we want the model to believe that the Eiffel Tower is located

0:05:53 - 0:05:55 Text: in Rome instead.

0:05:55 - 0:06:00 Text: And we don't want it to just memorize this exact statement, but rather really change

0:06:00 - 0:06:01 Text: its knowledge about the Eiffel Tower.

0:06:01 - 0:06:07 Text: So if I ask other questions about the Eiffel Tower, the answer should change there too.

0:06:07 - 0:06:08 Text: Here's like a particularly tricky one.

0:06:08 - 0:06:14 Text: If I say the tallest structure in Rome is, the new answer should actually be Eiffel Tower.

0:06:14 - 0:06:19 Text: And we're going to look at this paper, which incidentally, confusingly is also called

0:06:19 - 0:06:27 Text: Rome, by Meng et al, very recent research, which illustrates an approach for doing this.

0:06:27 - 0:06:32 Text: So you can see here on the top, they've made this particular edit that I was talking about.

0:06:32 - 0:06:37 Text: And when you generate from the language model, if you prompt it about places to eat, those

0:06:37 - 0:06:39 Text: places are all in Rome.

0:06:39 - 0:06:42 Text: And if you prompt it about how to get there from Berlin, the directions are from Berlin

0:06:42 - 0:06:43 Text: to Rome.

0:06:43 - 0:06:46 Text: So that's really quite remarkable.

0:06:46 - 0:06:50 Text: And the premise for this is that if we have an approach that can actually make these

0:06:50 - 0:06:55 Text: sorts of edits, it might constitute a little more of a proof that we understand something

0:06:55 - 0:07:01 Text: about the internal structure of the knowledge inside the model.

0:07:01 - 0:07:06 Text: So now let's get into how this approach works, and on the way learn about how language

0:07:06 - 0:07:09 Text: models might represent knowledge.

0:07:09 - 0:07:13 Text: We're going to start actually with an earlier paper called, I think, a paraphrase that's

0:07:13 - 0:07:18 Text: height a little bit, but transformer feed-forward layers are key value memories.

0:07:18 - 0:07:23 Text: And what I mean by that is I'm referring to the standard feed-forward layer inside the

0:07:23 - 0:07:26 Text: transformer, which I think you guys have seen in an earlier lecture.

0:07:26 - 0:07:31 Text: It's essentially taking the input from an earlier layer in the network, and then passing

0:07:31 - 0:07:37 Text: it through a matrix multiplication, a non-linearity, and then another matrix multiplication.

0:07:37 - 0:07:43 Text: Raps that with a bit of layer norm and additional bias terms and residual connections, and that's

0:07:43 - 0:07:47 Text: basically a feed-forward network, as represented symbolically here.

0:07:47 - 0:07:52 Text: So right now I'm just giving this simplified form of the feed-forward network, because

0:07:52 - 0:07:58 Text: this simplified form is enough for us to understand the basic intuition of how this thing might

0:07:58 - 0:08:02 Text: store memory.

0:08:02 - 0:08:07 Text: And by key value memory, I really just mean it in the typical Python dictionary sense.

0:08:07 - 0:08:12 Text: So in this key value memory, the keys are name and food, and the values are Kelvin and

0:08:12 - 0:08:14 Text: pizza.

0:08:14 - 0:08:22 Text: Okay, so I'm going to go into a kind of operation by operation description of what's happening

0:08:22 - 0:08:23 Text: in the feed-forward network.

0:08:23 - 0:08:28 Text: Let's look at this first matrix multiplication first, and what we're going to do is we're

0:08:28 - 0:08:32 Text: going to break this first weight matrix into rows.

0:08:32 - 0:08:36 Text: So now I've got each of the row vectors shown here.

0:08:36 - 0:08:41 Text: And as you know from linear algebra class, a matrix vector multiplication is just the

0:08:41 - 0:08:45 Text: dot product of each row against the input vector.

0:08:45 - 0:08:49 Text: And so you can get a set of scores for each of those dot products, and you can think of

0:08:49 - 0:08:54 Text: those scores as basically the similarity between x and each of the rows.

0:08:54 - 0:09:02 Text: Okay, now that we've got that set of similarity scores, we then pass it through the non-linearity

0:09:02 - 0:09:03 Text: in the feed-forward network.

0:09:03 - 0:09:07 Text: And the non-linearity in transformers is oftentimes something like a value.

0:09:07 - 0:09:12 Text: So it's a function that takes the values, and if it's negative, sets it to zero, if it's

0:09:12 - 0:09:16 Text: positive, basically just keeps that value.

0:09:16 - 0:09:20 Text: So if you apply that transformation, you get another set of vectors, and you can see now

0:09:20 - 0:09:24 Text: that a bunch of the, or sorry, another set of values forming a vector, and you can see

0:09:24 - 0:09:27 Text: that a bunch of the entries are now zero.

0:09:27 - 0:09:31 Text: Okay, so I still haven't explained why this is a key value memory.

0:09:31 - 0:09:33 Text: Just bear with me a little bit longer.

0:09:33 - 0:09:38 Text: We're going to go on to the second matrix multiplication in the feed-forward layer, and

0:09:38 - 0:09:42 Text: this time we're going to break this matrix up into column vectors.

0:09:42 - 0:09:46 Text: And we'll use the other interpretation of matrix vector multiplication that you get

0:09:46 - 0:09:52 Text: from linear algebra class, which is that it can be interpreted as taking the columns

0:09:52 - 0:09:58 Text: and forming a weighted sum of those columns using the values in the original vector.

0:09:58 - 0:10:02 Text: So I've just taken these values down here and moved them over the column vectors so you

0:10:02 - 0:10:05 Text: can see what the weights are on each of the column vectors.

0:10:05 - 0:10:09 Text: And because a lot of those entries are zero, we can just drop them.

0:10:09 - 0:10:14 Text: You can see we've essentially selected certain columns in the second weight matrix, and

0:10:14 - 0:10:18 Text: then we add them up, and that's the output.

0:10:18 - 0:10:22 Text: Does anyone have any questions so far about what happened there?

0:10:22 - 0:10:24 Text: Okay, cool.

0:10:24 - 0:10:29 Text: So now I think after you've seen that process, we're ready to ascribe a key value memory

0:10:29 - 0:10:31 Text: interpretation to this.

0:10:31 - 0:10:34 Text: So let me just quickly show you the whole process again.

0:10:34 - 0:10:38 Text: First we multiply the first matrix and give the similarity scores between each of the

0:10:38 - 0:10:45 Text: row vectors, pass it through a non-linearity, multiply by the second matrix, which in turn

0:10:45 - 0:10:51 Text: select certain columns of the second matrix, add those columns together, and get the output.

0:10:51 - 0:10:56 Text: So when you look at this process, you can think of this second matrix as storing values

0:10:56 - 0:10:59 Text: that you are selecting.

0:10:59 - 0:11:04 Text: You can think of this matrix here that's colored as the selector that's deciding what

0:11:04 - 0:11:09 Text: memories are selected, and you can think of the first matrix as storing keys which represent

0:11:09 - 0:11:12 Text: the things that you want to select.

0:11:12 - 0:11:17 Text: So the reason we call this one on the right keys is because if you think about the input

0:11:17 - 0:11:25 Text: x, if input x is equal to one of the row vectors in w1, then it will have high dot product

0:11:25 - 0:11:29 Text: with that particular key, and the score here will be high for that entry and low for all

0:11:29 - 0:11:31 Text: the other entries.

0:11:31 - 0:11:34 Text: So essentially, each one of these keys selects a particular value.

0:11:34 - 0:11:38 Text: The first row vector selects the first column vector in the second matrix, and so on and

0:11:38 - 0:11:39 Text: so forth.

0:11:39 - 0:11:45 Text: So that's the key value interpretation of a feed forward layer.

0:11:45 - 0:11:51 Text: And just to kind of beat this example to death, so you can get an example of how expressive

0:11:51 - 0:11:57 Text: this model is, let's suppose that the keys are actually one hot vectors where each entry

0:11:57 - 0:12:02 Text: or each row vector just has a one in a different position.

0:12:02 - 0:12:07 Text: Basically what I'll argue is that I can select any combination of the memory columns in

0:12:07 - 0:12:11 Text: this second matrix over here by just setting different values on the input.

0:12:11 - 0:12:17 Text: So in this particular example, I've got a one here and a one here, and that in turn

0:12:17 - 0:12:23 Text: selects these two keys, all the other dot products will be zero, and the selector will

0:12:23 - 0:12:27 Text: be only on for those two entries, which will select these two values.

0:12:27 - 0:12:31 Text: And if I had flipped any of the other bits in this vector to one or zero, I could select

0:12:31 - 0:12:33 Text: any other combination of memories.

0:12:33 - 0:12:39 Text: So there's really quite a lot of bit of flexibility in this model, and it gives us a potential

0:12:39 - 0:12:44 Text: theoretical explanation for how you could store and select lots of different kinds of information

0:12:44 - 0:12:47 Text: in a feed forward layer.

0:12:47 - 0:12:50 Text: Okay, so that's all theoretical so far.

0:12:50 - 0:12:54 Text: And it's just what a feed forward layer could do.

0:12:54 - 0:13:02 Text: We actually want to know if feed forward layers do act this way in a real transformer model.

0:13:02 - 0:13:05 Text: And for that, we're going to return to the paper that I was mentioning earlier called

0:13:05 - 0:13:11 Text: Rome, and we're going to look at how a transformer actually behaves on this particular prompt.

0:13:11 - 0:13:19 Text: All right, so as you know from previous classes on transformers, basically it processes

0:13:19 - 0:13:23 Text: the text from left to right, at least in a standard decoder model, and it builds the attention

0:13:23 - 0:13:28 Text: in feed forward layers one at a time on top, going across like this.

0:13:28 - 0:13:32 Text: And on the next time step, which I'm not showing, it has to predict what goes there.

0:13:32 - 0:13:36 Text: And currently in the basic model, it predicts Paris.

0:13:36 - 0:13:42 Text: So we want to know of each of these boxes, which one is actually storing the knowledge

0:13:42 - 0:13:44 Text: about the Eiffel Tower.

0:13:44 - 0:13:46 Text: If that's a reasonable question to ask at all.

0:13:46 - 0:13:50 Text: And we'll look at an approach that's used in the Rome paper that I mentioned earlier

0:13:50 - 0:13:52 Text: called the causal probing.

0:13:52 - 0:13:59 Text: And the technique, the basic idea of causal probing, is first you take some random Gaussian

0:13:59 - 0:14:04 Text: noise and you add it to the word embeddings for Eiffel and Tower.

0:14:04 - 0:14:08 Text: Or if Eiffel Tower is broken up into more sub words, all of those sub word embeddings.

0:14:08 - 0:14:13 Text: So that essentially confuses the model into not quite being able to recognize what the

0:14:13 - 0:14:14 Text: entity is.

0:14:14 - 0:14:18 Text: And they add enough noise to the point where the model no longer predicts Paris.

0:14:18 - 0:14:24 Text: So once they've destroyed the model's prediction, they then go about trying to restore the original

0:14:24 - 0:14:31 Text: value at each of these boxes one at a time to try and recover the model's original behavior.

0:14:31 - 0:14:37 Text: So intuitively, you would think if one of these boxes doesn't matter, like the model's

0:14:37 - 0:14:40 Text: not paying attention to it, even if you restore back to the original value, the prediction

0:14:40 - 0:14:42 Text: is not going to go back.

0:14:42 - 0:14:46 Text: But if it is responsible when you restore the value, there's some hope that the model

0:14:46 - 0:14:49 Text: returns to its original prediction.

0:14:49 - 0:14:55 Text: And then they're just going to see which layers are best at restoring the original prediction.

0:14:55 - 0:14:59 Text: One thing I just wanted to say to clarify is, let's say if I restore this value here

0:14:59 - 0:15:04 Text: in this attention box, we have to recompute everything above it because all the things

0:15:04 - 0:15:06 Text: above it need to consume that new value.

0:15:06 - 0:15:12 Text: So that's the restoration procedure.

0:15:12 - 0:15:19 Text: And what they found in this paper is that feedforward layers above the last token of Eiffel Tower

0:15:19 - 0:15:24 Text: are actually the critical causal layer that if you restore it, you can get the prediction

0:15:24 - 0:15:26 Text: to go back to its original value.

0:15:26 - 0:15:27 Text: And that's quite interesting.

0:15:27 - 0:15:30 Text: They tried restoring later layers, doesn't restore the prediction.

0:15:30 - 0:15:34 Text: They tried restoring earlier layers, also doesn't restore the prediction.

0:15:34 - 0:15:40 Text: So there's this very clear time band upon which the causal effect is.

0:15:40 - 0:15:42 Text: Let me show you guys a quick plot.

0:15:42 - 0:15:46 Text: So this is just a plot from the original paper.

0:15:46 - 0:15:47 Text: Take a moment to interpret this.

0:15:47 - 0:15:52 Text: So on the y-axis here, we have the different time positions as the model is processing

0:15:52 - 0:15:54 Text: different tokens.

0:15:54 - 0:15:59 Text: So you can see it's the big bang theory premieres on, and then there's a date.

0:15:59 - 0:16:03 Text: And the big bang theory has stars on it because that's the entity where the noise is being

0:16:03 - 0:16:04 Text: added.

0:16:04 - 0:16:09 Text: And then on the x-axis, you're seeing different layers of the transformer as it's processing

0:16:09 - 0:16:10 Text: it.

0:16:10 - 0:16:15 Text: And the colored intensity of the plot is the causal effect of restoring.

0:16:15 - 0:16:22 Text: So what's very exciting is that it all lands on theory, not any tokens before that.

0:16:22 - 0:16:26 Text: So this is where the model apparently is accessing the knowledge.

0:16:26 - 0:16:31 Text: And it's sort of surprising because you might imagine that the knowledge is kind of distributed

0:16:31 - 0:16:34 Text: everywhere, maybe a little bit of every time step contributes.

0:16:34 - 0:16:38 Text: And that would be unfortunate because it'd be harder to modify the model's behavior if

0:16:38 - 0:16:39 Text: that were the case.

0:16:39 - 0:16:41 Text: But in fact, it actually concentrates.

0:16:41 - 0:16:46 Text: And they did this over a bunch of different prompts and looked at basically where they got

0:16:46 - 0:16:51 Text: the most impact, measuring the first entity token, middle, last entity token, later words.

0:16:51 - 0:16:53 Text: And it all concentrates very well.

0:16:53 - 0:16:55 Text: So that's an interesting observation.

0:16:55 - 0:16:58 Text: I don't know what to do with it yet, research wise, but I thought you guys should know about

0:16:58 - 0:16:59 Text: it.

0:16:59 - 0:17:01 Text: Okay.

0:17:01 - 0:17:06 Text: So we're going to zoom in on this feed forward layer that they identified with this high

0:17:06 - 0:17:10 Text: causal effect. So they said it was in this time step and they found that effect to exist

0:17:10 - 0:17:12 Text: across many of the layers.

0:17:12 - 0:17:17 Text: And let's just zoom in on one of them to get a better understanding of what's going

0:17:17 - 0:17:19 Text: on here.

0:17:19 - 0:17:25 Text: So as you guys already know from the earlier slide, we can identify this selection vector

0:17:25 - 0:17:30 Text: that comes out of the nonlinearity, which says which memories got selected in the second

0:17:30 - 0:17:32 Text: weight matrix.

0:17:32 - 0:17:36 Text: And furthermore, we know that this output from this weight matrix is responsible for predicting

0:17:36 - 0:17:39 Text: Paris as we saw from the previous plot.

0:17:39 - 0:17:44 Text: So intuitively, that gives you this idea that somehow we should be messing with the weight

0:17:44 - 0:17:50 Text: matrix W2 to change its behavior.

0:17:50 - 0:17:56 Text: And a naive idea for how to change its behavior is well, drawing on our intuitions from word

0:17:56 - 0:18:02 Text: vectors, maybe we just pick one column from W2, the one that the selector selects, and

0:18:02 - 0:18:06 Text: just subtract the word vector for Paris and add the word vector for Rome.

0:18:06 - 0:18:10 Text: And it turns out that there is in fact a paper, which I've linked to here, that does that,

0:18:10 - 0:18:13 Text: and it works to some extent.

0:18:13 - 0:18:15 Text: So they showed positive results with this approach.

0:18:15 - 0:18:20 Text: And that's really quite surprising in on its own.

0:18:20 - 0:18:24 Text: The particular paper that we've been following, the Rome paper, does something slightly different,

0:18:24 - 0:18:26 Text: but similar in spirit.

0:18:26 - 0:18:29 Text: So they apply a rank one update just to the weight matrix W2.

0:18:29 - 0:18:35 Text: They don't touch W1 at all, kind of consistent with our interpretation that W2 contains

0:18:35 - 0:18:38 Text: the values of the memory, and W1 is just the keys.

0:18:38 - 0:18:45 Text: So what I mean by a rank one update is W2 is a matrix, and I add to it another matrix

0:18:45 - 0:18:50 Text: formed from outer producting two vectors, U and V transpose.

0:18:50 - 0:18:56 Text: And these two vectors are parameters that are optimized to maximize the probability that

0:18:56 - 0:19:03 Text: the model outputs Rome, while also minimizing the change in behavior over all the other inputs.

0:19:03 - 0:19:07 Text: It's out of the scope of this class to exactly describe what U and V is, or at least not

0:19:07 - 0:19:08 Text: in this lecture.

0:19:08 - 0:19:13 Text: So I'll maybe just punt this off to Eric's lecture when he gets there.

0:19:13 - 0:19:18 Text: But I just wanted to show you guys this to show the level of fine-grained control people

0:19:18 - 0:19:24 Text: are starting to look into in terms of editing knowledge in language models.

0:19:24 - 0:19:27 Text: And this wouldn't be complete if I didn't show you some more examples.

0:19:27 - 0:19:31 Text: So the successful example you guys already saw on an earlier slide.

0:19:31 - 0:19:35 Text: But to also show you some not quite successful examples, just to show where the field is

0:19:35 - 0:19:39 Text: right now, they also gave an example of trying to convince the model that the game Sonic

0:19:39 - 0:19:44 Text: Drift 2 was not made by Sega, but instead by Microsoft.

0:19:44 - 0:19:46 Text: And here you can see what the model does instead.

0:19:46 - 0:19:48 Text: It really struggles.

0:19:48 - 0:19:53 Text: It claims that the game is now made by a studio called Play Dead, and this studio was led

0:19:53 - 0:19:54 Text: by a former Microsoft employee.

0:19:54 - 0:19:58 Text: So it's really kind of fighting against what we're trying to make a change.

0:19:58 - 0:20:01 Text: And that's kind of where we are right now.

0:20:01 - 0:20:07 Text: So that was the first section on how language models currently represent knowledge.

0:20:07 - 0:20:11 Text: The main takeaway is that I'd like you guys to have is that the Transformer Feed Forward

0:20:11 - 0:20:14 Text: network can be viewed as a key value memory.

0:20:14 - 0:20:17 Text: And that's one of the reasons why when you see people scaling up Transformers to larger

0:20:17 - 0:20:22 Text: sizes, oftentimes they decide to put that scaling budget into making the Feed Forward

0:20:22 - 0:20:27 Text: layer wider, as opposed to making the attention layer wider, or adding more layers, because

0:20:27 - 0:20:31 Text: they're trying to increase that memorization capacity.

0:20:31 - 0:20:36 Text: And the second conclusion is that Transformers tend to look up information about the entity

0:20:36 - 0:20:38 Text: on the last token where it's mentioned.

0:20:38 - 0:20:40 Text: That's quite an interesting thing too.

0:20:40 - 0:20:45 Text: Prior to the Rome paper, there were other papers that tried to just fine-tune the entire

0:20:45 - 0:20:47 Text: network to get it to change its behavior.

0:20:47 - 0:20:50 Text: And when you fine-tune all of the parameters, indeed you can get it to change its behavior

0:20:50 - 0:20:54 Text: on Rome, but you also mess up a bunch of other facts too.

0:20:54 - 0:21:01 Text: So being able to make a small edit to one place actually does turn out to be helpful.

0:21:01 - 0:21:04 Text: And lastly, I just want to say this is a very new research area.

0:21:04 - 0:21:07 Text: Next year I could be saying something completely different.

0:21:07 - 0:21:11 Text: So just take that with a grain of salt.

0:21:11 - 0:21:16 Text: So we're going to go into the second half of the presentation.

0:21:16 - 0:21:20 Text: And we're going to talk actually, because we still got time, are there any questions

0:21:20 - 0:21:22 Text: about the previous slides?

0:21:22 - 0:21:23 Text: Yeah.

0:21:23 - 0:21:31 Text: Yeah, I'm actually curious, you mentioned about how to be surgical about knowledge alteration.

0:21:31 - 0:21:39 Text: Imperically, did the researchers discover, or with your own work that discovered that

0:21:39 - 0:21:42 Text: it could alter, you know, Iful-tower needing Rome?

0:21:42 - 0:21:47 Text: Does it cause other past dating effects of confusing Rome, like Paris, or something?

0:21:47 - 0:21:48 Text: That's work.

0:21:48 - 0:21:50 Text: Yeah, yeah, that's a great question.

0:21:50 - 0:21:56 Text: So on the eVal benchmarks that they have now for this task, they have, I'm not sure if

0:21:56 - 0:21:58 Text: the exact word is like neighbor prompt.

0:21:58 - 0:22:03 Text: So they have both prompts that are paraphrases of the original thing to test that the models

0:22:03 - 0:22:05 Text: were busted paraphrase.

0:22:05 - 0:22:10 Text: But they also have prompts where they ask, say, about Sears Tower or other towers.

0:22:10 - 0:22:14 Text: And make sure that those towers didn't move to Rome as well.

0:22:14 - 0:22:17 Text: And yeah, so that's very imperfect right now.

0:22:17 - 0:22:21 Text: There's this difficult balance between the two, and that could be a sign of many things.

0:22:21 - 0:22:25 Text: It could be a sign that the model doesn't have enough memorization capacity, so it's only

0:22:25 - 0:22:30 Text: way of representing the Iful-tower is to just think of it as like a tower, with maybe

0:22:30 - 0:22:31 Text: some nationality mixed in.

0:22:31 - 0:22:36 Text: And when you edit that tower, you just move all the other towers too.

0:22:36 - 0:22:38 Text: Great question, yeah.

0:22:38 - 0:22:39 Text: Yeah.

0:22:39 - 0:22:48 Text: So from one of the other things, on the next slide, you see different types of questions.

0:22:48 - 0:22:52 Text: So you asked about Michael Conn where it is, but what about like an article where he's

0:22:52 - 0:23:00 Text: done by other types of properties about the key and just the original, the key to the

0:23:00 - 0:23:03 Text: perspective of the map.

0:23:03 - 0:23:07 Text: Yeah, so I think if I understand your question, it's not just asking about that one fact,

0:23:07 - 0:23:10 Text: but other facts related to the Iful-tower.

0:23:10 - 0:23:15 Text: Yeah, so they also have this kind of freeform generation prompt, I think, where they just

0:23:15 - 0:23:20 Text: initialize it with Iful-tower, some short prompt, and then they measure the different kinds

0:23:20 - 0:23:21 Text: of text that come out.

0:23:21 - 0:23:26 Text: And I think if I remember correctly, they check like N-gram overlap with real text as

0:23:26 - 0:23:27 Text: well.

0:23:27 - 0:23:28 Text: I didn't.

0:23:28 - 0:23:33 Text: So they have a couple more analyses in the paper of how it behaves on other topics too.

0:23:33 - 0:23:35 Text: Yeah.

0:23:35 - 0:23:42 Text: One thing worth mentioning is, I guess, I worked with another student here on a paper on

0:23:42 - 0:23:44 Text: counterfactual updates.

0:23:44 - 0:23:52 Text: So we have this data set, which we should really get released out soon, that pairs one

0:23:52 - 0:23:58 Text: fact update with another implication of that fact update that is non-obvious.

0:23:58 - 0:24:03 Text: So for example, if Walt Disney had one of his Academy Awards script, then the next question

0:24:03 - 0:24:07 Text: would be how many Academy Awards did Walt Disney win.

0:24:07 - 0:24:11 Text: And that's like one of the areas that I think some researchers are thinking about for future

0:24:11 - 0:24:12 Text: work.

0:24:12 - 0:24:13 Text: Yeah.

0:24:13 - 0:24:43 Text: So the current method basically only allows control through the prompt.

0:24:43 - 0:24:47 Text: So whatever the prompt implies about the entity, you get to change what the answer to that

0:24:47 - 0:24:49 Text: prompt is.

0:24:49 - 0:24:53 Text: And it's basically at the moment up to the model to decide what implications are affected

0:24:53 - 0:24:55 Text: by changing that prompt.

0:24:55 - 0:24:57 Text: Which is another, you could say, weakness of the approach.

0:24:57 - 0:24:59 Text: It's not entirely interpretable.

0:24:59 - 0:25:00 Text: Everything is done through optimization.

0:25:00 - 0:25:01 Text: Yeah.

0:25:01 - 0:25:02 Text: Great question.

0:25:02 - 0:25:03 Text: Okay.

0:25:03 - 0:25:05 Text: We'll go on for now.

0:25:05 - 0:25:07 Text: And then, oh, it was a point.

0:25:07 - 0:25:08 Text: Oh, yeah.

0:25:08 - 0:25:15 Text: Okay.

0:25:15 - 0:25:22 Text: The question is about whether the low rank update has anything to do with adversarial machine

0:25:22 - 0:25:23 Text: learning.

0:25:23 - 0:25:29 Text: I'd say maybe there is a slight connection in that, if you look more deeply in the paper,

0:25:29 - 0:25:36 Text: what they're doing is they're optimizing the output of the weight matrix to change the

0:25:36 - 0:25:40 Text: label and that optimization procedure is very similar to what they do in adversarial machine

0:25:40 - 0:25:41 Text: learning.

0:25:41 - 0:25:45 Text: The low rankness of it is not necessarily connected to the adversarial learning.

0:25:45 - 0:25:50 Text: The low rank part is just to minimize the amount of change to the weight matrix.

0:25:50 - 0:25:53 Text: I can maybe just quickly do this on the board here.

0:25:53 - 0:26:03 Text: So if you have this weight matrix, W2 plus UV transpose, the reason this is considered

0:26:03 - 0:26:08 Text: a small update to the matrix is because if you think about anything multiplying W2 after

0:26:08 - 0:26:12 Text: it's received this update, let's say you're multiplying it by X, right?

0:26:12 - 0:26:19 Text: So this whole thing is multiplied by X. So let's just move this X into the expression.

0:26:19 - 0:26:26 Text: You get W2 times X and then UV transpose times X. And this part right here is just what

0:26:26 - 0:26:28 Text: W2 originally would have done.

0:26:28 - 0:26:34 Text: And this part right here, this vector V is being dot-producted with X. And in high-dimensional

0:26:34 - 0:26:39 Text: spaces, basically most things, dot-products are zero.

0:26:39 - 0:26:44 Text: And so this quantity here is likely to be zero unless X is very close to V. And then

0:26:44 - 0:26:47 Text: this whole update here basically disappears.

0:26:47 - 0:26:50 Text: So in that sense, you could think of a low rank update as a very small change to weight

0:26:50 - 0:26:52 Text: matrix.

0:26:52 - 0:27:00 Text: Okay, cool. We'll move on to the next section.

0:27:00 - 0:27:02 Text: All right.

0:27:02 - 0:27:06 Text: So now we've seen what language models currently do to represent knowledge or at least a theory

0:27:06 - 0:27:08 Text: about that.

0:27:08 - 0:27:10 Text: And we'd like to ask ourselves, well, is that really it?

0:27:10 - 0:27:16 Text: Do we just need to make feed-forward layers bigger and bigger to achieve the singularity

0:27:16 - 0:27:21 Text: or whatever it is that artificial intelligence researchers are focused on these days?

0:27:21 - 0:27:26 Text: Okay. So what is missing from transformers right now?

0:27:26 - 0:27:30 Text: We can automatically acquire knowledge from the web, but a lot of that information can

0:27:30 - 0:27:32 Text: be noisy or incorrect.

0:27:32 - 0:27:38 Text: So the web certainly has its share of misinformation or rumors and opinions.

0:27:38 - 0:27:42 Text: And when it absorbs that misinformation or other things, we can't trace the model's

0:27:42 - 0:27:44 Text: knowledge back to an attributable source.

0:27:44 - 0:27:47 Text: So we can trace it back to a particular layer in a feed-forward network, but that still

0:27:47 - 0:27:53 Text: doesn't tell us where in the training data it learned that.

0:27:53 - 0:27:57 Text: Now that would all be okay if we could then surgically edit the model to fix up all of those

0:27:57 - 0:27:58 Text: errors.

0:27:58 - 0:28:02 Text: But as you've seen, it doesn't work very reliably yet.

0:28:02 - 0:28:05 Text: And another fact that I didn't mention is if you apply a bunch of these edits in sequence

0:28:05 - 0:28:11 Text: to the model, eventually the parameters get kind of so damaged from the edits that it

0:28:11 - 0:28:14 Text: doesn't maintain its original performance anymore.

0:28:14 - 0:28:19 Text: And we can continue to try storing knowledge inside feed-forward layers, but the current

0:28:19 - 0:28:23 Text: memorization capacity is still too small, even though we're building these very large

0:28:23 - 0:28:26 Text: and expensive models.

0:28:26 - 0:28:30 Text: So we can rephrase some of these issues as a wish list for what we would want in a

0:28:30 - 0:28:33 Text: knowledge-based language model.

0:28:33 - 0:28:38 Text: We'd want fast and modular knowledge editing, so be able to robustly edit the model multiple

0:28:38 - 0:28:40 Text: times without breaking it.

0:28:40 - 0:28:44 Text: We'd like attribution and interpretability, so tracing a model's knowledge back to something

0:28:44 - 0:28:48 Text: in its training set, and we'd like efficient scaling.

0:28:48 - 0:28:53 Text: So we'd like to be able to increase the model's memory size by 10x without paying 10x more

0:28:53 - 0:28:55 Text: compute.

0:28:55 - 0:28:59 Text: And to give a motivating example, let's just say you wanted to use something like GPT-3

0:28:59 - 0:29:05 Text: to do question answering over your company or school wiki.

0:29:05 - 0:29:09 Text: At the moment, as we know, a single training run, at least when it was originally done,

0:29:09 - 0:29:14 Text: cost over $12 million, and we just can't afford to do that for every organization that

0:29:14 - 0:29:18 Text: wants to train a system off of their data.

0:29:18 - 0:29:22 Text: And furthermore, information is constantly being updated over time, for example, like the

0:29:22 - 0:29:25 Text: COVID requirements for being on campus.

0:29:25 - 0:29:31 Text: So all of this sort of motivates this wish list that I'm giving here.

0:29:31 - 0:29:36 Text: So with that, we'll turn to the next main half of the lecture, which is on memory-augmented

0:29:36 - 0:29:38 Text: models.

0:29:38 - 0:29:44 Text: So let me first just give a basic overview of what a memory-augmented model is.

0:29:44 - 0:29:48 Text: To start with, let's just consider a standard neural network model, which takes some sort

0:29:48 - 0:29:53 Text: of input like this one here, passes it through some dense computation, and then produces some

0:29:53 - 0:29:54 Text: output.

0:29:54 - 0:29:59 Text: And the difference that we're going to make to this model is we're going to attach a memory

0:29:59 - 0:30:01 Text: retriever to it.

0:30:01 - 0:30:05 Text: So the input is going to be fed into this memory retriever, which then accesses some external

0:30:05 - 0:30:11 Text: knowledge source that can be easily scaled, easily edited, easily understood by humans,

0:30:11 - 0:30:14 Text: like Wikipedia.

0:30:14 - 0:30:18 Text: And from that, we'll try to identify some piece of information that is relevant for the

0:30:18 - 0:30:23 Text: task at hand, and feed it back into the neural network, and then produce the prediction.

0:30:23 - 0:30:26 Text: So that's the basic approach that we're thinking about.

0:30:26 - 0:30:31 Text: And the memory that the memory retriever selects can be any number of things.

0:30:31 - 0:30:33 Text: It could be a document on the web.

0:30:33 - 0:30:38 Text: It could be a record in a database. It could be a training example, an entity embedding.

0:30:38 - 0:30:42 Text: I'm just going to focus on text for now, but most of what we'll talk about in this lecture

0:30:42 - 0:30:49 Text: can apply to other kinds of objects, and you should just keep that in mind that it's

0:30:49 - 0:30:52 Text: really not just about text.

0:30:52 - 0:30:56 Text: So this potentially meets our wish list of things we'd like to do.

0:30:56 - 0:30:59 Text: You can easily edit the knowledge in something like Wikipedia.

0:30:59 - 0:31:01 Text: You can easily attribute back to it.

0:31:01 - 0:31:04 Text: There's a source, there's an author for everything that's put in there, and there's efficient

0:31:04 - 0:31:09 Text: scaling, because I can always add more articles to Wikipedia, and I don't have to change the

0:31:09 - 0:31:13 Text: size of a neural network that's accessing it.

0:31:13 - 0:31:18 Text: And some motivating applications for why you would care about this.

0:31:18 - 0:31:21 Text: If you're building an open domain dialogue or question answering system, you want to

0:31:21 - 0:31:26 Text: have robust access to knowledge by retrieving documents on the web.

0:31:26 - 0:31:31 Text: If you're generating code, I'm pretty sure all of us are guilty of at least some point

0:31:31 - 0:31:33 Text: going on Stack Overflow and copying a snippet.

0:31:33 - 0:31:39 Text: So even we do retrieval, we are in a form of memory augmented vowel.

0:31:39 - 0:31:42 Text: If you're doing image generation, if somebody tells you, I want a picture of the Eiffel Tower

0:31:42 - 0:31:47 Text: on the White House lawn, you might consult some reference pictures of those objects.

0:31:47 - 0:31:51 Text: And if you're doing fact checking, you might want to retrieve documents that support or

0:31:51 - 0:31:52 Text: refute a claim.

0:31:52 - 0:31:56 Text: All of these things are very knowledge intensive tasks, and could benefit from an approach like

0:31:56 - 0:31:57 Text: this.

0:31:57 - 0:31:58 Text: Yeah, question.

0:31:58 - 0:31:59 Text: Yeah.

0:31:59 - 0:32:00 Text: Yeah.

0:32:00 - 0:32:01 Text: Yeah.

0:32:01 - 0:32:02 Text: That's a good question.

0:32:02 - 0:32:26 Text: So the question is whether you have to retrieve one memory, and that's not the case.

0:32:26 - 0:32:30 Text: I'm going to use that as the simplified example that we'll work with, but you could retrieve

0:32:30 - 0:32:31 Text: multiple memories.

0:32:31 - 0:32:35 Text: The complexity of retrieving multiple memories increases, and we may be good to that in a

0:32:35 - 0:32:36 Text: little bit.

0:32:36 - 0:32:37 Text: Good question.

0:32:37 - 0:32:38 Text: Yeah.

0:32:38 - 0:32:39 Text: Okay.

0:32:39 - 0:32:43 Text: All right.

0:32:43 - 0:32:47 Text: So the rest of this talk is going to be structured around the main design questions around

0:32:47 - 0:32:50 Text: how to design a memory augmented model.

0:32:50 - 0:32:53 Text: So first you have to choose what your memories are.

0:32:53 - 0:32:56 Text: I'm actually not going to focus on that much because based on your application, you can

0:32:56 - 0:33:00 Text: usually guess what you would want your memories to be.

0:33:00 - 0:33:04 Text: And then we're going to think very hard about how to retrieve the memories.

0:33:04 - 0:33:08 Text: That's essentially the heart, maybe the hardest part of the problem.

0:33:08 - 0:33:11 Text: Approaches we'll look at are you could use an off the shelf search engine like Google

0:33:11 - 0:33:17 Text: or Stack Overflow, or you could train your own memory retriever, which will spend some

0:33:17 - 0:33:18 Text: time on.

0:33:18 - 0:33:22 Text: And lastly, we'll end by looking at how to use retrieved memories.

0:33:22 - 0:33:28 Text: So we'll cover a few different approaches such as text fusion, label smearing.

0:33:28 - 0:33:32 Text: And perhaps most interestingly, I'll talk about a few common failure modes for when models

0:33:32 - 0:33:34 Text: try to use memory.

0:33:34 - 0:33:38 Text: One of them is something that we'll call underutilization, where the model actually ignores

0:33:38 - 0:33:39 Text: the retrieved memories.

0:33:39 - 0:33:43 Text: And another one is over reliance, where the model somehow kind of becomes too dependent

0:33:43 - 0:33:44 Text: on memory.

0:33:44 - 0:33:48 Text: And I'll talk about that when we get there.

0:33:48 - 0:33:50 Text: Okay.

0:33:50 - 0:33:54 Text: So let's go into the section on retrieving memories.

0:33:54 - 0:33:57 Text: So I'll kind of organize them into two broad groups.

0:33:57 - 0:34:01 Text: So one is the set of approaches that use an external tool.

0:34:01 - 0:34:04 Text: And the other is the set where you train your own.

0:34:04 - 0:34:09 Text: And we'll start with an approach that uses an external tool, just because I think some

0:34:09 - 0:34:14 Text: of those approaches have been really popping up this year and are quite exciting.

0:34:14 - 0:34:19 Text: The first approach we'll look at is from this paper called Lambda, which stands for Language

0:34:19 - 0:34:21 Text: Models for Dialog Applications.

0:34:21 - 0:34:25 Text: So on the right we've got a dialogue between a human user and the model.

0:34:25 - 0:34:28 Text: This is a, I think, a real dialogue from their paper.

0:34:28 - 0:34:32 Text: And the user is just asking about a particular artist and the model is giving a very spirited

0:34:32 - 0:34:37 Text: reply, complete with personal opinions and even follow up information.

0:34:37 - 0:34:42 Text: So Lambda is an open domain dialog chatbot that's designed to cover a very large range

0:34:42 - 0:34:43 Text: of topics.

0:34:43 - 0:34:47 Text: So you need some kind of memory component that's able to handle anything the user might

0:34:47 - 0:34:49 Text: throw at the model.

0:34:49 - 0:34:53 Text: And the basic version of the model is just a transformer decoder.

0:34:53 - 0:34:58 Text: So it's the same kind of transformer that we were studying in the previous slides.

0:34:58 - 0:35:02 Text: The input to that transformer is essentially the previous turns of the conversation, represented

0:35:02 - 0:35:03 Text: as text.

0:35:03 - 0:35:07 Text: And the output is just a new utterance that it needs to generate, also as text.

0:35:07 - 0:35:10 Text: So it's just a text to text kind of approach.

0:35:10 - 0:35:11 Text: What's new though?

0:35:11 - 0:35:14 Text: Oh, okay, not quite there, yeah.

0:35:14 - 0:35:18 Text: So one last thing about this model though is as they were developing it, they noticed that

0:35:18 - 0:35:20 Text: it often generated factually incorrect claims.

0:35:20 - 0:35:26 Text: So just to highlight that, this last claim that the model makes is factually incorrect.

0:35:26 - 0:35:31 Text: In this case, this particular artist that was supposedly inspired by the earlier artists

0:35:31 - 0:35:34 Text: stopped working before the first one began working.

0:35:34 - 0:35:37 Text: So this just can't be true.

0:35:37 - 0:35:40 Text: And their approach to solving this problem will be to teach their base model to learn

0:35:40 - 0:35:45 Text: to use a search engine to validate or fix claims that it's made.

0:35:45 - 0:35:48 Text: And I'll show you how that approach works on the next slide.

0:35:48 - 0:35:54 Text: The basic idea is you've got a user interacting with Lambda, which is this big box here.

0:35:54 - 0:35:59 Text: And what they've decided to do is have multiple agents inside the big box that can kind of

0:35:59 - 0:36:04 Text: interact with each other and kind of work things out before they give a reply back to the

0:36:04 - 0:36:05 Text: user.

0:36:05 - 0:36:07 Text: So here's how it might go.

0:36:07 - 0:36:11 Text: The user says to the base model, one was the Eiffel Tower built.

0:36:11 - 0:36:16 Text: And the base model replies it was constructed in 1887.

0:36:16 - 0:36:20 Text: But unlike the basic approach, it doesn't send that response immediately back to the

0:36:20 - 0:36:21 Text: user.

0:36:21 - 0:36:25 Text: It actually sends it to this agent called Research.

0:36:25 - 0:36:29 Text: And then research then takes that information and decides, okay, you know, this claim looks

0:36:29 - 0:36:32 Text: a little bit sketchy.

0:36:32 - 0:36:35 Text: I'm going to send a query to the search engine.

0:36:35 - 0:36:40 Text: I'm answer-promorphizing a bunch here, but it just helps with the explanation.

0:36:40 - 0:36:45 Text: And then the search engine then replies back with the web search results for that query.

0:36:45 - 0:36:49 Text: And in it, we have actually the correct answer, which is 1889.

0:36:49 - 0:36:55 Text: So research then takes that information and produces a new response that has the correct

0:36:55 - 0:36:57 Text: information and sends that back to the user.

0:36:57 - 0:36:59 Text: So that's the overall flow of the approach.

0:36:59 - 0:37:04 Text: And you can see that the search engine is able to intervene to fix a model's responses.

0:37:04 - 0:37:05 Text: Any questions about that?

0:37:05 - 0:37:19 Text: Okay, yeah, the question was whether this particular flow happens for all questions to the

0:37:19 - 0:37:20 Text: model.

0:37:20 - 0:37:25 Text: And actually, as we'll see in a later slide, the model gets to decide who it talks to next.

0:37:25 - 0:37:30 Text: So base has to decide to talk to research as opposed to talking to the user.

0:37:30 - 0:37:34 Text: So there's a learning process where each agent gets to decide whether to go right back

0:37:34 - 0:37:37 Text: to the user or to talk to another one of the other agents.

0:37:37 - 0:37:38 Text: Go ahead.

0:37:38 - 0:38:03 Text: Okay, yeah, the question is how to limit the amount of information coming back from the

0:38:03 - 0:38:04 Text: search engine.

0:38:04 - 0:38:10 Text: I think in this particular approach, the search engine returns the first snippet.

0:38:10 - 0:38:13 Text: And then if research is still not happy, it asks the same question again.

0:38:13 - 0:38:16 Text: And then they've designed it so the search engine returns the next snippet.

0:38:16 - 0:38:19 Text: So it just sort of yields control back to the first.

0:38:19 - 0:38:23 Text: Yeah, yeah.

0:38:23 - 0:38:37 Text: Yeah, okay, so the question is whether research sends any feedback back to the base model

0:38:37 - 0:38:38 Text: that it's made of a stake.

0:38:38 - 0:38:40 Text: And that's a great idea.

0:38:40 - 0:38:42 Text: I don't think they do that in the paper.

0:38:42 - 0:38:47 Text: They just, I mean, research overrides base so the user gets what research said.

0:38:47 - 0:38:50 Text: But base never learns, I think, from what research says.

0:38:50 - 0:38:53 Text: So that's a really, yeah, that's a really great point.

0:38:53 - 0:39:02 Text: Okay, so the question is, why do we need the base model?

0:39:02 - 0:39:03 Text: Yeah, that's a great point too.

0:39:03 - 0:39:09 Text: So Lambda is kind of the researchers who developed it cared not just about answering factual

0:39:09 - 0:39:12 Text: questions but making it interesting and fun and engaging.

0:39:12 - 0:39:16 Text: So the base model has a lot of that engaging behavior.

0:39:16 - 0:39:19 Text: And they wanted to preserve that while still preserving factuality.

0:39:19 - 0:39:24 Text: So the other two agents are kind of there to police the base model.

0:39:24 - 0:39:27 Text: That's maybe one explanation.

0:39:27 - 0:39:31 Text: Oh, right.

0:39:31 - 0:39:33 Text: Great, great questions.

0:39:33 - 0:39:36 Text: So now that we've seen the overall control flow of the model, we can look at how the model

0:39:36 - 0:39:37 Text: is trained.

0:39:37 - 0:39:39 Text: And it's actually quite simple.

0:39:39 - 0:39:42 Text: So their modeling approach is to just treat everything as dialogue.

0:39:42 - 0:39:46 Text: So let's look at a particular turn of the model's operation.

0:39:46 - 0:39:52 Text: So there've been a couple turns of conversation so far, which I've listed here.

0:39:52 - 0:39:58 Text: And you can see it's just saying who talked and who they talked to.

0:39:58 - 0:40:03 Text: And the output at this particular time step is just another person to talk and who it

0:40:03 - 0:40:04 Text: should talk to.

0:40:04 - 0:40:06 Text: So all of this is just text.

0:40:06 - 0:40:08 Text: It's text going in, it's text going out.

0:40:08 - 0:40:13 Text: So you guys have seen transformer models and basically this fits right into the contract

0:40:13 - 0:40:16 Text: of a standard transformer model.

0:40:16 - 0:40:20 Text: The only kind of special detail is that when you generate the text, you have to start your

0:40:20 - 0:40:22 Text: sentence with who you're addressing.

0:40:22 - 0:40:30 Text: And that provides the control of which agent responds next.

0:40:30 - 0:40:32 Text: I already mentioned this.

0:40:32 - 0:40:38 Text: And the perhaps the most important question is, okay, we've got this text to text data.

0:40:38 - 0:40:39 Text: How do we train this model?

0:40:39 - 0:40:42 Text: And the approach in Lambda is actually quite simple, too.

0:40:42 - 0:40:45 Text: They basically just get human demonstrations.

0:40:45 - 0:40:50 Text: So human crowd workers play the role of user and research in this dialogue.

0:40:50 - 0:40:54 Text: There are people who are looking at the base model's utterances and saying, oh, I don't

0:40:54 - 0:40:55 Text: like that.

0:40:55 - 0:40:57 Text: I think a search query should be sent here.

0:40:57 - 0:41:00 Text: And when the search results come back, they're reading the results and then deciding how

0:41:00 - 0:41:02 Text: Lambda should respond instead.

0:41:02 - 0:41:07 Text: So it's a really elegant and simple approach, but it does require you to have trained crowd

0:41:07 - 0:41:11 Text: workers and put in a good amount of budget to get the behavior you want.

0:41:11 - 0:41:14 Text: But still quite impressive that they're able to do this.

0:41:14 - 0:41:18 Text: This is a real example, I think, from the paper.

0:41:18 - 0:41:22 Text: Cool, any questions there?

0:41:22 - 0:41:23 Text: All right.

0:41:23 - 0:41:27 Text: So although the approach is simple, it actually achieves quite a bit.

0:41:27 - 0:41:32 Text: So the model learns to reformulate the previous terms of the conversation as a query that can

0:41:32 - 0:41:34 Text: go into Google search.

0:41:34 - 0:41:38 Text: So it's kind of shoe-horning the problem into something that Google search or some kind

0:41:38 - 0:41:40 Text: of web search can understand.

0:41:40 - 0:41:45 Text: And then it's learning also from human demonstrations how to incorporate the knowledge from the search

0:41:45 - 0:41:49 Text: results back into the utterance that it's putting out.

0:41:49 - 0:41:54 Text: And just because this work also came out around the same time and also very exciting, I also

0:41:54 - 0:42:00 Text: point you to WebGPT, which is another model that learns to use web search.

0:42:00 - 0:42:04 Text: In their case, they provide human demonstrators with an actual UI and have human demonstrators

0:42:04 - 0:42:05 Text: use that.

0:42:05 - 0:42:09 Text: But they ultimately, I think, convert the history of actions that the user takes again

0:42:09 - 0:42:16 Text: into a piece of text that the model then simply consumes and uses to predict the next action.

0:42:16 - 0:42:20 Text: The additional thing here is that they use reinforcement learning to then fine tune their

0:42:20 - 0:42:24 Text: system further on top of what they learned from human demonstrations.

0:42:24 - 0:42:26 Text: And that's something worth checking out as well.

0:42:26 - 0:42:31 Text: So the main takeaway for this little section here is that many external retrieval tools

0:42:31 - 0:42:35 Text: accept text as input and return text as output.

0:42:35 - 0:42:39 Text: So if you want to have an external memory and interface with one of these things, all

0:42:39 - 0:42:44 Text: the task really boils down to is learning to generate text queries to that external tool

0:42:44 - 0:42:48 Text: and then learning to understand the text output of the tool.

0:42:48 - 0:42:54 Text: And both of these tasks can be handled by standard off-the-shelf tools that all of you are already

0:42:54 - 0:42:57 Text: familiar with from previous lectures.

0:42:57 - 0:43:02 Text: As long as you have demonstrations for how to do that.

0:43:02 - 0:43:05 Text: Or if you're able to do RL training, which we won't cover here.

0:43:05 - 0:43:08 Text: So that's the overview of how to use external search tools.

0:43:08 - 0:43:13 Text: You can imagine that if you had a database instead of a web search, you could provide demonstrations

0:43:13 - 0:43:20 Text: of how to write SQL queries to that database or any other sort of tool that you could imagine.

0:43:20 - 0:43:22 Text: All right.

0:43:22 - 0:43:25 Text: So at this point, you might say, all right, we can query web search and web search is very

0:43:25 - 0:43:26 Text: powerful.

0:43:26 - 0:43:28 Text: So why would we use anything else?

0:43:28 - 0:43:31 Text: And to that, I have a couple responses.

0:43:31 - 0:43:34 Text: So first of all, web search is just far from perfect.

0:43:34 - 0:43:37 Text: And the reason it says good as it is today is because of research.

0:43:37 - 0:43:38 Text: And we're here to do research.

0:43:38 - 0:43:44 Text: So if you're just going to rely on web search being good, that sort of defeats the point.

0:43:44 - 0:43:47 Text: If you don't believe me, try some of these queries.

0:43:47 - 0:43:51 Text: So if you search for a famous lawyer who got into car accident, you will find that all

0:43:51 - 0:43:56 Text: the results are about lawyers you can call if you get into a car accident.

0:43:56 - 0:44:00 Text: If you search for a use NLP to parse research papers, you will find a bunch of research papers

0:44:00 - 0:44:01 Text: on parsing.

0:44:01 - 0:44:06 Text: And after doing a few of these, the illusion of web search working really well kind of fades

0:44:06 - 0:44:08 Text: away a little bit.

0:44:08 - 0:44:12 Text: And also, if you speak a language other than English, you might find that search performance

0:44:12 - 0:44:15 Text: in different languages is really not quite the same.

0:44:15 - 0:44:19 Text: So there's still a lot to do to improve retrieval in web search.

0:44:19 - 0:44:26 Text: And I sort of consider web search to be just sort of a component inside the larger, bigger

0:44:26 - 0:44:31 Text: set of things that memory augmented models could potentially do.

0:44:31 - 0:44:35 Text: And second of all, just the plain API of web search isn't designed to handle everything

0:44:35 - 0:44:37 Text: you might want to do.

0:44:37 - 0:44:41 Text: So you could imagine a doctor given a medical image might want to retrieve similar images

0:44:41 - 0:44:43 Text: from a medical textbook.

0:44:43 - 0:44:47 Text: That's not quite something that web search is cut out to do right now.

0:44:47 - 0:44:50 Text: Or if you're a programmer who's been given a programming challenge, you might want to retrieve

0:44:50 - 0:44:52 Text: relevant algorithms.

0:44:52 - 0:44:54 Text: Also something web search doesn't do.

0:44:54 - 0:44:56 Text: If you're in fashion, if you're given three pieces of clothing, can you retrieve another

0:44:56 - 0:44:59 Text: piece of clothing that completes your outfit?

0:44:59 - 0:45:03 Text: Or if you're a novelist and you're given a story, retrieve other stories that have the

0:45:03 - 0:45:05 Text: same plot.

0:45:05 - 0:45:09 Text: Or if you're a journalist, if you're given a claim, retrieve and use articles that refute

0:45:09 - 0:45:10 Text: or contradicted.

0:45:10 - 0:45:14 Text: These are all retrieval tasks that would be quite useful, but existing search tools just

0:45:14 - 0:45:15 Text: don't handle.

0:45:15 - 0:45:20 Text: And so these are all reasons why I think the retrieval problem is still interesting to

0:45:20 - 0:45:22 Text: look at.

0:45:22 - 0:45:25 Text: And a third and final point is that web search only accesses public data.

0:45:25 - 0:45:30 Text: So if you have any task that doesn't condition on public data, you're still going to need

0:45:30 - 0:45:33 Text: a retriever of your own.

0:45:33 - 0:45:34 Text: Cool.

0:45:34 - 0:45:38 Text: So with that, we'll turn to the next part of the talk, which is how to train your own

0:45:38 - 0:45:44 Text: neural retriever, which is something that I find very interesting.

0:45:44 - 0:45:51 Text: So we'll start by giving an anatomy of a neural retriever kind of similar to what we showed

0:45:51 - 0:45:53 Text: for feedforward networks and transformers.

0:45:53 - 0:45:56 Text: We're going to go with this key value type of interpretation.

0:45:56 - 0:45:59 Text: So you have a set of keys paired with a set of values.

0:45:59 - 0:46:05 Text: And given some input, you're going to compute some sort of similarity score between the

0:46:05 - 0:46:07 Text: input and each of the keys.

0:46:07 - 0:46:13 Text: And once you've computed that score, you basically want to return the value associated with

0:46:13 - 0:46:15 Text: the highest scoring key.

0:46:15 - 0:46:22 Text: Or you could return the values for the top K high scoring keys or any other metric.

0:46:22 - 0:46:25 Text: So to just kind of ground this example, the input could be something like Iple Tower

0:46:25 - 0:46:26 Text: Location.

0:46:26 - 0:46:31 Text: The keys could be titles of documents and the values could be the corresponding text associated

0:46:31 - 0:46:34 Text: with the document.

0:46:34 - 0:46:39 Text: And the basic takeaway here is just that a retriever is really just a function that takes

0:46:39 - 0:46:43 Text: some input and a key and produces a score.

0:46:43 - 0:46:45 Text: Once you have that, you basically have a retriever.

0:46:45 - 0:46:51 Text: You score all the memories and then you take the ones that have the highest score.

0:46:51 - 0:46:55 Text: For the remaining slides, I'll actually go with a slightly simplified set of.

0:46:55 - 0:46:59 Text: Because in many tasks, there's really no distinction between the keys and the value.

0:46:59 - 0:47:00 Text: Sometimes they're just the same thing.

0:47:00 - 0:47:04 Text: So for example, with Wikipedia documents, you just take the whole text as the key and

0:47:04 - 0:47:06 Text: the value.

0:47:06 - 0:47:12 Text: And so I'll go with this simplified schematic where you're just computing scores against

0:47:12 - 0:47:13 Text: what I'll call memories.

0:47:13 - 0:47:18 Text: And the highest scoring memory is what's returned from the memory retriever.

0:47:18 - 0:47:21 Text: And we'll go with this formulation that the retriever is just a function that takes the

0:47:21 - 0:47:25 Text: input and the memory and produces a score.

0:47:25 - 0:47:29 Text: Okay, so I've said it's just a function, but what sort of functions do people actually

0:47:29 - 0:47:31 Text: use in practice to compute this score?

0:47:31 - 0:47:35 Text: And the answer here is not too surprising.

0:47:35 - 0:47:39 Text: You guys have seen Bert and other sorts of transformers in previous classes.

0:47:39 - 0:47:41 Text: It's a very flexible model class.

0:47:41 - 0:47:44 Text: I'm trying not to introduce kind of unnecessary complexity.

0:47:44 - 0:47:49 Text: So we'll just go with a Bert model that takes the input and the memory.

0:47:49 - 0:47:55 Text: And you put a some sort of regression layer on top of the output layer of Bert.

0:47:55 - 0:48:00 Text: So maybe the CLS token embedding of Bert, you put a regression layer on top and

0:48:00 - 0:48:03 Text: it produces some float valued score.

0:48:03 - 0:48:05 Text: And this whole thing is differentiable.

0:48:05 - 0:48:07 Text: So the regression layer is differentiable.

0:48:07 - 0:48:08 Text: Bert is differentiable.

0:48:08 - 0:48:12 Text: This gives you basically a neural network that produces a score.

0:48:12 - 0:48:16 Text: And the advantages are you get this very powerful model that's comparing the input against

0:48:16 - 0:48:18 Text: the memory and it's differentiable.

0:48:18 - 0:48:20 Text: So all of that is good.

0:48:20 - 0:48:26 Text: The disadvantage of this approach is that if you have millions of memories, then every

0:48:26 - 0:48:30 Text: time a new input comes in, in order to retrieve a memory, you have to run this computation

0:48:30 - 0:48:33 Text: against all one million of the memories.

0:48:33 - 0:48:38 Text: So that's just way too expensive to do if you're thinking about something like all of Wikipedia

0:48:38 - 0:48:39 Text: or all of the web.

0:48:39 - 0:48:47 Text: So next we'll turn to a different architecture that is more commonly used for retrieval on

0:48:47 - 0:48:49 Text: the next slide.

0:48:49 - 0:48:51 Text: So it's very similar.

0:48:51 - 0:48:53 Text: The Bert picture comes up again.

0:48:53 - 0:48:57 Text: This time what we're doing is we're taking the input and feeding only the input into the

0:48:57 - 0:49:01 Text: transformer to produce a single vector that we'll call the input vector.

0:49:01 - 0:49:06 Text: And then we'll have a separate transformer encode each memory separately to produce a

0:49:06 - 0:49:12 Text: memory vector for each memory and then the relevant score between the input and the memory

0:49:12 - 0:49:16 Text: is just the dot product of these two vectors.

0:49:16 - 0:49:20 Text: It could be the dot product, it could be co-science similarity, just any function that you can

0:49:20 - 0:49:24 Text: efficiently compute between two vectors.

0:49:24 - 0:49:27 Text: So why are we proposing this instead?

0:49:27 - 0:49:30 Text: This has a couple advantages over the previous architecture.

0:49:30 - 0:49:38 Text: The first is that you can run this side of the model, the right side, on all of the memories

0:49:38 - 0:49:39 Text: in advance.

0:49:39 - 0:49:43 Text: So before any inputs even come in, you can just pre-compute the memory vector for each

0:49:43 - 0:49:44 Text: thing.

0:49:44 - 0:49:48 Text: So if it's Wikipedia, you can produce a vector for every document in Wikipedia.

0:49:48 - 0:49:52 Text: And when a new memory comes in, you don't have to redo that work.

0:49:52 - 0:49:54 Text: So that saves a lot of compute.

0:49:54 - 0:49:58 Text: The only thing that you need to do when a new input comes in is you need to compute this

0:49:58 - 0:50:03 Text: input vector and then do dot products against all of the memories.

0:50:03 - 0:50:08 Text: So dot products are cheap and can happen much more quickly than running an entire BERT

0:50:08 - 0:50:09 Text: model over again.

0:50:09 - 0:50:14 Text: And that's the fundamental savings that you get from using a model like this.

0:50:14 - 0:50:18 Text: Something that we won't cover in as much detail here is that for the dot product, there

0:50:18 - 0:50:26 Text: are also fast nearest neighbors algorithms that let you efficiently find the memory vectors

0:50:26 - 0:50:31 Text: that have the highest dot product with the input vector without actually computing over

0:50:31 - 0:50:32 Text: all of the memory vectors.

0:50:32 - 0:50:36 Text: So it's a sublinear search algorithm that allows you to find it.

0:50:36 - 0:50:41 Text: And the basic intuition there, at least there are a couple of them, is that you can take

0:50:41 - 0:50:45 Text: your set of memory vectors and build some sort of tree structure over them, kind of organizing

0:50:45 - 0:50:46 Text: them spatially.

0:50:46 - 0:50:50 Text: And once you've built that tree structure when the new input vector comes in, you can

0:50:50 - 0:50:54 Text: essentially kind of traverse down that tree to find the things that are most similar

0:50:54 - 0:50:57 Text: without computing dot products with everything else.

0:50:57 - 0:51:01 Text: There are other algorithms too that use hashing and other techniques, but they'll be out

0:51:01 - 0:51:06 Text: of the scope for today's class.

0:51:06 - 0:51:10 Text: And the other good property here is that all of this is still differentiable, so you can

0:51:10 - 0:51:13 Text: still train this thing with gradient descent like anything else.

0:51:13 - 0:51:18 Text: The main disadvantage of this approach is also kind of due to its advantage, which is

0:51:18 - 0:51:22 Text: that all of the expressiveness of this model has to go through that one dot product.

0:51:22 - 0:51:26 Text: So anything you want to remember about the input or anything you want to remember about

0:51:26 - 0:51:31 Text: the memory, all has to get squeezed into that one memory vector and that one input vector.

0:51:31 - 0:51:37 Text: And that's a bottleneck that kind of researchers have been dealing with in recent research.

0:51:37 - 0:51:40 Text: What you'll find is that there are a lot of approaches that try to strike some kind of

0:51:40 - 0:51:43 Text: balance between this approach and the approach on the previous slide.

0:51:43 - 0:51:48 Text: So a common thing to do is to use this approach to retrieve a top set of candidates and then

0:51:48 - 0:51:53 Text: run a more complex model like the one on the previous slide to rescore and re-rank the

0:51:53 - 0:51:56 Text: candidates proposed by the first model.

0:51:56 - 0:52:01 Text: You'll also find techniques that try to take the memory and produce five vectors and

0:52:01 - 0:52:05 Text: then use all five of those to somehow compute a score.

0:52:05 - 0:52:09 Text: There are many variations which we won't go into detail here.

0:52:09 - 0:52:11 Text: Any questions?

0:52:11 - 0:52:14 Text: Okay, right there.

0:52:14 - 0:52:40 Text: Okay, there's a question about whether you can kind of augment the search data structure

0:52:40 - 0:52:44 Text: that helps you do the fast search.

0:52:44 - 0:52:50 Text: I think there is some research in the area where the vectors that you produce to index

0:52:50 - 0:52:55 Text: the tree is perhaps not the same as the set that you ultimately return.

0:52:55 - 0:52:56 Text: They can be optimized for different things.

0:52:56 - 0:53:01 Text: So oftentimes these kind of tree-based approaches require your vectors to be spread out in some

0:53:01 - 0:53:04 Text: non-pathological way.

0:53:04 - 0:53:06 Text: And I think that's a very interesting area for research.

0:53:06 - 0:53:11 Text: So producing vectors that are easily indexable, kind of taking into account the indexing

0:53:11 - 0:53:16 Text: process as a way to improve overall performance is quite important too.

0:53:16 - 0:53:22 Text: Because oftentimes when you use these fast similarity methods, they make some sort of approximation

0:53:22 - 0:53:27 Text: to the real top case search and those approximations can often hurt you pretty bad.

0:53:27 - 0:53:31 Text: Yeah, great question.

0:53:31 - 0:53:37 Text: Okay, cool.

0:53:37 - 0:53:44 Text: Great, so now we've looked at a few different architectures for actually performing retrieval.

0:53:44 - 0:53:50 Text: Now let's look at how you would actually train one of these retrievers.

0:53:50 - 0:53:58 Text: So fundamentally all you need to train a retriever is you need an example of an input.

0:53:58 - 0:54:03 Text: You need a positive example of what you would like to retrieve and then you need some negative

0:54:03 - 0:54:05 Text: examples of what you would not like to retrieve.

0:54:05 - 0:54:11 Text: So for example, where the super-ball is this year, Sears Tower location, etc.

0:54:11 - 0:54:16 Text: And the training objective for this is quite straightforward.

0:54:16 - 0:54:19 Text: So I'm going to divine a few variables here.

0:54:19 - 0:54:23 Text: S star will be the score that the retriever assigns to the correct input and S sub i is

0:54:23 - 0:54:27 Text: going to be the score that the retriever assigns to each of the negative inputs.

0:54:27 - 0:54:33 Text: And then we're going to apply the well-known softmax function over all of these scores.

0:54:33 - 0:54:37 Text: So what we're doing here is we're taking each of these scores, exponentiating the score

0:54:37 - 0:54:39 Text: so that there's some positive value.

0:54:39 - 0:54:44 Text: And then dividing each of those exponentiated scores by the sum of all of those scores,

0:54:44 - 0:54:46 Text: so that the whole thing sums up to one.

0:54:46 - 0:54:51 Text: And we're going to call that the probability of retrieving the positive document.

0:54:51 - 0:54:55 Text: So intuitively if the positive document has a high score, then after exponentiation it

0:54:55 - 0:54:56 Text: will be even bigger.

0:54:56 - 0:55:00 Text: The other scores will be smaller and most of the mass in this probability distribution

0:55:00 - 0:55:03 Text: will be on the positive document.

0:55:03 - 0:55:06 Text: If it's not, then this probability will be small.

0:55:06 - 0:55:11 Text: And what we will do as a standard in machine learning is we're going to maximize the probability

0:55:11 - 0:55:16 Text: of that quantity, in particular the log probability.

0:55:16 - 0:55:21 Text: And this is all doable because P of positive depends on the softmax expression here, which

0:55:21 - 0:55:22 Text: is differentiable.

0:55:22 - 0:55:27 Text: And each of the scores inside the softmax depends on the retriever, which I just told you

0:55:27 - 0:55:29 Text: on the previous slide is also differentiable.

0:55:29 - 0:55:33 Text: So the whole thing is differentiable and you're just basically trying to push the positive

0:55:33 - 0:55:39 Text: score essentially above all the negative scores.

0:55:39 - 0:55:42 Text: Okay, so it's a very simple recipe.

0:55:42 - 0:55:46 Text: And we'll look at a concrete example of that based on this paper called Dense Passage Retrieval,

0:55:46 - 0:55:52 Text: DPR, one of the early papers to explore the sort of supervised retrieval approach.

0:55:52 - 0:55:57 Text: So the task they're looking at is basically given a question, like the one here, retrieve

0:55:57 - 0:56:01 Text: a passage from Wikipedia containing the answer.

0:56:01 - 0:56:05 Text: And once you've retrieved the passage, they then have a reader module that reads the

0:56:05 - 0:56:09 Text: passage and produces an answer.

0:56:09 - 0:56:13 Text: And the training data for the retriever is going to fit into the format that I just described.

0:56:13 - 0:56:18 Text: So they work with this dataset called natural questions, which comes with human annotated

0:56:18 - 0:56:24 Text: queries, answers to the queries, and also a passage that contains the answer.

0:56:24 - 0:56:28 Text: So here we go, the input to our memory is the query.

0:56:28 - 0:56:32 Text: The positive memory that we'll want to push up is the passage that the human provided.

0:56:32 - 0:56:36 Text: And the negative memories are actually something kind of interesting in this paper.

0:56:36 - 0:56:42 Text: So it's going to be one, it's going to be the positive passages for other queries.

0:56:42 - 0:56:45 Text: So as long as all your queries aren't asking the same question, the positive passage for

0:56:45 - 0:56:50 Text: another query is going to be negative for the current query that you're looking at.

0:56:50 - 0:56:53 Text: And this next bullet is also interesting.

0:56:53 - 0:56:58 Text: They take a passage that's retrieved by an off-the-shelf tool for search.

0:56:58 - 0:57:00 Text: This is called BM25.

0:57:00 - 0:57:05 Text: It's a classic information retrieval approach that uses token-based overlap to retrieve

0:57:05 - 0:57:06 Text: things.

0:57:06 - 0:57:10 Text: It doesn't have any deep learning or anything in it, but it's quite effective.

0:57:10 - 0:57:14 Text: So they retrieve a passage and they retrieve one that does not contain the answer in it.

0:57:14 - 0:57:18 Text: So the assumption here is that you've got a passage that looks very promising, but in

0:57:18 - 0:57:21 Text: fact, doesn't contain the answer.

0:57:21 - 0:57:26 Text: And you can think of that as this is what we call a hard negative.

0:57:26 - 0:57:27 Text: Great.

0:57:27 - 0:57:29 Text: So we've got all the components for training retriever.

0:57:29 - 0:57:31 Text: They go ahead and do that.

0:57:31 - 0:57:33 Text: Let's look at how well it actually works.

0:57:33 - 0:57:38 Text: So to understand how well it works, we're going to compare it against another approach.

0:57:38 - 0:57:44 Text: So we're going to look at, this is from, I should have had a citation for this too,

0:57:44 - 0:57:48 Text: paper by Robert Settel on close book question answering.

0:57:48 - 0:57:54 Text: They basically take a sequence to sequence neural network model called T5 and just feed

0:57:54 - 0:57:56 Text: in the question and ask it to retrieve the answer.

0:57:56 - 0:57:59 Text: So this model does not have access to passages.

0:57:59 - 0:58:02 Text: In effect, it doesn't have any external memory.

0:58:02 - 0:58:06 Text: And you can see that as they scaled up the size of the model, they were quite nicely getting

0:58:06 - 0:58:10 Text: better and better performance on the task.

0:58:10 - 0:58:15 Text: And the question we want to ask is with DPR, which has this external memory, this access

0:58:15 - 0:58:20 Text: to Wikipedia, can that do better than an approach that doesn't have external memory?

0:58:20 - 0:58:24 Text: And the answer, of course, in this class is yes.

0:58:24 - 0:58:29 Text: So it does indeed improve quite significantly, and it's not a surprise because we have access

0:58:29 - 0:58:35 Text: to this additional information, which is Wikipedia.

0:58:35 - 0:58:39 Text: So you might look at that previous chart and say, well, maybe we just need to make T5

0:58:39 - 0:58:43 Text: bigger because after all, the scaling was looking quite good, right?

0:58:43 - 0:58:49 Text: So I took that plot and re-plotted it with the parameter scale on the x-axis and the

0:58:49 - 0:58:51 Text: performance on the y-axis.

0:58:51 - 0:58:56 Text: And we know from recent research that these scaling laws tend to be logarithmic.

0:58:56 - 0:59:01 Text: So as you increase your model size, the improvement is a logarithmic function.

0:59:01 - 0:59:04 Text: And I just plotted that curve out for you to see where it's headed.

0:59:04 - 0:59:10 Text: And if you plot DPR on this curve, it's just kind of sitting way above this scaling plot

0:59:10 - 0:59:12 Text: for a much smaller model size.

0:59:12 - 0:59:14 Text: It's doing much better.

0:59:14 - 0:59:20 Text: And I also re-plotted this out further to see if this line eventually caught up with the

0:59:20 - 0:59:21 Text: 44 number up there.

0:59:21 - 0:59:25 Text: And it does add around 8 trillion parameters.

0:59:25 - 0:59:30 Text: So that's about like a thousand times bigger than where we are now.

0:59:30 - 0:59:34 Text: So all this is to say that scaling does help, but there might be easier and cheaper

0:59:34 - 0:59:36 Text: ways to get there.

0:59:36 - 0:59:42 Text: So one criticism you could make of the previous approach was that DPR actually had access to

0:59:42 - 0:59:47 Text: something that T5 didn't have, which is it had human annotated gold passages saying

0:59:47 - 0:59:49 Text: what you needed to retrieve to answer the question.

0:59:49 - 0:59:51 Text: And that data is actually hard to collect.

0:59:51 - 0:59:56 Text: So we're going to ask the question, what if the examples that you had access to were

0:59:56 - 0:59:57 Text: just query answer pairs?

0:59:57 - 1:00:04 Text: And you still train a good retriever without gold passages.

1:00:04 - 1:00:09 Text: And this sort of task arises in many other tasks as well.

1:00:09 - 1:00:13 Text: You could imagine if you were going from natural language to code, you might encounter cases

1:00:13 - 1:00:17 Text: where nobody has provided you annotations of what code snippets to retrieve, medical

1:00:17 - 1:00:20 Text: diagnosis, similar thing.

1:00:20 - 1:00:23 Text: So we're going to go now to end-to-end learning of a retriever.

1:00:23 - 1:00:26 Text: And let me get into some detail on what that is.

1:00:26 - 1:00:30 Text: So we're coming back to this diagram of a memory retriever.

1:00:30 - 1:00:34 Text: And in a memory augmented model, once the memory is retrieved, it then goes into a reader

1:00:34 - 1:00:39 Text: component, which takes the original input in the memory and produces an answer.

1:00:39 - 1:00:44 Text: So if you have no supervision for the memory, you might have this intuition instead, which

1:00:44 - 1:00:48 Text: is that if you did retrieve a good memory, that should result in a good answer from the

1:00:48 - 1:00:50 Text: reader.

1:00:50 - 1:00:53 Text: On the other hand, if you retrieved a bad memory, that will probably cause the reader

1:00:53 - 1:00:56 Text: to get confused and produce a bad result.

1:00:56 - 1:01:02 Text: So you might be able to use that observation as a training signal to train your retriever.

1:01:02 - 1:01:06 Text: Let me just give a concrete example with this, who is the bad guy in Lord of the Rings?

1:01:06 - 1:01:10 Text: If the memory retrieves something like the main antagonist is Soran, then you'll produce

1:01:10 - 1:01:12 Text: Soran likely, and that's great.

1:01:12 - 1:01:18 Text: On the other hand, if the retriever got this other passage saying Lord of the Rings received

1:01:18 - 1:01:24 Text: a bad review from IMDB, then your reader might be more inclined to produce IMDB, which

1:01:24 - 1:01:26 Text: would not match the gold answer in your training data set.

1:01:26 - 1:01:32 Text: So this gives you some knowledge that the second memory is bad and the first one is good.

1:01:32 - 1:01:36 Text: And so what I'm going to propose here is this idea of trial and error.

1:01:36 - 1:01:40 Text: In the first stage, you perform exploration where you let your imperfect retriever select

1:01:40 - 1:01:45 Text: some memory, and you try feeding that memory to the reader, and then you learn from success

1:01:45 - 1:01:46 Text: or failure.

1:01:46 - 1:01:49 Text: So if the memory helps the reader generate the right answer, you want to increase the

1:01:49 - 1:01:51 Text: score of that memory.

1:01:51 - 1:01:58 Text: And if the memory does not help the retriever, you want to decrease the score of that memory.

1:01:58 - 1:02:02 Text: And over time, this process would help the helpful memories get higher scores than the

1:02:02 - 1:02:04 Text: less helpful ones.

1:02:04 - 1:02:10 Text: So the formal approach for this is going to be taken from a paper by one of my colleagues

1:02:10 - 1:02:14 Text: called OpenRetrieval QA, ORCA.

1:02:14 - 1:02:18 Text: And the exploration component we're going to formalize as follows.

1:02:18 - 1:02:21 Text: So as I mentioned earlier, a retriever is just scoring function between an input and a

1:02:21 - 1:02:23 Text: memory.

1:02:23 - 1:02:29 Text: And if you take a softmax over all of the scores for all of the memories, then you get this

1:02:29 - 1:02:32 Text: distribution over memories given the input.

1:02:32 - 1:02:39 Text: So again, I've just raised all of the scores to e to the power of the scores and then normalize.

1:02:39 - 1:02:44 Text: And once we have this distribution, we'll randomly sample memory from that distribution.

1:02:44 - 1:02:49 Text: So as you can imagine, if the scores are meaningful, then we're more likely to sample a memory

1:02:49 - 1:02:52 Text: that's good and less likely to sample a memory that's bad.

1:02:52 - 1:02:56 Text: But because it's random, we kind of eventually will sample everything, unless there are things

1:02:56 - 1:02:59 Text: with zero probability.

1:02:59 - 1:03:03 Text: And then the learning from success and failure part.

1:03:03 - 1:03:09 Text: So once we pick a memory, we need to see if it actually helps.

1:03:09 - 1:03:13 Text: And we're going to measure that by looking at the reader's probability of generating the

1:03:13 - 1:03:16 Text: right answer given that particular memory.

1:03:16 - 1:03:18 Text: So that's this big quantity right here.

1:03:18 - 1:03:23 Text: The reader looks at the input and the memory, and we want to see its probability of generating

1:03:23 - 1:03:25 Text: the gold answer.

1:03:25 - 1:03:29 Text: And if this value is high, then we want to increase the score of that memory.

1:03:29 - 1:03:34 Text: And if it's low, we want to probably decrease the score of the memory.

1:03:34 - 1:03:40 Text: So I've shown you a couple expressions now, and we want to put those expressions together

1:03:40 - 1:03:44 Text: into a training objective that we can actually optimize.

1:03:44 - 1:03:46 Text: So we'll start with this question.

1:03:46 - 1:03:50 Text: If we randomly sample a memory from the retriever, and then we generate an answer, what is the

1:03:50 - 1:03:54 Text: overall probability that we get the answer right?

1:03:54 - 1:03:56 Text: So first, let's look at this expression right here.

1:03:56 - 1:04:01 Text: This is a summation over all possible memories that the retriever could retrieve, and the

1:04:01 - 1:04:03 Text: sum is over the probability of retrieving it.

1:04:03 - 1:04:08 Text: So right now, this is, this just equals one, because it's a distribution, and we're summing

1:04:08 - 1:04:10 Text: over all of its values.

1:04:10 - 1:04:14 Text: But then we'll add one more term to this, which is the probability that the reader gets

1:04:14 - 1:04:20 Text: the answer right given the memory in that term in the summation.

1:04:20 - 1:04:24 Text: So this first part is the retriever, and it's proposing different memories.

1:04:24 - 1:04:29 Text: And this second part is the reader, and it's succeeding or failing based on the memories.

1:04:29 - 1:04:34 Text: So you can think of each term in this summation as a trial of a different memory.

1:04:34 - 1:04:39 Text: And you can think of that second term kind of like a reward, if it's high, it's good,

1:04:39 - 1:04:42 Text: and if it's low, it's bad.

1:04:42 - 1:04:47 Text: So what they propose in the Orca paper is to perform gradient descent on this entire

1:04:47 - 1:04:49 Text: expression right here.

1:04:49 - 1:04:54 Text: They basically want to push the value of this entire expression up, and they're optimizing

1:04:54 - 1:04:56 Text: both the retriever and the reader.

1:04:56 - 1:04:59 Text: So let's look at the retriever first.

1:04:59 - 1:05:04 Text: So the retriever has a fixed budget that has to sum up to one over all of the memories.

1:05:04 - 1:05:09 Text: If it wants this value to be high, it doesn't have any incentive to put probability on bad

1:05:09 - 1:05:13 Text: memories, because those bad memories are just going to produce a low score on this

1:05:13 - 1:05:14 Text: term right here.

1:05:14 - 1:05:18 Text: So as you optimize this function, the retriever will basically try to put all of its mass

1:05:18 - 1:05:20 Text: on the good memories.

1:05:20 - 1:05:24 Text: And meanwhile, if you're optimizing the reader with respect to this function, it's trying

1:05:24 - 1:05:28 Text: its best to produce the gold answer given whatever memory it has.

1:05:28 - 1:05:32 Text: So it's also incentivized to try its best to extract the answer out of whatever it's

1:05:32 - 1:05:33 Text: given.

1:05:33 - 1:05:37 Text: And when you kind of jointly optimize both, over time you get something that puts its

1:05:37 - 1:05:39 Text: mass on good memories.

1:05:39 - 1:05:44 Text: So that kind of corresponds with the intuition that I was giving you earlier that you can

1:05:44 - 1:05:49 Text: kind of perform end-to-end learning to get a retriever.

1:05:49 - 1:05:50 Text: All right.

1:05:50 - 1:05:53 Text: So that's the high-level approach to Orca.

1:05:53 - 1:05:56 Text: What I didn't explain is that usually if your memories are like all of Wikipedia, this

1:05:56 - 1:06:01 Text: summation is very large, and if you're going to do gradient descent on this summation,

1:06:01 - 1:06:02 Text: it's going to take a very long time.

1:06:02 - 1:06:07 Text: So in practice, they approximate this summation with the highest probability memories, maybe

1:06:07 - 1:06:10 Text: the top 100 or the top 10.

1:06:10 - 1:06:14 Text: And I won't go into details in this class about exactly how that works.

1:06:14 - 1:06:17 Text: But I'll stop there.

1:06:17 - 1:06:20 Text: Because we're kind of approaching the end, I'm going to take questions just a little bit

1:06:20 - 1:06:21 Text: later.

1:06:21 - 1:06:23 Text: Sorry about that.

1:06:23 - 1:06:26 Text: So let's see how well Orca works.

1:06:26 - 1:06:28 Text: Just come out and put that number there.

1:06:28 - 1:06:29 Text: So a bit of context around this.

1:06:29 - 1:06:32 Text: It's not as good as DPR because it has less supervision than DPR.

1:06:32 - 1:06:36 Text: There's no human annotation of what passage to retrieve.

1:06:36 - 1:06:41 Text: But what's worth noticing is that at least compared to T5 at the same size, so you can compare

1:06:41 - 1:06:47 Text: 0.66 billion parameters against 0.77, it's actually already better than T5.

1:06:47 - 1:06:51 Text: And compared to a T5 that's about 15 times larger, it's almost at the same performance

1:06:51 - 1:06:52 Text: too.

1:06:52 - 1:07:02 Text: And it's a pretty decent result for an approach that has no access to retrieval supervision.

1:07:02 - 1:07:06 Text: So one thing you might note though is that the better result requires gold passages.

1:07:06 - 1:07:10 Text: And Orca and T5, these approaches don't require gold passages.

1:07:10 - 1:07:12 Text: They only need query answer pairs.

1:07:12 - 1:07:17 Text: And one advantage of that is that query answer pair data is actually pretty easy to get.

1:07:17 - 1:07:23 Text: So we could potentially get a lot more of it than if we were asking for passage passages as

1:07:23 - 1:07:24 Text: well.

1:07:24 - 1:07:30 Text: And the final part of this retrieval section is about a way to get basically an arbitrary

1:07:30 - 1:07:36 Text: number of query answer pairs to kind of improve these weekly supervised approaches that don't

1:07:36 - 1:07:37 Text: have passages.

1:07:37 - 1:07:41 Text: So it comes from a very simple observation, which is let's take your typical query answer

1:07:41 - 1:07:42 Text: pair.

1:07:42 - 1:07:43 Text: It looks like this, right?

1:07:43 - 1:07:45 Text: So you've got your query on the left and answer on the right.

1:07:45 - 1:07:50 Text: You can easily reformulate that as a fill in the blank question like this.

1:07:50 - 1:07:54 Text: And this fill in the blank question forces the model to think just as hard as the original

1:07:54 - 1:07:57 Text: question is just in a different format.

1:07:57 - 1:08:01 Text: But what's nice about this fill in the blank question is that it's very easy to create

1:08:01 - 1:08:03 Text: a bunch of them for free.

1:08:03 - 1:08:05 Text: Basically you can just take any sentence on the web.

1:08:05 - 1:08:10 Text: And as long as it's mentioning something factual or semantically meaningful, you can just

1:08:10 - 1:08:13 Text: blank out one of the entities.

1:08:13 - 1:08:17 Text: And in fact, that is exactly what you've probably seen in previous lectures, pre-trained

1:08:17 - 1:08:19 Text: language models like Bert do.

1:08:19 - 1:08:24 Text: And Bert uses that training objective to learn a very great deal.

1:08:24 - 1:08:27 Text: And that can be used in this setting for retrieval as well.

1:08:27 - 1:08:32 Text: So the basic idea for Realm, which is something that I worked on with collaborators, is to

1:08:32 - 1:08:35 Text: apply the same end-to-end training as Orca.

1:08:35 - 1:08:39 Text: But pre-trained the model on a bunch of these fill in the blank questions that we just

1:08:39 - 1:08:44 Text: automatically generated, just in extremely large quantities.

1:08:44 - 1:08:50 Text: And then we fine-tune that on the real questions query answer pairs that we already have.

1:08:50 - 1:08:54 Text: So if you do this approach and you plot it against all the others, what's quite nice is

1:08:54 - 1:09:01 Text: that it basically almost closes the gap completely with an approach that uses supervised data.

1:09:01 - 1:09:04 Text: Just by pre-training on fill in the blank questions.

1:09:04 - 1:09:06 Text: And the nice thing is it doesn't need access to gold passages.

1:09:06 - 1:09:11 Text: So it's on the same footing as things like T5 now.

1:09:11 - 1:09:16 Text: And at the same footing, it outperforms T5, despite being much smaller than even the largest

1:09:16 - 1:09:17 Text: model.

1:09:17 - 1:09:22 Text: So that gives us this interesting promise of using kind of language model fill in the

1:09:22 - 1:09:25 Text: blank techniques to build good memory retrievers.

1:09:25 - 1:09:30 Text: And the nice thing is that this fill in the blank approach can be used to tackle many

1:09:30 - 1:09:31 Text: sorts of tasks.

1:09:31 - 1:09:36 Text: You could blank out a patch in an image and train a retriever to find other images that

1:09:36 - 1:09:37 Text: might help fill it in.

1:09:37 - 1:09:42 Text: You could blank out a segment of code and train a retriever to find other pieces of code

1:09:42 - 1:09:43 Text: that might help fill that in.

1:09:43 - 1:09:45 Text: Or chapter in a textbook.

1:09:45 - 1:09:50 Text: The kind of list of things you can do with fill in the blank actually goes on and on.

1:09:50 - 1:09:55 Text: And each task that you define in this way produces a specialized memory retriever for whatever

1:09:55 - 1:09:58 Text: it is that you're filling the blank in for.

1:09:58 - 1:10:00 Text: And there's no need to collect training data.

1:10:00 - 1:10:07 Text: So this sort of scales to any set of tasks that may not be important enough for central

1:10:07 - 1:10:10 Text: enough to warn a big data collection budget.

1:10:10 - 1:10:15 Text: All right, so the main takeaways for this section are that a retriever is just a function

1:10:15 - 1:10:19 Text: that takes an input and a memory and produces a score.

1:10:19 - 1:10:22 Text: If you have supervised data for your retriever, that's great.

1:10:22 - 1:10:25 Text: Provide positive and negative memories for each input.

1:10:25 - 1:10:28 Text: And just train the retriever to score the positive ones higher.

1:10:28 - 1:10:33 Text: If you don't have supervision, you can use end-to-end learning, which employs a trial

1:10:33 - 1:10:34 Text: and error approach.

1:10:34 - 1:10:40 Text: If a memory helps the model score the memory higher, otherwise score it lower.

1:10:40 - 1:10:44 Text: And with end-to-end learning, you often get this special benefit that you can easily create

1:10:44 - 1:10:47 Text: tons of data to pre-train your retriever.

1:10:47 - 1:10:54 Text: All right, we're now into the very, very final part of the talk, which is how to actually

1:10:54 - 1:10:56 Text: use the memories after you get them.

1:10:56 - 1:10:59 Text: Notice I have 15 minutes, right?

1:10:59 - 1:11:01 Text: Yes, hold in.

1:11:01 - 1:11:04 Text: Okay, all right, then we should have plenty of time for questions, actually.

1:11:04 - 1:11:06 Text: So all right, here we go.

1:11:06 - 1:11:10 Text: We're going to come back to this diagram of a memory-augmented model.

1:11:10 - 1:11:13 Text: And now we're going to focus on this reader component, which I didn't say much about

1:11:13 - 1:11:14 Text: before.

1:11:14 - 1:11:17 Text: I said that the reader takes the memory and the input and then produces the answer.

1:11:17 - 1:11:21 Text: So what does that reader component actually look like?

1:11:21 - 1:11:27 Text: A very common architecture is just the sequence encoder to decoder model.

1:11:27 - 1:11:32 Text: In practical terms, you take the original question and you just concatenate it to the

1:11:32 - 1:11:34 Text: memory that you retrieved.

1:11:34 - 1:11:37 Text: And then you feed that into your encoder and you train it using standard sequence to sequence

1:11:37 - 1:11:40 Text: learning to produce the output.

1:11:40 - 1:11:44 Text: So all we're doing is just concatenating the memory with the input.

1:11:44 - 1:11:47 Text: And we can refer to that as text fusion.

1:11:47 - 1:11:52 Text: Anytime you have a memory that is text or can be converted into text in some form, just

1:11:52 - 1:11:53 Text: concatenated.

1:11:53 - 1:11:55 Text: That's all you need.

1:11:55 - 1:12:00 Text: At least that's what state of the art techniques are doing right now.

1:12:00 - 1:12:05 Text: Okay, and just to give some variety, here's another way to incorporate memories.

1:12:05 - 1:12:11 Text: Let's consider a slightly different memory-augmented model where instead of just retrieving a document,

1:12:11 - 1:12:17 Text: the memory is actually key value pairs where the key is the question, like a question that's

1:12:17 - 1:12:21 Text: been seen before and the value is the answer to that previously seen question.

1:12:21 - 1:12:26 Text: So in this case, you can do something even simpler than what was on the previous slide.

1:12:26 - 1:12:30 Text: You can take your input and compare it to the keys and find the key that most resembles

1:12:30 - 1:12:31 Text: the input.

1:12:31 - 1:12:35 Text: So in this case, we found a paraphrase of the original question.

1:12:35 - 1:12:40 Text: And if you have this, then all you really need to do is just copy the answer from the

1:12:40 - 1:12:43 Text: value out as your label.

1:12:43 - 1:12:49 Text: And that's what people refer to as label smearing or otherwise nearest neighbor's methods.

1:12:49 - 1:12:52 Text: We call it smearing because you're essentially smearing the label from this example that you

1:12:52 - 1:12:56 Text: retrieved onto the new example.

1:12:56 - 1:13:02 Text: And it's a very simple technique, which one you use depends on your application.

1:13:02 - 1:13:05 Text: So the techniques we're using memories are quite simple, but the problems that arise

1:13:05 - 1:13:09 Text: when you use them are actually quite interesting.

1:13:09 - 1:13:13 Text: The first one I want to talk about, I kind of mentioned this preview this earlier, are these

1:13:13 - 1:13:16 Text: two problems of underutilization and overreliance.

1:13:16 - 1:13:19 Text: So let's get into the underutilization issue.

1:13:19 - 1:13:23 Text: This is from a very recent paper by Longprey et al.

1:13:23 - 1:13:25 Text: So let me dive into it.

1:13:25 - 1:13:29 Text: So, okay, I switched the example here because I just got tired of the Lord of the Rings

1:13:29 - 1:13:31 Text: One.

1:13:31 - 1:13:34 Text: The question is, who do you meet at the gates of heaven?

1:13:34 - 1:13:39 Text: And the retrieved memory is on point, it says you see Saint Peter at the gates of heaven,

1:13:39 - 1:13:42 Text: the reader reads it, produces Saint Peter, everything is great.

1:13:42 - 1:13:47 Text: So what Longprey et al observed is, okay, if the reader is really doing such a great job

1:13:47 - 1:13:52 Text: of reading the memories, then if I edit the memory to say something else, the reader

1:13:52 - 1:13:55 Text: should pick up on that and produce the different answer.

1:13:55 - 1:14:00 Text: So what they do is they change Saint Peter to the United Nations, guards the gates of

1:14:00 - 1:14:01 Text: heaven.

1:14:01 - 1:14:06 Text: And they check if the reader actually produces the United Nations.

1:14:06 - 1:14:12 Text: But surprisingly, the reader still says that Saint Peter guards the gates of heaven.

1:14:12 - 1:14:15 Text: This is really quite interesting and pretty funny.

1:14:15 - 1:14:18 Text: So what's actually going on here?

1:14:18 - 1:14:20 Text: Let's first look at how bad this problem is.

1:14:20 - 1:14:23 Text: So here's a plot from the paper.

1:14:23 - 1:14:28 Text: The first row here is the model's behavior on the training set for natural questions.

1:14:28 - 1:14:33 Text: The same data set we were looking at earlier, and the red part of this plot indicates the

1:14:33 - 1:14:38 Text: number of times where the model sticks with its old answer even after changing the memory.

1:14:38 - 1:14:40 Text: It just stubbornly refuses to change.

1:14:40 - 1:14:44 Text: The blue part is the good part where it actually switches over to predicting the United Nations

1:14:44 - 1:14:45 Text: on various examples.

1:14:45 - 1:14:48 Text: And this other orange part is very concerning too.

1:14:48 - 1:14:53 Text: So when you change the memory to United Nations, sometimes the model just gets confused and

1:14:53 - 1:14:55 Text: predict something totally different.

1:14:55 - 1:14:58 Text: Not Saint Peter, not United Nations, just something completely different.

1:14:58 - 1:15:03 Text: So from this we can see something is really kind of broken about some of these memory-augmented

1:15:03 - 1:15:04 Text: models.

1:15:04 - 1:15:10 Text: And the same kind of behavior happens on the dev set as well.

1:15:10 - 1:15:16 Text: And just to underscore how bad this is, these results are from the set of examples that

1:15:16 - 1:15:19 Text: the original model was actually getting all correct.

1:15:19 - 1:15:24 Text: So you just kind of cut the performance of your model by more than half when you edit

1:15:24 - 1:15:25 Text: the memories.

1:15:25 - 1:15:27 Text: And that indicates the model is not robust to change.

1:15:27 - 1:15:30 Text: As we said earlier, being able to edit your memories is something you would really want

1:15:30 - 1:15:34 Text: from a memory-augmented model.

1:15:34 - 1:15:37 Text: So let's have an analysis of why this happens.

1:15:37 - 1:15:43 Text: Basically, when you put this memory into the sequence encoder, the reader that reads the

1:15:43 - 1:15:49 Text: memory, this encoder and this decoder, they actually have their own memory as well.

1:15:49 - 1:15:52 Text: As we saw in the earlier slides, transformers have their own memory.

1:15:52 - 1:15:57 Text: So we'll refer to that as the parametric memory of the encoder decoder, as opposed to the

1:15:57 - 1:16:00 Text: external memory that we want it to rely on.

1:16:00 - 1:16:05 Text: And at training time, they essentially learn to store the answer in their parametric

1:16:05 - 1:16:08 Text: memory and not rely on the external memory.

1:16:08 - 1:16:15 Text: So to give this issue a better cartoon form, the input is coming in and the model has

1:16:15 - 1:16:18 Text: its own parametric memory and the retrieve memory.

1:16:18 - 1:16:23 Text: And the parametric memory is saying St. Peter and the retrieve memory at training time is

1:16:23 - 1:16:28 Text: also saying St. Peter and the loss function is saying you must predict St. Peter.

1:16:28 - 1:16:32 Text: So the model says, okay, I've got two sources of information I can choose either one.

1:16:32 - 1:16:36 Text: There's nothing forcing the model to use the retrieved memory.

1:16:36 - 1:16:39 Text: And that's part of the problem that's causing this.

1:16:39 - 1:16:43 Text: Another problem, which isn't on this slide, is that sometimes the retriever is just not

1:16:43 - 1:16:44 Text: very good.

1:16:44 - 1:16:47 Text: So it might retrieve something that's just not related to the question.

1:16:47 - 1:16:51 Text: And in that case, the model is forced to fall back on its parametric memory and again

1:16:51 - 1:16:55 Text: learn to distrust the retrieved memory.

1:16:55 - 1:17:03 Text: So we want a way to kind of force the model to pick the retrieved memory instead.

1:17:03 - 1:17:06 Text: Ideally, we would want cases where the parametric memory is wrong and the retrieved memory

1:17:06 - 1:17:07 Text: is correct.

1:17:07 - 1:17:11 Text: That would force the model to say, hey, I can't trust my parametric memory.

1:17:11 - 1:17:17 Text: So what Longprey at all do is first they take the retrieved memory and they change what

1:17:17 - 1:17:19 Text: the retrieved memory is saying.

1:17:19 - 1:17:23 Text: So this creates a disagreement now between the parametric memory and the retrieved memory.

1:17:23 - 1:17:26 Text: But the gold label is still saying St. Peter.

1:17:26 - 1:17:31 Text: So we've just made the matters worse now because now the retrieved memory is even less trustworthy.

1:17:31 - 1:17:37 Text: The final thing they do, which is really the interesting bit, is they just decide to change

1:17:37 - 1:17:41 Text: the gold label as well to agree with their retrieved memory.

1:17:41 - 1:17:46 Text: So they've changed reality and said, no, actually the United Nations guards the gates of heaven.

1:17:46 - 1:17:49 Text: And what's in your parametric memory is wrong?

1:17:49 - 1:17:50 Text: And that's basically the approach.

1:17:50 - 1:17:55 Text: So they create a bunch of data like this where the gold answer has been changed to match

1:17:55 - 1:17:59 Text: the corruption they made in the retrieved memory and it guides the model away from using

1:17:59 - 1:18:00 Text: the parametric memory.

1:18:00 - 1:18:07 Text: I thought that was a pretty cool trick and they don't give this name in the paper, but

1:18:07 - 1:18:10 Text: you can think of it as data augmentation using these counterfactual memories.

1:18:10 - 1:18:13 Text: And it can really be applied to a lot of different approaches.

1:18:13 - 1:18:17 Text: As long as your memory is editable in a certain way and you can edit the gold label as well,

1:18:17 - 1:18:23 Text: you can create this artificial correlation between the memory and the output and an artificial

1:18:23 - 1:18:27 Text: anti-correlation between the output and whatever your model originally trained on.

1:18:27 - 1:18:30 Text: It's cool.

1:18:30 - 1:18:32 Text: So now you want to see if it works.

1:18:32 - 1:18:36 Text: And in the paper they report this metric, which is basically the percentage of time the

1:18:36 - 1:18:43 Text: model predicts the old value instead of the new one divided by the old plus the new.

1:18:43 - 1:18:46 Text: They ignore the set where the model gets confused and produces something totally different.

1:18:46 - 1:18:51 Text: I wish they had reported that too, but I couldn't immediately find it in their paper.

1:18:51 - 1:18:53 Text: But at least on this metric things look great.

1:18:53 - 1:18:57 Text: So on the training set and the dev set, the percentage of the time that the model uses

1:18:57 - 1:19:02 Text: the old memory, the old answer, goes dramatically down with this data augmentation.

1:19:02 - 1:19:06 Text: So it really keeps the model on its toes and makes it use its memory.

1:19:06 - 1:19:11 Text: Now I take this result with a little grain of salt because their test set is created the

1:19:11 - 1:19:14 Text: same way that they produce this data augmentation.

1:19:14 - 1:19:19 Text: So this is kind of the ideal setup where they're almost testing on the exact same distribution

1:19:19 - 1:19:21 Text: that they're training on.

1:19:21 - 1:19:24 Text: But still a very interesting approach.

1:19:24 - 1:19:26 Text: So let's see how long do we have left?

1:19:26 - 1:19:30 Text: Okay, in the last few minutes I'm going to cover the over reliance problem and there's

1:19:30 - 1:19:32 Text: just one slide on this.

1:19:32 - 1:19:36 Text: So sometimes the memories that your model retrieves are too easy.

1:19:36 - 1:19:38 Text: Here's what I mean by that.

1:19:38 - 1:19:44 Text: So if we go back to this Eiffel Tower query, what year was the Eiffel Tower built?

1:19:44 - 1:19:48 Text: We know it's 1889 and a good typical memory that you might retrieve is something like this.

1:19:48 - 1:19:53 Text: It says work on the Eiffel Tower was completed in 1889.

1:19:53 - 1:19:57 Text: There's not too much word overlap with the original query, which is good because it

1:19:57 - 1:20:03 Text: teaches the reader to recognize the fact that completed in this context is the same as

1:20:03 - 1:20:04 Text: built.

1:20:04 - 1:20:06 Text: So the reader learns paraphrase.

1:20:06 - 1:20:10 Text: On the other hand, you might get a memory that's too easy, which literally just says exactly

1:20:10 - 1:20:13 Text: the same tokens as the original input.

1:20:13 - 1:20:20 Text: And you could also consider, yeah, so this example would not teach the model how to paraphrase.

1:20:20 - 1:20:24 Text: And at the other extreme, you could also consider an extremely challenging memory.

1:20:24 - 1:20:29 Text: Like Paris's tallest tower finished the same year Van Gogh painted the start night.

1:20:29 - 1:20:33 Text: So yes, that also says the same fact, but the answer doesn't even directly appear.

1:20:33 - 1:20:36 Text: It's just too hard for the model.

1:20:36 - 1:20:40 Text: So if all of your examples are like this too easy memory, then you end up with a reader

1:20:40 - 1:20:42 Text: that is kind of spoiled.

1:20:42 - 1:20:45 Text: It never learns to paraphrase.

1:20:45 - 1:20:49 Text: And at test time, if your memories that are retrieved are not as good, the reader is not

1:20:49 - 1:20:51 Text: going to be able to use them.

1:20:51 - 1:20:56 Text: So a simple fix to this problem, I don't have a paper that I can cite exactly for this,

1:20:56 - 1:21:01 Text: but at training time, you can simply filter out some of the memories that have lexical overlap.

1:21:01 - 1:21:03 Text: That's too high.

1:21:03 - 1:21:08 Text: At the same time, you also want to make sure that you don't filter out so many of the easy

1:21:08 - 1:21:12 Text: things that you're just left with the super hard cases, like the one on the bottom.

1:21:12 - 1:21:15 Text: Because if you only have the super hard cases, your model will get confused.

1:21:15 - 1:21:20 Text: And as we saw in the previous slides, it might just fall back on its parametric memory.

1:21:20 - 1:21:25 Text: So this is sort of just an area for open research of how to give the reader a flexible set

1:21:25 - 1:21:27 Text: of things to train from.

1:21:27 - 1:21:29 Text: Great, yeah.

1:21:29 - 1:21:32 Text: So I've covered pretty much everything in that section as well.

1:21:32 - 1:21:36 Text: The main takeaway is that, like for you guys to have is that getting your model to use

1:21:36 - 1:21:38 Text: memories is not hard.

1:21:38 - 1:21:39 Text: There's some simple approaches.

1:21:39 - 1:21:44 Text: But getting your model to use memory correctly is actually an interesting open question.

1:21:44 - 1:21:50 Text: And there's this issue of underutilization and overreliance that are open areas of research.

1:21:50 - 1:21:51 Text: And that's it.

1:21:51 - 1:21:54 Text: I hope that you guys saw some interesting things about memory augmented models and are

1:21:54 - 1:21:56 Text: encouraged to look into that area.

1:21:56 - 1:22:00 Text: If there are any questions, please feel free to email me or message me.

1:22:00 - 1:22:02 Text: Have you to talk about it more.

1:22:02 - 1:22:04 Text: Thanks for sitting through a 90 minute lecture.