Stanford CS224N NLP with Deep Learning |Spring 2022|Guest Lecture: Building Knowledge Representation

0:00:00 - 0:00:11     Text: So I'm delighted to introduce our second invited speaker for 224N, Kelvin Gooh.

0:00:11 - 0:00:20     Text: So Kelvin is a senior research scientist at Google with interests in retrieval augmented

0:00:20 - 0:00:27     Text: language models and using knowledge in neural networks and perhaps best known for his work

0:00:27 - 0:00:33     Text: on the realm model, which is one of the things he'll doubtless talk that today.

0:00:33 - 0:00:39     Text: Yeah, I guess I know there are a few statistics students in the class, so maybe I'll also just

0:00:39 - 0:00:45     Text: mention that actually Kelvin's background is as a statistics PhD, but somewhere along

0:00:45 - 0:00:50     Text: the line he got corrupted away from math and statistics and ended up spending all of

0:00:50 - 0:00:52     Text: his time on natural language processing.

0:00:52 - 0:00:53     Text: I'm very good move.

0:00:53 - 0:00:56     Text: I'll recommend it to anybody.

0:00:56 - 0:01:04     Text: So anyway, I'm really happy to have Kelvin here today to tell us about his recent work.

0:01:04 - 0:01:06     Text: NLP.

0:01:06 - 0:01:11     Text: And as Chris alluded to, there'll be a focus on memory augmented models.

0:01:11 - 0:01:17     Text: So I'll try to kind of have a few spots for people to, for us to pause and ask questions,

0:01:17 - 0:01:19     Text: and otherwise I'll take the slides away.

0:01:19 - 0:01:20     Text: Great.

0:01:20 - 0:01:28     Text: So I want to start by just giving some motivation on some tasks that AI cannot solve today.

0:01:28 - 0:01:30     Text: It cannot diagnose a medical patient.

0:01:30 - 0:01:32     Text: It can't fix your car.

0:01:32 - 0:01:35     Text: It can't perform novel scientific research.

0:01:35 - 0:01:38     Text: It can't file corporate taxes.

0:01:38 - 0:01:40     Text: And it can't do many other things.

0:01:40 - 0:01:43     Text: Now I'm not saying that artificial intelligence is supposed to completely do these things,

0:01:43 - 0:01:46     Text: but at least it should be able to assist people who are doing those things.

0:01:46 - 0:01:52     Text: And what all of these tasks have in common is that certainly intelligence is required,

0:01:52 - 0:01:55     Text: but domain knowledge in these domains is just as important.

0:01:55 - 0:01:59     Text: So it's not intelligence alone that enables you to do these things.

0:01:59 - 0:02:04     Text: You have to have a long experience in various things such as what a car's components are,

0:02:04 - 0:02:10     Text: or in the case of this question here, if you were to ask a language model, the part of

0:02:10 - 0:02:16     Text: the intestine most commonly affected by Crohn's disease is my latest query to GPT-2 says

0:02:16 - 0:02:19     Text: the rectum, but actually it's the Ilium.

0:02:19 - 0:02:24     Text: So we can understand that if you're in the medical field and you're making this level of

0:02:24 - 0:02:28     Text: mistake, then you probably need to go back to training.

0:02:28 - 0:02:32     Text: So we're interested in getting language models and other language understanding systems

0:02:32 - 0:02:37     Text: to make these fine grain distinctions well, because we can't really unlock the next

0:02:37 - 0:02:41     Text: set of applications that NLP or AI could target.

0:02:41 - 0:02:46     Text: And of course, this is something that since the field began artificial intelligence researchers

0:02:46 - 0:02:48     Text: have been very interested in.

0:02:48 - 0:02:52     Text: If you look at some of the early applications of artificial intelligence in the 60s and

0:02:52 - 0:02:57     Text: the 80s, there were expert systems that did medical diagnosis, they would do computer chip

0:02:57 - 0:03:03     Text: design, and of course we know that we didn't get to completely fully solving those problems.

0:03:03 - 0:03:09     Text: And back then, the big obstacle was that you had to manually input all of the knowledge

0:03:09 - 0:03:10     Text: required for that domain.

0:03:10 - 0:03:14     Text: Some expert had to sit down and write all of the rules, and if any one of those rules

0:03:14 - 0:03:20     Text: contradicted another rule, the system was very brittle and unable to handle that complexity.

0:03:20 - 0:03:26     Text: But in 2022, what's very exciting is that we now have language models, as you've seen

0:03:26 - 0:03:31     Text: in previous lectures, that can automatically acquire knowledge from the web.

0:03:31 - 0:03:36     Text: And so that gives us an exciting opportunity to revisit this question of how to use knowledge

0:03:36 - 0:03:40     Text: in artificial intelligence and do more than, you know, classify with an image as a

0:03:40 - 0:03:45     Text: cab or a dog, but move on to much more complex tasks.

0:03:45 - 0:03:48     Text: So this talk will be in three parts.

0:03:48 - 0:03:53     Text: The first part is we're going to look at how language models currently represent knowledge,

0:03:53 - 0:03:57     Text: since they've obviously made huge gains, we need to understand what it is that's powering

0:03:57 - 0:03:59     Text: that success.

0:03:59 - 0:04:05     Text: And then we're going to step back and ask ourselves if that current way of representing

0:04:05 - 0:04:10     Text: knowledge is what we're happy with and what we'd actually like to see more of.

0:04:10 - 0:04:16     Text: And finally, kind of leading the discussion here, we're going to propose memory augmented

0:04:16 - 0:04:19     Text: models as a way of addressing some of those challenges.

0:04:19 - 0:04:22     Text: So certainly not the only way to address it, but one way that will spend a lot of this

0:04:22 - 0:04:24     Text: lecture looking at.

0:04:24 - 0:04:31     Text: Okay, so the first half of this talk is about how language models currently represent knowledge,

0:04:31 - 0:04:33     Text: maybe the first, third.

0:04:33 - 0:04:38     Text: And as I was actually looking at your curriculum, I realized there's another lecture on knowledge

0:04:38 - 0:04:40     Text: editing coming up.

0:04:40 - 0:04:42     Text: This will be sort of an introduction to that.

0:04:42 - 0:04:46     Text: I won't go into it as much detail as one of the later lectures, but you can think of this

0:04:46 - 0:04:47     Text: as an intro.

0:04:47 - 0:04:51     Text: So let's go back to this prompt that we were looking at earlier.

0:04:51 - 0:04:55     Text: And we know that we have a model that is close to getting a correct answer, but in some

0:04:55 - 0:04:57     Text: ways not that close.

0:04:57 - 0:05:03     Text: And so you'd like to ask yourself, this incorrect belief is clearly stored somewhere in the

0:05:03 - 0:05:04     Text: models parameters.

0:05:04 - 0:05:07     Text: But where exactly is it stored?

0:05:07 - 0:05:14     Text: We know that a GPT style model is a transformer, and a transformer has token embeddings, and

0:05:14 - 0:05:17     Text: a feed-forward network and an attention network.

0:05:17 - 0:05:19     Text: Where exactly is the knowledge?

0:05:19 - 0:05:22     Text: And how can we identify it and fix it?

0:05:22 - 0:05:29     Text: So to answer this question, we're going to look at some recent research on knowledge editing.

0:05:29 - 0:05:31     Text: Knowledge editing is the following task.

0:05:31 - 0:05:35     Text: So let's say the language model has some original belief.

0:05:35 - 0:05:39     Text: Like if you give it this fill in the blank question, Eiffel Tower is located in the city

0:05:39 - 0:05:40     Text: of blank.

0:05:40 - 0:05:42     Text: You expect it to predict Paris.

0:05:42 - 0:05:47     Text: And the knowledge editing task says, we'd like to actually change the model's belief about

0:05:47 - 0:05:48     Text: this.

0:05:48 - 0:05:53     Text: So let's say instead that we want the model to believe that the Eiffel Tower is located

0:05:53 - 0:05:55     Text: in Rome instead.

0:05:55 - 0:06:00     Text: And we don't want it to just memorize this exact statement, but rather really change

0:06:00 - 0:06:01     Text: its knowledge about the Eiffel Tower.

0:06:01 - 0:06:07     Text: So if I ask other questions about the Eiffel Tower, the answer should change there too.

0:06:07 - 0:06:08     Text: Here's like a particularly tricky one.

0:06:08 - 0:06:14     Text: If I say the tallest structure in Rome is, the new answer should actually be Eiffel Tower.

0:06:14 - 0:06:19     Text: And we're going to look at this paper, which incidentally, confusingly is also called

0:06:19 - 0:06:27     Text: Rome, by Meng et al, very recent research, which illustrates an approach for doing this.

0:06:27 - 0:06:32     Text: So you can see here on the top, they've made this particular edit that I was talking about.

0:06:32 - 0:06:37     Text: And when you generate from the language model, if you prompt it about places to eat, those

0:06:37 - 0:06:39     Text: places are all in Rome.

0:06:39 - 0:06:42     Text: And if you prompt it about how to get there from Berlin, the directions are from Berlin

0:06:42 - 0:06:43     Text: to Rome.

0:06:43 - 0:06:46     Text: So that's really quite remarkable.

0:06:46 - 0:06:50     Text: And the premise for this is that if we have an approach that can actually make these

0:06:50 - 0:06:55     Text: sorts of edits, it might constitute a little more of a proof that we understand something

0:06:55 - 0:07:01     Text: about the internal structure of the knowledge inside the model.

0:07:01 - 0:07:06     Text: So now let's get into how this approach works, and on the way learn about how language

0:07:06 - 0:07:09     Text: models might represent knowledge.

0:07:09 - 0:07:13     Text: We're going to start actually with an earlier paper called, I think, a paraphrase that's

0:07:13 - 0:07:18     Text: height a little bit, but transformer feed-forward layers are key value memories.

0:07:18 - 0:07:23     Text: And what I mean by that is I'm referring to the standard feed-forward layer inside the

0:07:23 - 0:07:26     Text: transformer, which I think you guys have seen in an earlier lecture.

0:07:26 - 0:07:31     Text: It's essentially taking the input from an earlier layer in the network, and then passing

0:07:31 - 0:07:37     Text: it through a matrix multiplication, a non-linearity, and then another matrix multiplication.

0:07:37 - 0:07:43     Text: Raps that with a bit of layer norm and additional bias terms and residual connections, and that's

0:07:43 - 0:07:47     Text: basically a feed-forward network, as represented symbolically here.

0:07:47 - 0:07:52     Text: So right now I'm just giving this simplified form of the feed-forward network, because

0:07:52 - 0:07:58     Text: this simplified form is enough for us to understand the basic intuition of how this thing might

0:07:58 - 0:08:02     Text: store memory.

0:08:02 - 0:08:07     Text: And by key value memory, I really just mean it in the typical Python dictionary sense.

0:08:07 - 0:08:12     Text: So in this key value memory, the keys are name and food, and the values are Kelvin and

0:08:12 - 0:08:14     Text: pizza.

0:08:14 - 0:08:22     Text: Okay, so I'm going to go into a kind of operation by operation description of what's happening

0:08:22 - 0:08:23     Text: in the feed-forward network.

0:08:23 - 0:08:28     Text: Let's look at this first matrix multiplication first, and what we're going to do is we're

0:08:28 - 0:08:32     Text: going to break this first weight matrix into rows.

0:08:32 - 0:08:36     Text: So now I've got each of the row vectors shown here.

0:08:36 - 0:08:41     Text: And as you know from linear algebra class, a matrix vector multiplication is just the

0:08:41 - 0:08:45     Text: dot product of each row against the input vector.

0:08:45 - 0:08:49     Text: And so you can get a set of scores for each of those dot products, and you can think of

0:08:49 - 0:08:54     Text: those scores as basically the similarity between x and each of the rows.

0:08:54 - 0:09:02     Text: Okay, now that we've got that set of similarity scores, we then pass it through the non-linearity

0:09:02 - 0:09:03     Text: in the feed-forward network.

0:09:03 - 0:09:07     Text: And the non-linearity in transformers is oftentimes something like a value.

0:09:07 - 0:09:12     Text: So it's a function that takes the values, and if it's negative, sets it to zero, if it's

0:09:12 - 0:09:16     Text: positive, basically just keeps that value.

0:09:16 - 0:09:20     Text: So if you apply that transformation, you get another set of vectors, and you can see now

0:09:20 - 0:09:24     Text: that a bunch of the, or sorry, another set of values forming a vector, and you can see

0:09:24 - 0:09:27     Text: that a bunch of the entries are now zero.

0:09:27 - 0:09:31     Text: Okay, so I still haven't explained why this is a key value memory.

0:09:31 - 0:09:33     Text: Just bear with me a little bit longer.

0:09:33 - 0:09:38     Text: We're going to go on to the second matrix multiplication in the feed-forward layer, and

0:09:38 - 0:09:42     Text: this time we're going to break this matrix up into column vectors.

0:09:42 - 0:09:46     Text: And we'll use the other interpretation of matrix vector multiplication that you get

0:09:46 - 0:09:52     Text: from linear algebra class, which is that it can be interpreted as taking the columns

0:09:52 - 0:09:58     Text: and forming a weighted sum of those columns using the values in the original vector.

0:09:58 - 0:10:02     Text: So I've just taken these values down here and moved them over the column vectors so you

0:10:02 - 0:10:05     Text: can see what the weights are on each of the column vectors.

0:10:05 - 0:10:09     Text: And because a lot of those entries are zero, we can just drop them.

0:10:09 - 0:10:14     Text: You can see we've essentially selected certain columns in the second weight matrix, and

0:10:14 - 0:10:18     Text: then we add them up, and that's the output.

0:10:18 - 0:10:22     Text: Does anyone have any questions so far about what happened there?

0:10:22 - 0:10:24     Text: Okay, cool.

0:10:24 - 0:10:29     Text: So now I think after you've seen that process, we're ready to ascribe a key value memory

0:10:29 - 0:10:31     Text: interpretation to this.

0:10:31 - 0:10:34     Text: So let me just quickly show you the whole process again.

0:10:34 - 0:10:38     Text: First we multiply the first matrix and give the similarity scores between each of the

0:10:38 - 0:10:45     Text: row vectors, pass it through a non-linearity, multiply by the second matrix, which in turn

0:10:45 - 0:10:51     Text: select certain columns of the second matrix, add those columns together, and get the output.

0:10:51 - 0:10:56     Text: So when you look at this process, you can think of this second matrix as storing values

0:10:56 - 0:10:59     Text: that you are selecting.

0:10:59 - 0:11:04     Text: You can think of this matrix here that's colored as the selector that's deciding what

0:11:04 - 0:11:09     Text: memories are selected, and you can think of the first matrix as storing keys which represent

0:11:09 - 0:11:12     Text: the things that you want to select.

0:11:12 - 0:11:17     Text: So the reason we call this one on the right keys is because if you think about the input

0:11:17 - 0:11:25     Text: x, if input x is equal to one of the row vectors in w1, then it will have high dot product

0:11:25 - 0:11:29     Text: with that particular key, and the score here will be high for that entry and low for all

0:11:29 - 0:11:31     Text: the other entries.

0:11:31 - 0:11:34     Text: So essentially, each one of these keys selects a particular value.

0:11:34 - 0:11:38     Text: The first row vector selects the first column vector in the second matrix, and so on and

0:11:38 - 0:11:39     Text: so forth.

0:11:39 - 0:11:45     Text: So that's the key value interpretation of a feed forward layer.

0:11:45 - 0:11:51     Text: And just to kind of beat this example to death, so you can get an example of how expressive

0:11:51 - 0:11:57     Text: this model is, let's suppose that the keys are actually one hot vectors where each entry

0:11:57 - 0:12:02     Text: or each row vector just has a one in a different position.

0:12:02 - 0:12:07     Text: Basically what I'll argue is that I can select any combination of the memory columns in

0:12:07 - 0:12:11     Text: this second matrix over here by just setting different values on the input.

0:12:11 - 0:12:17     Text: So in this particular example, I've got a one here and a one here, and that in turn

0:12:17 - 0:12:23     Text: selects these two keys, all the other dot products will be zero, and the selector will

0:12:23 - 0:12:27     Text: be only on for those two entries, which will select these two values.

0:12:27 - 0:12:31     Text: And if I had flipped any of the other bits in this vector to one or zero, I could select

0:12:31 - 0:12:33     Text: any other combination of memories.

0:12:33 - 0:12:39     Text: So there's really quite a lot of bit of flexibility in this model, and it gives us a potential

0:12:39 - 0:12:44     Text: theoretical explanation for how you could store and select lots of different kinds of information

0:12:44 - 0:12:47     Text: in a feed forward layer.

0:12:47 - 0:12:50     Text: Okay, so that's all theoretical so far.

0:12:50 - 0:12:54     Text: And it's just what a feed forward layer could do.

0:12:54 - 0:13:02     Text: We actually want to know if feed forward layers do act this way in a real transformer model.

0:13:02 - 0:13:05     Text: And for that, we're going to return to the paper that I was mentioning earlier called

0:13:05 - 0:13:11     Text: Rome, and we're going to look at how a transformer actually behaves on this particular prompt.

0:13:11 - 0:13:19     Text: All right, so as you know from previous classes on transformers, basically it processes

0:13:19 - 0:13:23     Text: the text from left to right, at least in a standard decoder model, and it builds the attention

0:13:23 - 0:13:28     Text: in feed forward layers one at a time on top, going across like this.

0:13:28 - 0:13:32     Text: And on the next time step, which I'm not showing, it has to predict what goes there.

0:13:32 - 0:13:36     Text: And currently in the basic model, it predicts Paris.

0:13:36 - 0:13:42     Text: So we want to know of each of these boxes, which one is actually storing the knowledge

0:13:42 - 0:13:44     Text: about the Eiffel Tower.

0:13:44 - 0:13:46     Text: If that's a reasonable question to ask at all.

0:13:46 - 0:13:50     Text: And we'll look at an approach that's used in the Rome paper that I mentioned earlier

0:13:50 - 0:13:52     Text: called the causal probing.

0:13:52 - 0:13:59     Text: And the technique, the basic idea of causal probing, is first you take some random Gaussian

0:13:59 - 0:14:04     Text: noise and you add it to the word embeddings for Eiffel and Tower.

0:14:04 - 0:14:08     Text: Or if Eiffel Tower is broken up into more sub words, all of those sub word embeddings.

0:14:08 - 0:14:13     Text: So that essentially confuses the model into not quite being able to recognize what the

0:14:13 - 0:14:14     Text: entity is.

0:14:14 - 0:14:18     Text: And they add enough noise to the point where the model no longer predicts Paris.

0:14:18 - 0:14:24     Text: So once they've destroyed the model's prediction, they then go about trying to restore the original

0:14:24 - 0:14:31     Text: value at each of these boxes one at a time to try and recover the model's original behavior.

0:14:31 - 0:14:37     Text: So intuitively, you would think if one of these boxes doesn't matter, like the model's

0:14:37 - 0:14:40     Text: not paying attention to it, even if you restore back to the original value, the prediction

0:14:40 - 0:14:42     Text: is not going to go back.

0:14:42 - 0:14:46     Text: But if it is responsible when you restore the value, there's some hope that the model

0:14:46 - 0:14:49     Text: returns to its original prediction.

0:14:49 - 0:14:55     Text: And then they're just going to see which layers are best at restoring the original prediction.

0:14:55 - 0:14:59     Text: One thing I just wanted to say to clarify is, let's say if I restore this value here

0:14:59 - 0:15:04     Text: in this attention box, we have to recompute everything above it because all the things

0:15:04 - 0:15:06     Text: above it need to consume that new value.

0:15:06 - 0:15:12     Text: So that's the restoration procedure.

0:15:12 - 0:15:19     Text: And what they found in this paper is that feedforward layers above the last token of Eiffel Tower

0:15:19 - 0:15:24     Text: are actually the critical causal layer that if you restore it, you can get the prediction

0:15:24 - 0:15:26     Text: to go back to its original value.

0:15:26 - 0:15:27     Text: And that's quite interesting.

0:15:27 - 0:15:30     Text: They tried restoring later layers, doesn't restore the prediction.

0:15:30 - 0:15:34     Text: They tried restoring earlier layers, also doesn't restore the prediction.

0:15:34 - 0:15:40     Text: So there's this very clear time band upon which the causal effect is.

0:15:40 - 0:15:42     Text: Let me show you guys a quick plot.

0:15:42 - 0:15:46     Text: So this is just a plot from the original paper.

0:15:46 - 0:15:47     Text: Take a moment to interpret this.

0:15:47 - 0:15:52     Text: So on the y-axis here, we have the different time positions as the model is processing

0:15:52 - 0:15:54     Text: different tokens.

0:15:54 - 0:15:59     Text: So you can see it's the big bang theory premieres on, and then there's a date.

0:15:59 - 0:16:03     Text: And the big bang theory has stars on it because that's the entity where the noise is being

0:16:03 - 0:16:04     Text: added.

0:16:04 - 0:16:09     Text: And then on the x-axis, you're seeing different layers of the transformer as it's processing

0:16:09 - 0:16:10     Text: it.

0:16:10 - 0:16:15     Text: And the colored intensity of the plot is the causal effect of restoring.

0:16:15 - 0:16:22     Text: So what's very exciting is that it all lands on theory, not any tokens before that.

0:16:22 - 0:16:26     Text: So this is where the model apparently is accessing the knowledge.

0:16:26 - 0:16:31     Text: And it's sort of surprising because you might imagine that the knowledge is kind of distributed

0:16:31 - 0:16:34     Text: everywhere, maybe a little bit of every time step contributes.

0:16:34 - 0:16:38     Text: And that would be unfortunate because it'd be harder to modify the model's behavior if

0:16:38 - 0:16:39     Text: that were the case.

0:16:39 - 0:16:41     Text: But in fact, it actually concentrates.

0:16:41 - 0:16:46     Text: And they did this over a bunch of different prompts and looked at basically where they got

0:16:46 - 0:16:51     Text: the most impact, measuring the first entity token, middle, last entity token, later words.

0:16:51 - 0:16:53     Text: And it all concentrates very well.

0:16:53 - 0:16:55     Text: So that's an interesting observation.

0:16:55 - 0:16:58     Text: I don't know what to do with it yet, research wise, but I thought you guys should know about

0:16:58 - 0:16:59     Text: it.

0:16:59 - 0:17:01     Text: Okay.

0:17:01 - 0:17:06     Text: So we're going to zoom in on this feed forward layer that they identified with this high

0:17:06 - 0:17:10     Text: causal effect. So they said it was in this time step and they found that effect to exist

0:17:10 - 0:17:12     Text: across many of the layers.

0:17:12 - 0:17:17     Text: And let's just zoom in on one of them to get a better understanding of what's going

0:17:17 - 0:17:19     Text: on here.

0:17:19 - 0:17:25     Text: So as you guys already know from the earlier slide, we can identify this selection vector

0:17:25 - 0:17:30     Text: that comes out of the nonlinearity, which says which memories got selected in the second

0:17:30 - 0:17:32     Text: weight matrix.

0:17:32 - 0:17:36     Text: And furthermore, we know that this output from this weight matrix is responsible for predicting

0:17:36 - 0:17:39     Text: Paris as we saw from the previous plot.

0:17:39 - 0:17:44     Text: So intuitively, that gives you this idea that somehow we should be messing with the weight

0:17:44 - 0:17:50     Text: matrix W2 to change its behavior.

0:17:50 - 0:17:56     Text: And a naive idea for how to change its behavior is well, drawing on our intuitions from word

0:17:56 - 0:18:02     Text: vectors, maybe we just pick one column from W2, the one that the selector selects, and

0:18:02 - 0:18:06     Text: just subtract the word vector for Paris and add the word vector for Rome.

0:18:06 - 0:18:10     Text: And it turns out that there is in fact a paper, which I've linked to here, that does that,

0:18:10 - 0:18:13     Text: and it works to some extent.

0:18:13 - 0:18:15     Text: So they showed positive results with this approach.

0:18:15 - 0:18:20     Text: And that's really quite surprising in on its own.

0:18:20 - 0:18:24     Text: The particular paper that we've been following, the Rome paper, does something slightly different,

0:18:24 - 0:18:26     Text: but similar in spirit.

0:18:26 - 0:18:29     Text: So they apply a rank one update just to the weight matrix W2.

0:18:29 - 0:18:35     Text: They don't touch W1 at all, kind of consistent with our interpretation that W2 contains

0:18:35 - 0:18:38     Text: the values of the memory, and W1 is just the keys.

0:18:38 - 0:18:45     Text: So what I mean by a rank one update is W2 is a matrix, and I add to it another matrix

0:18:45 - 0:18:50     Text: formed from outer producting two vectors, U and V transpose.

0:18:50 - 0:18:56     Text: And these two vectors are parameters that are optimized to maximize the probability that

0:18:56 - 0:19:03     Text: the model outputs Rome, while also minimizing the change in behavior over all the other inputs.

0:19:03 - 0:19:07     Text: It's out of the scope of this class to exactly describe what U and V is, or at least not

0:19:07 - 0:19:08     Text: in this lecture.

0:19:08 - 0:19:13     Text: So I'll maybe just punt this off to Eric's lecture when he gets there.

0:19:13 - 0:19:18     Text: But I just wanted to show you guys this to show the level of fine-grained control people

0:19:18 - 0:19:24     Text: are starting to look into in terms of editing knowledge in language models.

0:19:24 - 0:19:27     Text: And this wouldn't be complete if I didn't show you some more examples.

0:19:27 - 0:19:31     Text: So the successful example you guys already saw on an earlier slide.

0:19:31 - 0:19:35     Text: But to also show you some not quite successful examples, just to show where the field is

0:19:35 - 0:19:39     Text: right now, they also gave an example of trying to convince the model that the game Sonic

0:19:39 - 0:19:44     Text: Drift 2 was not made by Sega, but instead by Microsoft.

0:19:44 - 0:19:46     Text: And here you can see what the model does instead.

0:19:46 - 0:19:48     Text: It really struggles.

0:19:48 - 0:19:53     Text: It claims that the game is now made by a studio called Play Dead, and this studio was led

0:19:53 - 0:19:54     Text: by a former Microsoft employee.

0:19:54 - 0:19:58     Text: So it's really kind of fighting against what we're trying to make a change.

0:19:58 - 0:20:01     Text: And that's kind of where we are right now.

0:20:01 - 0:20:07     Text: So that was the first section on how language models currently represent knowledge.

0:20:07 - 0:20:11     Text: The main takeaway is that I'd like you guys to have is that the Transformer Feed Forward

0:20:11 - 0:20:14     Text: network can be viewed as a key value memory.

0:20:14 - 0:20:17     Text: And that's one of the reasons why when you see people scaling up Transformers to larger

0:20:17 - 0:20:22     Text: sizes, oftentimes they decide to put that scaling budget into making the Feed Forward

0:20:22 - 0:20:27     Text: layer wider, as opposed to making the attention layer wider, or adding more layers, because

0:20:27 - 0:20:31     Text: they're trying to increase that memorization capacity.

0:20:31 - 0:20:36     Text: And the second conclusion is that Transformers tend to look up information about the entity

0:20:36 - 0:20:38     Text: on the last token where it's mentioned.

0:20:38 - 0:20:40     Text: That's quite an interesting thing too.

0:20:40 - 0:20:45     Text: Prior to the Rome paper, there were other papers that tried to just fine-tune the entire

0:20:45 - 0:20:47     Text: network to get it to change its behavior.

0:20:47 - 0:20:50     Text: And when you fine-tune all of the parameters, indeed you can get it to change its behavior

0:20:50 - 0:20:54     Text: on Rome, but you also mess up a bunch of other facts too.

0:20:54 - 0:21:01     Text: So being able to make a small edit to one place actually does turn out to be helpful.

0:21:01 - 0:21:04     Text: And lastly, I just want to say this is a very new research area.

0:21:04 - 0:21:07     Text: Next year I could be saying something completely different.

0:21:07 - 0:21:11     Text: So just take that with a grain of salt.

0:21:11 - 0:21:16     Text: So we're going to go into the second half of the presentation.

0:21:16 - 0:21:20     Text: And we're going to talk actually, because we still got time, are there any questions

0:21:20 - 0:21:22     Text: about the previous slides?

0:21:22 - 0:21:23     Text: Yeah.

0:21:23 - 0:21:31     Text: Yeah, I'm actually curious, you mentioned about how to be surgical about knowledge alteration.

0:21:31 - 0:21:39     Text: Imperically, did the researchers discover, or with your own work that discovered that

0:21:39 - 0:21:42     Text: it could alter, you know, Iful-tower needing Rome?

0:21:42 - 0:21:47     Text: Does it cause other past dating effects of confusing Rome, like Paris, or something?

0:21:47 - 0:21:48     Text: That's work.

0:21:48 - 0:21:50     Text: Yeah, yeah, that's a great question.

0:21:50 - 0:21:56     Text: So on the eVal benchmarks that they have now for this task, they have, I'm not sure if

0:21:56 - 0:21:58     Text: the exact word is like neighbor prompt.

0:21:58 - 0:22:03     Text: So they have both prompts that are paraphrases of the original thing to test that the models

0:22:03 - 0:22:05     Text: were busted paraphrase.

0:22:05 - 0:22:10     Text: But they also have prompts where they ask, say, about Sears Tower or other towers.

0:22:10 - 0:22:14     Text: And make sure that those towers didn't move to Rome as well.

0:22:14 - 0:22:17     Text: And yeah, so that's very imperfect right now.

0:22:17 - 0:22:21     Text: There's this difficult balance between the two, and that could be a sign of many things.

0:22:21 - 0:22:25     Text: It could be a sign that the model doesn't have enough memorization capacity, so it's only

0:22:25 - 0:22:30     Text: way of representing the Iful-tower is to just think of it as like a tower, with maybe

0:22:30 - 0:22:31     Text: some nationality mixed in.

0:22:31 - 0:22:36     Text: And when you edit that tower, you just move all the other towers too.

0:22:36 - 0:22:38     Text: Great question, yeah.

0:22:38 - 0:22:39     Text: Yeah.

0:22:39 - 0:22:48     Text: So from one of the other things, on the next slide, you see different types of questions.

0:22:48 - 0:22:52     Text: So you asked about Michael Conn where it is, but what about like an article where he's

0:22:52 - 0:23:00     Text: done by other types of properties about the key and just the original, the key to the

0:23:00 - 0:23:03     Text: perspective of the map.

0:23:03 - 0:23:07     Text: Yeah, so I think if I understand your question, it's not just asking about that one fact,

0:23:07 - 0:23:10     Text: but other facts related to the Iful-tower.

0:23:10 - 0:23:15     Text: Yeah, so they also have this kind of freeform generation prompt, I think, where they just

0:23:15 - 0:23:20     Text: initialize it with Iful-tower, some short prompt, and then they measure the different kinds

0:23:20 - 0:23:21     Text: of text that come out.

0:23:21 - 0:23:26     Text: And I think if I remember correctly, they check like N-gram overlap with real text as

0:23:26 - 0:23:27     Text: well.

0:23:27 - 0:23:28     Text: I didn't.

0:23:28 - 0:23:33     Text: So they have a couple more analyses in the paper of how it behaves on other topics too.

0:23:33 - 0:23:35     Text: Yeah.

0:23:35 - 0:23:42     Text: One thing worth mentioning is, I guess, I worked with another student here on a paper on

0:23:42 - 0:23:44     Text: counterfactual updates.

0:23:44 - 0:23:52     Text: So we have this data set, which we should really get released out soon, that pairs one

0:23:52 - 0:23:58     Text: fact update with another implication of that fact update that is non-obvious.

0:23:58 - 0:24:03     Text: So for example, if Walt Disney had one of his Academy Awards script, then the next question

0:24:03 - 0:24:07     Text: would be how many Academy Awards did Walt Disney win.

0:24:07 - 0:24:11     Text: And that's like one of the areas that I think some researchers are thinking about for future

0:24:11 - 0:24:12     Text: work.

0:24:12 - 0:24:13     Text: Yeah.

0:24:13 - 0:24:43     Text: So the current method basically only allows control through the prompt.

0:24:43 - 0:24:47     Text: So whatever the prompt implies about the entity, you get to change what the answer to that

0:24:47 - 0:24:49     Text: prompt is.

0:24:49 - 0:24:53     Text: And it's basically at the moment up to the model to decide what implications are affected

0:24:53 - 0:24:55     Text: by changing that prompt.

0:24:55 - 0:24:57     Text: Which is another, you could say, weakness of the approach.

0:24:57 - 0:24:59     Text: It's not entirely interpretable.

0:24:59 - 0:25:00     Text: Everything is done through optimization.

0:25:00 - 0:25:01     Text: Yeah.

0:25:01 - 0:25:02     Text: Great question.

0:25:02 - 0:25:03     Text: Okay.

0:25:03 - 0:25:05     Text: We'll go on for now.

0:25:05 - 0:25:07     Text: And then, oh, it was a point.

0:25:07 - 0:25:08     Text: Oh, yeah.

0:25:08 - 0:25:15     Text: Okay.

0:25:15 - 0:25:22     Text: The question is about whether the low rank update has anything to do with adversarial machine

0:25:22 - 0:25:23     Text: learning.

0:25:23 - 0:25:29     Text: I'd say maybe there is a slight connection in that, if you look more deeply in the paper,

0:25:29 - 0:25:36     Text: what they're doing is they're optimizing the output of the weight matrix to change the

0:25:36 - 0:25:40     Text: label and that optimization procedure is very similar to what they do in adversarial machine

0:25:40 - 0:25:41     Text: learning.

0:25:41 - 0:25:45     Text: The low rankness of it is not necessarily connected to the adversarial learning.

0:25:45 - 0:25:50     Text: The low rank part is just to minimize the amount of change to the weight matrix.

0:25:50 - 0:25:53     Text: I can maybe just quickly do this on the board here.

0:25:53 - 0:26:03     Text: So if you have this weight matrix, W2 plus UV transpose, the reason this is considered

0:26:03 - 0:26:08     Text: a small update to the matrix is because if you think about anything multiplying W2 after

0:26:08 - 0:26:12     Text: it's received this update, let's say you're multiplying it by X, right?

0:26:12 - 0:26:19     Text: So this whole thing is multiplied by X. So let's just move this X into the expression.

0:26:19 - 0:26:26     Text: You get W2 times X and then UV transpose times X. And this part right here is just what

0:26:26 - 0:26:28     Text: W2 originally would have done.

0:26:28 - 0:26:34     Text: And this part right here, this vector V is being dot-producted with X. And in high-dimensional

0:26:34 - 0:26:39     Text: spaces, basically most things, dot-products are zero.

0:26:39 - 0:26:44     Text: And so this quantity here is likely to be zero unless X is very close to V. And then

0:26:44 - 0:26:47     Text: this whole update here basically disappears.

0:26:47 - 0:26:50     Text: So in that sense, you could think of a low rank update as a very small change to weight

0:26:50 - 0:26:52     Text: matrix.

0:26:52 - 0:27:00     Text: Okay, cool. We'll move on to the next section.

0:27:00 - 0:27:02     Text: All right.

0:27:02 - 0:27:06     Text: So now we've seen what language models currently do to represent knowledge or at least a theory

0:27:06 - 0:27:08     Text: about that.

0:27:08 - 0:27:10     Text: And we'd like to ask ourselves, well, is that really it?

0:27:10 - 0:27:16     Text: Do we just need to make feed-forward layers bigger and bigger to achieve the singularity

0:27:16 - 0:27:21     Text: or whatever it is that artificial intelligence researchers are focused on these days?

0:27:21 - 0:27:26     Text: Okay. So what is missing from transformers right now?

0:27:26 - 0:27:30     Text: We can automatically acquire knowledge from the web, but a lot of that information can

0:27:30 - 0:27:32     Text: be noisy or incorrect.

0:27:32 - 0:27:38     Text: So the web certainly has its share of misinformation or rumors and opinions.

0:27:38 - 0:27:42     Text: And when it absorbs that misinformation or other things, we can't trace the model's

0:27:42 - 0:27:44     Text: knowledge back to an attributable source.

0:27:44 - 0:27:47     Text: So we can trace it back to a particular layer in a feed-forward network, but that still

0:27:47 - 0:27:53     Text: doesn't tell us where in the training data it learned that.

0:27:53 - 0:27:57     Text: Now that would all be okay if we could then surgically edit the model to fix up all of those

0:27:57 - 0:27:58     Text: errors.

0:27:58 - 0:28:02     Text: But as you've seen, it doesn't work very reliably yet.

0:28:02 - 0:28:05     Text: And another fact that I didn't mention is if you apply a bunch of these edits in sequence

0:28:05 - 0:28:11     Text: to the model, eventually the parameters get kind of so damaged from the edits that it

0:28:11 - 0:28:14     Text: doesn't maintain its original performance anymore.

0:28:14 - 0:28:19     Text: And we can continue to try storing knowledge inside feed-forward layers, but the current

0:28:19 - 0:28:23     Text: memorization capacity is still too small, even though we're building these very large

0:28:23 - 0:28:26     Text: and expensive models.

0:28:26 - 0:28:30     Text: So we can rephrase some of these issues as a wish list for what we would want in a

0:28:30 - 0:28:33     Text: knowledge-based language model.

0:28:33 - 0:28:38     Text: We'd want fast and modular knowledge editing, so be able to robustly edit the model multiple

0:28:38 - 0:28:40     Text: times without breaking it.

0:28:40 - 0:28:44     Text: We'd like attribution and interpretability, so tracing a model's knowledge back to something

0:28:44 - 0:28:48     Text: in its training set, and we'd like efficient scaling.

0:28:48 - 0:28:53     Text: So we'd like to be able to increase the model's memory size by 10x without paying 10x more

0:28:53 - 0:28:55     Text: compute.

0:28:55 - 0:28:59     Text: And to give a motivating example, let's just say you wanted to use something like GPT-3

0:28:59 - 0:29:05     Text: to do question answering over your company or school wiki.

0:29:05 - 0:29:09     Text: At the moment, as we know, a single training run, at least when it was originally done,

0:29:09 - 0:29:14     Text: cost over $12 million, and we just can't afford to do that for every organization that

0:29:14 - 0:29:18     Text: wants to train a system off of their data.

0:29:18 - 0:29:22     Text: And furthermore, information is constantly being updated over time, for example, like the

0:29:22 - 0:29:25     Text: COVID requirements for being on campus.

0:29:25 - 0:29:31     Text: So all of this sort of motivates this wish list that I'm giving here.

0:29:31 - 0:29:36     Text: So with that, we'll turn to the next main half of the lecture, which is on memory-augmented

0:29:36 - 0:29:38     Text: models.

0:29:38 - 0:29:44     Text: So let me first just give a basic overview of what a memory-augmented model is.

0:29:44 - 0:29:48     Text: To start with, let's just consider a standard neural network model, which takes some sort

0:29:48 - 0:29:53     Text: of input like this one here, passes it through some dense computation, and then produces some

0:29:53 - 0:29:54     Text: output.

0:29:54 - 0:29:59     Text: And the difference that we're going to make to this model is we're going to attach a memory

0:29:59 - 0:30:01     Text: retriever to it.

0:30:01 - 0:30:05     Text: So the input is going to be fed into this memory retriever, which then accesses some external

0:30:05 - 0:30:11     Text: knowledge source that can be easily scaled, easily edited, easily understood by humans,

0:30:11 - 0:30:14     Text: like Wikipedia.

0:30:14 - 0:30:18     Text: And from that, we'll try to identify some piece of information that is relevant for the

0:30:18 - 0:30:23     Text: task at hand, and feed it back into the neural network, and then produce the prediction.

0:30:23 - 0:30:26     Text: So that's the basic approach that we're thinking about.

0:30:26 - 0:30:31     Text: And the memory that the memory retriever selects can be any number of things.

0:30:31 - 0:30:33     Text: It could be a document on the web.

0:30:33 - 0:30:38     Text: It could be a record in a database. It could be a training example, an entity embedding.

0:30:38 - 0:30:42     Text: I'm just going to focus on text for now, but most of what we'll talk about in this lecture

0:30:42 - 0:30:49     Text: can apply to other kinds of objects, and you should just keep that in mind that it's

0:30:49 - 0:30:52     Text: really not just about text.

0:30:52 - 0:30:56     Text: So this potentially meets our wish list of things we'd like to do.

0:30:56 - 0:30:59     Text: You can easily edit the knowledge in something like Wikipedia.

0:30:59 - 0:31:01     Text: You can easily attribute back to it.

0:31:01 - 0:31:04     Text: There's a source, there's an author for everything that's put in there, and there's efficient

0:31:04 - 0:31:09     Text: scaling, because I can always add more articles to Wikipedia, and I don't have to change the

0:31:09 - 0:31:13     Text: size of a neural network that's accessing it.

0:31:13 - 0:31:18     Text: And some motivating applications for why you would care about this.

0:31:18 - 0:31:21     Text: If you're building an open domain dialogue or question answering system, you want to

0:31:21 - 0:31:26     Text: have robust access to knowledge by retrieving documents on the web.

0:31:26 - 0:31:31     Text: If you're generating code, I'm pretty sure all of us are guilty of at least some point

0:31:31 - 0:31:33     Text: going on Stack Overflow and copying a snippet.

0:31:33 - 0:31:39     Text: So even we do retrieval, we are in a form of memory augmented vowel.

0:31:39 - 0:31:42     Text: If you're doing image generation, if somebody tells you, I want a picture of the Eiffel Tower

0:31:42 - 0:31:47     Text: on the White House lawn, you might consult some reference pictures of those objects.

0:31:47 - 0:31:51     Text: And if you're doing fact checking, you might want to retrieve documents that support or

0:31:51 - 0:31:52     Text: refute a claim.

0:31:52 - 0:31:56     Text: All of these things are very knowledge intensive tasks, and could benefit from an approach like

0:31:56 - 0:31:57     Text: this.

0:31:57 - 0:31:58     Text: Yeah, question.

0:31:58 - 0:31:59     Text: Yeah.

0:31:59 - 0:32:00     Text: Yeah.

0:32:00 - 0:32:01     Text: Yeah.

0:32:01 - 0:32:02     Text: That's a good question.

0:32:02 - 0:32:26     Text: So the question is whether you have to retrieve one memory, and that's not the case.

0:32:26 - 0:32:30     Text: I'm going to use that as the simplified example that we'll work with, but you could retrieve

0:32:30 - 0:32:31     Text: multiple memories.

0:32:31 - 0:32:35     Text: The complexity of retrieving multiple memories increases, and we may be good to that in a

0:32:35 - 0:32:36     Text: little bit.

0:32:36 - 0:32:37     Text: Good question.

0:32:37 - 0:32:38     Text: Yeah.

0:32:38 - 0:32:39     Text: Okay.

0:32:39 - 0:32:43     Text: All right.

0:32:43 - 0:32:47     Text: So the rest of this talk is going to be structured around the main design questions around

0:32:47 - 0:32:50     Text: how to design a memory augmented model.

0:32:50 - 0:32:53     Text: So first you have to choose what your memories are.

0:32:53 - 0:32:56     Text: I'm actually not going to focus on that much because based on your application, you can

0:32:56 - 0:33:00     Text: usually guess what you would want your memories to be.

0:33:00 - 0:33:04     Text: And then we're going to think very hard about how to retrieve the memories.

0:33:04 - 0:33:08     Text: That's essentially the heart, maybe the hardest part of the problem.

0:33:08 - 0:33:11     Text: Approaches we'll look at are you could use an off the shelf search engine like Google

0:33:11 - 0:33:17     Text: or Stack Overflow, or you could train your own memory retriever, which will spend some

0:33:17 - 0:33:18     Text: time on.

0:33:18 - 0:33:22     Text: And lastly, we'll end by looking at how to use retrieved memories.

0:33:22 - 0:33:28     Text: So we'll cover a few different approaches such as text fusion, label smearing.

0:33:28 - 0:33:32     Text: And perhaps most interestingly, I'll talk about a few common failure modes for when models

0:33:32 - 0:33:34     Text: try to use memory.

0:33:34 - 0:33:38     Text: One of them is something that we'll call underutilization, where the model actually ignores

0:33:38 - 0:33:39     Text: the retrieved memories.

0:33:39 - 0:33:43     Text: And another one is over reliance, where the model somehow kind of becomes too dependent

0:33:43 - 0:33:44     Text: on memory.

0:33:44 - 0:33:48     Text: And I'll talk about that when we get there.

0:33:48 - 0:33:50     Text: Okay.

0:33:50 - 0:33:54     Text: So let's go into the section on retrieving memories.

0:33:54 - 0:33:57     Text: So I'll kind of organize them into two broad groups.

0:33:57 - 0:34:01     Text: So one is the set of approaches that use an external tool.

0:34:01 - 0:34:04     Text: And the other is the set where you train your own.

0:34:04 - 0:34:09     Text: And we'll start with an approach that uses an external tool, just because I think some

0:34:09 - 0:34:14     Text: of those approaches have been really popping up this year and are quite exciting.

0:34:14 - 0:34:19     Text: The first approach we'll look at is from this paper called Lambda, which stands for Language

0:34:19 - 0:34:21     Text: Models for Dialog Applications.

0:34:21 - 0:34:25     Text: So on the right we've got a dialogue between a human user and the model.

0:34:25 - 0:34:28     Text: This is a, I think, a real dialogue from their paper.

0:34:28 - 0:34:32     Text: And the user is just asking about a particular artist and the model is giving a very spirited

0:34:32 - 0:34:37     Text: reply, complete with personal opinions and even follow up information.

0:34:37 - 0:34:42     Text: So Lambda is an open domain dialog chatbot that's designed to cover a very large range

0:34:42 - 0:34:43     Text: of topics.

0:34:43 - 0:34:47     Text: So you need some kind of memory component that's able to handle anything the user might

0:34:47 - 0:34:49     Text: throw at the model.

0:34:49 - 0:34:53     Text: And the basic version of the model is just a transformer decoder.

0:34:53 - 0:34:58     Text: So it's the same kind of transformer that we were studying in the previous slides.

0:34:58 - 0:35:02     Text: The input to that transformer is essentially the previous turns of the conversation, represented

0:35:02 - 0:35:03     Text: as text.

0:35:03 - 0:35:07     Text: And the output is just a new utterance that it needs to generate, also as text.

0:35:07 - 0:35:10     Text: So it's just a text to text kind of approach.

0:35:10 - 0:35:11     Text: What's new though?

0:35:11 - 0:35:14     Text: Oh, okay, not quite there, yeah.

0:35:14 - 0:35:18     Text: So one last thing about this model though is as they were developing it, they noticed that

0:35:18 - 0:35:20     Text: it often generated factually incorrect claims.

0:35:20 - 0:35:26     Text: So just to highlight that, this last claim that the model makes is factually incorrect.

0:35:26 - 0:35:31     Text: In this case, this particular artist that was supposedly inspired by the earlier artists

0:35:31 - 0:35:34     Text: stopped working before the first one began working.

0:35:34 - 0:35:37     Text: So this just can't be true.

0:35:37 - 0:35:40     Text: And their approach to solving this problem will be to teach their base model to learn

0:35:40 - 0:35:45     Text: to use a search engine to validate or fix claims that it's made.

0:35:45 - 0:35:48     Text: And I'll show you how that approach works on the next slide.

0:35:48 - 0:35:54     Text: The basic idea is you've got a user interacting with Lambda, which is this big box here.

0:35:54 - 0:35:59     Text: And what they've decided to do is have multiple agents inside the big box that can kind of

0:35:59 - 0:36:04     Text: interact with each other and kind of work things out before they give a reply back to the

0:36:04 - 0:36:05     Text: user.

0:36:05 - 0:36:07     Text: So here's how it might go.

0:36:07 - 0:36:11     Text: The user says to the base model, one was the Eiffel Tower built.

0:36:11 - 0:36:16     Text: And the base model replies it was constructed in 1887.

0:36:16 - 0:36:20     Text: But unlike the basic approach, it doesn't send that response immediately back to the

0:36:20 - 0:36:21     Text: user.

0:36:21 - 0:36:25     Text: It actually sends it to this agent called Research.

0:36:25 - 0:36:29     Text: And then research then takes that information and decides, okay, you know, this claim looks

0:36:29 - 0:36:32     Text: a little bit sketchy.

0:36:32 - 0:36:35     Text: I'm going to send a query to the search engine.

0:36:35 - 0:36:40     Text: I'm answer-promorphizing a bunch here, but it just helps with the explanation.

0:36:40 - 0:36:45     Text: And then the search engine then replies back with the web search results for that query.

0:36:45 - 0:36:49     Text: And in it, we have actually the correct answer, which is 1889.

0:36:49 - 0:36:55     Text: So research then takes that information and produces a new response that has the correct

0:36:55 - 0:36:57     Text: information and sends that back to the user.

0:36:57 - 0:36:59     Text: So that's the overall flow of the approach.

0:36:59 - 0:37:04     Text: And you can see that the search engine is able to intervene to fix a model's responses.

0:37:04 - 0:37:05     Text: Any questions about that?

0:37:05 - 0:37:19     Text: Okay, yeah, the question was whether this particular flow happens for all questions to the

0:37:19 - 0:37:20     Text: model.

0:37:20 - 0:37:25     Text: And actually, as we'll see in a later slide, the model gets to decide who it talks to next.

0:37:25 - 0:37:30     Text: So base has to decide to talk to research as opposed to talking to the user.

0:37:30 - 0:37:34     Text: So there's a learning process where each agent gets to decide whether to go right back

0:37:34 - 0:37:37     Text: to the user or to talk to another one of the other agents.

0:37:37 - 0:37:38     Text: Go ahead.

0:37:38 - 0:38:03     Text: Okay, yeah, the question is how to limit the amount of information coming back from the

0:38:03 - 0:38:04     Text: search engine.

0:38:04 - 0:38:10     Text: I think in this particular approach, the search engine returns the first snippet.

0:38:10 - 0:38:13     Text: And then if research is still not happy, it asks the same question again.

0:38:13 - 0:38:16     Text: And then they've designed it so the search engine returns the next snippet.

0:38:16 - 0:38:19     Text: So it just sort of yields control back to the first.

0:38:19 - 0:38:23     Text: Yeah, yeah.

0:38:23 - 0:38:37     Text: Yeah, okay, so the question is whether research sends any feedback back to the base model

0:38:37 - 0:38:38     Text: that it's made of a stake.

0:38:38 - 0:38:40     Text: And that's a great idea.

0:38:40 - 0:38:42     Text: I don't think they do that in the paper.

0:38:42 - 0:38:47     Text: They just, I mean, research overrides base so the user gets what research said.

0:38:47 - 0:38:50     Text: But base never learns, I think, from what research says.

0:38:50 - 0:38:53     Text: So that's a really, yeah, that's a really great point.

0:38:53 - 0:39:02     Text: Okay, so the question is, why do we need the base model?

0:39:02 - 0:39:03     Text: Yeah, that's a great point too.

0:39:03 - 0:39:09     Text: So Lambda is kind of the researchers who developed it cared not just about answering factual

0:39:09 - 0:39:12     Text: questions but making it interesting and fun and engaging.

0:39:12 - 0:39:16     Text: So the base model has a lot of that engaging behavior.

0:39:16 - 0:39:19     Text: And they wanted to preserve that while still preserving factuality.

0:39:19 - 0:39:24     Text: So the other two agents are kind of there to police the base model.

0:39:24 - 0:39:27     Text: That's maybe one explanation.

0:39:27 - 0:39:31     Text: Oh, right.

0:39:31 - 0:39:33     Text: Great, great questions.

0:39:33 - 0:39:36     Text: So now that we've seen the overall control flow of the model, we can look at how the model

0:39:36 - 0:39:37     Text: is trained.

0:39:37 - 0:39:39     Text: And it's actually quite simple.

0:39:39 - 0:39:42     Text: So their modeling approach is to just treat everything as dialogue.

0:39:42 - 0:39:46     Text: So let's look at a particular turn of the model's operation.

0:39:46 - 0:39:52     Text: So there've been a couple turns of conversation so far, which I've listed here.

0:39:52 - 0:39:58     Text: And you can see it's just saying who talked and who they talked to.

0:39:58 - 0:40:03     Text: And the output at this particular time step is just another person to talk and who it

0:40:03 - 0:40:04     Text: should talk to.

0:40:04 - 0:40:06     Text: So all of this is just text.

0:40:06 - 0:40:08     Text: It's text going in, it's text going out.

0:40:08 - 0:40:13     Text: So you guys have seen transformer models and basically this fits right into the contract

0:40:13 - 0:40:16     Text: of a standard transformer model.

0:40:16 - 0:40:20     Text: The only kind of special detail is that when you generate the text, you have to start your

0:40:20 - 0:40:22     Text: sentence with who you're addressing.

0:40:22 - 0:40:30     Text: And that provides the control of which agent responds next.

0:40:30 - 0:40:32     Text: I already mentioned this.

0:40:32 - 0:40:38     Text: And the perhaps the most important question is, okay, we've got this text to text data.

0:40:38 - 0:40:39     Text: How do we train this model?

0:40:39 - 0:40:42     Text: And the approach in Lambda is actually quite simple, too.

0:40:42 - 0:40:45     Text: They basically just get human demonstrations.

0:40:45 - 0:40:50     Text: So human crowd workers play the role of user and research in this dialogue.

0:40:50 - 0:40:54     Text: There are people who are looking at the base model's utterances and saying, oh, I don't

0:40:54 - 0:40:55     Text: like that.

0:40:55 - 0:40:57     Text: I think a search query should be sent here.

0:40:57 - 0:41:00     Text: And when the search results come back, they're reading the results and then deciding how

0:41:00 - 0:41:02     Text: Lambda should respond instead.

0:41:02 - 0:41:07     Text: So it's a really elegant and simple approach, but it does require you to have trained crowd

0:41:07 - 0:41:11     Text: workers and put in a good amount of budget to get the behavior you want.

0:41:11 - 0:41:14     Text: But still quite impressive that they're able to do this.

0:41:14 - 0:41:18     Text: This is a real example, I think, from the paper.

0:41:18 - 0:41:22     Text: Cool, any questions there?

0:41:22 - 0:41:23     Text: All right.

0:41:23 - 0:41:27     Text: So although the approach is simple, it actually achieves quite a bit.

0:41:27 - 0:41:32     Text: So the model learns to reformulate the previous terms of the conversation as a query that can

0:41:32 - 0:41:34     Text: go into Google search.

0:41:34 - 0:41:38     Text: So it's kind of shoe-horning the problem into something that Google search or some kind

0:41:38 - 0:41:40     Text: of web search can understand.

0:41:40 - 0:41:45     Text: And then it's learning also from human demonstrations how to incorporate the knowledge from the search

0:41:45 - 0:41:49     Text: results back into the utterance that it's putting out.

0:41:49 - 0:41:54     Text: And just because this work also came out around the same time and also very exciting, I also

0:41:54 - 0:42:00     Text: point you to WebGPT, which is another model that learns to use web search.

0:42:00 - 0:42:04     Text: In their case, they provide human demonstrators with an actual UI and have human demonstrators

0:42:04 - 0:42:05     Text: use that.

0:42:05 - 0:42:09     Text: But they ultimately, I think, convert the history of actions that the user takes again

0:42:09 - 0:42:16     Text: into a piece of text that the model then simply consumes and uses to predict the next action.

0:42:16 - 0:42:20     Text: The additional thing here is that they use reinforcement learning to then fine tune their

0:42:20 - 0:42:24     Text: system further on top of what they learned from human demonstrations.

0:42:24 - 0:42:26     Text: And that's something worth checking out as well.

0:42:26 - 0:42:31     Text: So the main takeaway for this little section here is that many external retrieval tools

0:42:31 - 0:42:35     Text: accept text as input and return text as output.

0:42:35 - 0:42:39     Text: So if you want to have an external memory and interface with one of these things, all

0:42:39 - 0:42:44     Text: the task really boils down to is learning to generate text queries to that external tool

0:42:44 - 0:42:48     Text: and then learning to understand the text output of the tool.

0:42:48 - 0:42:54     Text: And both of these tasks can be handled by standard off-the-shelf tools that all of you are already

0:42:54 - 0:42:57     Text: familiar with from previous lectures.

0:42:57 - 0:43:02     Text: As long as you have demonstrations for how to do that.

0:43:02 - 0:43:05     Text: Or if you're able to do RL training, which we won't cover here.

0:43:05 - 0:43:08     Text: So that's the overview of how to use external search tools.

0:43:08 - 0:43:13     Text: You can imagine that if you had a database instead of a web search, you could provide demonstrations

0:43:13 - 0:43:20     Text: of how to write SQL queries to that database or any other sort of tool that you could imagine.

0:43:20 - 0:43:22     Text: All right.

0:43:22 - 0:43:25     Text: So at this point, you might say, all right, we can query web search and web search is very

0:43:25 - 0:43:26     Text: powerful.

0:43:26 - 0:43:28     Text: So why would we use anything else?

0:43:28 - 0:43:31     Text: And to that, I have a couple responses.

0:43:31 - 0:43:34     Text: So first of all, web search is just far from perfect.

0:43:34 - 0:43:37     Text: And the reason it says good as it is today is because of research.

0:43:37 - 0:43:38     Text: And we're here to do research.

0:43:38 - 0:43:44     Text: So if you're just going to rely on web search being good, that sort of defeats the point.

0:43:44 - 0:43:47     Text: If you don't believe me, try some of these queries.

0:43:47 - 0:43:51     Text: So if you search for a famous lawyer who got into car accident, you will find that all

0:43:51 - 0:43:56     Text: the results are about lawyers you can call if you get into a car accident.

0:43:56 - 0:44:00     Text: If you search for a use NLP to parse research papers, you will find a bunch of research papers

0:44:00 - 0:44:01     Text: on parsing.

0:44:01 - 0:44:06     Text: And after doing a few of these, the illusion of web search working really well kind of fades

0:44:06 - 0:44:08     Text: away a little bit.

0:44:08 - 0:44:12     Text: And also, if you speak a language other than English, you might find that search performance

0:44:12 - 0:44:15     Text: in different languages is really not quite the same.

0:44:15 - 0:44:19     Text: So there's still a lot to do to improve retrieval in web search.

0:44:19 - 0:44:26     Text: And I sort of consider web search to be just sort of a component inside the larger, bigger

0:44:26 - 0:44:31     Text: set of things that memory augmented models could potentially do.

0:44:31 - 0:44:35     Text: And second of all, just the plain API of web search isn't designed to handle everything

0:44:35 - 0:44:37     Text: you might want to do.

0:44:37 - 0:44:41     Text: So you could imagine a doctor given a medical image might want to retrieve similar images

0:44:41 - 0:44:43     Text: from a medical textbook.

0:44:43 - 0:44:47     Text: That's not quite something that web search is cut out to do right now.

0:44:47 - 0:44:50     Text: Or if you're a programmer who's been given a programming challenge, you might want to retrieve

0:44:50 - 0:44:52     Text: relevant algorithms.

0:44:52 - 0:44:54     Text: Also something web search doesn't do.

0:44:54 - 0:44:56     Text: If you're in fashion, if you're given three pieces of clothing, can you retrieve another

0:44:56 - 0:44:59     Text: piece of clothing that completes your outfit?

0:44:59 - 0:45:03     Text: Or if you're a novelist and you're given a story, retrieve other stories that have the

0:45:03 - 0:45:05     Text: same plot.

0:45:05 - 0:45:09     Text: Or if you're a journalist, if you're given a claim, retrieve and use articles that refute

0:45:09 - 0:45:10     Text: or contradicted.

0:45:10 - 0:45:14     Text: These are all retrieval tasks that would be quite useful, but existing search tools just

0:45:14 - 0:45:15     Text: don't handle.

0:45:15 - 0:45:20     Text: And so these are all reasons why I think the retrieval problem is still interesting to

0:45:20 - 0:45:22     Text: look at.

0:45:22 - 0:45:25     Text: And a third and final point is that web search only accesses public data.

0:45:25 - 0:45:30     Text: So if you have any task that doesn't condition on public data, you're still going to need

0:45:30 - 0:45:33     Text: a retriever of your own.

0:45:33 - 0:45:34     Text: Cool.

0:45:34 - 0:45:38     Text: So with that, we'll turn to the next part of the talk, which is how to train your own

0:45:38 - 0:45:44     Text: neural retriever, which is something that I find very interesting.

0:45:44 - 0:45:51     Text: So we'll start by giving an anatomy of a neural retriever kind of similar to what we showed

0:45:51 - 0:45:53     Text: for feedforward networks and transformers.

0:45:53 - 0:45:56     Text: We're going to go with this key value type of interpretation.

0:45:56 - 0:45:59     Text: So you have a set of keys paired with a set of values.

0:45:59 - 0:46:05     Text: And given some input, you're going to compute some sort of similarity score between the

0:46:05 - 0:46:07     Text: input and each of the keys.

0:46:07 - 0:46:13     Text: And once you've computed that score, you basically want to return the value associated with

0:46:13 - 0:46:15     Text: the highest scoring key.

0:46:15 - 0:46:22     Text: Or you could return the values for the top K high scoring keys or any other metric.

0:46:22 - 0:46:25     Text: So to just kind of ground this example, the input could be something like Iple Tower

0:46:25 - 0:46:26     Text: Location.

0:46:26 - 0:46:31     Text: The keys could be titles of documents and the values could be the corresponding text associated

0:46:31 - 0:46:34     Text: with the document.

0:46:34 - 0:46:39     Text: And the basic takeaway here is just that a retriever is really just a function that takes

0:46:39 - 0:46:43     Text: some input and a key and produces a score.

0:46:43 - 0:46:45     Text: Once you have that, you basically have a retriever.

0:46:45 - 0:46:51     Text: You score all the memories and then you take the ones that have the highest score.

0:46:51 - 0:46:55     Text: For the remaining slides, I'll actually go with a slightly simplified set of.

0:46:55 - 0:46:59     Text: Because in many tasks, there's really no distinction between the keys and the value.

0:46:59 - 0:47:00     Text: Sometimes they're just the same thing.

0:47:00 - 0:47:04     Text: So for example, with Wikipedia documents, you just take the whole text as the key and

0:47:04 - 0:47:06     Text: the value.

0:47:06 - 0:47:12     Text: And so I'll go with this simplified schematic where you're just computing scores against

0:47:12 - 0:47:13     Text: what I'll call memories.

0:47:13 - 0:47:18     Text: And the highest scoring memory is what's returned from the memory retriever.

0:47:18 - 0:47:21     Text: And we'll go with this formulation that the retriever is just a function that takes the

0:47:21 - 0:47:25     Text: input and the memory and produces a score.

0:47:25 - 0:47:29     Text: Okay, so I've said it's just a function, but what sort of functions do people actually

0:47:29 - 0:47:31     Text: use in practice to compute this score?

0:47:31 - 0:47:35     Text: And the answer here is not too surprising.

0:47:35 - 0:47:39     Text: You guys have seen Bert and other sorts of transformers in previous classes.

0:47:39 - 0:47:41     Text: It's a very flexible model class.

0:47:41 - 0:47:44     Text: I'm trying not to introduce kind of unnecessary complexity.

0:47:44 - 0:47:49     Text: So we'll just go with a Bert model that takes the input and the memory.

0:47:49 - 0:47:55     Text: And you put a some sort of regression layer on top of the output layer of Bert.

0:47:55 - 0:48:00     Text: So maybe the CLS token embedding of Bert, you put a regression layer on top and

0:48:00 - 0:48:03     Text: it produces some float valued score.

0:48:03 - 0:48:05     Text: And this whole thing is differentiable.

0:48:05 - 0:48:07     Text: So the regression layer is differentiable.

0:48:07 - 0:48:08     Text: Bert is differentiable.

0:48:08 - 0:48:12     Text: This gives you basically a neural network that produces a score.

0:48:12 - 0:48:16     Text: And the advantages are you get this very powerful model that's comparing the input against

0:48:16 - 0:48:18     Text: the memory and it's differentiable.

0:48:18 - 0:48:20     Text: So all of that is good.

0:48:20 - 0:48:26     Text: The disadvantage of this approach is that if you have millions of memories, then every

0:48:26 - 0:48:30     Text: time a new input comes in, in order to retrieve a memory, you have to run this computation

0:48:30 - 0:48:33     Text: against all one million of the memories.

0:48:33 - 0:48:38     Text: So that's just way too expensive to do if you're thinking about something like all of Wikipedia

0:48:38 - 0:48:39     Text: or all of the web.

0:48:39 - 0:48:47     Text: So next we'll turn to a different architecture that is more commonly used for retrieval on

0:48:47 - 0:48:49     Text: the next slide.

0:48:49 - 0:48:51     Text: So it's very similar.

0:48:51 - 0:48:53     Text: The Bert picture comes up again.

0:48:53 - 0:48:57     Text: This time what we're doing is we're taking the input and feeding only the input into the

0:48:57 - 0:49:01     Text: transformer to produce a single vector that we'll call the input vector.

0:49:01 - 0:49:06     Text: And then we'll have a separate transformer encode each memory separately to produce a

0:49:06 - 0:49:12     Text: memory vector for each memory and then the relevant score between the input and the memory

0:49:12 - 0:49:16     Text: is just the dot product of these two vectors.

0:49:16 - 0:49:20     Text: It could be the dot product, it could be co-science similarity, just any function that you can

0:49:20 - 0:49:24     Text: efficiently compute between two vectors.

0:49:24 - 0:49:27     Text: So why are we proposing this instead?

0:49:27 - 0:49:30     Text: This has a couple advantages over the previous architecture.

0:49:30 - 0:49:38     Text: The first is that you can run this side of the model, the right side, on all of the memories

0:49:38 - 0:49:39     Text: in advance.

0:49:39 - 0:49:43     Text: So before any inputs even come in, you can just pre-compute the memory vector for each

0:49:43 - 0:49:44     Text: thing.

0:49:44 - 0:49:48     Text: So if it's Wikipedia, you can produce a vector for every document in Wikipedia.

0:49:48 - 0:49:52     Text: And when a new memory comes in, you don't have to redo that work.

0:49:52 - 0:49:54     Text: So that saves a lot of compute.

0:49:54 - 0:49:58     Text: The only thing that you need to do when a new input comes in is you need to compute this

0:49:58 - 0:50:03     Text: input vector and then do dot products against all of the memories.

0:50:03 - 0:50:08     Text: So dot products are cheap and can happen much more quickly than running an entire BERT

0:50:08 - 0:50:09     Text: model over again.

0:50:09 - 0:50:14     Text: And that's the fundamental savings that you get from using a model like this.

0:50:14 - 0:50:18     Text: Something that we won't cover in as much detail here is that for the dot product, there

0:50:18 - 0:50:26     Text: are also fast nearest neighbors algorithms that let you efficiently find the memory vectors

0:50:26 - 0:50:31     Text: that have the highest dot product with the input vector without actually computing over

0:50:31 - 0:50:32     Text: all of the memory vectors.

0:50:32 - 0:50:36     Text: So it's a sublinear search algorithm that allows you to find it.

0:50:36 - 0:50:41     Text: And the basic intuition there, at least there are a couple of them, is that you can take

0:50:41 - 0:50:45     Text: your set of memory vectors and build some sort of tree structure over them, kind of organizing

0:50:45 - 0:50:46     Text: them spatially.

0:50:46 - 0:50:50     Text: And once you've built that tree structure when the new input vector comes in, you can

0:50:50 - 0:50:54     Text: essentially kind of traverse down that tree to find the things that are most similar

0:50:54 - 0:50:57     Text: without computing dot products with everything else.

0:50:57 - 0:51:01     Text: There are other algorithms too that use hashing and other techniques, but they'll be out

0:51:01 - 0:51:06     Text: of the scope for today's class.

0:51:06 - 0:51:10     Text: And the other good property here is that all of this is still differentiable, so you can

0:51:10 - 0:51:13     Text: still train this thing with gradient descent like anything else.

0:51:13 - 0:51:18     Text: The main disadvantage of this approach is also kind of due to its advantage, which is

0:51:18 - 0:51:22     Text: that all of the expressiveness of this model has to go through that one dot product.

0:51:22 - 0:51:26     Text: So anything you want to remember about the input or anything you want to remember about

0:51:26 - 0:51:31     Text: the memory, all has to get squeezed into that one memory vector and that one input vector.

0:51:31 - 0:51:37     Text: And that's a bottleneck that kind of researchers have been dealing with in recent research.

0:51:37 - 0:51:40     Text: What you'll find is that there are a lot of approaches that try to strike some kind of

0:51:40 - 0:51:43     Text: balance between this approach and the approach on the previous slide.

0:51:43 - 0:51:48     Text: So a common thing to do is to use this approach to retrieve a top set of candidates and then

0:51:48 - 0:51:53     Text: run a more complex model like the one on the previous slide to rescore and re-rank the

0:51:53 - 0:51:56     Text: candidates proposed by the first model.

0:51:56 - 0:52:01     Text: You'll also find techniques that try to take the memory and produce five vectors and

0:52:01 - 0:52:05     Text: then use all five of those to somehow compute a score.

0:52:05 - 0:52:09     Text: There are many variations which we won't go into detail here.

0:52:09 - 0:52:11     Text: Any questions?

0:52:11 - 0:52:14     Text: Okay, right there.

0:52:14 - 0:52:40     Text: Okay, there's a question about whether you can kind of augment the search data structure

0:52:40 - 0:52:44     Text: that helps you do the fast search.

0:52:44 - 0:52:50     Text: I think there is some research in the area where the vectors that you produce to index

0:52:50 - 0:52:55     Text: the tree is perhaps not the same as the set that you ultimately return.

0:52:55 - 0:52:56     Text: They can be optimized for different things.

0:52:56 - 0:53:01     Text: So oftentimes these kind of tree-based approaches require your vectors to be spread out in some

0:53:01 - 0:53:04     Text: non-pathological way.

0:53:04 - 0:53:06     Text: And I think that's a very interesting area for research.

0:53:06 - 0:53:11     Text: So producing vectors that are easily indexable, kind of taking into account the indexing

0:53:11 - 0:53:16     Text: process as a way to improve overall performance is quite important too.

0:53:16 - 0:53:22     Text: Because oftentimes when you use these fast similarity methods, they make some sort of approximation

0:53:22 - 0:53:27     Text: to the real top case search and those approximations can often hurt you pretty bad.

0:53:27 - 0:53:31     Text: Yeah, great question.

0:53:31 - 0:53:37     Text: Okay, cool.

0:53:37 - 0:53:44     Text: Great, so now we've looked at a few different architectures for actually performing retrieval.

0:53:44 - 0:53:50     Text: Now let's look at how you would actually train one of these retrievers.

0:53:50 - 0:53:58     Text: So fundamentally all you need to train a retriever is you need an example of an input.

0:53:58 - 0:54:03     Text: You need a positive example of what you would like to retrieve and then you need some negative

0:54:03 - 0:54:05     Text: examples of what you would not like to retrieve.

0:54:05 - 0:54:11     Text: So for example, where the super-ball is this year, Sears Tower location, etc.

0:54:11 - 0:54:16     Text: And the training objective for this is quite straightforward.

0:54:16 - 0:54:19     Text: So I'm going to divine a few variables here.

0:54:19 - 0:54:23     Text: S star will be the score that the retriever assigns to the correct input and S sub i is

0:54:23 - 0:54:27     Text: going to be the score that the retriever assigns to each of the negative inputs.

0:54:27 - 0:54:33     Text: And then we're going to apply the well-known softmax function over all of these scores.

0:54:33 - 0:54:37     Text: So what we're doing here is we're taking each of these scores, exponentiating the score

0:54:37 - 0:54:39     Text: so that there's some positive value.

0:54:39 - 0:54:44     Text: And then dividing each of those exponentiated scores by the sum of all of those scores,

0:54:44 - 0:54:46     Text: so that the whole thing sums up to one.

0:54:46 - 0:54:51     Text: And we're going to call that the probability of retrieving the positive document.

0:54:51 - 0:54:55     Text: So intuitively if the positive document has a high score, then after exponentiation it

0:54:55 - 0:54:56     Text: will be even bigger.

0:54:56 - 0:55:00     Text: The other scores will be smaller and most of the mass in this probability distribution

0:55:00 - 0:55:03     Text: will be on the positive document.

0:55:03 - 0:55:06     Text: If it's not, then this probability will be small.

0:55:06 - 0:55:11     Text: And what we will do as a standard in machine learning is we're going to maximize the probability

0:55:11 - 0:55:16     Text: of that quantity, in particular the log probability.

0:55:16 - 0:55:21     Text: And this is all doable because P of positive depends on the softmax expression here, which

0:55:21 - 0:55:22     Text: is differentiable.

0:55:22 - 0:55:27     Text: And each of the scores inside the softmax depends on the retriever, which I just told you

0:55:27 - 0:55:29     Text: on the previous slide is also differentiable.

0:55:29 - 0:55:33     Text: So the whole thing is differentiable and you're just basically trying to push the positive

0:55:33 - 0:55:39     Text: score essentially above all the negative scores.

0:55:39 - 0:55:42     Text: Okay, so it's a very simple recipe.

0:55:42 - 0:55:46     Text: And we'll look at a concrete example of that based on this paper called Dense Passage Retrieval,

0:55:46 - 0:55:52     Text: DPR, one of the early papers to explore the sort of supervised retrieval approach.

0:55:52 - 0:55:57     Text: So the task they're looking at is basically given a question, like the one here, retrieve

0:55:57 - 0:56:01     Text: a passage from Wikipedia containing the answer.

0:56:01 - 0:56:05     Text: And once you've retrieved the passage, they then have a reader module that reads the

0:56:05 - 0:56:09     Text: passage and produces an answer.

0:56:09 - 0:56:13     Text: And the training data for the retriever is going to fit into the format that I just described.

0:56:13 - 0:56:18     Text: So they work with this dataset called natural questions, which comes with human annotated

0:56:18 - 0:56:24     Text: queries, answers to the queries, and also a passage that contains the answer.

0:56:24 - 0:56:28     Text: So here we go, the input to our memory is the query.

0:56:28 - 0:56:32     Text: The positive memory that we'll want to push up is the passage that the human provided.

0:56:32 - 0:56:36     Text: And the negative memories are actually something kind of interesting in this paper.

0:56:36 - 0:56:42     Text: So it's going to be one, it's going to be the positive passages for other queries.

0:56:42 - 0:56:45     Text: So as long as all your queries aren't asking the same question, the positive passage for

0:56:45 - 0:56:50     Text: another query is going to be negative for the current query that you're looking at.

0:56:50 - 0:56:53     Text: And this next bullet is also interesting.

0:56:53 - 0:56:58     Text: They take a passage that's retrieved by an off-the-shelf tool for search.

0:56:58 - 0:57:00     Text: This is called BM25.

0:57:00 - 0:57:05     Text: It's a classic information retrieval approach that uses token-based overlap to retrieve

0:57:05 - 0:57:06     Text: things.

0:57:06 - 0:57:10     Text: It doesn't have any deep learning or anything in it, but it's quite effective.

0:57:10 - 0:57:14     Text: So they retrieve a passage and they retrieve one that does not contain the answer in it.

0:57:14 - 0:57:18     Text: So the assumption here is that you've got a passage that looks very promising, but in

0:57:18 - 0:57:21     Text: fact, doesn't contain the answer.

0:57:21 - 0:57:26     Text: And you can think of that as this is what we call a hard negative.

0:57:26 - 0:57:27     Text: Great.

0:57:27 - 0:57:29     Text: So we've got all the components for training retriever.

0:57:29 - 0:57:31     Text: They go ahead and do that.

0:57:31 - 0:57:33     Text: Let's look at how well it actually works.

0:57:33 - 0:57:38     Text: So to understand how well it works, we're going to compare it against another approach.

0:57:38 - 0:57:44     Text: So we're going to look at, this is from, I should have had a citation for this too,

0:57:44 - 0:57:48     Text: paper by Robert Settel on close book question answering.

0:57:48 - 0:57:54     Text: They basically take a sequence to sequence neural network model called T5 and just feed

0:57:54 - 0:57:56     Text: in the question and ask it to retrieve the answer.

0:57:56 - 0:57:59     Text: So this model does not have access to passages.

0:57:59 - 0:58:02     Text: In effect, it doesn't have any external memory.

0:58:02 - 0:58:06     Text: And you can see that as they scaled up the size of the model, they were quite nicely getting

0:58:06 - 0:58:10     Text: better and better performance on the task.

0:58:10 - 0:58:15     Text: And the question we want to ask is with DPR, which has this external memory, this access

0:58:15 - 0:58:20     Text: to Wikipedia, can that do better than an approach that doesn't have external memory?

0:58:20 - 0:58:24     Text: And the answer, of course, in this class is yes.

0:58:24 - 0:58:29     Text: So it does indeed improve quite significantly, and it's not a surprise because we have access

0:58:29 - 0:58:35     Text: to this additional information, which is Wikipedia.

0:58:35 - 0:58:39     Text: So you might look at that previous chart and say, well, maybe we just need to make T5

0:58:39 - 0:58:43     Text: bigger because after all, the scaling was looking quite good, right?

0:58:43 - 0:58:49     Text: So I took that plot and re-plotted it with the parameter scale on the x-axis and the

0:58:49 - 0:58:51     Text: performance on the y-axis.

0:58:51 - 0:58:56     Text: And we know from recent research that these scaling laws tend to be logarithmic.

0:58:56 - 0:59:01     Text: So as you increase your model size, the improvement is a logarithmic function.

0:59:01 - 0:59:04     Text: And I just plotted that curve out for you to see where it's headed.

0:59:04 - 0:59:10     Text: And if you plot DPR on this curve, it's just kind of sitting way above this scaling plot

0:59:10 - 0:59:12     Text: for a much smaller model size.

0:59:12 - 0:59:14     Text: It's doing much better.

0:59:14 - 0:59:20     Text: And I also re-plotted this out further to see if this line eventually caught up with the

0:59:20 - 0:59:21     Text: 44 number up there.

0:59:21 - 0:59:25     Text: And it does add around 8 trillion parameters.

0:59:25 - 0:59:30     Text: So that's about like a thousand times bigger than where we are now.

0:59:30 - 0:59:34     Text: So all this is to say that scaling does help, but there might be easier and cheaper

0:59:34 - 0:59:36     Text: ways to get there.

0:59:36 - 0:59:42     Text: So one criticism you could make of the previous approach was that DPR actually had access to

0:59:42 - 0:59:47     Text: something that T5 didn't have, which is it had human annotated gold passages saying

0:59:47 - 0:59:49     Text: what you needed to retrieve to answer the question.

0:59:49 - 0:59:51     Text: And that data is actually hard to collect.

0:59:51 - 0:59:56     Text: So we're going to ask the question, what if the examples that you had access to were

0:59:56 - 0:59:57     Text: just query answer pairs?

0:59:57 - 1:00:04     Text: And you still train a good retriever without gold passages.

1:00:04 - 1:00:09     Text: And this sort of task arises in many other tasks as well.

1:00:09 - 1:00:13     Text: You could imagine if you were going from natural language to code, you might encounter cases

1:00:13 - 1:00:17     Text: where nobody has provided you annotations of what code snippets to retrieve, medical

1:00:17 - 1:00:20     Text: diagnosis, similar thing.

1:00:20 - 1:00:23     Text: So we're going to go now to end-to-end learning of a retriever.

1:00:23 - 1:00:26     Text: And let me get into some detail on what that is.

1:00:26 - 1:00:30     Text: So we're coming back to this diagram of a memory retriever.

1:00:30 - 1:00:34     Text: And in a memory augmented model, once the memory is retrieved, it then goes into a reader

1:00:34 - 1:00:39     Text: component, which takes the original input in the memory and produces an answer.

1:00:39 - 1:00:44     Text: So if you have no supervision for the memory, you might have this intuition instead, which

1:00:44 - 1:00:48     Text: is that if you did retrieve a good memory, that should result in a good answer from the

1:00:48 - 1:00:50     Text: reader.

1:00:50 - 1:00:53     Text: On the other hand, if you retrieved a bad memory, that will probably cause the reader

1:00:53 - 1:00:56     Text: to get confused and produce a bad result.

1:00:56 - 1:01:02     Text: So you might be able to use that observation as a training signal to train your retriever.

1:01:02 - 1:01:06     Text: Let me just give a concrete example with this, who is the bad guy in Lord of the Rings?

1:01:06 - 1:01:10     Text: If the memory retrieves something like the main antagonist is Soran, then you'll produce

1:01:10 - 1:01:12     Text: Soran likely, and that's great.

1:01:12 - 1:01:18     Text: On the other hand, if the retriever got this other passage saying Lord of the Rings received

1:01:18 - 1:01:24     Text: a bad review from IMDB, then your reader might be more inclined to produce IMDB, which

1:01:24 - 1:01:26     Text: would not match the gold answer in your training data set.

1:01:26 - 1:01:32     Text: So this gives you some knowledge that the second memory is bad and the first one is good.

1:01:32 - 1:01:36     Text: And so what I'm going to propose here is this idea of trial and error.

1:01:36 - 1:01:40     Text: In the first stage, you perform exploration where you let your imperfect retriever select

1:01:40 - 1:01:45     Text: some memory, and you try feeding that memory to the reader, and then you learn from success

1:01:45 - 1:01:46     Text: or failure.

1:01:46 - 1:01:49     Text: So if the memory helps the reader generate the right answer, you want to increase the

1:01:49 - 1:01:51     Text: score of that memory.

1:01:51 - 1:01:58     Text: And if the memory does not help the retriever, you want to decrease the score of that memory.

1:01:58 - 1:02:02     Text: And over time, this process would help the helpful memories get higher scores than the

1:02:02 - 1:02:04     Text: less helpful ones.

1:02:04 - 1:02:10     Text: So the formal approach for this is going to be taken from a paper by one of my colleagues

1:02:10 - 1:02:14     Text: called OpenRetrieval QA, ORCA.

1:02:14 - 1:02:18     Text: And the exploration component we're going to formalize as follows.

1:02:18 - 1:02:21     Text: So as I mentioned earlier, a retriever is just scoring function between an input and a

1:02:21 - 1:02:23     Text: memory.

1:02:23 - 1:02:29     Text: And if you take a softmax over all of the scores for all of the memories, then you get this

1:02:29 - 1:02:32     Text: distribution over memories given the input.

1:02:32 - 1:02:39     Text: So again, I've just raised all of the scores to e to the power of the scores and then normalize.

1:02:39 - 1:02:44     Text: And once we have this distribution, we'll randomly sample memory from that distribution.

1:02:44 - 1:02:49     Text: So as you can imagine, if the scores are meaningful, then we're more likely to sample a memory

1:02:49 - 1:02:52     Text: that's good and less likely to sample a memory that's bad.

1:02:52 - 1:02:56     Text: But because it's random, we kind of eventually will sample everything, unless there are things

1:02:56 - 1:02:59     Text: with zero probability.

1:02:59 - 1:03:03     Text: And then the learning from success and failure part.

1:03:03 - 1:03:09     Text: So once we pick a memory, we need to see if it actually helps.

1:03:09 - 1:03:13     Text: And we're going to measure that by looking at the reader's probability of generating the

1:03:13 - 1:03:16     Text: right answer given that particular memory.

1:03:16 - 1:03:18     Text: So that's this big quantity right here.

1:03:18 - 1:03:23     Text: The reader looks at the input and the memory, and we want to see its probability of generating

1:03:23 - 1:03:25     Text: the gold answer.

1:03:25 - 1:03:29     Text: And if this value is high, then we want to increase the score of that memory.

1:03:29 - 1:03:34     Text: And if it's low, we want to probably decrease the score of the memory.

1:03:34 - 1:03:40     Text: So I've shown you a couple expressions now, and we want to put those expressions together

1:03:40 - 1:03:44     Text: into a training objective that we can actually optimize.

1:03:44 - 1:03:46     Text: So we'll start with this question.

1:03:46 - 1:03:50     Text: If we randomly sample a memory from the retriever, and then we generate an answer, what is the

1:03:50 - 1:03:54     Text: overall probability that we get the answer right?

1:03:54 - 1:03:56     Text: So first, let's look at this expression right here.

1:03:56 - 1:04:01     Text: This is a summation over all possible memories that the retriever could retrieve, and the

1:04:01 - 1:04:03     Text: sum is over the probability of retrieving it.

1:04:03 - 1:04:08     Text: So right now, this is, this just equals one, because it's a distribution, and we're summing

1:04:08 - 1:04:10     Text: over all of its values.

1:04:10 - 1:04:14     Text: But then we'll add one more term to this, which is the probability that the reader gets

1:04:14 - 1:04:20     Text: the answer right given the memory in that term in the summation.

1:04:20 - 1:04:24     Text: So this first part is the retriever, and it's proposing different memories.

1:04:24 - 1:04:29     Text: And this second part is the reader, and it's succeeding or failing based on the memories.

1:04:29 - 1:04:34     Text: So you can think of each term in this summation as a trial of a different memory.

1:04:34 - 1:04:39     Text: And you can think of that second term kind of like a reward, if it's high, it's good,

1:04:39 - 1:04:42     Text: and if it's low, it's bad.

1:04:42 - 1:04:47     Text: So what they propose in the Orca paper is to perform gradient descent on this entire

1:04:47 - 1:04:49     Text: expression right here.

1:04:49 - 1:04:54     Text: They basically want to push the value of this entire expression up, and they're optimizing

1:04:54 - 1:04:56     Text: both the retriever and the reader.

1:04:56 - 1:04:59     Text: So let's look at the retriever first.

1:04:59 - 1:05:04     Text: So the retriever has a fixed budget that has to sum up to one over all of the memories.

1:05:04 - 1:05:09     Text: If it wants this value to be high, it doesn't have any incentive to put probability on bad

1:05:09 - 1:05:13     Text: memories, because those bad memories are just going to produce a low score on this

1:05:13 - 1:05:14     Text: term right here.

1:05:14 - 1:05:18     Text: So as you optimize this function, the retriever will basically try to put all of its mass

1:05:18 - 1:05:20     Text: on the good memories.

1:05:20 - 1:05:24     Text: And meanwhile, if you're optimizing the reader with respect to this function, it's trying

1:05:24 - 1:05:28     Text: its best to produce the gold answer given whatever memory it has.

1:05:28 - 1:05:32     Text: So it's also incentivized to try its best to extract the answer out of whatever it's

1:05:32 - 1:05:33     Text: given.

1:05:33 - 1:05:37     Text: And when you kind of jointly optimize both, over time you get something that puts its

1:05:37 - 1:05:39     Text: mass on good memories.

1:05:39 - 1:05:44     Text: So that kind of corresponds with the intuition that I was giving you earlier that you can

1:05:44 - 1:05:49     Text: kind of perform end-to-end learning to get a retriever.

1:05:49 - 1:05:50     Text: All right.

1:05:50 - 1:05:53     Text: So that's the high-level approach to Orca.

1:05:53 - 1:05:56     Text: What I didn't explain is that usually if your memories are like all of Wikipedia, this

1:05:56 - 1:06:01     Text: summation is very large, and if you're going to do gradient descent on this summation,

1:06:01 - 1:06:02     Text: it's going to take a very long time.

1:06:02 - 1:06:07     Text: So in practice, they approximate this summation with the highest probability memories, maybe

1:06:07 - 1:06:10     Text: the top 100 or the top 10.

1:06:10 - 1:06:14     Text: And I won't go into details in this class about exactly how that works.

1:06:14 - 1:06:17     Text: But I'll stop there.

1:06:17 - 1:06:20     Text: Because we're kind of approaching the end, I'm going to take questions just a little bit

1:06:20 - 1:06:21     Text: later.

1:06:21 - 1:06:23     Text: Sorry about that.

1:06:23 - 1:06:26     Text: So let's see how well Orca works.

1:06:26 - 1:06:28     Text: Just come out and put that number there.

1:06:28 - 1:06:29     Text: So a bit of context around this.

1:06:29 - 1:06:32     Text: It's not as good as DPR because it has less supervision than DPR.

1:06:32 - 1:06:36     Text: There's no human annotation of what passage to retrieve.

1:06:36 - 1:06:41     Text: But what's worth noticing is that at least compared to T5 at the same size, so you can compare

1:06:41 - 1:06:47     Text: 0.66 billion parameters against 0.77, it's actually already better than T5.

1:06:47 - 1:06:51     Text: And compared to a T5 that's about 15 times larger, it's almost at the same performance

1:06:51 - 1:06:52     Text: too.

1:06:52 - 1:07:02     Text: And it's a pretty decent result for an approach that has no access to retrieval supervision.

1:07:02 - 1:07:06     Text: So one thing you might note though is that the better result requires gold passages.

1:07:06 - 1:07:10     Text: And Orca and T5, these approaches don't require gold passages.

1:07:10 - 1:07:12     Text: They only need query answer pairs.

1:07:12 - 1:07:17     Text: And one advantage of that is that query answer pair data is actually pretty easy to get.

1:07:17 - 1:07:23     Text: So we could potentially get a lot more of it than if we were asking for passage passages as

1:07:23 - 1:07:24     Text: well.

1:07:24 - 1:07:30     Text: And the final part of this retrieval section is about a way to get basically an arbitrary

1:07:30 - 1:07:36     Text: number of query answer pairs to kind of improve these weekly supervised approaches that don't

1:07:36 - 1:07:37     Text: have passages.

1:07:37 - 1:07:41     Text: So it comes from a very simple observation, which is let's take your typical query answer

1:07:41 - 1:07:42     Text: pair.

1:07:42 - 1:07:43     Text: It looks like this, right?

1:07:43 - 1:07:45     Text: So you've got your query on the left and answer on the right.

1:07:45 - 1:07:50     Text: You can easily reformulate that as a fill in the blank question like this.

1:07:50 - 1:07:54     Text: And this fill in the blank question forces the model to think just as hard as the original

1:07:54 - 1:07:57     Text: question is just in a different format.

1:07:57 - 1:08:01     Text: But what's nice about this fill in the blank question is that it's very easy to create

1:08:01 - 1:08:03     Text: a bunch of them for free.

1:08:03 - 1:08:05     Text: Basically you can just take any sentence on the web.

1:08:05 - 1:08:10     Text: And as long as it's mentioning something factual or semantically meaningful, you can just

1:08:10 - 1:08:13     Text: blank out one of the entities.

1:08:13 - 1:08:17     Text: And in fact, that is exactly what you've probably seen in previous lectures, pre-trained

1:08:17 - 1:08:19     Text: language models like Bert do.

1:08:19 - 1:08:24     Text: And Bert uses that training objective to learn a very great deal.

1:08:24 - 1:08:27     Text: And that can be used in this setting for retrieval as well.

1:08:27 - 1:08:32     Text: So the basic idea for Realm, which is something that I worked on with collaborators, is to

1:08:32 - 1:08:35     Text: apply the same end-to-end training as Orca.

1:08:35 - 1:08:39     Text: But pre-trained the model on a bunch of these fill in the blank questions that we just

1:08:39 - 1:08:44     Text: automatically generated, just in extremely large quantities.

1:08:44 - 1:08:50     Text: And then we fine-tune that on the real questions query answer pairs that we already have.

1:08:50 - 1:08:54     Text: So if you do this approach and you plot it against all the others, what's quite nice is

1:08:54 - 1:09:01     Text: that it basically almost closes the gap completely with an approach that uses supervised data.

1:09:01 - 1:09:04     Text: Just by pre-training on fill in the blank questions.

1:09:04 - 1:09:06     Text: And the nice thing is it doesn't need access to gold passages.

1:09:06 - 1:09:11     Text: So it's on the same footing as things like T5 now.

1:09:11 - 1:09:16     Text: And at the same footing, it outperforms T5, despite being much smaller than even the largest

1:09:16 - 1:09:17     Text: model.

1:09:17 - 1:09:22     Text: So that gives us this interesting promise of using kind of language model fill in the

1:09:22 - 1:09:25     Text: blank techniques to build good memory retrievers.

1:09:25 - 1:09:30     Text: And the nice thing is that this fill in the blank approach can be used to tackle many

1:09:30 - 1:09:31     Text: sorts of tasks.

1:09:31 - 1:09:36     Text: You could blank out a patch in an image and train a retriever to find other images that

1:09:36 - 1:09:37     Text: might help fill it in.

1:09:37 - 1:09:42     Text: You could blank out a segment of code and train a retriever to find other pieces of code

1:09:42 - 1:09:43     Text: that might help fill that in.

1:09:43 - 1:09:45     Text: Or chapter in a textbook.

1:09:45 - 1:09:50     Text: The kind of list of things you can do with fill in the blank actually goes on and on.

1:09:50 - 1:09:55     Text: And each task that you define in this way produces a specialized memory retriever for whatever

1:09:55 - 1:09:58     Text: it is that you're filling the blank in for.

1:09:58 - 1:10:00     Text: And there's no need to collect training data.

1:10:00 - 1:10:07     Text: So this sort of scales to any set of tasks that may not be important enough for central

1:10:07 - 1:10:10     Text: enough to warn a big data collection budget.

1:10:10 - 1:10:15     Text: All right, so the main takeaways for this section are that a retriever is just a function

1:10:15 - 1:10:19     Text: that takes an input and a memory and produces a score.

1:10:19 - 1:10:22     Text: If you have supervised data for your retriever, that's great.

1:10:22 - 1:10:25     Text: Provide positive and negative memories for each input.

1:10:25 - 1:10:28     Text: And just train the retriever to score the positive ones higher.

1:10:28 - 1:10:33     Text: If you don't have supervision, you can use end-to-end learning, which employs a trial

1:10:33 - 1:10:34     Text: and error approach.

1:10:34 - 1:10:40     Text: If a memory helps the model score the memory higher, otherwise score it lower.

1:10:40 - 1:10:44     Text: And with end-to-end learning, you often get this special benefit that you can easily create

1:10:44 - 1:10:47     Text: tons of data to pre-train your retriever.

1:10:47 - 1:10:54     Text: All right, we're now into the very, very final part of the talk, which is how to actually

1:10:54 - 1:10:56     Text: use the memories after you get them.

1:10:56 - 1:10:59     Text: Notice I have 15 minutes, right?

1:10:59 - 1:11:01     Text: Yes, hold in.

1:11:01 - 1:11:04     Text: Okay, all right, then we should have plenty of time for questions, actually.

1:11:04 - 1:11:06     Text: So all right, here we go.

1:11:06 - 1:11:10     Text: We're going to come back to this diagram of a memory-augmented model.

1:11:10 - 1:11:13     Text: And now we're going to focus on this reader component, which I didn't say much about

1:11:13 - 1:11:14     Text: before.

1:11:14 - 1:11:17     Text: I said that the reader takes the memory and the input and then produces the answer.

1:11:17 - 1:11:21     Text: So what does that reader component actually look like?

1:11:21 - 1:11:27     Text: A very common architecture is just the sequence encoder to decoder model.

1:11:27 - 1:11:32     Text: In practical terms, you take the original question and you just concatenate it to the

1:11:32 - 1:11:34     Text: memory that you retrieved.

1:11:34 - 1:11:37     Text: And then you feed that into your encoder and you train it using standard sequence to sequence

1:11:37 - 1:11:40     Text: learning to produce the output.

1:11:40 - 1:11:44     Text: So all we're doing is just concatenating the memory with the input.

1:11:44 - 1:11:47     Text: And we can refer to that as text fusion.

1:11:47 - 1:11:52     Text: Anytime you have a memory that is text or can be converted into text in some form, just

1:11:52 - 1:11:53     Text: concatenated.

1:11:53 - 1:11:55     Text: That's all you need.

1:11:55 - 1:12:00     Text: At least that's what state of the art techniques are doing right now.

1:12:00 - 1:12:05     Text: Okay, and just to give some variety, here's another way to incorporate memories.

1:12:05 - 1:12:11     Text: Let's consider a slightly different memory-augmented model where instead of just retrieving a document,

1:12:11 - 1:12:17     Text: the memory is actually key value pairs where the key is the question, like a question that's

1:12:17 - 1:12:21     Text: been seen before and the value is the answer to that previously seen question.

1:12:21 - 1:12:26     Text: So in this case, you can do something even simpler than what was on the previous slide.

1:12:26 - 1:12:30     Text: You can take your input and compare it to the keys and find the key that most resembles

1:12:30 - 1:12:31     Text: the input.

1:12:31 - 1:12:35     Text: So in this case, we found a paraphrase of the original question.

1:12:35 - 1:12:40     Text: And if you have this, then all you really need to do is just copy the answer from the

1:12:40 - 1:12:43     Text: value out as your label.

1:12:43 - 1:12:49     Text: And that's what people refer to as label smearing or otherwise nearest neighbor's methods.

1:12:49 - 1:12:52     Text: We call it smearing because you're essentially smearing the label from this example that you

1:12:52 - 1:12:56     Text: retrieved onto the new example.

1:12:56 - 1:13:02     Text: And it's a very simple technique, which one you use depends on your application.

1:13:02 - 1:13:05     Text: So the techniques we're using memories are quite simple, but the problems that arise

1:13:05 - 1:13:09     Text: when you use them are actually quite interesting.

1:13:09 - 1:13:13     Text: The first one I want to talk about, I kind of mentioned this preview this earlier, are these

1:13:13 - 1:13:16     Text: two problems of underutilization and overreliance.

1:13:16 - 1:13:19     Text: So let's get into the underutilization issue.

1:13:19 - 1:13:23     Text: This is from a very recent paper by Longprey et al.

1:13:23 - 1:13:25     Text: So let me dive into it.

1:13:25 - 1:13:29     Text: So, okay, I switched the example here because I just got tired of the Lord of the Rings

1:13:29 - 1:13:31     Text: One.

1:13:31 - 1:13:34     Text: The question is, who do you meet at the gates of heaven?

1:13:34 - 1:13:39     Text: And the retrieved memory is on point, it says you see Saint Peter at the gates of heaven,

1:13:39 - 1:13:42     Text: the reader reads it, produces Saint Peter, everything is great.

1:13:42 - 1:13:47     Text: So what Longprey et al observed is, okay, if the reader is really doing such a great job

1:13:47 - 1:13:52     Text: of reading the memories, then if I edit the memory to say something else, the reader

1:13:52 - 1:13:55     Text: should pick up on that and produce the different answer.

1:13:55 - 1:14:00     Text: So what they do is they change Saint Peter to the United Nations, guards the gates of

1:14:00 - 1:14:01     Text: heaven.

1:14:01 - 1:14:06     Text: And they check if the reader actually produces the United Nations.

1:14:06 - 1:14:12     Text: But surprisingly, the reader still says that Saint Peter guards the gates of heaven.

1:14:12 - 1:14:15     Text: This is really quite interesting and pretty funny.

1:14:15 - 1:14:18     Text: So what's actually going on here?

1:14:18 - 1:14:20     Text: Let's first look at how bad this problem is.

1:14:20 - 1:14:23     Text: So here's a plot from the paper.

1:14:23 - 1:14:28     Text: The first row here is the model's behavior on the training set for natural questions.

1:14:28 - 1:14:33     Text: The same data set we were looking at earlier, and the red part of this plot indicates the

1:14:33 - 1:14:38     Text: number of times where the model sticks with its old answer even after changing the memory.

1:14:38 - 1:14:40     Text: It just stubbornly refuses to change.

1:14:40 - 1:14:44     Text: The blue part is the good part where it actually switches over to predicting the United Nations

1:14:44 - 1:14:45     Text: on various examples.

1:14:45 - 1:14:48     Text: And this other orange part is very concerning too.

1:14:48 - 1:14:53     Text: So when you change the memory to United Nations, sometimes the model just gets confused and

1:14:53 - 1:14:55     Text: predict something totally different.

1:14:55 - 1:14:58     Text: Not Saint Peter, not United Nations, just something completely different.

1:14:58 - 1:15:03     Text: So from this we can see something is really kind of broken about some of these memory-augmented

1:15:03 - 1:15:04     Text: models.

1:15:04 - 1:15:10     Text: And the same kind of behavior happens on the dev set as well.

1:15:10 - 1:15:16     Text: And just to underscore how bad this is, these results are from the set of examples that

1:15:16 - 1:15:19     Text: the original model was actually getting all correct.

1:15:19 - 1:15:24     Text: So you just kind of cut the performance of your model by more than half when you edit

1:15:24 - 1:15:25     Text: the memories.

1:15:25 - 1:15:27     Text: And that indicates the model is not robust to change.

1:15:27 - 1:15:30     Text: As we said earlier, being able to edit your memories is something you would really want

1:15:30 - 1:15:34     Text: from a memory-augmented model.

1:15:34 - 1:15:37     Text: So let's have an analysis of why this happens.

1:15:37 - 1:15:43     Text: Basically, when you put this memory into the sequence encoder, the reader that reads the

1:15:43 - 1:15:49     Text: memory, this encoder and this decoder, they actually have their own memory as well.

1:15:49 - 1:15:52     Text: As we saw in the earlier slides, transformers have their own memory.

1:15:52 - 1:15:57     Text: So we'll refer to that as the parametric memory of the encoder decoder, as opposed to the

1:15:57 - 1:16:00     Text: external memory that we want it to rely on.

1:16:00 - 1:16:05     Text: And at training time, they essentially learn to store the answer in their parametric

1:16:05 - 1:16:08     Text: memory and not rely on the external memory.

1:16:08 - 1:16:15     Text: So to give this issue a better cartoon form, the input is coming in and the model has

1:16:15 - 1:16:18     Text: its own parametric memory and the retrieve memory.

1:16:18 - 1:16:23     Text: And the parametric memory is saying St. Peter and the retrieve memory at training time is

1:16:23 - 1:16:28     Text: also saying St. Peter and the loss function is saying you must predict St. Peter.

1:16:28 - 1:16:32     Text: So the model says, okay, I've got two sources of information I can choose either one.

1:16:32 - 1:16:36     Text: There's nothing forcing the model to use the retrieved memory.

1:16:36 - 1:16:39     Text: And that's part of the problem that's causing this.

1:16:39 - 1:16:43     Text: Another problem, which isn't on this slide, is that sometimes the retriever is just not

1:16:43 - 1:16:44     Text: very good.

1:16:44 - 1:16:47     Text: So it might retrieve something that's just not related to the question.

1:16:47 - 1:16:51     Text: And in that case, the model is forced to fall back on its parametric memory and again

1:16:51 - 1:16:55     Text: learn to distrust the retrieved memory.

1:16:55 - 1:17:03     Text: So we want a way to kind of force the model to pick the retrieved memory instead.

1:17:03 - 1:17:06     Text: Ideally, we would want cases where the parametric memory is wrong and the retrieved memory

1:17:06 - 1:17:07     Text: is correct.

1:17:07 - 1:17:11     Text: That would force the model to say, hey, I can't trust my parametric memory.

1:17:11 - 1:17:17     Text: So what Longprey at all do is first they take the retrieved memory and they change what

1:17:17 - 1:17:19     Text: the retrieved memory is saying.

1:17:19 - 1:17:23     Text: So this creates a disagreement now between the parametric memory and the retrieved memory.

1:17:23 - 1:17:26     Text: But the gold label is still saying St. Peter.

1:17:26 - 1:17:31     Text: So we've just made the matters worse now because now the retrieved memory is even less trustworthy.

1:17:31 - 1:17:37     Text: The final thing they do, which is really the interesting bit, is they just decide to change

1:17:37 - 1:17:41     Text: the gold label as well to agree with their retrieved memory.

1:17:41 - 1:17:46     Text: So they've changed reality and said, no, actually the United Nations guards the gates of heaven.

1:17:46 - 1:17:49     Text: And what's in your parametric memory is wrong?

1:17:49 - 1:17:50     Text: And that's basically the approach.

1:17:50 - 1:17:55     Text: So they create a bunch of data like this where the gold answer has been changed to match

1:17:55 - 1:17:59     Text: the corruption they made in the retrieved memory and it guides the model away from using

1:17:59 - 1:18:00     Text: the parametric memory.

1:18:00 - 1:18:07     Text: I thought that was a pretty cool trick and they don't give this name in the paper, but

1:18:07 - 1:18:10     Text: you can think of it as data augmentation using these counterfactual memories.

1:18:10 - 1:18:13     Text: And it can really be applied to a lot of different approaches.

1:18:13 - 1:18:17     Text: As long as your memory is editable in a certain way and you can edit the gold label as well,

1:18:17 - 1:18:23     Text: you can create this artificial correlation between the memory and the output and an artificial

1:18:23 - 1:18:27     Text: anti-correlation between the output and whatever your model originally trained on.

1:18:27 - 1:18:30     Text: It's cool.

1:18:30 - 1:18:32     Text: So now you want to see if it works.

1:18:32 - 1:18:36     Text: And in the paper they report this metric, which is basically the percentage of time the

1:18:36 - 1:18:43     Text: model predicts the old value instead of the new one divided by the old plus the new.

1:18:43 - 1:18:46     Text: They ignore the set where the model gets confused and produces something totally different.

1:18:46 - 1:18:51     Text: I wish they had reported that too, but I couldn't immediately find it in their paper.

1:18:51 - 1:18:53     Text: But at least on this metric things look great.

1:18:53 - 1:18:57     Text: So on the training set and the dev set, the percentage of the time that the model uses

1:18:57 - 1:19:02     Text: the old memory, the old answer, goes dramatically down with this data augmentation.

1:19:02 - 1:19:06     Text: So it really keeps the model on its toes and makes it use its memory.

1:19:06 - 1:19:11     Text: Now I take this result with a little grain of salt because their test set is created the

1:19:11 - 1:19:14     Text: same way that they produce this data augmentation.

1:19:14 - 1:19:19     Text: So this is kind of the ideal setup where they're almost testing on the exact same distribution

1:19:19 - 1:19:21     Text: that they're training on.

1:19:21 - 1:19:24     Text: But still a very interesting approach.

1:19:24 - 1:19:26     Text: So let's see how long do we have left?

1:19:26 - 1:19:30     Text: Okay, in the last few minutes I'm going to cover the over reliance problem and there's

1:19:30 - 1:19:32     Text: just one slide on this.

1:19:32 - 1:19:36     Text: So sometimes the memories that your model retrieves are too easy.

1:19:36 - 1:19:38     Text: Here's what I mean by that.

1:19:38 - 1:19:44     Text: So if we go back to this Eiffel Tower query, what year was the Eiffel Tower built?

1:19:44 - 1:19:48     Text: We know it's 1889 and a good typical memory that you might retrieve is something like this.

1:19:48 - 1:19:53     Text: It says work on the Eiffel Tower was completed in 1889.

1:19:53 - 1:19:57     Text: There's not too much word overlap with the original query, which is good because it

1:19:57 - 1:20:03     Text: teaches the reader to recognize the fact that completed in this context is the same as

1:20:03 - 1:20:04     Text: built.

1:20:04 - 1:20:06     Text: So the reader learns paraphrase.

1:20:06 - 1:20:10     Text: On the other hand, you might get a memory that's too easy, which literally just says exactly

1:20:10 - 1:20:13     Text: the same tokens as the original input.

1:20:13 - 1:20:20     Text: And you could also consider, yeah, so this example would not teach the model how to paraphrase.

1:20:20 - 1:20:24     Text: And at the other extreme, you could also consider an extremely challenging memory.

1:20:24 - 1:20:29     Text: Like Paris's tallest tower finished the same year Van Gogh painted the start night.

1:20:29 - 1:20:33     Text: So yes, that also says the same fact, but the answer doesn't even directly appear.

1:20:33 - 1:20:36     Text: It's just too hard for the model.

1:20:36 - 1:20:40     Text: So if all of your examples are like this too easy memory, then you end up with a reader

1:20:40 - 1:20:42     Text: that is kind of spoiled.

1:20:42 - 1:20:45     Text: It never learns to paraphrase.

1:20:45 - 1:20:49     Text: And at test time, if your memories that are retrieved are not as good, the reader is not

1:20:49 - 1:20:51     Text: going to be able to use them.

1:20:51 - 1:20:56     Text: So a simple fix to this problem, I don't have a paper that I can cite exactly for this,

1:20:56 - 1:21:01     Text: but at training time, you can simply filter out some of the memories that have lexical overlap.

1:21:01 - 1:21:03     Text: That's too high.

1:21:03 - 1:21:08     Text: At the same time, you also want to make sure that you don't filter out so many of the easy

1:21:08 - 1:21:12     Text: things that you're just left with the super hard cases, like the one on the bottom.

1:21:12 - 1:21:15     Text: Because if you only have the super hard cases, your model will get confused.

1:21:15 - 1:21:20     Text: And as we saw in the previous slides, it might just fall back on its parametric memory.

1:21:20 - 1:21:25     Text: So this is sort of just an area for open research of how to give the reader a flexible set

1:21:25 - 1:21:27     Text: of things to train from.

1:21:27 - 1:21:29     Text: Great, yeah.

1:21:29 - 1:21:32     Text: So I've covered pretty much everything in that section as well.

1:21:32 - 1:21:36     Text: The main takeaway is that, like for you guys to have is that getting your model to use

1:21:36 - 1:21:38     Text: memories is not hard.

1:21:38 - 1:21:39     Text: There's some simple approaches.

1:21:39 - 1:21:44     Text: But getting your model to use memory correctly is actually an interesting open question.

1:21:44 - 1:21:50     Text: And there's this issue of underutilization and overreliance that are open areas of research.

1:21:50 - 1:21:51     Text: And that's it.

1:21:51 - 1:21:54     Text: I hope that you guys saw some interesting things about memory augmented models and are

1:21:54 - 1:21:56     Text: encouraged to look into that area.

1:21:56 - 1:22:00     Text: If there are any questions, please feel free to email me or message me.

1:22:00 - 1:22:02     Text: Have you to talk about it more.

1:22:02 - 1:22:04     Text: Thanks for sitting through a 90 minute lecture.