0:00:00 - 0:00:12 Text: Welcome to CS224N, lecture 17.
0:00:12 - 0:00:14 Text: Model analysis and explanation.
0:00:14 - 0:00:16 Text: Okay, look at us.
0:00:16 - 0:00:19 Text: We're here.
0:00:19 - 0:00:21 Text: Start with some course logistics.
0:00:21 - 0:00:26 Text: We have updated the policy on the guest lecture reactions.
0:00:26 - 0:00:28 Text: They're all due Friday.
0:00:28 - 0:00:30 Text: All at 11.59 pm.
0:00:30 - 0:00:33 Text: You can't use late days for this.
0:00:33 - 0:00:35 Text: So please get them in.
0:00:35 - 0:00:36 Text: Watch the lectures.
0:00:36 - 0:00:37 Text: They're awesome lectures.
0:00:37 - 0:00:39 Text: They're awesome guests.
0:00:39 - 0:00:42 Text: And you get something like half a point for each of them.
0:00:42 - 0:00:46 Text: And yeah, all three can be submitted up through Friday.
0:00:46 - 0:00:48 Text: Okay, so final project.
0:00:48 - 0:00:51 Text: Remember that the due date is Tuesday.
0:00:51 - 0:00:55 Text: It's Tuesday at 4.30 pm, March 16th.
0:00:55 - 0:01:05 Text: And let me emphasize that there's a hard deadline on the three days from then Friday.
0:01:05 - 0:01:10 Text: We won't be accepting for additional points off assignments, sorry, final projects that
0:01:10 - 0:01:15 Text: are submitted after the 4.30 deadline on Friday.
0:01:15 - 0:01:18 Text: We need to get these graded and get grades in.
0:01:18 - 0:01:20 Text: So it's the end stretch.
0:01:20 - 0:01:21 Text: Week 9.
0:01:21 - 0:01:27 Text: Week 10 is really the lectures are us giving you help on the final projects.
0:01:27 - 0:01:29 Text: So this is really the last week of lectures.
0:01:29 - 0:01:31 Text: Thanks for all your hard work.
0:01:31 - 0:01:36 Text: And for asking awesome questions and lecture and in office hours and on the ed.
0:01:36 - 0:01:37 Text: And let's get right into it.
0:01:37 - 0:01:44 Text: So today we get to talk about one of my favorite subjects in natural language processing.
0:01:44 - 0:01:47 Text: It's model analysis and explanation.
0:01:47 - 0:01:51 Text: So first we're going to do what I love doing, which is motivating why we want to talk about
0:01:51 - 0:01:54 Text: the topic at all.
0:01:54 - 0:01:59 Text: We'll talk about how we can look at a model at different levels of abstraction to perform
0:01:59 - 0:02:02 Text: different kinds of analysis on it.
0:02:02 - 0:02:05 Text: We'll talk about out of domain evaluation sets.
0:02:05 - 0:02:10 Text: So we'll feel familiar to the robust QA folks.
0:02:10 - 0:02:15 Text: Then we'll talk about sort of trying to figure out for a given example, why did it make
0:02:15 - 0:02:17 Text: the decision that it made?
0:02:17 - 0:02:19 Text: Had some input, it produced some output.
0:02:19 - 0:02:23 Text: And we come up with some sort of interpretable explanation for it.
0:02:23 - 0:02:29 Text: And then we'll look at actually the representations of the models.
0:02:29 - 0:02:34 Text: So these are the sort of hidden states, the vectors that are being built throughout the processing
0:02:34 - 0:02:38 Text: of the model, try to figure out if we can understand some of the representations and
0:02:38 - 0:02:41 Text: mechanisms that the model is performing.
0:02:41 - 0:02:45 Text: And then we'll actually come back to sort of one of the kind of default states that
0:02:45 - 0:02:50 Text: we've been in in this course, which is trying to look at model improvements, removing things
0:02:50 - 0:02:55 Text: from models, seeing how it performs, and relate that to the analysis that we're doing in this
0:02:55 - 0:02:59 Text: lecture, show how it's not all that different.
0:02:59 - 0:03:01 Text: Okay.
0:03:01 - 0:03:08 Text: So if you haven't seen this XKCD, now you have, and it's one of my favorites, I'm going
0:03:08 - 0:03:09 Text: to say all the words.
0:03:09 - 0:03:16 Text: So person A says this is your machine learning system, person B says yep, you pour the
0:03:16 - 0:03:21 Text: data into this big pile of linear algebra, and then collect the answers on the other side,
0:03:21 - 0:03:26 Text: person A, what if the answers are wrong, and person B, just stir the pile until they
0:03:26 - 0:03:28 Text: start looking right.
0:03:28 - 0:03:32 Text: And I feel like at its worst, deep learning can feel like this from time to time.
0:03:32 - 0:03:37 Text: You have a model, maybe it works for some things, maybe it doesn't work for other things,
0:03:37 - 0:03:41 Text: you're not sure why it works for some things and doesn't work for others.
0:03:41 - 0:03:47 Text: And the changes that we make to our models, they're based on intuition, but frequently,
0:03:47 - 0:03:49 Text: what are the TAs told?
0:03:49 - 0:03:52 Text: Everyone in office hours, sometimes you just have to try it and see if it's going to work
0:03:52 - 0:03:55 Text: out because it's very hard to tell.
0:03:55 - 0:04:01 Text: It's very, very difficult to understand our models at any level.
0:04:01 - 0:04:06 Text: And so today we'll go through a number of ways for trying to carve out little bits of understanding
0:04:06 - 0:04:07 Text: here and there.
0:04:07 - 0:04:16 Text: So beyond it being important because it's in the next KCD comic, why should we care about
0:04:16 - 0:04:18 Text: what our models, about understanding our models?
0:04:18 - 0:04:23 Text: One, is that we want to know what our models are doing.
0:04:23 - 0:04:30 Text: So here you have a black box, black box functions, or this idea that you can't look into
0:04:30 - 0:04:33 Text: them and interpret what they're doing.
0:04:33 - 0:04:37 Text: You have an input sentence, say, and then some output prediction.
0:04:37 - 0:04:46 Text: Maybe this black box is actually your final project model and it gets some accuracy.
0:04:46 - 0:04:51 Text: Now we summarize our models and in your final projects you'll summarize your model with
0:04:51 - 0:04:57 Text: sort of one or a handful of summary metrics of accuracy or f1 score or blue score or
0:04:57 - 0:04:59 Text: something.
0:04:59 - 0:05:03 Text: But there's a lot of model to explain with just a small number of metrics.
0:05:03 - 0:05:05 Text: So what do they learn?
0:05:05 - 0:05:08 Text: Why do they succeed and why do they fail?
0:05:08 - 0:05:09 Text: What's another motivation?
0:05:09 - 0:05:12 Text: We want to know what our models are doing.
0:05:12 - 0:05:17 Text: But maybe that's because we want to be able to make tomorrow's model.
0:05:17 - 0:05:23 Text: So today, when you're building models in this class at a company, you start out with some
0:05:23 - 0:05:28 Text: kind of recipe that is known to work either at the company or because you have experience
0:05:28 - 0:05:33 Text: from this class and it's not perfect right in makes mistakes to look at the errors.
0:05:33 - 0:05:39 Text: And then over time, you take what works and then you find what needs changing.
0:05:39 - 0:05:43 Text: So it seems like maybe adding another layer to the model helped.
0:05:43 - 0:05:49 Text: And maybe that's a nice tweak and the model performance gets better, et cetera.
0:05:49 - 0:05:55 Text: And incremental progress doesn't always feel exciting, but I want to pitch to you that
0:05:55 - 0:06:01 Text: it is actually very important for us to understand how much incremental progress can kind of get
0:06:01 - 0:06:08 Text: us towards some of our goals so that we can have a better job of evaluating when we need
0:06:08 - 0:06:12 Text: big leaps, when we need major changes because there are problems that we're attacking with
0:06:12 - 0:06:16 Text: our incremental sort of progress and we're not getting very far.
0:06:16 - 0:06:20 Text: OK, so we want to make tomorrow's model.
0:06:20 - 0:06:27 Text: The thing that's very related to both a part of and bigger than this field of analysis
0:06:27 - 0:06:29 Text: is model biases.
0:06:29 - 0:06:38 Text: So let's say you take your word to that analogy's solver from glove or word to that is from
0:06:38 - 0:06:43 Text: assignment one and you give it the analogy managed to computer programmer as woman is
0:06:43 - 0:06:46 Text: to and it gives you the output home maker.
0:06:46 - 0:06:50 Text: This is a real example from the paper below.
0:06:50 - 0:06:56 Text: You should be like, wow, well, I'm glad I know that now and of course you saw the lecture
0:06:56 - 0:06:58 Text: from Julia.
0:06:58 - 0:07:04 Text: It's what kind of last week you said, wow, I'm glad I know that now and that's a huge problem.
0:07:04 - 0:07:06 Text: What did the model use in its decision?
0:07:06 - 0:07:10 Text: What biases is it learning from data and possibly making even worse?
0:07:10 - 0:07:15 Text: So that's the kind of thing you can also do with model analysis beyond is making models
0:07:15 - 0:07:19 Text: better according to some sort of summary metric as well.
0:07:19 - 0:07:23 Text: And then another thing, we don't just want to make tomorrow's model and this is something
0:07:23 - 0:07:28 Text: that I think is super important.
0:07:28 - 0:07:30 Text: We don't just want to look at that time scale.
0:07:30 - 0:07:36 Text: We want to say, what about 10, 15, 25 years from now, what kinds of things will we be doing?
0:07:36 - 0:07:37 Text: What are the limits?
0:07:37 - 0:07:41 Text: What can be learned by language model pre-training?
0:07:41 - 0:07:44 Text: What's the model that will replace the transformer?
0:07:44 - 0:07:46 Text: What's the model that will replace that model?
0:07:46 - 0:07:48 Text: What does deep learning struggle to do?
0:07:48 - 0:07:52 Text: What are we sort of attacking over and over again and failing to make significant progress
0:07:52 - 0:07:53 Text: on?
0:07:53 - 0:07:55 Text: What do neural models tell us about language potentially?
0:07:55 - 0:08:00 Text: There's some people who are primarily interested in understanding language better using neural
0:08:00 - 0:08:01 Text: networks.
0:08:01 - 0:08:03 Text: Cool.
0:08:03 - 0:08:10 Text: How are our models affecting people, transferring power between groups of people, governments,
0:08:10 - 0:08:11 Text: et cetera?
0:08:11 - 0:08:12 Text: That's an excellent type of analysis.
0:08:12 - 0:08:15 Text: That can't be learned via language model pre-training.
0:08:15 - 0:08:17 Text: That's sort of the complementary question there.
0:08:17 - 0:08:22 Text: If you sort of come to the edge of what you can learn via language model pre-training,
0:08:22 - 0:08:28 Text: is there stuff that we need total paradigm shifts in order to do well?
0:08:28 - 0:08:34 Text: All of this falls under some category of trying to really deeply understand our models
0:08:34 - 0:08:37 Text: and their capabilities.
0:08:37 - 0:08:40 Text: There's a lot of different methods here that will go over today.
0:08:40 - 0:08:47 Text: One thing that I want you to take away from it is that they're all going to tell us
0:08:47 - 0:08:52 Text: some aspect of the model elucidates, some kind of intuition or something, but none of them
0:08:52 - 0:08:58 Text: are we going to say, aha, I really understand 100% about what this model is doing now.
0:08:58 - 0:09:01 Text: They're going to provide some clarity, but never total clarity.
0:09:01 - 0:09:07 Text: One way, if you're trying to decide how you want to understand your model more, I think
0:09:07 - 0:09:12 Text: you should start out by thinking about what level of abstraction do I want to be looking
0:09:12 - 0:09:14 Text: at my model.
0:09:14 - 0:09:22 Text: The very high level abstraction, let's say you've trained a QA model to estimate the probabilities
0:09:22 - 0:09:27 Text: of start and end indices in a reading comprehension problem or you've trained a language model
0:09:27 - 0:09:30 Text: that assigns probabilities to words in context.
0:09:30 - 0:09:33 Text: You can just look at the model as that object.
0:09:33 - 0:09:37 Text: It's just a probability distribution defined by your model.
0:09:37 - 0:09:41 Text: You are not looking into it any further than the fact that you can sort of give it inputs
0:09:41 - 0:09:45 Text: and see what outputs it provides.
0:09:45 - 0:09:49 Text: That's not even who even cares if it's a neural network.
0:09:49 - 0:09:53 Text: It could be anything, but it's a way to understand its behavior.
0:09:53 - 0:09:56 Text: Another level of abstraction that you can look at, you can dig a little deeper.
0:09:56 - 0:10:01 Text: You can say, well, I know that my network is a bunch of layers that are kind of stacked
0:10:01 - 0:10:06 Text: on top of each other, you've got sort of maybe your transformer encoder with your one
0:10:06 - 0:10:09 Text: layer, two layer, three layer, you can try to see what it's doing as it goes deeper
0:10:09 - 0:10:12 Text: in the layers.
0:10:12 - 0:10:15 Text: Maybe your neural model is the sequence of these vector representations.
0:10:15 - 0:10:22 Text: A third option of sort of specificity is to look as much detail as you can.
0:10:22 - 0:10:23 Text: You've got these parameters in there.
0:10:23 - 0:10:26 Text: You've got the connections in the computation graph.
0:10:26 - 0:10:30 Text: Now you're sort of trying to remove all of the abstraction that you can and look at as
0:10:30 - 0:10:32 Text: many details as possible.
0:10:32 - 0:10:36 Text: All three of these ways of looking at your model and performing analysis are going to
0:10:36 - 0:10:42 Text: be useful and will actually sort of travel slowly from one to two to three as we go through
0:10:42 - 0:10:45 Text: this lecture.
0:10:45 - 0:10:47 Text: Okay.
0:10:47 - 0:10:51 Text: We haven't actually talked about any analyses yet.
0:10:51 - 0:10:56 Text: We're going to get started on that now.
0:10:56 - 0:10:59 Text: We're starting with the sort of testing our model's behaviors.
0:10:59 - 0:11:02 Text: So would we want to see, well, my model perform well.
0:11:02 - 0:11:10 Text: I mean, the natural thing to ask is, how does it behave on some sort of test set?
0:11:10 - 0:11:13 Text: And so we don't really care about mechanisms yet.
0:11:13 - 0:11:14 Text: Why is it performing this?
0:11:14 - 0:11:17 Text: By what method is it making its decision?
0:11:17 - 0:11:22 Text: Instead, we're just interested in sort of the more higher level abstraction of like, does
0:11:22 - 0:11:24 Text: it perform the way I wanted to perform?
0:11:24 - 0:11:31 Text: So let's like, take our model evaluation that we are already doing and sort of recast
0:11:31 - 0:11:33 Text: it in the framework of analysis.
0:11:33 - 0:11:37 Text: So you've trained your model on some samples from some distribution.
0:11:37 - 0:11:40 Text: So you've got input, output pairs of some kind.
0:11:40 - 0:11:43 Text: So how does the model behave on samples from the same distribution?
0:11:43 - 0:11:48 Text: It's a simple question and it's sort of, you know, it's known as, you know, in domain
0:11:48 - 0:11:53 Text: accuracy or you can say that the samples are IID and that's what you're testing on.
0:11:53 - 0:11:56 Text: And this is just what we've been doing this whole time.
0:11:56 - 0:12:00 Text: It's your test set accuracy or F1 or blue score.
0:12:00 - 0:12:06 Text: And you know, so you've got some model with some accuracy and maybe it's better than some
0:12:06 - 0:12:09 Text: model with some other accuracy on this test set, right?
0:12:09 - 0:12:14 Text: So this is what you're doing as you're iterating on your models and your final project as well.
0:12:14 - 0:12:18 Text: You say, well, you know, on my test set, which is what I've decided to care about for
0:12:18 - 0:12:19 Text: now, model A does better.
0:12:19 - 0:12:22 Text: They both seem pretty good.
0:12:22 - 0:12:24 Text: And so maybe I'll choose model A to keep working on.
0:12:24 - 0:12:28 Text: Maybe I'll choose it if you were putting something into production.
0:12:28 - 0:12:33 Text: But remember back to, you know, this idea that it's just one number to summarize a very
0:12:33 - 0:12:36 Text: complex system.
0:12:36 - 0:12:40 Text: It's not going to be sufficient to tell you how it's going to perform in a wide variety
0:12:40 - 0:12:41 Text: of settings.
0:12:41 - 0:12:44 Text: Okay, so we've been doing this.
0:12:44 - 0:12:48 Text: This is model evaluation as model analysis.
0:12:48 - 0:12:55 Text: Now we're going to say what if we are not testing on exactly the same type of data that we
0:12:55 - 0:12:56 Text: trained on.
0:12:56 - 0:13:01 Text: So now we're asking, did the model learn something such that it's able to sort of extrapolate
0:13:01 - 0:13:05 Text: or perform how I want it to on data that looks a little bit different from what it was
0:13:05 - 0:13:06 Text: trained on?
0:13:06 - 0:13:09 Text: And we're going to take the example of natural language inference.
0:13:09 - 0:13:12 Text: So to recall the task of natural language inference, and this is through the multi-analye
0:13:12 - 0:13:17 Text: data set that we're just pulling our definition, you have a premise.
0:13:17 - 0:13:21 Text: He turned and saw John sleeping in his half tent, and you have a hypothesis.
0:13:21 - 0:13:24 Text: He saw John was asleep.
0:13:24 - 0:13:26 Text: And then you give them both two of model.
0:13:26 - 0:13:29 Text: And this is the model that we had before that gets some good accuracy.
0:13:29 - 0:13:35 Text: And the model is supposed to tell whether the hypothesis is sort of implied by the premise
0:13:35 - 0:13:37 Text: or contradicting.
0:13:37 - 0:13:39 Text: So you could be contradicting.
0:13:39 - 0:13:42 Text: Maybe if the hypothesis is, you know, John was awake.
0:13:42 - 0:13:44 Text: For example, or he saw John was awake.
0:13:44 - 0:13:46 Text: Maybe that would be contradiction.
0:13:46 - 0:13:50 Text: Or if sort of both could be true at the same time, so to speak.
0:13:50 - 0:13:54 Text: And then in this case, you know, it seems like they're saying that the premise implies
0:13:54 - 0:13:56 Text: the hypothesis.
0:13:56 - 0:14:00 Text: And so, you know, you would say probably this is likely to get the right answer since
0:14:00 - 0:14:01 Text: the accuracy of the model is 95%.
0:14:01 - 0:14:06 Text: And if I percent of the time, we get the right answer.
0:14:06 - 0:14:09 Text: And we're going to dig deeper into that.
0:14:09 - 0:14:15 Text: What if the model is not doing what we think we want it to be doing in order to perform
0:14:15 - 0:14:16 Text: natural language inference?
0:14:16 - 0:14:22 Text: So in a data set like multi-nLI, the authors who gathered the data set will have asked
0:14:22 - 0:14:27 Text: humans to perform the task and, you know, gotten the accuracy that the humans achieved.
0:14:27 - 0:14:34 Text: And models nowadays are achieving accuracies that are around where humans are achieving,
0:14:34 - 0:14:36 Text: which sounds great at first.
0:14:36 - 0:14:43 Text: But as we'll see, it's not the same as actually performing the task more broadly in the right
0:14:43 - 0:14:45 Text: way.
0:14:45 - 0:14:49 Text: So what if the model is not doing something smart effectively?
0:14:49 - 0:14:54 Text: We're going to use a diagnostic test set of carefully constructed examples that seem
0:14:54 - 0:15:01 Text: like things the model should be able to do to test for a specific skill or capacity.
0:15:01 - 0:15:03 Text: In this case, we'll use Hans.
0:15:03 - 0:15:07 Text: So Hans is the heuristic analysis for analyzed systems data set.
0:15:07 - 0:15:12 Text: And it's intended to take systems that do natural language inference and test whether
0:15:12 - 0:15:16 Text: they're using some simple syntactic heuristics.
0:15:16 - 0:15:19 Text: What we'll have in each of these cases, we'll have some heuristic.
0:15:19 - 0:15:21 Text: We'll talk through the definition.
0:15:21 - 0:15:22 Text: We'll get an example.
0:15:22 - 0:15:24 Text: So the first thing is lexical overlap.
0:15:24 - 0:15:31 Text: So the model might do this thing where it assumes that a premise entails all hypotheses
0:15:31 - 0:15:32 Text: constructed from words in the premise.
0:15:32 - 0:15:40 Text: So in this example, you have the premise the doctor was paid by the actor.
0:15:40 - 0:15:43 Text: And then the hypothesis is the doctor paid the actor.
0:15:43 - 0:15:49 Text: And you'll notice that in bold here, get the doctor, and then paid, and then the actor.
0:15:49 - 0:15:54 Text: And so if you use this heuristic, you will think that the doctor was paid by the actor,
0:15:54 - 0:15:58 Text: implies the doctor paid the actor that does not imply it, of course.
0:15:58 - 0:16:02 Text: And so you could expect a model you want the model to be able to do this.
0:16:02 - 0:16:03 Text: It's somewhat simple.
0:16:03 - 0:16:08 Text: But if it's using this heuristic, it won't get this example right.
0:16:08 - 0:16:10 Text: Next is a sub-sequence heuristics.
0:16:10 - 0:16:17 Text: So here, if the model assumes that the premise entails all of its contiguous sub-sequences,
0:16:17 - 0:16:19 Text: it will get this one wrong as well.
0:16:19 - 0:16:23 Text: So this example is the doctor near the actor danced.
0:16:23 - 0:16:24 Text: That's the premise.
0:16:24 - 0:16:26 Text: The hypothesis is the actor danced.
0:16:26 - 0:16:28 Text: Now this is a simple syntactic thing.
0:16:28 - 0:16:31 Text: The doctor is doing the dancing near the actor.
0:16:31 - 0:16:33 Text: Is this prepositional phrase?
0:16:33 - 0:16:37 Text: And so the model sort of uses this heuristic, oh, look, the actor danced.
0:16:37 - 0:16:38 Text: That's a sub-sequence entailed.
0:16:38 - 0:16:39 Text: Awesome.
0:16:39 - 0:16:42 Text: And it'll get this one wrong as well.
0:16:42 - 0:16:46 Text: And here's another one that's a lot like sub-sequence.
0:16:46 - 0:16:52 Text: So if the premise, if the model thinks that the premise entails all complete sub-trees,
0:16:52 - 0:16:55 Text: so this is like sort of fully formed phrases.
0:16:55 - 0:17:01 Text: So the artist slept here is a fully formed sort of, is that sub-tree, if the artist slept,
0:17:01 - 0:17:04 Text: the actor ran, and then that's the premise.
0:17:04 - 0:17:05 Text: Does it entail the hypothesis?
0:17:05 - 0:17:11 Text: The actor slept, no, sorry, the artist slept.
0:17:11 - 0:17:13 Text: That does not entail it because this is in that conditional.
0:17:13 - 0:17:20 Text: Okay, let me pause here for some questions before I move on to see how these models do.
0:17:20 - 0:17:28 Text: Anyone unclear about how this sort of evaluation is being set up?
0:17:28 - 0:17:37 Text: Cool.
0:17:37 - 0:17:39 Text: Okay.
0:17:39 - 0:17:42 Text: Okay, so how do models perform?
0:17:42 - 0:17:46 Text: That's sort of the question of the hour.
0:17:46 - 0:17:51 Text: What we'll do is, we'll look at these results from the same paper that really released the
0:17:51 - 0:17:52 Text: data set.
0:17:52 - 0:17:57 Text: So they took four strong multi-nl i models with the following accuracy.
0:17:57 - 0:18:02 Text: So the accuracy is here are something between 60 and 80 something 80 percent burnt over
0:18:02 - 0:18:05 Text: here is doing the best.
0:18:05 - 0:18:12 Text: And in domain, in that first sort of setting that we talked about, you get these reasonable
0:18:12 - 0:18:14 Text: accuracies.
0:18:14 - 0:18:20 Text: And that is sort of what we said before about it, like looking pretty good.
0:18:20 - 0:18:27 Text: And when we evaluate on Hans, in this setting here, we have examples where the
0:18:27 - 0:18:30 Text: heuristics we talked about actually work.
0:18:30 - 0:18:34 Text: So if the model is using the heuristic, it will get this right.
0:18:34 - 0:18:37 Text: And it gets very high accuracies.
0:18:37 - 0:18:42 Text: And then if we evaluate the model in the settings where if it uses the heuristic, it gets the
0:18:42 - 0:18:44 Text: examples wrong.
0:18:44 - 0:18:51 Text: You know, maybe birds doing like epsilon better than some of the other stuff here, but it's
0:18:51 - 0:18:53 Text: a very different story.
0:18:53 - 0:18:54 Text: Okay.
0:18:54 - 0:18:55 Text: And you saw those examples.
0:18:55 - 0:19:03 Text: They're not complex in our sort of own idea of complexity.
0:19:03 - 0:19:08 Text: And so this is why it sort of feels like a clear failure of the system.
0:19:08 - 0:19:13 Text: Now you can say though that well, maybe the training data sort of wasn't, didn't have
0:19:13 - 0:19:14 Text: any of those sort of phenomena.
0:19:14 - 0:19:18 Text: So the model couldn't have learned not to do that.
0:19:18 - 0:19:22 Text: And that's sort of a reasonable argument except, well, you know, Bert is pre-trained on
0:19:22 - 0:19:23 Text: a bunch of language texts.
0:19:23 - 0:19:27 Text: So you might hope, you might expect, you might hope that it does better.
0:19:27 - 0:19:29 Text: Okay.
0:19:29 - 0:19:39 Text: So we saw that example of models performing well on examples that are like those that
0:19:39 - 0:19:40 Text: it was trained on.
0:19:40 - 0:19:46 Text: And then performing not very well at all on examples that seem reasonable, but are sort
0:19:46 - 0:19:49 Text: of a little bit tricky.
0:19:49 - 0:19:53 Text: Now we're going to take this idea of having a test set that we've carefully crafted and
0:19:53 - 0:19:55 Text: go in a slightly different direction.
0:19:55 - 0:19:59 Text: So we're going to have, what does it mean to try to understand the linguistic properties
0:19:59 - 0:20:00 Text: of our models?
0:20:00 - 0:20:01 Text: Does it?
0:20:01 - 0:20:05 Text: So that's some tactic heuristics question was one thing for natural language inference,
0:20:05 - 0:20:10 Text: but can we sort of test how the models, whether they think certain things are sort of right
0:20:10 - 0:20:14 Text: or wrong as language models?
0:20:14 - 0:20:18 Text: And the first way that we'll do this is we'll ask, well, how do we think about sort
0:20:18 - 0:20:21 Text: of what humans think of as good language?
0:20:21 - 0:20:26 Text: How do we evaluate their sort of preferences about language?
0:20:26 - 0:20:29 Text: And one answer is minimal pairs.
0:20:29 - 0:20:34 Text: And the idea of a minimal pair is that you've got one sentence that sounds okay to a speaker.
0:20:34 - 0:20:38 Text: So this sentence is the chef who made the pizzas is here.
0:20:38 - 0:20:43 Text: It's called, it's an acceptable sentence, at least to me.
0:20:43 - 0:20:50 Text: And then with a small change, a minimal change, the sentence is no longer okay to the speaker.
0:20:50 - 0:20:53 Text: So the chef who made the pizzas are here.
0:20:53 - 0:21:01 Text: And this, whoops, this should be, present tense verbs.
0:21:01 - 0:21:05 Text: In English, present tense verbs agree in number with their subject when they are third
0:21:05 - 0:21:07 Text: person.
0:21:07 - 0:21:10 Text: So chef pizzas, okay.
0:21:10 - 0:21:14 Text: And this is sort of a pretty general thing.
0:21:14 - 0:21:16 Text: Most people don't like this.
0:21:16 - 0:21:18 Text: It's a misconjugated verb.
0:21:18 - 0:21:23 Text: And so the syntax here looks like you have the chef who made the pizzas.
0:21:23 - 0:21:30 Text: And then this arc of agreement in number is requiring the word is here to be singular
0:21:30 - 0:21:32 Text: is instead of plural R.
0:21:32 - 0:21:38 Text: Despite the fact that there's this noun pizzas, which is plural, closer linearly, comes
0:21:38 - 0:21:40 Text: back to dependency parsing.
0:21:40 - 0:21:42 Text: Or back, okay.
0:21:42 - 0:21:49 Text: And what this looks like in the tree structure, right, is well, chef and is are attached in
0:21:49 - 0:21:52 Text: the tree.
0:21:52 - 0:21:57 Text: Chef is the subject of is, pizza is down here in the subtree.
0:21:57 - 0:22:02 Text: And so that subject verb relationship has this sort of agreement thing.
0:22:02 - 0:22:08 Text: So this is a pretty sort of basic and interesting property of language that also reflects the
0:22:08 - 0:22:11 Text: syntactic sort of hierarchical structure of language.
0:22:11 - 0:22:14 Text: So we've been training these language models sampling from them, seeing that they get
0:22:14 - 0:22:15 Text: interesting things.
0:22:15 - 0:22:19 Text: And they tend to seem to generate syntactic content.
0:22:19 - 0:22:25 Text: But does it really understand or does it behave as if it understands this idea of agreement
0:22:25 - 0:22:29 Text: more broadly and does it sort of get the syntax right so that it matches the subjects and
0:22:29 - 0:22:31 Text: the verbs.
0:22:31 - 0:22:36 Text: But language models can't tell us exactly whether they think that a sentence is good or
0:22:36 - 0:22:40 Text: bad, they just tell us the probability of a sentence.
0:22:40 - 0:22:45 Text: So before we had acceptable and unacceptable, that's what we get from humans.
0:22:45 - 0:22:50 Text: And the language models analog is just, does it assign higher probability to the acceptable
0:22:50 - 0:22:52 Text: sentence in the minimal pair, right?
0:22:52 - 0:22:58 Text: So you have the probability under the model of the chef who made the pizzas is here.
0:22:58 - 0:23:02 Text: And then you have the probability under the model of the chef who made the pizzas are
0:23:02 - 0:23:03 Text: here.
0:23:03 - 0:23:08 Text: And you want this probability here to be higher.
0:23:08 - 0:23:15 Text: And if it is, that's sort of like a simple way to test whether the model got it right effectively.
0:23:15 - 0:23:22 Text: And just like in Huns, we can develop a test set with very carefully chosen properties,
0:23:22 - 0:23:23 Text: right?
0:23:23 - 0:23:29 Text: So most sentences in English don't have terribly complex subject verb agreement structure
0:23:29 - 0:23:34 Text: or with a lot of words in the middle like pizzas that are going to make it difficult.
0:23:34 - 0:23:42 Text: So if I say, you know, the dog runs sort of no way to get it wrong because there's no
0:23:42 - 0:23:44 Text: syntax is very simple.
0:23:44 - 0:23:53 Text: So we can create, well, we can look for sentences that have these things called attractors in
0:23:53 - 0:23:54 Text: the sentence.
0:23:54 - 0:23:59 Text: So pizzas is an attractor because the model might be attracted to the plurality here and
0:23:59 - 0:24:03 Text: get the conjugation wrong.
0:24:03 - 0:24:04 Text: So this is our question.
0:24:04 - 0:24:08 Text: Can language models sort of very generally handle these examples with attractors?
0:24:08 - 0:24:13 Text: So we can take examples with zero attractors, see whether the model gets the minimal pairs
0:24:13 - 0:24:14 Text: evaluation right.
0:24:14 - 0:24:18 Text: We can take examples with one attractor, two attractors.
0:24:18 - 0:24:22 Text: You can see how people would still reasonably understand the sentences, right?
0:24:22 - 0:24:24 Text: Chef who made the pizzas and prep the ingredients is.
0:24:24 - 0:24:26 Text: It's still the chef who is.
0:24:26 - 0:24:32 Text: And then on and on and on, it gets rarer, obviously, but you can have more and more attractors.
0:24:32 - 0:24:36 Text: And so now we've created this test set that's intended to evaluate this very specific linguistic
0:24:36 - 0:24:39 Text: phenomenon.
0:24:39 - 0:24:46 Text: So in this paper here, I concur at all, trained an LSTM language model on a subset of Wikipedia
0:24:46 - 0:24:48 Text: back in 2018.
0:24:48 - 0:24:54 Text: And they evaluate it sort of in these buckets that are specified by the paper that sort
0:24:54 - 0:25:02 Text: of introduced subject verb agreement to the NLP field, or more recently at least, and
0:25:02 - 0:25:06 Text: they evaluate it in buckets based on the number of attractors.
0:25:06 - 0:25:12 Text: And so in this table here that you're about to see, the numbers are sort of the percentive
0:25:12 - 0:25:19 Text: times that you get this assign higher probability to the correct sentence in the minimal pair.
0:25:19 - 0:25:23 Text: So if you were just to do random or majority class, you get these errors, oh, sorry, it's
0:25:23 - 0:25:26 Text: the percent of times that you get it wrong.
0:25:26 - 0:25:27 Text: Sorry about that.
0:25:27 - 0:25:30 Text: So lower is better.
0:25:30 - 0:25:33 Text: And so with no attractors, you get very low error rates.
0:25:33 - 0:25:39 Text: So this is 1.3 error rate with a 350-dimensional LSTM.
0:25:39 - 0:25:45 Text: And with one attractor, your error rate is higher, but actually humans start to get errors
0:25:45 - 0:25:47 Text: with more attractors too.
0:25:47 - 0:25:50 Text: So zero attractors is easy.
0:25:50 - 0:25:53 Text: The larger the LSTM, it looks like in general the better you're doing, right?
0:25:53 - 0:25:56 Text: So the smaller models doing worse, OK?
0:25:56 - 0:26:01 Text: And then even on sort of very difficult examples with four attractors, which I try to think
0:26:01 - 0:26:07 Text: of an example in your head like the chef made the pizzas and took out the trash and sort
0:26:07 - 0:26:12 Text: of has to be this long sentence, the error rate is definitely higher, so it gets more difficult,
0:26:12 - 0:26:15 Text: but it's still relatively low.
0:26:15 - 0:26:18 Text: And so even on these very hard examples, models are actually performing subject verb number
0:26:18 - 0:26:21 Text: agreement relatively well.
0:26:21 - 0:26:24 Text: Very cool.
0:26:24 - 0:26:25 Text: OK.
0:26:25 - 0:26:28 Text: Here are some examples that a model got wrong.
0:26:28 - 0:26:32 Text: This is actually a worse model than the ones from the paper that was just there, but I
0:26:32 - 0:26:35 Text: think actually the errors are quite interesting.
0:26:35 - 0:26:41 Text: So here's a sentence, the ship that the player drives has a very high speed.
0:26:41 - 0:26:47 Text: Now this model thought that was less probable than the ship that the player drives have a
0:26:47 - 0:26:51 Text: very high speed.
0:26:51 - 0:27:00 Text: My hypothesis, right, is that it sort of misanalyzes drives as a plural noun, for example,
0:27:00 - 0:27:01 Text: sort of a difficult construction there.
0:27:01 - 0:27:04 Text: I think it's pretty interesting.
0:27:04 - 0:27:07 Text: Likewise here, this one is fun.
0:27:07 - 0:27:09 Text: The lead is also rather long.
0:27:09 - 0:27:12 Text: Five paragraphs is pretty lengthy.
0:27:12 - 0:27:18 Text: So here five paragraphs is a singular noun together, it gets it's like a unit of length,
0:27:18 - 0:27:19 Text: I guess.
0:27:19 - 0:27:26 Text: But the model thought that it was more likely to say five paragraphs are pretty lengthy,
0:27:26 - 0:27:32 Text: because it's referring to this sort of five paragraphs as the five actual paragraphs
0:27:32 - 0:27:37 Text: themselves as opposed to a single unit of length describing the lead.
0:27:37 - 0:27:39 Text: Fascinating.
0:27:39 - 0:27:42 Text: OK.
0:27:42 - 0:27:53 Text: Maybe questions again?
0:27:53 - 0:27:56 Text: So I guess there are a couple.
0:27:56 - 0:28:04 Text: Can we do the similar heuristic analysis for other tasks such as Q and A classification?
0:28:04 - 0:28:07 Text: Yes.
0:28:07 - 0:28:14 Text: So yes, I think that it's easier to do this kind of analysis for the Huns style analysis
0:28:14 - 0:28:23 Text: with question answering and other sorts of tasks, because you can construct examples
0:28:23 - 0:28:35 Text: that similarly have these heuristics and then have the answer depend on the syntax or
0:28:35 - 0:28:36 Text: not.
0:28:36 - 0:28:41 Text: The actual probability of one sentence is higher than the other, of course, is sort of a language
0:28:41 - 0:28:43 Text: model dependent thing.
0:28:43 - 0:28:52 Text: But the idea that you can sort of develop kind of bespoke test sets for various tasks,
0:28:52 - 0:28:54 Text: I think is very, very general.
0:28:54 - 0:28:59 Text: And something that I think is actually quite interesting.
0:28:59 - 0:29:00 Text: Yes.
0:29:00 - 0:29:05 Text: So I won't go on further, but I think the answer is just yes.
0:29:05 - 0:29:07 Text: So there's another one.
0:29:07 - 0:29:10 Text: How do you know where to find these failure cases?
0:29:10 - 0:29:13 Text: Maybe that's the right time to advertise linguistics classes.
0:29:13 - 0:29:14 Text: Sorry.
0:29:14 - 0:29:17 Text: You're still very quiet over here.
0:29:17 - 0:29:19 Text: How do you find what?
0:29:19 - 0:29:23 Text: How do you know where to find these failure cases?
0:29:23 - 0:29:24 Text: Oh, interesting.
0:29:24 - 0:29:25 Text: Yes.
0:29:25 - 0:29:27 Text: How do we know where to find the failure cases?
0:29:27 - 0:29:28 Text: That's a good question.
0:29:28 - 0:29:36 Text: I mean, I think I agree with Chris that actually thinking about what is interesting about things
0:29:36 - 0:29:39 Text: in language is one way to do it.
0:29:39 - 0:29:47 Text: I mean, the heuristics that we saw in our language model, sorry, in our NLI models with
0:29:47 - 0:29:56 Text: hans, you can kind of imagine that they, if the model was sort of ignoring facts about
0:29:56 - 0:30:01 Text: language and sort of just doing this sort of rough bag of words with some extra magic,
0:30:01 - 0:30:05 Text: then it would do well about as bad as it is doing here.
0:30:05 - 0:30:12 Text: And these sorts of ideas about understanding that this statement, if the artist slept
0:30:12 - 0:30:17 Text: the actor ran does not imply the artist slept, is the kind of thing that maybe you'd think
0:30:17 - 0:30:22 Text: up on your own, but also you'd spend time sort of pondering about and thinking broad
0:30:22 - 0:30:29 Text: thoughts about in linguistics curricula as well.
0:30:29 - 0:30:36 Text: Anything else, Chris?
0:30:36 - 0:30:42 Text: So there's also, well, I guess someone was also saying, I think it's about the sort of
0:30:42 - 0:30:48 Text: intervening verbs example, or intervening nouns, sorry, example, but the data set itself
0:30:48 - 0:30:53 Text: probably includes mistakes with higher attractors.
0:30:53 - 0:30:55 Text: Yeah, yeah, that's a good point.
0:30:55 - 0:31:03 Text: Yeah, because humans make more and more mistakes as the number of attractors gets larger.
0:31:03 - 0:31:10 Text: On the other hand, I think that the mistakes are fewer in written text than in spoken.
0:31:10 - 0:31:14 Text: Maybe I'm just making that up, but that's what I think.
0:31:14 - 0:31:19 Text: But yeah, it would be interesting to actually go through that test set and see how many
0:31:19 - 0:31:24 Text: of the errors the really strong model makes are actually due to the sort of observed form
0:31:24 - 0:31:25 Text: being incorrect.
0:31:25 - 0:31:32 Text: I'd be super curious.
0:31:32 - 0:31:36 Text: Okay, should I move on?
0:31:36 - 0:31:47 Text: Yep, great.
0:31:47 - 0:31:55 Text: Okay, so what does it feel like we're doing when we are kind of constructing these sort
0:31:55 - 0:31:59 Text: of bespoke, small, careful test sets for various phenomena?
0:31:59 - 0:32:03 Text: Well, it sort of feels like unit testing.
0:32:03 - 0:32:13 Text: And in fact, this sort of idea has been brought to the fore, you might say an NLP, unit tests
0:32:13 - 0:32:15 Text: but for these NLP neural networks.
0:32:15 - 0:32:21 Text: And in particular, the paper here that I'm citing at the bottom suggests this minimum
0:32:21 - 0:32:22 Text: functionality test.
0:32:22 - 0:32:28 Text: You want a small test set that targets a specific behavior that should sound like some of the
0:32:28 - 0:32:31 Text: things that we were, that we've already talked about.
0:32:31 - 0:32:34 Text: But in this case, we're going to get even more specific.
0:32:34 - 0:32:36 Text: So here's a single test case.
0:32:36 - 0:32:42 Text: We're going to have an expected label, what was actually predicted, whether the model passed
0:32:42 - 0:32:44 Text: this unit test.
0:32:44 - 0:32:47 Text: And the labels are going to be sentiment analysis here.
0:32:47 - 0:32:52 Text: So negative label, positive label, or neutral, or the three options.
0:32:52 - 0:32:57 Text: And the unit test is going to consist simply of sentences that follow this template.
0:32:57 - 0:33:02 Text: I, the navigation, the positive verb, and then the thing.
0:33:02 - 0:33:07 Text: So if you negation positive verb, means you negative verb, right?
0:33:07 - 0:33:08 Text: And so here's an example.
0:33:08 - 0:33:11 Text: I can't say I recommend the food.
0:33:11 - 0:33:13 Text: The expected label is negative.
0:33:13 - 0:33:17 Text: The answer that the model provided, and this is, I think, a commercial sentiment analysis
0:33:17 - 0:33:18 Text: system.
0:33:18 - 0:33:19 Text: I was positive.
0:33:19 - 0:33:21 Text: So it's pretty positive.
0:33:21 - 0:33:24 Text: And then I didn't love the flight.
0:33:24 - 0:33:30 Text: The expected label was negative, and then the predicted answer was neutral.
0:33:30 - 0:33:35 Text: And this commercial sentiment analysis system gets a lot of, well, you could imagine are
0:33:35 - 0:33:38 Text: pretty reasonably simple examples, wrong.
0:33:38 - 0:33:44 Text: And so what your bureau had all 2020 showed is that they could actually provide a system
0:33:44 - 0:33:50 Text: that sort of had this framework of building test cases for NLP models, two ML engineers
0:33:50 - 0:33:53 Text: working on these products.
0:33:53 - 0:34:00 Text: And given that interface, and they would actually find bugs, you know, bugs being categories
0:34:00 - 0:34:02 Text: of high error, right?
0:34:02 - 0:34:06 Text: Find bugs in their models that they could then kind of try to go and fix.
0:34:06 - 0:34:10 Text: And that this was kind of an efficient way of trying to find things that were simple
0:34:10 - 0:34:16 Text: and still wrong with what should be pretty sophisticated neural systems.
0:34:16 - 0:34:21 Text: So I really like this, and it's sort of a nice way of thinking more specifically about
0:34:21 - 0:34:27 Text: what are the capabilities in sort of precise terms of our models.
0:34:27 - 0:34:33 Text: And altogether, now you've seen problems in natural language inference.
0:34:33 - 0:34:37 Text: You've seen language models actually perform pretty well at the language modeling objective.
0:34:37 - 0:34:42 Text: But then you see, you just saw an example of a commercial sentiment analysis system
0:34:42 - 0:34:45 Text: that sort of should do better and doesn't.
0:34:45 - 0:34:52 Text: And this comes with really, I think, broad and important takeaway, which is if you get
0:34:52 - 0:34:58 Text: high accuracy on the end domain test set, you are not guaranteed high accuracy on even
0:34:58 - 0:35:05 Text: what you might consider to be reasonable out of domain evaluations.
0:35:05 - 0:35:08 Text: And life is always out of domain.
0:35:08 - 0:35:12 Text: And if you're building a system that we have given to users, it's immediately out of
0:35:12 - 0:35:17 Text: domain that the very least because it's trained on text that's now older than the things
0:35:17 - 0:35:18 Text: that the users are now saying.
0:35:18 - 0:35:23 Text: So it's a really, really important takeaway that your sort of benchmark accuracy is a
0:35:23 - 0:35:28 Text: single number that does not guarantee good performance on a wide variety of things.
0:35:28 - 0:35:32 Text: And from a, what are our neural networks doing perspective?
0:35:32 - 0:35:36 Text: One way to think about it is that models seem to be learning the data set, fitting sort
0:35:36 - 0:35:42 Text: of the fine-grained, sort of heuristics and statistics that help it fit this one data
0:35:42 - 0:35:44 Text: set as opposed to learning the task.
0:35:44 - 0:35:48 Text: So humans can perform natural language inference if you give them examples from whatever data
0:35:48 - 0:35:49 Text: set.
0:35:49 - 0:35:54 Text: You know, once you've told them how to do the task, they'll be very generally strong at
0:35:54 - 0:35:55 Text: it.
0:35:55 - 0:36:01 Text: But you take your MNLI model and you test it on Hans and it got, you know, whatever that
0:36:01 - 0:36:03 Text: was, below chance accuracy.
0:36:03 - 0:36:05 Text: That's not the kind of thing that you want to see.
0:36:05 - 0:36:10 Text: So it definitely learns the data set well because the accuracy in domain is high.
0:36:10 - 0:36:17 Text: But our models are seemingly not frequently learning, sort of the mechanisms that we would
0:36:17 - 0:36:19 Text: like them to be learning.
0:36:19 - 0:36:23 Text: Last week, we heard about language models and sort of the implicit knowledge that they
0:36:23 - 0:36:26 Text: encode about the world through pre-training.
0:36:26 - 0:36:30 Text: And one of the ways that we sought to interact with language models was providing them with
0:36:30 - 0:36:36 Text: a prompt like Dante was born in mask and then seeing if it puts high probability on the
0:36:36 - 0:36:42 Text: correct continuation, which requires you to access knowledge about where Dante was
0:36:42 - 0:36:43 Text: born.
0:36:43 - 0:36:47 Text: And we didn't frame it this way last week, but this fits into the set of behavioral studies
0:36:47 - 0:36:49 Text: that we've done so far.
0:36:49 - 0:36:51 Text: This is a specific kind of input.
0:36:51 - 0:36:54 Text: You could ask this for multiple types of multiple people.
0:36:54 - 0:36:56 Text: You could swap out Dante for other people.
0:36:56 - 0:37:02 Text: You could swap out born in for, I don't know, died in or something.
0:37:02 - 0:37:04 Text: And then you can, there are like test suites again.
0:37:04 - 0:37:07 Text: And so it's all connected.
0:37:07 - 0:37:11 Text: OK, so I won't go too deep into sort of the knowledge of language models in terms of
0:37:11 - 0:37:14 Text: world knowledge because we've gone over it some.
0:37:14 - 0:37:20 Text: But when you're thinking about ways of interacting with your models, this sort of behavioral study
0:37:20 - 0:37:22 Text: can be very, very general.
0:37:22 - 0:37:27 Text: Even though, remember, we're at still this highest level of abstraction where we're just
0:37:27 - 0:37:30 Text: looking at the probability distributions that are defined.
0:37:30 - 0:37:33 Text: All right.
0:37:33 - 0:37:38 Text: So now we'll go into, so we've sort of looked at understanding in fine grain areas what
0:37:38 - 0:37:41 Text: our model is actually doing.
0:37:41 - 0:37:48 Text: What about sort of why for an individual input is it getting the answer right or wrong?
0:37:48 - 0:37:52 Text: And then are there changes to the inputs that look fine to humans, but actually make the
0:37:52 - 0:37:55 Text: models do a bad job?
0:37:55 - 0:38:02 Text: So one study that I love to reference that really draws back into our original motivation
0:38:02 - 0:38:07 Text: of using LSTM networks instead of simple recurrent neural networks was that they could use
0:38:07 - 0:38:10 Text: long context.
0:38:10 - 0:38:15 Text: But like how long is your long short term memory?
0:38:15 - 0:38:23 Text: And the idea of Kendall well at all 2018 was shuffle or remove contexts that are farther
0:38:23 - 0:38:29 Text: than some k words away, changing k.
0:38:29 - 0:38:35 Text: And if the accuracy, if the predictive ability of your language model, the perplexity,
0:38:35 - 0:38:39 Text: right, doesn't change once you do that, it means the model wasn't actually using that
0:38:39 - 0:38:40 Text: context.
0:38:40 - 0:38:42 Text: I think this is so cool.
0:38:42 - 0:38:48 Text: So on the x-axis, we've got how far away from the word that you're trying to predict,
0:38:48 - 0:38:54 Text: are you actually sort of corrupting, shuffling, or moving stuff from the sequence.
0:38:54 - 0:38:57 Text: And then on the y-axis is the increase in loss.
0:38:57 - 0:39:03 Text: So if the increase in loss is zero, it means that the model was not using the thing that
0:39:03 - 0:39:08 Text: you just removed because if it was using it, it would now do worse without it, right?
0:39:08 - 0:39:13 Text: And so if you shuffle in the blue line here, if you shuffle the history that's farther
0:39:13 - 0:39:18 Text: away from 50 words, the model does not even notice.
0:39:18 - 0:39:20 Text: I think that's really interesting.
0:39:20 - 0:39:25 Text: One it says, everything passed 50 words of this LSTM language model, you could have given
0:39:25 - 0:39:28 Text: it in random order and it wouldn't have noticed.
0:39:28 - 0:39:32 Text: And then two it says that if you're closer than that, it actually is making use of the
0:39:32 - 0:39:33 Text: word order.
0:39:33 - 0:39:36 Text: That's a pretty long memory, okay, that's really interesting.
0:39:36 - 0:39:42 Text: And then if you actually remove the words entirely, you can kind of notice that the words
0:39:42 - 0:39:45 Text: are missing up to 200 words away.
0:39:45 - 0:39:48 Text: So you don't know the order that you don't care about the order they're in, but you
0:39:48 - 0:39:50 Text: care whether they're there or not.
0:39:50 - 0:39:54 Text: And so this is an evaluation of, well, do LSTMs have long term memory?
0:39:54 - 0:40:01 Text: Well, this one at least has effectively no longer than 200 words of memory, but also
0:40:01 - 0:40:02 Text: no less.
0:40:02 - 0:40:07 Text: So very cool.
0:40:07 - 0:40:09 Text: So that's like a general study for a single model.
0:40:09 - 0:40:15 Text: It talks about, it's sort of average behavior over a wide range of examples, but we want
0:40:15 - 0:40:18 Text: to talk about individual predictions on individual inputs.
0:40:18 - 0:40:19 Text: So let's talk about that.
0:40:19 - 0:40:26 Text: So one way of interpreting why did my model make this decision, that's very popular, is
0:40:26 - 0:40:31 Text: for a single example, what parts of the input actually led to the decision?
0:40:31 - 0:40:34 Text: And this is where we come in with saliency maps.
0:40:34 - 0:40:40 Text: So saliency map provides a score for each word indicating its importance to the model's
0:40:40 - 0:40:41 Text: prediction.
0:40:41 - 0:40:44 Text: So you've got something like Bert here.
0:40:44 - 0:40:45 Text: You've got Bert.
0:40:45 - 0:40:47 Text: Bert is making a prediction for this mask.
0:40:47 - 0:40:52 Text: And a mask rush to the emergency room to see her patient.
0:40:52 - 0:40:58 Text: And the predictions that the model is making is, thanks with 47%, it's going to be nurse
0:40:58 - 0:41:04 Text: that's here in the mask instead, or maybe woman, or doctor, or mother, or girl.
0:41:04 - 0:41:08 Text: And then the saliency map is being visualized here in orange.
0:41:08 - 0:41:13 Text: According to this method of saliency called simple gradients, which we'll get into, emergency
0:41:13 - 0:41:18 Text: her and the septokin, it's not worried about the septokin for now, but the emergency and
0:41:18 - 0:41:21 Text: her are the important words apparently.
0:41:21 - 0:41:25 Text: And the septokin shows up in every sentence, so I'm not going to, yeah.
0:41:25 - 0:41:30 Text: And so these two together are, according to this method, what's important for the model
0:41:30 - 0:41:33 Text: to make this prediction to mask.
0:41:33 - 0:41:38 Text: And you can see maybe some statistics, biases, etc., that is picked up in the predictions
0:41:38 - 0:41:41 Text: and then have it mapped out onto the sentence.
0:41:41 - 0:41:47 Text: And this is, well, it seems like it's really helping interpretability.
0:41:47 - 0:41:52 Text: And yeah, I think that this is sort of a very useful tool.
0:41:52 - 0:42:00 Text: And actually, this is part of a demo from Alan and LP that allows you to do this yourself
0:42:00 - 0:42:02 Text: for any sentence that you want.
0:42:02 - 0:42:05 Text: So what's this way of making saliency maps?
0:42:05 - 0:42:07 Text: We're not going to go, there's so many ways to do it.
0:42:07 - 0:42:12 Text: We're going to take a very simple one and work through why it sort of makes sense.
0:42:12 - 0:42:17 Text: So the sort of issue is how do you define importance?
0:42:17 - 0:42:20 Text: What does it mean to be important to the model's prediction?
0:42:20 - 0:42:22 Text: And here's one way of thinking about it.
0:42:22 - 0:42:23 Text: It's called the simple gradient method.
0:42:23 - 0:42:25 Text: It's got a little formula.
0:42:25 - 0:42:27 Text: You got words x1 to xn.
0:42:27 - 0:42:28 Text: Okay?
0:42:28 - 0:42:31 Text: And then you got a model score for a given output class.
0:42:31 - 0:42:36 Text: So maybe you've got, in the birth example, each output class was each output word that
0:42:36 - 0:42:38 Text: you could possibly predict.
0:42:38 - 0:42:43 Text: And then you take the norm of the gradient of the score with respect to each word.
0:42:43 - 0:42:53 Text: Okay, so what we're saying here is the score is sort of the unnormalized probability for
0:42:53 - 0:42:55 Text: that class.
0:42:55 - 0:42:56 Text: Okay, so you got a single class.
0:42:56 - 0:43:00 Text: You're taking the scores like how likely it is not yet normalized by how likely everything
0:43:00 - 0:43:02 Text: else is sort of.
0:43:02 - 0:43:07 Text: So gradient, how much is it going to change if I move it a little bit in one direction
0:43:07 - 0:43:11 Text: or another, and then you take the norm to get a scalar from a vector.
0:43:11 - 0:43:12 Text: So it looks like this.
0:43:12 - 0:43:17 Text: So salience of word i, you have the norm bars on the outside, gradient with respect to
0:43:17 - 0:43:19 Text: xi.
0:43:19 - 0:43:25 Text: So that's if I change a little bit locally xi, how much does my score change?
0:43:25 - 0:43:30 Text: So the idea is that a high gradient norm means that if I were to change it locally, I'd
0:43:30 - 0:43:32 Text: affect the score a lot.
0:43:32 - 0:43:34 Text: That means it was very important to the decision.
0:43:34 - 0:43:35 Text: Let's visualize this a little bit.
0:43:35 - 0:43:39 Text: So here on the y axis we've got loss.
0:43:39 - 0:43:43 Text: Just the loss of the model, sorry, this should be score.
0:43:43 - 0:43:44 Text: It should be score.
0:43:44 - 0:43:47 Text: And on the x axis you've got word space.
0:43:47 - 0:43:53 Text: The word space is like sort of a flattening of the ability to move your word embedding
0:43:53 - 0:43:54 Text: in thousand dimensional space.
0:43:54 - 0:43:58 Text: So I've just plotted it here in one dimension.
0:43:58 - 0:44:04 Text: Now a high saliency thing, you can see that the relationship between what should be score
0:44:04 - 0:44:09 Text: and moving the word in word space, you move it a little bit on the x axis and the score
0:44:09 - 0:44:10 Text: changes a lot.
0:44:10 - 0:44:13 Text: That's that derivative, that's the gradient, awesome, love it.
0:44:13 - 0:44:20 Text: Low saliency, you move the word around locally and the score doesn't change.
0:44:20 - 0:44:23 Text: So that's an interpretation is.
0:44:23 - 0:44:27 Text: That means that the actual identity of this word wasn't that important to the prediction
0:44:27 - 0:44:31 Text: because I could have changed it and the score wouldn't have changed.
0:44:31 - 0:44:34 Text: Now why are there more methods than this?
0:44:34 - 0:44:38 Text: Because honestly reading that sounds awesome, that sounds great.
0:44:38 - 0:44:45 Text: There are sort of a lot of issues with this kind of method in lots of ways of getting around
0:44:45 - 0:44:46 Text: them.
0:44:46 - 0:44:47 Text: Here's one issue.
0:44:47 - 0:44:52 Text: It's not perfect because well maybe your linear approximation that the gradient gives
0:44:52 - 0:44:56 Text: you holds only very, very locally.
0:44:56 - 0:45:00 Text: So here the gradient is zero.
0:45:00 - 0:45:05 Text: So this is a low saliency word because at the bottom of this parabola, but if I were to
0:45:05 - 0:45:10 Text: move even a little bit in either direction, the score would shoot up.
0:45:10 - 0:45:11 Text: Is this not an important word?
0:45:11 - 0:45:19 Text: It seems important to be right there as opposed to anywhere else even sort of nearby in
0:45:19 - 0:45:22 Text: order for the score not to go up.
0:45:22 - 0:45:26 Text: The simple gradients method won't capture this because it just looks at the gradient which
0:45:26 - 0:45:29 Text: is that zero right there.
0:45:29 - 0:45:30 Text: Okay.
0:45:30 - 0:45:35 Text: But if you want to look into more, there's a bunch of different methods that are sort of
0:45:35 - 0:45:37 Text: applied in these papers.
0:45:37 - 0:45:42 Text: And I think that there's a good tool for the toolbox.
0:45:42 - 0:45:43 Text: Okay.
0:45:43 - 0:45:47 Text: So that is one way of explaining a prediction.
0:45:47 - 0:45:55 Text: And it has some issues like why are individual words being scored as opposed to phrases or
0:45:55 - 0:45:57 Text: something like that.
0:45:57 - 0:45:59 Text: But for now, we're going to move on to another type of explanation.
0:45:59 - 0:46:02 Text: And I'm going to check the time.
0:46:02 - 0:46:03 Text: Okay.
0:46:03 - 0:46:04 Text: Cool.
0:46:04 - 0:46:06 Text: Actually, yeah, let me pause for a second.
0:46:06 - 0:46:10 Text: Any questions about this?
0:46:10 - 0:46:16 Text: I mean, the earlier on, they were a couple of questions.
0:46:16 - 0:46:22 Text: One of them was, what are your thoughts on whether looking at attention weights is a methodologically
0:46:22 - 0:46:28 Text: rigorous way of determining the importance of the model places on certain tokens?
0:46:28 - 0:46:32 Text: It seems like there's some back and forth in the literature.
0:46:32 - 0:46:34 Text: That is a great question.
0:46:34 - 0:46:39 Text: And I probably won't engage with that question as much as I could if we had like a second
0:46:39 - 0:46:40 Text: lecture on this.
0:46:40 - 0:46:45 Text: I actually will provide some attention analyses and tell you they're interesting.
0:46:45 - 0:46:54 Text: And then I'll say a little bit about why they can be interesting without being sort of
0:46:54 - 0:47:04 Text: maybe sort of the end all of analysis of where information is flowing in a transformer,
0:47:04 - 0:47:05 Text: for example.
0:47:05 - 0:47:11 Text: I think the debate is something that we would have to get into in a much longer period of
0:47:11 - 0:47:12 Text: time.
0:47:12 - 0:47:16 Text: Look at the slides that I show about attention and the caveats that I provide and let me
0:47:16 - 0:47:19 Text: know if that answers your question first because we have quite a number of slides on it.
0:47:19 - 0:47:24 Text: And if not, please, please ask again and we can chat more about it.
0:47:24 - 0:47:27 Text: And maybe you can go on.
0:47:27 - 0:47:28 Text: Great.
0:47:28 - 0:47:29 Text: Okay.
0:47:29 - 0:47:33 Text: So, I think this is a really fascinating question which also gets at what was important
0:47:33 - 0:47:39 Text: about the input but in actually kind of an even more direct way, which is, could I just
0:47:39 - 0:47:42 Text: keep some minimal part of the input and get the same answer.
0:47:42 - 0:47:44 Text: So, here's an example from Squad.
0:47:44 - 0:47:51 Text: You have this passage in 1899, John Jacob Astor IV invested $100,000 for Tesla.
0:47:51 - 0:47:52 Text: Okay.
0:47:52 - 0:47:55 Text: And then the answer that is being predicted by the model is going to always be in blue
0:47:55 - 0:47:57 Text: in these examples, Colorado Springs experiments.
0:47:57 - 0:47:59 Text: So, you got this passage.
0:47:59 - 0:48:03 Text: And the question is what did Tesla spend Astor's money on?
0:48:03 - 0:48:06 Text: That's why the prediction is Colorado Springs experiments.
0:48:06 - 0:48:10 Text: The model gets the answer right, which is nice.
0:48:10 - 0:48:14 Text: And we would like to think it's because it's doing some kind of reading comprehension.
0:48:14 - 0:48:16 Text: But here's the issue.
0:48:16 - 0:48:23 Text: It turns out, based on this fascinating paper, that if you just reduced the question to
0:48:23 - 0:48:30 Text: did, you actually get exactly the same, you actually get exactly the same answer.
0:48:30 - 0:48:36 Text: And in fact, with the original question, the model had sort of a.78 confidence, you
0:48:36 - 0:48:38 Text: know, probability in that answer.
0:48:38 - 0:48:44 Text: And with the reduced question did, you get even higher confidence.
0:48:44 - 0:48:48 Text: And that, if you give a human this, they would not be able to know really what you're
0:48:48 - 0:48:49 Text: trying to ask about.
0:48:49 - 0:48:53 Text: So, it seems like something is going really wonky here.
0:48:53 - 0:48:54 Text: Here's another.
0:48:54 - 0:48:58 Text: So, here's sort of like a very high level overview of the method.
0:48:58 - 0:49:01 Text: In fact, it actually references our input saline's theme methods.
0:49:01 - 0:49:03 Text: Nice, it's connected.
0:49:03 - 0:49:09 Text: So, you iteratively remove non-salient or unimportant words.
0:49:09 - 0:49:12 Text: So here's a passage again talking about football.
0:49:12 - 0:49:13 Text: I think.
0:49:13 - 0:49:14 Text: Yeah.
0:49:14 - 0:49:16 Text: And, oh, nice.
0:49:16 - 0:49:17 Text: Okay.
0:49:17 - 0:49:20 Text: So, the question is, where did the Broncos practice with a super bowl as the prediction
0:49:20 - 0:49:24 Text: of Stanford University?
0:49:24 - 0:49:25 Text: And that is correct.
0:49:25 - 0:49:27 Text: So again, seems nice.
0:49:27 - 0:49:31 Text: And now, we're not actually going to get the model to be incorrect.
0:49:31 - 0:49:36 Text: We're just going to say, how can I change this question such that I still look at the
0:49:36 - 0:49:37 Text: answer right?
0:49:37 - 0:49:41 Text: So, I'm going to remove the word that was least important according to a saliency method.
0:49:41 - 0:49:45 Text: So, now, it's where did the practice for the super bowl?
0:49:45 - 0:49:48 Text: Already, this is sort of unanswerable because you've got two teams practicing.
0:49:48 - 0:49:50 Text: You don't even know which one you're asking about.
0:49:50 - 0:49:55 Text: So, why the model still thinks it's so confident in Stanford University makes no sense.
0:49:55 - 0:49:59 Text: But you can just sort of keep going.
0:49:59 - 0:50:07 Text: And now, I think, here, the model stops being confident in the answer Stanford University.
0:50:07 - 0:50:13 Text: But I think this is really interesting just to show that if the model is able to do this
0:50:13 - 0:50:19 Text: with very high confidence, it's not reflecting the uncertainty that really should be there
0:50:19 - 0:50:22 Text: because you can't know what you're even asking about.
0:50:22 - 0:50:23 Text: Okay.
0:50:23 - 0:50:26 Text: So, what was important to make this answer?
0:50:26 - 0:50:32 Text: Well, at least these parts were important because you could keep just those parts and
0:50:32 - 0:50:34 Text: get the same answer, fascinating.
0:50:34 - 0:50:36 Text: All right.
0:50:36 - 0:50:44 Text: So, that's sort of the end of the admittedly brief section on thinking about input saliency
0:50:44 - 0:50:45 Text: methods and similar things.
0:50:45 - 0:50:48 Text: Now, we're going to talk about actually breaking models and understanding models by breaking
0:50:48 - 0:50:49 Text: them.
0:50:49 - 0:50:50 Text: Okay.
0:50:50 - 0:50:51 Text: Cool.
0:50:51 - 0:50:58 Text: So, if we have a passage here, Peyton Manning came the first quarterback, something Super
0:50:58 - 0:51:02 Text: Bowl, age 39, past record, held by John L. Wei.
0:51:02 - 0:51:03 Text: Again, we're doing question answering.
0:51:03 - 0:51:05 Text: We got this question.
0:51:05 - 0:51:08 Text: What was the name of the quarterback who was 38 in the Super Bowl?
0:51:08 - 0:51:10 Text: The prediction is correct.
0:51:10 - 0:51:11 Text: Looks good.
0:51:11 - 0:51:16 Text: Now, we're not going to change the question to try to sort of make the question nonsensical
0:51:16 - 0:51:23 Text: while keeping the same answer, instead we're going to change the passage by adding the
0:51:23 - 0:51:25 Text: sentence at the end, which really shouldn't distract anyone.
0:51:25 - 0:51:30 Text: This is quarterback, well known quarterback, Jeff Dean, you know, had jersey number 37
0:51:30 - 0:51:31 Text: in champ bull.
0:51:31 - 0:51:34 Text: So, this just doesn't, it's really not even related.
0:51:34 - 0:51:40 Text: But now the prediction is Jeff Dean for our nice QA model.
0:51:40 - 0:51:46 Text: And so, this shows as well that it seems like maybe there's this like end of the passage
0:51:46 - 0:51:50 Text: by as to what the answer should be, for example.
0:51:50 - 0:51:55 Text: And so, that's an adversarial example where we flipped the prediction by adding something
0:51:55 - 0:51:57 Text: that is innocuous to humans.
0:51:57 - 0:52:01 Text: And so, sort of like the higher level takeaway is like, oh, it seems like the QA model that
0:52:01 - 0:52:02 Text: we had that seemed good.
0:52:02 - 0:52:07 Text: It's not actually performing QA how we want it to, even though it's in domain accuracy
0:52:07 - 0:52:10 Text: it was good.
0:52:10 - 0:52:12 Text: And here's another example.
0:52:12 - 0:52:19 Text: So, you've got this paragraph with a question, what has been the result of this publicity?
0:52:19 - 0:52:22 Text: The answer is increased scrutiny on teacher misconduct.
0:52:22 - 0:52:28 Text: Now instead of changing the paragraph, we're going to change the question in really, really
0:52:28 - 0:52:32 Text: seemingly in significant ways to change the model's prediction.
0:52:32 - 0:52:39 Text: So first, what HA, and I've got this typo L, then the result of this publicity, the
0:52:39 - 0:52:45 Text: answer changes to teacher misconduct, likely a human would sort of ignore this typo or
0:52:45 - 0:52:47 Text: something and answer the right answer.
0:52:47 - 0:52:49 Text: And then this is really nuts.
0:52:49 - 0:52:54 Text: Instead of asking what has been the result of this publicity, if you ask what's been
0:52:54 - 0:52:59 Text: the result of this publicity, the answer also changes.
0:52:59 - 0:53:03 Text: And this is the author's call, this is semantically equivalent adversary.
0:53:03 - 0:53:05 Text: This is pretty rough.
0:53:05 - 0:53:13 Text: But in general, swapping what for what in this QA model breaks it pretty frequently.
0:53:13 - 0:53:19 Text: And so again, when you go back and sort of re-tinker how to build your model, you're going
0:53:19 - 0:53:23 Text: to be thinking about these things, not just the sort of average accuracy.
0:53:23 - 0:53:28 Text: So that's sort of talking about noise.
0:53:28 - 0:53:31 Text: Our models are bus to noise and their inputs.
0:53:31 - 0:53:32 Text: Our humans are bus to noise.
0:53:32 - 0:53:36 Text: And so this is another question we can ask.
0:53:36 - 0:53:43 Text: And so you can kind of go to this popular sort of meme passed around the internet from time
0:53:43 - 0:53:49 Text: to time where you have all the letters in these words scrambled, you say, according to
0:53:49 - 0:53:54 Text: research or Cambridge University, it doesn't matter in what order the letters in a word
0:53:54 - 0:53:55 Text: are.
0:53:55 - 0:54:00 Text: And so it seems like, I think I did a pretty good job there.
0:54:00 - 0:54:07 Text: And we can be robust as humans to reading and processing the language without actually
0:54:07 - 0:54:10 Text: all that much of a difficulty.
0:54:10 - 0:54:15 Text: So that's maybe something that we might want our models to also be robust to.
0:54:15 - 0:54:19 Text: And it's very practical as well.
0:54:19 - 0:54:23 Text: Noise is a part of all NLP systems inputs at all times.
0:54:23 - 0:54:28 Text: There's just no such thing, effectively, as having users, for example, and not having
0:54:28 - 0:54:30 Text: any noise.
0:54:30 - 0:54:36 Text: And so there's a study that was performed on some popular machine translation models where
0:54:36 - 0:54:42 Text: you train machine translation models in French, German and Czech, I think all to English.
0:54:42 - 0:54:43 Text: And you get blue scores.
0:54:43 - 0:54:47 Text: These blue scores will look a lot better than the ones in your Simon Four because much,
0:54:47 - 0:54:48 Text: much more training data.
0:54:48 - 0:54:53 Text: The idea is these are actually pretty strong machine translation systems.
0:54:53 - 0:54:56 Text: And that's an in domain clean text.
0:54:56 - 0:55:03 Text: Now if you add character swaps like the ones we saw in that sentence about Cambridge,
0:55:03 - 0:55:07 Text: the blue scores take a pretty harsh dive.
0:55:07 - 0:55:09 Text: Not very good.
0:55:09 - 0:55:17 Text: And even if you take somewhat more natural typo noise distribution here, you'll see
0:55:17 - 0:55:27 Text: that you're still getting 20-ish drops in blue score through simply natural noise.
0:55:27 - 0:55:30 Text: And so maybe you'll go back and retrain the model on more types of noise.
0:55:30 - 0:55:32 Text: And then you ask, oh, I do that.
0:55:32 - 0:55:35 Text: Is it robust to even different kinds of noise?
0:55:35 - 0:55:37 Text: These are the questions that are going to be really important.
0:55:37 - 0:55:41 Text: And it's important to know that you're able to break your model really easily so that
0:55:41 - 0:55:45 Text: you can then go and try to make it more robust.
0:55:45 - 0:55:51 Text: OK, now, let's see, 20 minutes.
0:55:51 - 0:55:53 Text: Some.
0:55:53 - 0:55:57 Text: Now we're going to, I guess, yeah.
0:55:57 - 0:56:01 Text: So now we're going to look at the representations of our neural networks.
0:56:01 - 0:56:06 Text: We've talked about sort of their behavior and then whether we could sort of change or
0:56:06 - 0:56:09 Text: observe reasons behind their behavior.
0:56:09 - 0:56:15 Text: Now we'll go into less abstraction, like more at the actual vector representations that
0:56:15 - 0:56:17 Text: are being built by models.
0:56:17 - 0:56:24 Text: And we can answer a different kind of question at the very least than with the other studies.
0:56:24 - 0:56:30 Text: The first thing is related to the question I was asked about attention, which is that
0:56:30 - 0:56:33 Text: some modeling components lend themselves to inspection.
0:56:33 - 0:56:37 Text: Now this is a sentence that I chose somewhat carefully actually because in part of this
0:56:37 - 0:56:42 Text: debate, are they interpretable components?
0:56:42 - 0:56:43 Text: We'll see.
0:56:43 - 0:56:46 Text: But they lend themselves to inspection in the following way.
0:56:46 - 0:56:51 Text: You can visualize them well and you can correlate them easily with various properties.
0:56:51 - 0:56:53 Text: So let's say you have attention heads in Burt.
0:56:53 - 0:57:00 Text: This is from a really nice study that was done here where you look at attention heads
0:57:00 - 0:57:05 Text: of Burt and you say, on most sentences, this attention head had one one seems to do this
0:57:05 - 0:57:08 Text: very sort of global aggregation.
0:57:08 - 0:57:11 Text: Simple kind of operation does this pretty consistently.
0:57:11 - 0:57:13 Text: That's cool.
0:57:13 - 0:57:16 Text: Is it interpretable?
0:57:16 - 0:57:18 Text: Well, maybe, right?
0:57:18 - 0:57:25 Text: So it's the first layer, which means that this word found is sort of uncontextualized.
0:57:25 - 0:57:31 Text: And then, you know, but in deeper layers, the problem is that like once you do some
0:57:31 - 0:57:37 Text: rounds of attention, you've had information mixing and flowing between words.
0:57:37 - 0:57:40 Text: And how do you know exactly what information you're combining, what you're attending
0:57:40 - 0:57:44 Text: to, even, the little hard to tell.
0:57:44 - 0:57:50 Text: And saliency methods more directly sort of evaluate the importance of models.
0:57:50 - 0:57:54 Text: But it's still interesting to see at sort of a local mechanistic point of view what
0:57:54 - 0:57:57 Text: kinds of things are being attended to.
0:57:57 - 0:58:01 Text: So let's take another example.
0:58:01 - 0:58:02 Text: Some attention heads seem to perform simple operations.
0:58:02 - 0:58:05 Text: So you have the global aggregation here that we saw already.
0:58:05 - 0:58:09 Text: Others seem to attend pretty robustly to the next token.
0:58:09 - 0:58:10 Text: Cool.
0:58:10 - 0:58:12 Text: Next token is a great signal.
0:58:12 - 0:58:14 Text: Some heads attend to the CEP token.
0:58:14 - 0:58:17 Text: So here you have attending to CEP.
0:58:17 - 0:58:18 Text: And then maybe some attend to periods.
0:58:18 - 0:58:23 Text: Maybe that's sort of a splitting sentences together and things like that.
0:58:23 - 0:58:25 Text: Not things that are hard to do.
0:58:25 - 0:58:30 Text: But things that some attention had seemed to pretty robustly perform.
0:58:30 - 0:58:35 Text: Again now though, deep in the network, what's actually represented at this period at layer
0:58:35 - 0:58:37 Text: 11?
0:58:37 - 0:58:38 Text: Little unclear.
0:58:38 - 0:58:39 Text: Little unclear.
0:58:39 - 0:58:41 Text: Okay.
0:58:41 - 0:58:46 Text: So some heads though are correlated with really interesting linguistic properties.
0:58:46 - 0:58:49 Text: So this head is actually attending to noun modifiers.
0:58:49 - 0:58:57 Text: So you got this the complicated language in the huge new law.
0:58:57 - 0:59:00 Text: That's pretty fascinating.
0:59:00 - 0:59:05 Text: Even if the model is not like doing this as a causal mechanism to do syntax necessarily,
0:59:05 - 0:59:10 Text: the fact that these things so strongly correlate is actually pretty, pretty cool.
0:59:10 - 0:59:13 Text: And so what we have in all of these studies is we've got sort of an approximate
0:59:13 - 0:59:19 Text: return partition and quantitative analysis relating, like allowing us to reason about very
0:59:19 - 0:59:21 Text: complicated model behavior.
0:59:21 - 0:59:24 Text: They're all approximations, but they're definitely interesting.
0:59:24 - 0:59:26 Text: One other example is that of co-reference.
0:59:26 - 0:59:29 Text: So we saw some work on co-reference.
0:59:29 - 0:59:36 Text: And it seems like this head does a pretty okay job of actually matching up co-referent
0:59:36 - 0:59:37 Text: entities.
0:59:37 - 0:59:39 Text: These are in red.
0:59:39 - 0:59:42 Text: Talks, negotiations, she, her.
0:59:42 - 0:59:43 Text: And that's not obvious how to do that.
0:59:43 - 0:59:45 Text: This is a difficult task.
0:59:45 - 0:59:50 Text: And so it does so with some percentage of the time.
0:59:50 - 0:59:56 Text: And again, it's sort of connecting very complex model behavior to these sort of interpretable
0:59:56 - 1:00:00 Text: summaries of correlating properties.
1:00:00 - 1:00:04 Text: Other cases you can have individual hidden units that lend themselves to interpretation.
1:00:04 - 1:00:10 Text: So here you've got a character level LSTM language model.
1:00:10 - 1:00:14 Text: Which row here is a sentence, if you can't read it, it's totally okay.
1:00:14 - 1:00:18 Text: The interpretation that you should take is that as we walk along the sentence, this single
1:00:18 - 1:00:23 Text: unit is going from I think very negative to very positive or very positive to very
1:00:23 - 1:00:24 Text: negative.
1:00:24 - 1:00:26 Text: I don't really remember.
1:00:26 - 1:00:30 Text: But it's tracking the position in the line.
1:00:30 - 1:00:34 Text: So it's just a linear position unit and pretty robustly doing so across all of these
1:00:34 - 1:00:36 Text: sentences.
1:00:36 - 1:00:42 Text: So this is from a nice visualization study way back in 2016, way back.
1:00:42 - 1:00:47 Text: Here's another cell from that same LSTM language model that seems to sort of turn on inside
1:00:47 - 1:00:48 Text: quotes.
1:00:48 - 1:00:50 Text: So here's a quote and then it turns on.
1:00:50 - 1:00:53 Text: Okay, so I guess that's positive in the blue.
1:00:53 - 1:00:55 Text: End quote here.
1:00:55 - 1:00:57 Text: And then it's negative.
1:00:57 - 1:01:01 Text: Here you start with no quote, negative in the red.
1:01:01 - 1:01:03 Text: See a quote and then blue.
1:01:03 - 1:01:08 Text: Again, very interpretable, also potentially a very useful feature to keep in mind.
1:01:08 - 1:01:11 Text: And this is just an individual unit in the LSTM that you can just look at and see that
1:01:11 - 1:01:13 Text: it does this.
1:01:13 - 1:01:17 Text: Very, very interesting.
1:01:17 - 1:01:26 Text: Even farther on this, and this is actually a study by some AI and neuroscience researchers,
1:01:26 - 1:01:29 Text: we saw the LSTMs were good at subject for a number agreement.
1:01:29 - 1:01:33 Text: Can we figure out the mechanisms by which the LSTM is solving the task?
1:01:33 - 1:01:35 Text: We actually get some insight into that.
1:01:35 - 1:01:37 Text: And so we have a word level language model.
1:01:37 - 1:01:41 Text: The word level language model is going to be a little small, but you have a sentence,
1:01:41 - 1:01:45 Text: the boy, gently, and kindly greets the.
1:01:45 - 1:01:51 Text: And this cell that's being tracked here, so it's an individual hidden unit, one dimension,
1:01:51 - 1:01:57 Text: right, is actually after it sees boy, it sort of starts to go higher.
1:01:57 - 1:02:02 Text: And then it goes down to something very small once it sees greets.
1:02:02 - 1:02:08 Text: And this cell seems to correlate with the scope of a subject for number agreement instance
1:02:08 - 1:02:09 Text: effectively.
1:02:09 - 1:02:14 Text: So here, the boy that watches the dog, that watches the cat greets, you got that cell,
1:02:14 - 1:02:20 Text: again, staying high, maintaining the scope of subject until greets, at which point it
1:02:20 - 1:02:22 Text: stops.
1:02:22 - 1:02:23 Text: What allows it to do that?
1:02:23 - 1:02:28 Text: Probably some complex other dynamics in the network, but it's still a fascinating, I
1:02:28 - 1:02:30 Text: think, insight.
1:02:30 - 1:02:37 Text: And yeah, this is just neuron, 1,150 in this LSTM.
1:02:37 - 1:02:46 Text: Now, so those are sort of all observational studies that you could do by picking out individual
1:02:46 - 1:02:51 Text: components of the model that you can sort of just take each one of and correlating them
1:02:51 - 1:02:53 Text: with some behavior.
1:02:53 - 1:03:00 Text: Now we'll look at a general class of methods called probing by which we still sort of use
1:03:00 - 1:03:06 Text: supervised knowledge, like the knowledge of the type of co-reference that we're looking
1:03:06 - 1:03:07 Text: for.
1:03:07 - 1:03:10 Text: But instead of thinking if it correlates with something that's immediately interpretable,
1:03:10 - 1:03:16 Text: like a attention head, we're going to look into the vector representations of the model
1:03:16 - 1:03:21 Text: and see if these properties can be read out by some simple function.
1:03:21 - 1:03:26 Text: To say, oh, maybe this property was made very easily accessible by my neural network.
1:03:26 - 1:03:28 Text: So let's dig into this.
1:03:28 - 1:03:34 Text: So the general paradigm is that you've got language data that goes into some big pre-trained
1:03:34 - 1:03:36 Text: transformer with fine tuning.
1:03:36 - 1:03:39 Text: And you get state-of-the-art results.
1:03:39 - 1:03:41 Text: So that means state-of-the-art.
1:03:41 - 1:03:46 Text: And so the question for the probing sort of methodology is like, if it's providing these
1:03:46 - 1:03:51 Text: general purpose language representations, what does it actually encode about language?
1:03:51 - 1:03:54 Text: Like, can we quantify this?
1:03:54 - 1:03:57 Text: Can we figure out what kinds of things is learning about language that we seemingly
1:03:57 - 1:04:00 Text: now don't have to tell it?
1:04:00 - 1:04:06 Text: And so you might have something like a sentence, like I record the record.
1:04:06 - 1:04:08 Text: That's an interesting sentence.
1:04:08 - 1:04:13 Text: And you put it into your transformer model with its word embeddings at the beginning,
1:04:13 - 1:04:17 Text: maybe some layers of self-attention and stuff, and you make some predictions.
1:04:17 - 1:04:21 Text: And now our objects of study are going to be these intermediate layers.
1:04:21 - 1:04:22 Text: Right?
1:04:22 - 1:04:27 Text: So it's a vector per word or sub word for every layer.
1:04:27 - 1:04:31 Text: And the question is, like, can we use these linguistic properties like the dependency
1:04:31 - 1:04:38 Text: parsing that we had way back in the early part of the course to understand correlations
1:04:38 - 1:04:44 Text: between properties in the vectors and these things that we can interpret.
1:04:44 - 1:04:46 Text: We can interpret dependency parses.
1:04:46 - 1:04:51 Text: So there are a couple of things that we might want to look for here.
1:04:51 - 1:04:53 Text: You might want to look for semantics.
1:04:53 - 1:04:56 Text: So here, in the sentence, I record the record.
1:04:56 - 1:04:58 Text: I am an agent.
1:04:58 - 1:05:01 Text: That's a semantics thing.
1:05:01 - 1:05:02 Text: Record is a patient.
1:05:02 - 1:05:04 Text: It's the thing I'm recording.
1:05:04 - 1:05:05 Text: You might have syntax.
1:05:05 - 1:05:07 Text: So you might have the syntax tree that you're interested in.
1:05:07 - 1:05:09 Text: That's the dependency parse tree.
1:05:09 - 1:05:11 Text: Maybe you're interested in part of speech, right?
1:05:11 - 1:05:14 Text: Because you have record and record.
1:05:14 - 1:05:17 Text: And the first one's a verb, the second one's a noun.
1:05:17 - 1:05:19 Text: They're identical strings.
1:05:19 - 1:05:23 Text: That's the model encode that one is one and the other is the other.
1:05:23 - 1:05:26 Text: So how do we do this kind of study?
1:05:26 - 1:05:29 Text: So we're going to decide on a layer that we want to analyze.
1:05:29 - 1:05:31 Text: And we're going to freeze Bert.
1:05:31 - 1:05:32 Text: So we're not going to fine tune Bert.
1:05:32 - 1:05:34 Text: All the parameters are frozen.
1:05:34 - 1:05:36 Text: So we decide on layer two of Bert.
1:05:36 - 1:05:38 Text: We're going to pass it some sentences.
1:05:38 - 1:05:42 Text: We decide on what's called a probe family.
1:05:42 - 1:05:49 Text: The question I'm asking is, can I use a model for my family, say linear, to decode a property
1:05:49 - 1:05:53 Text: that I'm interested in really well from this layer?
1:05:53 - 1:06:00 Text: So it's indicating that this property is easily accessible to linear models effectively.
1:06:00 - 1:06:06 Text: So maybe I get a train a model, a train a linear classifier on top of Bert.
1:06:06 - 1:06:09 Text: And I get a really high accuracy.
1:06:09 - 1:06:13 Text: That's sort of interesting already because you know from prior work in part of speech
1:06:13 - 1:06:18 Text: tagging that if you run a linear classifier on simpler features that aren't Bert, you
1:06:18 - 1:06:20 Text: probably don't get as high an accuracy.
1:06:20 - 1:06:22 Text: So that's an interesting sort of takeaway.
1:06:22 - 1:06:24 Text: But then you can also take like a baseline.
1:06:24 - 1:06:26 Text: So I want to compare two layers now.
1:06:26 - 1:06:27 Text: So I've got layer one here.
1:06:27 - 1:06:29 Text: I want to compare it to layer two.
1:06:29 - 1:06:32 Text: I train a probe on it as well.
1:06:32 - 1:06:34 Text: Maybe the accuracy isn't as good.
1:06:34 - 1:06:40 Text: Now I can say, oh wow, look, by layer two, part of speech is more easily accessible to linear
1:06:40 - 1:06:44 Text: functions than it was at layer one.
1:06:44 - 1:06:45 Text: So what did that?
1:06:45 - 1:06:49 Text: Well, the self-attention and feed-forward stuff made it more easily accessible.
1:06:49 - 1:06:53 Text: That's interesting because it's a statement about sort of the information processing of
1:06:53 - 1:06:55 Text: the model.
1:06:55 - 1:06:56 Text: Okay.
1:06:56 - 1:07:00 Text: Okay, so that's, we're going to analyze these layers.
1:07:00 - 1:07:06 Text: Just take a second more to think about it, and you just really give me just a second.
1:07:06 - 1:07:12 Text: So if you have the model representations, h1 to ht, and you have a function family f,
1:07:12 - 1:07:16 Text: that's the subset linear models, or maybe you have like a feed-forward neural network, some
1:07:16 - 1:07:21 Text: fixed set of hyper parameters, freeze the model, train the probe.
1:07:21 - 1:07:25 Text: So you get some predictions for part of speech tagging or whatever.
1:07:25 - 1:07:28 Text: That's just the probe applied to the hidden state of the model.
1:07:28 - 1:07:33 Text: The probe was a member of the probe family, and then the extent that we can predict why
1:07:33 - 1:07:34 Text: is a measure of accessibility.
1:07:34 - 1:07:37 Text: So that's just kind of written out not as pictorially.
1:07:37 - 1:07:38 Text: Okay.
1:07:38 - 1:07:44 Text: So I'm not going to stay on this for too much longer.
1:07:44 - 1:07:49 Text: And it may help in the search for causal mechanisms, but it sort of just gives us a rough
1:07:49 - 1:07:54 Text: understanding of sort of processing of the model and what things are accessible at what
1:07:54 - 1:07:55 Text: layer.
1:07:55 - 1:07:57 Text: So what are some results here?
1:07:57 - 1:08:03 Text: So one result is that BERT, if you run linear probes on it, does really, really well on things
1:08:03 - 1:08:07 Text: that require syntax in part of speech named NETA recognition.
1:08:07 - 1:08:12 Text: Actually in some cases, approximately as well as just doing the very best thing you could
1:08:12 - 1:08:15 Text: possibly do without BERT.
1:08:15 - 1:08:19 Text: So it just makes easily accessible, amazingly strong features for these properties.
1:08:19 - 1:08:26 Text: And that's an interesting sort of emergent quality of BERT, you might say.
1:08:26 - 1:08:31 Text: It seems like as well that the layers of BERT have this property where, so if you look
1:08:31 - 1:08:39 Text: at the columns of this plot here, each column is a task, you've got input words at the sort
1:08:39 - 1:08:44 Text: of layer zero of BERT here, layer 24 is the last layer of BERT large, lower performance
1:08:44 - 1:08:51 Text: is yellow, higher performance is blue, and I, the resolution isn't perfect, but consistently
1:08:51 - 1:08:55 Text: the best place to read out these properties is somewhere a bit past the middle of the
1:08:55 - 1:09:01 Text: model, which is a very consistent rule, which is fascinating.
1:09:01 - 1:09:07 Text: And then it seems as well like if you look at this function of increasingly abstract or
1:09:07 - 1:09:11 Text: increasingly difficult to compute linguistic properties on this axis, an increasing
1:09:11 - 1:09:17 Text: depth in the network on that axis, so the deeper you go in the network, it seems like
1:09:17 - 1:09:24 Text: the more easily you can access more and more abstract linguistic properties, suggesting
1:09:24 - 1:09:29 Text: that that accessibility is being constructed over time by the layers of processing of BERT,
1:09:29 - 1:09:32 Text: so it's building more and more abstract features.
1:09:32 - 1:09:37 Text: Which I think is again, sort of really interesting result.
1:09:37 - 1:09:43 Text: And now I think, yeah, one thing that I think comes to mind that really brings us back
1:09:43 - 1:09:48 Text: right today one is we built intuitions around word to veck.
1:09:48 - 1:09:51 Text: We were asking like what does each dimension of word to veck mean?
1:09:51 - 1:09:57 Text: And the answer was not really anything, but we could build intuitions about it and
1:09:57 - 1:10:01 Text: think about properties of it through sort of these connections between simple mathematical
1:10:01 - 1:10:08 Text: properties of word to veck and linguistic properties that we could sort of understand.
1:10:08 - 1:10:12 Text: So we had this approximation, which is not 100% true, but it's an approximation that
1:10:12 - 1:10:22 Text: says cosine similarity is effectively correlated with semantic similarity.
1:10:22 - 1:10:25 Text: Think about even if all we're going to do at the end of the day is fine tune these word
1:10:25 - 1:10:27 Text: embeddings anyway.
1:10:27 - 1:10:32 Text: Likewise we had this sort of idea about the analogies being encoded by linear offsets.
1:10:32 - 1:10:39 Text: So some relationships are linear in space and they didn't have to be, that's fascinating.
1:10:39 - 1:10:43 Text: This is emergent property that we've now been able to study since we discovered this.
1:10:43 - 1:10:45 Text: Why is that the case in word to veck?
1:10:45 - 1:10:51 Text: And in general, even though you can't interpret the individual dimensions of word to veck,
1:10:51 - 1:10:57 Text: these sort of emergent, interpretable connections between approximately linguistic ideas and sort
1:10:57 - 1:11:00 Text: of simple math on these objects is fascinating.
1:11:00 - 1:11:06 Text: And so one piece of work that sort of extends this idea comes back to dependency parse
1:11:06 - 1:11:09 Text: trees. So they describe the syntax of sentences.
1:11:09 - 1:11:18 Text: And in a paper that I did with Chris, we showed that actually birds and models like it make
1:11:18 - 1:11:24 Text: dependency parse tree structure emergent sort of more easily accessible than one might
1:11:24 - 1:11:26 Text: imagine in its vector space.
1:11:26 - 1:11:32 Text: So if you've got a tree right here, the chef who ran to the store was out of food.
1:11:32 - 1:11:39 Text: So what you can sort of do is think about the tree in terms of distances between words.
1:11:39 - 1:11:44 Text: So you've got the number of edges in the tree between two words is their path distance.
1:11:44 - 1:11:48 Text: So you've got sort of that the distance between chef and was is one.
1:11:48 - 1:11:52 Text: And we're going to use this interpretation of a tree as a distance to make a connection
1:11:52 - 1:11:54 Text: with birds embedding space.
1:11:54 - 1:12:00 Text: And what we were able to show is that under a single linear transformation, the squared
1:12:00 - 1:12:07 Text: Euclidean distance between bird vectors for the same sentence actually correlates well
1:12:07 - 1:12:12 Text: if you choose the B matrix right with the distances in the tree.
1:12:12 - 1:12:17 Text: So here in this Euclidean space that we've transformed, the approximate distance between
1:12:17 - 1:12:21 Text: chef and was is also one.
1:12:21 - 1:12:26 Text: Likewise the difference between was and store is four in the tree.
1:12:26 - 1:12:31 Text: And in my simple sort of transformation of bird space, the distance between store and
1:12:31 - 1:12:33 Text: was is also approximately four.
1:12:33 - 1:12:36 Text: And this is true across a wide range of sentences.
1:12:36 - 1:12:42 Text: And this is like to me a fascinating example of again emergent approximate structure in
1:12:42 - 1:12:49 Text: these very nonlinear models that don't necessarily need to encode things so simply.
1:12:49 - 1:12:51 Text: Okay.
1:12:51 - 1:12:52 Text: All right.
1:12:52 - 1:12:53 Text: Great.
1:12:53 - 1:12:59 Text: So probing studies and correlation studies are I think interesting and pointless in directions
1:12:59 - 1:13:01 Text: to build intuitions about models.
1:13:01 - 1:13:05 Text: But they're not arguments that the model is actually using the thing that you're finding
1:13:05 - 1:13:07 Text: to make a decision.
1:13:07 - 1:13:10 Text: Not causal studies.
1:13:10 - 1:13:12 Text: This is for probing and correlation studies.
1:13:12 - 1:13:18 Text: So and some work that I did around the same time, we showed actually that certain conditions
1:13:18 - 1:13:22 Text: on probes allow you to achieve high accuracy on a task.
1:13:22 - 1:13:25 Text: It's effectively just fitting random labels.
1:13:25 - 1:13:31 Text: And so there's a difficulty of interpreting what the model could or could not be doing
1:13:31 - 1:13:34 Text: with this thing that is somehow easily accessible.
1:13:34 - 1:13:38 Text: It's interesting that this property is easily accessible, but the model might not be doing
1:13:38 - 1:13:42 Text: anything with it, for example, because it's totally random.
1:13:42 - 1:13:47 Text: Likewise, another paper showed that you can achieve high accuracy with a probe, even
1:13:47 - 1:13:52 Text: if the model is trained to know that thing that you're probing for is not useful.
1:13:52 - 1:13:56 Text: And there's causal studies that sort of try to extend this work.
1:13:56 - 1:14:01 Text: It's much more difficult to read this paper than it's a fascinating line of future work.
1:14:01 - 1:14:07 Text: Now in my last two minutes, I want to talk about recasting model tweaks and ablations
1:14:07 - 1:14:09 Text: as analysis.
1:14:09 - 1:14:14 Text: So we had this improvement process where we had a network that was going to work, okay.
1:14:14 - 1:14:17 Text: And we would see whether we could tweak it in simple ways to improve it.
1:14:17 - 1:14:21 Text: And then you could see whether you could remove anything and how it still be okay.
1:14:21 - 1:14:22 Text: And that's kind of like analysis.
1:14:22 - 1:14:23 Text: Like I have my network.
1:14:23 - 1:14:26 Text: Do I want it to like, is it going to be better if it's more complicated, if it's going
1:14:26 - 1:14:30 Text: to be better, if it's simpler, can I get away with it being simpler?
1:14:30 - 1:14:35 Text: And so one example of some folks who did this is they took this idea of multi-headed
1:14:35 - 1:14:40 Text: attention and said, so many heads, all the heads important.
1:14:40 - 1:14:44 Text: And what they showed is that if you train a system with multi-headed attention and then
1:14:44 - 1:14:49 Text: just remove the heads at test time and not use them at all, you can actually do pretty
1:14:49 - 1:14:54 Text: well on the original task, not retraining at all without some of the attention heads,
1:14:54 - 1:14:56 Text: showing that they weren't important.
1:14:56 - 1:14:58 Text: You could just get rid of them after training.
1:14:58 - 1:15:02 Text: And likewise, you can do the same thing for, this is on machine translation, this is on
1:15:02 - 1:15:06 Text: multi-analye, you can actually get away without a large, large percentage of your attention
1:15:06 - 1:15:07 Text: heads.
1:15:07 - 1:15:11 Text: Let's see.
1:15:11 - 1:15:17 Text: Yeah, so another thing that you could think about is questioning sort of the basics of
1:15:17 - 1:15:19 Text: the models that we're building.
1:15:19 - 1:15:22 Text: So we have transformer models that are sort of self-attention, feed-forward, self-attention,
1:15:22 - 1:15:28 Text: feed-forward, but like why in that order, with some of the things emitted here, and this
1:15:28 - 1:15:33 Text: paper asked this question and said, if this is my transformer, self-attention, feed-forward,
1:15:33 - 1:15:36 Text: self-attention, feed-forward, et cetera, et cetera, et cetera.
1:15:36 - 1:15:40 Text: And if I just reordered it so that I had a bunch of self-attention at the head and a bunch
1:15:40 - 1:15:44 Text: of feed-forward at the back, and they tried a bunch of these orderings, and this one actually
1:15:44 - 1:15:45 Text: does better.
1:15:45 - 1:15:48 Text: So this achieves a lower perplexity on a benchmark.
1:15:48 - 1:15:53 Text: And this is a way of analyzing what's important about the architectures that I'm building,
1:15:53 - 1:15:56 Text: and how can they be changed in order to perform better.
1:15:56 - 1:16:00 Text: So neural models are very complex, and they're difficult to characterize and impossible to
1:16:00 - 1:16:05 Text: characterize with a single sort of statistic, I think, for your test set accuracy, especially
1:16:05 - 1:16:07 Text: in domain.
1:16:07 - 1:16:12 Text: And we want to find intuitive descriptions of model behaviors, but we should look at
1:16:12 - 1:16:16 Text: multiple levels of abstraction, and none of them are going to be complete.
1:16:16 - 1:16:20 Text: And someone tells you that their neural network is interpretable.
1:16:20 - 1:16:23 Text: I encourage you to engage critically with that.
1:16:23 - 1:16:28 Text: It's not necessarily false, but like the levels of interpretability and what you can interpret,
1:16:28 - 1:16:32 Text: these are the questions that you should be asking, because it's going to be opaque in
1:16:32 - 1:16:35 Text: some ways, almost definitely.
1:16:35 - 1:16:41 Text: And then bring this lens to your model building as you try to think about how to build better
1:16:41 - 1:16:46 Text: models, even if you're not going to be doing analysis as one of your main driving goals.
1:16:46 - 1:16:50 Text: And with that, good luck on your final projects.
1:16:50 - 1:16:52 Text: I realize we're at time.
1:16:52 - 1:16:57 Text: The teaching staff is really appreciative of your efforts over this difficult quarter.
1:16:57 - 1:17:04 Text: And yeah, I hope there's a lecture left on Thursday, but yeah, this is my last one.
1:17:04 - 1:17:05 Text: So thanks, everyone.