Stanford CS224N NLP with Deep Learning | Winter 2021 | Lecture 17 - Model Analysis and Explanation

0:00:00 - 0:00:12     Text: Welcome to CS224N, lecture 17.

0:00:12 - 0:00:14     Text: Model analysis and explanation.

0:00:14 - 0:00:16     Text: Okay, look at us.

0:00:16 - 0:00:19     Text: We're here.

0:00:19 - 0:00:21     Text: Start with some course logistics.

0:00:21 - 0:00:26     Text: We have updated the policy on the guest lecture reactions.

0:00:26 - 0:00:28     Text: They're all due Friday.

0:00:28 - 0:00:30     Text: All at 11.59 pm.

0:00:30 - 0:00:33     Text: You can't use late days for this.

0:00:33 - 0:00:35     Text: So please get them in.

0:00:35 - 0:00:36     Text: Watch the lectures.

0:00:36 - 0:00:37     Text: They're awesome lectures.

0:00:37 - 0:00:39     Text: They're awesome guests.

0:00:39 - 0:00:42     Text: And you get something like half a point for each of them.

0:00:42 - 0:00:46     Text: And yeah, all three can be submitted up through Friday.

0:00:46 - 0:00:48     Text: Okay, so final project.

0:00:48 - 0:00:51     Text: Remember that the due date is Tuesday.

0:00:51 - 0:00:55     Text: It's Tuesday at 4.30 pm, March 16th.

0:00:55 - 0:01:05     Text: And let me emphasize that there's a hard deadline on the three days from then Friday.

0:01:05 - 0:01:10     Text: We won't be accepting for additional points off assignments, sorry, final projects that

0:01:10 - 0:01:15     Text: are submitted after the 4.30 deadline on Friday.

0:01:15 - 0:01:18     Text: We need to get these graded and get grades in.

0:01:18 - 0:01:20     Text: So it's the end stretch.

0:01:20 - 0:01:21     Text: Week 9.

0:01:21 - 0:01:27     Text: Week 10 is really the lectures are us giving you help on the final projects.

0:01:27 - 0:01:29     Text: So this is really the last week of lectures.

0:01:29 - 0:01:31     Text: Thanks for all your hard work.

0:01:31 - 0:01:36     Text: And for asking awesome questions and lecture and in office hours and on the ed.

0:01:36 - 0:01:37     Text: And let's get right into it.

0:01:37 - 0:01:44     Text: So today we get to talk about one of my favorite subjects in natural language processing.

0:01:44 - 0:01:47     Text: It's model analysis and explanation.

0:01:47 - 0:01:51     Text: So first we're going to do what I love doing, which is motivating why we want to talk about

0:01:51 - 0:01:54     Text: the topic at all.

0:01:54 - 0:01:59     Text: We'll talk about how we can look at a model at different levels of abstraction to perform

0:01:59 - 0:02:02     Text: different kinds of analysis on it.

0:02:02 - 0:02:05     Text: We'll talk about out of domain evaluation sets.

0:02:05 - 0:02:10     Text: So we'll feel familiar to the robust QA folks.

0:02:10 - 0:02:15     Text: Then we'll talk about sort of trying to figure out for a given example, why did it make

0:02:15 - 0:02:17     Text: the decision that it made?

0:02:17 - 0:02:19     Text: Had some input, it produced some output.

0:02:19 - 0:02:23     Text: And we come up with some sort of interpretable explanation for it.

0:02:23 - 0:02:29     Text: And then we'll look at actually the representations of the models.

0:02:29 - 0:02:34     Text: So these are the sort of hidden states, the vectors that are being built throughout the processing

0:02:34 - 0:02:38     Text: of the model, try to figure out if we can understand some of the representations and

0:02:38 - 0:02:41     Text: mechanisms that the model is performing.

0:02:41 - 0:02:45     Text: And then we'll actually come back to sort of one of the kind of default states that

0:02:45 - 0:02:50     Text: we've been in in this course, which is trying to look at model improvements, removing things

0:02:50 - 0:02:55     Text: from models, seeing how it performs, and relate that to the analysis that we're doing in this

0:02:55 - 0:02:59     Text: lecture, show how it's not all that different.

0:02:59 - 0:03:01     Text: Okay.

0:03:01 - 0:03:08     Text: So if you haven't seen this XKCD, now you have, and it's one of my favorites, I'm going

0:03:08 - 0:03:09     Text: to say all the words.

0:03:09 - 0:03:16     Text: So person A says this is your machine learning system, person B says yep, you pour the

0:03:16 - 0:03:21     Text: data into this big pile of linear algebra, and then collect the answers on the other side,

0:03:21 - 0:03:26     Text: person A, what if the answers are wrong, and person B, just stir the pile until they

0:03:26 - 0:03:28     Text: start looking right.

0:03:28 - 0:03:32     Text: And I feel like at its worst, deep learning can feel like this from time to time.

0:03:32 - 0:03:37     Text: You have a model, maybe it works for some things, maybe it doesn't work for other things,

0:03:37 - 0:03:41     Text: you're not sure why it works for some things and doesn't work for others.

0:03:41 - 0:03:47     Text: And the changes that we make to our models, they're based on intuition, but frequently,

0:03:47 - 0:03:49     Text: what are the TAs told?

0:03:49 - 0:03:52     Text: Everyone in office hours, sometimes you just have to try it and see if it's going to work

0:03:52 - 0:03:55     Text: out because it's very hard to tell.

0:03:55 - 0:04:01     Text: It's very, very difficult to understand our models at any level.

0:04:01 - 0:04:06     Text: And so today we'll go through a number of ways for trying to carve out little bits of understanding

0:04:06 - 0:04:07     Text: here and there.

0:04:07 - 0:04:16     Text: So beyond it being important because it's in the next KCD comic, why should we care about

0:04:16 - 0:04:18     Text: what our models, about understanding our models?

0:04:18 - 0:04:23     Text: One, is that we want to know what our models are doing.

0:04:23 - 0:04:30     Text: So here you have a black box, black box functions, or this idea that you can't look into

0:04:30 - 0:04:33     Text: them and interpret what they're doing.

0:04:33 - 0:04:37     Text: You have an input sentence, say, and then some output prediction.

0:04:37 - 0:04:46     Text: Maybe this black box is actually your final project model and it gets some accuracy.

0:04:46 - 0:04:51     Text: Now we summarize our models and in your final projects you'll summarize your model with

0:04:51 - 0:04:57     Text: sort of one or a handful of summary metrics of accuracy or f1 score or blue score or

0:04:57 - 0:04:59     Text: something.

0:04:59 - 0:05:03     Text: But there's a lot of model to explain with just a small number of metrics.

0:05:03 - 0:05:05     Text: So what do they learn?

0:05:05 - 0:05:08     Text: Why do they succeed and why do they fail?

0:05:08 - 0:05:09     Text: What's another motivation?

0:05:09 - 0:05:12     Text: We want to know what our models are doing.

0:05:12 - 0:05:17     Text: But maybe that's because we want to be able to make tomorrow's model.

0:05:17 - 0:05:23     Text: So today, when you're building models in this class at a company, you start out with some

0:05:23 - 0:05:28     Text: kind of recipe that is known to work either at the company or because you have experience

0:05:28 - 0:05:33     Text: from this class and it's not perfect right in makes mistakes to look at the errors.

0:05:33 - 0:05:39     Text: And then over time, you take what works and then you find what needs changing.

0:05:39 - 0:05:43     Text: So it seems like maybe adding another layer to the model helped.

0:05:43 - 0:05:49     Text: And maybe that's a nice tweak and the model performance gets better, et cetera.

0:05:49 - 0:05:55     Text: And incremental progress doesn't always feel exciting, but I want to pitch to you that

0:05:55 - 0:06:01     Text: it is actually very important for us to understand how much incremental progress can kind of get

0:06:01 - 0:06:08     Text: us towards some of our goals so that we can have a better job of evaluating when we need

0:06:08 - 0:06:12     Text: big leaps, when we need major changes because there are problems that we're attacking with

0:06:12 - 0:06:16     Text: our incremental sort of progress and we're not getting very far.

0:06:16 - 0:06:20     Text: OK, so we want to make tomorrow's model.

0:06:20 - 0:06:27     Text: The thing that's very related to both a part of and bigger than this field of analysis

0:06:27 - 0:06:29     Text: is model biases.

0:06:29 - 0:06:38     Text: So let's say you take your word to that analogy's solver from glove or word to that is from

0:06:38 - 0:06:43     Text: assignment one and you give it the analogy managed to computer programmer as woman is

0:06:43 - 0:06:46     Text: to and it gives you the output home maker.

0:06:46 - 0:06:50     Text: This is a real example from the paper below.

0:06:50 - 0:06:56     Text: You should be like, wow, well, I'm glad I know that now and of course you saw the lecture

0:06:56 - 0:06:58     Text: from Julia.

0:06:58 - 0:07:04     Text: It's what kind of last week you said, wow, I'm glad I know that now and that's a huge problem.

0:07:04 - 0:07:06     Text: What did the model use in its decision?

0:07:06 - 0:07:10     Text: What biases is it learning from data and possibly making even worse?

0:07:10 - 0:07:15     Text: So that's the kind of thing you can also do with model analysis beyond is making models

0:07:15 - 0:07:19     Text: better according to some sort of summary metric as well.

0:07:19 - 0:07:23     Text: And then another thing, we don't just want to make tomorrow's model and this is something

0:07:23 - 0:07:28     Text: that I think is super important.

0:07:28 - 0:07:30     Text: We don't just want to look at that time scale.

0:07:30 - 0:07:36     Text: We want to say, what about 10, 15, 25 years from now, what kinds of things will we be doing?

0:07:36 - 0:07:37     Text: What are the limits?

0:07:37 - 0:07:41     Text: What can be learned by language model pre-training?

0:07:41 - 0:07:44     Text: What's the model that will replace the transformer?

0:07:44 - 0:07:46     Text: What's the model that will replace that model?

0:07:46 - 0:07:48     Text: What does deep learning struggle to do?

0:07:48 - 0:07:52     Text: What are we sort of attacking over and over again and failing to make significant progress

0:07:52 - 0:07:53     Text: on?

0:07:53 - 0:07:55     Text: What do neural models tell us about language potentially?

0:07:55 - 0:08:00     Text: There's some people who are primarily interested in understanding language better using neural

0:08:00 - 0:08:01     Text: networks.

0:08:01 - 0:08:03     Text: Cool.

0:08:03 - 0:08:10     Text: How are our models affecting people, transferring power between groups of people, governments,

0:08:10 - 0:08:11     Text: et cetera?

0:08:11 - 0:08:12     Text: That's an excellent type of analysis.

0:08:12 - 0:08:15     Text: That can't be learned via language model pre-training.

0:08:15 - 0:08:17     Text: That's sort of the complementary question there.

0:08:17 - 0:08:22     Text: If you sort of come to the edge of what you can learn via language model pre-training,

0:08:22 - 0:08:28     Text: is there stuff that we need total paradigm shifts in order to do well?

0:08:28 - 0:08:34     Text: All of this falls under some category of trying to really deeply understand our models

0:08:34 - 0:08:37     Text: and their capabilities.

0:08:37 - 0:08:40     Text: There's a lot of different methods here that will go over today.

0:08:40 - 0:08:47     Text: One thing that I want you to take away from it is that they're all going to tell us

0:08:47 - 0:08:52     Text: some aspect of the model elucidates, some kind of intuition or something, but none of them

0:08:52 - 0:08:58     Text: are we going to say, aha, I really understand 100% about what this model is doing now.

0:08:58 - 0:09:01     Text: They're going to provide some clarity, but never total clarity.

0:09:01 - 0:09:07     Text: One way, if you're trying to decide how you want to understand your model more, I think

0:09:07 - 0:09:12     Text: you should start out by thinking about what level of abstraction do I want to be looking

0:09:12 - 0:09:14     Text: at my model.

0:09:14 - 0:09:22     Text: The very high level abstraction, let's say you've trained a QA model to estimate the probabilities

0:09:22 - 0:09:27     Text: of start and end indices in a reading comprehension problem or you've trained a language model

0:09:27 - 0:09:30     Text: that assigns probabilities to words in context.

0:09:30 - 0:09:33     Text: You can just look at the model as that object.

0:09:33 - 0:09:37     Text: It's just a probability distribution defined by your model.

0:09:37 - 0:09:41     Text: You are not looking into it any further than the fact that you can sort of give it inputs

0:09:41 - 0:09:45     Text: and see what outputs it provides.

0:09:45 - 0:09:49     Text: That's not even who even cares if it's a neural network.

0:09:49 - 0:09:53     Text: It could be anything, but it's a way to understand its behavior.

0:09:53 - 0:09:56     Text: Another level of abstraction that you can look at, you can dig a little deeper.

0:09:56 - 0:10:01     Text: You can say, well, I know that my network is a bunch of layers that are kind of stacked

0:10:01 - 0:10:06     Text: on top of each other, you've got sort of maybe your transformer encoder with your one

0:10:06 - 0:10:09     Text: layer, two layer, three layer, you can try to see what it's doing as it goes deeper

0:10:09 - 0:10:12     Text: in the layers.

0:10:12 - 0:10:15     Text: Maybe your neural model is the sequence of these vector representations.

0:10:15 - 0:10:22     Text: A third option of sort of specificity is to look as much detail as you can.

0:10:22 - 0:10:23     Text: You've got these parameters in there.

0:10:23 - 0:10:26     Text: You've got the connections in the computation graph.

0:10:26 - 0:10:30     Text: Now you're sort of trying to remove all of the abstraction that you can and look at as

0:10:30 - 0:10:32     Text: many details as possible.

0:10:32 - 0:10:36     Text: All three of these ways of looking at your model and performing analysis are going to

0:10:36 - 0:10:42     Text: be useful and will actually sort of travel slowly from one to two to three as we go through

0:10:42 - 0:10:45     Text: this lecture.

0:10:45 - 0:10:47     Text: Okay.

0:10:47 - 0:10:51     Text: We haven't actually talked about any analyses yet.

0:10:51 - 0:10:56     Text: We're going to get started on that now.

0:10:56 - 0:10:59     Text: We're starting with the sort of testing our model's behaviors.

0:10:59 - 0:11:02     Text: So would we want to see, well, my model perform well.

0:11:02 - 0:11:10     Text: I mean, the natural thing to ask is, how does it behave on some sort of test set?

0:11:10 - 0:11:13     Text: And so we don't really care about mechanisms yet.

0:11:13 - 0:11:14     Text: Why is it performing this?

0:11:14 - 0:11:17     Text: By what method is it making its decision?

0:11:17 - 0:11:22     Text: Instead, we're just interested in sort of the more higher level abstraction of like, does

0:11:22 - 0:11:24     Text: it perform the way I wanted to perform?

0:11:24 - 0:11:31     Text: So let's like, take our model evaluation that we are already doing and sort of recast

0:11:31 - 0:11:33     Text: it in the framework of analysis.

0:11:33 - 0:11:37     Text: So you've trained your model on some samples from some distribution.

0:11:37 - 0:11:40     Text: So you've got input, output pairs of some kind.

0:11:40 - 0:11:43     Text: So how does the model behave on samples from the same distribution?

0:11:43 - 0:11:48     Text: It's a simple question and it's sort of, you know, it's known as, you know, in domain

0:11:48 - 0:11:53     Text: accuracy or you can say that the samples are IID and that's what you're testing on.

0:11:53 - 0:11:56     Text: And this is just what we've been doing this whole time.

0:11:56 - 0:12:00     Text: It's your test set accuracy or F1 or blue score.

0:12:00 - 0:12:06     Text: And you know, so you've got some model with some accuracy and maybe it's better than some

0:12:06 - 0:12:09     Text: model with some other accuracy on this test set, right?

0:12:09 - 0:12:14     Text: So this is what you're doing as you're iterating on your models and your final project as well.

0:12:14 - 0:12:18     Text: You say, well, you know, on my test set, which is what I've decided to care about for

0:12:18 - 0:12:19     Text: now, model A does better.

0:12:19 - 0:12:22     Text: They both seem pretty good.

0:12:22 - 0:12:24     Text: And so maybe I'll choose model A to keep working on.

0:12:24 - 0:12:28     Text: Maybe I'll choose it if you were putting something into production.

0:12:28 - 0:12:33     Text: But remember back to, you know, this idea that it's just one number to summarize a very

0:12:33 - 0:12:36     Text: complex system.

0:12:36 - 0:12:40     Text: It's not going to be sufficient to tell you how it's going to perform in a wide variety

0:12:40 - 0:12:41     Text: of settings.

0:12:41 - 0:12:44     Text: Okay, so we've been doing this.

0:12:44 - 0:12:48     Text: This is model evaluation as model analysis.

0:12:48 - 0:12:55     Text: Now we're going to say what if we are not testing on exactly the same type of data that we

0:12:55 - 0:12:56     Text: trained on.

0:12:56 - 0:13:01     Text: So now we're asking, did the model learn something such that it's able to sort of extrapolate

0:13:01 - 0:13:05     Text: or perform how I want it to on data that looks a little bit different from what it was

0:13:05 - 0:13:06     Text: trained on?

0:13:06 - 0:13:09     Text: And we're going to take the example of natural language inference.

0:13:09 - 0:13:12     Text: So to recall the task of natural language inference, and this is through the multi-analye

0:13:12 - 0:13:17     Text: data set that we're just pulling our definition, you have a premise.

0:13:17 - 0:13:21     Text: He turned and saw John sleeping in his half tent, and you have a hypothesis.

0:13:21 - 0:13:24     Text: He saw John was asleep.

0:13:24 - 0:13:26     Text: And then you give them both two of model.

0:13:26 - 0:13:29     Text: And this is the model that we had before that gets some good accuracy.

0:13:29 - 0:13:35     Text: And the model is supposed to tell whether the hypothesis is sort of implied by the premise

0:13:35 - 0:13:37     Text: or contradicting.

0:13:37 - 0:13:39     Text: So you could be contradicting.

0:13:39 - 0:13:42     Text: Maybe if the hypothesis is, you know, John was awake.

0:13:42 - 0:13:44     Text: For example, or he saw John was awake.

0:13:44 - 0:13:46     Text: Maybe that would be contradiction.

0:13:46 - 0:13:50     Text: Or if sort of both could be true at the same time, so to speak.

0:13:50 - 0:13:54     Text: And then in this case, you know, it seems like they're saying that the premise implies

0:13:54 - 0:13:56     Text: the hypothesis.

0:13:56 - 0:14:00     Text: And so, you know, you would say probably this is likely to get the right answer since

0:14:00 - 0:14:01     Text: the accuracy of the model is 95%.

0:14:01 - 0:14:06     Text: And if I percent of the time, we get the right answer.

0:14:06 - 0:14:09     Text: And we're going to dig deeper into that.

0:14:09 - 0:14:15     Text: What if the model is not doing what we think we want it to be doing in order to perform

0:14:15 - 0:14:16     Text: natural language inference?

0:14:16 - 0:14:22     Text: So in a data set like multi-nLI, the authors who gathered the data set will have asked

0:14:22 - 0:14:27     Text: humans to perform the task and, you know, gotten the accuracy that the humans achieved.

0:14:27 - 0:14:34     Text: And models nowadays are achieving accuracies that are around where humans are achieving,

0:14:34 - 0:14:36     Text: which sounds great at first.

0:14:36 - 0:14:43     Text: But as we'll see, it's not the same as actually performing the task more broadly in the right

0:14:43 - 0:14:45     Text: way.

0:14:45 - 0:14:49     Text: So what if the model is not doing something smart effectively?

0:14:49 - 0:14:54     Text: We're going to use a diagnostic test set of carefully constructed examples that seem

0:14:54 - 0:15:01     Text: like things the model should be able to do to test for a specific skill or capacity.

0:15:01 - 0:15:03     Text: In this case, we'll use Hans.

0:15:03 - 0:15:07     Text: So Hans is the heuristic analysis for analyzed systems data set.

0:15:07 - 0:15:12     Text: And it's intended to take systems that do natural language inference and test whether

0:15:12 - 0:15:16     Text: they're using some simple syntactic heuristics.

0:15:16 - 0:15:19     Text: What we'll have in each of these cases, we'll have some heuristic.

0:15:19 - 0:15:21     Text: We'll talk through the definition.

0:15:21 - 0:15:22     Text: We'll get an example.

0:15:22 - 0:15:24     Text: So the first thing is lexical overlap.

0:15:24 - 0:15:31     Text: So the model might do this thing where it assumes that a premise entails all hypotheses

0:15:31 - 0:15:32     Text: constructed from words in the premise.

0:15:32 - 0:15:40     Text: So in this example, you have the premise the doctor was paid by the actor.

0:15:40 - 0:15:43     Text: And then the hypothesis is the doctor paid the actor.

0:15:43 - 0:15:49     Text: And you'll notice that in bold here, get the doctor, and then paid, and then the actor.

0:15:49 - 0:15:54     Text: And so if you use this heuristic, you will think that the doctor was paid by the actor,

0:15:54 - 0:15:58     Text: implies the doctor paid the actor that does not imply it, of course.

0:15:58 - 0:16:02     Text: And so you could expect a model you want the model to be able to do this.

0:16:02 - 0:16:03     Text: It's somewhat simple.

0:16:03 - 0:16:08     Text: But if it's using this heuristic, it won't get this example right.

0:16:08 - 0:16:10     Text: Next is a sub-sequence heuristics.

0:16:10 - 0:16:17     Text: So here, if the model assumes that the premise entails all of its contiguous sub-sequences,

0:16:17 - 0:16:19     Text: it will get this one wrong as well.

0:16:19 - 0:16:23     Text: So this example is the doctor near the actor danced.

0:16:23 - 0:16:24     Text: That's the premise.

0:16:24 - 0:16:26     Text: The hypothesis is the actor danced.

0:16:26 - 0:16:28     Text: Now this is a simple syntactic thing.

0:16:28 - 0:16:31     Text: The doctor is doing the dancing near the actor.

0:16:31 - 0:16:33     Text: Is this prepositional phrase?

0:16:33 - 0:16:37     Text: And so the model sort of uses this heuristic, oh, look, the actor danced.

0:16:37 - 0:16:38     Text: That's a sub-sequence entailed.

0:16:38 - 0:16:39     Text: Awesome.

0:16:39 - 0:16:42     Text: And it'll get this one wrong as well.

0:16:42 - 0:16:46     Text: And here's another one that's a lot like sub-sequence.

0:16:46 - 0:16:52     Text: So if the premise, if the model thinks that the premise entails all complete sub-trees,

0:16:52 - 0:16:55     Text: so this is like sort of fully formed phrases.

0:16:55 - 0:17:01     Text: So the artist slept here is a fully formed sort of, is that sub-tree, if the artist slept,

0:17:01 - 0:17:04     Text: the actor ran, and then that's the premise.

0:17:04 - 0:17:05     Text: Does it entail the hypothesis?

0:17:05 - 0:17:11     Text: The actor slept, no, sorry, the artist slept.

0:17:11 - 0:17:13     Text: That does not entail it because this is in that conditional.

0:17:13 - 0:17:20     Text: Okay, let me pause here for some questions before I move on to see how these models do.

0:17:20 - 0:17:28     Text: Anyone unclear about how this sort of evaluation is being set up?

0:17:28 - 0:17:37     Text: Cool.

0:17:37 - 0:17:39     Text: Okay.

0:17:39 - 0:17:42     Text: Okay, so how do models perform?

0:17:42 - 0:17:46     Text: That's sort of the question of the hour.

0:17:46 - 0:17:51     Text: What we'll do is, we'll look at these results from the same paper that really released the

0:17:51 - 0:17:52     Text: data set.

0:17:52 - 0:17:57     Text: So they took four strong multi-nl i models with the following accuracy.

0:17:57 - 0:18:02     Text: So the accuracy is here are something between 60 and 80 something 80 percent burnt over

0:18:02 - 0:18:05     Text: here is doing the best.

0:18:05 - 0:18:12     Text: And in domain, in that first sort of setting that we talked about, you get these reasonable

0:18:12 - 0:18:14     Text: accuracies.

0:18:14 - 0:18:20     Text: And that is sort of what we said before about it, like looking pretty good.

0:18:20 - 0:18:27     Text: And when we evaluate on Hans, in this setting here, we have examples where the

0:18:27 - 0:18:30     Text: heuristics we talked about actually work.

0:18:30 - 0:18:34     Text: So if the model is using the heuristic, it will get this right.

0:18:34 - 0:18:37     Text: And it gets very high accuracies.

0:18:37 - 0:18:42     Text: And then if we evaluate the model in the settings where if it uses the heuristic, it gets the

0:18:42 - 0:18:44     Text: examples wrong.

0:18:44 - 0:18:51     Text: You know, maybe birds doing like epsilon better than some of the other stuff here, but it's

0:18:51 - 0:18:53     Text: a very different story.

0:18:53 - 0:18:54     Text: Okay.

0:18:54 - 0:18:55     Text: And you saw those examples.

0:18:55 - 0:19:03     Text: They're not complex in our sort of own idea of complexity.

0:19:03 - 0:19:08     Text: And so this is why it sort of feels like a clear failure of the system.

0:19:08 - 0:19:13     Text: Now you can say though that well, maybe the training data sort of wasn't, didn't have

0:19:13 - 0:19:14     Text: any of those sort of phenomena.

0:19:14 - 0:19:18     Text: So the model couldn't have learned not to do that.

0:19:18 - 0:19:22     Text: And that's sort of a reasonable argument except, well, you know, Bert is pre-trained on

0:19:22 - 0:19:23     Text: a bunch of language texts.

0:19:23 - 0:19:27     Text: So you might hope, you might expect, you might hope that it does better.

0:19:27 - 0:19:29     Text: Okay.

0:19:29 - 0:19:39     Text: So we saw that example of models performing well on examples that are like those that

0:19:39 - 0:19:40     Text: it was trained on.

0:19:40 - 0:19:46     Text: And then performing not very well at all on examples that seem reasonable, but are sort

0:19:46 - 0:19:49     Text: of a little bit tricky.

0:19:49 - 0:19:53     Text: Now we're going to take this idea of having a test set that we've carefully crafted and

0:19:53 - 0:19:55     Text: go in a slightly different direction.

0:19:55 - 0:19:59     Text: So we're going to have, what does it mean to try to understand the linguistic properties

0:19:59 - 0:20:00     Text: of our models?

0:20:00 - 0:20:01     Text: Does it?

0:20:01 - 0:20:05     Text: So that's some tactic heuristics question was one thing for natural language inference,

0:20:05 - 0:20:10     Text: but can we sort of test how the models, whether they think certain things are sort of right

0:20:10 - 0:20:14     Text: or wrong as language models?

0:20:14 - 0:20:18     Text: And the first way that we'll do this is we'll ask, well, how do we think about sort

0:20:18 - 0:20:21     Text: of what humans think of as good language?

0:20:21 - 0:20:26     Text: How do we evaluate their sort of preferences about language?

0:20:26 - 0:20:29     Text: And one answer is minimal pairs.

0:20:29 - 0:20:34     Text: And the idea of a minimal pair is that you've got one sentence that sounds okay to a speaker.

0:20:34 - 0:20:38     Text: So this sentence is the chef who made the pizzas is here.

0:20:38 - 0:20:43     Text: It's called, it's an acceptable sentence, at least to me.

0:20:43 - 0:20:50     Text: And then with a small change, a minimal change, the sentence is no longer okay to the speaker.

0:20:50 - 0:20:53     Text: So the chef who made the pizzas are here.

0:20:53 - 0:21:01     Text: And this, whoops, this should be, present tense verbs.

0:21:01 - 0:21:05     Text: In English, present tense verbs agree in number with their subject when they are third

0:21:05 - 0:21:07     Text: person.

0:21:07 - 0:21:10     Text: So chef pizzas, okay.

0:21:10 - 0:21:14     Text: And this is sort of a pretty general thing.

0:21:14 - 0:21:16     Text: Most people don't like this.

0:21:16 - 0:21:18     Text: It's a misconjugated verb.

0:21:18 - 0:21:23     Text: And so the syntax here looks like you have the chef who made the pizzas.

0:21:23 - 0:21:30     Text: And then this arc of agreement in number is requiring the word is here to be singular

0:21:30 - 0:21:32     Text: is instead of plural R.

0:21:32 - 0:21:38     Text: Despite the fact that there's this noun pizzas, which is plural, closer linearly, comes

0:21:38 - 0:21:40     Text: back to dependency parsing.

0:21:40 - 0:21:42     Text: Or back, okay.

0:21:42 - 0:21:49     Text: And what this looks like in the tree structure, right, is well, chef and is are attached in

0:21:49 - 0:21:52     Text: the tree.

0:21:52 - 0:21:57     Text: Chef is the subject of is, pizza is down here in the subtree.

0:21:57 - 0:22:02     Text: And so that subject verb relationship has this sort of agreement thing.

0:22:02 - 0:22:08     Text: So this is a pretty sort of basic and interesting property of language that also reflects the

0:22:08 - 0:22:11     Text: syntactic sort of hierarchical structure of language.

0:22:11 - 0:22:14     Text: So we've been training these language models sampling from them, seeing that they get

0:22:14 - 0:22:15     Text: interesting things.

0:22:15 - 0:22:19     Text: And they tend to seem to generate syntactic content.

0:22:19 - 0:22:25     Text: But does it really understand or does it behave as if it understands this idea of agreement

0:22:25 - 0:22:29     Text: more broadly and does it sort of get the syntax right so that it matches the subjects and

0:22:29 - 0:22:31     Text: the verbs.

0:22:31 - 0:22:36     Text: But language models can't tell us exactly whether they think that a sentence is good or

0:22:36 - 0:22:40     Text: bad, they just tell us the probability of a sentence.

0:22:40 - 0:22:45     Text: So before we had acceptable and unacceptable, that's what we get from humans.

0:22:45 - 0:22:50     Text: And the language models analog is just, does it assign higher probability to the acceptable

0:22:50 - 0:22:52     Text: sentence in the minimal pair, right?

0:22:52 - 0:22:58     Text: So you have the probability under the model of the chef who made the pizzas is here.

0:22:58 - 0:23:02     Text: And then you have the probability under the model of the chef who made the pizzas are

0:23:02 - 0:23:03     Text: here.

0:23:03 - 0:23:08     Text: And you want this probability here to be higher.

0:23:08 - 0:23:15     Text: And if it is, that's sort of like a simple way to test whether the model got it right effectively.

0:23:15 - 0:23:22     Text: And just like in Huns, we can develop a test set with very carefully chosen properties,

0:23:22 - 0:23:23     Text: right?

0:23:23 - 0:23:29     Text: So most sentences in English don't have terribly complex subject verb agreement structure

0:23:29 - 0:23:34     Text: or with a lot of words in the middle like pizzas that are going to make it difficult.

0:23:34 - 0:23:42     Text: So if I say, you know, the dog runs sort of no way to get it wrong because there's no

0:23:42 - 0:23:44     Text: syntax is very simple.

0:23:44 - 0:23:53     Text: So we can create, well, we can look for sentences that have these things called attractors in

0:23:53 - 0:23:54     Text: the sentence.

0:23:54 - 0:23:59     Text: So pizzas is an attractor because the model might be attracted to the plurality here and

0:23:59 - 0:24:03     Text: get the conjugation wrong.

0:24:03 - 0:24:04     Text: So this is our question.

0:24:04 - 0:24:08     Text: Can language models sort of very generally handle these examples with attractors?

0:24:08 - 0:24:13     Text: So we can take examples with zero attractors, see whether the model gets the minimal pairs

0:24:13 - 0:24:14     Text: evaluation right.

0:24:14 - 0:24:18     Text: We can take examples with one attractor, two attractors.

0:24:18 - 0:24:22     Text: You can see how people would still reasonably understand the sentences, right?

0:24:22 - 0:24:24     Text: Chef who made the pizzas and prep the ingredients is.

0:24:24 - 0:24:26     Text: It's still the chef who is.

0:24:26 - 0:24:32     Text: And then on and on and on, it gets rarer, obviously, but you can have more and more attractors.

0:24:32 - 0:24:36     Text: And so now we've created this test set that's intended to evaluate this very specific linguistic

0:24:36 - 0:24:39     Text: phenomenon.

0:24:39 - 0:24:46     Text: So in this paper here, I concur at all, trained an LSTM language model on a subset of Wikipedia

0:24:46 - 0:24:48     Text: back in 2018.

0:24:48 - 0:24:54     Text: And they evaluate it sort of in these buckets that are specified by the paper that sort

0:24:54 - 0:25:02     Text: of introduced subject verb agreement to the NLP field, or more recently at least, and

0:25:02 - 0:25:06     Text: they evaluate it in buckets based on the number of attractors.

0:25:06 - 0:25:12     Text: And so in this table here that you're about to see, the numbers are sort of the percentive

0:25:12 - 0:25:19     Text: times that you get this assign higher probability to the correct sentence in the minimal pair.

0:25:19 - 0:25:23     Text: So if you were just to do random or majority class, you get these errors, oh, sorry, it's

0:25:23 - 0:25:26     Text: the percent of times that you get it wrong.

0:25:26 - 0:25:27     Text: Sorry about that.

0:25:27 - 0:25:30     Text: So lower is better.

0:25:30 - 0:25:33     Text: And so with no attractors, you get very low error rates.

0:25:33 - 0:25:39     Text: So this is 1.3 error rate with a 350-dimensional LSTM.

0:25:39 - 0:25:45     Text: And with one attractor, your error rate is higher, but actually humans start to get errors

0:25:45 - 0:25:47     Text: with more attractors too.

0:25:47 - 0:25:50     Text: So zero attractors is easy.

0:25:50 - 0:25:53     Text: The larger the LSTM, it looks like in general the better you're doing, right?

0:25:53 - 0:25:56     Text: So the smaller models doing worse, OK?

0:25:56 - 0:26:01     Text: And then even on sort of very difficult examples with four attractors, which I try to think

0:26:01 - 0:26:07     Text: of an example in your head like the chef made the pizzas and took out the trash and sort

0:26:07 - 0:26:12     Text: of has to be this long sentence, the error rate is definitely higher, so it gets more difficult,

0:26:12 - 0:26:15     Text: but it's still relatively low.

0:26:15 - 0:26:18     Text: And so even on these very hard examples, models are actually performing subject verb number

0:26:18 - 0:26:21     Text: agreement relatively well.

0:26:21 - 0:26:24     Text: Very cool.

0:26:24 - 0:26:25     Text: OK.

0:26:25 - 0:26:28     Text: Here are some examples that a model got wrong.

0:26:28 - 0:26:32     Text: This is actually a worse model than the ones from the paper that was just there, but I

0:26:32 - 0:26:35     Text: think actually the errors are quite interesting.

0:26:35 - 0:26:41     Text: So here's a sentence, the ship that the player drives has a very high speed.

0:26:41 - 0:26:47     Text: Now this model thought that was less probable than the ship that the player drives have a

0:26:47 - 0:26:51     Text: very high speed.

0:26:51 - 0:27:00     Text: My hypothesis, right, is that it sort of misanalyzes drives as a plural noun, for example,

0:27:00 - 0:27:01     Text: sort of a difficult construction there.

0:27:01 - 0:27:04     Text: I think it's pretty interesting.

0:27:04 - 0:27:07     Text: Likewise here, this one is fun.

0:27:07 - 0:27:09     Text: The lead is also rather long.

0:27:09 - 0:27:12     Text: Five paragraphs is pretty lengthy.

0:27:12 - 0:27:18     Text: So here five paragraphs is a singular noun together, it gets it's like a unit of length,

0:27:18 - 0:27:19     Text: I guess.

0:27:19 - 0:27:26     Text: But the model thought that it was more likely to say five paragraphs are pretty lengthy,

0:27:26 - 0:27:32     Text: because it's referring to this sort of five paragraphs as the five actual paragraphs

0:27:32 - 0:27:37     Text: themselves as opposed to a single unit of length describing the lead.

0:27:37 - 0:27:39     Text: Fascinating.

0:27:39 - 0:27:42     Text: OK.

0:27:42 - 0:27:53     Text: Maybe questions again?

0:27:53 - 0:27:56     Text: So I guess there are a couple.

0:27:56 - 0:28:04     Text: Can we do the similar heuristic analysis for other tasks such as Q and A classification?

0:28:04 - 0:28:07     Text: Yes.

0:28:07 - 0:28:14     Text: So yes, I think that it's easier to do this kind of analysis for the Huns style analysis

0:28:14 - 0:28:23     Text: with question answering and other sorts of tasks, because you can construct examples

0:28:23 - 0:28:35     Text: that similarly have these heuristics and then have the answer depend on the syntax or

0:28:35 - 0:28:36     Text: not.

0:28:36 - 0:28:41     Text: The actual probability of one sentence is higher than the other, of course, is sort of a language

0:28:41 - 0:28:43     Text: model dependent thing.

0:28:43 - 0:28:52     Text: But the idea that you can sort of develop kind of bespoke test sets for various tasks,

0:28:52 - 0:28:54     Text: I think is very, very general.

0:28:54 - 0:28:59     Text: And something that I think is actually quite interesting.

0:28:59 - 0:29:00     Text: Yes.

0:29:00 - 0:29:05     Text: So I won't go on further, but I think the answer is just yes.

0:29:05 - 0:29:07     Text: So there's another one.

0:29:07 - 0:29:10     Text: How do you know where to find these failure cases?

0:29:10 - 0:29:13     Text: Maybe that's the right time to advertise linguistics classes.

0:29:13 - 0:29:14     Text: Sorry.

0:29:14 - 0:29:17     Text: You're still very quiet over here.

0:29:17 - 0:29:19     Text: How do you find what?

0:29:19 - 0:29:23     Text: How do you know where to find these failure cases?

0:29:23 - 0:29:24     Text: Oh, interesting.

0:29:24 - 0:29:25     Text: Yes.

0:29:25 - 0:29:27     Text: How do we know where to find the failure cases?

0:29:27 - 0:29:28     Text: That's a good question.

0:29:28 - 0:29:36     Text: I mean, I think I agree with Chris that actually thinking about what is interesting about things

0:29:36 - 0:29:39     Text: in language is one way to do it.

0:29:39 - 0:29:47     Text: I mean, the heuristics that we saw in our language model, sorry, in our NLI models with

0:29:47 - 0:29:56     Text: hans, you can kind of imagine that they, if the model was sort of ignoring facts about

0:29:56 - 0:30:01     Text: language and sort of just doing this sort of rough bag of words with some extra magic,

0:30:01 - 0:30:05     Text: then it would do well about as bad as it is doing here.

0:30:05 - 0:30:12     Text: And these sorts of ideas about understanding that this statement, if the artist slept

0:30:12 - 0:30:17     Text: the actor ran does not imply the artist slept, is the kind of thing that maybe you'd think

0:30:17 - 0:30:22     Text: up on your own, but also you'd spend time sort of pondering about and thinking broad

0:30:22 - 0:30:29     Text: thoughts about in linguistics curricula as well.

0:30:29 - 0:30:36     Text: Anything else, Chris?

0:30:36 - 0:30:42     Text: So there's also, well, I guess someone was also saying, I think it's about the sort of

0:30:42 - 0:30:48     Text: intervening verbs example, or intervening nouns, sorry, example, but the data set itself

0:30:48 - 0:30:53     Text: probably includes mistakes with higher attractors.

0:30:53 - 0:30:55     Text: Yeah, yeah, that's a good point.

0:30:55 - 0:31:03     Text: Yeah, because humans make more and more mistakes as the number of attractors gets larger.

0:31:03 - 0:31:10     Text: On the other hand, I think that the mistakes are fewer in written text than in spoken.

0:31:10 - 0:31:14     Text: Maybe I'm just making that up, but that's what I think.

0:31:14 - 0:31:19     Text: But yeah, it would be interesting to actually go through that test set and see how many

0:31:19 - 0:31:24     Text: of the errors the really strong model makes are actually due to the sort of observed form

0:31:24 - 0:31:25     Text: being incorrect.

0:31:25 - 0:31:32     Text: I'd be super curious.

0:31:32 - 0:31:36     Text: Okay, should I move on?

0:31:36 - 0:31:47     Text: Yep, great.

0:31:47 - 0:31:55     Text: Okay, so what does it feel like we're doing when we are kind of constructing these sort

0:31:55 - 0:31:59     Text: of bespoke, small, careful test sets for various phenomena?

0:31:59 - 0:32:03     Text: Well, it sort of feels like unit testing.

0:32:03 - 0:32:13     Text: And in fact, this sort of idea has been brought to the fore, you might say an NLP, unit tests

0:32:13 - 0:32:15     Text: but for these NLP neural networks.

0:32:15 - 0:32:21     Text: And in particular, the paper here that I'm citing at the bottom suggests this minimum

0:32:21 - 0:32:22     Text: functionality test.

0:32:22 - 0:32:28     Text: You want a small test set that targets a specific behavior that should sound like some of the

0:32:28 - 0:32:31     Text: things that we were, that we've already talked about.

0:32:31 - 0:32:34     Text: But in this case, we're going to get even more specific.

0:32:34 - 0:32:36     Text: So here's a single test case.

0:32:36 - 0:32:42     Text: We're going to have an expected label, what was actually predicted, whether the model passed

0:32:42 - 0:32:44     Text: this unit test.

0:32:44 - 0:32:47     Text: And the labels are going to be sentiment analysis here.

0:32:47 - 0:32:52     Text: So negative label, positive label, or neutral, or the three options.

0:32:52 - 0:32:57     Text: And the unit test is going to consist simply of sentences that follow this template.

0:32:57 - 0:33:02     Text: I, the navigation, the positive verb, and then the thing.

0:33:02 - 0:33:07     Text: So if you negation positive verb, means you negative verb, right?

0:33:07 - 0:33:08     Text: And so here's an example.

0:33:08 - 0:33:11     Text: I can't say I recommend the food.

0:33:11 - 0:33:13     Text: The expected label is negative.

0:33:13 - 0:33:17     Text: The answer that the model provided, and this is, I think, a commercial sentiment analysis

0:33:17 - 0:33:18     Text: system.

0:33:18 - 0:33:19     Text: I was positive.

0:33:19 - 0:33:21     Text: So it's pretty positive.

0:33:21 - 0:33:24     Text: And then I didn't love the flight.

0:33:24 - 0:33:30     Text: The expected label was negative, and then the predicted answer was neutral.

0:33:30 - 0:33:35     Text: And this commercial sentiment analysis system gets a lot of, well, you could imagine are

0:33:35 - 0:33:38     Text: pretty reasonably simple examples, wrong.

0:33:38 - 0:33:44     Text: And so what your bureau had all 2020 showed is that they could actually provide a system

0:33:44 - 0:33:50     Text: that sort of had this framework of building test cases for NLP models, two ML engineers

0:33:50 - 0:33:53     Text: working on these products.

0:33:53 - 0:34:00     Text: And given that interface, and they would actually find bugs, you know, bugs being categories

0:34:00 - 0:34:02     Text: of high error, right?

0:34:02 - 0:34:06     Text: Find bugs in their models that they could then kind of try to go and fix.

0:34:06 - 0:34:10     Text: And that this was kind of an efficient way of trying to find things that were simple

0:34:10 - 0:34:16     Text: and still wrong with what should be pretty sophisticated neural systems.

0:34:16 - 0:34:21     Text: So I really like this, and it's sort of a nice way of thinking more specifically about

0:34:21 - 0:34:27     Text: what are the capabilities in sort of precise terms of our models.

0:34:27 - 0:34:33     Text: And altogether, now you've seen problems in natural language inference.

0:34:33 - 0:34:37     Text: You've seen language models actually perform pretty well at the language modeling objective.

0:34:37 - 0:34:42     Text: But then you see, you just saw an example of a commercial sentiment analysis system

0:34:42 - 0:34:45     Text: that sort of should do better and doesn't.

0:34:45 - 0:34:52     Text: And this comes with really, I think, broad and important takeaway, which is if you get

0:34:52 - 0:34:58     Text: high accuracy on the end domain test set, you are not guaranteed high accuracy on even

0:34:58 - 0:35:05     Text: what you might consider to be reasonable out of domain evaluations.

0:35:05 - 0:35:08     Text: And life is always out of domain.

0:35:08 - 0:35:12     Text: And if you're building a system that we have given to users, it's immediately out of

0:35:12 - 0:35:17     Text: domain that the very least because it's trained on text that's now older than the things

0:35:17 - 0:35:18     Text: that the users are now saying.

0:35:18 - 0:35:23     Text: So it's a really, really important takeaway that your sort of benchmark accuracy is a

0:35:23 - 0:35:28     Text: single number that does not guarantee good performance on a wide variety of things.

0:35:28 - 0:35:32     Text: And from a, what are our neural networks doing perspective?

0:35:32 - 0:35:36     Text: One way to think about it is that models seem to be learning the data set, fitting sort

0:35:36 - 0:35:42     Text: of the fine-grained, sort of heuristics and statistics that help it fit this one data

0:35:42 - 0:35:44     Text: set as opposed to learning the task.

0:35:44 - 0:35:48     Text: So humans can perform natural language inference if you give them examples from whatever data

0:35:48 - 0:35:49     Text: set.

0:35:49 - 0:35:54     Text: You know, once you've told them how to do the task, they'll be very generally strong at

0:35:54 - 0:35:55     Text: it.

0:35:55 - 0:36:01     Text: But you take your MNLI model and you test it on Hans and it got, you know, whatever that

0:36:01 - 0:36:03     Text: was, below chance accuracy.

0:36:03 - 0:36:05     Text: That's not the kind of thing that you want to see.

0:36:05 - 0:36:10     Text: So it definitely learns the data set well because the accuracy in domain is high.

0:36:10 - 0:36:17     Text: But our models are seemingly not frequently learning, sort of the mechanisms that we would

0:36:17 - 0:36:19     Text: like them to be learning.

0:36:19 - 0:36:23     Text: Last week, we heard about language models and sort of the implicit knowledge that they

0:36:23 - 0:36:26     Text: encode about the world through pre-training.

0:36:26 - 0:36:30     Text: And one of the ways that we sought to interact with language models was providing them with

0:36:30 - 0:36:36     Text: a prompt like Dante was born in mask and then seeing if it puts high probability on the

0:36:36 - 0:36:42     Text: correct continuation, which requires you to access knowledge about where Dante was

0:36:42 - 0:36:43     Text: born.

0:36:43 - 0:36:47     Text: And we didn't frame it this way last week, but this fits into the set of behavioral studies

0:36:47 - 0:36:49     Text: that we've done so far.

0:36:49 - 0:36:51     Text: This is a specific kind of input.

0:36:51 - 0:36:54     Text: You could ask this for multiple types of multiple people.

0:36:54 - 0:36:56     Text: You could swap out Dante for other people.

0:36:56 - 0:37:02     Text: You could swap out born in for, I don't know, died in or something.

0:37:02 - 0:37:04     Text: And then you can, there are like test suites again.

0:37:04 - 0:37:07     Text: And so it's all connected.

0:37:07 - 0:37:11     Text: OK, so I won't go too deep into sort of the knowledge of language models in terms of

0:37:11 - 0:37:14     Text: world knowledge because we've gone over it some.

0:37:14 - 0:37:20     Text: But when you're thinking about ways of interacting with your models, this sort of behavioral study

0:37:20 - 0:37:22     Text: can be very, very general.

0:37:22 - 0:37:27     Text: Even though, remember, we're at still this highest level of abstraction where we're just

0:37:27 - 0:37:30     Text: looking at the probability distributions that are defined.

0:37:30 - 0:37:33     Text: All right.

0:37:33 - 0:37:38     Text: So now we'll go into, so we've sort of looked at understanding in fine grain areas what

0:37:38 - 0:37:41     Text: our model is actually doing.

0:37:41 - 0:37:48     Text: What about sort of why for an individual input is it getting the answer right or wrong?

0:37:48 - 0:37:52     Text: And then are there changes to the inputs that look fine to humans, but actually make the

0:37:52 - 0:37:55     Text: models do a bad job?

0:37:55 - 0:38:02     Text: So one study that I love to reference that really draws back into our original motivation

0:38:02 - 0:38:07     Text: of using LSTM networks instead of simple recurrent neural networks was that they could use

0:38:07 - 0:38:10     Text: long context.

0:38:10 - 0:38:15     Text: But like how long is your long short term memory?

0:38:15 - 0:38:23     Text: And the idea of Kendall well at all 2018 was shuffle or remove contexts that are farther

0:38:23 - 0:38:29     Text: than some k words away, changing k.

0:38:29 - 0:38:35     Text: And if the accuracy, if the predictive ability of your language model, the perplexity,

0:38:35 - 0:38:39     Text: right, doesn't change once you do that, it means the model wasn't actually using that

0:38:39 - 0:38:40     Text: context.

0:38:40 - 0:38:42     Text: I think this is so cool.

0:38:42 - 0:38:48     Text: So on the x-axis, we've got how far away from the word that you're trying to predict,

0:38:48 - 0:38:54     Text: are you actually sort of corrupting, shuffling, or moving stuff from the sequence.

0:38:54 - 0:38:57     Text: And then on the y-axis is the increase in loss.

0:38:57 - 0:39:03     Text: So if the increase in loss is zero, it means that the model was not using the thing that

0:39:03 - 0:39:08     Text: you just removed because if it was using it, it would now do worse without it, right?

0:39:08 - 0:39:13     Text: And so if you shuffle in the blue line here, if you shuffle the history that's farther

0:39:13 - 0:39:18     Text: away from 50 words, the model does not even notice.

0:39:18 - 0:39:20     Text: I think that's really interesting.

0:39:20 - 0:39:25     Text: One it says, everything passed 50 words of this LSTM language model, you could have given

0:39:25 - 0:39:28     Text: it in random order and it wouldn't have noticed.

0:39:28 - 0:39:32     Text: And then two it says that if you're closer than that, it actually is making use of the

0:39:32 - 0:39:33     Text: word order.

0:39:33 - 0:39:36     Text: That's a pretty long memory, okay, that's really interesting.

0:39:36 - 0:39:42     Text: And then if you actually remove the words entirely, you can kind of notice that the words

0:39:42 - 0:39:45     Text: are missing up to 200 words away.

0:39:45 - 0:39:48     Text: So you don't know the order that you don't care about the order they're in, but you

0:39:48 - 0:39:50     Text: care whether they're there or not.

0:39:50 - 0:39:54     Text: And so this is an evaluation of, well, do LSTMs have long term memory?

0:39:54 - 0:40:01     Text: Well, this one at least has effectively no longer than 200 words of memory, but also

0:40:01 - 0:40:02     Text: no less.

0:40:02 - 0:40:07     Text: So very cool.

0:40:07 - 0:40:09     Text: So that's like a general study for a single model.

0:40:09 - 0:40:15     Text: It talks about, it's sort of average behavior over a wide range of examples, but we want

0:40:15 - 0:40:18     Text: to talk about individual predictions on individual inputs.

0:40:18 - 0:40:19     Text: So let's talk about that.

0:40:19 - 0:40:26     Text: So one way of interpreting why did my model make this decision, that's very popular, is

0:40:26 - 0:40:31     Text: for a single example, what parts of the input actually led to the decision?

0:40:31 - 0:40:34     Text: And this is where we come in with saliency maps.

0:40:34 - 0:40:40     Text: So saliency map provides a score for each word indicating its importance to the model's

0:40:40 - 0:40:41     Text: prediction.

0:40:41 - 0:40:44     Text: So you've got something like Bert here.

0:40:44 - 0:40:45     Text: You've got Bert.

0:40:45 - 0:40:47     Text: Bert is making a prediction for this mask.

0:40:47 - 0:40:52     Text: And a mask rush to the emergency room to see her patient.

0:40:52 - 0:40:58     Text: And the predictions that the model is making is, thanks with 47%, it's going to be nurse

0:40:58 - 0:41:04     Text: that's here in the mask instead, or maybe woman, or doctor, or mother, or girl.

0:41:04 - 0:41:08     Text: And then the saliency map is being visualized here in orange.

0:41:08 - 0:41:13     Text: According to this method of saliency called simple gradients, which we'll get into, emergency

0:41:13 - 0:41:18     Text: her and the septokin, it's not worried about the septokin for now, but the emergency and

0:41:18 - 0:41:21     Text: her are the important words apparently.

0:41:21 - 0:41:25     Text: And the septokin shows up in every sentence, so I'm not going to, yeah.

0:41:25 - 0:41:30     Text: And so these two together are, according to this method, what's important for the model

0:41:30 - 0:41:33     Text: to make this prediction to mask.

0:41:33 - 0:41:38     Text: And you can see maybe some statistics, biases, etc., that is picked up in the predictions

0:41:38 - 0:41:41     Text: and then have it mapped out onto the sentence.

0:41:41 - 0:41:47     Text: And this is, well, it seems like it's really helping interpretability.

0:41:47 - 0:41:52     Text: And yeah, I think that this is sort of a very useful tool.

0:41:52 - 0:42:00     Text: And actually, this is part of a demo from Alan and LP that allows you to do this yourself

0:42:00 - 0:42:02     Text: for any sentence that you want.

0:42:02 - 0:42:05     Text: So what's this way of making saliency maps?

0:42:05 - 0:42:07     Text: We're not going to go, there's so many ways to do it.

0:42:07 - 0:42:12     Text: We're going to take a very simple one and work through why it sort of makes sense.

0:42:12 - 0:42:17     Text: So the sort of issue is how do you define importance?

0:42:17 - 0:42:20     Text: What does it mean to be important to the model's prediction?

0:42:20 - 0:42:22     Text: And here's one way of thinking about it.

0:42:22 - 0:42:23     Text: It's called the simple gradient method.

0:42:23 - 0:42:25     Text: It's got a little formula.

0:42:25 - 0:42:27     Text: You got words x1 to xn.

0:42:27 - 0:42:28     Text: Okay?

0:42:28 - 0:42:31     Text: And then you got a model score for a given output class.

0:42:31 - 0:42:36     Text: So maybe you've got, in the birth example, each output class was each output word that

0:42:36 - 0:42:38     Text: you could possibly predict.

0:42:38 - 0:42:43     Text: And then you take the norm of the gradient of the score with respect to each word.

0:42:43 - 0:42:53     Text: Okay, so what we're saying here is the score is sort of the unnormalized probability for

0:42:53 - 0:42:55     Text: that class.

0:42:55 - 0:42:56     Text: Okay, so you got a single class.

0:42:56 - 0:43:00     Text: You're taking the scores like how likely it is not yet normalized by how likely everything

0:43:00 - 0:43:02     Text: else is sort of.

0:43:02 - 0:43:07     Text: So gradient, how much is it going to change if I move it a little bit in one direction

0:43:07 - 0:43:11     Text: or another, and then you take the norm to get a scalar from a vector.

0:43:11 - 0:43:12     Text: So it looks like this.

0:43:12 - 0:43:17     Text: So salience of word i, you have the norm bars on the outside, gradient with respect to

0:43:17 - 0:43:19     Text: xi.

0:43:19 - 0:43:25     Text: So that's if I change a little bit locally xi, how much does my score change?

0:43:25 - 0:43:30     Text: So the idea is that a high gradient norm means that if I were to change it locally, I'd

0:43:30 - 0:43:32     Text: affect the score a lot.

0:43:32 - 0:43:34     Text: That means it was very important to the decision.

0:43:34 - 0:43:35     Text: Let's visualize this a little bit.

0:43:35 - 0:43:39     Text: So here on the y axis we've got loss.

0:43:39 - 0:43:43     Text: Just the loss of the model, sorry, this should be score.

0:43:43 - 0:43:44     Text: It should be score.

0:43:44 - 0:43:47     Text: And on the x axis you've got word space.

0:43:47 - 0:43:53     Text: The word space is like sort of a flattening of the ability to move your word embedding

0:43:53 - 0:43:54     Text: in thousand dimensional space.

0:43:54 - 0:43:58     Text: So I've just plotted it here in one dimension.

0:43:58 - 0:44:04     Text: Now a high saliency thing, you can see that the relationship between what should be score

0:44:04 - 0:44:09     Text: and moving the word in word space, you move it a little bit on the x axis and the score

0:44:09 - 0:44:10     Text: changes a lot.

0:44:10 - 0:44:13     Text: That's that derivative, that's the gradient, awesome, love it.

0:44:13 - 0:44:20     Text: Low saliency, you move the word around locally and the score doesn't change.

0:44:20 - 0:44:23     Text: So that's an interpretation is.

0:44:23 - 0:44:27     Text: That means that the actual identity of this word wasn't that important to the prediction

0:44:27 - 0:44:31     Text: because I could have changed it and the score wouldn't have changed.

0:44:31 - 0:44:34     Text: Now why are there more methods than this?

0:44:34 - 0:44:38     Text: Because honestly reading that sounds awesome, that sounds great.

0:44:38 - 0:44:45     Text: There are sort of a lot of issues with this kind of method in lots of ways of getting around

0:44:45 - 0:44:46     Text: them.

0:44:46 - 0:44:47     Text: Here's one issue.

0:44:47 - 0:44:52     Text: It's not perfect because well maybe your linear approximation that the gradient gives

0:44:52 - 0:44:56     Text: you holds only very, very locally.

0:44:56 - 0:45:00     Text: So here the gradient is zero.

0:45:00 - 0:45:05     Text: So this is a low saliency word because at the bottom of this parabola, but if I were to

0:45:05 - 0:45:10     Text: move even a little bit in either direction, the score would shoot up.

0:45:10 - 0:45:11     Text: Is this not an important word?

0:45:11 - 0:45:19     Text: It seems important to be right there as opposed to anywhere else even sort of nearby in

0:45:19 - 0:45:22     Text: order for the score not to go up.

0:45:22 - 0:45:26     Text: The simple gradients method won't capture this because it just looks at the gradient which

0:45:26 - 0:45:29     Text: is that zero right there.

0:45:29 - 0:45:30     Text: Okay.

0:45:30 - 0:45:35     Text: But if you want to look into more, there's a bunch of different methods that are sort of

0:45:35 - 0:45:37     Text: applied in these papers.

0:45:37 - 0:45:42     Text: And I think that there's a good tool for the toolbox.

0:45:42 - 0:45:43     Text: Okay.

0:45:43 - 0:45:47     Text: So that is one way of explaining a prediction.

0:45:47 - 0:45:55     Text: And it has some issues like why are individual words being scored as opposed to phrases or

0:45:55 - 0:45:57     Text: something like that.

0:45:57 - 0:45:59     Text: But for now, we're going to move on to another type of explanation.

0:45:59 - 0:46:02     Text: And I'm going to check the time.

0:46:02 - 0:46:03     Text: Okay.

0:46:03 - 0:46:04     Text: Cool.

0:46:04 - 0:46:06     Text: Actually, yeah, let me pause for a second.

0:46:06 - 0:46:10     Text: Any questions about this?

0:46:10 - 0:46:16     Text: I mean, the earlier on, they were a couple of questions.

0:46:16 - 0:46:22     Text: One of them was, what are your thoughts on whether looking at attention weights is a methodologically

0:46:22 - 0:46:28     Text: rigorous way of determining the importance of the model places on certain tokens?

0:46:28 - 0:46:32     Text: It seems like there's some back and forth in the literature.

0:46:32 - 0:46:34     Text: That is a great question.

0:46:34 - 0:46:39     Text: And I probably won't engage with that question as much as I could if we had like a second

0:46:39 - 0:46:40     Text: lecture on this.

0:46:40 - 0:46:45     Text: I actually will provide some attention analyses and tell you they're interesting.

0:46:45 - 0:46:54     Text: And then I'll say a little bit about why they can be interesting without being sort of

0:46:54 - 0:47:04     Text: maybe sort of the end all of analysis of where information is flowing in a transformer,

0:47:04 - 0:47:05     Text: for example.

0:47:05 - 0:47:11     Text: I think the debate is something that we would have to get into in a much longer period of

0:47:11 - 0:47:12     Text: time.

0:47:12 - 0:47:16     Text: Look at the slides that I show about attention and the caveats that I provide and let me

0:47:16 - 0:47:19     Text: know if that answers your question first because we have quite a number of slides on it.

0:47:19 - 0:47:24     Text: And if not, please, please ask again and we can chat more about it.

0:47:24 - 0:47:27     Text: And maybe you can go on.

0:47:27 - 0:47:28     Text: Great.

0:47:28 - 0:47:29     Text: Okay.

0:47:29 - 0:47:33     Text: So, I think this is a really fascinating question which also gets at what was important

0:47:33 - 0:47:39     Text: about the input but in actually kind of an even more direct way, which is, could I just

0:47:39 - 0:47:42     Text: keep some minimal part of the input and get the same answer.

0:47:42 - 0:47:44     Text: So, here's an example from Squad.

0:47:44 - 0:47:51     Text: You have this passage in 1899, John Jacob Astor IV invested $100,000 for Tesla.

0:47:51 - 0:47:52     Text: Okay.

0:47:52 - 0:47:55     Text: And then the answer that is being predicted by the model is going to always be in blue

0:47:55 - 0:47:57     Text: in these examples, Colorado Springs experiments.

0:47:57 - 0:47:59     Text: So, you got this passage.

0:47:59 - 0:48:03     Text: And the question is what did Tesla spend Astor's money on?

0:48:03 - 0:48:06     Text: That's why the prediction is Colorado Springs experiments.

0:48:06 - 0:48:10     Text: The model gets the answer right, which is nice.

0:48:10 - 0:48:14     Text: And we would like to think it's because it's doing some kind of reading comprehension.

0:48:14 - 0:48:16     Text: But here's the issue.

0:48:16 - 0:48:23     Text: It turns out, based on this fascinating paper, that if you just reduced the question to

0:48:23 - 0:48:30     Text: did, you actually get exactly the same, you actually get exactly the same answer.

0:48:30 - 0:48:36     Text: And in fact, with the original question, the model had sort of a.78 confidence, you

0:48:36 - 0:48:38     Text: know, probability in that answer.

0:48:38 - 0:48:44     Text: And with the reduced question did, you get even higher confidence.

0:48:44 - 0:48:48     Text: And that, if you give a human this, they would not be able to know really what you're

0:48:48 - 0:48:49     Text: trying to ask about.

0:48:49 - 0:48:53     Text: So, it seems like something is going really wonky here.

0:48:53 - 0:48:54     Text: Here's another.

0:48:54 - 0:48:58     Text: So, here's sort of like a very high level overview of the method.

0:48:58 - 0:49:01     Text: In fact, it actually references our input saline's theme methods.

0:49:01 - 0:49:03     Text: Nice, it's connected.

0:49:03 - 0:49:09     Text: So, you iteratively remove non-salient or unimportant words.

0:49:09 - 0:49:12     Text: So here's a passage again talking about football.

0:49:12 - 0:49:13     Text: I think.

0:49:13 - 0:49:14     Text: Yeah.

0:49:14 - 0:49:16     Text: And, oh, nice.

0:49:16 - 0:49:17     Text: Okay.

0:49:17 - 0:49:20     Text: So, the question is, where did the Broncos practice with a super bowl as the prediction

0:49:20 - 0:49:24     Text: of Stanford University?

0:49:24 - 0:49:25     Text: And that is correct.

0:49:25 - 0:49:27     Text: So again, seems nice.

0:49:27 - 0:49:31     Text: And now, we're not actually going to get the model to be incorrect.

0:49:31 - 0:49:36     Text: We're just going to say, how can I change this question such that I still look at the

0:49:36 - 0:49:37     Text: answer right?

0:49:37 - 0:49:41     Text: So, I'm going to remove the word that was least important according to a saliency method.

0:49:41 - 0:49:45     Text: So, now, it's where did the practice for the super bowl?

0:49:45 - 0:49:48     Text: Already, this is sort of unanswerable because you've got two teams practicing.

0:49:48 - 0:49:50     Text: You don't even know which one you're asking about.

0:49:50 - 0:49:55     Text: So, why the model still thinks it's so confident in Stanford University makes no sense.

0:49:55 - 0:49:59     Text: But you can just sort of keep going.

0:49:59 - 0:50:07     Text: And now, I think, here, the model stops being confident in the answer Stanford University.

0:50:07 - 0:50:13     Text: But I think this is really interesting just to show that if the model is able to do this

0:50:13 - 0:50:19     Text: with very high confidence, it's not reflecting the uncertainty that really should be there

0:50:19 - 0:50:22     Text: because you can't know what you're even asking about.

0:50:22 - 0:50:23     Text: Okay.

0:50:23 - 0:50:26     Text: So, what was important to make this answer?

0:50:26 - 0:50:32     Text: Well, at least these parts were important because you could keep just those parts and

0:50:32 - 0:50:34     Text: get the same answer, fascinating.

0:50:34 - 0:50:36     Text: All right.

0:50:36 - 0:50:44     Text: So, that's sort of the end of the admittedly brief section on thinking about input saliency

0:50:44 - 0:50:45     Text: methods and similar things.

0:50:45 - 0:50:48     Text: Now, we're going to talk about actually breaking models and understanding models by breaking

0:50:48 - 0:50:49     Text: them.

0:50:49 - 0:50:50     Text: Okay.

0:50:50 - 0:50:51     Text: Cool.

0:50:51 - 0:50:58     Text: So, if we have a passage here, Peyton Manning came the first quarterback, something Super

0:50:58 - 0:51:02     Text: Bowl, age 39, past record, held by John L. Wei.

0:51:02 - 0:51:03     Text: Again, we're doing question answering.

0:51:03 - 0:51:05     Text: We got this question.

0:51:05 - 0:51:08     Text: What was the name of the quarterback who was 38 in the Super Bowl?

0:51:08 - 0:51:10     Text: The prediction is correct.

0:51:10 - 0:51:11     Text: Looks good.

0:51:11 - 0:51:16     Text: Now, we're not going to change the question to try to sort of make the question nonsensical

0:51:16 - 0:51:23     Text: while keeping the same answer, instead we're going to change the passage by adding the

0:51:23 - 0:51:25     Text: sentence at the end, which really shouldn't distract anyone.

0:51:25 - 0:51:30     Text: This is quarterback, well known quarterback, Jeff Dean, you know, had jersey number 37

0:51:30 - 0:51:31     Text: in champ bull.

0:51:31 - 0:51:34     Text: So, this just doesn't, it's really not even related.

0:51:34 - 0:51:40     Text: But now the prediction is Jeff Dean for our nice QA model.

0:51:40 - 0:51:46     Text: And so, this shows as well that it seems like maybe there's this like end of the passage

0:51:46 - 0:51:50     Text: by as to what the answer should be, for example.

0:51:50 - 0:51:55     Text: And so, that's an adversarial example where we flipped the prediction by adding something

0:51:55 - 0:51:57     Text: that is innocuous to humans.

0:51:57 - 0:52:01     Text: And so, sort of like the higher level takeaway is like, oh, it seems like the QA model that

0:52:01 - 0:52:02     Text: we had that seemed good.

0:52:02 - 0:52:07     Text: It's not actually performing QA how we want it to, even though it's in domain accuracy

0:52:07 - 0:52:10     Text: it was good.

0:52:10 - 0:52:12     Text: And here's another example.

0:52:12 - 0:52:19     Text: So, you've got this paragraph with a question, what has been the result of this publicity?

0:52:19 - 0:52:22     Text: The answer is increased scrutiny on teacher misconduct.

0:52:22 - 0:52:28     Text: Now instead of changing the paragraph, we're going to change the question in really, really

0:52:28 - 0:52:32     Text: seemingly in significant ways to change the model's prediction.

0:52:32 - 0:52:39     Text: So first, what HA, and I've got this typo L, then the result of this publicity, the

0:52:39 - 0:52:45     Text: answer changes to teacher misconduct, likely a human would sort of ignore this typo or

0:52:45 - 0:52:47     Text: something and answer the right answer.

0:52:47 - 0:52:49     Text: And then this is really nuts.

0:52:49 - 0:52:54     Text: Instead of asking what has been the result of this publicity, if you ask what's been

0:52:54 - 0:52:59     Text: the result of this publicity, the answer also changes.

0:52:59 - 0:53:03     Text: And this is the author's call, this is semantically equivalent adversary.

0:53:03 - 0:53:05     Text: This is pretty rough.

0:53:05 - 0:53:13     Text: But in general, swapping what for what in this QA model breaks it pretty frequently.

0:53:13 - 0:53:19     Text: And so again, when you go back and sort of re-tinker how to build your model, you're going

0:53:19 - 0:53:23     Text: to be thinking about these things, not just the sort of average accuracy.

0:53:23 - 0:53:28     Text: So that's sort of talking about noise.

0:53:28 - 0:53:31     Text: Our models are bus to noise and their inputs.

0:53:31 - 0:53:32     Text: Our humans are bus to noise.

0:53:32 - 0:53:36     Text: And so this is another question we can ask.

0:53:36 - 0:53:43     Text: And so you can kind of go to this popular sort of meme passed around the internet from time

0:53:43 - 0:53:49     Text: to time where you have all the letters in these words scrambled, you say, according to

0:53:49 - 0:53:54     Text: research or Cambridge University, it doesn't matter in what order the letters in a word

0:53:54 - 0:53:55     Text: are.

0:53:55 - 0:54:00     Text: And so it seems like, I think I did a pretty good job there.

0:54:00 - 0:54:07     Text: And we can be robust as humans to reading and processing the language without actually

0:54:07 - 0:54:10     Text: all that much of a difficulty.

0:54:10 - 0:54:15     Text: So that's maybe something that we might want our models to also be robust to.

0:54:15 - 0:54:19     Text: And it's very practical as well.

0:54:19 - 0:54:23     Text: Noise is a part of all NLP systems inputs at all times.

0:54:23 - 0:54:28     Text: There's just no such thing, effectively, as having users, for example, and not having

0:54:28 - 0:54:30     Text: any noise.

0:54:30 - 0:54:36     Text: And so there's a study that was performed on some popular machine translation models where

0:54:36 - 0:54:42     Text: you train machine translation models in French, German and Czech, I think all to English.

0:54:42 - 0:54:43     Text: And you get blue scores.

0:54:43 - 0:54:47     Text: These blue scores will look a lot better than the ones in your Simon Four because much,

0:54:47 - 0:54:48     Text: much more training data.

0:54:48 - 0:54:53     Text: The idea is these are actually pretty strong machine translation systems.

0:54:53 - 0:54:56     Text: And that's an in domain clean text.

0:54:56 - 0:55:03     Text: Now if you add character swaps like the ones we saw in that sentence about Cambridge,

0:55:03 - 0:55:07     Text: the blue scores take a pretty harsh dive.

0:55:07 - 0:55:09     Text: Not very good.

0:55:09 - 0:55:17     Text: And even if you take somewhat more natural typo noise distribution here, you'll see

0:55:17 - 0:55:27     Text: that you're still getting 20-ish drops in blue score through simply natural noise.

0:55:27 - 0:55:30     Text: And so maybe you'll go back and retrain the model on more types of noise.

0:55:30 - 0:55:32     Text: And then you ask, oh, I do that.

0:55:32 - 0:55:35     Text: Is it robust to even different kinds of noise?

0:55:35 - 0:55:37     Text: These are the questions that are going to be really important.

0:55:37 - 0:55:41     Text: And it's important to know that you're able to break your model really easily so that

0:55:41 - 0:55:45     Text: you can then go and try to make it more robust.

0:55:45 - 0:55:51     Text: OK, now, let's see, 20 minutes.

0:55:51 - 0:55:53     Text: Some.

0:55:53 - 0:55:57     Text: Now we're going to, I guess, yeah.

0:55:57 - 0:56:01     Text: So now we're going to look at the representations of our neural networks.

0:56:01 - 0:56:06     Text: We've talked about sort of their behavior and then whether we could sort of change or

0:56:06 - 0:56:09     Text: observe reasons behind their behavior.

0:56:09 - 0:56:15     Text: Now we'll go into less abstraction, like more at the actual vector representations that

0:56:15 - 0:56:17     Text: are being built by models.

0:56:17 - 0:56:24     Text: And we can answer a different kind of question at the very least than with the other studies.

0:56:24 - 0:56:30     Text: The first thing is related to the question I was asked about attention, which is that

0:56:30 - 0:56:33     Text: some modeling components lend themselves to inspection.

0:56:33 - 0:56:37     Text: Now this is a sentence that I chose somewhat carefully actually because in part of this

0:56:37 - 0:56:42     Text: debate, are they interpretable components?

0:56:42 - 0:56:43     Text: We'll see.

0:56:43 - 0:56:46     Text: But they lend themselves to inspection in the following way.

0:56:46 - 0:56:51     Text: You can visualize them well and you can correlate them easily with various properties.

0:56:51 - 0:56:53     Text: So let's say you have attention heads in Burt.

0:56:53 - 0:57:00     Text: This is from a really nice study that was done here where you look at attention heads

0:57:00 - 0:57:05     Text: of Burt and you say, on most sentences, this attention head had one one seems to do this

0:57:05 - 0:57:08     Text: very sort of global aggregation.

0:57:08 - 0:57:11     Text: Simple kind of operation does this pretty consistently.

0:57:11 - 0:57:13     Text: That's cool.

0:57:13 - 0:57:16     Text: Is it interpretable?

0:57:16 - 0:57:18     Text: Well, maybe, right?

0:57:18 - 0:57:25     Text: So it's the first layer, which means that this word found is sort of uncontextualized.

0:57:25 - 0:57:31     Text: And then, you know, but in deeper layers, the problem is that like once you do some

0:57:31 - 0:57:37     Text: rounds of attention, you've had information mixing and flowing between words.

0:57:37 - 0:57:40     Text: And how do you know exactly what information you're combining, what you're attending

0:57:40 - 0:57:44     Text: to, even, the little hard to tell.

0:57:44 - 0:57:50     Text: And saliency methods more directly sort of evaluate the importance of models.

0:57:50 - 0:57:54     Text: But it's still interesting to see at sort of a local mechanistic point of view what

0:57:54 - 0:57:57     Text: kinds of things are being attended to.

0:57:57 - 0:58:01     Text: So let's take another example.

0:58:01 - 0:58:02     Text: Some attention heads seem to perform simple operations.

0:58:02 - 0:58:05     Text: So you have the global aggregation here that we saw already.

0:58:05 - 0:58:09     Text: Others seem to attend pretty robustly to the next token.

0:58:09 - 0:58:10     Text: Cool.

0:58:10 - 0:58:12     Text: Next token is a great signal.

0:58:12 - 0:58:14     Text: Some heads attend to the CEP token.

0:58:14 - 0:58:17     Text: So here you have attending to CEP.

0:58:17 - 0:58:18     Text: And then maybe some attend to periods.

0:58:18 - 0:58:23     Text: Maybe that's sort of a splitting sentences together and things like that.

0:58:23 - 0:58:25     Text: Not things that are hard to do.

0:58:25 - 0:58:30     Text: But things that some attention had seemed to pretty robustly perform.

0:58:30 - 0:58:35     Text: Again now though, deep in the network, what's actually represented at this period at layer

0:58:35 - 0:58:37     Text: 11?

0:58:37 - 0:58:38     Text: Little unclear.

0:58:38 - 0:58:39     Text: Little unclear.

0:58:39 - 0:58:41     Text: Okay.

0:58:41 - 0:58:46     Text: So some heads though are correlated with really interesting linguistic properties.

0:58:46 - 0:58:49     Text: So this head is actually attending to noun modifiers.

0:58:49 - 0:58:57     Text: So you got this the complicated language in the huge new law.

0:58:57 - 0:59:00     Text: That's pretty fascinating.

0:59:00 - 0:59:05     Text: Even if the model is not like doing this as a causal mechanism to do syntax necessarily,

0:59:05 - 0:59:10     Text: the fact that these things so strongly correlate is actually pretty, pretty cool.

0:59:10 - 0:59:13     Text: And so what we have in all of these studies is we've got sort of an approximate

0:59:13 - 0:59:19     Text: return partition and quantitative analysis relating, like allowing us to reason about very

0:59:19 - 0:59:21     Text: complicated model behavior.

0:59:21 - 0:59:24     Text: They're all approximations, but they're definitely interesting.

0:59:24 - 0:59:26     Text: One other example is that of co-reference.

0:59:26 - 0:59:29     Text: So we saw some work on co-reference.

0:59:29 - 0:59:36     Text: And it seems like this head does a pretty okay job of actually matching up co-referent

0:59:36 - 0:59:37     Text: entities.

0:59:37 - 0:59:39     Text: These are in red.

0:59:39 - 0:59:42     Text: Talks, negotiations, she, her.

0:59:42 - 0:59:43     Text: And that's not obvious how to do that.

0:59:43 - 0:59:45     Text: This is a difficult task.

0:59:45 - 0:59:50     Text: And so it does so with some percentage of the time.

0:59:50 - 0:59:56     Text: And again, it's sort of connecting very complex model behavior to these sort of interpretable

0:59:56 - 1:00:00     Text: summaries of correlating properties.

1:00:00 - 1:00:04     Text: Other cases you can have individual hidden units that lend themselves to interpretation.

1:00:04 - 1:00:10     Text: So here you've got a character level LSTM language model.

1:00:10 - 1:00:14     Text: Which row here is a sentence, if you can't read it, it's totally okay.

1:00:14 - 1:00:18     Text: The interpretation that you should take is that as we walk along the sentence, this single

1:00:18 - 1:00:23     Text: unit is going from I think very negative to very positive or very positive to very

1:00:23 - 1:00:24     Text: negative.

1:00:24 - 1:00:26     Text: I don't really remember.

1:00:26 - 1:00:30     Text: But it's tracking the position in the line.

1:00:30 - 1:00:34     Text: So it's just a linear position unit and pretty robustly doing so across all of these

1:00:34 - 1:00:36     Text: sentences.

1:00:36 - 1:00:42     Text: So this is from a nice visualization study way back in 2016, way back.

1:00:42 - 1:00:47     Text: Here's another cell from that same LSTM language model that seems to sort of turn on inside

1:00:47 - 1:00:48     Text: quotes.

1:00:48 - 1:00:50     Text: So here's a quote and then it turns on.

1:00:50 - 1:00:53     Text: Okay, so I guess that's positive in the blue.

1:00:53 - 1:00:55     Text: End quote here.

1:00:55 - 1:00:57     Text: And then it's negative.

1:00:57 - 1:01:01     Text: Here you start with no quote, negative in the red.

1:01:01 - 1:01:03     Text: See a quote and then blue.

1:01:03 - 1:01:08     Text: Again, very interpretable, also potentially a very useful feature to keep in mind.

1:01:08 - 1:01:11     Text: And this is just an individual unit in the LSTM that you can just look at and see that

1:01:11 - 1:01:13     Text: it does this.

1:01:13 - 1:01:17     Text: Very, very interesting.

1:01:17 - 1:01:26     Text: Even farther on this, and this is actually a study by some AI and neuroscience researchers,

1:01:26 - 1:01:29     Text: we saw the LSTMs were good at subject for a number agreement.

1:01:29 - 1:01:33     Text: Can we figure out the mechanisms by which the LSTM is solving the task?

1:01:33 - 1:01:35     Text: We actually get some insight into that.

1:01:35 - 1:01:37     Text: And so we have a word level language model.

1:01:37 - 1:01:41     Text: The word level language model is going to be a little small, but you have a sentence,

1:01:41 - 1:01:45     Text: the boy, gently, and kindly greets the.

1:01:45 - 1:01:51     Text: And this cell that's being tracked here, so it's an individual hidden unit, one dimension,

1:01:51 - 1:01:57     Text: right, is actually after it sees boy, it sort of starts to go higher.

1:01:57 - 1:02:02     Text: And then it goes down to something very small once it sees greets.

1:02:02 - 1:02:08     Text: And this cell seems to correlate with the scope of a subject for number agreement instance

1:02:08 - 1:02:09     Text: effectively.

1:02:09 - 1:02:14     Text: So here, the boy that watches the dog, that watches the cat greets, you got that cell,

1:02:14 - 1:02:20     Text: again, staying high, maintaining the scope of subject until greets, at which point it

1:02:20 - 1:02:22     Text: stops.

1:02:22 - 1:02:23     Text: What allows it to do that?

1:02:23 - 1:02:28     Text: Probably some complex other dynamics in the network, but it's still a fascinating, I

1:02:28 - 1:02:30     Text: think, insight.

1:02:30 - 1:02:37     Text: And yeah, this is just neuron, 1,150 in this LSTM.

1:02:37 - 1:02:46     Text: Now, so those are sort of all observational studies that you could do by picking out individual

1:02:46 - 1:02:51     Text: components of the model that you can sort of just take each one of and correlating them

1:02:51 - 1:02:53     Text: with some behavior.

1:02:53 - 1:03:00     Text: Now we'll look at a general class of methods called probing by which we still sort of use

1:03:00 - 1:03:06     Text: supervised knowledge, like the knowledge of the type of co-reference that we're looking

1:03:06 - 1:03:07     Text: for.

1:03:07 - 1:03:10     Text: But instead of thinking if it correlates with something that's immediately interpretable,

1:03:10 - 1:03:16     Text: like a attention head, we're going to look into the vector representations of the model

1:03:16 - 1:03:21     Text: and see if these properties can be read out by some simple function.

1:03:21 - 1:03:26     Text: To say, oh, maybe this property was made very easily accessible by my neural network.

1:03:26 - 1:03:28     Text: So let's dig into this.

1:03:28 - 1:03:34     Text: So the general paradigm is that you've got language data that goes into some big pre-trained

1:03:34 - 1:03:36     Text: transformer with fine tuning.

1:03:36 - 1:03:39     Text: And you get state-of-the-art results.

1:03:39 - 1:03:41     Text: So that means state-of-the-art.

1:03:41 - 1:03:46     Text: And so the question for the probing sort of methodology is like, if it's providing these

1:03:46 - 1:03:51     Text: general purpose language representations, what does it actually encode about language?

1:03:51 - 1:03:54     Text: Like, can we quantify this?

1:03:54 - 1:03:57     Text: Can we figure out what kinds of things is learning about language that we seemingly

1:03:57 - 1:04:00     Text: now don't have to tell it?

1:04:00 - 1:04:06     Text: And so you might have something like a sentence, like I record the record.

1:04:06 - 1:04:08     Text: That's an interesting sentence.

1:04:08 - 1:04:13     Text: And you put it into your transformer model with its word embeddings at the beginning,

1:04:13 - 1:04:17     Text: maybe some layers of self-attention and stuff, and you make some predictions.

1:04:17 - 1:04:21     Text: And now our objects of study are going to be these intermediate layers.

1:04:21 - 1:04:22     Text: Right?

1:04:22 - 1:04:27     Text: So it's a vector per word or sub word for every layer.

1:04:27 - 1:04:31     Text: And the question is, like, can we use these linguistic properties like the dependency

1:04:31 - 1:04:38     Text: parsing that we had way back in the early part of the course to understand correlations

1:04:38 - 1:04:44     Text: between properties in the vectors and these things that we can interpret.

1:04:44 - 1:04:46     Text: We can interpret dependency parses.

1:04:46 - 1:04:51     Text: So there are a couple of things that we might want to look for here.

1:04:51 - 1:04:53     Text: You might want to look for semantics.

1:04:53 - 1:04:56     Text: So here, in the sentence, I record the record.

1:04:56 - 1:04:58     Text: I am an agent.

1:04:58 - 1:05:01     Text: That's a semantics thing.

1:05:01 - 1:05:02     Text: Record is a patient.

1:05:02 - 1:05:04     Text: It's the thing I'm recording.

1:05:04 - 1:05:05     Text: You might have syntax.

1:05:05 - 1:05:07     Text: So you might have the syntax tree that you're interested in.

1:05:07 - 1:05:09     Text: That's the dependency parse tree.

1:05:09 - 1:05:11     Text: Maybe you're interested in part of speech, right?

1:05:11 - 1:05:14     Text: Because you have record and record.

1:05:14 - 1:05:17     Text: And the first one's a verb, the second one's a noun.

1:05:17 - 1:05:19     Text: They're identical strings.

1:05:19 - 1:05:23     Text: That's the model encode that one is one and the other is the other.

1:05:23 - 1:05:26     Text: So how do we do this kind of study?

1:05:26 - 1:05:29     Text: So we're going to decide on a layer that we want to analyze.

1:05:29 - 1:05:31     Text: And we're going to freeze Bert.

1:05:31 - 1:05:32     Text: So we're not going to fine tune Bert.

1:05:32 - 1:05:34     Text: All the parameters are frozen.

1:05:34 - 1:05:36     Text: So we decide on layer two of Bert.

1:05:36 - 1:05:38     Text: We're going to pass it some sentences.

1:05:38 - 1:05:42     Text: We decide on what's called a probe family.

1:05:42 - 1:05:49     Text: The question I'm asking is, can I use a model for my family, say linear, to decode a property

1:05:49 - 1:05:53     Text: that I'm interested in really well from this layer?

1:05:53 - 1:06:00     Text: So it's indicating that this property is easily accessible to linear models effectively.

1:06:00 - 1:06:06     Text: So maybe I get a train a model, a train a linear classifier on top of Bert.

1:06:06 - 1:06:09     Text: And I get a really high accuracy.

1:06:09 - 1:06:13     Text: That's sort of interesting already because you know from prior work in part of speech

1:06:13 - 1:06:18     Text: tagging that if you run a linear classifier on simpler features that aren't Bert, you

1:06:18 - 1:06:20     Text: probably don't get as high an accuracy.

1:06:20 - 1:06:22     Text: So that's an interesting sort of takeaway.

1:06:22 - 1:06:24     Text: But then you can also take like a baseline.

1:06:24 - 1:06:26     Text: So I want to compare two layers now.

1:06:26 - 1:06:27     Text: So I've got layer one here.

1:06:27 - 1:06:29     Text: I want to compare it to layer two.

1:06:29 - 1:06:32     Text: I train a probe on it as well.

1:06:32 - 1:06:34     Text: Maybe the accuracy isn't as good.

1:06:34 - 1:06:40     Text: Now I can say, oh wow, look, by layer two, part of speech is more easily accessible to linear

1:06:40 - 1:06:44     Text: functions than it was at layer one.

1:06:44 - 1:06:45     Text: So what did that?

1:06:45 - 1:06:49     Text: Well, the self-attention and feed-forward stuff made it more easily accessible.

1:06:49 - 1:06:53     Text: That's interesting because it's a statement about sort of the information processing of

1:06:53 - 1:06:55     Text: the model.

1:06:55 - 1:06:56     Text: Okay.

1:06:56 - 1:07:00     Text: Okay, so that's, we're going to analyze these layers.

1:07:00 - 1:07:06     Text: Just take a second more to think about it, and you just really give me just a second.

1:07:06 - 1:07:12     Text: So if you have the model representations, h1 to ht, and you have a function family f,

1:07:12 - 1:07:16     Text: that's the subset linear models, or maybe you have like a feed-forward neural network, some

1:07:16 - 1:07:21     Text: fixed set of hyper parameters, freeze the model, train the probe.

1:07:21 - 1:07:25     Text: So you get some predictions for part of speech tagging or whatever.

1:07:25 - 1:07:28     Text: That's just the probe applied to the hidden state of the model.

1:07:28 - 1:07:33     Text: The probe was a member of the probe family, and then the extent that we can predict why

1:07:33 - 1:07:34     Text: is a measure of accessibility.

1:07:34 - 1:07:37     Text: So that's just kind of written out not as pictorially.

1:07:37 - 1:07:38     Text: Okay.

1:07:38 - 1:07:44     Text: So I'm not going to stay on this for too much longer.

1:07:44 - 1:07:49     Text: And it may help in the search for causal mechanisms, but it sort of just gives us a rough

1:07:49 - 1:07:54     Text: understanding of sort of processing of the model and what things are accessible at what

1:07:54 - 1:07:55     Text: layer.

1:07:55 - 1:07:57     Text: So what are some results here?

1:07:57 - 1:08:03     Text: So one result is that BERT, if you run linear probes on it, does really, really well on things

1:08:03 - 1:08:07     Text: that require syntax in part of speech named NETA recognition.

1:08:07 - 1:08:12     Text: Actually in some cases, approximately as well as just doing the very best thing you could

1:08:12 - 1:08:15     Text: possibly do without BERT.

1:08:15 - 1:08:19     Text: So it just makes easily accessible, amazingly strong features for these properties.

1:08:19 - 1:08:26     Text: And that's an interesting sort of emergent quality of BERT, you might say.

1:08:26 - 1:08:31     Text: It seems like as well that the layers of BERT have this property where, so if you look

1:08:31 - 1:08:39     Text: at the columns of this plot here, each column is a task, you've got input words at the sort

1:08:39 - 1:08:44     Text: of layer zero of BERT here, layer 24 is the last layer of BERT large, lower performance

1:08:44 - 1:08:51     Text: is yellow, higher performance is blue, and I, the resolution isn't perfect, but consistently

1:08:51 - 1:08:55     Text: the best place to read out these properties is somewhere a bit past the middle of the

1:08:55 - 1:09:01     Text: model, which is a very consistent rule, which is fascinating.

1:09:01 - 1:09:07     Text: And then it seems as well like if you look at this function of increasingly abstract or

1:09:07 - 1:09:11     Text: increasingly difficult to compute linguistic properties on this axis, an increasing

1:09:11 - 1:09:17     Text: depth in the network on that axis, so the deeper you go in the network, it seems like

1:09:17 - 1:09:24     Text: the more easily you can access more and more abstract linguistic properties, suggesting

1:09:24 - 1:09:29     Text: that that accessibility is being constructed over time by the layers of processing of BERT,

1:09:29 - 1:09:32     Text: so it's building more and more abstract features.

1:09:32 - 1:09:37     Text: Which I think is again, sort of really interesting result.

1:09:37 - 1:09:43     Text: And now I think, yeah, one thing that I think comes to mind that really brings us back

1:09:43 - 1:09:48     Text: right today one is we built intuitions around word to veck.

1:09:48 - 1:09:51     Text: We were asking like what does each dimension of word to veck mean?

1:09:51 - 1:09:57     Text: And the answer was not really anything, but we could build intuitions about it and

1:09:57 - 1:10:01     Text: think about properties of it through sort of these connections between simple mathematical

1:10:01 - 1:10:08     Text: properties of word to veck and linguistic properties that we could sort of understand.

1:10:08 - 1:10:12     Text: So we had this approximation, which is not 100% true, but it's an approximation that

1:10:12 - 1:10:22     Text: says cosine similarity is effectively correlated with semantic similarity.

1:10:22 - 1:10:25     Text: Think about even if all we're going to do at the end of the day is fine tune these word

1:10:25 - 1:10:27     Text: embeddings anyway.

1:10:27 - 1:10:32     Text: Likewise we had this sort of idea about the analogies being encoded by linear offsets.

1:10:32 - 1:10:39     Text: So some relationships are linear in space and they didn't have to be, that's fascinating.

1:10:39 - 1:10:43     Text: This is emergent property that we've now been able to study since we discovered this.

1:10:43 - 1:10:45     Text: Why is that the case in word to veck?

1:10:45 - 1:10:51     Text: And in general, even though you can't interpret the individual dimensions of word to veck,

1:10:51 - 1:10:57     Text: these sort of emergent, interpretable connections between approximately linguistic ideas and sort

1:10:57 - 1:11:00     Text: of simple math on these objects is fascinating.

1:11:00 - 1:11:06     Text: And so one piece of work that sort of extends this idea comes back to dependency parse

1:11:06 - 1:11:09     Text: trees. So they describe the syntax of sentences.

1:11:09 - 1:11:18     Text: And in a paper that I did with Chris, we showed that actually birds and models like it make

1:11:18 - 1:11:24     Text: dependency parse tree structure emergent sort of more easily accessible than one might

1:11:24 - 1:11:26     Text: imagine in its vector space.

1:11:26 - 1:11:32     Text: So if you've got a tree right here, the chef who ran to the store was out of food.

1:11:32 - 1:11:39     Text: So what you can sort of do is think about the tree in terms of distances between words.

1:11:39 - 1:11:44     Text: So you've got the number of edges in the tree between two words is their path distance.

1:11:44 - 1:11:48     Text: So you've got sort of that the distance between chef and was is one.

1:11:48 - 1:11:52     Text: And we're going to use this interpretation of a tree as a distance to make a connection

1:11:52 - 1:11:54     Text: with birds embedding space.

1:11:54 - 1:12:00     Text: And what we were able to show is that under a single linear transformation, the squared

1:12:00 - 1:12:07     Text: Euclidean distance between bird vectors for the same sentence actually correlates well

1:12:07 - 1:12:12     Text: if you choose the B matrix right with the distances in the tree.

1:12:12 - 1:12:17     Text: So here in this Euclidean space that we've transformed, the approximate distance between

1:12:17 - 1:12:21     Text: chef and was is also one.

1:12:21 - 1:12:26     Text: Likewise the difference between was and store is four in the tree.

1:12:26 - 1:12:31     Text: And in my simple sort of transformation of bird space, the distance between store and

1:12:31 - 1:12:33     Text: was is also approximately four.

1:12:33 - 1:12:36     Text: And this is true across a wide range of sentences.

1:12:36 - 1:12:42     Text: And this is like to me a fascinating example of again emergent approximate structure in

1:12:42 - 1:12:49     Text: these very nonlinear models that don't necessarily need to encode things so simply.

1:12:49 - 1:12:51     Text: Okay.

1:12:51 - 1:12:52     Text: All right.

1:12:52 - 1:12:53     Text: Great.

1:12:53 - 1:12:59     Text: So probing studies and correlation studies are I think interesting and pointless in directions

1:12:59 - 1:13:01     Text: to build intuitions about models.

1:13:01 - 1:13:05     Text: But they're not arguments that the model is actually using the thing that you're finding

1:13:05 - 1:13:07     Text: to make a decision.

1:13:07 - 1:13:10     Text: Not causal studies.

1:13:10 - 1:13:12     Text: This is for probing and correlation studies.

1:13:12 - 1:13:18     Text: So and some work that I did around the same time, we showed actually that certain conditions

1:13:18 - 1:13:22     Text: on probes allow you to achieve high accuracy on a task.

1:13:22 - 1:13:25     Text: It's effectively just fitting random labels.

1:13:25 - 1:13:31     Text: And so there's a difficulty of interpreting what the model could or could not be doing

1:13:31 - 1:13:34     Text: with this thing that is somehow easily accessible.

1:13:34 - 1:13:38     Text: It's interesting that this property is easily accessible, but the model might not be doing

1:13:38 - 1:13:42     Text: anything with it, for example, because it's totally random.

1:13:42 - 1:13:47     Text: Likewise, another paper showed that you can achieve high accuracy with a probe, even

1:13:47 - 1:13:52     Text: if the model is trained to know that thing that you're probing for is not useful.

1:13:52 - 1:13:56     Text: And there's causal studies that sort of try to extend this work.

1:13:56 - 1:14:01     Text: It's much more difficult to read this paper than it's a fascinating line of future work.

1:14:01 - 1:14:07     Text: Now in my last two minutes, I want to talk about recasting model tweaks and ablations

1:14:07 - 1:14:09     Text: as analysis.

1:14:09 - 1:14:14     Text: So we had this improvement process where we had a network that was going to work, okay.

1:14:14 - 1:14:17     Text: And we would see whether we could tweak it in simple ways to improve it.

1:14:17 - 1:14:21     Text: And then you could see whether you could remove anything and how it still be okay.

1:14:21 - 1:14:22     Text: And that's kind of like analysis.

1:14:22 - 1:14:23     Text: Like I have my network.

1:14:23 - 1:14:26     Text: Do I want it to like, is it going to be better if it's more complicated, if it's going

1:14:26 - 1:14:30     Text: to be better, if it's simpler, can I get away with it being simpler?

1:14:30 - 1:14:35     Text: And so one example of some folks who did this is they took this idea of multi-headed

1:14:35 - 1:14:40     Text: attention and said, so many heads, all the heads important.

1:14:40 - 1:14:44     Text: And what they showed is that if you train a system with multi-headed attention and then

1:14:44 - 1:14:49     Text: just remove the heads at test time and not use them at all, you can actually do pretty

1:14:49 - 1:14:54     Text: well on the original task, not retraining at all without some of the attention heads,

1:14:54 - 1:14:56     Text: showing that they weren't important.

1:14:56 - 1:14:58     Text: You could just get rid of them after training.

1:14:58 - 1:15:02     Text: And likewise, you can do the same thing for, this is on machine translation, this is on

1:15:02 - 1:15:06     Text: multi-analye, you can actually get away without a large, large percentage of your attention

1:15:06 - 1:15:07     Text: heads.

1:15:07 - 1:15:11     Text: Let's see.

1:15:11 - 1:15:17     Text: Yeah, so another thing that you could think about is questioning sort of the basics of

1:15:17 - 1:15:19     Text: the models that we're building.

1:15:19 - 1:15:22     Text: So we have transformer models that are sort of self-attention, feed-forward, self-attention,

1:15:22 - 1:15:28     Text: feed-forward, but like why in that order, with some of the things emitted here, and this

1:15:28 - 1:15:33     Text: paper asked this question and said, if this is my transformer, self-attention, feed-forward,

1:15:33 - 1:15:36     Text: self-attention, feed-forward, et cetera, et cetera, et cetera.

1:15:36 - 1:15:40     Text: And if I just reordered it so that I had a bunch of self-attention at the head and a bunch

1:15:40 - 1:15:44     Text: of feed-forward at the back, and they tried a bunch of these orderings, and this one actually

1:15:44 - 1:15:45     Text: does better.

1:15:45 - 1:15:48     Text: So this achieves a lower perplexity on a benchmark.

1:15:48 - 1:15:53     Text: And this is a way of analyzing what's important about the architectures that I'm building,

1:15:53 - 1:15:56     Text: and how can they be changed in order to perform better.

1:15:56 - 1:16:00     Text: So neural models are very complex, and they're difficult to characterize and impossible to

1:16:00 - 1:16:05     Text: characterize with a single sort of statistic, I think, for your test set accuracy, especially

1:16:05 - 1:16:07     Text: in domain.

1:16:07 - 1:16:12     Text: And we want to find intuitive descriptions of model behaviors, but we should look at

1:16:12 - 1:16:16     Text: multiple levels of abstraction, and none of them are going to be complete.

1:16:16 - 1:16:20     Text: And someone tells you that their neural network is interpretable.

1:16:20 - 1:16:23     Text: I encourage you to engage critically with that.

1:16:23 - 1:16:28     Text: It's not necessarily false, but like the levels of interpretability and what you can interpret,

1:16:28 - 1:16:32     Text: these are the questions that you should be asking, because it's going to be opaque in

1:16:32 - 1:16:35     Text: some ways, almost definitely.

1:16:35 - 1:16:41     Text: And then bring this lens to your model building as you try to think about how to build better

1:16:41 - 1:16:46     Text: models, even if you're not going to be doing analysis as one of your main driving goals.

1:16:46 - 1:16:50     Text: And with that, good luck on your final projects.

1:16:50 - 1:16:52     Text: I realize we're at time.

1:16:52 - 1:16:57     Text: The teaching staff is really appreciative of your efforts over this difficult quarter.

1:16:57 - 1:17:04     Text: And yeah, I hope there's a lecture left on Thursday, but yeah, this is my last one.

1:17:04 - 1:17:05     Text: So thanks, everyone.