Stanford CS224N NLP with Deep Learning ｜ Winter 2021 ｜ Lecture 13 - Coreference Resolution

0:00:00 - 0:00:14 Text: Okay, hi everyone. So we'll get started again. We're now into week seven of CS224N. If

0:00:14 - 0:00:18 Text: you're following along the syllabus really closely, we actually did a little bit of

0:00:18 - 0:00:25 Text: a rearrangement in classes. And so today it's me and I'm going to talk about co-reference

0:00:25 - 0:00:30 Text: resolution, which is another chance we get to take a deeper dive into a more linguistic

0:00:30 - 0:00:35 Text: topic. They will also show you a couple of new things for deep learning models at the

0:00:35 - 0:00:41 Text: same time. And then the lecture that previously been scheduled at this point, which was going

0:00:41 - 0:00:49 Text: to be John on explanation in neural models, has been shifted later down into week nine,

0:00:49 - 0:00:52 Text: I think it is. But you'll still get him later.

0:00:52 - 0:00:59 Text: Before getting underway, just a couple of announcements on things. Well, first of all, congratulations

0:00:59 - 0:01:06 Text: on surviving assignment five, I hope. I know it was a bit of a challenge for some of you,

0:01:06 - 0:01:10 Text: but I hope it was a rewarding state of the art learning experience on the latest in neural

0:01:10 - 0:01:15 Text: nets. And in any rate, you know, this was a brand new assignment that we used for the

0:01:15 - 0:01:20 Text: first time this year. So we'll really appreciate later on when we do the second survey

0:01:20 - 0:01:26 Text: taking your feedback on it. We've been busy reading people's final project proposals.

0:01:26 - 0:01:32 Text: Thanks lots of interesting stuff there. Our goal is to get them back to you tomorrow. But

0:01:32 - 0:01:37 Text: you know, as soon as you've had a good night's sleep after assignment five, now is also a

0:01:37 - 0:01:42 Text: great time to get started working on your final projects because there's just not that

0:01:42 - 0:01:47 Text: much time to the end of quarter. And I particularly want to encourage all of you to chat to your

0:01:47 - 0:01:54 Text: mentor, regularly go and visit office hours and keep in touch, get advice, just talking through

0:01:54 - 0:01:59 Text: things is a good way to keep you on track. We also plan to begin back assignment four grades

0:01:59 - 0:02:07 Text: later this week. There's some of the work never stops at this point. So the next thing for the

0:02:07 - 0:02:14 Text: final project is the final project milestone. So that we handed out the details of that last

0:02:14 - 0:02:22 Text: Friday and it's due a week from today. So the idea of this final project milestone is really to

0:02:22 - 0:02:28 Text: help keep you on track and keep things moving towards having a successful final project.

0:02:28 - 0:02:34 Text: So our hope is that sort of most of what you write for the final project milestone is material

0:02:34 - 0:02:40 Text: you can also include in your final project, except for a few paragraphs of years exactly where I'm

0:02:40 - 0:02:46 Text: up to now. So the overall hope is that doing this in two parts and having a milestone before

0:02:46 - 0:02:52 Text: the final thing is just making you make progress and be on track for having a successful final project.

0:02:54 - 0:03:00 Text: Finally, the next class on Thursday is going to be Colin Rafellen. This is going to be super

0:03:00 - 0:03:07 Text: exciting. So he's going to be talking more about the very latest in large pre-trained language

0:03:07 - 0:03:13 Text: models, both what some of their successes are and also what some of the disconcerting, not quite

0:03:13 - 0:03:19 Text: so good aspects of those models are. So that should be a really good interesting lecture when we

0:03:19 - 0:03:25 Text: had him come and talk to our NLP seminar. We had several hundred people come along for that.

0:03:27 - 0:03:35 Text: And so for this talk again, we're asking that you write a reaction paragraph following the same

0:03:35 - 0:03:43 Text: instructions as last time about what's in this lecture. And someone that asks in the questions,

0:03:43 - 0:03:49 Text: well, what about last Thursdays? The answer that is no. So the distinction here is we're only doing

0:03:49 - 0:03:58 Text: the reaction paragraphs for outside guest speakers. And although it was great to have on Trian Bosleau

0:03:58 - 0:04:04 Text: for last Thursdays lecture, he's a postdoc at Stanford. So we don't count him as an outside guest

0:04:04 - 0:04:10 Text: speaker. And so nothing needs to be done for that one. So there are three classes for which you

0:04:10 - 0:04:18 Text: need to do it. So there was the one before from Duncishan, Colin Rafelle, which is Thursday,

0:04:18 - 0:04:22 Text: and then towards the end of the course, there's Yulitz Vettkov.

0:04:25 - 0:04:32 Text: Okay, so this is the plan today. So in the first part of it, I'm actually going to spend a bit of time

0:04:32 - 0:04:37 Text: talking about what co-reference is, what different kinds of reference and language are.

0:04:38 - 0:04:42 Text: And then I'm going to move on and talk about some of the kind of methods that people have used

0:04:42 - 0:04:52 Text: to solving co-reference resolution. Now there's one bug in our course design, which was a lot of years

0:04:52 - 0:04:58 Text: we've had a whole lecture on doing convolutional neural nets for language applications. And

0:04:58 - 0:05:09 Text: that slight bug appeared the other day when Duncish talked about the VDF model because she sort of

0:05:09 - 0:05:15 Text: slipped in all this character CNN representation of words. And we haven't actually covered that.

0:05:15 - 0:05:21 Text: And so that was a slight upsee. I mean, actually for applications in co-reference as well,

0:05:21 - 0:05:27 Text: people commonly make use of character level confidence. So I wanted to sort of spend a few

0:05:27 - 0:05:32 Text: minutes sort of doing basics of confidence for language. The sort of reality here is that

0:05:33 - 0:05:38 Text: given that there's no exam week this year to give people more time for final projects,

0:05:38 - 0:05:44 Text: we sort of shorten the content by a week this year and so you're getting a little bit less of

0:05:44 - 0:05:52 Text: that content. Then going on from there, say some stuff about a state of the art neural co-reference

0:05:52 - 0:05:58 Text: system and write at the end talk about how co-reference evaluated and what some of the results are.

0:05:58 - 0:06:05 Text: Yeah. So first of all, what is this co-reference resolution term that I've been talking about a lot?

0:06:05 - 0:06:15 Text: So co-reference resolution is meaning to find all the mentions and a piece of text that refer

0:06:15 - 0:06:19 Text: to the same entity. And sorry, that's a typo. It should be in the world, not in the word.

0:06:19 - 0:06:27 Text: So let's make this concrete. So here's part of a short story by Struz Hirao called the star.

0:06:27 - 0:06:36 Text: Now I have to make a confession here because this is an NLP class, not a literature class. I

0:06:36 - 0:06:43 Text: crudely made some cuts to the story to be able to have relevant parts appear on my slide in

0:06:43 - 0:06:49 Text: a decent size font for illustrating co-reference. So it's not quite the full original text, but

0:06:49 - 0:06:58 Text: it basically is a piece of this story. So what we're doing in co-reference resolution is we're

0:06:58 - 0:07:06 Text: working out what people are mentioned. So here's a mention of a person, Vanajah, and here's a

0:07:06 - 0:07:13 Text: mention of another person, Akila. And well, mentions don't have to be people. So the local park that's

0:07:13 - 0:07:23 Text: also a mention. And then here's Akila again and Akila's son. And then there's Pradhwal. Then there's

0:07:24 - 0:07:34 Text: another son here and then her son and Akash. And they both went to the same school. And then

0:07:34 - 0:07:47 Text: there's a preschool play. And there's Pradhwal again. And then there's a naughty child, Lord

0:07:47 - 0:07:55 Text: Krishna. And there's some that are a bit complicated. Like the lead role is that a mention. It's sort

0:07:55 - 0:08:03 Text: of more of a functional specification of something in the play. There's Akash and it's a tree.

0:08:03 - 0:08:09 Text: I won't go through the whole thing yet. But I mean, in general, there are noun phrases that are

0:08:09 - 0:08:17 Text: mentioning things in the world. And so then what we want to do for co-reference resolution is

0:08:17 - 0:08:25 Text: work out which of these mentions are talking about the same real world entity. So if we start off,

0:08:25 - 0:08:40 Text: so there's Banadja. And so Banadja is the same person as her there. And then we could read through.

0:08:41 - 0:08:53 Text: She resigned herself. So that's both Banadja. She bought him a brown T-shirt and brown trousers.

0:08:53 - 0:09:04 Text: And then she made a large cut out tree. She attached. So all of that's about Banadja.

0:09:05 - 0:09:14 Text: But then we can have another person. So here's a killer. And here's a killer.

0:09:14 - 0:09:27 Text: Maybe those are the only mentions of a killer. So then we can go on from there.

0:09:29 - 0:09:36 Text: Okay. And so then there's Pradjwal. But note the Pradjwal.

0:09:36 - 0:09:49 Text: Note that Pradjwal is also a killer's son. So really a killer's son is also Pradjwal.

0:09:49 - 0:09:57 Text: And so an interesting thing here is that you can get nested syntactic structure so that we have

0:09:57 - 0:10:08 Text: the sort of noun phrases. So that just overall we have sort of this noun phrase, a killer's son

0:10:08 - 0:10:17 Text: Pradjwal, which consists of two noun phrases in an opposition. He is Pradjwal. And then for the

0:10:17 - 0:10:24 Text: noun phrase, a killer's son, it sort of breaks down to itself having an extra possessive noun

0:10:24 - 0:10:32 Text: phrase in it. And then a noun so that you have a killer's and then this is son. So that you have

0:10:32 - 0:10:41 Text: these multiple noun phrases. And so that you can then be sort of having different parts of this

0:10:43 - 0:10:49 Text: be one person in the co-reference. But this noun phrase here referring to a different person

0:10:49 - 0:11:03 Text: in the co-reference. Okay, so back to Pradjwal. All right, so well there's some easy other

0:11:03 - 0:11:15 Text: Pradjwals, right? So there's Pradjwal here. And then you've got some more complicated things. So

0:11:15 - 0:11:25 Text: one of the complicated cases here is that we have they went to the same school. So that they there

0:11:25 - 0:11:39 Text: is what gets referred to as split antecedents. Because the they refers to both Pradjwal and the

0:11:39 - 0:11:47 Text: Kash. And that's an interesting phenomenon that and so I could try and show that somehow I could

0:11:47 - 0:11:56 Text: put some splashes in or something. And if I get a different color, a Kash, we have a Kash and her son.

0:11:56 - 0:12:04 Text: And then this one sort of both of them at once. Right, so human languages have this phenomenon

0:12:04 - 0:12:12 Text: of split antecedents. But you know, one of the things that you should notice when we start talking

0:12:12 - 0:12:19 Text: about algorithms that people use for doing co-reference resolution is that they make some

0:12:19 - 0:12:27 Text: simplified assumptions as they how they go about treating the problem. And one of the simplifications

0:12:27 - 0:12:36 Text: that most algorithms make is for any noun phrase like this pronoun they that's trying to work out

0:12:36 - 0:12:43 Text: what is a co-reference with. And the answer is one thing. And so actually most NLP algorithms

0:12:43 - 0:12:50 Text: for co-reference resolution just cannot get split and antecedents write. Any time it occurs in

0:12:50 - 0:12:56 Text: the text they guess something and they always get it wrong. So that's a sort of a bit of a sad state

0:12:56 - 0:13:04 Text: of affairs, but that's the truth of how it is. Okay, so then going ahead we have Akash here.

0:13:05 - 0:13:13 Text: And then we have another tricky one. So moving on from there, we then have this a tree.

0:13:13 - 0:13:28 Text: So well, in this context of this story Akash is going to be the tree. So you could feel that it was

0:13:28 - 0:13:37 Text: okay to say, well this tree is also Akash. You could also feel that that's a little bit weird

0:13:37 - 0:13:47 Text: and not want to do that. And I mean actually different people's co-reference datasets differ in

0:13:47 - 0:13:54 Text: this. So really that you know that we're predicating identity relationship here between Akash

0:13:54 - 0:13:59 Text: and the property of being a tree. So do we regard the tree as the same as Akash or not? And people

0:13:59 - 0:14:09 Text: make different decisions there. Okay, but then going ahead we have here's Akash and she bought him.

0:14:09 - 0:14:22 Text: So that's Akash. And then we have Akash here. And so then we go on. Okay, so then if we don't

0:14:22 - 0:14:37 Text: regard the tree as the same as Akash, we have a tree here. But then note that the next place over here

0:14:37 - 0:14:47 Text: where we have a mention of a tree, the best tree. But that's sort of really a functional description

0:14:47 - 0:14:54 Text: of you know of possible trees making someone the best tree. It's not really referential to a tree.

0:14:56 - 0:15:05 Text: And so it seems like that's not really co-referent. But if we go on, there's definitely more mention

0:15:05 - 0:15:14 Text: of a tree. So when she she has made the tree truly the nicest tree or well I'm not sure. Is that

0:15:14 - 0:15:20 Text: one co-referent? It is definitely referring to our tree. And maybe this one again is a sort of a

0:15:20 - 0:15:31 Text: functional description that isn't referring to the tree. Okay, there's definitely. And so

0:15:34 - 0:15:40 Text: maybe this one though where it's a tree is referring to the tree. But I hope to have illustrated

0:15:40 - 0:15:48 Text: from this is you know most of the time when we do co-reference in NLP, we just make it look sort of

0:15:50 - 0:15:59 Text: like the conceptual phenomenon is you know kind of obvious that there's a mention of Sarah

0:15:59 - 0:16:07 Text: and then it says she and you say oh they're co-referent. This is easy. But if you actually start

0:16:07 - 0:16:14 Text: looking at real text especially when you're looking at something like this that is a piece of literature,

0:16:14 - 0:16:20 Text: the kind of phenomena you get for co-reference and overlapping reference and it varies

0:16:20 - 0:16:27 Text: out the phenomena that I'll talk about you know they actually get pretty complex and it's not

0:16:27 - 0:16:32 Text: you know there are a lot of hard cases that you actually have to think about as to what things you

0:16:32 - 0:16:42 Text: think about as co-referent or not. Okay, but basically we do want to be able to do something with

0:16:42 - 0:16:47 Text: co-reference because it's useful for a lot of things that we'd like to do in natural language

0:16:47 - 0:16:53 Text: processing. So for one task that we've already talked about question answering but equally for

0:16:53 - 0:17:00 Text: other tasks such as summarization information extraction, if you're doing something like reading

0:17:00 - 0:17:08 Text: through a piece of text and you've got a sentence like he was born in 1961. You really want to know

0:17:08 - 0:17:16 Text: who he refers to to know if this is a good answer to the question of you know when was Barack

0:17:16 - 0:17:25 Text: Obama born or something like that. It turns out also that it's useful in machine translation.

0:17:25 - 0:17:36 Text: So in most languages pronouns have features for gender and number and in quite a lot of languages

0:17:37 - 0:17:45 Text: nouns and adjectives also show features of gender, number and case. And so when you're translating

0:17:45 - 0:17:56 Text: a sentence you want to be aware of these features and what is co-referent as what to be able to

0:17:56 - 0:18:05 Text: get the translations correct. So you know if you want to be able to work out a translation

0:18:05 - 0:18:11 Text: and know whether it's saying Alicia likes Juan because he's smart or Alicia likes Juan because

0:18:11 - 0:18:17 Text: she's smart then you have to be sensitive to co-reference relationships to be able to choose

0:18:17 - 0:18:27 Text: the right translation. For people who build dialogue systems dialogue systems also have issues

0:18:27 - 0:18:35 Text: of co-reference a lot at the time. So you know if it sort of book tickets to see James Bond

0:18:35 - 0:18:41 Text: and the system applies spectra is playing near you at two and three today. Well there's actually

0:18:41 - 0:18:47 Text: co-reference relation. Oh sorry there's a reference relation between spectra and James Bond

0:18:47 - 0:18:52 Text: because spectra is a James Bond film. I'll come back to that one in a minute. But then it's

0:18:52 - 0:18:59 Text: how many tickets would you like two tickets for the showing at three? That three is not just the

0:18:59 - 0:19:08 Text: number three. That three is then a co-reference relationship back to the 3PM showing that was

0:19:08 - 0:19:16 Text: mentioned by the agents in the dialogue system. So again to understand these we need to be understanding

0:19:16 - 0:19:25 Text: the co-reference relationships. So how now can you go about doing co-reference? So the standard

0:19:25 - 0:19:32 Text: traditional answer which I'll present first is co-reference is done in two steps. On the first

0:19:32 - 0:19:40 Text: step what we do is detect mentions in a piece of text and that's actually a pretty easy problem.

0:19:40 - 0:19:48 Text: And then in the second step we work out how to cluster the mentions. So as in my example from

0:19:48 - 0:19:54 Text: the Shruti Rau text basically what you're doing with co-reference is you're building up these

0:19:54 - 0:20:03 Text: clusters sets of mentions that refer to the same entity in the world. So if we explore a little

0:20:03 - 0:20:11 Text: how we could do that as a two step solution the first part was detecting the mentions. And so pretty

0:20:11 - 0:20:20 Text: much there are three kinds of things, different kinds of noun phrases that can be mentions.

0:20:20 - 0:20:27 Text: There are pronouns like I, you're itchy hymn and also some demonstrative pronouns like this and

0:20:27 - 0:20:34 Text: that and things like that. There are explicitly name things so things like Paris, Joe Biden, Nike

0:20:35 - 0:20:42 Text: and then there are plain noun phrases that describe things. So a dog, the big fluffy cat stuck in the

0:20:42 - 0:20:50 Text: tree. And so all of these things that we'd like to identify as mentions. And all the straightforward

0:20:50 - 0:20:58 Text: way to identify these mentions is to use natural language processing tools several of which we've

0:20:58 - 0:21:08 Text: talked about already. So to work out pronouns we can use what's called a part of speech tagger.

0:21:12 - 0:21:18 Text: We can use a part of speech tagger which we haven't really explicitly talked about but we used

0:21:18 - 0:21:25 Text: when you built dependency parsers. So that first of all assigns parts of speech to each word and so

0:21:25 - 0:21:33 Text: that we can just find the words that are pronouns. For named entities we did talk just a little bit

0:21:33 - 0:21:39 Text: about named entity recognizers as a use of sequence models for neural networks. So we can pick out

0:21:39 - 0:21:45 Text: things like person names and company names. And then for the ones like the big fluffy

0:21:45 - 0:21:54 Text: a big fluffy dog we could then be sort of picking out from syntactic structure noun phrases

0:21:54 - 0:22:00 Text: and regarding them as descriptions of things. So that we could use all of these tools and those

0:22:00 - 0:22:06 Text: would give us basically our mentions. It's a little bit more subtle than that because it turns out

0:22:06 - 0:22:16 Text: there are some noun phrases and things of all of those kinds which don't actually refer so that

0:22:16 - 0:22:22 Text: they're not referential in the world. So when you say it is sunny it doesn't really refer. When you

0:22:22 - 0:22:30 Text: make universal claims like every student well every student isn't referring to something you can

0:22:30 - 0:22:35 Text: point to in the world. And more dramatically when you have no student and making a negative universal

0:22:35 - 0:22:43 Text: claim it's not referential to anything. There are also things that you can describe functionally

0:22:43 - 0:22:51 Text: which don't have any clear reference. So if I say the best doughnut in the world that that's

0:22:51 - 0:22:57 Text: a functional claim but it doesn't necessarily have reference. Like if I've established

0:22:58 - 0:23:04 Text: that I think a particular kind of doughnut is the best doughnut in the world I could then say to you

0:23:04 - 0:23:12 Text: I hate the best doughnut in the world yesterday and you know what I mean it might have reference.

0:23:12 - 0:23:17 Text: But if I say something like I'm going around to all the doughnut stores trying to find the best

0:23:17 - 0:23:22 Text: doughnut in the world then it doesn't have any reference yet it's just a sort of a functional

0:23:22 - 0:23:28 Text: description I'm trying to satisfy. You also then have things like quantities, 100 miles

0:23:28 - 0:23:34 Text: it's a quantity that's not really something that has any particular reference. You can mark out 100

0:23:34 - 0:23:43 Text: miles or sorts of places. So how do we deal with those things that aren't really mentions?

0:23:43 - 0:23:51 Text: Well one way is we could train a machine learning classifier to get rid of those curious mentions

0:23:51 - 0:23:59 Text: that actually mostly people don't do that. Most commonly if you're using this kind of pipeline model

0:23:59 - 0:24:07 Text: where you use a parser and a named NT recognizer you regard everything as you've found as a candidate

0:24:07 - 0:24:13 Text: mention and then you try and run your co-ref system and some of them like those ones hopefully aren't

0:24:13 - 0:24:21 Text: make a referent with anything else and so then you just discard them at the end of the process.

0:24:21 - 0:24:25 Text: Secret? Yeah. I've got an interesting question that linguist

0:24:25 - 0:24:33 Text: experienced on this. A student asks can we say that it is sunny? Has it's referring to the weather?

0:24:33 - 0:24:46 Text: I think so. That's a fair question. People have actually tried to suggest that when you say it is sunny

0:24:46 - 0:24:57 Text: it means the weather is sunny but I guess the majority opinion at least is that isn't plausible.

0:24:57 - 0:25:07 Text: I guess many of you aren't native speakers of English but similar phenomena occur in many other

0:25:07 - 0:25:16 Text: languages. I mean it just intuitively doesn't seem plausible when you say it's sunny or it's raining

0:25:16 - 0:25:24 Text: today that you're really saying that as a shortcut for the weather is raining today it just seems

0:25:24 - 0:25:30 Text: like really what the case is is English likes to have something filling the subject position

0:25:30 - 0:25:37 Text: and when there's nothing better to fill the subject position you stick it in there and get

0:25:37 - 0:25:43 Text: it's raining and so in general it's believed that you get this phenomenon of having these

0:25:43 - 0:25:49 Text: empty dummy it's that appear in various places. I mean another place in which it seems like you

0:25:49 - 0:25:56 Text: clearly get dummy it's is that when you have clauses that are subjects of a verb you can move

0:25:56 - 0:26:02 Text: them to the end of the sentence. So if you have a sentence where you put a clause in the subject

0:26:02 - 0:26:10 Text: position they normally in English sound fairly awkward so it's you have a sentence something like

0:26:11 - 0:26:19 Text: that CS24N is a lot of work is known by all students. People don't normally say that the normal

0:26:19 - 0:26:23 Text: thing to do is to shift the clause to the end of the sentence but when you do that you stick in

0:26:23 - 0:26:31 Text: the dummy it to fill the subject position so you then have it is known by all students that CS224N

0:26:31 - 0:26:38 Text: is a lot of work. So that's the general feeling that this is a dummy it that doesn't have any reference.

0:26:42 - 0:26:48 Text: Okay there's one more question so if someone says it is sunny and like other things and we ask

0:26:48 - 0:26:55 Text: how is the weather. Okay good point you've got me on that one right so someone says how is the

0:26:55 - 0:27:02 Text: weather and you answer it is sunny it then does seem like the it is in reference to the weather.

0:27:02 - 0:27:09 Text: Oh by that well you know I guess this is what our co-reference systems are built trying to do

0:27:09 - 0:27:15 Text: in situations like that they're making a decision of co-reference or not and I guess what you'd

0:27:15 - 0:27:20 Text: want to say in that case is it seems reasonable to regard this one as co-reference to that weather

0:27:20 - 0:27:28 Text: that did appear before it. I mean but that also indicates another reason to think that in the normal

0:27:28 - 0:27:34 Text: cases not co-reference right because normally pronouns are only used when their references establish

0:27:34 - 0:27:43 Text: that you've referred to now like John is answering questions and then you can say he types really

0:27:43 - 0:27:50 Text: quickly and it seemed odd to just sort of start the conversation by he types really quickly because

0:27:50 - 0:27:55 Text: it doesn't have any established reference whereas that doesn't seem to be the case it seems like

0:27:55 - 0:28:01 Text: you can just sort of start a conversation by saying it's raining really hard today and that

0:28:01 - 0:28:14 Text: doesn't sound odd at all. Okay so I've sort of there presented the traditional picture but you know

0:28:14 - 0:28:19 Text: this traditional picture doesn't mean something that was done last millennium before you were born

0:28:19 - 0:28:30 Text: I mean essentially that was the picture until about 2016 that essentially every co-reference

0:28:30 - 0:28:37 Text: system that was built used tools like part of speech tag as any of our systems and parsers to

0:28:37 - 0:28:43 Text: analyze sentences to identify mentions and to give you features for co-reference resolution and

0:28:43 - 0:28:52 Text: I'll show a bit more about that later but more recently in our neural systems people have moved

0:28:52 - 0:29:01 Text: to avoiding traditional pipeline systems and of doing one shot end to end co-reference resolution

0:29:01 - 0:29:09 Text: systems so if I skip directly to the second bullet there's a new generation of neural systems where

0:29:09 - 0:29:16 Text: you just start with your sequence of words and you do the maximally done thing you just say let's

0:29:16 - 0:29:24 Text: take all spans commonly with some heuristics for efficiency but you know conceptually all subsequences

0:29:24 - 0:29:31 Text: of this sentence they might be mentions let's feed them in to a neural network which will

0:29:31 - 0:29:37 Text: simultaneously do mention detection and co-reference resolution end to end in one model and I'll

0:29:37 - 0:29:46 Text: give an example of that kind of system later in the lecture. Okay is everything good to there and

0:29:46 - 0:29:58 Text: I should go on. Okay so I'm going to get on to how to do co-reference resolution systems but before

0:29:58 - 0:30:05 Text: I do that I do actually want to show a little bit more the linguistics of co-reference because

0:30:05 - 0:30:12 Text: there's actually a few more interesting things to understand and know here I mean when we say

0:30:12 - 0:30:21 Text: co-reference resolution we really confuse together two linguistic things which are overlapping

0:30:21 - 0:30:27 Text: but different and so it's really actually good to understand the difference between these things

0:30:27 - 0:30:34 Text: so there are two things that can happen one is that you can have mentions which are essentially

0:30:34 - 0:30:43 Text: standalone but happen to refer to the same entity in the world so if I have a piece of text that

0:30:43 - 0:30:52 Text: said Barack Obama traveled yesterday to Nebraska Obama was there to open a new meat processing

0:30:52 - 0:30:59 Text: plant or something like that I've mentioned with Barack Obama and Obama there are two mentions there

0:30:59 - 0:31:05 Text: they refer to the same person in the world they are co-referent so that is true co-reference but there's

0:31:05 - 0:31:11 Text: a different the related linguistic concept called a naffra and a naffra is when you have a

0:31:11 - 0:31:19 Text: textual dependence of an anaphora on another term which is the antecedent and in this case the

0:31:19 - 0:31:27 Text: meaning of the anaphora is determined by the antecedent in a textual context and the canonical case

0:31:27 - 0:31:35 Text: of this is pronouns so when it's Barack Obama said he would sign the bill he is an anaphora it's not a

0:31:35 - 0:31:41 Text: word that independently we can work out what it's meaning is in the world apart from knowing the

0:31:41 - 0:31:50 Text: vagus feature that it's referring to something probably male but in the context of this text

0:31:50 - 0:31:58 Text: we have that this anaphora is textually dependent on Barack Obama and so then we have an anaphora

0:31:58 - 0:32:05 Text: relationship which sort of means they refer to the same thing in the world and so therefore you

0:32:05 - 0:32:13 Text: can say they're co-referent so the picture we have is like this right so for co-reference we have

0:32:13 - 0:32:19 Text: these separate textual mentions which are basically standalone which refer to the same thing in

0:32:19 - 0:32:27 Text: the world whereas in an affra we actually have a textual relationship and you know you essentially

0:32:27 - 0:32:35 Text: have to use pronouns like he and she in legitimate ways in which the heerer can reconstruct a

0:32:35 - 0:32:42 Text: relationship from the text because they can't work out what he refers to if that's not there

0:32:42 - 0:32:53 Text: and so that's a fair bit of the distinction but it's actually a little bit more to realize because

0:32:53 - 0:33:01 Text: there are more complex forms of an affra which aren't co-reference because you have a textual

0:33:01 - 0:33:09 Text: dependence but it's not actually one of reference and so this comes back to things like these

0:33:09 - 0:33:17 Text: quantifying noun phrases that don't have reference so when you have sentences like these ones

0:33:17 - 0:33:26 Text: every dancer twisted her knee well this her here has an anaphora dependency on every dancer

0:33:26 - 0:33:33 Text: or even more clearly with no dancer twisted her knee the her here has an anaphora dependence on

0:33:33 - 0:33:42 Text: no dancer but for no dancer twisted her knee no dancer isn't referential it's not referring to

0:33:42 - 0:33:48 Text: anything in the world and so there's no co-reference or relationship because there's no reference

0:33:48 - 0:33:54 Text: relationship but there's still an anaphora relationship between these two noun phrases

0:33:54 - 0:34:04 Text: and then you have this other complex case that turns up quite a bit where you can have where the

0:34:04 - 0:34:11 Text: things being talked about do have reference but an anaphora relationship is more subtle than

0:34:11 - 0:34:20 Text: identity so you commonly get the constructions like this one we went to a concert last night the

0:34:20 - 0:34:28 Text: tickets were really expensive well the concert and the tickets are two different things they're not

0:34:29 - 0:34:37 Text: co-reference co-referential but in interpreting this sentence what this really means is the tickets

0:34:37 - 0:34:45 Text: of tickets to the tickets to the concert right and so there's sort of this hidden not not said

0:34:45 - 0:34:52 Text: dependence where this is referring back to the concert and so what we say is that these the tickets

0:34:53 - 0:34:59 Text: does have an anaphora dependence on the concert but they're not co-referential and so that's

0:34:59 - 0:35:07 Text: referred to as bridging an aphra and so overall there's the simple case and the common case which

0:35:07 - 0:35:16 Text: is pronominal anaphora where it's both co-reference and anaphora you then have other cases of

0:35:16 - 0:35:23 Text: co-reference such as every time you see a mention of the every time the United States has said

0:35:23 - 0:35:28 Text: it's co-referential with every other mention of the United States but those don't have any

0:35:28 - 0:35:34 Text: textual dependence on each other and then you have textual dependencies like bridging anaphora

0:35:34 - 0:35:41 Text: which aren't co-reference. Phew that's probably about as mm now I was going to say that's probably

0:35:41 - 0:35:46 Text: as as much linguistics as you wanted to hear but actually I have one more point of linguistics

0:35:49 - 0:35:57 Text: one or two of you but probably not many might have been troubled by the fact that the the term

0:35:57 - 0:36:05 Text: anaphora as a classical term means that you're looking backward for your antecedent

0:36:06 - 0:36:12 Text: that the anapart of anaphora means that you're looking backward for your antecedent and in

0:36:13 - 0:36:21 Text: sort of classical terminology you have both anaphora and cataphora and it's cataphora

0:36:21 - 0:36:27 Text: where you look forward for your antecedent. Cataphora isn't that common but it does occur

0:36:28 - 0:36:34 Text: here's a beautiful example of cataphora so this is from Oscar Wilde from the corner of the

0:36:34 - 0:36:41 Text: divine of Persian saddlebags on which he was lying smoking as was his custom and numerous cigarettes

0:36:41 - 0:36:47 Text: Lord Henry Watten could just catch the gleam of a honey sweet and honey cut of the honey sweet

0:36:47 - 0:36:56 Text: and honey colored blossoms of a labyrinth. Okay so in this example here right the he and then this

0:36:56 - 0:37:06 Text: he is are actually referring to Lord Henry Watten and so these are both examples of cataphora

0:37:06 - 0:37:18 Text: but in in modern linguistics even though most reference to pronouns is backwards we don't

0:37:19 - 0:37:26 Text: distinguish on in terms of order and so the term anaphora and anaphora is used for

0:37:26 - 0:37:33 Text: a textual dependence regardless of whether it's forward or backwards. Okay a lot of details there

0:37:33 - 0:37:44 Text: but taking stock of this so the basic observation is languages interpreted in context that in general

0:37:44 - 0:37:51 Text: you can't work out the meaning or reference of things without looking at the context of the

0:37:51 - 0:37:58 Text: linguistic utterance. So we've seen some simple examples before so for something like words

0:37:58 - 0:38:05 Text: since disambiguation you've if you see just the words but bank you don't know what it means

0:38:05 - 0:38:10 Text: and you need to look at a context to get some senses to whether it means a financial institution

0:38:10 - 0:38:18 Text: or the bank of a river or something like that and so anaphora and co-reference give us additional

0:38:18 - 0:38:25 Text: examples where you need to be doing contextual interpretation of language so when you see a pronoun

0:38:25 - 0:38:33 Text: you need to be looking at the context to see what it refers to and so if you think about text

0:38:33 - 0:38:40 Text: understanding as a human being does it reading a story or an article that we progress through the

0:38:40 - 0:38:47 Text: article from beginning to end and as we do it we build up a pretty complex discourse model in which

0:38:47 - 0:38:54 Text: new entities are introduced by mentions and then they're referred back to and relationships

0:38:54 - 0:38:59 Text: between them are established and they take actions and things like that and it sort of seems like

0:38:59 - 0:39:05 Text: in our head that we sort of build up a kind of a complex graph-like discourse representation

0:39:05 - 0:39:11 Text: of a piece of text with all these relationships and so part of that is these anaphora

0:39:11 - 0:39:17 Text: relationships and co-reference that we're talking about here and indeed in terms of CS224N

0:39:17 - 0:39:25 Text: the only kind of whole discourse meaning that we're going to look at is looking a bit at anaphora

0:39:25 - 0:39:31 Text: and co-reference but if you want to see more about higher level natural language understanding

0:39:31 - 0:39:41 Text: you can get more of this next quarter in CS224U so I want to tell you a bit about several different

0:39:41 - 0:39:49 Text: ways of doing co-reference so broadly there are four different kinds of co-reference models

0:39:50 - 0:39:58 Text: so the traditional old way of doing it was rule-based systems and this isn't the topic of this class

0:39:58 - 0:40:05 Text: and this is pretty archaic at this point this is stuff from last millennium but I wanted to say

0:40:05 - 0:40:11 Text: a little bit about it because it's actually kind of interesting as sort of food for thought as to

0:40:11 - 0:40:18 Text: how far along we are aren't in solving you know artificial intelligence and really being out

0:40:18 - 0:40:24 Text: to understand texts then there are sort of classic machine learning methods of doing it

0:40:24 - 0:40:29 Text: which you can sort of divide up as mention pair methods mention ranking methods and really

0:40:29 - 0:40:35 Text: clustering methods and I'm sort of going to skip the clustering methods today because most of

0:40:35 - 0:40:41 Text: the work especially most of the recent work implicitly makes clusters by using even mention pair

0:40:41 - 0:40:47 Text: or mention ranking methods and so I'm going to talk about a couple of neural methods for doing that

0:40:49 - 0:40:55 Text: okay but first of all let me just tell you a little bit about rule-based co-reference so there's

0:40:55 - 0:41:06 Text: a famous historical algorithm in NLP for doing pronoun and affer a resolution which is referred

0:41:06 - 0:41:12 Text: to as the Hobbs algorithm so everyone just refers to it as the Hobbs algorithm and if you sort of

0:41:12 - 0:41:19 Text: look up a textbook like Draftski and Martin's textbook it's referred to as the Hobbs algorithm

0:41:19 - 0:41:24 Text: but you know actually if you go back to Jerry Hobbs that's his picture over there in the corner

0:41:24 - 0:41:33 Text: if you actually go back to his original paper he refers to it as the naive algorithm and then his

0:41:33 - 0:41:41 Text: naive algorithm for pronoun co-reference was this sort of intricate handwritten set of rules

0:41:41 - 0:41:47 Text: to work out co-reference so this is the start of the set of the rules but there are more rules

0:41:47 - 0:41:55 Text: or more clauses of these rules for working out co-reference and you know this looks like a hot mess

0:41:56 - 0:42:03 Text: but the funny thing was that this set of rules for determining co-reference were actually pretty good

0:42:03 - 0:42:12 Text: and so in the sort of 1990s and 2000s decade even when people were using machine learning

0:42:12 - 0:42:18 Text: base systems for doing co-reference they had hide into those machine learning base systems

0:42:18 - 0:42:24 Text: that one of their features was the Hobbs algorithm and that the predictions it made with a certain

0:42:24 - 0:42:31 Text: weight was then a feature in making your final decisions and it's only really in the last five

0:42:31 - 0:42:37 Text: years that people have moved away from using the Hobbs algorithm let me give you a little bit of a

0:42:37 - 0:42:45 Text: sense of how it works okay so the Hobbs algorithm here's our example this is an example from a

0:42:45 - 0:42:51 Text: Guardian book review Nile Ferguson is prolific well-paid and a snappy dresser Steven Moss heated

0:42:51 - 0:42:56 Text: him okay so what the Hobbs algorithm does is we start with a pronoun oops

0:42:56 - 0:43:08 Text: we start with a pronoun and then it says step one go to the NP that's immediately dominating the

0:43:08 - 0:43:19 Text: pronoun and then it says go up to the first NP or S call this X and the path P then it says

0:43:19 - 0:43:27 Text: traverse all branches below X the left of P left to right bread first so then it's saying to go

0:43:27 - 0:43:35 Text: left to right for other branches below bread first so that's sort of working down the tree so we're

0:43:35 - 0:43:46 Text: going down and left to right and look for an NP okay and here's an NP but then we have to read

0:43:46 - 0:43:56 Text: more carefully and say propose as antecedent any NP that has an NP or S between it and X well

0:43:56 - 0:44:07 Text: this NP here has no NP or S between NP and X so this isn't a possible antecedent so this is

0:44:07 - 0:44:15 Text: all very you know complex and handwritten but basically he's sort of fit into the clauses of this

0:44:15 - 0:44:22 Text: kind of a lot of facts about how the grammar of English works and so what this is capturing is

0:44:22 - 0:44:28 Text: if you imagine a different sentence you know if you imagine the sentence Steven Moss's

0:44:30 - 0:44:41 Text: brother hated him well then Steven Moss would naturally be co-referent with him and in that case

0:44:41 - 0:44:53 Text: well precisely what you'd have is the noun phrase with well the noun brother and you'd have another

0:44:53 - 0:45:03 Text: noun phrase inside it for the Steven Moss and then that would go up to the sentence so in the case

0:45:03 - 0:45:11 Text: of Steven Moss's brother when you looked at this noun phrase there would be an intervening noun phrase

0:45:11 - 0:45:21 Text: before you got to the note X and therefore Steven Moss is a possible and in fact good antecedent

0:45:21 - 0:45:29 Text: of him and the algorithm would choose Steven Moss but the algorithm correctly captures that when

0:45:29 - 0:45:36 Text: you have the sentence Steven Moss hated him that him cannot refer to Steven Moss okay so having

0:45:36 - 0:45:44 Text: worked that out it then says if X is the highest S in the sentence okay so my X here is definitely

0:45:44 - 0:45:50 Text: the highest S in the sentence because I've got the whole sentence what you should do is then

0:45:50 - 0:45:58 Text: traverse the parse trees of previous sentences in the order of recency so what I should not do now

0:45:58 - 0:46:06 Text: is sort of work backwards in the text one sentence at a time going backwards looking for an antecedent

0:46:07 - 0:46:15 Text: and then for each tree traverse each tree left or right bread first so then within each tree

0:46:15 - 0:46:22 Text: I'm doing the same of going bread first so sort of working down and then going left or right

0:46:22 - 0:46:30 Text: with an equal bread and so hidden inside these clauses it's capturing a lot of the facts of how

0:46:30 - 0:46:39 Text: co-reference typically works so what you find in English I'll stay but in general this is true

0:46:39 - 0:46:46 Text: of lots of languages is that there are general preferences and tendencies for co-reference

0:46:46 - 0:46:53 Text: so a lot of the time a pronoun will be co-referent with something in the same sentence like Steven's

0:46:53 - 0:46:59 Text: Moss's brother heeded him but it can't be if it's too close to it so you can't say Steven Moss heeded

0:46:59 - 0:47:05 Text: him and have the him be Steven Moss and if you're then looking for co-reference it's further away

0:47:06 - 0:47:12 Text: the thing it's co-referent with is normally close by and so that's why you work backwards through

0:47:12 - 0:47:20 Text: sentences one by one but then once you're looking within a particular sentence the most likely

0:47:20 - 0:47:27 Text: thing that's going to be co-referent too is a topical noun phrase and default topics in English

0:47:27 - 0:47:36 Text: subjects so by doing things bread first left or right a preferred antecedent is then a subject

0:47:36 - 0:47:43 Text: and so this algorithm I won't go through all the complex clauses 529 ends up saying okay what you

0:47:43 - 0:47:51 Text: should do is propose Nile Ferguson as what is co-referent to him which is the obvious correct reading

0:47:51 - 0:47:58 Text: in this example okay you probably didn't want to know that and in some sense the details of that

0:47:58 - 0:48:08 Text: aren't interesting but what is I think actually still interesting in 2021 is what points Jerry Hobbes

0:48:08 - 0:48:19 Text: was actually trying to make last millennium and the point he was trying to make was the following

0:48:19 - 0:48:29 Text: so Jerry Hobbes wrote this algorithm the naive algorithm because what he said was well look if you

0:48:29 - 0:48:37 Text: want to try and crudely determine co-reference well there are these various preferences right there's

0:48:37 - 0:48:43 Text: the preference for same sentence there's the preference for recency there's a preference for

0:48:43 - 0:48:48 Text: topical things like subject and there are things where you know if it has gender it has to agree

0:48:48 - 0:48:56 Text: in gender so there are sort of strong constraints of that sort so I can write an algorithm using my

0:48:56 - 0:49:05 Text: linguistic mouse which captures all the main preferences and actually it works pretty well

0:49:05 - 0:49:15 Text: doing that is a pretty strong baseline system but what Jerry Hobbes wanted to argue is that this

0:49:15 - 0:49:23 Text: algorithm just isn't something you should believe in this isn't a solution to the problem this is

0:49:23 - 0:49:32 Text: just sort of you know making a best guess according to the preferences of what's most likely

0:49:32 - 0:49:40 Text: without actually understanding what's going on in the text at all and so actually what Jerry Hobbes

0:49:40 - 0:49:46 Text: wanted to argue was the so-called Hobbes algorithm now he wasn't a fan of the Hobbes algorithm he

0:49:46 - 0:49:52 Text: was wanting to argue that the Hobbes algorithm is completely inadequate as a solution to the problem

0:49:52 - 0:49:58 Text: and the only way we'll actually make progress in natural language understanding is by building systems

0:49:58 - 0:50:06 Text: that actually really understand the text and this is actually something that has come to the fore

0:50:06 - 0:50:16 Text: again more recently so the suggestion is that in general you can't work out co-reference or

0:50:16 - 0:50:21 Text: phenomenal in that for in particular unless you're really understanding the meaning of the text

0:50:21 - 0:50:28 Text: and people look at pairs of examples like these ones so she poured water from the picture into the cup

0:50:28 - 0:50:37 Text: until it was full so think for just half a moment well what is it in that example that is full

0:50:38 - 0:50:46 Text: so that what's full there is the cup but then if I say she poured water from the picture into the

0:50:46 - 0:50:53 Text: cup until it was empty well what's empty well that's the picture and the point that

0:50:53 - 0:51:01 Text: is being made with this example is the only thing that's been changed in these examples is

0:51:02 - 0:51:11 Text: the adjective right here so these two examples have exactly the same grammatical structure so in

0:51:11 - 0:51:20 Text: terms of the Hobbes naive algorithm the Hobbes naive algorithm necessarily has to predict the same

0:51:20 - 0:51:27 Text: answer for both of these but that's wrong you just cannot determine the correct pronoun

0:51:27 - 0:51:34 Text: antecedent based on grammatical preferences of the kind that are used in the naive algorithm

0:51:34 - 0:51:40 Text: you actually have to conceptually understand about pictures and cups and water and full and empty

0:51:40 - 0:51:49 Text: to be able to choose the right antecedent here's another famous example that goes along the same lines

0:51:49 - 0:51:57 Text: so Terry Winnegrad shown here as a young man so long long ago Terry Winnegrad came to Stanford as

0:51:57 - 0:52:06 Text: the natural language processing faculty and Terry Winnegrad became disillusioned with the symbolic AI

0:52:06 - 0:52:14 Text: of those days and just gave it up altogether and he reinvented himself as being an HCI person and

0:52:14 - 0:52:21 Text: so Terry was then essentially the person who established the HCI program at Stanford but before

0:52:21 - 0:52:29 Text: he lost faith in symbolic AI he talked about the co-reference problem and pointed out a

0:52:29 - 0:52:36 Text: similar pair of examples here so we have the city council refused the women a permit because they

0:52:36 - 0:52:43 Text: feared violence versus the city council refused the women a permit because they advocated violence

0:52:43 - 0:52:50 Text: so again you have this situation where these two sentences have identical syntactic structure

0:52:50 - 0:52:58 Text: and they differ only in the choice of verb here but once you add knowledge common sense knowledge

0:52:58 - 0:53:07 Text: of how the human world works well what how this should pretty obviously be interpreted in the

0:53:07 - 0:53:14 Text: first one that they is referring to the city council whereas in the second one that they

0:53:14 - 0:53:21 Text: is referring to the women and so coming off of that example of Terry these have been

0:53:23 - 0:53:30 Text: preferred to as Winnegrad schemers so Winnegrad schema challenges sort of choosing the right

0:53:30 - 0:53:37 Text: reference here and so it's basically just doing pronominal and afra but you know the interesting

0:53:37 - 0:53:44 Text: thing is people have been interested you know what a test of general intelligence and one famous

0:53:44 - 0:53:49 Text: general test of intelligence so I won't talk about now is the Turing test and there's been a lot of

0:53:49 - 0:53:55 Text: debate about problems with the Turing test and is it good and so in particular Hector Levesque is a

0:53:56 - 0:54:03 Text: very well-known senior AI person he actually proposed that a better alternative to the Turing test

0:54:03 - 0:54:10 Text: might be to do what he then dubbed Winnegrad schema and Winnegrad schema is just solving

0:54:10 - 0:54:15 Text: pronominal co-reference in cases like this where you have to have knowledge about the situation

0:54:15 - 0:54:22 Text: the world to get the answer right and so he's basically arguing that you know you can review

0:54:22 - 0:54:30 Text: really solving co-reference as solving artificial intelligence and that's sort of what the position

0:54:30 - 0:54:36 Text: that Hobbes wanted to advocate so what he actually said about his algorithm was that the naive

0:54:36 - 0:54:42 Text: approach is quite good computationally speaking it will be a long time before a semantically-based

0:54:42 - 0:54:48 Text: algorithm is sophisticated enough to perform as well and these results set a very high standard

0:54:48 - 0:54:53 Text: for any other approached way in for and he was proven right about that because there was sort of

0:54:53 - 0:54:58 Text: really talked to around 2015 before people thought they could do without the Hobbes algorithm

0:54:58 - 0:55:04 Text: but then he notes yet there is every reason to pursue a semantically-based approach the naive

0:55:04 - 0:55:11 Text: algorithm does not work anyone can think of examples where it fails in these cases it not only fails

0:55:11 - 0:55:17 Text: it gives no indication that it has failed and offers no help in finding the real antecedent

0:55:18 - 0:55:23 Text: and so I think this is actually still interesting stuff to think about because you know really for the

0:55:23 - 0:55:30 Text: kind of machine learning-based co-reference systems that we're building you know they're not a

0:55:30 - 0:55:37 Text: hot mess of rules like the Hobbes algorithm but basically they're still sort of working out

0:55:37 - 0:55:45 Text: statistical preferences of what patterns are most likely and choosing the antecedent that way

0:55:46 - 0:55:53 Text: they really have exactly the same deficiencies still that Hobbes was talking about right

0:55:53 - 0:56:01 Text: that they fail in various cases it's easy to find places where they fail the algorithms give you

0:56:01 - 0:56:07 Text: no idea when they fail they're not really understanding the text in a way that a human does to

0:56:07 - 0:56:13 Text: determine the antecedent so we still actually have a lot more work to do before we're really doing

0:56:13 - 0:56:20 Text: full artificial intelligence but I best get on now and actually tell you a bit about some

0:56:20 - 0:56:28 Text: co-reference algorithms right so the simple way of thinking about co-reference is to say

0:56:28 - 0:56:36 Text: that you're making just a binary decision about a reference pair so if you have your mentions

0:56:38 - 0:56:46 Text: you can then say well I've come to my next mention she I want to work out what it's co-referent

0:56:46 - 0:56:53 Text: with and I can just look at all of the mentions that came before it and say is it co-referent or not

0:56:53 - 0:57:00 Text: and do a binary decision so at training time I'll be able to say I have positive examples assuming

0:57:00 - 0:57:06 Text: I've got some data labeled for what's co-referent of what as to these ones are co-referent and I've

0:57:06 - 0:57:13 Text: got some negative examples of these ones are not co-referent and what I want to do is build a model

0:57:13 - 0:57:19 Text: that learns to predict co-referent things and I can do that fairly straightforwardly in the

0:57:19 - 0:57:27 Text: kind of ways that we have talked about so I train with the regular kind of cross entropy loss

0:57:27 - 0:57:34 Text: where I'm now summing over every pairwise binary decision as to whether two mentions

0:57:34 - 0:57:42 Text: are co-referent to each other or not and so then when I'm at test time what I want to do is

0:57:42 - 0:57:48 Text: cluster the mentions that correspond to the same entity and I do that by making use for my

0:57:48 - 0:57:58 Text: pairwise score so I can run my pairwise score and it will give a probability or a score that any

0:57:58 - 0:58:05 Text: two mentions are co-referent so by picking some threshold like 0.5 I can add co-reference links

0:58:05 - 0:58:13 Text: for when the classifier says it's above the threshold and then I do one more step to give me a

0:58:13 - 0:58:21 Text: clustering I then say okay let's also make the transitive closure to give me clusters so it

0:58:21 - 0:58:28 Text: thought that I and she were co-referent and my and she were co-referent therefore I also have to

0:58:28 - 0:58:37 Text: regard I and my as co-referent and so that's sort of the completion by transitivity and so since

0:58:37 - 0:58:45 Text: we always complete by transitivity note that this algorithm is very sensitive to making any mistake

0:58:45 - 0:58:52 Text: in a positive sense because if you make one mistake for example you say that he and my a co-referent

0:58:52 - 0:59:00 Text: and then by transitivity all of the mentions in these sentence become one big cluster and that

0:59:00 - 0:59:07 Text: they're all co-referent with each other so that's a workable algorithm and people have often used it

0:59:07 - 0:59:15 Text: but often people go a little bit beyond that and prefer a mention ranking model so it

0:59:15 - 0:59:22 Text: we just explain the advantages of that that normally if you have a long document where it's

0:59:22 - 0:59:27 Text: Ralph Nader and he did this and some of them did something to him and we visited his house and

0:59:27 - 0:59:34 Text: blah blah blah blah and then somebody voted for Nader because he in terms of building a

0:59:34 - 0:59:43 Text: co-reference classifier it seems like it's easy and reasonable it's easy and reasonable to be

0:59:43 - 0:59:51 Text: able to recover that this he refers to Nader but in terms of building a classifier for it to

0:59:51 - 0:59:58 Text: recognize that this he should be referring to this Nader which might be three paragraphs back

0:59:58 - 1:00:04 Text: seems kind of unreasonable how you're going to recover that so those far away ones might be almost

1:00:04 - 1:00:10 Text: impossible to get correct and so that suggests that maybe we should have a different way of

1:00:10 - 1:00:19 Text: configuring this task so instead of doing it that way what we should say is well this he here

1:00:19 - 1:00:27 Text: has various possible antecedents and our job is to just choose one of them and that's almost

1:00:27 - 1:00:37 Text: sufficient apart from we need to add one more choice which is well some mentions won't be

1:00:37 - 1:00:43 Text: co-referent with anything that proceeds because we're introducing a new entity into the discourse

1:00:43 - 1:00:52 Text: so we can add one more dummy mention the N.A. mention so it doesn't refer to anything previously

1:00:52 - 1:00:59 Text: in the discourse discourse and then our job at each point is to do mention ranking to choose which

1:00:59 - 1:01:08 Text: one of these she refers to and then at that point rather than doing binary yes no classifiers

1:01:08 - 1:01:14 Text: that what we can do is say aha this is choose one classification and then we can use the kind of

1:01:14 - 1:01:23 Text: softmax classifiers that we've seen at many points previously okay so that gets us in business

1:01:23 - 1:01:30 Text: for building systems and for either of these kind of models there are several ways in which we

1:01:30 - 1:01:37 Text: can build the system we could use any kind of traditional machine learning classifier we could use

1:01:37 - 1:01:43 Text: simple neural network we can use more advanced ones with all of the tools that we've been learning

1:01:43 - 1:01:51 Text: about more recently let me just quickly show you a simple neural network way of doing it so this

1:01:51 - 1:02:01 Text: is a model that my PhD student Kevin Clark did in 2015 so not that long ago but what he was doing was

1:02:01 - 1:02:09 Text: doing co-reference resolution based on the mentions with a simple feed forward neural network kind

1:02:09 - 1:02:14 Text: of in some sense like we did dependency parsing with a simple feed forward neural network so for

1:02:14 - 1:02:23 Text: for the mention it had word embeddings and seed and had word embeddings there were some additional

1:02:23 - 1:02:28 Text: features of each of the mention and candidate and a seed and and then there were some final

1:02:28 - 1:02:34 Text: additional features that captured things like distance away which you can't see from either

1:02:34 - 1:02:40 Text: the mention or the candidate and they were all of those features were just fed into several feed

1:02:40 - 1:02:48 Text: forward layers of a neural network and it gave you a score of are these things um co-referent or not

1:02:48 - 1:02:58 Text: and that by itself um just worked pretty well um and I won't say more details about that um but

1:02:58 - 1:03:06 Text: what I do want to show is sort of a more advanced um and modern neural co-reference system but before

1:03:06 - 1:03:13 Text: I do that I want to take a digression and sort of say a few words about convolutional neural networks

1:03:16 - 1:03:25 Text: so um the idea of when you apply a convolutional neural network to language i.e. to sequences

1:03:25 - 1:03:32 Text: is that what you're going to do is you're going to compute vectors features effectively

1:03:32 - 1:03:38 Text: for every possible words sub sequence of a certain length so that if you have a piece of text

1:03:38 - 1:03:45 Text: like tentative deal reach to keep government open you might say I'm going to take every three words

1:03:45 - 1:03:53 Text: of that I tentative deal reached deal reached to reach to keep and I'm going to compute a vector

1:03:53 - 1:04:01 Text: based on that sub sequence of words and use those computed vectors in my model by somehow

1:04:01 - 1:04:11 Text: grouping them together so the canonical um case of convolutional neural networks um is in vision

1:04:11 - 1:04:19 Text: and so if after this next quarter um you go along to CS231 and um you'll be able to um spend weeks

1:04:19 - 1:04:26 Text: doing convolutional neural networks for vision and so the idea there is that you've got these

1:04:26 - 1:04:35 Text: convolutional filters that you sort of slide over an image and you compute a function of each

1:04:35 - 1:04:41 Text: place so the sort of little red um numbers are showing you what you're computing but then you'll

1:04:41 - 1:04:47 Text: slide it over to the next position and fill in this cell and then you'll slide over the next

1:04:47 - 1:04:53 Text: position and fill in this cell and then you'll slide it down and fill in this cell and so you've

1:04:53 - 1:05:00 Text: got this sort of little function of a patch which you're sliding over your image and computing a

1:05:00 - 1:05:08 Text: convolution which is just um a dot product effectively um that you're then using to get an extra

1:05:08 - 1:05:15 Text: layer of representation and so by sliding things over you can pick out features and you've got a

1:05:15 - 1:05:23 Text: sort of a feature identifier that runs across every piece of the image um well for language we've

1:05:23 - 1:05:30 Text: just got a sequence but you can do basically the same thing and what you then have is a 1D

1:05:30 - 1:05:37 Text: convolution for text so if here's my sentence tentative deal reach to keep the government open

1:05:37 - 1:05:47 Text: that what I can do is have um so these words have uh word representation which so this is my

1:05:47 - 1:05:56 Text: vector for each word and then I can have a filter sometimes called a kernel which I use for my

1:05:56 - 1:06:03 Text: convolution and what I'm going to do is slide that down the text so I can start with it with the

1:06:03 - 1:06:11 Text: first three words and then I sort of treat them as sort of elements I can dot product and some

1:06:11 - 1:06:18 Text: and then I can compute a value as to what they all add up to which is minus one it turns out um and

1:06:18 - 1:06:24 Text: so then I might have a bias that I add on and get an updated value if my bias is plus one

1:06:27 - 1:06:33 Text: and then I'd run it through a nonlinearity and that will give me a final value um and then I'll

1:06:33 - 1:06:44 Text: slide my filter down and I'd work out um a computation for this window of three words and take 0.5

1:06:44 - 1:06:53 Text: times 3 plus 0.2 times 1 etc and that comes out as this value um I add the bias I put it I'm

1:06:53 - 1:06:59 Text: going to put it through my nonlinearity and then I keep on sliding down and I'll do the next three

1:06:59 - 1:07:07 Text: words and keep on going down and so that gives me a 1D convolution and computes a representation of

1:07:07 - 1:07:16 Text: the text um you might have noticed in the previous example that I started here with seven words

1:07:16 - 1:07:24 Text: but because I wanted to have a window of three for my convolution the end result is that things shrunk

1:07:24 - 1:07:31 Text: so in the output I only had five things that's not necessarily desirable so commonly people will

1:07:31 - 1:07:39 Text: deal with that um with padding so if I put padding on both sides I can then start my 3 by 3 convolution

1:07:39 - 1:07:48 Text: my 3 so not 3 by 3 my 3 convolution um here and compute this one and then slide it down one

1:07:48 - 1:07:56 Text: and compute this one and so now my output is the same size as my real input and so that's a

1:07:56 - 1:08:06 Text: convolution with padding um okay so that was the start of things but you know how you get more

1:08:06 - 1:08:13 Text: power of the convolutional network is you don't only have one filter you have several filters

1:08:13 - 1:08:19 Text: so if I have three filters each of which will have their own bias nonlinearity I can then get

1:08:19 - 1:08:26 Text: a three-dimensional representation coming out the end and sort of you can think of these as

1:08:26 - 1:08:35 Text: conceptually computing different features of your text okay so that gives us a kind of a sort of

1:08:35 - 1:08:45 Text: a new feature re-representation of our text but commonly we then want to somehow summarize

1:08:45 - 1:08:53 Text: what we have and a very common way of summarizing what we have is to then do pooling so

1:08:55 - 1:09:00 Text: if we sort of think of these features as detecting different things in the text so you know they

1:09:00 - 1:09:09 Text: might even be high-level features like you know does this show signs of toxicity or hate speech

1:09:10 - 1:09:15 Text: is there reference to something so if you want to be interested and doesn't occur anywhere in the

1:09:15 - 1:09:23 Text: text what people often then do as a max pooling operation where for each feature they simply

1:09:23 - 1:09:30 Text: sort of compute the maximum value it ever achieved in any position as you went through the text

1:09:30 - 1:09:36 Text: and say that this vector ends up as the sentence representation sometimes for other purposes rather

1:09:36 - 1:09:42 Text: than max pooling people use average pooling where you take the averages of the different vectors

1:09:43 - 1:09:49 Text: to get the sentence representation then general max pooling has been found to be more successful

1:09:49 - 1:09:55 Text: and that's kind of because if you think of it as feature detectors that are wanting to detect

1:09:55 - 1:10:01 Text: was this present somewhere then it you know something like positive sentiment isn't going to be

1:10:02 - 1:10:09 Text: present in every three word subsequent you choose before us there somewhere is there and so

1:10:09 - 1:10:19 Text: often max pooling works better and so that's a very quick look at convolutional neural networks

1:10:19 - 1:10:27 Text: except to say this example was doing 1D convolutions with words but a very common place that

1:10:27 - 1:10:33 Text: convolutional neural networks being used in natural language is actually using them with characters

1:10:33 - 1:10:42 Text: and so what you can do is you can do convolutions over subsequences of the characters in the same way

1:10:42 - 1:10:49 Text: and if you do that this allows you to compute a representation for any sequence of characters

1:10:49 - 1:10:56 Text: so you don't have any problems with being out of vocabulary or anything like that because for any

1:10:56 - 1:11:02 Text: sequence of characters you just compute your convolutional representation and max pool word and so

1:11:02 - 1:11:11 Text: quite commonly people use a character convolution to give a representation of words perhaps as the only

1:11:11 - 1:11:20 Text: representation of words but otherwise is something that you use in addition to a word vector and so

1:11:20 - 1:11:26 Text: in both bydaph in the model I'm about to show that at the base level it makes use of both a word vector

1:11:26 - 1:11:33 Text: representation that we see saw right at the beginning of the text and a character level convolutional

1:11:33 - 1:11:42 Text: representation of the words okay with that said I now want to show you before time runs out

1:11:42 - 1:11:48 Text: an end-to-end neural co-refer system model so the model I'm going to show you is Kenton Lee's one

1:11:48 - 1:11:55 Text: from so Dunn University of Washington in 2017 this is no longer the state of the art I'll mention

1:11:55 - 1:12:02 Text: the state of the art at the end but this was the first model that really said get rid of all of that

1:12:02 - 1:12:09 Text: old stuff of having pipelines and mention detection first just build one end-to-end big model that

1:12:09 - 1:12:16 Text: does everything in returns co-reference so it's a good one to show so compared to the earliest

1:12:16 - 1:12:22 Text: simple thing I saw we're now going to process the text with bylistians we're going to make use

1:12:22 - 1:12:29 Text: of attention and we're going to do all of mention detection co-reference in one step end-to-end

1:12:29 - 1:12:37 Text: and the way it does that is by considering every span of the text up to a certain length as a

1:12:37 - 1:12:43 Text: candidate mentioned and just figures out a representation for it and whether it's co-referent to

1:12:43 - 1:12:50 Text: other things so what we do at the start is we start with the sequence of words and we calculate

1:12:50 - 1:12:59 Text: from those standard word embedding and a character level CNNs embedding we then feed those

1:12:59 - 1:13:08 Text: as inputs into a bidirectional LSTM of the kind that we saw quite a lot of before but then

1:13:08 - 1:13:17 Text: after this what we do is we compute representations for spans so when we have a sequence of words

1:13:18 - 1:13:24 Text: we're then going to work out a representation of a sequence of words which we can then put into

1:13:24 - 1:13:34 Text: our co-reference model so that we I can't fully illustrate in this picture but so sub sequences of

1:13:34 - 1:13:41 Text: different lengths so like general electric general electric said we'll all have a span representation

1:13:41 - 1:13:48 Text: which I've only shown a subset of them in green so how are those computed well the way they're

1:13:48 - 1:13:56 Text: computed is that the span representation is a vector that can catenate several vectors and it

1:13:56 - 1:14:07 Text: consists of four parts it consists of the representation that was computed for the start of the span from

1:14:07 - 1:14:17 Text: the bio LSTM the representation of the end from the bio LSTM that's over here and then it has a third

1:14:17 - 1:14:24 Text: part that's kind of interesting this is an intention based representation that is calculated from

1:14:24 - 1:14:29 Text: the whole span but particularly sort of looks for the head of a span and then there are still a

1:14:29 - 1:14:36 Text: few additional features so it turns out that you know some of these additional things like length

1:14:36 - 1:14:47 Text: and so on is still a bit useful so to work out the the final part is not the beginning and the end

1:14:47 - 1:14:53 Text: what's done is to calculate an intention weighted average of the word embeddings so what you're

1:14:53 - 1:15:01 Text: doing is you're taking the x star representation of the final word of the span and you're feeding

1:15:01 - 1:15:11 Text: that into a neural network to get attention scores for every word in the span which are these three

1:15:11 - 1:15:17 Text: and that's giving you an attention distribution as we've seen previously and then you're calculating

1:15:17 - 1:15:29 Text: the third component of this as an attention weighted sum of the different words in the span and

1:15:29 - 1:15:34 Text: so therefore you've got the sort of a sort of a soft average of the representations of the words of

1:15:34 - 1:15:48 Text: the span okay so then once you've got that what you're doing is then feeding these representations

1:15:48 - 1:15:57 Text: into having scores for whether spans are co-referent mentions so you have a representation of

1:15:57 - 1:16:08 Text: the two spans you have a score that's calculated for whether two different spans look co-referent

1:16:08 - 1:16:14 Text: and that overall you're getting a score for our different spans looking co-referent or not

1:16:16 - 1:16:24 Text: and so this model is just run into a non-all spans now it sort of would get intractable if you

1:16:24 - 1:16:30 Text: scored literally every span in a long piece of text so they do some pruning they sort of

1:16:30 - 1:16:38 Text: only allow spans up to a certain maximum size they only consider pairs of spans that aren't too

1:16:38 - 1:16:46 Text: distant from each other etc etc but basically it's in sort of an approximation just a complete

1:16:46 - 1:16:52 Text: comparison of spans and this turns into a very effective co-reference resolution algorithm

1:16:52 - 1:16:59 Text: today it's not the best co-reference resolution algorithm because maybe not surprisingly like

1:16:59 - 1:17:05 Text: everything else that we've been dealing with there's now been these transformer models like

1:17:05 - 1:17:12 Text: BERT have come along and that they produce even better results so the best co-reference systems now

1:17:12 - 1:17:20 Text: have you make use of BERT in particular when Danchi spoke she briefly mentioned span BERT

1:17:20 - 1:17:28 Text: which was a variant of BERT which constructs blanks out for reconstruction sub sequences of words

1:17:28 - 1:17:33 Text: rather than just a single word and span BERT has actually proven to be very effective

1:17:34 - 1:17:38 Text: for doing co-reference perhaps because you can blank out whole mentions

1:17:40 - 1:17:46 Text: people have also gotten gains actually funnily by treating co-reference a question answering

1:17:46 - 1:17:56 Text: task so effectively you can find a mention like he or the person and say what is it's

1:17:56 - 1:18:03 Text: there to see then and get in question answering answer and that's a good way to do co-reference

1:18:05 - 1:18:10 Text: so if we put that together as time is running out let me just sort of give you some

1:18:10 - 1:18:18 Text: sense of how results come out for co-reference systems so I'm skipping a bit actually which

1:18:18 - 1:18:26 Text: you can find in the slides which is how co-references scored but essentially it's scored on a clustering

1:18:26 - 1:18:33 Text: metric so a perfect clustering and give you a hundred and something that makes no correct decisions

1:18:33 - 1:18:41 Text: would give you zero and so this is sort of how the co-reference numbers have been panning out

1:18:41 - 1:18:49 Text: so back in 2010 actually this was a Stanford system this was a state of the art system for

1:18:49 - 1:18:56 Text: co-reference and one competition it was actually a non-machine learning model because again we

1:18:56 - 1:19:03 Text: were wanting to so prove how these rule based methods and practice work kind of well and so its accuracy

1:19:03 - 1:19:13 Text: was around 55 English 50 for Chinese then gradually machine learning these were sort of statistical

1:19:13 - 1:19:20 Text: machine learning models got a bit better wiseman was the very first neural co-reference system

1:19:20 - 1:19:26 Text: and that gave some gains he has a system that Kevin Clark and I did which gave a little bit

1:19:26 - 1:19:35 Text: further gains so Lee is the model that I've just shown you as the end-win model and it got

1:19:35 - 1:19:42 Text: a bit of further gains but then again you know what gave the huge breakthrough just like question

1:19:42 - 1:19:49 Text: answering was that the use of spanbert so once we moved it here we're now using spanbert

1:19:49 - 1:19:56 Text: that's giving you about an extra 10 percent or so the co-ref QA technique proved to be useful

1:19:57 - 1:20:03 Text: and then the very latest best results are effectively combining together spanbert and

1:20:04 - 1:20:13 Text: or larger version of spanbert and co-ref QA and getting up to 83 so you might think from that

1:20:13 - 1:20:21 Text: that co-ref is sort of doing really well and is getting close to solve like other NLP tasks

1:20:22 - 1:20:27 Text: well it's so me true that in neural times the results have been getting way way better than they

1:20:27 - 1:20:34 Text: had been before but I would caution you that these results that I just showed were on a corpus

1:20:34 - 1:20:40 Text: called onto notes which is mainly newswire and it turns out that newswire co-reference is pretty

1:20:40 - 1:20:47 Text: easy I mean in particular there's a lot of mention of the same entities right so the newspaper

1:20:47 - 1:20:54 Text: articles are full of mentions of the United States and China and leaders of the different countries

1:20:54 - 1:21:00 Text: and it's sort of very easy to work out what their co-reference to and so the co-reference scores

1:21:01 - 1:21:10 Text: are fairly high whereas if what you do is take something like a page of dialogue from a novel

1:21:10 - 1:21:16 Text: and feed that into a system and say okay do the co-reference correctly you'll find pretty rapidly

1:21:17 - 1:21:23 Text: that the performance of the models is much more modest if you'd like to try out a co-reference

1:21:23 - 1:21:32 Text: system for yourself there are pointers to a couple of them here where the top ones are ours from

1:21:32 - 1:21:38 Text: the Southern Kevin Clark's neural co-reference and this is one that goes with the hugging face

1:21:38 - 1:21:41 Text: repository that we've mentioned