Stanford CS224N NLP with Deep Learning ｜ Winter 2021 ｜ Lecture 15 - Add Knowledge to Language Models

0:00:00 - 0:00:07 Text: Welcome to CS224N Lecture 15.

0:00:07 - 0:00:11 Text: So I'm Megan and I'm one of the cities in this course and I'm also a PhD student

0:00:11 - 0:00:12 Text: working at Kirste.

0:00:12 - 0:00:18 Text: And today I'll be talking about integrating knowledge and language models.

0:00:18 - 0:00:21 Text: So some quick reminders, your project milestones we do today.

0:00:21 - 0:00:24 Text: So hopefully you turn those in already or we'll be turning them in in the next couple

0:00:24 - 0:00:25 Text: of days.

0:00:25 - 0:00:29 Text: And we'll try to get you feedback on those as fast as possible.

0:00:29 - 0:00:32 Text: So something to be aware of is a change of grading basis and of course withdrawal deadline

0:00:32 - 0:00:34 Text: is this Friday.

0:00:34 - 0:00:37 Text: So if you want to make any change of your grade, make sure to do that by then.

0:00:37 - 0:00:41 Text: And we'll be getting you the grades back on assignment five by then as well in case

0:00:41 - 0:00:44 Text: that's helpful in making your decision.

0:00:44 - 0:00:46 Text: And finally, your final projects are due in two weeks.

0:00:46 - 0:00:49 Text: So hopefully those are going smoothly.

0:00:49 - 0:00:52 Text: So topic of the day is integrating knowledge and language models.

0:00:52 - 0:00:56 Text: You've seen a bit about this idea in assignment five and also in call and raffles lecture

0:00:56 - 0:00:57 Text: last class.

0:00:57 - 0:01:01 Text: In assignment five, the task goes to train a model to predict the birthplace of a person

0:01:01 - 0:01:03 Text: given their name.

0:01:03 - 0:01:05 Text: And you sell it by pre training on a larger data set.

0:01:05 - 0:01:09 Text: You're actually able to do better on this task since you could encode some role knowledge

0:01:09 - 0:01:11 Text: into the language model.

0:01:11 - 0:01:16 Text: And then last lecture call and raffle presented how T5 could actually be fine tuned for a closed

0:01:16 - 0:01:21 Text: domain question answering task such that you can give T5 a natural language question and

0:01:21 - 0:01:24 Text: it'll return an answer.

0:01:24 - 0:01:27 Text: So they will be building on these threads and looking at techniques that researchers have

0:01:27 - 0:01:32 Text: recently been developing to increase the amount of knowledge in language models.

0:01:32 - 0:01:35 Text: So we're going to start with a quick recap of language models just to make sure we're

0:01:35 - 0:01:37 Text: all on the same page.

0:01:37 - 0:01:40 Text: Then we're going to talk about what types of knowledge language models can already encode

0:01:40 - 0:01:41 Text: and what they might struggle on.

0:01:41 - 0:01:46 Text: We'll also motivate why researchers are interested in increasing the amount of knowledge in language

0:01:46 - 0:01:50 Text: models and what this could enable for future AI systems if we have language models that can

0:01:50 - 0:01:54 Text: actually reliably recall knowledge.

0:01:54 - 0:01:58 Text: We'll talk about three broad classes of techniques that researchers have been using to add knowledge

0:01:58 - 0:01:59 Text: to language models.

0:01:59 - 0:02:04 Text: These include adding pre trained entity embeddings using external memory or key value store or

0:02:04 - 0:02:07 Text: even just modifying the training data.

0:02:07 - 0:02:10 Text: And for each of these techniques, we'll talk about at least one recent work that uses

0:02:10 - 0:02:11 Text: the technique.

0:02:11 - 0:02:14 Text: So hopefully it's clear to see how to actually employ and practice.

0:02:14 - 0:02:19 Text: And then finally, we'll wrap up by talking about how to evaluate the knowledge in language

0:02:19 - 0:02:24 Text: models and the challenges that come up in trying to do this.

0:02:24 - 0:02:25 Text: So let's dive right in.

0:02:25 - 0:02:28 Text: We're going to start by talking about standard language models.

0:02:28 - 0:02:31 Text: You learned about these at the beginning of the course.

0:02:31 - 0:02:35 Text: And the task is to predict the next word and sequence of text and to compute the probability

0:02:35 - 0:02:36 Text: of a sequence.

0:02:36 - 0:02:39 Text: So you may remember the example that students opened their blank.

0:02:39 - 0:02:43 Text: And we talked about, it could be mine's exams, bring those books here.

0:02:43 - 0:02:47 Text: And the task of standard language model is to predict the most likely next word in the

0:02:47 - 0:02:48 Text: sequence.

0:02:48 - 0:02:53 Text: A couple of lectures ago, John also introduced the notion of mass language models.

0:02:53 - 0:02:56 Text: And instead of predicting the next word and sequence of text, the task is to predict

0:02:56 - 0:02:58 Text: the mass token.

0:02:58 - 0:03:01 Text: And this is done using bi-jectional context.

0:03:01 - 0:03:04 Text: So you may remember the example I'm mass to the mask and the goal of the mass language

0:03:04 - 0:03:09 Text: model is to make the most likely token for each of the massed out words.

0:03:09 - 0:03:11 Text: So maybe I went to the store.

0:03:11 - 0:03:15 Text: So while there's some differences in these two types of language models, but they are

0:03:15 - 0:03:19 Text: predicting the next word, or whether you're predicting the massed out token, they're similar

0:03:19 - 0:03:23 Text: and that they can both be trained over large amounts of unlabeled text.

0:03:23 - 0:03:26 Text: And this is one of the reasons why they've been so wide they adopted.

0:03:26 - 0:03:30 Text: They don't require any human annotated data.

0:03:30 - 0:03:34 Text: So you've seen that language models can be used for a variety of tasks, from summarization,

0:03:34 - 0:03:40 Text: to dialogue, to fluency evaluation, tasks that involve either generating text or evaluating

0:03:40 - 0:03:43 Text: the probability of text.

0:03:43 - 0:03:46 Text: And more recently we've seen that language models can also be used to generate pre-chained

0:03:46 - 0:03:51 Text: representations of text that encodes some notion of language understanding, and has

0:03:51 - 0:03:56 Text: been shown to be widely useful for different downstream NLP tasks.

0:03:56 - 0:03:59 Text: And then finally, today we're going to touch on this idea that if language models are

0:03:59 - 0:04:07 Text: trained over massive amounts of text, can they even be used as a knowledge base?

0:04:07 - 0:04:10 Text: So we're going to start by looking at what types of factual knowledge a language model

0:04:10 - 0:04:15 Text: might already know. And these examples are taken from a paper by Petroni et al, in

0:04:15 - 0:04:17 Text: EML, P a couple years ago.

0:04:17 - 0:04:22 Text: And the goal is to test the factual or common sense knowledge in existing language models

0:04:22 - 0:04:24 Text: such as Bert large.

0:04:24 - 0:04:26 Text: So let's check out what Bert large predicts.

0:04:26 - 0:04:33 Text: iPod Touch is produced by Apple, London Jazz Festival is located in London, Danny Alve

0:04:33 - 0:04:40 Text: is placed with Santos, Carl III used to communicate in German, and Ravens can fly.

0:04:40 - 0:04:44 Text: So here we have the correct predictions in green and the incorrect predictions in red.

0:04:44 - 0:04:48 Text: And if you know anything about sports, you may know that Danny Alve is a soccer player,

0:04:48 - 0:04:50 Text: Santos is a soccer team.

0:04:50 - 0:04:53 Text: Here they were hoping that it would predict Barcelona, because at least at the time of

0:04:53 - 0:04:58 Text: this data set, apparently he played for Barcelona, and Carl III actually used to communicate

0:04:58 - 0:05:01 Text: in Swedish, not German.

0:05:01 - 0:05:05 Text: So it's good about these examples, it's a predictions are generally reasonable.

0:05:05 - 0:05:08 Text: If you didn't know the ground truth, they all make sense.

0:05:08 - 0:05:13 Text: When you want to predict a language, you do in fact predict a language.

0:05:13 - 0:05:17 Text: But of course they're not all factually correct.

0:05:17 - 0:05:19 Text: So why might this happen?

0:05:19 - 0:05:22 Text: Well, for one, the fact might not have been seen in training.

0:05:22 - 0:05:25 Text: And you can't expect the language model to do more than recall facts that it has seen

0:05:25 - 0:05:26 Text: in training.

0:05:26 - 0:05:29 Text: It can't make up facts about the world for instance.

0:05:29 - 0:05:31 Text: It's also possible the fact is just really rare.

0:05:31 - 0:05:35 Text: So maybe the language model has seen the fact during training, but it hasn't seen it

0:05:35 - 0:05:38 Text: enough times actually memorize the fact.

0:05:38 - 0:05:42 Text: And the last issue is a little more subtle, which a model might just be very sensitive

0:05:42 - 0:05:45 Text: to the phrasing of the fill in the blank statement.

0:05:45 - 0:05:50 Text: And so for example, you might have statements like X was created in blank that the model

0:05:50 - 0:05:54 Text: can't predict correctly, but if you change it to X was made in blank, suddenly it can

0:05:54 - 0:05:55 Text: predict it correctly.

0:05:55 - 0:06:02 Text: And we'll come back to this in how to actually evaluate the knowledge in these language models.

0:06:02 - 0:06:07 Text: So this inability to reliably recall knowledge is a key challenge facing language models

0:06:07 - 0:06:08 Text: today.

0:06:08 - 0:06:09 Text: It'll be the focus of this talk.

0:06:09 - 0:06:14 Text: Recent works have found that language models can recover some knowledge, including the

0:06:14 - 0:06:16 Text: work that Colin presented last class.

0:06:16 - 0:06:18 Text: They've had very encouraging results.

0:06:18 - 0:06:21 Text: But they're still a way to go as we saw with the fill in the blank statements and with

0:06:21 - 0:06:24 Text: these challenges that we just discussed above.

0:06:24 - 0:06:28 Text: So as a result, the past couple years have had a ton of rapid progress in this area of

0:06:28 - 0:06:33 Text: research in terms of trying to figure out how do you actually encode more knowledge in

0:06:33 - 0:06:37 Text: language models.

0:06:37 - 0:06:41 Text: So I also want to motivate why researchers are interested in building language models

0:06:41 - 0:06:45 Text: that can more reliably recall knowledge.

0:06:45 - 0:06:49 Text: And one of these reasons is that the pre-changed representations are used in a variety of downstream

0:06:49 - 0:06:50 Text: tasks.

0:06:50 - 0:06:53 Text: And some of these downstream tests are knowledge intensive.

0:06:53 - 0:06:58 Text: So for instance, you might have a downstream task to extract the relations between two

0:06:58 - 0:07:00 Text: entities in a sentence.

0:07:00 - 0:07:02 Text: And this is commonly known as relation extraction.

0:07:02 - 0:07:06 Text: And this is much easier if you have some knowledge of the entities, which could be potentially

0:07:06 - 0:07:11 Text: provided by this pre-trained language model representation.

0:07:11 - 0:07:12 Text: And we talk about evaluation.

0:07:12 - 0:07:17 Text: We'll talk about what types of tasks are most likely to benefit from these knowledge-rich

0:07:17 - 0:07:20 Text: pre-changed representations.

0:07:20 - 0:07:25 Text: And then as a stretch goal, some researchers are starting to propose the idea that can

0:07:25 - 0:07:30 Text: language models actually ultimately be used to replace traditional knowledge bases.

0:07:30 - 0:07:34 Text: So instead of creating a knowledge base for a fact, like you might right now with SQL,

0:07:34 - 0:07:37 Text: you'd create a language model with a natural language prompt.

0:07:37 - 0:07:41 Text: And of course, this does require the language model to have high quality under calling

0:07:41 - 0:07:42 Text: facts.

0:07:42 - 0:07:49 Text: So we might not be there yet, but it's an interesting direction for us to be moving towards.

0:07:49 - 0:07:52 Text: So I want to make it super clear what I mean by a knowledge base.

0:07:52 - 0:07:55 Text: Here we're just talking about a knowledge graph where the nodes in the graph would be

0:07:55 - 0:07:57 Text: entities.

0:07:57 - 0:08:00 Text: And the edges are going to be relations between the entities.

0:08:00 - 0:08:04 Text: So for example, here we have a subset of a knowledge graph for Franklin D. Roosevelt.

0:08:04 - 0:08:08 Text: And you see the information about his spouse, his place of birth, his date of birth, and

0:08:08 - 0:08:09 Text: so on.

0:08:09 - 0:08:14 Text: An important thing to note is this is a structured way of storing the knowledge, since it's

0:08:14 - 0:08:15 Text: just in a graph form.

0:08:15 - 0:08:19 Text: And you can actually describe these graphs with knowledge graph triples, which will be

0:08:19 - 0:08:22 Text: an important vocabulary word throughout this talk.

0:08:22 - 0:08:29 Text: So knowledge graph triple would be consisting of a subject entity, a relation, and then object

0:08:29 - 0:08:30 Text: entity.

0:08:30 - 0:08:34 Text: So for instance, here we might have Franklin D. Roosevelt, date of birth, January 30th,

0:08:34 - 0:08:35 Text: 1882.

0:08:35 - 0:08:37 Text: And that would form a knowledge graph triple.

0:08:37 - 0:08:43 Text: We'll also refer to this as a parent entity, a relation, and a tail entity.

0:08:43 - 0:08:46 Text: So wiki data is one very popular knowledge base you might come across if you're working

0:08:46 - 0:08:48 Text: this area.

0:08:48 - 0:08:52 Text: It's a free knowledge base that's actually populated by humans, so they're filling in

0:08:52 - 0:08:54 Text: these relations and entities.

0:08:54 - 0:08:57 Text: And it's also multilingual.

0:08:57 - 0:09:02 Text: So if you want information from this knowledge base, what you do is you'd write a SQL query.

0:09:02 - 0:09:03 Text: This is a simplified one.

0:09:03 - 0:09:08 Text: But the idea is you'd want to figure out the date of birth of Franklin Roosevelt, so

0:09:08 - 0:09:12 Text: you would write a query like follows.

0:09:12 - 0:09:16 Text: Now if instead you want to query a language model as a knowledge base, you'll have something

0:09:16 - 0:09:20 Text: like this diagram that you've actually probably seen in several lectures now.

0:09:20 - 0:09:25 Text: And the idea is you'll train a language model over this unstructured text.

0:09:25 - 0:09:30 Text: And then you'll use a language model to just answer these natural language query statements.

0:09:30 - 0:09:36 Text: So here, this is the work on T5 where they're training T5 over natural language or just

0:09:36 - 0:09:39 Text: unstructured text with the span corruption task.

0:09:39 - 0:09:42 Text: And then they're asking T5 when was Franklin D. Roosevelt born?

0:09:42 - 0:09:46 Text: And the idea is T5 will produce a textual answer.

0:09:46 - 0:09:50 Text: So you can see this contrast very much with the old approach of using a traditional knowledge

0:09:50 - 0:09:58 Text: base for the knowledge base is structured and you have the SQL statements to query it.

0:09:58 - 0:10:01 Text: So what are the advantages of using language models over traditional knowledge bases?

0:10:01 - 0:10:04 Text: And why might people think this could be a good idea?

0:10:04 - 0:10:08 Text: Well for one, the language models are pre-trained over large amounts of unstructured and unlabeled

0:10:08 - 0:10:14 Text: text, whereas traditional knowledge bases require manual annotation like with wiki data

0:10:14 - 0:10:19 Text: people actually are populating it, or complex NLP pipelines to extract from unstructured

0:10:19 - 0:10:24 Text: text into a structured form that forms a knowledge base.

0:10:24 - 0:10:28 Text: Language models can also support more flexible natural language queries.

0:10:28 - 0:10:34 Text: So if we take the example, what does the final F in the song UFOF stand for?

0:10:34 - 0:10:39 Text: A knowledge base probably won't have a field for final F, so it won't be able to answer your

0:10:39 - 0:10:40 Text: query.

0:10:40 - 0:10:43 Text: But there's a chance that a language model could actually learn and have a response for

0:10:43 - 0:10:46 Text: this natural language query.

0:10:46 - 0:10:50 Text: They also had a less extreme example in this paper by Petroni and others, where maybe

0:10:50 - 0:10:54 Text: your relation would be is works for in your knowledge base.

0:10:54 - 0:10:56 Text: And the new ask for is working for.

0:10:56 - 0:11:00 Text: And the knowledge base doesn't have an exact match on the field, and so it returns an

0:11:00 - 0:11:01 Text: empty response.

0:11:01 - 0:11:06 Text: And it's much, it's reasonable to believe that your language model could figure out that

0:11:06 - 0:11:07 Text: these relations are similar.

0:11:07 - 0:11:13 Text: So if I know the answer to one of them, I probably know the answer to the other.

0:11:13 - 0:11:15 Text: Of course, it's not all advantages.

0:11:15 - 0:11:19 Text: There's also many open challenges using language models as knowledge bases.

0:11:19 - 0:11:21 Text: So for one, it's harder to interpret.

0:11:21 - 0:11:24 Text: When a traditional knowledge base produces an answer, there's actually provenance information

0:11:24 - 0:11:28 Text: associated with why did it return that particular query.

0:11:28 - 0:11:34 Text: But with a language model, it's really not clear why it might produce a prediction.

0:11:34 - 0:11:38 Text: The knowledge is just encoded in the parameters of the model.

0:11:38 - 0:11:39 Text: It's also harder to trust.

0:11:39 - 0:11:45 Text: So you saw this in Simon 5, where the language model could produce realistic predictions,

0:11:45 - 0:11:46 Text: but they are incorrect.

0:11:46 - 0:11:50 Text: So it's not easy to know when the language model actually knows the fact versus it's using

0:11:50 - 0:11:53 Text: some like biases to make its prediction.

0:11:53 - 0:11:57 Text: And in the case of the traditional knowledge base, if it doesn't know a fact, it's just

0:11:57 - 0:12:00 Text: going to have an empty response.

0:12:00 - 0:12:05 Text: And then finally, language models are harder to modify.

0:12:05 - 0:12:09 Text: So in a knowledge base, if you want to update a fact, you just change the fact directly

0:12:09 - 0:12:11 Text: in the structured data.

0:12:11 - 0:12:14 Text: But a language model is not quite clear how you would do this.

0:12:14 - 0:12:19 Text: You could fine tune the model longer on the updated data, but how do you know if it still

0:12:19 - 0:12:23 Text: has some memorization of the old fact?

0:12:23 - 0:12:27 Text: So there are a lot of open challenges to this goal of actually using language models as

0:12:27 - 0:12:29 Text: traditional knowledge bases.

0:12:29 - 0:12:33 Text: But hopefully you see why some people think this could actually be a good idea.

0:12:33 - 0:12:37 Text: And why researchers are interested in training language models that can actually integrate

0:12:37 - 0:12:41 Text: more knowledge.

0:12:41 - 0:12:43 Text: So that brings us to section two of the talk.

0:12:43 - 0:12:47 Text: So I want to pause here just in case there's any questions.

0:12:47 - 0:12:53 Text: Okay.

0:12:53 - 0:12:54 Text: Okay.

0:12:54 - 0:12:55 Text: Awesome.

0:12:55 - 0:13:00 Text: So now we're going to be talking about what techniques researchers are using to actually

0:13:00 - 0:13:03 Text: add more knowledge to language models.

0:13:03 - 0:13:06 Text: So we're going to talk about three broad classes of techniques.

0:13:06 - 0:13:10 Text: This is by no means exhaustive, but hopefully it gives you a good overview so that if you

0:13:10 - 0:13:13 Text: want to dive deeper, you can.

0:13:13 - 0:13:17 Text: So we'll start by talking about adding pre-trained entity embeddings and for each section we'll

0:13:17 - 0:13:22 Text: kind of focus on the first work that you see in the bullets, but we'll also talk about

0:13:22 - 0:13:23 Text: briefly some of the variants.

0:13:23 - 0:13:31 Text: So you see how the works within each class can differ and what knobs you can turn.

0:13:31 - 0:13:35 Text: So for adding pre-trained embeddings, we first need to figure out what pre-trained embeddings

0:13:35 - 0:13:39 Text: would actually be the most useful to add knowledge to language models.

0:13:39 - 0:13:43 Text: And this can start with an observation that facts about the world are usually in terms

0:13:43 - 0:13:45 Text: of entities.

0:13:45 - 0:13:49 Text: So if we have a fact like Washington was the first president of the United States, we have

0:13:49 - 0:13:53 Text: the entities Washington United States.

0:13:53 - 0:13:57 Text: But pre-trained word embeddings don't have this notion of entities.

0:13:57 - 0:14:02 Text: So we'd have different word embeddings for USA, United States America, and America, even

0:14:02 - 0:14:05 Text: though these all refer to the same entity.

0:14:05 - 0:14:08 Text: And this makes it challenging for the language model to actually learn any representations

0:14:08 - 0:14:15 Text: over these entities, since they may be referred to many ways in the text.

0:14:15 - 0:14:22 Text: So what I've been said, we have a single embedding per entity and law for these as entity embeddings.

0:14:22 - 0:14:27 Text: So now you'd have a single entity embedding for USA, United States America, and America.

0:14:27 - 0:14:33 Text: And whenever you see a phrase in text referring to this entity, you would use the same entity

0:14:33 - 0:14:35 Text: embedding.

0:14:35 - 0:14:40 Text: And these entity embeddings can actually be pre-trained to encode this factual knowledge about the

0:14:40 - 0:14:41 Text: world.

0:14:41 - 0:14:44 Text: And this first class techniques we'll be looking at will be how do you actually best use

0:14:44 - 0:14:50 Text: these pre-trained entity embeddings in a language model.

0:14:50 - 0:14:54 Text: So I need to make up for a quick note that these entity embeddings are only useful to a language

0:14:54 - 0:14:55 Text: model.

0:14:55 - 0:15:00 Text: So if you can do another NLP task called entity linking well.

0:15:00 - 0:15:05 Text: So I'm going to take a quick aside and explain what is entity linking.

0:15:05 - 0:15:09 Text: So a definition of entity linking is the link mentions in text to entities in a knowledge

0:15:09 - 0:15:10 Text: base.

0:15:10 - 0:15:14 Text: I like to think about this in terms of how you use word embeddings.

0:15:14 - 0:15:17 Text: So if you want to use word embeddings and you have a sentence, you're going to first

0:15:17 - 0:15:19 Text: tokenize that sentence into words.

0:15:19 - 0:15:22 Text: And then for each word, you're going to look up their corresponding ID in some word embedding

0:15:22 - 0:15:23 Text: matrix.

0:15:23 - 0:15:26 Text: And now you have your word embedding.

0:15:26 - 0:15:30 Text: Well for entity embeddings, the dictionary look up isn't so easy.

0:15:30 - 0:15:34 Text: You might have sentences like Washington's the first present United States.

0:15:34 - 0:15:36 Text: Well Washington has two different candidates.

0:15:36 - 0:15:40 Text: Are we talking about George Washington or we're talking about Washington state?

0:15:40 - 0:15:44 Text: And these are different entities that have different entity embeddings.

0:15:44 - 0:15:49 Text: And the QIDs here would just be their identifiers and wiki data.

0:15:49 - 0:15:52 Text: And the United States just has a single entity.

0:15:52 - 0:15:56 Text: So a task of entity linking is to figure out correctly these ambiguous mentions what

0:15:56 - 0:16:00 Text: entity do they actually link to in a knowledge base.

0:16:00 - 0:16:03 Text: And there's many different ways you can do this entity linking.

0:16:03 - 0:16:06 Text: The one way you might be able to do this is to figure out that, oh, I see the context

0:16:06 - 0:16:07 Text: word of president.

0:16:07 - 0:16:11 Text: So Washington probably links to George Washington.

0:16:11 - 0:16:13 Text: Just some more definitions.

0:16:13 - 0:16:16 Text: We're going to refer to Washington as a mention in United States as a mention.

0:16:16 - 0:16:21 Text: And then the things that the mentioned could link to so the two options for Washington

0:16:21 - 0:16:24 Text: are going to be candidates.

0:16:24 - 0:16:25 Text: So this is a whole research area of its own.

0:16:25 - 0:16:28 Text: And I encourage you to check out the resources at the bottom if you're interested in learning

0:16:28 - 0:16:29 Text: more.

0:16:29 - 0:16:33 Text: But right now, the most important thing to understand is the entity linking is what is

0:16:33 - 0:16:36 Text: going to tell us which entity embeddings are actually relevant to the text and which

0:16:36 - 0:16:41 Text: ones do you want to use as you iterate through a sequence?

0:16:41 - 0:16:46 Text: There are a few questions around here.

0:16:46 - 0:16:51 Text: One of them is, so that's entity linking, but what about the relations?

0:16:51 - 0:16:53 Text: Yeah.

0:16:53 - 0:16:57 Text: So some of the works we'll talk about will only use the entity embeddings.

0:16:57 - 0:17:01 Text: So some of these have been pre-trained with relation information, but in the end, you

0:17:01 - 0:17:04 Text: only have an entity embedding.

0:17:04 - 0:17:07 Text: So relation extraction is yet another NLP task that you could also do.

0:17:07 - 0:17:09 Text: But yeah, here we're just talking about entity linking.

0:17:09 - 0:17:14 Text: But then if you have the knowledge graph you showed earlier, it had relations in it, right?

0:17:14 - 0:17:20 Text: Do you get any connection between that and the text?

0:17:20 - 0:17:23 Text: I mean, that's the goal of relation extraction, right?

0:17:23 - 0:17:27 Text: Just to figure out, like given the entities, what is relation between them, which would

0:17:27 - 0:17:33 Text: then form the full triple of tail entity and relation?

0:17:33 - 0:17:34 Text: Okay.

0:17:34 - 0:17:41 Text: Then I think people want to know more about how it's going to be used, but maybe you should

0:17:41 - 0:17:43 Text: go on and show some examples.

0:17:43 - 0:17:45 Text: Yeah, I will, for sure.

0:17:45 - 0:17:46 Text: Okay.

0:17:46 - 0:17:47 Text: Great.

0:17:47 - 0:17:53 Text: So entity embeddings, just to summarize, they're like word embeddings, but they're for entities

0:17:53 - 0:17:54 Text: and analogies.

0:17:54 - 0:17:58 Text: So you'll have some vector associated with George Washington, and it should be meaningful

0:17:58 - 0:18:02 Text: in embedding space such that maybe the George Washington vector is close to the vectors

0:18:02 - 0:18:05 Text: for other founding fathers.

0:18:05 - 0:18:09 Text: So we're going to briefly talk about some methods for training entity embeddings.

0:18:09 - 0:18:11 Text: There is knowledge graph embedding methods.

0:18:11 - 0:18:13 Text: You might have heard of the transient embedding method.

0:18:13 - 0:18:17 Text: So this starts from the idea of having these knowledge graph triples, and you want to learn

0:18:17 - 0:18:20 Text: pre-trained entity and pre-trained relation embeddings.

0:18:20 - 0:18:23 Text: And you want to be the case that the subject embedding and the relation embedding, the

0:18:23 - 0:18:28 Text: sum of those two, is close to the object embedding and vector space.

0:18:28 - 0:18:31 Text: So it's an algorithm to learn that constraint.

0:18:31 - 0:18:33 Text: There's also word entity co-occurrence methods.

0:18:33 - 0:18:37 Text: So these build off of word to vec, one of them is even called Wikipedia to vec, and the

0:18:37 - 0:18:42 Text: idea is given an entity, you want to figure out what words are most likely to co-occur around

0:18:42 - 0:18:44 Text: it.

0:18:44 - 0:18:48 Text: And then the last method, or one of the other methods that is common now is actually just

0:18:48 - 0:18:53 Text: using the transformer to learn representations of an entity by encoding the entity description.

0:18:53 - 0:18:58 Text: And so Blink from Facebook is an approach that does this.

0:18:58 - 0:19:01 Text: So the methods we'll talk about today are actually agnostic to how you train your pre-trained

0:19:01 - 0:19:02 Text: entity embedding.

0:19:02 - 0:19:06 Text: But I think it's important to know that there's actually a wide variety of methods to train

0:19:06 - 0:19:08 Text: these pre-trained entity embeddings.

0:19:08 - 0:19:13 Text: And it's actually not clear which method is best for using them downstream and language

0:19:13 - 0:19:14 Text: models.

0:19:14 - 0:19:20 Text: So one of the key challenges that using pre-trained entity embeddings and language models

0:19:20 - 0:19:24 Text: is figuring out how to incorporate them when they're from a different embedding space than

0:19:24 - 0:19:26 Text: the language model.

0:19:26 - 0:19:29 Text: And so what we'll do, or the approach we'll look at today, we'll learn a fusion layer

0:19:29 - 0:19:32 Text: to combine this context and entity information.

0:19:32 - 0:19:37 Text: So we have entity embeddings and we have the contextualized word embeddings from our language

0:19:37 - 0:19:39 Text: model.

0:19:39 - 0:19:45 Text: So if we take a sequence of text, and we imagine that J indicates the J element in a sequence,

0:19:45 - 0:19:48 Text: then the challenge here is you want to figure out how do we combine some word embedding

0:19:48 - 0:19:52 Text: WJ with some aligned entity embedding EK.

0:19:52 - 0:19:57 Text: So here an alignment could be like in the example where we had Washington was the first president,

0:19:57 - 0:20:01 Text: Washington would be your word embedding and George Washington would be the aligned entity

0:20:01 - 0:20:02 Text: embedding there.

0:20:02 - 0:20:08 Text: So you could imagine, in this case, let's say your WJ is Washington and your EK is your

0:20:08 - 0:20:11 Text: entity embedding for George Washington, and you want to align them together.

0:20:11 - 0:20:18 Text: So what you can do is learn a weight matrix WT for the text and WE for the entity to project

0:20:18 - 0:20:22 Text: these embeddings to the same dimension before you sum them and finally take an activation

0:20:22 - 0:20:25 Text: function over them.

0:20:25 - 0:20:30 Text: So the idea is that by having some fusion layer mechanism like this, you can actually

0:20:30 - 0:20:34 Text: use these entity embeddings and these contextual word embeddings that are in different embedding

0:20:34 - 0:20:40 Text: spaces and fuse them together to have this single hidden representation for the element

0:20:40 - 0:20:44 Text: in the sequence.

0:20:44 - 0:20:48 Text: So the project will talk about today, I'll have some mechanism either very similar to this

0:20:48 - 0:20:55 Text: or some variation of this to do this combination of a context and entity information.

0:20:55 - 0:20:59 Text: So the first approach we're going to talk about is called Ernie, enhanced language representation

0:20:59 - 0:21:01 Text: with informative entities.

0:21:01 - 0:21:03 Text: And so this just builds on what we've already talked about.

0:21:03 - 0:21:09 Text: It uses pre-trained entity embeddings and it also uses this notion of a fusion layer.

0:21:09 - 0:21:14 Text: So the first block in Ernie is a text encoder, which is a multi-layer bi-directional transformer

0:21:14 - 0:21:20 Text: encoder for the experiments they use BERT, but it doesn't have to be BERT.

0:21:20 - 0:21:25 Text: And this is followed by a knowledge encoder which has stacked blocks composed of two multi-headed

0:21:25 - 0:21:26 Text: attentions.

0:21:26 - 0:21:31 Text: One is over the entity embeddings and one is over your token or subword embeddings.

0:21:31 - 0:21:35 Text: And then the output of these contextualized entity and token embeddings from the multi-headed

0:21:35 - 0:21:40 Text: attentions are passed to a fusion layer, which looks very similar to what we just looked

0:21:40 - 0:21:42 Text: at.

0:21:42 - 0:21:47 Text: But now you also have new word and entity embeddings that you're producing as output of your fusion

0:21:47 - 0:21:48 Text: layer.

0:21:48 - 0:21:55 Text: So you see this WJ and this EK, which are produced as the next layer of word and entity embeddings.

0:21:55 - 0:22:00 Text: So the I here indicates that it's the I block in the knowledge encoder.

0:22:00 - 0:22:04 Text: So you'll actually have multiple stacks of these knowledge encoders and you'll be doing

0:22:04 - 0:22:09 Text: a fusion of the word entity embedding, producing new word and entity embeddings and then passing

0:22:09 - 0:22:14 Text: this to the next block of the knowledge encoder.

0:22:14 - 0:22:17 Text: So this is what the architecture diagram looks like.

0:22:17 - 0:22:22 Text: On the left side we have the T encoder or the text encoder followed by the K encoder

0:22:22 - 0:22:24 Text: or the knowledge encoder.

0:22:24 - 0:22:27 Text: And then on the right side we have a zoomed in version of the your knowledge encoder.

0:22:27 - 0:22:31 Text: So you see the multi-headed attentions over the tokens and orange and then over the entities

0:22:31 - 0:22:36 Text: and yellow and then you have this alignment between the word and entities with the dash

0:22:36 - 0:22:37 Text: lines.

0:22:37 - 0:22:42 Text: So they have this example as Bob Dylan wrote blowing in the wind in 1962.

0:22:42 - 0:22:46 Text: The entities here are Bob Dylan and blowing in the wind and they have a simple alignment

0:22:46 - 0:22:51 Text: rule where you want to align the entity to the first word in the entity phrase.

0:22:51 - 0:22:53 Text: So you want to align Bob Dylan to Bob.

0:22:53 - 0:22:57 Text: That's what the dash lines trying to indicate and you want to align blowing in the wind

0:22:57 - 0:22:59 Text: to blow.

0:22:59 - 0:23:02 Text: So here this already assumes the entity linking has been done and you know your entities

0:23:02 - 0:23:03 Text: in advance.

0:23:03 - 0:23:08 Text: So you can see that the entities are actually input into the model.

0:23:08 - 0:23:12 Text: So after you have your word and the alignment this goes through the information fusion layer

0:23:12 - 0:23:16 Text: and this light purple gray color and then finally it produces these new word and entity

0:23:16 - 0:23:18 Text: embeddings is output.

0:23:18 - 0:23:22 Text: And then remember that you have multiple blocks of these so as we pass into the next

0:23:22 - 0:23:26 Text: block of your knowledge encoder.

0:23:26 - 0:23:29 Text: So how do you actually train this? It's pretty similar to Bert.

0:23:29 - 0:23:33 Text: You have a mass language model loss and you have an ex sentence prediction loss.

0:23:33 - 0:23:39 Text: And they also introduce a knowledge pre-training task which they refer to as the DEA task.

0:23:39 - 0:23:45 Text: It's named after a denoising entity auto encoder from an ICML paper in 2008.

0:23:45 - 0:23:49 Text: And the idea is they're going to randomly mass these token entity alignments.

0:23:49 - 0:23:53 Text: So the idea that Bob goes to Bob Dylan, they're going to mask that out with some random

0:23:53 - 0:23:54 Text: percentage.

0:23:54 - 0:23:58 Text: And then they're going to predict the corresponding entity for a token out of the entities in

0:23:58 - 0:24:00 Text: the sequence.

0:24:00 - 0:24:02 Text: So this looks like as follows.

0:24:02 - 0:24:05 Text: The summation is over M entities in the sequence.

0:24:05 - 0:24:09 Text: So this would be over Bob Dylan and blowing in the wind in the previous example.

0:24:09 - 0:24:13 Text: And given a particular word, they want to figure out what entities that most likely to

0:24:13 - 0:24:15 Text: align to in that sequence.

0:24:15 - 0:24:21 Text: So does Bob align to Bob Dylan or does Bob align to blowing in the wind?

0:24:21 - 0:24:24 Text: And their motivation for doing this is that if you don't have this task, all you're ever

0:24:24 - 0:24:28 Text: going to be predicting is a token with the mass language model loss.

0:24:28 - 0:24:33 Text: And you really, to encode knowledge, should also probably be predicting over entities.

0:24:33 - 0:24:38 Text: So by adding this task, they have some kind of task that is actually predicting the entity.

0:24:38 - 0:24:43 Text: And they also suggest that this might better fuse the knowledge or the entity and the word

0:24:43 - 0:24:48 Text: representations than just using the fusion layer.

0:24:48 - 0:24:52 Text: And the final loss is then the summation of the mass language model loss, the next sentence

0:24:52 - 0:24:59 Text: prediction loss, and this DEA knowledge pre-training task loss.

0:24:59 - 0:25:03 Text: So they show that a Blatian experiment that it's actually very important to have this

0:25:03 - 0:25:04 Text: knowledge pre-training task.

0:25:04 - 0:25:10 Text: So this has Bert on the left most bar, Ernie as the second bar from the left.

0:25:10 - 0:25:12 Text: And so that's with all the features of Ernie.

0:25:12 - 0:25:15 Text: And then they try removing the pre-trained entity embeddings and removing this knowledge

0:25:15 - 0:25:17 Text: pre-training task.

0:25:17 - 0:25:19 Text: So you see that Bert performs a worst.

0:25:19 - 0:25:22 Text: This isn't very surprising, and that Ernie performs the best.

0:25:22 - 0:25:26 Text: But what's interesting is that if you remove the entity embeddings or you remove the

0:25:26 - 0:25:30 Text: pre-training task, they only do a little better than Bert.

0:25:30 - 0:25:35 Text: And so it's really necessary to actually use this pre-training task to get the most

0:25:35 - 0:25:41 Text: use to your pre-trained entity embeddings.

0:25:41 - 0:25:44 Text: So some strengths of this work were that they introduced some way to combine entity and

0:25:44 - 0:25:49 Text: context information through this fusion layer and this knowledge pre-training task.

0:25:49 - 0:25:52 Text: And then they also show improved performance on downstream tasks, which we'll come back

0:25:52 - 0:25:55 Text: to when we talk about evaluation.

0:25:55 - 0:25:58 Text: But of course, there's also some limitations.

0:25:58 - 0:26:01 Text: So it needs text data with the entities annotated as input.

0:26:01 - 0:26:03 Text: And this is even true for downstream tasks.

0:26:03 - 0:26:08 Text: So if you remember on the architecture diagram, we had the entity information actually input

0:26:08 - 0:26:10 Text: into the architecture.

0:26:10 - 0:26:14 Text: But it's not very realistic that you're necessarily going to have a good entity linker for any

0:26:14 - 0:26:18 Text: downstream tasks that you want to use Ernie on.

0:26:18 - 0:26:21 Text: And the next challenge is this requires more pre-training of your language model.

0:26:21 - 0:26:24 Text: So now you don't just need to pre-train Bert, but you also need to pre-train your knowledge

0:26:24 - 0:26:27 Text: and code around top.

0:26:27 - 0:26:30 Text: For the first challenge, we're going to actually talk about a work that presents a solution

0:26:30 - 0:26:31 Text: to address this.

0:26:31 - 0:26:35 Text: For the second challenge, I encourage you to check out the footnote on the bottom.

0:26:35 - 0:26:40 Text: This introduces a work that actually uses pre-trained entity embeddings, uses them in a language

0:26:40 - 0:26:42 Text: model, and doesn't require any more pre-training.

0:26:42 - 0:26:45 Text: So it's pretty cool.

0:26:45 - 0:26:50 Text: I guess that's all I have for Ernie, so I want to pause here for questions.

0:26:50 - 0:26:55 Text: Well, here's one that's up here.

0:26:55 - 0:27:01 Text: So on the fusion layer, it observed that passing the entity embedding into a fusion layer

0:27:01 - 0:27:05 Text: to provide with the word embedding is more powerful than just concatenating the entity

0:27:05 - 0:27:08 Text: embedding onto the end of the word embedding question mark.

0:27:08 - 0:27:14 Text: Yeah, so I guess people are still a little bit confused as to the motivation of that fusion

0:27:14 - 0:27:15 Text: layer.

0:27:15 - 0:27:20 Text: And so I guess here it's this, the simple strategy would be since you've got the entity

0:27:20 - 0:27:25 Text: linking, you could just concatenate entity embeddings onto the end of word embeddings and

0:27:25 - 0:27:30 Text: do regular Bert, but that worked just as well.

0:27:30 - 0:27:37 Text: I think the idea is it would not, because if you imagine that, let's say, your magnitude

0:27:37 - 0:27:44 Text: is very different, you need some way to, I guess, align the spaces so that anything meaningful

0:27:44 - 0:27:47 Text: in the entity embedding space is still meaningful in the word embedding space.

0:27:47 - 0:27:50 Text: So if you're close in the word embedding space, you also would be, you'd want to be close

0:27:50 - 0:27:52 Text: in the entity embedding space.

0:27:52 - 0:27:55 Text: So I guess that's one argument.

0:27:55 - 0:27:56 Text: Yeah.

0:27:56 - 0:28:05 Text: I mean, I think the question isn't, you know, it's a good question as people say, I mean,

0:28:05 - 0:28:10 Text: it's not completely obvious that it wouldn't work to do that.

0:28:10 - 0:28:16 Text: It seems like one of the potential problems is some words have entity links to them and

0:28:16 - 0:28:18 Text: some words don't.

0:28:18 - 0:28:22 Text: And so then you'd sort of have zero vectors for the ones that don't have anything linked

0:28:22 - 0:28:23 Text: in that way.

0:28:23 - 0:28:25 Text: Act a bit weirdly, but.

0:28:25 - 0:28:26 Text: Yeah.

0:28:26 - 0:28:32 Text: In this case, when they don't have entities linked, which is a great point.

0:28:32 - 0:28:37 Text: Yeah, the first equation just simplifies to the first term plus the bias.

0:28:37 - 0:28:40 Text: So like there's an obvious solution in that case when you're not concatenating that you

0:28:40 - 0:28:42 Text: just don't add on the term.

0:28:42 - 0:28:48 Text: Yeah, that could be one reason too.

0:28:48 - 0:28:51 Text: Okay.

0:28:51 - 0:28:55 Text: Are there any other questions?

0:28:55 - 0:29:02 Text: I think you can go on.

0:29:02 - 0:29:03 Text: Okay.

0:29:03 - 0:29:04 Text: Cool.

0:29:04 - 0:29:05 Text: Right.

0:29:05 - 0:29:11 Text: So now we're talking about nobert.

0:29:11 - 0:29:15 Text: And this is from the same folks that introduced the Elmo work.

0:29:15 - 0:29:20 Text: And the idea here is that they're going to pre-train and integrate into linker as an extension

0:29:20 - 0:29:23 Text: to bird.

0:29:23 - 0:29:28 Text: And so their loss function will now be the summation of the next sentence prediction,

0:29:28 - 0:29:30 Text: the mass language model loss and this entity linking loss.

0:29:30 - 0:29:34 Text: So instead of the knowledge pre-training DEA task from Ernie, we'll have an entity linking

0:29:34 - 0:29:35 Text: loss.

0:29:35 - 0:29:41 Text: And the idea of the entity linker is you'll now have just as normal sequence as input.

0:29:41 - 0:29:46 Text: And the integrated entity linker will figure out what are the entities in the sentence and

0:29:46 - 0:29:50 Text: or what are the mentions in the sentence or the candidates of those mentions and then

0:29:50 - 0:29:55 Text: what should be the scores those entities or the candidates given the context of the sentence.

0:29:55 - 0:29:59 Text: And so this is all done now as part of the model rather than requiring it as some external

0:29:59 - 0:30:03 Text: pipeline stage before you could even use Ernie for instance.

0:30:03 - 0:30:07 Text: So now for downstream tasks, you no longer need these entity annotations.

0:30:07 - 0:30:11 Text: Your integrated entity linker will figure out what the correct entity is and be able to

0:30:11 - 0:30:14 Text: use the correct entity embedding.

0:30:14 - 0:30:17 Text: So there's also this idea that learning is entity linking may actually better in code

0:30:17 - 0:30:22 Text: knowledge than this DEA pre-training task because they show that nobert actually outperforms

0:30:22 - 0:30:25 Text: Ernie on downstream tasks.

0:30:25 - 0:30:29 Text: So one reason this may occur is that if you think about the DEA task, it's actually a bit

0:30:29 - 0:30:32 Text: simpler than just entity linking.

0:30:32 - 0:30:36 Text: So you're trying to predict for instance what Bob linked to out of Bob Dylan and blowing

0:30:36 - 0:30:37 Text: in the wind.

0:30:37 - 0:30:41 Text: And it's much easier even as a human to see that Bob Dylan will more likely link to or

0:30:41 - 0:30:46 Text: Bob will more likely link to Bob Dylan than that Bob will link to blowing in the wind.

0:30:46 - 0:30:49 Text: And the entity linking task, you actually have a much harder set of candidates to predict

0:30:49 - 0:30:50 Text: over.

0:30:50 - 0:30:52 Text: You're not just looking at the ones in the sentence.

0:30:52 - 0:30:57 Text: So does Washington link to George Washington or Washington state actually requires you

0:30:57 - 0:30:59 Text: using more information about the entity.

0:30:59 - 0:31:04 Text: So given it's a harder task, it's not too surprising that it might perform better than

0:31:04 - 0:31:10 Text: just this easier knowledge pre-training task that Ernie introduced.

0:31:10 - 0:31:12 Text: So otherwise, nobert has a lot of similarities to Ernie.

0:31:12 - 0:31:17 Text: It uses a fusion layer that combines this context and entity information and it introduces

0:31:17 - 0:31:19 Text: some knowledge pre-training task.

0:31:19 - 0:31:22 Text: So I'd say a high level takeaways if you want to use pre-training entity embeddings

0:31:22 - 0:31:26 Text: in a language model, you'll probably at least want to consider both of these components

0:31:26 - 0:31:31 Text: in terms of actually going to integrate the pre-training entity embeddings and take the

0:31:31 - 0:31:37 Text: most advantage of a knowledge in them as possible.

0:31:37 - 0:31:43 Text: So that brings us to the next class of techniques, which is using an external memory.

0:31:43 - 0:31:46 Text: And here we'll mainly focus on this work called KGLN and then we'll also briefly talk

0:31:46 - 0:31:49 Text: about KNN LM.

0:31:49 - 0:31:53 Text: So the previous methods that you've talked about have relied on pre-trained entity embeddings

0:31:53 - 0:31:57 Text: to encode the factual knowledge from knowledge bases.

0:31:57 - 0:32:01 Text: And the one problem with this or one of the problems with this is if you want to, let's

0:32:01 - 0:32:02 Text: say, modify your knowledge base.

0:32:02 - 0:32:06 Text: You now need to retrain your entity embeddings and then retrain your language model on top

0:32:06 - 0:32:08 Text: of those entity embeddings.

0:32:08 - 0:32:13 Text: So this begs a question, are there more direct ways in pre-trained entity embeddings

0:32:13 - 0:32:17 Text: to provide the model factual knowledge?

0:32:17 - 0:32:20 Text: And so what we're going to talk about is how you can actually use an external memory or

0:32:20 - 0:32:25 Text: a key value store to give the model access to either knowledge graph triples or context

0:32:25 - 0:32:26 Text: information.

0:32:26 - 0:32:30 Text: And a key thing about this external memory is that it's independent of the learned model

0:32:30 - 0:32:33 Text: parameters.

0:32:33 - 0:32:37 Text: So this means you can actually support injecting and updating factual knowledge.

0:32:37 - 0:32:40 Text: You can do this directly to this symbolic external memory, while let's say changing the

0:32:40 - 0:32:44 Text: value for a particular key or maybe adding another key.

0:32:44 - 0:32:49 Text: And you don't have to retrain or retrain your entity embeddings when you make this

0:32:49 - 0:32:50 Text: change.

0:32:50 - 0:32:54 Text: And the approaches we'll talk about today can actually even have these updates to the

0:32:54 - 0:32:58 Text: external memory without more pre-training of the language model.

0:32:58 - 0:33:01 Text: So that's pretty neat.

0:33:01 - 0:33:04 Text: And then another benefit of using external memory over these pre-trained entity embedding

0:33:04 - 0:33:07 Text: approaches is they can also be more interpretable.

0:33:07 - 0:33:13 Text: So if you have a bug or not bug an air in your model where it's not predicting a correct

0:33:13 - 0:33:18 Text: fact, it's very challenging to figure out with pre-trained entity embeddings what the

0:33:18 - 0:33:19 Text: problem might be.

0:33:19 - 0:33:20 Text: Was it the original knowledge base?

0:33:20 - 0:33:22 Text: Was it the encoding in the entity embeddings?

0:33:22 - 0:33:25 Text: Is it how the language models using the entity embeddings?

0:33:25 - 0:33:28 Text: And here you have a little more information with an external memory.

0:33:28 - 0:33:33 Text: And that you can look in the external memory and see what's the fact in the external memory

0:33:33 - 0:33:35 Text: was not an external memory and so on.

0:33:35 - 0:33:40 Text: So it adds a little bit more interpretability than just using these pre-trained entity embeddings

0:33:40 - 0:33:45 Text: as an inject way to encode the knowledge base.

0:33:45 - 0:33:49 Text: So the first word we're going to talk about is called KGLM and unlike the other approaches

0:33:49 - 0:33:55 Text: we've talked about so far, this actually uses LSTMs and not transformers.

0:33:55 - 0:34:00 Text: So the key idea here is to condition the language model on a knowledge graph.

0:34:00 - 0:34:04 Text: So recall with the standard language model, we want to predict the next word given the

0:34:04 - 0:34:06 Text: previous words in the sequence.

0:34:06 - 0:34:11 Text: Well, now we also want to predict the next entity given the previous words in the sequence

0:34:11 - 0:34:16 Text: and given the previous entities in the sentence or the entities that are relevant to the sentence

0:34:16 - 0:34:19 Text: I should say.

0:34:19 - 0:34:24 Text: So KGLM will be building a local knowledge graph as it iterates over the sequence.

0:34:24 - 0:34:28 Text: And a local knowledge graph is just a subset of a full knowledge graph that only has the

0:34:28 - 0:34:32 Text: entities that are actually relevant to the sequence.

0:34:32 - 0:34:37 Text: So if we have this example here, a simplified example from the paper, that SuperMarioLand

0:34:37 - 0:34:39 Text: is a game developed by Blank.

0:34:39 - 0:34:43 Text: And SuperMarioLand here is an entity.

0:34:43 - 0:34:47 Text: You'd want a local knowledge graph as follows where you see that SuperMarioLand is in the

0:34:47 - 0:34:52 Text: local knowledge graph, but we also have the relations to SuperMarioLand to other entities

0:34:52 - 0:34:56 Text: that are copied from the full knowledge graph into this local knowledge graph.

0:34:56 - 0:34:59 Text: And you would build up this local knowledge graph as you iterate over the sentence.

0:34:59 - 0:35:03 Text: So whenever you see an entity, you would add it to the local knowledge graph as well as

0:35:03 - 0:35:06 Text: its relations to other entities.

0:35:06 - 0:35:11 Text: So obviously this is a much smaller example than what would really have all the relations

0:35:11 - 0:35:14 Text: to SuperMarioLand just for the purpose of the example.

0:35:14 - 0:35:20 Text: But hopefully it's clear that all of these are relevant to the sequence.

0:35:20 - 0:35:23 Text: Something important to note here is that this does assume that the entities are known during

0:35:23 - 0:35:27 Text: training so that you do have this entity annotated data for training.

0:35:27 - 0:35:30 Text: And therefore your local knowledge graph is always the ground truth local knowledge graph

0:35:30 - 0:35:33 Text: as you iterate over the sequence.

0:35:33 - 0:35:36 Text: So why might this be a good idea to do this?

0:35:36 - 0:35:39 Text: Well here the next word you want to predict is Nintendo.

0:35:39 - 0:35:43 Text: And you may notice that Nintendo is in your local knowledge graph.

0:35:43 - 0:35:47 Text: So sometimes this local knowledge graph can actually serve as a very strong signal for

0:35:47 - 0:35:50 Text: what you want to predict for your next word.

0:35:50 - 0:35:55 Text: Now you may be thinking well this wouldn't always be helpful and that's true as well.

0:35:55 - 0:35:58 Text: So if you look at just like the third word in the sequence and you want to predict that

0:35:58 - 0:36:02 Text: word, so is a game for instance.

0:36:02 - 0:36:06 Text: Well if this isn't in the local knowledge graph this wouldn't be necessarily that helpful.

0:36:06 - 0:36:10 Text: You would just do a standard language model prediction.

0:36:10 - 0:36:14 Text: Or if you're at the beginning of the sequence, your local knowledge graph is empty so of course

0:36:14 - 0:36:16 Text: you're not going to get any signal from it.

0:36:16 - 0:36:21 Text: So the first question they ask in KGLM is how can a language model know when to use a

0:36:21 - 0:36:29 Text: local knowledge graph and when it might actually be useful for predicting the next word.

0:36:29 - 0:36:32 Text: So we're going to keep the same example as a running example and we have our local knowledge

0:36:32 - 0:36:34 Text: graph here.

0:36:34 - 0:36:37 Text: We now have an LSTM that looks similar to the representations you've seen throughout

0:36:37 - 0:36:38 Text: this class.

0:36:38 - 0:36:41 Text: And normally you've seen the LSTM predicts the next word.

0:36:41 - 0:36:46 Text: Well now we're also going to use the LSTM to predict the next type of the word.

0:36:46 - 0:36:50 Text: So it's the next word going to be a related entity meaning it's in the local knowledge

0:36:50 - 0:36:51 Text: graph already.

0:36:51 - 0:36:56 Text: Is it going to be a new entity meaning it's not in the local knowledge graph or is it going

0:36:56 - 0:37:02 Text: to be not an entity in which case you just revert to a normal LSTM prediction.

0:37:02 - 0:37:05 Text: And they're going to use the LSTM hidden state to do this prediction of the type of the

0:37:05 - 0:37:11 Text: next word over this three way, three different classes that they might want to consider.

0:37:11 - 0:37:15 Text: So in the case of super Mario Land is a game developed mind Nintendo.

0:37:15 - 0:37:19 Text: We saw that this would be a related entity case because you saw that Nintendo was in

0:37:19 - 0:37:21 Text: the local knowledge graph.

0:37:21 - 0:37:25 Text: For the other cases, super Mario Land would be a new entity case since it's the local

0:37:25 - 0:37:27 Text: knowledge graph is empty at that point.

0:37:27 - 0:37:33 Text: And then any of the words between super Mario Land and Nintendo would be not an entity.

0:37:33 - 0:37:40 Text: Is there just a standard LSTM language model prediction that doesn't involve any entities.

0:37:40 - 0:37:43 Text: So now we need to talk about what the language model actually does in these three different

0:37:43 - 0:37:51 Text: scenarios to predict the next entity and the next word.

0:37:51 - 0:37:53 Text: So we're going to keep the example up at the top in case you want to further back to

0:37:53 - 0:37:55 Text: three different cases.

0:37:55 - 0:37:59 Text: And we're going to start with a related entity case.

0:37:59 - 0:38:04 Text: So here we assume that the next word or entity is actually in your local knowledge graph.

0:38:04 - 0:38:07 Text: And remember that we can describe an knowledge graph in terms of triples.

0:38:07 - 0:38:12 Text: So in terms of pairs of parent entities, relations and tail entities.

0:38:12 - 0:38:15 Text: And in the case of predicting the next word as Nintendo.

0:38:15 - 0:38:19 Text: There's only one possible parent entity in the local knowledge graph, which is super Mario

0:38:19 - 0:38:21 Text: Land.

0:38:21 - 0:38:25 Text: And the goal is you want to figure out what is the most relevant triple that will be useful

0:38:25 - 0:38:28 Text: in helping to predict the next word.

0:38:28 - 0:38:32 Text: So in this case, you could have the triple super Mario Land publisher Nintendo.

0:38:32 - 0:38:36 Text: You might have the triple super Mario Land genre platform game, which of these is actually

0:38:36 - 0:38:40 Text: helpful in predicting that Nintendo should be the next word.

0:38:40 - 0:38:45 Text: So here what you would want KGLN to do is predict that the top scoring parent entity is super

0:38:45 - 0:38:46 Text: Mario Land.

0:38:46 - 0:38:49 Text: And the top scoring relation is publisher.

0:38:49 - 0:38:53 Text: You can see there are actually contextual cues in the sentence that could help you figure

0:38:53 - 0:38:56 Text: out which triple you're talking about.

0:38:56 - 0:39:00 Text: And then given that your top scoring parent entity is super Mario Land and your top scoring

0:39:00 - 0:39:05 Text: relation is publisher, you can figure out that using knowledge graph triples, the tail

0:39:05 - 0:39:07 Text: entity has to be Nintendo.

0:39:07 - 0:39:15 Text: And therefore, this gives you a strong signal that the next word will be Nintendo.

0:39:15 - 0:39:19 Text: So the goal is you're going to find the top scoring parent entity and the top scoring relation

0:39:19 - 0:39:21 Text: using the nodes in your local knowledge graph.

0:39:21 - 0:39:25 Text: And you can do this by using the LSTM hidden state combined with pre-trained entity and

0:39:25 - 0:39:27 Text: relation embeddings.

0:39:27 - 0:39:31 Text: So I do admit I cheated here a little bit in that this does use pre-trained embeddings,

0:39:31 - 0:39:34 Text: but hopefully you'll see by the end of this discussion why I think it fits a bit better

0:39:34 - 0:39:39 Text: in this external memory use case as well.

0:39:39 - 0:39:42 Text: So what they're going to do is they're going to take a softmax using the LSTM hidden state

0:39:42 - 0:39:46 Text: and the entity embeddings for each of the potential parent entities and they'll take this

0:39:46 - 0:39:52 Text: top scoring one as a parent entity and they'll do the same thing for the relation embeddings.

0:39:52 - 0:39:56 Text: The next entity is then just this tail entity from the knowledge graph triple.

0:39:56 - 0:40:00 Text: So it's relatively trivial to figure out what the next entity should be once you've figured

0:40:00 - 0:40:04 Text: out the top scoring parent entity and your top scoring relation.

0:40:04 - 0:40:09 Text: And then finally to predict the next word, they take the vocabulary and they expand it

0:40:09 - 0:40:14 Text: to include different aliases that could refer to that entity.

0:40:14 - 0:40:18 Text: So what I mean by aliases here are phrases that could refer to the entity in text.

0:40:18 - 0:40:23 Text: So you might not just call it Nintendo, you might also say Nintendo Company or Copie

0:40:23 - 0:40:28 Text: and you want any of these to be possible words that you could predict as the next word.

0:40:28 - 0:40:33 Text: So the goal of this vocabulary expansion is to increase the probability that the next

0:40:33 - 0:40:40 Text: word you predict will actually be related to this next entity.

0:40:40 - 0:40:44 Text: So the new entity case is a bit simpler, this means that the entity that you're predicting

0:40:44 - 0:40:45 Text: is not in the local knowledge graph.

0:40:45 - 0:40:48 Text: So you're not getting any signal from this local knowledge graph that you've been building

0:40:48 - 0:40:49 Text: up.

0:40:49 - 0:40:54 Text: And all you want to do is find the top scoring entity in the full knowledge graph and

0:40:54 - 0:40:58 Text: you can do this using the LSTM hidden state and preaching and TMPeddings, similar to how

0:40:58 - 0:41:02 Text: we found the score for the top parent entity.

0:41:02 - 0:41:06 Text: Your next entity will just be the top scoring entity out of the full knowledge graph.

0:41:06 - 0:41:10 Text: And then your next word is once again, this vocabulary expanded to include aliases of

0:41:10 - 0:41:13 Text: that entity.

0:41:13 - 0:41:15 Text: The not in the entity case is the simplest.

0:41:15 - 0:41:17 Text: You just revert to normal LSTM.

0:41:17 - 0:41:20 Text: You don't have an X entity to predict.

0:41:20 - 0:41:27 Text: And your next word is just the most likely next token over your normal vocabulary.

0:41:27 - 0:41:31 Text: So here's a diagram from the paper that hopefully summarizes and makes even clearer what I just

0:41:31 - 0:41:33 Text: went over.

0:41:33 - 0:41:37 Text: So they have a longer example than the one we were looking at, but it's the same prediction

0:41:37 - 0:41:39 Text: as Nintendo's next word.

0:41:39 - 0:41:40 Text: And they have their predictions in red.

0:41:40 - 0:41:43 Text: So this is what they want KGLN to predict.

0:41:43 - 0:41:45 Text: The three different cases are in the horizontal.

0:41:45 - 0:41:50 Text: And we see that here, you're in the related entity case, since Nintendo is in your local

0:41:50 - 0:41:52 Text: knowledge graph.

0:41:52 - 0:41:57 Text: So they want KGLN to predict that Nintendo should be a related entity type of word, that

0:41:57 - 0:42:03 Text: Super Mario Land should be its parent entity, that publisher should be the relevant relation.

0:42:03 - 0:42:06 Text: And as a result, the next entity is Nintendo.

0:42:06 - 0:42:08 Text: And then they expand the vocabulary.

0:42:08 - 0:42:11 Text: You see that aliases of Nintendo at the bottom.

0:42:11 - 0:42:14 Text: And then finally, they actually predict Nintendo is the next word.

0:42:14 - 0:42:20 Text: And the other case is just summarized what we also already went over.

0:42:20 - 0:42:27 Text: So you find that KGLN actually outperforms GPT2 and AWD LSTM, which is a strong LSTM language

0:42:27 - 0:42:28 Text: model.

0:42:28 - 0:42:31 Text: On a fact completion task, similar to the fill in the blank examples that we looked at

0:42:31 - 0:42:38 Text: at the beginning of the talk, they also find qualitatively that compared to GPT2, KGLN

0:42:38 - 0:42:43 Text: tends to predict more specific tokens, since it can predict these tokens from just copying

0:42:43 - 0:42:44 Text: from the local knowledge graph.

0:42:44 - 0:42:47 Text: Whereas GPT2 will tend to predict more generic tokens.

0:42:47 - 0:42:51 Text: So if you want to predict the birthplace of someone, GPT2 is more likely to predict

0:42:51 - 0:42:57 Text: New York, for example, and KGLN might predict some obscure place.

0:42:57 - 0:43:00 Text: And then they have these really cool set of experiments where they show that KGLN actually

0:43:00 - 0:43:03 Text: supports modifying or updating facts.

0:43:03 - 0:43:07 Text: So they made a direct change in the knowledge graph, and then they saw what is the change

0:43:07 - 0:43:10 Text: in KGLN's predictions.

0:43:10 - 0:43:15 Text: So they have this example where the sequence was Barack Obama is born on blank.

0:43:15 - 0:43:19 Text: They had their knowledge graph triple as Barack Obama's original birth date, and then

0:43:19 - 0:43:24 Text: their most likely next tokens were as expected, August 4, 1961.

0:43:24 - 0:43:26 Text: And then they just changed their knowledge graph.

0:43:26 - 0:43:28 Text: So they changed the birth date of Obama.

0:43:28 - 0:43:30 Text: And they said, OK, he's now born 2013.

0:43:30 - 0:43:35 Text: And they looked to see what the next predictions were for KGLN, and it changed its predictions

0:43:35 - 0:43:38 Text: to match what was in the local knowledge graph.

0:43:38 - 0:43:42 Text: So this is something that's pretty cool, and that really only external memory approaches

0:43:42 - 0:43:46 Text: can do compared to these to the original pre-chain etian betting approach we talked

0:43:46 - 0:43:47 Text: about.

0:43:47 - 0:43:51 Text: And I think it's one of the reasons that KGLN at least my opinion fits better in these

0:43:51 - 0:43:53 Text: external memory use cases.

0:43:53 - 0:43:58 Text: Right, so the next slide is a different paper.

0:43:58 - 0:44:01 Text: So I guess I'll take questions on KGLN.

0:44:01 - 0:44:04 Text: Is there any?

0:44:04 - 0:44:10 Text: It's a pretty complex method, so feel free to have questions.

0:44:10 - 0:44:15 Text: Yeah, could you one more time explain what the definition of the local knowledge graph

0:44:15 - 0:44:18 Text: is in relationship to the global knowledge graph?

0:44:18 - 0:44:20 Text: Yep.

0:44:20 - 0:44:24 Text: So local knowledge graph is supposed to be a subset of the full knowledge graph, and

0:44:24 - 0:44:28 Text: it's only supposed to consist of entities that are actually have actually been seen in

0:44:28 - 0:44:35 Text: the sequence as well as their relevant entities.

0:44:35 - 0:44:43 Text: OK, so here you see that SuperMario land is in the local knowledge graph because SuperMario

0:44:43 - 0:44:45 Text: land is an entity that is seen in the sequence.

0:44:45 - 0:44:50 Text: And then you also want to copy over all the edges from SuperMario land that would be in

0:44:50 - 0:44:52 Text: the full knowledge graph.

0:44:52 - 0:44:55 Text: So this is just a subset of them for the purpose of the example, but you see that SuperMario

0:44:55 - 0:44:59 Text: land has an edge in Tendo to gain void platform gain.

0:44:59 - 0:45:03 Text: And so you would copy all edges that SuperMario land has to another node in the full knowledge

0:45:03 - 0:45:04 Text: graph.

0:45:04 - 0:45:09 Text: And they know in advance like they have the labels here for what the entities are during

0:45:09 - 0:45:10 Text: training.

0:45:10 - 0:45:14 Text: So that's how they can actually create this ground truth knowledge graph.

0:45:14 - 0:45:20 Text: And briefly, a student asked why we can't just use the whole knowledge graph and I gave

0:45:20 - 0:45:22 Text: an answer, but maybe you know better.

0:45:22 - 0:45:27 Text: Yeah, I think the idea is the signal will be much stronger if you just use local knowledge

0:45:27 - 0:45:28 Text: graph.

0:45:28 - 0:45:36 Text: So in the softmax for the related entity case, you would just be predicting over the potential

0:45:36 - 0:45:39 Text: parent entities in your local knowledge graph, which is a much smaller set than what's in

0:45:39 - 0:45:41 Text: your full knowledge graph.

0:45:41 - 0:45:44 Text: So I guess it's more likely that you're going to predict something that is correct in

0:45:44 - 0:45:45 Text: that case.

0:45:45 - 0:45:49 Text: Then when you have like 5 million or so entities in your full knowledge graph, it's also

0:45:49 - 0:45:51 Text: much cheaper to compute.

0:45:51 - 0:45:54 Text: In this case, there's only a single parent entity, but you could have multiple parent entities

0:45:54 - 0:45:57 Text: that you're trying to compute, which one's most likely over.

0:45:57 - 0:46:00 Text: Is that what you were also thinking?

0:46:00 - 0:46:05 Text: Yeah, I mainly just said efficiency.

0:46:05 - 0:46:07 Text: So the signal thing is cool too.

0:46:07 - 0:46:14 Text: Who's an exciting question, what about queries that require more than one step in the

0:46:14 - 0:46:21 Text: knowledge graph, such as the location of the publisher of Superrario Land?

0:46:21 - 0:46:25 Text: Yeah, that's a good question.

0:46:25 - 0:46:27 Text: So the idea is like can it support those types?

0:46:27 - 0:46:32 Text: Does it support multi-hop kind of building of the knowledge graph?

0:46:32 - 0:46:36 Text: Yeah, yeah, it's like KGLM perform in those cases.

0:46:36 - 0:46:37 Text: Yeah, I don't know.

0:46:37 - 0:46:39 Text: That's a very good question.

0:46:39 - 0:46:43 Text: They build up the knowledge graph, so that is just single hop as far as I know.

0:46:43 - 0:46:47 Text: But like if you saw the other entities, if you were to see the entities along the hops,

0:46:47 - 0:46:49 Text: it would have them in the local knowledge graph.

0:46:49 - 0:46:51 Text: Yeah, that's a good question.

0:46:51 - 0:46:54 Text: I don't know if they explored that.

0:46:54 - 0:46:55 Text: Great.

0:46:55 - 0:46:56 Text: Okay.

0:46:56 - 0:47:05 Text: Let's move along then.

0:47:05 - 0:47:06 Text: Okay.

0:47:06 - 0:47:14 Text: So the next piece of work we're going to talk about, you guys have actually briefly seen

0:47:14 - 0:47:20 Text: in the natural language generation lecture, but I'm going to go over it again quickly here.

0:47:20 - 0:47:23 Text: So unlike the other work, so you talked about that use knowledge graph, Chipples, this

0:47:23 - 0:47:27 Text: is actually going to take kind of a looser notion of knowledge in that the knowledge will

0:47:27 - 0:47:30 Text: just be encoded in the text and the training data set.

0:47:30 - 0:47:36 Text: So this is called K&N LM and the idea is that or it's building an idea that language models

0:47:36 - 0:47:40 Text: not only learn to predict the next word in text, but they also learn these representations

0:47:40 - 0:47:41 Text: of text.

0:47:41 - 0:47:46 Text: And the authors suggest that it might actually be easier to learn similarities between

0:47:46 - 0:47:50 Text: text sequences than it is to predict the next word in the text.

0:47:50 - 0:47:55 Text: So you have this example that Dickens is the author of blank and Dickens wrote blank.

0:47:55 - 0:48:00 Text: And they argue that it's easier to tell for human, but also for a model that these sequences

0:48:00 - 0:48:04 Text: are similar and they should probably have the same next word, even if you don't know

0:48:04 - 0:48:06 Text: what the next word is.

0:48:06 - 0:48:10 Text: So that's suggesting that it's easier to learn these similarities than it actually predict

0:48:10 - 0:48:12 Text: the next word.

0:48:12 - 0:48:15 Text: And they argue that this is even more true for long tail patterns, where it's very challenging

0:48:15 - 0:48:21 Text: for the model to predict that the next word is some rarely seen token or rare entity than

0:48:21 - 0:48:25 Text: it is to find another similar sequence that it's already seen and just copy the next word

0:48:25 - 0:48:28 Text: from that sequence.

0:48:28 - 0:48:32 Text: So what they propose to do is to our representations of text sequences in nearest neighbor data

0:48:32 - 0:48:34 Text: store.

0:48:34 - 0:48:37 Text: And then at inference, what you'll want to do is you find the K most similar sequences

0:48:37 - 0:48:38 Text: of text.

0:48:38 - 0:48:42 Text: You then retrieve their corresponding values, so you just peek at those sequences and

0:48:42 - 0:48:45 Text: see what were their next words.

0:48:45 - 0:48:50 Text: And then you combine the probability from this nearest neighbor data store with just a typical

0:48:50 - 0:48:52 Text: language model prediction.

0:48:52 - 0:48:56 Text: And so they call this an interpolation step in that they're reading how much to pay attention

0:48:56 - 0:49:00 Text: to the probability from this K and N approach and how much to pay attention to this language

0:49:00 - 0:49:02 Text: model approach.

0:49:02 - 0:49:08 Text: And the lambda here is just a hyperparameter they tune.

0:49:08 - 0:49:11 Text: So they have this diagram from their paper where they want to predict the next word in

0:49:11 - 0:49:13 Text: the sequence, Shakespeare's play blank.

0:49:13 - 0:49:17 Text: And so what they do is they have all the training context already encoded in their data

0:49:17 - 0:49:18 Text: store.

0:49:18 - 0:49:21 Text: So they have representations of all the training context.

0:49:21 - 0:49:24 Text: And then they compute representation of their text context.

0:49:24 - 0:49:28 Text: And they want to figure out which representations in the training context are most similar

0:49:28 - 0:49:32 Text: to this text text context representation.

0:49:32 - 0:49:37 Text: And so here in external memory view of things, the keys would be the representations of

0:49:37 - 0:49:42 Text: the training context and the values would be the next words.

0:49:42 - 0:49:46 Text: So they get the K nearest training representations.

0:49:46 - 0:49:48 Text: They then copy over their values.

0:49:48 - 0:49:51 Text: So that's what you see with this Macbeth Hamlet Macbeth example.

0:49:51 - 0:49:55 Text: They have a normalization step where they convert this to probability space.

0:49:55 - 0:49:58 Text: And then finally, they have an aggregation step.

0:49:58 - 0:50:03 Text: So if a word is seen as the next word and several of these K nearest neighbors, then they

0:50:03 - 0:50:04 Text: want to count more for that.

0:50:04 - 0:50:05 Text: So that's why they aggregate.

0:50:05 - 0:50:06 Text: So they see Macbeth twice.

0:50:06 - 0:50:10 Text: It means Macbeth is more likely.

0:50:10 - 0:50:15 Text: And then finally, they have this interpolation step where they try to balance between the classification

0:50:15 - 0:50:20 Text: probabilities from the language model and from the K and N approach.

0:50:20 - 0:50:25 Text: So some immediate observation you might have is this seems really expensive.

0:50:25 - 0:50:30 Text: They do propose ways to kind of try to minimize the expense of actually having to store all

0:50:30 - 0:50:35 Text: the training context in this data store because they actually store it for every single window

0:50:35 - 0:50:38 Text: of next word in the training context.

0:50:38 - 0:50:42 Text: And you can do quantization on some nearest neighbor approaches to try to make this less

0:50:42 - 0:50:44 Text: expensive.

0:50:44 - 0:50:47 Text: But I imagine this would still be pretty expensive for really large training data sets.

0:50:47 - 0:50:53 Text: They also have some cool experiments that show that this is very good for domain adaptation.

0:50:53 - 0:50:56 Text: So if you take your language model and you have a new domain that you want to apply your

0:50:56 - 0:51:02 Text: language model to, you could just create a nearest neighbor data store of your new domain.

0:51:02 - 0:51:07 Text: So you code all the representations that new domain you stick in a data store.

0:51:07 - 0:51:12 Text: And then you can just use your language model with these K and N probabilities as well,

0:51:12 - 0:51:18 Text: which is immediately on this new domain without actually having to further train your language

0:51:18 - 0:51:19 Text: model.

0:51:19 - 0:51:23 Text: So I thought that was a pretty cool use case of this external memory approach.

0:51:23 - 0:51:27 Text: So while it doesn't leverage knowledge bases directly, it does have this loose knowledge

0:51:27 - 0:51:32 Text: of or loose idea of encoding knowledge that is in a textual representation form into

0:51:32 - 0:51:37 Text: some external memory that the model can then take advantage of.

0:51:37 - 0:51:41 Text: That's all I have for this approach.

0:51:41 - 0:51:43 Text: Are there any questions on this approach?

0:51:43 - 0:51:55 Text: Well, so only one person is asking, how does the K and N make predictions for the next

0:51:55 - 0:51:56 Text: word?

0:51:56 - 0:52:00 Text: The K neighbors are for the context instead of the next word.

0:52:00 - 0:52:01 Text: Oh, okay.

0:52:01 - 0:52:02 Text: That was unclear.

0:52:02 - 0:52:07 Text: So the keys are the representations that context the values in your external memory are

0:52:07 - 0:52:09 Text: the next words.

0:52:09 - 0:52:12 Text: So when you figure out you figure out your nearest neighbors using your keys and then you

0:52:12 - 0:52:14 Text: copy over their values.

0:52:14 - 0:52:19 Text: So it does actually know what the next words are for each of those representations.

0:52:19 - 0:52:23 Text: Okay.

0:52:23 - 0:52:29 Text: So finally, we're going to talk about how you can just modify the training data to better

0:52:29 - 0:52:32 Text: and code knowledge and language models.

0:52:32 - 0:52:36 Text: So approaches you've talked about so far are actually incorporating knowledge explicitly

0:52:36 - 0:52:40 Text: by using the pre-trained embeddings or an external memory.

0:52:40 - 0:52:44 Text: We also want to talk about how can you just incorporate knowledge implicitly through the

0:52:44 - 0:52:48 Text: unstructured text.

0:52:48 - 0:52:51 Text: So what we're going to do is either mask or crop the data to introduce additional training

0:52:51 - 0:52:58 Text: tasks that require factual knowledge to figure out what data was masked, for instance.

0:52:58 - 0:52:59 Text: So some clear advantages.

0:52:59 - 0:53:02 Text: It doesn't have an additional memory or computation requirements.

0:53:02 - 0:53:04 Text: You don't have a data stored at deal with.

0:53:04 - 0:53:07 Text: You don't have extra knowledge and coder layers to train.

0:53:07 - 0:53:09 Text: All you do is modify the training data.

0:53:09 - 0:53:11 Text: And you don't have to modify your architecture either.

0:53:11 - 0:53:15 Text: So you can continue using your favorite bird model and just make these changes to the

0:53:15 - 0:53:18 Text: training data.

0:53:18 - 0:53:22 Text: So the first work we're going to look at is called WKLM, weekly supervised knowledge

0:53:22 - 0:53:25 Text: pre-training language model or pre-trained language model.

0:53:25 - 0:53:31 Text: And the key idea here is to train the model to distinguish between true and false knowledge.

0:53:31 - 0:53:34 Text: So they're going to corrupt the data by replacing mentions in the text with mentions that

0:53:34 - 0:53:39 Text: refer to different entities of the same type to create what they refer to as negative

0:53:39 - 0:53:41 Text: knowledge statements.

0:53:41 - 0:53:47 Text: And then the model will just predict has the entity been replaced or corrupted.

0:53:47 - 0:53:52 Text: This type constraint is necessary to make sure that or to encourage the model to actually

0:53:52 - 0:53:55 Text: use factual knowledge to figure out if this corruption is taking place.

0:53:55 - 0:53:58 Text: So you could imagine if you replace it with something that's not realistic at all, the

0:53:58 - 0:54:04 Text: model could just be basing its prediction based on is this sentence linguistically correct.

0:54:04 - 0:54:09 Text: So as an example, we have a true knowledge statement as JK rolling is the author of

0:54:09 - 0:54:11 Text: Harry Potter.

0:54:11 - 0:54:15 Text: And then we want to modify this to replace it with another author.

0:54:15 - 0:54:19 Text: So let's say we change this to JR Tolkien is the author of Harry Potter.

0:54:19 - 0:54:24 Text: So you can see that this requires some amount of knowledge background knowledge actually

0:54:24 - 0:54:27 Text: able to figure out which statements true and which statement is false.

0:54:27 - 0:54:31 Text: And the idea is that the model will be able to predict free to these mentions, whether

0:54:31 - 0:54:36 Text: it's a true or false mention.

0:54:36 - 0:54:40 Text: So this diagram here is from the paper and hopefully explains this a bit better.

0:54:40 - 0:54:43 Text: They have their original article on the left and then they have their replaced article

0:54:43 - 0:54:47 Text: with the corruptions on the right and the entities are in blue.

0:54:47 - 0:54:52 Text: So what they do is for a given entity, they first look up its type, they find other entities

0:54:52 - 0:54:53 Text: of that type.

0:54:53 - 0:54:59 Text: And they randomly sample the entity and get an alias of it to replace in the text.

0:54:59 - 0:55:03 Text: So they're going to replace Stanley, for instance, with Brian Johnson and Marvel Comics

0:55:03 - 0:55:08 Text: with DC Comics and their placements are in red on the right.

0:55:08 - 0:55:12 Text: And then the idea is that the model will be able to predict for each of these mentions

0:55:12 - 0:55:14 Text: was it replaced or not.

0:55:14 - 0:55:19 Text: So in the case of Brian Johnson, they have the red X for this is a false mention and in

0:55:19 - 0:55:22 Text: the case of the true mentions, they have the check mark.

0:55:22 - 0:55:27 Text: So this is a pretty simple approach, but they actually show that it can help the model

0:55:27 - 0:55:32 Text: increase the amount of knowledge that's encoded in parameters.

0:55:32 - 0:55:40 Text: So WKLN uses an entity or placement loss to train the model to distinguish between

0:55:40 - 0:55:42 Text: these true and false mentions.

0:55:42 - 0:55:46 Text: And this just looks like a binary classification loss where your true mentions are on the

0:55:46 - 0:55:49 Text: left and your false mentions are on the right.

0:55:49 - 0:55:54 Text: And you want to increase the probability that this P of E given C, so the probability

0:55:54 - 0:55:58 Text: of entity given the context, you want to increase that for the true mentions and decrease

0:55:58 - 0:56:01 Text: it for the false mentions.

0:56:01 - 0:56:05 Text: The total loss is then just a combination of the mass language model loss and this entity

0:56:05 - 0:56:08 Text: or placement loss.

0:56:08 - 0:56:14 Text: The mass language model loss is defined at the token level and the entity or placement

0:56:14 - 0:56:19 Text: loss is defined at the entity level, meaning it's not just over sub words, it's even potentially

0:56:19 - 0:56:25 Text: over words if you have multi word entities phrases, for instance.

0:56:25 - 0:56:30 Text: And this is an important theme that we really see occurring throughout these works that

0:56:30 - 0:56:35 Text: we'll look at in that modifying the data at the entity level seems to be an important

0:56:35 - 0:56:39 Text: component of actually increasing the amount of knowledge that a language model can encode.

0:56:39 - 0:56:47 Text: So you find that WKLN improves over BERT and GPT2, in fact completion tasks like the

0:56:47 - 0:56:50 Text: fill in the blank statements that we looked at at the beginning.

0:56:50 - 0:56:54 Text: They also find that it improves over the Ernie paper that we talked about on a downstream

0:56:54 - 0:56:55 Text: task.

0:56:55 - 0:56:59 Text: And they had a set of the Blatian experiments where they looked at, can you just remove

0:56:59 - 0:57:02 Text: this mass language model loss now?

0:57:02 - 0:57:08 Text: And if you just train BERT for longer, do you really need this entity or placement loss?

0:57:08 - 0:57:11 Text: So it's at the table here is looking at, the second row is looking at if we remove the

0:57:11 - 0:57:14 Text: mass language model loss, what happens.

0:57:14 - 0:57:17 Text: We see that it performs much worse without the mass language model loss.

0:57:17 - 0:57:18 Text: So you really need both losses.

0:57:18 - 0:57:24 Text: The intuition there was the mass language model loss helps to encode just general language

0:57:24 - 0:57:26 Text: understanding.

0:57:26 - 0:57:31 Text: And then training BERT for longer performs much worse than using this entity or placement

0:57:31 - 0:57:32 Text: loss.

0:57:32 - 0:57:35 Text: So this motivates even farther that you really do need.

0:57:35 - 0:57:39 Text: Or the entity or placement loss is actually really helping encode more knowledge in these

0:57:39 - 0:57:43 Text: language models.

0:57:43 - 0:57:46 Text: So in addition to corrupting the data, we're also going to look at, can we just mass

0:57:46 - 0:57:47 Text: the data differently?

0:57:47 - 0:57:50 Text: Can we be more clever about how we do the masking?

0:57:50 - 0:57:53 Text: And this is how thread and several recent works.

0:57:53 - 0:57:55 Text: So there's actually another paper called Ernie.

0:57:55 - 0:57:57 Text: So this is different than the one we talked about before.

0:57:57 - 0:58:01 Text: And this is enhanced representation through knowledge integration.

0:58:01 - 0:58:06 Text: And what they do is show improvements on downstream Chinese and LP tasks by doing phrase level

0:58:06 - 0:58:08 Text: and entity level masking.

0:58:08 - 0:58:13 Text: So instead of just masking out sub words, they're going to mask out phrases of multiple words

0:58:13 - 0:58:18 Text: and entities before phrase and entity which corresponds to some entity in the text that

0:58:18 - 0:58:23 Text: they might find like any art techniques, for example.

0:58:23 - 0:58:27 Text: And then the second work is actually something you heard about in the last lecture, which

0:58:27 - 0:58:32 Text: is the idea of using salient span masking to mask out salient spans.

0:58:32 - 0:58:34 Text: And a salient span is just a named entity or a date.

0:58:34 - 0:58:38 Text: So you can see this is pretty similar to what Ernie is doing.

0:58:38 - 0:58:43 Text: And they found that using salient span masking actually significantly helped T5 performance

0:58:43 - 0:58:48 Text: on these closed domain question answering tasks.

0:58:48 - 0:58:52 Text: So just to make sure we're all on the same page with the different masking techniques, this

0:58:52 - 0:58:56 Text: diagram from the Ernie paper is comparing to what Bert does versus what Ernie does.

0:58:56 - 0:59:00 Text: The top shows that Ernie massed out the sub word tokens or that Bert massed out the sub

0:59:00 - 0:59:06 Text: word tokens, whereas Ernie massed out phrases like a series of as well as entities like JK

0:59:06 - 0:59:08 Text: rolling.

0:59:08 - 0:59:15 Text: There's some interesting results on showing that salient span masking is helping encode

0:59:15 - 0:59:18 Text: more knowledge in these representations.

0:59:18 - 0:59:22 Text: So on the left, we're looking at the results of the original paper that proposed salient

0:59:22 - 0:59:23 Text: span masking.

0:59:23 - 0:59:27 Text: So this is the realm work.

0:59:27 - 0:59:30 Text: And the idea here was that they were training a knowledge retriever.

0:59:30 - 0:59:35 Text: So it's actually more of an external memory class of techniques, but they find that by using

0:59:35 - 0:59:41 Text: the salient span masking technique, they could actually train a much better knowledge retriever.

0:59:41 - 0:59:45 Text: So it's a good example of how these techniques are really complimentary.

0:59:45 - 0:59:49 Text: So while I presented three classes of techniques, you can definitely get benefits by doing multiple

0:59:49 - 0:59:52 Text: techniques together.

0:59:52 - 0:59:56 Text: And they found that doing salient span masking compared to using masking from Bert, which

0:59:56 - 1:00:01 Text: would be the random uniform masks or doing random masking of spans from a paper called

1:00:01 - 1:00:06 Text: Spanbert, it performs much better to do salient span masking.

1:00:06 - 1:00:13 Text: So you see a 38 exact match score versus like a 32 exact match score, for instance.

1:00:13 - 1:00:20 Text: And on the right, we have results from fine tuning T5 with either salient span masking or

1:00:20 - 1:00:23 Text: the span corruption task that you saw on assignment five.

1:00:23 - 1:00:27 Text: And you can see that on these different QA data sets, salient span masking is significantly

1:00:27 - 1:00:31 Text: better than just using the span corruption technique.

1:00:31 - 1:00:36 Text: So this really suggests that during the salient span masking and masking out the salient

1:00:36 - 1:00:46 Text: spans of these entities is in fact helping to encode more knowledge in these language models.

1:00:46 - 1:00:49 Text: So to recap, we talked about three different classes of techniques to add knowledge to

1:00:49 - 1:00:51 Text: language models.

1:00:51 - 1:00:54 Text: We talked about using pre trained entity embeddings.

1:00:54 - 1:00:58 Text: These weren't too difficult to apply to existing architectures and is a way to leverage this

1:00:58 - 1:01:01 Text: knowledge graph pre training.

1:01:01 - 1:01:06 Text: But it's rather inject way of incorporating knowledge and it could be hard to interpret.

1:01:06 - 1:01:10 Text: We also talked about approaches to add an external memory.

1:01:10 - 1:01:12 Text: This could support modifying the knowledge base.

1:01:12 - 1:01:15 Text: It was also easier to interpret.

1:01:15 - 1:01:20 Text: But they tended to be more complex in implementation like we saw KGLM and they also required more

1:01:20 - 1:01:24 Text: memory like we saw the K and N LM approach.

1:01:24 - 1:01:28 Text: And then finally we talked about modifying the training data.

1:01:28 - 1:01:31 Text: So this requires no model changes or additional computation.

1:01:31 - 1:01:34 Text: It also might be the easiest to theoretically analyze.

1:01:34 - 1:01:37 Text: So it's actually an active area research right now.

1:01:37 - 1:01:42 Text: But still open question if modifying the training data is always as effective as model changes

1:01:42 - 1:01:46 Text: and what the trade offs are in terms of how the data required versus doing one of these

1:01:46 - 1:01:52 Text: other knowledge enhancement approaches.

1:01:52 - 1:01:55 Text: So that leads us to section three.

1:01:55 - 1:01:59 Text: So I guess I'll pause again for questions.

1:01:59 - 1:02:05 Text: I think we may be good.

1:02:05 - 1:02:06 Text: Awesome.

1:02:06 - 1:02:07 Text: Okay.

1:02:07 - 1:02:10 Text: So section three is about how researchers are actually going about evaluating the knowledge

1:02:10 - 1:02:12 Text: and language models.

1:02:12 - 1:02:17 Text: And I guess how some of the techniques we actually just talked about stand up in this evaluation.

1:02:17 - 1:02:22 Text: So first we're going to talk about probes which don't require any fine tuning of the language

1:02:22 - 1:02:23 Text: model.

1:02:23 - 1:02:27 Text: And then we're going to talk about downstream tasks which look at how well do these pre-trained

1:02:27 - 1:02:32 Text: representations actually transfer their knowledge to other tasks.

1:02:32 - 1:02:35 Text: So one of the initial works in this area was called LAMA.

1:02:35 - 1:02:40 Text: And this really started a series of works to look into how much knowledge is already

1:02:40 - 1:02:43 Text: encoded in these language models.

1:02:43 - 1:02:48 Text: So their question was how much relational, common sense and factual knowledge is in off

1:02:48 - 1:02:49 Text: the shelf language models.

1:02:49 - 1:02:54 Text: So this is just taking pre-trained language models and evaluating the knowledge in them.

1:02:54 - 1:02:57 Text: And this is without any additional training or fine tuning.

1:02:57 - 1:03:01 Text: So they mainly constructed a set of what they're for to its closed statements.

1:03:01 - 1:03:04 Text: And these are just the fill in the blank statements that we actually drew from at the beginning

1:03:04 - 1:03:05 Text: of the talk.

1:03:05 - 1:03:10 Text: And we'll have some more examples here.

1:03:10 - 1:03:14 Text: And they mainly created these templates of closed statements using knowledge graph triples

1:03:14 - 1:03:19 Text: and question answering pairs from existing data sets.

1:03:19 - 1:03:23 Text: They wanted to compare pre-trained language models to supervised relation extraction

1:03:23 - 1:03:27 Text: and question answering systems to see how do these language models that were trained

1:03:27 - 1:03:33 Text: in unsupervised fashion compared to these baseline systems that are not only supervised

1:03:33 - 1:03:37 Text: but really targeted for this task of knowledge extraction.

1:03:37 - 1:03:41 Text: And their goal was to evaluate the knowledge in existing pre-trained language models.

1:03:41 - 1:03:46 Text: And a key point about this is like they're just using the language models as they are available

1:03:46 - 1:03:47 Text: to researchers.

1:03:47 - 1:03:51 Text: So this means there could be differences in the pre-trained corpora, for example.

1:03:51 - 1:03:54 Text: So when you look at the following table in your comparing language models, also keep in

1:03:54 - 1:04:00 Text: mind that these don't account for the differences in the pre-trained corpora.

1:04:00 - 1:04:05 Text: So a lot of these language models probably look familiar to you either from previous lectures

1:04:05 - 1:04:07 Text: or maybe your final projects.

1:04:07 - 1:04:12 Text: And what we see is that overall, birth base and birth large pre-trained models are performing

1:04:12 - 1:04:16 Text: much better than the previous language or the other language models here.

1:04:16 - 1:04:20 Text: As I forgot to mention what mean precision at one is.

1:04:20 - 1:04:21 Text: This is a pretty simple metric.

1:04:21 - 1:04:26 Text: The idea is if you look at the blank and you look at the top predictions for the top

1:04:26 - 1:04:28 Text: prediction for the blank, is it correct or not?

1:04:28 - 1:04:32 Text: That's what precision at one means, precision at 10 would be, let's look at the top 10

1:04:32 - 1:04:37 Text: predictions, is it correct prediction in the top 10?

1:04:37 - 1:04:42 Text: So in addition to birth large and birth base performing well overall, we do see that

1:04:42 - 1:04:46 Text: in the TREX dataset, the relation extraction baseline is performing a bit better than

1:04:46 - 1:04:48 Text: birth.

1:04:48 - 1:04:52 Text: One thing they notice here that's pretty interesting is that this dataset has a lot of different

1:04:52 - 1:04:53 Text: types of relations.

1:04:53 - 1:04:58 Text: And relations can be classified in terms of are they a one-to-one relation, are they

1:04:58 - 1:05:02 Text: an end-to-one relation, are they an end-to-end relation?

1:05:02 - 1:05:07 Text: An example of a one-to-one relation would be your student ID relation, so you have a unique

1:05:07 - 1:05:08 Text: student ID.

1:05:08 - 1:05:13 Text: An example of an end-to-end relation would be the enrolled in relation, so there's lots

1:05:13 - 1:05:17 Text: of students enrolled in lots of classes, so this would be an end-to-end relation.

1:05:17 - 1:05:21 Text: And they find that birth really struggles on these end-to-end relations.

1:05:21 - 1:05:26 Text: So while it performs better than relation extraction baseline on some types of relations,

1:05:26 - 1:05:30 Text: overall it does pretty terribly on these end-to-end relations, so overall it does a bit worse

1:05:30 - 1:05:33 Text: than the baseline on this TREX dataset.

1:05:33 - 1:05:39 Text: They also compare to squad on Docker QA, and they find that it does a fair amount worse.

1:05:39 - 1:05:43 Text: They note that the language model is not fine tuned here, and also has no access to

1:05:43 - 1:05:45 Text: an information retrieval system.

1:05:45 - 1:05:49 Text: And then when they look at the precision at 10, they find that this gap between Docker

1:05:49 - 1:05:54 Text: QA's performance and birth actually closes quite a bit, which suggests that these language

1:05:54 - 1:05:59 Text: models do have some amount of knowledge encoded in them, and that they're even competitive

1:05:59 - 1:06:06 Text: with these knowledge extraction supervised baselines.

1:06:06 - 1:06:10 Text: So you can also try out examples on their GitHub repo for the Lama probe.

1:06:10 - 1:06:15 Text: We have an example that was from their repo that was the cat is on the mask.

1:06:15 - 1:06:19 Text: You can see what the top 10 predictions are to fill in the closed statement.

1:06:19 - 1:06:22 Text: Here they have the cat is on the phone.

1:06:22 - 1:06:26 Text: So this can be a fun way to just figure out what factual and common sense knowledge is

1:06:26 - 1:06:33 Text: in existing language models, and it's pretty easy to use with this interactive prompt.

1:06:33 - 1:06:37 Text: So some limitations on the Lama probe are that it can be hard to understand why the models

1:06:37 - 1:06:40 Text: perform well when they do.

1:06:40 - 1:06:44 Text: So for instance, birth might just be predicting those popular token, and this happens to

1:06:44 - 1:06:45 Text: be right.

1:06:45 - 1:06:50 Text: Maybe it's just memorizing co-occurrence patterns and doesn't really understand the knowledge

1:06:50 - 1:06:54 Text: statement and doesn't understand what the fact is.

1:06:54 - 1:06:59 Text: It might also just be identifying similarities between surface forms of the subject and object.

1:06:59 - 1:07:03 Text: So for instance, an example, Pope Comet VII has a position of blank.

1:07:03 - 1:07:08 Text: Even if you don't know anything about Pope Comet VII, you might be able to figure out

1:07:08 - 1:07:15 Text: that Pope is a likely next word for this triple or for this template.

1:07:15 - 1:07:18 Text: So the problem with this is if the model is just making these predictions based on these

1:07:18 - 1:07:23 Text: surface forms or co-occurrence patterns, it's difficult to know for actually evaluating

1:07:23 - 1:07:25 Text: the knowledge in the model.

1:07:25 - 1:07:29 Text: Maybe it's just making correct predictions for other reasons.

1:07:29 - 1:07:33 Text: And then more subtle issue that we've brought up is that language models might be just

1:07:33 - 1:07:35 Text: sensitive to the phrasing of the statement.

1:07:35 - 1:07:40 Text: So for each triple in their data set or for each relation in their data set, they just

1:07:40 - 1:07:42 Text: had one manually defined template.

1:07:42 - 1:07:45 Text: And qualitatively they found that if they just make small changes as template, it could

1:07:45 - 1:07:51 Text: actually change whether or not the model could recall the correct prediction or not.

1:07:51 - 1:07:55 Text: And so this means that the probe results are really a lower bound on the knowledge that's

1:07:55 - 1:07:58 Text: encoded in the language model.

1:07:58 - 1:08:01 Text: So if you change the phrasing, it's possible that the model might show that it actually

1:08:01 - 1:08:04 Text: does have the knowledge encoded in it.

1:08:04 - 1:08:08 Text: So the next lines of work we'll talk about are really building on these two limitations

1:08:08 - 1:08:12 Text: of this original Lama probe.

1:08:12 - 1:08:16 Text: So the first one is called Lama Un or Lama Unhelpful Names.

1:08:16 - 1:08:20 Text: And the key idea is to remove these examples from Lama that can be answered without the

1:08:20 - 1:08:21 Text: relational knowledge.

1:08:21 - 1:08:25 Text: So this is kind of addressing the first limitation on the last slide.

1:08:25 - 1:08:30 Text: So they observed that BERT relies on just surface forms entities, might not be using knowledge,

1:08:30 - 1:08:31 Text: make these predictions.

1:08:31 - 1:08:35 Text: This includes a string match situation that we talked about with the Pope.

1:08:35 - 1:08:40 Text: This also is dealing with the revealing person name issue that you saw in assignment five.

1:08:40 - 1:08:45 Text: So this is where the name could be an incorrect prior for the native language of someone,

1:08:45 - 1:08:47 Text: their place of birth, their nationality.

1:08:47 - 1:08:51 Text: They have this example from the table or from the paper, but they look at different people

1:08:51 - 1:08:56 Text: names or person's names, and then they look at BERT's prediction for their native language.

1:08:56 - 1:08:58 Text: And these are all French speaking actors.

1:08:58 - 1:09:04 Text: And BERT just predicts very biased and stereotypical languages for these particular names.

1:09:04 - 1:09:06 Text: So this can really work both ways.

1:09:06 - 1:09:12 Text: It can lead BERT to make incorrect predictions in some cases, but it could also work to

1:09:12 - 1:09:16 Text: let BERT make correct predictions, even if it has no factual knowledge of those people.

1:09:16 - 1:09:19 Text: So that's the issue they're trying to get at here is do we know that BERT actually

1:09:19 - 1:09:24 Text: knows this fact or is it just using some bias to make its prediction?

1:09:24 - 1:09:27 Text: So what they do is they introduce a couple heuristics to basically just filter out these

1:09:27 - 1:09:32 Text: examples from the LAMA probe that can either be solved by the string match setting or

1:09:32 - 1:09:35 Text: the serbiling person name setting.

1:09:35 - 1:09:39 Text: So they make a harder subset of the LAMA data set essentially.

1:09:39 - 1:09:43 Text: They find that when they test BERT on this harder subset that its performance drops about

1:09:43 - 1:09:44 Text: 8%.

1:09:44 - 1:09:48 Text: But when they test their knowledge enhanced model, which they call Ebert, the score only

1:09:48 - 1:09:49 Text: drops about 1%.

1:09:49 - 1:09:54 Text: So it's possible that as you make harder knowledge probes, we'll actually see even bigger differences

1:09:54 - 1:10:02 Text: in the performance of knowledge enhanced models to models without these knowledge enhancements.

1:10:02 - 1:10:09 Text: The next piece of work we'll talk about is actually getting at this issue of the phrasing

1:10:09 - 1:10:14 Text: of the prompt might actually trigger different responses from the language model.

1:10:14 - 1:10:19 Text: So the language model might know the fact, but it might fail on the task due to the phrasing.

1:10:19 - 1:10:23 Text: One reason this might happen is the pre-training is on different contexts and sentence structures

1:10:23 - 1:10:24 Text: in the query.

1:10:24 - 1:10:29 Text: So for example, you might have in your pre-training corpus, the birthplace of Barack Obama is

1:10:29 - 1:10:30 Text: Honolulu Hawaii.

1:10:30 - 1:10:33 Text: And this might be something you've seen with Kapedia for instance, that's a comment training

1:10:33 - 1:10:34 Text: data set.

1:10:34 - 1:10:38 Text: And then as a researcher, you write Barack Obama is born in blank.

1:10:38 - 1:10:40 Text: And you can see that these sentence structures are pretty different.

1:10:40 - 1:10:44 Text: So the model might have seen the first fact, but the sentence structure difference is actually

1:10:44 - 1:10:49 Text: enough to confuse it so it can't answer this query.

1:10:49 - 1:10:53 Text: So what they do is they generate a lot more of these prompts by mining templates from

1:10:53 - 1:10:57 Text: with Kapedia, one of their techniques actually uses dependency parsing, and also generating

1:10:57 - 1:11:03 Text: paraphrased prompts by taking inspiration from the machine translation literature and using

1:11:03 - 1:11:05 Text: back translation.

1:11:05 - 1:11:08 Text: So they generate a lot more prompts to try to query the language models and figure out

1:11:08 - 1:11:13 Text: do small variations in the prompt, trigger the correct prediction from the language model.

1:11:13 - 1:11:16 Text: They also experiment the sampling prompts.

1:11:16 - 1:11:21 Text: So if we give the model multiple prompts and then take some probability averaged over

1:11:21 - 1:11:25 Text: these different prompts, can we improve the performance on the model returning the correct

1:11:25 - 1:11:26 Text: prediction?

1:11:26 - 1:11:30 Text: So we give it a higher chance of seeing a context that it might have actually seen during

1:11:30 - 1:11:32 Text: pre-training.

1:11:32 - 1:11:37 Text: They find that the performance on Lama increases when they either use a top performing prompt

1:11:37 - 1:11:39 Text: or when they use this sampling approach.

1:11:39 - 1:11:43 Text: So it suggests that the original Lama really was a lower bound on the amount of knowledge

1:11:43 - 1:11:45 Text: encoded in these language models.

1:11:45 - 1:11:52 Text: And changing the phrasing can actually help the model recall the correct answer.

1:11:52 - 1:11:56 Text: This table is a bit frightening, but they find that small changes in the query can lead

1:11:56 - 1:11:58 Text: to really large gains on performance.

1:11:58 - 1:12:03 Text: So if you just have a query like x plays in y position, and then you change that to x plays

1:12:03 - 1:12:08 Text: at y position, this can actually lead to like a 23% accuracy gain on this particular

1:12:08 - 1:12:13 Text: relation in terms of the model actually being able to recall the correct answer.

1:12:13 - 1:12:19 Text: Or even just x was created in y to x is created in y 10% accuracy gain.

1:12:19 - 1:12:24 Text: So I think this motivates the need to not only develop better ways to query these models,

1:12:24 - 1:12:31 Text: but probably also build language models that are actually more robust to the query itself.

1:12:31 - 1:12:35 Text: So in addition to probes, another way to evaluate these language models is by looking

1:12:35 - 1:12:42 Text: at how well they transfer from the pre-trained representation to downstream tasks.

1:12:42 - 1:12:45 Text: And so the idea here is you're actually going to find two in the pre-trained representation

1:12:45 - 1:12:51 Text: on different downstream tasks, similar to how you evaluate Bert on glue tasks.

1:12:51 - 1:12:56 Text: So common tasks that are used for this are relation extraction, entity typing, and question

1:12:56 - 1:12:57 Text: answering.

1:12:57 - 1:13:01 Text: Relation extraction is where you want to predict the relation between two entities.

1:13:01 - 1:13:05 Text: So this is getting back at one of the questions earlier in this talk in terms of well, how

1:13:05 - 1:13:08 Text: do you get the relation that's the edges in these knowledge bases?

1:13:08 - 1:13:13 Text: So given two entities, you learn a model to predict what is a relation between them.

1:13:13 - 1:13:16 Text: Entity typing is a task of giving an entity what is the type of the entity.

1:13:16 - 1:13:20 Text: So here, Alice Rob the bank, you want to predict her as a criminal.

1:13:20 - 1:13:23 Text: And then you guys are very familiar with question answering.

1:13:23 - 1:13:28 Text: So the idea of these tasks is that they're knowledge intensive, so they're good candidates

1:13:28 - 1:13:32 Text: to see how well do these pre-trained representations actually transfer the knowledge to these downstream

1:13:32 - 1:13:36 Text: tasks.

1:13:36 - 1:13:41 Text: So we look at the performance on a relation extraction benchmark called tackred and all

1:13:41 - 1:13:45 Text: the models that we show here for one point stay the art on tackred.

1:13:45 - 1:13:50 Text: So this CGCN is a graph convolutional neural network over dependency trees.

1:13:50 - 1:13:55 Text: The Bert LSTM base is a, it's one of the first works that showed that you could actually

1:13:55 - 1:13:58 Text: get state of the art performance with Bert on relation extraction.

1:13:58 - 1:14:02 Text: And this is just putting LSTM layer over Bert's output.

1:14:02 - 1:14:05 Text: Ernie is the work that we talked about with the pre-trained entity embeddings.

1:14:05 - 1:14:09 Text: Actually the blanks we didn't get to today, but it's a really interesting work about learning

1:14:09 - 1:14:11 Text: meaningful relation representations.

1:14:11 - 1:14:16 Text: And it falls more into the training data modification approaches and that they are actually

1:14:16 - 1:14:19 Text: masking out entities again.

1:14:19 - 1:14:22 Text: And then no Bert is what we talked about.

1:14:22 - 1:14:26 Text: The W and W here means the action code two knowledge bases in no Bert.

1:14:26 - 1:14:30 Text: So they're encoding wordnet and they're also encoding Wikipedia.

1:14:30 - 1:14:34 Text: And the high level takeaway from this table is that you can see that the recent knowledge

1:14:34 - 1:14:39 Text: enhanced models have achieved state of the art over the original models that once performed

1:14:39 - 1:14:40 Text: very well on tackred.

1:14:40 - 1:14:44 Text: And we have about five F1 gains here.

1:14:44 - 1:14:47 Text: Another interesting takeaway from this table is there seems to be a trade-off in the size

1:14:47 - 1:14:50 Text: of a language model that's necessary to get a certain performance.

1:14:50 - 1:14:55 Text: So if you just consider the size of a language model, then no Bert forms the best.

1:14:55 - 1:15:00 Text: But if you don't consider that, then it's high with matching the blanks.

1:15:00 - 1:15:05 Text: So overall, this is pretty good evidence that these knowledge enhanced methods are in

1:15:05 - 1:15:10 Text: fact transferring to these knowledge intensive downstream tasks that can really take advantage

1:15:10 - 1:15:14 Text: of these pre-trained representations.

1:15:14 - 1:15:16 Text: We also have results on entity typing.

1:15:16 - 1:15:18 Text: So here we're comparing to slightly different set of models.

1:15:18 - 1:15:23 Text: Some of the base signs are LSTM models that were designed for entity typing.

1:15:23 - 1:15:29 Text: And we have Ernie and Nobert leading the, I guess, leaderboard here on the entity typing

1:15:29 - 1:15:30 Text: task of open entity.

1:15:30 - 1:15:34 Text: And we see gains of about 15 F1 points with Ernie and Nobert.

1:15:34 - 1:15:39 Text: So once again, we really do see that these knowledge-rich pre-trained representations

1:15:39 - 1:15:45 Text: are transferring and helping on these knowledge intensive downstream tasks.

1:15:45 - 1:15:50 Text: So just to recap, we talked about probes which evaluate the knowledge already present in models.

1:15:50 - 1:15:52 Text: These don't require any more training.

1:15:52 - 1:15:57 Text: But it can be challenging to construct benchmarks to actually make sure you're testing the knowledge

1:15:57 - 1:15:58 Text: in these language models.

1:15:58 - 1:16:03 Text: It can also be challenging to construct the queries used in the probe.

1:16:03 - 1:16:05 Text: We then talked about downstream tasks.

1:16:05 - 1:16:08 Text: These are a bit of an indirect way to evaluate knowledge in that they have this extra component

1:16:08 - 1:16:09 Text: of fine tuning.

1:16:09 - 1:16:14 Text: But it's a good way to evaluate how useful is this knowledge-rich pre-trained representation

1:16:14 - 1:16:19 Text: in actual applications.

1:16:19 - 1:16:23 Text: So I just touched on the exciting work in this area, but there's many other objections

1:16:23 - 1:16:25 Text: if you want to dive more into this.

1:16:25 - 1:16:30 Text: So there's retrieval augmented language models which learn knowledge retrievers to figure

1:16:30 - 1:16:34 Text: out what documents might be relevant for predicting the next word.

1:16:34 - 1:16:37 Text: There's work in modifying the knowledge in language models.

1:16:37 - 1:16:41 Text: So I talked about how this is one of the obstacles and challenges to using language models

1:16:41 - 1:16:42 Text: as knowledge bases.

1:16:42 - 1:16:45 Text: So there's been recent work in this area.

1:16:45 - 1:16:48 Text: We also saw how important the knowledge-pre-training task was.

1:16:48 - 1:16:53 Text: Well, there's many papers that are proposing different tasks to do the knowledge-pre-training.

1:16:53 - 1:16:57 Text: So it's still an open question in terms of what tasks are best to add to and code more

1:16:57 - 1:16:58 Text: knowledge.

1:16:58 - 1:17:02 Text: And there's also been work on more efficient knowledge systems.

1:17:02 - 1:17:07 Text: So add an search-mounted efficient QA challenge which aims at building the smallest QA system.

1:17:07 - 1:17:12 Text: And then finally, there's been work on building better knowledge benchmarks that build on the

1:17:12 - 1:17:16 Text: benchmarks that we saw today.

1:17:16 - 1:17:26 Text: So that's all I have for today and I hope your final projects are going well.