Stanford CS224N NLP with Deep Learning ｜ Winter 2021 ｜ Lecture 11 - Question Answering

0:00:00 - 0:00:13 Text: Okay, hi everyone. Welcome back to Winner and pass the halfway point week six of CS 224

0:00:13 - 0:00:23 Text: N. And so let me just give a couple of quick announcements first. So today is the day

0:00:23 - 0:00:29 Text: that you have to have done the mid quarter survey by hundreds of people have. But if you

0:00:29 - 0:00:36 Text: haven't, this is your last chance to get the half point for that. Today is also the day

0:00:36 - 0:00:41 Text: that final project proposals are due. We really encourage you to try and hand them in

0:00:41 - 0:00:47 Text: on time or nearly on time. That's really just to help you so we can more quickly give you

0:00:47 - 0:00:53 Text: feedback on final project proposals. And in the background then there's also assignment

0:00:53 - 0:00:58 Text: five. You'll have seen the message that we're giving you one extra day for that. But we

0:00:58 - 0:01:03 Text: do certainly encourage you to be hard at work on assignment five at this point. Hopefully

0:01:03 - 0:01:10 Text: it's a great exciting opportunity to be learning all the latest stuff about transformers.

0:01:10 - 0:01:17 Text: And then today delighted to have our first invited speaker. Let me just mention that going

0:01:17 - 0:01:27 Text: along with the half point of participation credit is you guys writing a reaction paragraph

0:01:27 - 0:01:33 Text: for talking about something that the speaker talks about their instructions up for that

0:01:33 - 0:01:42 Text: on Ed. But without further ado, let me introduce Dan Tichun. So Dan Tich is one of the foremost

0:01:42 - 0:01:48 Text: researchers in question answering. And she's particularly well known in recent work for

0:01:48 - 0:01:55 Text: being one of the co-authors of the Roberta paper, the Spanbert paper, and on using dense

0:01:55 - 0:02:02 Text: passage retrieval methods for open domain question answering. And as a professor at the Princeton

0:02:02 - 0:02:12 Text: University, but as one other comment, Dan Tichun, once upon a time was the head TA of CS224N.

0:02:12 - 0:02:19 Text: So she's quite familiar with the context of this class. So really delighted to have Dan Tichun

0:02:19 - 0:02:26 Text: here to give this lecture on question answering. Thanks. Thank you Chris for the introduction.

0:02:26 - 0:02:33 Text: For me, it's a great opportunity for me to come back to CS224N today and give this lecture

0:02:33 - 0:02:37 Text: although being virtually. So questions are in the areas that have been working quite a

0:02:37 - 0:02:42 Text: bit in the last few years. So today I'm very happy to introduce you some of the fundamentals

0:02:42 - 0:02:49 Text: in this field as well as some cutting edge and saves our topics. So here is my plan for

0:02:49 - 0:02:55 Text: this lecture. So first I will give a brief introduction of what is question answering

0:02:55 - 0:03:00 Text: and what kind of problems that people are starting today. Then I'm going to use the most

0:03:00 - 0:03:06 Text: of the this lecture focusing one type of question answering problems called reading comprehension.

0:03:06 - 0:03:10 Text: So this is basically a problem that of how we build systems to answer questions over

0:03:10 - 0:03:16 Text: a single passive text. So I know that many of you are going to do a default project on

0:03:16 - 0:03:20 Text: the Stanford question answering data set. So understanding this part will be very crucial

0:03:20 - 0:03:26 Text: for your final project. So at the end of this lecture, I'm hoping to spend hopefully

0:03:26 - 0:03:32 Text: like 10 days a minute to talk about a more practical and a more exciting problem called

0:03:32 - 0:03:37 Text: open-to-man question answering. So basically try to answer questions over a very large collection

0:03:37 - 0:03:43 Text: of the documents. So my plan is try to quickly go over some of the state art methods in

0:03:43 - 0:03:53 Text: this area. Okay, so let's just get started. So first, what is the question answering?

0:03:53 - 0:03:58 Text: So the goal of question answering is to build systems that can automatically answer questions

0:03:58 - 0:04:05 Text: posed by humans in a natural language. Question answering or let's say QA in short is

0:04:05 - 0:04:12 Text: one of the earliest tasks and the early systems can even date back to 1960s. So here is one

0:04:12 - 0:04:22 Text: example of the early QA systems in back to 1964. So I think I see that this system is trying

0:04:22 - 0:04:28 Text: to answer questions like what do you want it and finally return on the answer that's

0:04:28 - 0:04:33 Text: the graph. So to do this, so this system is better to try to find some kind of text matching

0:04:33 - 0:04:38 Text: between the question and some kind of text segments and by using some kind of dependency

0:04:38 - 0:04:45 Text: analysis, I assume that you have already learned the defense of course in this class.

0:04:45 - 0:04:49 Text: And there are many different types of the question answering problems. And we can also

0:04:49 - 0:04:55 Text: look at category or this question answer problems based on the either the information source

0:04:55 - 0:05:01 Text: or the type of questions or the type of answers. So for the information source, we can build

0:05:01 - 0:05:09 Text: a system that can put like condition or text or very large collection of documents or

0:05:09 - 0:05:16 Text: even like structured database or structured knowledge based basis or even tables or images.

0:05:16 - 0:05:22 Text: So for the question part, we can also be a system that can also affect questions or non-factual

0:05:22 - 0:05:28 Text: questions, open open questions or coastal questions or simple questions versus more complex

0:05:28 - 0:05:34 Text: or compositional questions. And for the answer type, it can also be like a short segment

0:05:34 - 0:05:40 Text: of text or a paragraph or document or list or even the yes or no questions. So just

0:05:40 - 0:05:44 Text: have in mind there's many different types of question answering problems and all these

0:05:44 - 0:05:50 Text: problems may require very different techniques or different data or even different evaluation

0:05:50 - 0:05:58 Text: metrics to evaluate all these different problems. And the question answer has enabled a lot

0:05:58 - 0:06:04 Text: of the use for real-world applications. For example, today if you're just putting a question

0:06:04 - 0:06:10 Text: in a search engine like Google, for example, you can put in a question like where's the

0:06:10 - 0:06:15 Text: deepest lake in the world. So you can see that the current system basically can find a short

0:06:15 - 0:06:25 Text: snippet of text with including like like by call, by call in Siberia holds a distinction

0:06:25 - 0:06:30 Text: of being both the deepest lake in the world and the largest fresh water lake blah blah.

0:06:30 - 0:06:36 Text: And then it can actually ping pong the crack answer which is actually a concise answer

0:06:36 - 0:06:43 Text: which should be Siberia. And those current systems are also able to handle more complex

0:06:43 - 0:06:48 Text: questions like how two questions. I guess this is probably a question that everyone currently

0:06:48 - 0:06:55 Text: cares about. So the question is how can I protect myself from COVID-19? So there is another

0:06:55 - 0:07:00 Text: really simple and short answer to this question. So you can see that the system actually returns

0:07:00 - 0:07:05 Text: a very long paragraph including the best way to prevent the year is to avoid being exposed to

0:07:05 - 0:07:12 Text: this virus. And to help prevent the spread of COVID-19 you can do the following. So actually

0:07:12 - 0:07:19 Text: this paragraph is actually a summary from this CDC article if you just click this link and

0:07:19 - 0:07:27 Text: read through the article. So this is also one kind of question. And now this is a survey of the

0:07:27 - 0:07:34 Text: use cases for the current digital system such as Alexa or Google Home. So according to this survey

0:07:34 - 0:07:41 Text: result in January 2020 which is one year ago. So you can see that also people actually really like

0:07:41 - 0:07:47 Text: to ask questions on this digital assistance. So you can see that also question is actually the

0:07:47 - 0:07:53 Text: second most used case only around after listening to music and before the check weather in a set of

0:07:53 - 0:08:02 Text: time timer. So question is really useful in these digital systems. Another very famous example

0:08:02 - 0:08:10 Text: of the question answering system is this IBM was from the question answering system. So in 2011

0:08:10 - 0:08:17 Text: so this IBM was not used to a system has been shown to be too national. Jeffrey Champions in answering

0:08:17 - 0:08:25 Text: chapter questions. So this is kind of this all like a historical event and it's in the LK history.

0:08:27 - 0:08:32 Text: So if you look at the you are working of this system more closely. So you can see that it is

0:08:32 - 0:08:38 Text: actually a very complicated and highly modularized system. So it's a system that builds on both

0:08:38 - 0:08:46 Text: the unstructured text and also the structured data. So by looking at the system if you go from the

0:08:46 - 0:08:51 Text: left to right you can see that this system consists of the four stages including the question processing.

0:08:52 - 0:08:57 Text: The candidate answer generation and the candidate answer scoring and the confidence margin

0:08:57 - 0:09:02 Text: and ranking. And then if you look at each stage you can see that there are many different LP

0:09:02 - 0:09:08 Text: techniques that have actually included in this complex QE system including question classification,

0:09:08 - 0:09:13 Text: parsing, relation extraction, correctness. So it's actually there really a lot of the LP systems

0:09:13 - 0:09:19 Text: modules that have been included. And this system has been over 10 years actually exactly 10 years now

0:09:20 - 0:09:25 Text: and this is actually represented in a safe art like 10 years ago at that time.

0:09:28 - 0:09:33 Text: So we know that this class is about deep learning. So today deep learning has completely

0:09:33 - 0:09:39 Text: really transformed the landscape of the question answering systems. So there's no doubt that we

0:09:39 - 0:09:44 Text: can say that almost all the states are question answering systems today are built on top of the

0:09:44 - 0:09:50 Text: end-training of the deep learning networks and the pretty chain language models such as BERT.

0:09:50 - 0:09:54 Text: So today in this lecture we also going to learn a lot of deep learning models in for question

0:09:54 - 0:10:01 Text: answering. And this statement is probably also true for almost all the LP problems that we can see

0:10:01 - 0:10:06 Text: today. But we can also argue that question answering is probably one of those fields that we have

0:10:06 - 0:10:12 Text: seen the most remarkable progress in the last couple of years driven by deep learning.

0:10:16 - 0:10:23 Text: So in this lecture I would be mostly focused on like focusing on the text based or textual

0:10:23 - 0:10:29 Text: question answering problems. So basically we are trying to answer questions based on the unstructured text.

0:10:29 - 0:10:35 Text: So before I start I jump to that part. I also would quickly point out that there are many

0:10:35 - 0:10:40 Text: other really bigger question answering problems and the issue of them can be really a

0:10:40 - 0:10:46 Text: like a big subfield in NLP and they actually have very different challenges and also model designs.

0:10:47 - 0:10:51 Text: So one bigger class of this question answer problem is this knowledge based question answering.

0:10:51 - 0:10:56 Text: So basically we want to build question answering systems to answer questions that can answer

0:10:57 - 0:11:03 Text: answer questions over a very large database. So to solve this problem some approaches need to

0:11:03 - 0:11:09 Text: take this question and convert this question into some kind of logic forms and this kind of logic

0:11:09 - 0:11:14 Text: forms can be executed against this database to give you the final answer.

0:11:14 - 0:11:21 Text: And another class bigger class of the question answering problem is called visual question answering.

0:11:21 - 0:11:26 Text: So it's basically need to answer questions based on the images. So this problem basically

0:11:26 - 0:11:32 Text: requires both understanding of the questions and also images and there is actually a very active

0:11:32 - 0:11:37 Text: field between the computer vision and NLP. So if we are interested in all these type of problems

0:11:37 - 0:11:44 Text: I encourage you to check out these problems but I'm not going to give you these problems today.

0:11:45 - 0:11:51 Text: Okay so next I'm going to start with a part to review comprehension. I just want to quickly check

0:11:51 - 0:11:57 Text: if there are any quick questions I can answer before I get started. I'll start with a part to you.

0:11:58 - 0:12:04 Text: Now I think we could do now. Okay so yeah so let's talk about the review comprehension then.

0:12:04 - 0:12:09 Text: So a read comprehension is a basic problem that we want to compare

0:12:09 - 0:12:14 Text: and passive text and answer questions about the content. So it's an input of discovering

0:12:14 - 0:12:20 Text: visual passive text a question and going to return the answer that actually can't answer this question.

0:12:20 - 0:12:28 Text: So here is one example. So let's talk here is a passive text and we want to answer a question

0:12:28 - 0:12:34 Text: the question is what language the test will start if while you sport. Okay so I'm going to pause like

0:12:34 - 0:12:44 Text: five or 10 seconds and see if you will find the answer to this question based on this passage.

0:12:44 - 0:13:04 Text: And you guys. Okay. Well people stress the German. Yeah, German is a crap. So the answer should be

0:13:04 - 0:13:10 Text: German. So basically answer this question so you need to find this sentence like in 1861 test

0:13:10 - 0:13:16 Text: out 10 years school where he started German arithmetic and religion and only the German is a language

0:13:16 - 0:13:23 Text: so the answer to this question should be German. Okay here is another example. Okay another passive

0:13:23 - 0:13:32 Text: text and the question is which linguistic minority larger painting or Malal Yalan I think yeah.

0:13:32 - 0:13:38 Text: Five seconds.

0:13:49 - 0:13:55 Text: Okay so the answer to this question should be Hindi. So this probably is not very hard question

0:13:55 - 0:14:00 Text: for humans it actually a pretty hard question for machines because to get this question correctly

0:14:00 - 0:14:05 Text: so the machine is basically to understand that for the Hindi like 3.3% of the population

0:14:05 - 0:14:14 Text: speaks Hindi and only like 1.27% speaks Malal Yalan this language and then also compare these two

0:14:14 - 0:14:21 Text: numbers and the final case 3% 3.3% is a bigger number so the answer should be Hindi to this question.

0:14:23 - 0:14:28 Text: Okay so next I'm going to talk a little bit so why do we care about this problem so why do we

0:14:28 - 0:14:34 Text: care about the reading comprehension problem so besides that it actually tries many useful real

0:14:34 - 0:14:40 Text: work practical applications so as I already saw some examples at the beginning I think there are

0:14:40 - 0:14:48 Text: also two other few reasons so the first reason also besides the application the first reason is

0:14:48 - 0:14:54 Text: so reading comprehension has been also viewed as a very important test path for evaluating how well

0:14:54 - 0:15:01 Text: computer systems understand human language so this is really just similar to how we humans actually

0:15:01 - 0:15:07 Text: test a reading comprehension test to evaluate how well we actually understand what language

0:15:07 - 0:15:12 Text: so this is also the way that we actually post questions to test the machines language understanding

0:15:12 - 0:15:20 Text: and language understanding ability so this actually has been formally stated in back in 1937 by

0:15:20 - 0:15:25 Text: Wendy Lengard in her dissertation so she's fitting the saying is that she said that

0:15:26 - 0:15:32 Text: these questions can be devised to query any aspect of test comprehension so be it to answer

0:15:32 - 0:15:37 Text: questions is the strongest possible demonstration understanding so that's why reading comprehension

0:15:37 - 0:15:43 Text: can be a very important test path because we can't be devised design very complex questions to test

0:15:43 - 0:15:50 Text: that and also I think there's another interesting and important reason that reading comprehension

0:15:50 - 0:15:56 Text: is important so in the recent few years so many some researchers actually found that okay so for

0:15:56 - 0:16:02 Text: many other NLP tests that we also reduced them to a reading comprehension problem so I'm going

0:16:02 - 0:16:09 Text: to give you two examples so one example is already a information extraction so basically if we want to

0:16:10 - 0:16:17 Text: so give answers to a person like subject brought Obama give a relation educated at so we want to

0:16:17 - 0:16:23 Text: fill in what is a fill in this question mark and figure out okay where Barack Obama was educated at

0:16:23 - 0:16:29 Text: so one way to solve this problem is basically trying to cover this relation into a question

0:16:29 - 0:16:35 Text: so where did the Barack Obama graduate from and take a relevant piece of text and then

0:16:35 - 0:16:41 Text: by writing a reading comprehension problem then basically we can find out the extract the correct

0:16:41 - 0:16:48 Text: answer should be Columbia University that is also the output of this information extraction system

0:16:48 - 0:16:52 Text: another example is actually called a cement for the labeling I'm not sure if you have noticed

0:16:52 - 0:16:58 Text: in the class yet probably not but it basically is a task of the technical labeling is trying to

0:16:58 - 0:17:04 Text: taking one sentence and trying to identify the roles for different verbs at least for verbs in

0:17:04 - 0:17:11 Text: this case in one sentence so basically trying to give one sentence or give one verb finish trying

0:17:11 - 0:17:19 Text: to figure out like who did what to whom and then and well so by trying to so to go you try to

0:17:19 - 0:17:26 Text: figure out all this like roles with respect to the words so one way to solve this problem is by

0:17:26 - 0:17:32 Text: also by converting all these different roles into questions such as who finished something what

0:17:32 - 0:17:38 Text: is someone finished and what is someone finished something else so by converting all these kind of

0:17:38 - 0:17:45 Text: like cement relations we can also just apply the reading comprehension problem and give you a

0:17:45 - 0:17:50 Text: correct answer so this is actually a very interesting perspective that reading comprehension

0:17:50 - 0:17:59 Text: can be actually very universally useful to many other questions so next I'm going to introduce

0:17:59 - 0:18:04 Text: this like a Stanford question three data set calls God so if you're going to do the before the final

0:18:04 - 0:18:10 Text: project you will need to use this data set so Stanford question three data set is actually a

0:18:10 - 0:18:17 Text: super advised reading comprehension data set so which consists of 100 K annotated passage and

0:18:17 - 0:18:26 Text: answer question answer triples so here is one example from this data set and I just want to say

0:18:26 - 0:18:33 Text: that also what important thing to have in mind is that so this data set has consists of 100

0:18:33 - 0:18:39 Text: K annotated examples and this kind of large scale supervised data set is also very key

0:18:39 - 0:18:43 Text: in grade and for the training the effective neural models for reading comprehension so after

0:18:43 - 0:18:49 Text: this God data set many other like later data set having also collected basically runs this size

0:18:49 - 0:18:57 Text: or around like 100 K so 100 K is actually very important to trend these neural models so for this

0:18:57 - 0:19:04 Text: data set so the question the passages is like a single passage a single paragraph selected from

0:19:04 - 0:19:09 Text: the English Wikipedia which usually consists of like 100 to 150 words and the questions are

0:19:09 - 0:19:15 Text: crowdsourced basically like from the kind of turkey and there is a very important property of

0:19:15 - 0:19:23 Text: this data set is that each answer is a short segment text or we can spend in the passage so as

0:19:23 - 0:19:29 Text: you can see from this example so here are three different questions and each of this answer can

0:19:29 - 0:19:34 Text: be actually find also like a short segment text in the passage so this is actually a

0:19:34 - 0:19:41 Text: pretty interesting property you know it's also important property of this data set but also

0:19:41 - 0:19:48 Text: just to your coverage that this also limitation because not all the questions can be answered in

0:19:48 - 0:19:54 Text: this way so only the questions that that you can find answered as a stand in the passage can actually

0:19:54 - 0:20:02 Text: be included in this data set basically but today so this data set yeah I forgot to say so this

0:20:02 - 0:20:08 Text: data set was collected in 2016 by several researchers at Stanford so it's called Stanford

0:20:08 - 0:20:14 Text: Question 3 data set and today like after four or five years now so Scott still remains the most

0:20:14 - 0:20:19 Text: popular reading comprehension data set so he's actually he's a very clean on the high quality data

0:20:19 - 0:20:25 Text: set but he's also not as very difficult data set so today basically the score data set has

0:20:25 - 0:20:30 Text: been almost sold and what safe are already since estimated human performance

0:20:30 - 0:20:41 Text: and also until quickly mentioned the evaluation for this Stanford question data set so there are

0:20:41 - 0:20:46 Text: basically two evaluation metrics to evaluate how well a system can do on this data set the two

0:20:46 - 0:20:52 Text: metrics are like the exact match and the affine score so when you find match is basically just a binary

0:20:52 - 0:20:57 Text: indicator zero one based measures whether the answer can actually be exactly matched to the

0:20:57 - 0:21:03 Text: gold answer and the affine score basically measures kind of some partial credit and not to do the

0:21:03 - 0:21:08 Text: evaluation so basically for the development and testing set there will be like three gold answers

0:21:08 - 0:21:14 Text: collected because for some questions there might be not just one one unique answer so they're

0:21:14 - 0:21:20 Text: quite multiple possible answers and the evaluation makes it basically takes a pretty good answer

0:21:20 - 0:21:26 Text: and compares or compares the predicted answer to each gold answer with some kind of like some articles

0:21:26 - 0:21:32 Text: and also the computations excluded and the picture you can compute a exact match score and also

0:21:32 - 0:21:40 Text: a score by comparing the predicted answer to the gold answer and finally you take the match scores

0:21:40 - 0:21:45 Text: and because there are many different examples in the demo test set and finally we just take the

0:21:45 - 0:21:51 Text: average of all the examples for the post-example match and the reference score so by using this

0:21:51 - 0:21:58 Text: evaluation metric so estimating the human performance is by the researchers as a time

0:21:58 - 0:22:05 Text: estimated by the researchers as a time is the exact match score is 82.3% and the affine score is

0:22:05 - 0:22:16 Text: 91.2 so here's just a quick example so here's a question what do tests are doing in December 1878

0:22:16 - 0:22:22 Text: and the Diatry possible answers so you can see that the first two answers are the same left

0:22:22 - 0:22:32 Text: grass and the third answer is left grass and as a serve as is a title here or relations with his family

0:22:33 - 0:22:37 Text: and then you feel that if you find a prediction is a span which is left grass and serve

0:22:38 - 0:22:44 Text: so you can see that the exact there is an exact match score between the predicted answer and

0:22:44 - 0:22:50 Text: any of the gold answer so the exact match will be zero and the affine score will be taking the max

0:22:50 - 0:22:55 Text: I'm not going to talk about how this is computed so I suggest you check out the original paper

0:22:55 - 0:23:02 Text: so by computing this scores and taking the max and the final is the affine score will be 0.67

0:23:02 - 0:23:07 Text: which is the affine score for this predicted answer on this data set.

0:23:07 - 0:23:15 Text: So Danchi one question you might answer is so if you can do other tasks like named entity

0:23:15 - 0:23:22 Text: recognition or relation extraction by sticking something on top of bird as and fine tuning

0:23:22 - 0:23:28 Text: forward or do it as question answering there's one or the other method work better and by how much.

0:23:28 - 0:23:38 Text: That's an interesting question so I haven't really seen the okay so there has been some

0:23:38 - 0:23:43 Text: claims that okay also tasks can be converted into questions and tasks but I'm not sure there

0:23:43 - 0:23:49 Text: is a really a very fair comparison let's say an entity recognition but by really converting that into

0:23:49 - 0:23:54 Text: questions and tasks so I don't have to answer to that so the kind of states are in your system

0:23:54 - 0:24:01 Text: and still trying to just change sequence tagger on top of the bird so yeah I don't really have

0:24:01 - 0:24:03 Text: a pre-set answer to that.

0:24:07 - 0:24:08 Text: Should I continue?

0:24:10 - 0:24:19 Text: Okay so next I'm going to talk about how to build newer models for really comprehension in particular

0:24:19 - 0:24:24 Text: how we can build a model to solve this Stanford question answering data sets for data sets.

0:24:24 - 0:24:29 Text: Also I want to just quickly mention that because there are many different papers it actually uses

0:24:29 - 0:24:36 Text: like different notions to refer to the sensing so I'm starting from so I'm going to use the passage

0:24:36 - 0:24:41 Text: paragraph and context and also question query basically interchangeably so they are basically

0:24:41 - 0:24:46 Text: referred to sensing because different papers use also different notions so I just want to quickly

0:24:46 - 0:24:53 Text: mention that okay so can we build a model to solve this problem so let's first form this problem

0:24:54 - 0:25:00 Text: so the input of this problem is let's take let's take our context or paragraph so see which

0:25:00 - 0:25:07 Text: consists of the intel can see one to see and and also we take our question q and look the question

0:25:07 - 0:25:14 Text: consists of m tokens q1 to qm so n could be something like around 100 to on between 100 and

0:25:14 - 0:25:22 Text: 200 for Scott and m would be much shorter be something like 10 or 15 and because the answer has

0:25:22 - 0:25:27 Text: these constraints as the answer must be a second text in the passage so the output can be just

0:25:28 - 0:25:34 Text: reading this way so we are going to predict a start and so start an end will be

0:25:34 - 0:25:40 Text: wrench be basically in the wrench between the one and so it is basically just two check points

0:25:40 - 0:25:51 Text: oh sorry two end points of the answer and then so Scott has been collected by the late 2016

0:25:51 - 0:25:59 Text: so after 2016 they are having like visit two families of the models newer models to solve

0:25:59 - 0:26:05 Text: to solve in this like STEM score data set so the first family basically like there are a lot of

0:26:05 - 0:26:12 Text: models that count out during the lecture between 2016 and 2018 so this was family models because

0:26:12 - 0:26:18 Text: they are always team based models result tension so these are like just like a list of the

0:26:18 - 0:26:23 Text: representing models that come out during the period and including some works that I did when

0:26:23 - 0:26:30 Text: I was a PhD student at Stanford and also second the second class models I put here is really

0:26:30 - 0:26:37 Text: there that divided here before the birth and after birth so after birth 10 miles so all of the

0:26:37 - 0:26:42 Text: system reading comprehension models were built on like how to find two the first models not just

0:26:42 - 0:26:47 Text: bird models are put a bird like models so pretty and long with models and for this kind of reading

0:26:47 - 0:26:55 Text: comprehension problems so here I like to the some you know to the illustrations of these two

0:26:55 - 0:27:02 Text: families of the models so on the left is like I always team based models result tension on the

0:27:02 - 0:27:09 Text: right is on the version model and then we need to find two this model for the passion for the

0:27:09 - 0:27:16 Text: reading comprehension task so I know that so my plan today is first I talk to talk about this

0:27:16 - 0:27:21 Text: I always team based models so I'm going to spend a little bit more time on this part because I know

0:27:21 - 0:27:27 Text: that for the default final project you need to increment this model from the scratch so I'm

0:27:27 - 0:27:31 Text: going to work through how to build this model like step by step and hopefully that you can have

0:27:31 - 0:27:36 Text: a good understanding of how this model works and then I'm just going to briefly talk about how

0:27:36 - 0:27:43 Text: to build this use the bird models for the reading comprehension okay so before I start talking

0:27:43 - 0:27:48 Text: about this always team models I know that you have already learned sequence to sequence models

0:27:48 - 0:27:54 Text: result tension for machine translation so I was I want to draw some connections between the

0:27:54 - 0:27:59 Text: machine translation part and the reading comprehension problem because they really share

0:27:59 - 0:28:06 Text: little similarities so first so in the machine translation model all these like sequence

0:28:06 - 0:28:12 Text: use sequence model there is a like source and package the sentence so basically two sequences

0:28:12 - 0:28:17 Text: so that in our case in this reading comprehension case that we also have two sequences

0:28:17 - 0:28:22 Text: one is a passage and another is a question but the lens could be a slightly in balance because

0:28:22 - 0:28:26 Text: the passage really much longer than the question but it's essentially also two sequences

0:28:29 - 0:28:35 Text: and so in the reading comprehension we need to model like which words in a passage are most

0:28:35 - 0:28:41 Text: relevant to the question and then if they're relevant to the question so it's also relevant to which

0:28:41 - 0:28:48 Text: that of the question works so this is basically a very key important thing that the important

0:28:48 - 0:28:53 Text: thing that we actually need to model and this is actually very similar to the machine translation

0:28:53 - 0:28:59 Text: model that we need to model which words in the source sentence that actually are most relevant

0:28:59 - 0:29:04 Text: to the current packet word so if I imagine that the attention will be also really the key

0:29:04 - 0:29:09 Text: break in here that just like some sequence to six model we need to model the attention between

0:29:09 - 0:29:15 Text: the source sentence and the packet sentence we also need to model the attention between the

0:29:15 - 0:29:20 Text: passage and the question so this is actually very similar so something that's actually not very

0:29:20 - 0:29:26 Text: similar is for the sequence to six model we need to build like a decoder auto-regressivity

0:29:26 - 0:29:32 Text: decoder to generate the packet sentence word by word but in this reading comprehension problem

0:29:32 - 0:29:37 Text: we we don't need to really generate anything so we just take the pass into question so at least for

0:29:37 - 0:29:44 Text: the scope on data set we just need to try to cross-spare to predict the start and positions of

0:29:44 - 0:29:50 Text: answer so that's very much you simply find so we need to be to try the decoder to generate the

0:29:50 - 0:29:59 Text: target sentence okay so next I'm going to talk about one this model called by death so it's

0:29:59 - 0:30:05 Text: sent for by directional attention flow for machine comprehension so either was proposed by

0:30:05 - 0:30:12 Text: mention seal and other folks in 2017 so it remains before the first time out it remains one of the

0:30:12 - 0:30:18 Text: most popular reading comprehension models and a very good performance at that time at least on

0:30:18 - 0:30:26 Text: the spot dataset so you can see that this model seems to be pretty complicated but if you look

0:30:26 - 0:30:33 Text: look at this model from the bottom to the top it actually can be decomposed into many different layers

0:30:33 - 0:30:39 Text: so the next I'm going to just dissect this model layer by layer and talk about okay what this

0:30:39 - 0:30:44 Text: layer is actually doing and how we can really build this model from the bottom layer to the top layer

0:30:44 - 0:30:53 Text: and the final retrancess like model in an end to end away okay so the first part it actually

0:30:53 - 0:30:59 Text: the bottom three layers called character embedding layer wording embedding layer and the first

0:30:59 - 0:31:05 Text: thing that later so I just put them together called this as a encoding function so the idea here

0:31:05 - 0:31:11 Text: is that okay let's take the context query or the passage in question we need to encode them separately

0:31:13 - 0:31:19 Text: so to do this so this model basically proposed to use a concatenation of the wording embedding

0:31:19 - 0:31:26 Text: as well as the character embedding for each word in the context and query so for the wording

0:31:26 - 0:31:31 Text: embedding straightforward so you have a wording embedding so you can just look up the word

0:31:31 - 0:31:37 Text: for the this word like Seattle just use the global embedding as a reputation for this word

0:31:37 - 0:31:44 Text: and for the character embedding part so you basically need to represent each character in this

0:31:44 - 0:31:50 Text: word like Seattle and the hypothesis to a convolutional neural network with some kind of max

0:31:50 - 0:31:54 Text: point operations and finally you can just get one reputation I will talk and then you just

0:31:54 - 0:32:00 Text: concatenate the wording embedding and the character embedding so this character embedding has been

0:32:00 - 0:32:07 Text: shown exactly to improve the reputation for the unseen or the real world so mathematically a

0:32:07 - 0:32:12 Text: mathematical you can see that so for each word in the context query you can just we can just

0:32:12 - 0:32:18 Text: represent the rotation of the block with the embedding and the character embedding and then we

0:32:18 - 0:32:23 Text: just concatenate them and pass each other all highway networks so I don't write the function here so

0:32:23 - 0:32:31 Text: so you can just look up orange on paper and the second part so other we call the issue a very visual

0:32:31 - 0:32:38 Text: work so and the next we are going to pass this wording embedding into two separate by directional

0:32:38 - 0:32:45 Text: LSTNs to separate me to produce these contextualized embeddings for both the context and query so

0:32:47 - 0:32:52 Text: let's look at these equations so we take the reputation of this word and then we just

0:32:52 - 0:32:59 Text: base this in like a one LSTN model from one direction and this LSTN model from another direction

0:32:59 - 0:33:06 Text: so we just need to concatenate the two key rotations two directions and then finally we can get

0:33:06 - 0:33:13 Text: a contextualized reputation for each single word in the context and then we can do the same

0:33:13 - 0:33:19 Text: similar thing for the question reputation also want to query mention because I mentioned the sequence

0:33:19 - 0:33:24 Text: your sequence model so sequence to signal although we can already do this bad directional LSTNs

0:33:24 - 0:33:29 Text: for the two sequences again like because the decoder is all all the embarrassing model so that's

0:33:29 - 0:33:35 Text: why the decoder is usually just implemented as a unit direction on this team but because of

0:33:35 - 0:33:41 Text: here we don't really care about the generation so we can just use two bad directional LSTNs to

0:33:41 - 0:33:47 Text: represent the rotation this is actually very important this bad directional lentil is actually

0:33:47 - 0:33:55 Text: very important to capture the context from both the left and right set okay so the next component

0:33:55 - 0:34:00 Text: is the next layer it's called the attention flow layer so I just call it the attention here

0:34:01 - 0:34:06 Text: so the attention idea the idea of attention is trying to capture the interactions between the

0:34:06 - 0:34:12 Text: context and query and in this paper the baddest paper they propose two types of tension

0:34:12 - 0:34:20 Text: so the first type of tension we call the context your query attention so the idea is for each context

0:34:20 - 0:34:29 Text: word can we find the most relevant words in the question from the question for the query for the

0:34:29 - 0:34:35 Text: query words so here's the one example so here the context context the problem of Rama is a present

0:34:35 - 0:34:42 Text: of the USA so for each context word we need to find an alignment because it's fun like the wish words

0:34:42 - 0:34:46 Text: in the question can be actually aligned with this context word so we can see that both

0:34:46 - 0:34:52 Text: both of them can be aligned to pool and the president will align to the list and the USA is

0:34:52 - 0:34:56 Text: aligned to the United States so basically for each content we'll try to find the most relevant

0:34:56 - 0:35:04 Text: query words and then not the second type of tension is called query to context or tension so it's

0:35:04 - 0:35:12 Text: a very or not direction so here the idea is to choose some context words that are most relevant to

0:35:12 - 0:35:18 Text: one of the query words because the context can be very long so a lot of the context could be

0:35:18 - 0:35:25 Text: just not relevant to the discussion so we just run over several examples you can see that

0:35:25 - 0:35:30 Text: the first thing we need to do is try to locate okay wish cards of the sentences in this

0:35:30 - 0:35:36 Text: context can be actually relevant to this question so this type of query to context or tension

0:35:36 - 0:35:45 Text: is trying to capture so which which context words actually can be most relevant to the query to

0:35:45 - 0:35:51 Text: one of the query words so for this example the question is which seeding the glooming in winter

0:35:51 - 0:35:57 Text: so because the question asked about glooming so you can find a triacy of okay glooming

0:35:57 - 0:36:05 Text: species is actually very relevant to this question and now we also find this in winter because

0:36:05 - 0:36:11 Text: in winter it also mentioned the question so this part of context words to be also relevant to

0:36:11 - 0:36:17 Text: this question so this context words could be probably need to capture and not in this

0:36:17 - 0:36:23 Text: tension that okay this actually relevant to this question okay so this actually basically just

0:36:23 - 0:36:30 Text: a intuition of this two types of tension and this also wise model is called a bi-directional

0:36:30 - 0:36:36 Text: tension flow because there is a context query or tension and there is also a parity context

0:36:36 - 0:36:45 Text: tension so let me just talk about how to actually do this like query to context tension the

0:36:45 - 0:36:52 Text: context query or tension in this model so the way they do this is first to compute a similarity

0:36:52 - 0:36:59 Text: sport for every period of the contextualized vector C i and for every pair of the question with

0:36:59 - 0:37:05 Text: QJ so this is actually the output from the encoding layer so this already is the output from the LSTM

0:37:05 - 0:37:13 Text: layers and the way they say basically just compute a similarity sport by taking the C i QJ

0:37:14 - 0:37:20 Text: and also the element wise amygdication of the C i QJ so it's a basically just concatenate

0:37:20 - 0:37:27 Text: these three vectors so the output will be a six-inch dimensional vector and they just

0:37:27 - 0:37:33 Text: match this to compute the dot product of another like a learnable vector and the family just

0:37:33 - 0:37:41 Text: this can't this can't give you one scalar one number the is ij which matters how on the similarity

0:37:41 - 0:37:48 Text: between this context word C i and also this question word QJ so if you have so if I learned some

0:37:48 - 0:37:54 Text: attention before so this is actually just one choice of this model so there could be many different

0:37:54 - 0:38:00 Text: ways to define this similarity and a similarity sport so this is basically just one design choice

0:38:00 - 0:38:09 Text: of this model okay so after defined this similarity sport is ij so the context to query

0:38:09 - 0:38:15 Text: attention again like which question words are most relevant to C i so the way they do this is so

0:38:15 - 0:38:21 Text: basically just taking this matrix the similarity sport is ij for each row each row basically

0:38:21 - 0:38:30 Text: corresponds to like one context word for each row they are going to compute a soft max for each

0:38:30 - 0:38:36 Text: row and this can give us like normalization sports r for ij which is our probability distribution

0:38:36 - 0:38:42 Text: over all the question words r for ij so this is just really similar to all the attention

0:38:42 - 0:38:47 Text: on the kind of things that you probably have seen in this class so basically for each

0:38:47 - 0:38:57 Text: context word taking the soft max over all the question words and get us probability distribution

0:38:57 - 0:39:03 Text: and finally just take the linear combination of the weighted combination of these attention

0:39:03 - 0:39:08 Text: score r for ij and also the question vector the QJ and the finally you can't get a vector

0:39:08 - 0:39:15 Text: a i which is actually two h-dimensional vector so this context to query attention basically

0:39:15 - 0:39:21 Text: just try to capture which questions was a most relevant to each context word so the next part

0:39:21 - 0:39:29 Text: part is a query to sorry the title here sorry this actually the query to context or attention so

0:39:29 - 0:39:35 Text: which means that which context was relevant to some question words so we don't so a lot of

0:39:35 - 0:39:41 Text: context words would be no relevant to this question so the idea to do this is for each row of

0:39:41 - 0:39:48 Text: this is ij this basically just takes a mass scores over all the question words and after taking

0:39:48 - 0:39:55 Text: this mass score they compute the soft max over all the context words here so here i actually

0:39:55 - 0:40:01 Text: numerous over all the context words and this can give us like attention another attention score

0:40:01 - 0:40:09 Text: beta i which captures how important this context word is relevant to this question so after computing

0:40:09 - 0:40:17 Text: this beta i so we can again like compute this like a weighted combination by computing by some

0:40:18 - 0:40:24 Text: by summing up the beta i and also the context context vector ci and the finally you can't get

0:40:24 - 0:40:31 Text: a vector bi which is also another two h-dimensional vector and the final output of this attention

0:40:31 - 0:40:37 Text: function that's a very complicated here is also the design choice of this model so the takes a

0:40:37 - 0:40:44 Text: context vector ci and as it takes a i from this part the context to query attention and the

0:40:44 - 0:40:50 Text: takes the element of multiplication between the ci and ai and also the ci and bi and the final

0:40:50 - 0:40:56 Text: is to take the contactination and the can give you a produce of h-dimensional vector

0:40:56 - 0:41:01 Text: okay maybe i want to pause a little bit and check if there are any questions because this part

0:41:01 - 0:41:10 Text: is a little bit complicated yeah one one question is why is query to context and context to query

0:41:10 - 0:41:20 Text: attention not symmetrical um um that's a good question yes so here because essentially the

0:41:20 - 0:41:25 Text: goal is trying to because the goal is final goal you're trying to final span in the passage

0:41:25 - 0:41:31 Text: so the the whole the point of the this attention function is trying to produce a rotation

0:41:31 - 0:41:39 Text: for each single context word in this context so that's um so so we are not trying to generate

0:41:39 - 0:41:45 Text: questions rotations here it's going to try to generate the um contact rotations so one so the

0:41:45 - 0:41:51 Text: difference between these two like first try to see which questions are relevant to this context work

0:41:51 - 0:41:55 Text: another part is trying to figure out which contact work can be relevant and which contact work can

0:41:55 - 0:42:02 Text: be not relevant i hope it sounds just answers your question yeah here's an easier question sort of

0:42:02 - 0:42:09 Text: on the same topic which might help is there a reason why you use both query to context and context

0:42:09 - 0:42:17 Text: to query attention is it sometimes advantageous or okay to use just one that's a good question um

0:42:17 - 0:42:23 Text: um the reason is yeah so the i'm going to show some relations already from this figure so

0:42:23 - 0:42:29 Text: they basically just find both both directions can really help um by drawing the context for

0:42:29 - 0:42:34 Text: and query to context so there'll be some relations studies so by using one strategy useful

0:42:34 - 0:42:46 Text: but then just not a bit as using the both directions yeah um right let's see uh in the bottom right

0:42:46 - 0:42:55 Text: we sum over i so far does the i remain in bi is that correct or so typo there uh this is not a typo

0:42:55 - 0:43:02 Text: so again sorry so the output yeah you know it's a bit confusing so the output of this model this

0:43:02 - 0:43:10 Text: come for module it to get a rotation for each context work at the end so both the output for

0:43:10 - 0:43:17 Text: AI and bi i is actually um in numerates from like um actually in numerates uh you know if first over

0:43:17 - 0:43:24 Text: all the context works so bi would be still um just to try to aggregate over all the questions

0:43:24 - 0:43:31 Text: uh on the over all the context works but the beta i measures the importance of this context works

0:43:32 - 0:43:38 Text: compared to all the context works so both AI and bi are actually disrespecting the context works

0:43:38 - 0:43:44 Text: yes so you can see that here is basically doing some kind of the animal wise multiplication so the

0:43:44 - 0:43:50 Text: output of the g i would be actually only arranged from the one to and uh in which is another the context

0:43:50 - 0:43:57 Text: works there are lots of questions about this um what is the rationale for the expression for g i

0:43:58 - 0:44:05 Text: how does one come up with such an expression okay i don't know i guess not also try out a lot of

0:44:05 - 0:44:12 Text: things uh so okay so keep on here trying to understand okay so the roles of the context require

0:44:12 - 0:44:16 Text: attention or context or tension so i bet there's actually been many different

0:44:16 - 0:44:21 Text: formations to do this i also think there also have child many different variants but uh just

0:44:21 - 0:44:29 Text: what they kind of can up as and i think that if after week um it's gonna be smart the way string

0:44:29 - 0:44:39 Text: copper is a both attention but it doesn't have to be written this way yeah i mean one other question

0:44:39 - 0:44:46 Text: would be in the query the context attention why do you do a max inside the soft max

0:44:48 - 0:44:54 Text: yeah oh yeah sorry i should have expense more clearly so here again query to contest attention

0:44:54 - 0:45:01 Text: to try to measure whether this the importance of this context works with respect to some

0:45:01 - 0:45:09 Text: some answer or question words so if the so the so by taking the max for each row in this ace

0:45:09 - 0:45:15 Text: matrix so it's basically trying to see okay which question word um is actually most relevant

0:45:15 - 0:45:21 Text: to this context word if this number is still very low that means there isn't any question

0:45:21 - 0:45:27 Text: words that could be online with this context word so that we would just by taking after taking

0:45:27 - 0:45:32 Text: the max if this number is still very low that means this query context word is not very relevant

0:45:32 - 0:45:37 Text: so so basically just as well we take the soft max uh try to soft max i won't talk to the next

0:45:41 - 0:45:48 Text: uh i know do you want even more do you want to go on uh i probably should know all i

0:45:48 - 0:45:53 Text: do have a lot of slides but i'm happy on the questions after the answer yeah maybe you should go on

0:45:54 - 0:46:02 Text: yeah okay so the last part of this model is actually um the idea is the most simple so it's

0:46:02 - 0:46:08 Text: two of the last last three there are two layers smaller layer and all four layers so for the

0:46:08 - 0:46:15 Text: smaller layer so again for absolute attention layer they take the key derivation um of so basically

0:46:15 - 0:46:22 Text: g i which captures the attention between all contacts and the query and then basically just

0:46:22 - 0:46:28 Text: passes g i to another two layers of bi-directional errors gems and the many reasons they do this is

0:46:29 - 0:46:33 Text: the attention layer is basically modeling the interactions between the query and context

0:46:33 - 0:46:38 Text: and the bi-passing this to another two layers of bi-directional errors gems the modeling layers

0:46:38 - 0:46:43 Text: is basically modeling they can also first model the interactions using the context words

0:46:43 - 0:46:50 Text: so this is a formulation here um so these are two layers bi-directional and

0:46:50 - 0:46:55 Text: here by taking the g i as input and the output will be on the m i which is another two-h

0:46:55 - 0:47:00 Text: direction dimensional vector for each context work in the in the passage

0:47:03 - 0:47:08 Text: okay so the final is all four layers so all four layers they did just two cross-fers

0:47:08 - 0:47:15 Text: just trying to predict the starting end positions so by doing this so the first contact in the g i

0:47:15 - 0:47:22 Text: and m i so this would be actually a 10-h dimensional vector and by computing the dot project over

0:47:22 - 0:47:28 Text: another vector called double start and this resulting vector and they can get basic data score

0:47:28 - 0:47:35 Text: for each position in the context and then you can just up up higher softmax uh and then this

0:47:35 - 0:47:41 Text: will give you a probability that okay what is the probability this position i will be actually

0:47:43 - 0:47:51 Text: uh be on the start position on the final answer string and they also have another class file

0:47:51 - 0:47:57 Text: to predict the end position of the answer but there also be something a little bit more complicated

0:47:57 - 0:48:02 Text: so they actually passed the m i to another bi-directional error is tm here so they call the m fine i

0:48:02 - 0:48:10 Text: and they come cat in g i and m prime i oh sorry this is the title so this will be w and so the

0:48:10 - 0:48:17 Text: computer dot project between w and and this vector and this can be reproduced on all the probability

0:48:18 - 0:48:24 Text: probability over all the positions which predicts the um how likely this position will be the

0:48:24 - 0:48:31 Text: end position of the answer so by doing it by crossing the m i to another bi-directional is tm the

0:48:31 - 0:48:36 Text: reason is that they're trying to capture some kind of dependence between the choice of the start and

0:48:36 - 0:48:42 Text: end so you can imagine that start and shouldn't be too separate so shouldn't be a cool

0:48:42 - 0:48:49 Text: business independent predict it but if they claim that if you add some kind of dependence

0:48:49 - 0:48:54 Text: between the m i and um the p-start and p-end this can actually perform better

0:48:54 - 0:49:03 Text: okay and don't visit this part on describing the bi-directional model any quick questions i can ask

0:49:03 - 0:49:12 Text: this i think you can actually go on okay okay sorry i forgot to mention this is the okay the final

0:49:12 - 0:49:18 Text: training loss will be just by taking these two probability distributions and this is basically

0:49:18 - 0:49:25 Text: just next next lot like the code of the gold as a gold answer does the protocol start position

0:49:25 - 0:49:33 Text: of the gold answer and end position of the answer and by just um basically taking the

0:49:33 - 0:49:39 Text: product of the these two probabilities but you're planning a pilot lock so it's a

0:49:39 - 0:49:45 Text: sum of the two next log terms will be the final training loss and the whole the whole model can be

0:49:45 - 0:49:51 Text: just changing the end to end away from the encoding layer to a tension layer to modern layer and to

0:49:51 - 0:49:57 Text: outlayer so this will be just to accomplish the whole the whole model of the bi-directional model

0:49:58 - 0:50:06 Text: okay so this model is actually achieved like on the data set it achieved a 77 point

0:50:06 - 0:50:11 Text: history f1 school so as i mentioned earlier so just on some operations started they found

0:50:11 - 0:50:17 Text: the both of tension in two directions are actually important if we remove the one direction the

0:50:17 - 0:50:23 Text: performance will actually drop a bit if we remove the contrast to error tension the performance

0:50:23 - 0:50:30 Text: will drop to 67 point seven f1 school and if we remove this part it will drop to four point f1

0:50:30 - 0:50:35 Text: four and then also the character embedding styles help so if we remove the character embedding

0:50:35 - 0:50:42 Text: you'll get like a 1.9 point drop and all the right of this figure you can

0:50:43 - 0:50:48 Text: see slide you can see a very big table so it's basically all the models that account of as

0:50:48 - 0:50:55 Text: that time between 2016 and 2018 so you can see that um by that we're here so you're achieved a

0:50:55 - 0:51:02 Text: 77 point three f1 school and the basic all the models are actually you know very similar ballpark

0:51:02 - 0:51:10 Text: so numbers range from like the highest number here is 79.8 until like after the Elmo was introduced

0:51:10 - 0:51:14 Text: the numbers have been actually improved quite a bit so before the Elmo basically all the numbers

0:51:14 - 0:51:20 Text: are actually kind of similar so each model actually improved our primers primers model by like a

0:51:20 - 0:51:29 Text: 1.2 points and now here is our tension visualization to show that on how these like the similarities

0:51:29 - 0:51:34 Text: for the tension actually can capture the similarity between the question words and the contrast words

0:51:34 - 0:51:40 Text: so here's an example of the question the word in super form 50 takes place so each show is actually

0:51:41 - 0:51:47 Text: a question word here and each column is matrix based in the case the attention score the

0:51:47 - 0:51:54 Text: similarity score that has been learned by this model so you can see that on the right is basically

0:51:54 - 0:52:02 Text: trying to print out or display so the the the contrast words that have the highest scores

0:52:03 - 0:52:10 Text: so you can see that the where it has been online very well with the at the stadium liva

0:52:10 - 0:52:15 Text: and also the super bowl 50 is basically line very well with the super bowl 50 so this basically

0:52:15 - 0:52:20 Text: really tells that this kind of attention scores can actually capture the similarity scores pretty well

0:52:20 - 0:52:30 Text: yeah okay so next i'm going to talk about bird now how to use the bird model to solve this problem

0:52:30 - 0:52:35 Text: so i know that you have learned the bird in the last lecture so i'm not going to repeat this

0:52:35 - 0:52:41 Text: so very quick so bird is basically a deep vibrational transformer encoder pre channel on the large

0:52:41 - 0:52:46 Text: amount of text and is a channel the two channel objectives you will invest in which modeling

0:52:46 - 0:52:53 Text: and the next sentence prediction and this model has a lot of parameters so the bird base has a

0:52:53 - 0:53:01 Text: 110 million parameters and the bird logic model has 330 million parameters so okay so how we can

0:53:01 - 0:53:07 Text: actually use bird for the for reading comprehension so it's actually very easy as a very straightforward

0:53:07 - 0:53:12 Text: the idea is to take the person as a segment okay so you know so the bird

0:53:12 - 0:53:18 Text: criteria are like two segments for the next sentence prediction task so then you apply the bird

0:53:18 - 0:53:23 Text: on the reading comprehension task you basically just take the question as a segment A and take

0:53:23 - 0:53:29 Text: the passage as a segment B and finally you to go with the trying to create two end points in segment B

0:53:31 - 0:53:37 Text: so here's one more concrete example so question how many parameters does bird logic have

0:53:37 - 0:53:44 Text: so you can see that so the basically just takes the question here and then takes a passage here

0:53:44 - 0:53:49 Text: and by putting the cio is token and the acp token and by just contacting the question of the

0:53:49 - 0:53:54 Text: passage tokens and also for the question setting you just need to pass the 8 to a segment

0:53:54 - 0:54:01 Text: embedding and the passage you just need to put in the bird the segment B embedding and finally

0:54:01 - 0:54:07 Text: the training loss is also the same so you basically just try to maximize the probability the some of

0:54:07 - 0:54:13 Text: the next lot like could also both the start and end positions but here's the way that the compute

0:54:13 - 0:54:20 Text: the start and end probability is slightly different so it actually very straightforward so it just

0:54:20 - 0:54:28 Text: passes on impolitation into bird and the bird can give you the hit on that h i that actually

0:54:28 - 0:54:35 Text: presents the hit of that corresponding to the context word context word c i so we can just

0:54:37 - 0:54:43 Text: introduce another two vectors W start and W and by computing the top product and then apply the

0:54:43 - 0:54:48 Text: softmax then you can just give you a very similar to what we had before but here is the h i

0:54:48 - 0:54:54 Text: just output from the bird's encoder and then we are training on these two W to start and W

0:54:54 - 0:55:04 Text: and for these two probability distribution P start and P and okay so for this model

0:55:05 - 0:55:10 Text: so all the bird parameters that is actually very much number if you use the bird base

0:55:10 - 0:55:16 Text: you will be 110 million parameters as well as a newly introduced parameters h start and h end

0:55:17 - 0:55:24 Text: which is if you take the bird base so hidden side will be 7608 so it's only like 1,500

0:55:24 - 0:55:29 Text: new parameters there will be just to optimize together jointly for this training objective error

0:55:30 - 0:55:36 Text: and then it actually works really really well this model so if you just take this mod bird model

0:55:36 - 0:55:41 Text: and by just optimizing all the parameters together you can give you very high performance

0:55:41 - 0:55:45 Text: I will show you very in a minute and even the strong even if you use the stronger

0:55:45 - 0:55:51 Text: approach and long models more than like the standard um the stronger models than the bird models

0:55:51 - 0:55:56 Text: they can evenly to better performance on scot and the scot that has also become a standard

0:55:56 - 0:56:01 Text: data set for testing this kind of virtual models let me show you some numbers

0:56:02 - 0:56:08 Text: so again human performance in 91 and by that is 77.3 and then if we just do this

0:56:08 - 0:56:16 Text: fun shooting model so bird base can give you like 88.5 bird large can give you 99.9 so you can

0:56:16 - 0:56:21 Text: see that this is a huge jump from the by that model to the bird models and the final

0:56:21 - 0:56:28 Text: if you see that even the latest um pretty long with the models include the X-O-Ned

0:56:28 - 0:56:33 Text: or the belt or Albert so these models are either like a bigger or these model are channel bigger

0:56:33 - 0:56:41 Text: covers or the model size are bigger so basically these models can give you a not a like 34 point

0:56:41 - 0:56:48 Text: iF1 score compared to the bird large model so this is already way higher than estimate iF1 score

0:56:48 - 0:56:54 Text: so this just works really well any quick questions

0:56:59 - 0:57:09 Text: this might be okay okay so okay so yeah i guess i've been a little bit fast for this bird models

0:57:09 - 0:57:14 Text: but next what's so i would also do a bit of the comparisons between the by-dash models and the

0:57:14 - 0:57:21 Text: bird models so bird model has many many more parameters so it's like it's like one 10 more

0:57:21 - 0:57:27 Text: million or 300 to 13 million parameters but the by-dash has only like 2.5 million parameters

0:57:28 - 0:57:33 Text: and the by-dash is built on top of several by-directional-errish teams and while bird is built on top

0:57:33 - 0:57:39 Text: of the transformers so transformers means that there isn't any recurrence structure architecture

0:57:39 - 0:57:45 Text: so the trans-weapons are much easier to paralyze and a very different difference between the

0:57:45 - 0:57:50 Text: bird models and the by-back models is bird model is a pretender but by-back models only built on top

0:57:50 - 0:57:57 Text: of the glove that's which is the pretender and the other remaining of two parameters new peer

0:57:57 - 0:58:02 Text: learner found this is called a data set all the other supervision data set so here it is very

0:58:02 - 0:58:10 Text: clear that pretending is a game changer here that pretender basically can just change everything

0:58:10 - 0:58:13 Text: and also give you very very large boost in terms of the performance

0:58:15 - 0:58:21 Text: but also want to create another passion so if we don't think of this like on pretending

0:58:22 - 0:58:26 Text: this like by-demo on bird models are really fundamentally different

0:58:26 - 0:58:32 Text: I don't think so because of the below is actually my audience so let's try to see how these two

0:58:32 - 0:58:38 Text: models actually connected especially in terms of the model you've done so by that model essentially

0:58:38 - 0:58:44 Text: they're trying to model the interactions between the question and passage right so both of the

0:58:44 - 0:58:49 Text: questions to passage and passage to question and the bird model essentially they're trying to

0:58:49 - 0:58:57 Text: use a self-attention on top of the concatenation of the question passage so this is a transformable

0:58:57 - 0:59:02 Text: model so you should take the question the passage so these are questions in the passage and then

0:59:02 - 0:59:08 Text: you apply many many different layers of the self-attention essentially that this self-attention is able

0:59:08 - 0:59:15 Text: to capture the tension between the contests and the tension between the passage and the question

0:59:15 - 0:59:20 Text: words and the attention from the questions to passage side and also the attention from between

0:59:21 - 0:59:27 Text: from the question was to another question was so compared to by that by that is trying to model

0:59:27 - 0:59:32 Text: this part but the bird model essentially can capture the tension between all these four parts

0:59:33 - 0:59:39 Text: and actually after by that kind of so this also before the bird can before the bird can

0:59:39 - 0:59:46 Text: out so people have been also showing that if we just add a self-attention layer for the passage side

0:59:46 - 0:59:51 Text: so basically you're trying to explicitly model this attention between the passage words and

0:59:51 - 0:59:57 Text: passage words to the bad act this also you put the performance so you can see that these two models

0:59:57 - 1:00:03 Text: essentially just trying to model the tension between the passing question also the attention

1:00:03 - 1:00:10 Text: between the passage words and the passage words and this actually what exactly the bird model is doing

1:00:10 - 1:00:17 Text: okay so if there's no further questions so at this point I'll talk about

1:00:18 - 1:00:22 Text: bird models can do really well on this kind of reading comprehension data sets and we just

1:00:22 - 1:00:27 Text: talk about pretending can really change the performance can be again changing your reading

1:00:27 - 1:00:34 Text: comprehension I can put it on don't you I can ask add one question first people wonder whether you

1:00:34 - 1:00:40 Text: can do well with a transformer that isn't pre-trained right if you tried to build a question

1:00:40 - 1:00:46 Text: answering system using a transformer rather than RSTM's then no pre-training does that work

1:00:47 - 1:00:53 Text: that's a good question yeah it works but you probably cannot review the model as big as like one

1:00:53 - 1:01:01 Text: one 10 million premise or 230 million parameters models so actually there's a model between the

1:01:03 - 1:01:09 Text: sorry between the this is like a family of RSTM models and bird models they're called a QA in that

1:01:09 - 1:01:14 Text: from Google so QA in that is actually built on top of the transformers we saw the real pre-training

1:01:14 - 1:01:19 Text: so that model actually can perform better than the bad act models and other models but actually

1:01:19 - 1:01:25 Text: on the performance of that a bird model quite a bit so just check it out for QA in that

1:01:29 - 1:01:36 Text: okay I will just continue so okay so given pre-training has been so important so next I want

1:01:36 - 1:01:42 Text: quickly talk about okay question here is that can we actually even define better pre-training

1:01:42 - 1:01:47 Text: objective for reading comprehension or question answering and the answer is actually yes so this

1:01:47 - 1:01:54 Text: actually work I did with Mender-Drosion other folks like one year ago called Spambert so think about

1:01:54 - 1:02:00 Text: this so for the squad and other a lot of these types of reading comprehension data set the goal

1:02:00 - 1:02:07 Text: is trying to predict the answer span from the passage as a question so the as an answer to the

1:02:07 - 1:02:15 Text: discussion so there are two key ideas being proposed in Spambert so first idea is that instead of

1:02:15 - 1:02:20 Text: using only the masking of individual words we propose that we want to master particular

1:02:20 - 1:02:27 Text: spence of words in the passage because the final answer would be just a segment of text in the

1:02:27 - 1:02:33 Text: passage so we are trying to so mask out all these possible answers spence from the passage as a

1:02:33 - 1:02:40 Text: pre-training objective and the second idea of compulsory Spambert is that because at the end of

1:02:40 - 1:02:46 Text: we want to predict an answer spence so we actually essentially trying to predict two end points

1:02:46 - 1:02:53 Text: as a answer so the idea is that can we try to compress the two end points of answer span

1:02:56 - 1:03:02 Text: so can we try to compress all the information in this span into the two end points so here's

1:03:02 - 1:03:08 Text: the idea is that here let's think about this if we mask out the four words here and can we try to

1:03:08 - 1:03:16 Text: use the two end points here in this figure like an x4 and x9 to predict all the words in the middle

1:03:16 - 1:03:21 Text: so essentially we are trying to predict takes the two end points and also the position some kind

1:03:21 - 1:03:27 Text: of position coding and finally we are going to try to predict all the words in this span so this

1:03:27 - 1:03:33 Text: is why this code is spambert so I encourage you to check out our paper and this actually really

1:03:33 - 1:03:39 Text: helps a lot at least for the questions and data sets so as you can see from this figure so this

1:03:39 - 1:03:46 Text: is called 1.1 and it's called 2.0 and this are many other questions and data sets so you can see

1:03:46 - 1:03:54 Text: here so the blue bars here we call the google bird is actually the original check points that

1:03:54 - 1:04:00 Text: released by google researchers and our bird is actually just exactly our re-evaluation of the

1:04:00 - 1:04:05 Text: bird model but we are having trying to using the same data but we have been trying to transit

1:04:05 - 1:04:11 Text: model for slightly longer so it's actually achieved a better performance than our original google bird

1:04:11 - 1:04:16 Text: so as you can see the yellow bars here is actually the spambert so spambert actually

1:04:17 - 1:04:22 Text: brings with also performed google bird and all the bird basically across all the data sets

1:04:22 - 1:04:28 Text: that really tells us that okay even if we are not going to increase the model size we are not

1:04:28 - 1:04:33 Text: going to increase the data by designing better criteria objectives can also be very go a long way

1:04:33 - 1:04:37 Text: and do a much better job in at least in the question answering and reading comprehension

1:04:37 - 1:04:48 Text: data sets okay so I have several few slides left in this part so so far I have to demonstrate

1:04:48 - 1:04:55 Text: that on by using by death model and by using bird models we can get a very good performance on

1:04:55 - 1:05:00 Text: the scope data set and this number has already exists even the human performance on scope

1:05:00 - 1:05:05 Text: that this means that reading comprehension is already solved the answer is of course not

1:05:06 - 1:05:12 Text: so let me just so in the recent last couple of years that's been a lot of evidence

1:05:12 - 1:05:18 Text: showing that the current systems still perform poorly on adversarial examples or the examples

1:05:18 - 1:05:26 Text: from the out of domain distributions so here is a very classical example so proposed by

1:05:26 - 1:05:34 Text: Robin John personally on 2017 so the idea is that they take a pass and take a question

1:05:34 - 1:05:40 Text: and they're trying to just insert a random sentence to the end of the paragraph so you can see

1:05:40 - 1:05:46 Text: that distance passes like even like a nonsense entity in this context, drafting here but this

1:05:46 - 1:05:51 Text: sentence actually has a like a great some love score overlap between the question

1:05:51 - 1:05:55 Text: is actually very similar to this question but actually the word numbers have been changed

1:05:55 - 1:06:01 Text: the entinence has been changed and they found that these kind of adversarial examples can actually

1:06:01 - 1:06:06 Text: very easy to fool the prime systems and the final and the makes the system to predict

1:06:06 - 1:06:13 Text: answer to be the drafting so the by shoots the table shows that by adding a lot of these

1:06:13 - 1:06:19 Text: adversarial examples they found that the performance actually drops a lot this by that model so

1:06:19 - 1:06:26 Text: drops from 75.5 to even like 30% so for even like this kind of attack the performance will just drop

1:06:26 - 1:06:35 Text: to very low like 4.8% so here's another paper that actually just came out in 2020 so it has

1:06:35 - 1:06:40 Text: made a lot of the evidence showing the similar things that so today we can be a very good reading

1:06:40 - 1:06:45 Text: kind of attention data set on individual data on the individual data sets but this system is

1:06:45 - 1:06:50 Text: channel one data sets basically cannot really generalize to other data sets so the diagonal is basically

1:06:51 - 1:06:57 Text: of this table is basically channel one model on one data set and the evaluate on the send data set

1:06:57 - 1:07:03 Text: and for all the other numbers in this table basically shows that if you turn one from system

1:07:03 - 1:07:08 Text: on one data set and then evaluate on another data set the performance will drop a lot so it's

1:07:08 - 1:07:14 Text: basically really cannot generalize from one data set to another data set so finally this is

1:07:14 - 1:07:20 Text: actually a very interesting result so this model this paper is actually the best paper from

1:07:20 - 1:07:28 Text: ACL 2020 is called checklist paper so the idea is that this this also basically try to propose

1:07:28 - 1:07:34 Text: some kind of the test cases to check whether this model can actually really under

1:07:34 - 1:07:40 Text: also some simple questions whether we sound specific or particular film they find that by just

1:07:41 - 1:07:48 Text: kind of a really simple question for example here Jeremy is more optimistic than Taylor

1:07:48 - 1:07:54 Text: and who is more pessimistic and the they found that a birth lot model channel stop and this

1:07:54 - 1:08:04 Text: basically can fill this type of test cases 100% time and up here is another table so you can see

1:08:04 - 1:08:11 Text: that here is another correct example like Victoria and Alex are friends her mom is an agent

1:08:11 - 1:08:17 Text: who's mom is an agent and so to get this kind of question correctly it has to understand

1:08:17 - 1:08:23 Text: the Victoria actually refers to a female person and Alex refers to a male person so this

1:08:23 - 1:08:28 Text: this model this kind of questions also makes a kind of model for large models can also

1:08:28 - 1:08:40 Text: totally fill on this kind of test cases okay so I have 10 minutes left Chris is any question

1:08:40 - 1:08:47 Text: I should answer this point here you can go on okay so in the last 10 minutes I'm going to be

1:08:47 - 1:08:52 Text: very very very with introduction of what is open to my question and what we are having

1:08:52 - 1:08:59 Text: trying to do in the last couple years so open the main question is the problem that so it

1:08:59 - 1:09:04 Text: different from reading comprehension that we don't assume a given passage so here we have

1:09:04 - 1:09:10 Text: assumption that we only have access to a large collection of weapons so one example is just

1:09:10 - 1:09:15 Text: taking the whole way you should be keeping it on which has like 5 million articles so we don't

1:09:15 - 1:09:20 Text: really know where the answer is located and the goal is to return the answer for any open

1:09:20 - 1:09:25 Text: of questions so this problem so there is an annual single passage so we have to answer questions

1:09:25 - 1:09:30 Text: against a very large collection document or even the whole web documents so this is actually

1:09:30 - 1:09:38 Text: much more challenging and also more practical problem so if you look at the example of Google

1:09:38 - 1:09:43 Text: example I showed at the beginning so this is with techniques where will be very useful in the

1:09:43 - 1:09:51 Text: practical applications so the time here open domain is just in contrast to closed domains that

1:09:51 - 1:10:01 Text: deal with questions under specific domain here okay so how can we solve this type problem

1:10:03 - 1:10:06 Text: because for the reading comprehension problem we just need to answer questions based on

1:10:06 - 1:10:12 Text: single passage so this is a paper that I wrote in 2017 four years now so the paper is called

1:10:12 - 1:10:19 Text: reading Wikipedia to answer open domain questions and system called up to Q8 so the paper

1:10:19 - 1:10:25 Text: basically proposes the idea that we can actually solve this problem by using a retrieval and also

1:10:25 - 1:10:32 Text: read a framework so idea is that let's take a question okay so here all we go is trying to answer

1:10:32 - 1:10:37 Text: questions using like a very large collection document such as the Wikipedia so the idea is that

1:10:37 - 1:10:44 Text: there's a retrieval and also read a component so the retrieval takes in the question and I try to

1:10:44 - 1:10:49 Text: find out like a smaller number of the our documents that to be relevant to this question and this

1:10:49 - 1:10:55 Text: reading model basically trying to read through all the documents that this retrieval return and the

1:10:55 - 1:11:03 Text: fact try to find out the correct answers so formally defined here is that you put a large collection

1:11:03 - 1:11:10 Text: documents D and the question Q and the output could be our answer stream A so we had just

1:11:10 - 1:11:15 Text: decomposed this problem into as I just mentioned in our retrieval and the reader component so the

1:11:15 - 1:11:21 Text: retrieval is basically trying to take a large collection document D and Q and try to return

1:11:21 - 1:11:26 Text: a set document or set of passages so here the set this number K could be very small

1:11:26 - 1:11:35 Text: could be very small such as like one of just like a 100 so it's basically trying to pull out

1:11:35 - 1:11:41 Text: the found out like 100 passages of documents from like let's say five million documents

1:11:41 - 1:11:47 Text: and the finally the reader is basically takes a question and takes this set of the passages and

1:11:47 - 1:11:52 Text: finally finally returns the answer so the second problem exactly the reading component

1:11:52 - 1:12:00 Text: model that we just learned so in the just 70 paper result so it's actually doing a very simple

1:12:00 - 1:12:05 Text: thing so the retrieval is just a standard a new formation retrieval model the sparse

1:12:06 - 1:12:11 Text: pfid information retrieval sparse model and the real model essentially just a new already

1:12:11 - 1:12:17 Text: comprehend the model I just talked about so it's very trying to solve and some other questions

1:12:17 - 1:12:23 Text: during the process so this is the drill it's the idea is very simple but trying to bridge two things

1:12:23 - 1:12:28 Text: how to how to have bridge this retrieval and also the reader to do this kind of open domain question

1:12:28 - 1:12:38 Text: history so so I'm just going to quickly go over some really exciting ideas that that has been

1:12:38 - 1:12:45 Text: having in the last two years basically so the first idea is that this retrieval part can be

1:12:45 - 1:12:50 Text: also trained so we can actually even do this kind of drawing the training of the retrieval and the

1:12:50 - 1:12:57 Text: reader so here is actually so this this idea has been first proposed in Cantonese paper in 2019

1:12:57 - 1:13:04 Text: called later in the retrieval for weekly supervised open domain questions so this part is basically

1:13:04 - 1:13:11 Text: the first model for reading comprehension and this not how it's based in the retrieval model

1:13:11 - 1:13:16 Text: so to get this in retrieval model working they also try to use the birth to you call the passage

1:13:16 - 1:13:19 Text: and also you call the question and they try to use a birth product between the question

1:13:19 - 1:13:27 Text: station the passage repetition to model how the relevance the similarity between the question

1:13:27 - 1:13:33 Text: the passage but this is actually a very difficult problem because the scalar scalability of this

1:13:33 - 1:13:38 Text: problem because there are like 20 million passages in a Wikipedia so it's actually very hard to model

1:13:38 - 1:13:47 Text: this part but so I encourage you to check out this paper and also on second paper I want to quickly

1:13:47 - 1:13:55 Text: mention is also work ideas on last year it's called the best pass of the retrieval so the idea is

1:13:55 - 1:14:01 Text: actually very similar to the the the previous paper because the idea is that is actually much more

1:14:01 - 1:14:06 Text: simply by model and very easy very simple straightforward approach the idea is that we can also

1:14:06 - 1:14:12 Text: really just trend the retrieval part by using two birth models using only the question answer pairs

1:14:12 - 1:14:19 Text: and this model can work really well and you can largely all form the traditional IR retrieval models

1:14:19 - 1:14:27 Text: if you see this figure so the blue curve here is a traditional IR approach like a BM25 approach

1:14:27 - 1:14:32 Text: and the so the other curve the orange curve based training this kind of retrieval is only

1:14:32 - 1:14:37 Text: 1000 question answer pairs so by looking at all these different curves basically using different

1:14:37 - 1:14:44 Text: number of training examples so it's actually largely crazy from the traditional IR models

1:14:49 - 1:14:53 Text: okay so again really I don't have time to talk about the details of all these approaches

1:14:53 - 1:15:01 Text: so I just encourage you to check out this paper this paper is nice and this result is really exciting

1:15:02 - 1:15:08 Text: so here's actually a really nice demo so the demo is actually hosted at this website you

1:15:08 - 1:15:15 Text: can check out so again so the database here is a whole Wikipedia you can see that if you ask a question

1:15:15 - 1:15:21 Text: who tells higher portals that he is a wizard and the higher you know higher portals series and

1:15:21 - 1:15:26 Text: the system I really found out the correct article should be higher portals film series and finally give

1:15:26 - 1:15:33 Text: you the correct answer which is exactly what you have seen from the Google example here so the answer

1:15:33 - 1:15:39 Text: could be the rubers hybrid which is actually the person who tells higher portals that he is a wizard

1:15:39 - 1:15:47 Text: so this is actually the perfect answer to this question okay I'm going to skip this slide

1:15:47 - 1:15:56 Text: and the final is very quick so so this is something that can out very recently that some researchers

1:15:56 - 1:16:02 Text: have demonstrated that maybe you don't even need to do this a retrieval study so you can if you

1:16:02 - 1:16:06 Text: just use a very large language model you can also just do the open domain question answering

1:16:07 - 1:16:12 Text: so the way they did this is that I hope that you have learned the TFI model in this class already

1:16:12 - 1:16:18 Text: so they just take a prediction language model TFI and they're trying to find to this model by taking

1:16:18 - 1:16:25 Text: the question and taking the question as an answer as output without any explicit retrieval and they

1:16:25 - 1:16:31 Text: just find to this on the data set and they find this model can be pretty well at the testing time by

1:16:31 - 1:16:37 Text: just taking the question and the directly generous answer without resorting to any like documents or

1:16:37 - 1:16:42 Text: like a retrieval system so this is actually very amazing so this kind of model is also called

1:16:43 - 1:16:53 Text: close book Q-assistence okay very long the last life so so this is one direction and personally

1:16:53 - 1:16:59 Text: I'm very excited about so this is actually a new direction that it basically shows that maybe

1:16:59 - 1:17:04 Text: for the open domain question answering maybe this rhythm model is also not necessary anymore

1:17:04 - 1:17:11 Text: so so this idea was first proposed by a museum in 2019 and we recently wrote a paper called dense

1:17:11 - 1:17:18 Text: phrases that try to demonstrate that maybe it doesn't even need this like a rhythm model

1:17:18 - 1:17:25 Text: so instead we can just you you code all the phrases in Wikipedia using some kind of dense letters

1:17:25 - 1:17:30 Text: so what you just need to do is just to do this kind of nearest neighbor search in the answer space

1:17:30 - 1:17:36 Text: you just encode all encode all the phrases in Wikipedia encodes and using vectors and by taking a

1:17:36 - 1:17:41 Text: question you can just encode this question using a vector and then we can just do the vector the

1:17:41 - 1:17:47 Text: nearest neighbor search and then you can directly give you the answer so this is a bit of a new

1:17:47 - 1:17:52 Text: paradigm of this kind of the question answer model so you don't need the you just need a retrieval

1:17:52 - 1:17:58 Text: you don't need a rhythm so good a great advantage for doing this is that so for the perfect

1:17:58 - 1:18:02 Text: rhythm model essentially you have to run a very model at the entrance time this is actually very

1:18:02 - 1:18:09 Text: expensive you can just do the similarity search you can just do the nearest neighbor search

1:18:09 - 1:18:14 Text: without running a burden model so this could be very fast and it can even run on the CPUs

1:18:14 - 1:18:21 Text: without needing to run like a very expensive different neural network and it can still run

1:18:21 - 1:18:29 Text: very well perfect very well okay finally I hope this works so I actually prepared a

1:18:29 - 1:18:33 Text: memo for this dance versus so I want to show you how this actually works

1:18:43 - 1:18:48 Text: so you can see that I've been trying this question like who on the not no bell prize EPs

1:18:48 - 1:18:54 Text: into 2014 so everything just tied for little piece of the input question and all this system

1:18:54 - 1:19:00 Text: can basically just to find out the answer the relevant test test it is and the family's answer

1:19:00 - 1:19:05 Text: is actually it shows up it's actually very fast because it's a bit of real time we don't we don't

1:19:05 - 1:19:13 Text: do wrong in the version model so it's just a ritual model here okay I'm actually done this is

1:19:13 - 1:19:21 Text: lecture so they are you're 515 now yeah thank you very much Dan chief that awesome survey of

1:19:22 - 1:19:26 Text: question answering I guess given that demo at the end people will want to know whether you're

1:19:26 - 1:19:35 Text: launching your own search engine soon but at any rate Dan chief can stay for a bit to answer

1:19:35 - 1:19:42 Text: questions but not forever but today because of you know she doesn't have a standard login

1:19:42 - 1:19:50 Text: we're going to do questions inside zoom so if you'd like to ask a question if you use the raise hand

1:19:50 - 1:19:57 Text: button we can promote you so that you appear in the regular zoom window and can just ask questions

1:19:57 - 1:20:05 Text: and see each other and if you hang around and don't leave the zoom for more than a few minutes maybe

1:20:05 - 1:20:13 Text: we'll just promote everybody who's still there into people in the regular zoom for some bits of

1:20:13 - 1:20:20 Text: discussion but we'd welcome anyone who'd like to ask a question by asking it themselves at this point

1:20:20 - 1:20:32 Text: okay I've got a one volunteer I've got more volunteers

1:20:35 - 1:20:41 Text: sure I mean so questions oh sure look at a chev or I mean so there are now four people who've

1:20:41 - 1:20:51 Text: been promoted there four people was the first so maybe he could start by asking a question and then

1:20:51 - 1:20:59 Text: the other people that we've promoted okay so thank you so much for the lecture today my question

1:20:59 - 1:21:06 Text: is mainly like if you use like a model mate for example Burke how a small kind of training

1:21:06 - 1:21:15 Text: dataset be really to get like reasonable results so the question is how we can try to

1:21:15 - 1:21:23 Text: recover a single model using only a small number of training samples yeah I think it's a really

1:21:23 - 1:21:31 Text: good question especially like you probably have heard the GPT stream model side you show that

1:21:31 - 1:21:35 Text: if you only use like a few very few examples you can also do the open-to-one question answering

1:21:35 - 1:21:45 Text: pretty well so but this kind of model is huge like what numbers like how many parameters I've

1:21:45 - 1:21:50 Text: got in the GPT stream model yeah so it's a very large very few model but okay so this

1:21:50 - 1:21:57 Text: amount is that if we can leverage a very large and very powerful precision-longed model there is a

1:21:57 - 1:22:03 Text: way that we are there is a possibility that we can actually do the question stream well we

1:22:03 - 1:22:08 Text: only have small number examples and also there are some other promising directions including like

1:22:08 - 1:22:16 Text: on supervised passion three so by using some kind of approach like the from the machine on supervised

1:22:16 - 1:22:22 Text: machine translation this kind of idea that can be borrowed and by yeah the kind of borrow ideas

1:22:24 - 1:22:29 Text: can also work pretty well reasonable reasonably well in on supervised passion three

1:22:29 - 1:22:38 Text: yeah also I have seen some of our other works like very pretty showing that since that you clearly

1:22:38 - 1:22:43 Text: assess how it's also quite helpful not imposing the performance if you don't have enough

1:22:43 - 1:22:51 Text: supervised data sets so nice examples yeah so my question is it's I guess it's kind of interesting

1:22:51 - 1:22:58 Text: that there's not really that strong of a transfer effect between data sets that are kind of

1:22:58 - 1:23:06 Text: ostensibly similar so my question is like has there been any research done on how close I

1:23:06 - 1:23:13 Text: guess like the formatting and the semantic content of these question answering data sets actually

1:23:13 - 1:23:21 Text: adheres to the data that like BERT is pre-trained on and if so like has there been sort of any effect

1:23:21 - 1:23:30 Text: found between those similarities or differences I use a question asking like there has been like

1:23:30 - 1:23:37 Text: a stonk cap okay maybe I can just try to clarify it but why the current models can already

1:23:37 - 1:23:44 Text: generalize well from one data set from the data set yeah so I actually really believe that

1:23:44 - 1:23:49 Text: most existing question stream data set already comprehension data set have been collected

1:23:49 - 1:23:56 Text: from the kind of perk so it's very hard it's very difficult to avoid some kind of artifact or

1:23:56 - 1:24:03 Text: like a simple clue or super visual clue that is not super visual but some simple clue that

1:24:03 - 1:24:09 Text: for the machines to pick up so for let's take those photos example set so it has to be that

1:24:09 - 1:24:14 Text: actually if you look at the data set more closely there has been a lot of examples that the

1:24:14 - 1:24:19 Text: question had been like a lot overlap in terms of the words between the question and the passage so

1:24:19 - 1:24:27 Text: the model is actually very good at picking up this kind of clues to get very high performance on

1:24:27 - 1:24:34 Text: this data set and another data set is called job so it's basically about comparison the two numbers

1:24:34 - 1:24:40 Text: something like that so that's the reason that one specialized model that has been very well in

1:24:40 - 1:24:46 Text: one one data set is very easy to pick up this kind of clues then there is a very hard to generalize

1:24:46 - 1:24:51 Text: this kind of thing to another dataset what about the natural questions data set doesn't

1:24:51 - 1:24:59 Text: better avoid that objection yeah natural questions would be much better but there are some other

1:24:59 - 1:25:04 Text: issues I'm not sure you have seen that there are the recent paper called like a question

1:25:04 - 1:25:11 Text: or a trend passed overlap paper so that means it demonstrate natural questions was a

1:25:11 - 1:25:17 Text: data set that Google put out about a year and a half ago maybe where they were actually taking

1:25:17 - 1:25:25 Text: real questions from Google search logs and then finding answers trying to find answers for them

1:25:25 - 1:25:33 Text: in web documents sorry go on Dachi oh I just want to see yeah I think the definite natural

1:25:33 - 1:25:37 Text: questions is on much better data set because the questions are natural like you're collected

1:25:37 - 1:25:43 Text: are real like real questions that are asking by like users so it kind of avoid this kind of

1:25:43 - 1:25:48 Text: are super fish of the artifact between the question the passage but there are some other issues

1:25:48 - 1:25:56 Text: that people like to ask some common questions so if you just do the Reynolds bait of questions

1:25:56 - 1:26:01 Text: you do trend that in test and there's a recent paper that's showing that there is actually a

1:26:01 - 1:26:09 Text: big model is inevitable that there is a high overlap between the trends that so if you find

1:26:09 - 1:26:17 Text: the question if one question that you're trying to test in the test set that has already appeared

1:26:17 - 1:26:23 Text: in the trends that that's a really generalization right yeah but this is more also like all

1:26:23 - 1:26:29 Text: open domain settings not in the reading comprehension set yeah yeah um do you want to ask a question

1:26:29 - 1:26:37 Text: yes so you mentioned that in the last part of the presentations that the read of models may not

1:26:37 - 1:26:43 Text: be necessary and you presented the answers please kind of also work well to use so

1:26:45 - 1:26:52 Text: do we know how how it performs on the question and answering data sets and compared to other

1:26:52 - 1:26:59 Text: other models including bread as well as other and some GPU of course yeah I just encourage you to

1:26:59 - 1:27:04 Text: check out this paper so this model is basically performs on par with the like the dense path

1:27:04 - 1:27:12 Text: retrieval retrieval model so it is performs on par with all the retrieval with the models but it

1:27:12 - 1:27:19 Text: is actually Reynolds so I skipped one slide so so right now the saved art is actually dominating by

1:27:19 - 1:27:26 Text: just kind of dense path retrieval as a generating model so this kind of so using a T5 model

1:27:26 - 1:27:31 Text: class that's a retrieval this is actually performed really well so I would just say so this is

1:27:31 - 1:27:37 Text: can work with a similar in this block but compared to this kind of generous model we still like

1:27:37 - 1:27:46 Text: have two points behind yeah okay and what is the kind of the uh intuition behind the test phrases

1:27:47 - 1:27:52 Text: upperformed like the answers are probably on the close proximity and what if the data sets

1:27:52 - 1:28:01 Text: has answers and has answers to a specific question like very far from the actual information

1:28:04 - 1:28:09 Text: let's see the answers to the question may may not be may not reside in close proximity to the

1:28:10 - 1:28:19 Text: to the words in the question so let me just clarify this okay it's a goal of this project is trying

1:28:19 - 1:28:27 Text: to um index all the phrases in the Wikipedia so and by and the these kind of expressions are

1:28:27 - 1:28:34 Text: built using the training set of the questions in data sets so the assumption still the distribution

1:28:34 - 1:28:38 Text: of the examples in the different test set will be similar to the Chinese set for sure

1:28:40 - 1:28:44 Text: that is that is also a question like so basically we still trying to consider all the phrases in

1:28:44 - 1:28:49 Text: the Wikipedia and that test now we just take the question of vision and I would compute the thought

1:28:49 - 1:28:57 Text: okay so if we use say a different data sets that does not present the information using a

1:28:57 - 1:29:02 Text: structure presented in Wikipedia then this model may not work as well as

1:29:03 - 1:29:09 Text: for what do you what do you mean by structure you present so uh say if we um

1:29:09 - 1:29:17 Text: lean more towards uh lean more towards uh structures like the passages we see in standardized tests

1:29:17 - 1:29:26 Text: where the answers to the question may not be like um may not be close proximity to where the

1:29:26 - 1:29:33 Text: information was first introduced oh no so the the the answers doesn't have to be seen or Chinese

1:29:33 - 1:29:39 Text: stuff so basically it's a goal is to take in the Chinese set channel in code for the phrases and

1:29:39 - 1:29:45 Text: by using and then we apply this in code to all the free all the phrases all the a lot of like

1:29:45 - 1:29:51 Text: six billion phrases in this video so it so the model is definitely able to generalize from the

1:29:52 - 1:29:57 Text: Chinese set to all the phrases in this video so it doesn't have to be seen in the things that

1:29:57 - 1:30:05 Text: that is this what you're asking

1:30:08 - 1:30:13 Text: this actually very so it's actually similar to like the retrieval of the best passive retrieval

1:30:13 - 1:30:19 Text: so you still like um yeah try to channel pass irritation here is the first representation

1:30:20 - 1:30:25 Text: but the the revisions only try to use the the Chinese set of the cross answered data sets

1:30:25 - 1:30:32 Text: but um by taking the encoder and then we are going to encode uh all the rotations all the passages

1:30:32 - 1:30:38 Text: of phrases in Wikipedia and then we can um use practice this rotation can actually generalize

1:30:38 - 1:30:47 Text: well for the unseen questions yeah so uh so the question is um what if the nearest neighbor

1:30:47 - 1:30:55 Text: search doesn't return to answer so why do you think the nearest neighbor I mean you always

1:30:55 - 1:30:59 Text: can find something right you just a question is that whether it's a question is not not

1:31:00 - 1:31:05 Text: yes so the question is what if in the data says that the answer is not close enough then

1:31:06 - 1:31:13 Text: um yeah this good question I don't know uh if you really come up with something that is really

1:31:13 - 1:31:18 Text: very far away from all the questions that we have been seeing the Chinese set that could be

1:31:18 - 1:31:26 Text: possible I don't know basically depend on um uh found the text or um formatted

1:31:27 - 1:31:32 Text: then uh the nearest neighbor search may not uh work as well as the models

1:31:33 - 1:31:38 Text: so again the question is also the question is also returned by a question encoder

1:31:39 - 1:31:44 Text: so so the question is the whether this question encoder can give you something reasonable

1:31:44 - 1:31:51 Text: that space or not but uh yeah so we have been testing along like a random even the input

1:31:51 - 1:31:56 Text: sentences or even like the question then I have to go real question could be a sentence

1:31:56 - 1:32:02 Text: it doesn't seem to be a problem so far yeah maybe maybe maybe we should give a couple of other people

1:32:02 - 1:32:08 Text: ago and you're allowed to turn your camera on and ask a question if you want um so um next person is

1:32:08 - 1:32:18 Text: all right hi uh thank you for taking the time to meet us um my my question is kind of quick

1:32:18 - 1:32:24 Text: so you mentioned work that brought up a set of relatively simple questions that show how brittle

1:32:24 - 1:32:33 Text: or poor the current models can be right I'm curious if that's right yeah yeah exactly exactly did

1:32:33 - 1:32:39 Text: that turn out to change the community to improve how to evaluate the models because

1:32:40 - 1:32:46 Text: they're actually doing pretty poorly on some of those right yeah so first these questions are

1:32:46 - 1:32:53 Text: simple in uh in terms of the the wording is very simple the for the template is very simple

1:32:53 - 1:32:58 Text: but they still trying to test like a negational temporal relational forever so the questions are not

1:32:58 - 1:33:02 Text: the I mean in terms of the reasoning of the code of plastic is not that simple it's just the wording

1:33:02 - 1:33:09 Text: very simple um I do think um okay so this paper that we receive a lot of attention you the best

1:33:09 - 1:33:15 Text: paper last year and I see all the biggest conference um so I think a lot of people are trying to

1:33:15 - 1:33:21 Text: solve the problem I cannot tell you that okay whether we really have a solution to this yet or not

1:33:21 - 1:33:29 Text: yeah cool yeah thank you for bringing this one up it's really interesting okay next is

1:33:29 - 1:33:37 Text: yeah thanks for saying it at time so my question is kind of not relevant but like to view the

1:33:37 - 1:33:44 Text: robust system of question answering in what extent can in context learning how models to be more

1:33:44 - 1:33:58 Text: robust with respect to different domains oh so like uh basically you provide um template

1:33:58 - 1:34:06 Text: generated by bird and then instead of directly predicting the classes of text classifications

1:34:06 - 1:34:12 Text: it is um use some word to represent that class of enter to the word so

1:34:15 - 1:34:22 Text: okay um so I assume that you are actually referred to the in context learning in the GPS

1:34:22 - 1:34:29 Text: stream or that's the way that okay um actually I have been doing something related to the

1:34:29 - 1:34:34 Text: learning recently um questions but I'm not sure how we actually use that in context learning

1:34:34 - 1:34:42 Text: in at least in for school type of problems um yeah so I don't know if that could

1:34:42 - 1:34:47 Text: solve the robustness or not or even the whole how to use that technique for the questions

1:34:47 - 1:34:55 Text: or yet nice thanks and I also mentioned that we can train a retriever without a reader so

1:34:55 - 1:35:04 Text: is there a paper of the current like attempt to do that yeah so the library also just uh yeah

1:35:07 - 1:35:10 Text: thank you all okay next is

1:35:10 - 1:35:19 Text: hey how's it going uh thanks so much for the uh for the lecture um i put a little broader question

1:35:19 - 1:35:28 Text: um so we got the about the future of NLP um do you think that in order to solve NLP in a sense

1:35:28 - 1:35:36 Text: that you can perform on par with humans on all NLP tasks it's efficient to only interact with

1:35:36 - 1:35:41 Text: with text you know whatever do you think will eventually need some sort of with the sort of experience

1:35:41 - 1:35:47 Text: and confidence that you get um only from seeing and then sort of feeling the world and having the

1:35:47 - 1:35:54 Text: sake of interactions that we assume it's how yeah I mean conversation is definitely very

1:35:54 - 1:36:00 Text: difficult even in the context of our sensory conversation is a very yeah no very very important topic

1:36:00 - 1:36:07 Text: that um still remains I think it still remains on three so uh that's for that part the one

1:36:07 - 1:36:13 Text: six definitely yes and also want to mention that okay so for a lot of the reading conversation

1:36:13 - 1:36:18 Text: data sets or questions and data sets you have seen that we are people will start start to achieve

1:36:18 - 1:36:23 Text: the human performance but this but we also see that how great all these systems are because

1:36:24 - 1:36:29 Text: yeah I mean they cannot regenerize also the easy problems all these things need to be with a

1:36:29 - 1:36:37 Text: solved um it depends so if you think about your descent time maybe a little bit too easy

1:36:38 - 1:36:45 Text: oh yeah one point oh no sorry guys yeah one point out for what is there is that

1:36:45 - 1:36:49 Text: uh dabbing a lot of trans-versa in trying to have a few maintenance

1:36:49 - 1:36:56 Text: um a framework to evaluate this kind of system just try to break the current system come

1:36:56 - 1:37:02 Text: out with some harder questions so um yeah so that means maybe it's a kind of static

1:37:02 - 1:37:07 Text: data system of good enough to measure the progress so we actually really need some kind of dynamic

1:37:07 - 1:37:13 Text: evaleration and also introduce all more these kind of adverse examples or the um yeah harder questions

1:37:13 - 1:37:21 Text: or something like yeah hey still game for a couple more questions uh sure I don't only mention

1:37:21 - 1:37:28 Text: my 10 p.m uh yeah um only used cost yeah okay um so next is next is

1:37:30 - 1:37:38 Text: hey yeah that's a much more much to the arm so in uh just 2020 there was this efficient open domain

1:37:38 - 1:37:45 Text: question answering challenge um and from you know from performance to team what there was like

1:37:45 - 1:37:52 Text: quite substantial uh decreased versus human accuracy um probably like yeah well for

1:37:52 - 1:37:59 Text: primarily dear quantization and um processing drift that occurred uh when they were quantizing

1:37:59 - 1:38:07 Text: um so I I recently encountered this paper called uh random quantizers uh which is essentially

1:38:08 - 1:38:14 Text: learned uh learns like basis representations for the quantizers like jointly with the weight

1:38:14 - 1:38:23 Text: of the network um and while this would be like extremely effective if you were to just like say

1:38:23 - 1:38:29 Text: change of scratch I was just really curious do you think it like such a uh there's a way to do

1:38:29 - 1:38:35 Text: this with say a pre-trained model or something like that um I had a few ideas with like the

1:38:35 - 1:38:45 Text: insurance insurance bloggers uh I don't think uh I don't see a very clear way of doing them um yeah I

1:38:45 - 1:38:51 Text: don't think I'm really experts to answer those questions um yeah I'm not sure if I really have

1:38:51 - 1:38:55 Text: the answer but I also want to give you mentions that yeah quantization has been very useful

1:38:55 - 1:39:00 Text: technique to make the model smaller right so we have been also exploring the quantization

1:39:00 - 1:39:07 Text: in the dense forest's project recently because of the storage has been still very uh has been

1:39:07 - 1:39:14 Text: still very large so we are having trying to reduce that storage um yeah I'm not sure uh

1:39:14 - 1:39:19 Text: about the question about the connection between conversation and also creating uh yeah I'm not

1:39:19 - 1:39:24 Text: sure I'm I'm I have also to ask you sorry thank you

1:39:24 - 1:39:31 Text: that don't you is too um modest to mention that she was one of the co-organizers of the efficient QA

1:39:31 - 1:39:34 Text: um share task um okay next question is

1:39:40 - 1:39:46 Text: I think she thinks so much for being here today um so my question is a bit different um so one

1:39:46 - 1:39:52 Text: example you gave the competition was this Alex Victoria example the checklist um and I was

1:39:52 - 1:39:57 Text: like in technically Alex was an evolved answer right it's gender neutral and there wasn't enough

1:39:57 - 1:40:04 Text: context context in the question to determine who it's referring to so my question is how concerned

1:40:04 - 1:40:11 Text: should we be about potentially including uh sort of biases into these or go labels or how we

1:40:11 - 1:40:20 Text: evaluate them or is that just more of a concern for more open and new questions um yeah this is

1:40:20 - 1:40:25 Text: definitely a very important again a lot of people are trying to study okay how much bias have been

1:40:25 - 1:40:32 Text: coding this model time how we can yeah um yeah I'm not sure if I'm good on so to do that um

1:40:34 - 1:40:38 Text: again I like it just I want to say like talk to do the debiasing of the pre-tronement models all

1:40:38 - 1:40:45 Text: these things are very important and um yeah this is just one so you're talking about this example

1:40:45 - 1:40:55 Text: right so this is just one test case um yeah um yeah yeah right yeah so I guess I'm just

1:40:55 - 1:41:06 Text: wondering who comes up with a test case is we will have more discussion of toxicity and bias

1:41:06 - 1:41:13 Text: coming up very soon including actually first-day lecture as well as a later lecture um not specifically

1:41:13 - 1:41:23 Text: about QA though um okay next person is uh right thank you for the lecture um yeah and my question

1:41:23 - 1:41:31 Text: is also related to the open domain question answering so um I was just wondering how much of like

1:41:32 - 1:41:40 Text: the learning side of um domain like sort of authorization or like domain alignments um

1:41:40 - 1:41:49 Text: techniques can be combined with like the language level like question answering like to what

1:41:49 - 1:41:55 Text: extends where they work and like what kind of like the language specific design should be

1:41:55 - 1:42:02 Text: leveraged to combine with those two two different um it's if if we want like higher performance

1:42:02 - 1:42:08 Text: and stuff like that the question about how to generalize between different domains or like

1:42:08 - 1:42:13 Text: all about how to extend open domain to assist them for different languages I'm watching

1:42:13 - 1:42:21 Text: together yeah great so I was wondering so um so there's like um some some like many specific designs

1:42:21 - 1:42:29 Text: like um um uh domain alignments and like uh future level disentanglement techniques uh that have

1:42:29 - 1:42:38 Text: been that has shown some like like interesting performance and um other tasks um it may solve that

1:42:38 - 1:42:44 Text: like recently some people also like leverage some other things um like for for for question answering

1:42:44 - 1:42:52 Text: so I was just wondering um like to what extent these kind of techniques come work on um like

1:42:52 - 1:42:59 Text: uh what in group tasks modules limited question answering but like um mainly uh for question answering

1:43:00 - 1:43:05 Text: sorry which which work I was talking about um not you know sure what do you mean by this

1:43:05 - 1:43:11 Text: in hand go to your school questions right so uh so basically um believe this is like a

1:43:11 - 1:43:19 Text: little bit more specific for so um so there is this um paper called um um doing um

1:43:19 - 1:43:27 Text: um I forgot the exact name uh it's uh like sorry align true domains um by decent

1:43:35 - 1:43:38 Text: okay so I have okay just want to make sure that we are on the same page so I have things

1:43:38 - 1:43:44 Text: of work that trying to learn some kind of in decent angle rotation second better generalize

1:43:44 - 1:43:51 Text: to the different domains or adverse examples is this what do you say yeah yeah and the question

1:43:51 - 1:43:57 Text: is whether this technique kind of general general general you apply to question answering oh yeah

1:43:57 - 1:44:04 Text: you're just wondering how um to what extent would they work because um I think language has like

1:44:05 - 1:44:10 Text: a lot of like specific things like dependency and to other stuff that like these techniques

1:44:10 - 1:44:22 Text: does not like actually um take care of several languages um yeah um yeah um yeah I'm not sure uh I think

1:44:22 - 1:44:28 Text: we have to try that for the subtle interesting point yeah I don't know at least for the work that

1:44:28 - 1:44:34 Text: I have since though fact all all of the title operated at a very simple uh sentence classification

1:44:34 - 1:44:41 Text: task maybe I'm almost maybe that's not correct so my own thing is that's basically take a

1:44:41 - 1:44:47 Text: label encoder apply to a simple test of classification task and take a documentation do some time

1:44:47 - 1:44:53 Text: to solve or transformation and make sure that um yeah it can learn some kind of invariant features

1:44:53 - 1:45:01 Text: about the header rotation something like that very yeah cool very input yeah I'm not sure I feel like

1:45:01 - 1:45:10 Text: QA is a more structured task and also kind of longer uh balance sequences um yeah so I don't know

1:45:10 - 1:45:17 Text: if it works unless people have tried that yeah thank you thank you okay and then we've got

1:45:17 - 1:45:23 Text: and maybe we should call this the last question um I'm just wondering what is like the

1:45:23 - 1:45:32 Text: alternative difference between solving question answering with um dance models like T5 versus encoder

1:45:32 - 1:45:42 Text: swipeboard okay um that's good point uh okay so I so I skipped this slide so why does model work so

1:45:42 - 1:45:48 Text: well uh the reason is actually it's not really about extracting model versus generating model the

1:45:48 - 1:45:55 Text: reason is that they actually um for the extracting model so if the retrieval returns let's say

1:45:55 - 1:46:02 Text: 100 passages so they have to extract also from each of the passages and finally figure out which

1:46:02 - 1:46:07 Text: one that has a might hide score but for the generation model essentially they're trying to aggregate

1:46:07 - 1:46:14 Text: all the 100 passages and the direct stations together and do their generation um jointly do you

1:46:14 - 1:46:20 Text: understand so essentially taking the 100 stations together to the joint generation instead of only

1:46:20 - 1:46:26 Text: do the extraction from each of the passages so I think that's actually the T difference so that's

1:46:26 - 1:46:33 Text: why this generating model can do really well compared to the extracting models so also I'll

1:46:33 - 1:46:40 Text: dimension that okay so if you look at this i.g model it's actually um like uh compress this DPR

1:46:40 - 1:46:45 Text: and i.g model the i.g model is always doing the generating model but they're not doing this kind

1:46:45 - 1:46:51 Text: of aggregation they're just trying to take a single passage and doing the generation so the i.g model

1:46:51 - 1:46:56 Text: actually um doesn't perform as well as this model but we also animation i.g model is actually not

1:46:56 - 1:47:02 Text: doing better than DPR because this base model is this large model so this this numbers are a little

1:47:02 - 1:47:06 Text: bit confusing so they're actually basically a really own part they're basically performed similarly

1:47:06 - 1:47:11 Text: but um so the the key difference between the generating model and extracting model is that for

1:47:11 - 1:47:17 Text: general models you can actually leverage more on input passage together and do the generation

1:47:17 - 1:47:28 Text: look does this is that clear or not um yes thanks yeah i know that's usually just check out this

1:47:28 - 1:47:33 Text: paper yeah so this paper asks the query well then why this model has worked better than the

1:47:33 - 1:47:41 Text: previous generating model actually uh follow up this page is janity models to do like wait um

1:47:42 - 1:47:49 Text: simple fashion and you know when you have the documents and vice-fabriccial page janity models to

1:47:49 - 1:47:59 Text: generate the span wait the yeah you you can definitely do that here oh so you are talking about

1:47:59 - 1:48:06 Text: this tip i'm not this yeah so i'm just wondering about like if you use encoders is like

1:48:06 - 1:48:12 Text: you're finding similarities between very encodings uh and then generating models are you

1:48:12 - 1:48:20 Text: like remembering a whole question and you try to retrieve them in a way

1:48:20 - 1:48:31 Text: okay so for this model there is an any retrieval so you can't really find the answer from the

1:48:31 - 1:48:36 Text: question right so this model really has to rely on all the parameters you remember

1:48:36 - 1:48:41 Text: all the information so by just taking the scene for the task to just rely on the parameters

1:48:41 - 1:48:47 Text: to infer this answer so it's actually very hard to so it's yeah it's definitely a balance between

1:48:47 - 1:48:57 Text: all the yeah it's a memory and a generalization problem yeah i see so um i'm just gonna say what i

1:48:57 - 1:49:02 Text: like is it but when you've got this question because you're adding it in some space and then

1:49:03 - 1:49:12 Text: using that you're adding the generator matches that to 18, 18, 18, is that what is going on either

1:49:12 - 1:49:20 Text: yeah exactly yes the model has to like it's very large like 11 billion parameters so all this

1:49:20 - 1:49:25 Text: so parameters visit trying to uh yeah memorize a lot of information that has been because the

1:49:25 - 1:49:31 Text: model has been retrieved from the text and also has been funky so the model has been trying to

1:49:31 - 1:49:38 Text: memorize a lot of information about the text here um do you want to call it an id or do you want one

1:49:38 - 1:49:48 Text: more question uh either way yeah i'm sorry you just show around how do you take one more question

1:49:48 - 1:49:57 Text: okay let me just do okay one more question okay the first question is about how how you really are

1:49:57 - 1:50:04 Text: these techniques generalized to other languages like say languages that are quite different

1:50:04 - 1:50:10 Text: or quite different grammatical rules which Chinese um Japanese or Arabic or some other languages

1:50:11 - 1:50:18 Text: sort of wonderful question maybe not right exactly your domain expertise is there's a lot of

1:50:18 - 1:50:25 Text: interest in modeling user behavior say online searching behavior browsing behavior as a sequence

1:50:25 - 1:50:31 Text: using state transformers self-attention um and then you can use that to predict how user or can

1:50:31 - 1:50:37 Text: be like an user or a selector and then predict user's actions hope promising do you think that

1:50:37 - 1:50:43 Text: would be i know this may not be your domain expertise but there is a lot of interest in extending

1:50:43 - 1:50:49 Text: these uh questions from techniques or just encoding techniques embedding techniques to recommend

1:50:49 - 1:50:59 Text: their systems um just want to get your thoughts on your mind okay um the first question is whether

1:50:59 - 1:51:05 Text: the these techniques can be generalized to other languages i think it's also yes and that has

1:51:05 - 1:51:11 Text: been a lot of active research in this version but there has no concerns that um as a lot of models

1:51:11 - 1:51:18 Text: of systems i described here actually require a lot of them require very strong like a pre-term

1:51:18 - 1:51:25 Text: language model and also require the lots of training examples for the QDSS so that would be actually

1:51:25 - 1:51:33 Text: um i would say a bottleneck for many low resource languages right so so it's very hard to collect so

1:51:33 - 1:51:39 Text: many examples called other languages as if we have actually i see the techniques can can be generalized

1:51:39 - 1:51:44 Text: generally applied to other languages as and i there has been also a lot of work trying to do

1:51:44 - 1:51:58 Text: to cross symbol questions so that's it there