Stanford CS224N NLP with Deep Learning | Winter 2021 | Lecture 8 - Final Projects_ Practical Tips

0:00:00 - 0:00:14     Text: Okay, so now we'd better get going with the lecture. Okay, so we're now at week eight,

0:00:14 - 0:00:21     Text: second half of week four. So the agenda today is that I first of all go into finish off

0:00:21 - 0:00:25     Text: the last pieces of attention that I didn't talk about last time.

0:00:25 - 0:00:31     Text: But the main topic today is to talk about final projects and that all consists of

0:00:31 - 0:00:37     Text: sort of grab bag of different things. I've talked about the final projects and finding research topics

0:00:37 - 0:00:43     Text: and finding data and doing research. I'll give a very brief introduction to reading

0:00:43 - 0:00:47     Text: compension and question answering, which is our default final project.

0:00:47 - 0:00:51     Text: But we get a whole lecture on that at the beginning of week six.

0:00:51 - 0:00:57     Text: So this is just to give you a little bit of what it's about and if you're thinking about

0:00:57 - 0:01:05     Text: doing the default final project. In some sense, this lecture is also a rare chance in this course

0:01:05 - 0:01:10     Text: to cause regret because you know, really we've got to happen up until now,

0:01:10 - 0:01:14     Text: being having the fire hose going full stream ahead,

0:01:14 - 0:01:21     Text: spraying you with new facts and approaches, algorithms and models and linguistic things.

0:01:21 - 0:01:26     Text: So this is a brief respite from that. So if you have any questions,

0:01:26 - 0:01:31     Text: if you've been wondering about for weeks, today's lecture could be a good time to ask them

0:01:31 - 0:01:37     Text: because after today's lecture in week five, we'll turn the fire hose right back on

0:01:37 - 0:01:43     Text: and we'll have a lot of new information about transformers and large pre-trained

0:01:43 - 0:01:48     Text: language models, which become a huge part of modern neural NLP,

0:01:48 - 0:01:54     Text: as I already mentioned, a little bit of laser into this class.

0:01:54 - 0:01:58     Text: Okay, so this is where we left things last time, basically.

0:01:58 - 0:02:05     Text: So I sort of talk through the rough idea that what we're going to do for this new

0:02:05 - 0:02:10     Text: attention model is that we are going to use the encoder justice before.

0:02:10 - 0:02:16     Text: And then once we're running the decoder at each time step,

0:02:16 - 0:02:22     Text: we're going to compute a new hidden representation using the same time of sequence model

0:02:22 - 0:02:28     Text: as we had before. But now we're going to use that hidden representation

0:02:28 - 0:02:33     Text: of the decoder to look back at the encoder.

0:02:33 - 0:02:40     Text: And it's going to then work out, respectively, some kind of function of similarity

0:02:40 - 0:02:45     Text: between the code of hidden states and decoder hidden states.

0:02:45 - 0:02:50     Text: And based on those, it's going to work out what are called attention scores.

0:02:50 - 0:02:57     Text: And attention scores are actually probability weights as to how much it likes different elements.

0:02:57 - 0:03:04     Text: And based on those attention scores, we're going to compute an attention distribution.

0:03:04 - 0:03:07     Text: So this is our probability distribution.

0:03:07 - 0:03:13     Text: And then based on that, what we do is compute a weighted average of the encoder

0:03:13 - 0:03:18     Text: and then hidden states weighted by the attention distribution.

0:03:18 - 0:03:21     Text: And so that's going to give us a new attention output vector,

0:03:21 - 0:03:25     Text: which is like the hidden state vector of the decoder.

0:03:25 - 0:03:28     Text: So that is an extra hidden vector.

0:03:28 - 0:03:35     Text: And so we're going to use both of them to then generate our next output,

0:03:35 - 0:03:40     Text: which is going to be here, the word pi at the end of the sequence.

0:03:40 - 0:03:45     Text: So let's start off now by doing that with some equations.

0:03:45 - 0:03:50     Text: So before you move on, there's a good question here.

0:03:50 - 0:03:58     Text: And then we'll go back to the question, which is why is the encoder and decoder both required as opposed to just the same arn and for both.

0:03:58 - 0:03:59     Text: I think it's.

0:03:59 - 0:04:02     Text: Okay, I'll address that.

0:04:02 - 0:04:08     Text: So maybe there are a couple of still possible interpretations to that, but I'll say something about it.

0:04:08 - 0:04:19     Text: So, you know, this the basic case that we've been doing here is this case on the machine translation, where we've got a source in code,

0:04:19 - 0:04:20     Text: which is in the target language.

0:04:20 - 0:04:29     Text: So since these two seem.

0:04:29 - 0:04:40     Text: Different languages, it makes sense to have separate sequence models with different RNN parameters for each one.

0:04:40 - 0:04:48     Text: And at that point, it's just a fact about what we want to do with machine translation, which is that.

0:04:48 - 0:04:56     Text: Well, we actually want to look back at the source to try and decide what extra words to put into the translation.

0:04:56 - 0:05:02     Text: So it makes sense to attend back from the source to the translation.

0:05:02 - 0:05:07     Text: But what you might be asking is, well, why do we only do that?

0:05:07 - 0:05:14     Text: Why don't we also consider doing attention from here back into the decoder RNN?

0:05:14 - 0:05:20     Text: And if that's what you were thinking about, that's a great suggestion.

0:05:20 - 0:05:28     Text: And actually very quickly after these attention models were developed, that's exactly what people started doing.

0:05:28 - 0:05:33     Text: They decided that well, actually we could start using more forms of attention.

0:05:33 - 0:05:38     Text: And we could also use attention that books back in the decoder sequence.

0:05:38 - 0:05:46     Text: And that often gets referred to as self attention and self attention has proven to be an extremely powerful concept.

0:05:46 - 0:05:55     Text: And indeed that then leads into the transformer models that we're going to see next week.

0:05:55 - 0:06:12     Text: But going in on things, I think a self attention wasn't quite as much an obviously needed idea, like from this initial translation motivation, which was where attention was developed.

0:06:12 - 0:06:26     Text: And fairly clear that when you're running the decoder RNN that as a conditional language model, it gets some information about the source fed into its initial state.

0:06:26 - 0:06:33     Text: And it seemed pretty clear you were losing a lot of information about the details of what was in the source sentence.

0:06:33 - 0:06:37     Text: And therefore, it'd be really, really useful to have this idea of attention.

0:06:37 - 0:07:04     Text: So you could directly look at it as you proceed into translate in a way it's a little bit less obvious that you need that for the decoder RNN because after all last week, we introduced those really clever LSTM's and the whole argument of the LSTM was actually there pretty good at maintaining history of a sequence through quite a bunch of time periods.

0:07:04 - 0:07:12     Text: And to be extent that the LSTM is doing a perfect job, maybe you shouldn't really need self attention in your decoder.

0:07:12 - 0:07:24     Text: But actually precisely what's being shown is that this mechanism of attention is a much more effective method again of selectively addressing elements of your past state.

0:07:24 - 0:07:34     Text: And it's sort of lighter weight rather than having to kind of cook up the parameters that realize the M so it's carrying just the right information forward all the time.

0:07:34 - 0:07:46     Text: You provide you carry only enough information that the model knows where to look back to you can then kind of grab more information from past states when you want to.

0:07:46 - 0:07:58     Text: So that's actually a great approach, but I won't talk about it more now and that will come up more next week.

0:07:58 - 0:08:11     Text: Okay, the equations, hopefully these won't actually seem difficult. So what we have is we have our hidden our coding in the states which are vectors.

0:08:11 - 0:08:18     Text: And then the time set T we have a decoder hidden state, which is also a vector of the hidden state to mention.

0:08:18 - 0:08:32     Text: And then what we want to do is get attention scores as to for ST how much attention it played it pays to each of the hidden states of encoder.

0:08:32 - 0:08:48     Text: Easiest way to do that is just to use dot products between the source and the first sorry, not the source, the dot products between the decoder hidden state s and the encoders hidden state age.

0:08:48 - 0:09:10     Text: So these give us a bunch of numbers that might be negative or positive for their attention scores. And so you are just like we did right from the first lecture with word vectors we then put those to a softmax distribution and then we get a probability distribution over the time steps of the encoder.

0:09:10 - 0:09:31     Text: Okay, so now we've got that probability distribution we can construct a new vector by creating a weighted sum of the encoder hidden states based on these attention distribution probabilities and that gives us.

0:09:31 - 0:09:44     Text: So to make use of that and generating the outboard we're going to calculate the attention outboard with the decoder hidden state s t. So now that's something of size to age.

0:09:44 - 0:09:57     Text: And then we're going to proceed as with a non attention model with them put that through another softmax to generate a probability distribution of the output words and then with some word.

0:09:57 - 0:10:22     Text: Hopefully that sort of a fairly obvious implementation of what we have here. So we've got our vectors with encoder and decoder with game dot products of ST with each one softmax those and your probability we take the weighted average of the ones in red to get the attention out board we combine that with the.

0:10:22 - 0:10:36     Text: The ST decoder's getting state and then we put it through another softmax and we can sell for pie.

0:10:36 - 0:10:39     Text: Okay, so.

0:10:39 - 0:11:08     Text: I sort of almost can't stress enough attention is great. So the very first modern neural machine translation program was done in 2014 at Google by Julia, since Gabriel and they had a straight board encoder decoder to LSTM's and by a bunch of tricks of having very deep LSTM's.

0:11:08 - 0:11:25     Text: Huge amount of data huge amount of training other tricks that I don't want to go into now they were actually able to get good results by just putting together a straight seek to see new machine translation system.

0:11:25 - 0:11:50     Text: But very soon after that, in fact, later this same year, a group from Montreal, Dima, but now can you show your show of end you introduce sequence the sequence model with attention and it was just obviously better so attention significantly improved your machine translation performance and that sort of makes sense.

0:11:50 - 0:12:19     Text: That allows the decoder to focus on parts of the source sentence. So I think about us giving you a much more human like model of doing machine translation because you know exactly what a human translator would do is you read the source sentence you've got an idea of what's about you start writing the first couple of words of translation and then what you do is is you look back to see what exactly it said as the sort of modifiers of the nouns.

0:12:19 - 0:12:35     Text: Translate the next few words. Technically people think of it as solving the bottleneck problem because attention now allows us full access to the entire source in the state and we can get any information that we need.

0:12:35 - 0:12:40     Text: It's not the case that all information has to be encoded in the final hidden state.

0:12:40 - 0:12:51     Text: It also helps with the vanishing gradient problem. So effectively now we have shortcuts back to every hidden state of the encoder.

0:12:51 - 0:12:59     Text: And so therefore there's always a short path with gradient flow and that greatly mitigates the vanishing gradient problem.

0:12:59 - 0:13:15     Text: The final neat thing is that attention provides some effective interpretability to sequence the sequence models because by looking at the tension distribution we can see what the code was focusing on.

0:13:15 - 0:13:39     Text: So in a softballistic way we get the free through the operation of the model a soft alignment as to which words translate which words. So in this example of the French sentence is being translated as he hit me with a pie where there's sort of a single verb here in the French, which is kind of like a English sometimes people on play day use pie.

0:13:39 - 0:13:59     Text: So the words are sort of like saying he the pie that the model is getting that eilers translators he the most being translators me and essentially on tarte is being translated as it with a pie.

0:13:59 - 0:14:27     Text: The amazing thing is you know the model was never told about any of these alignments there was no explicit separate model which was trying to learn these alignments as in the earliest statistical phrase based systems we just built a sequence the secret model with the tension and said here are a lot of translating sentences start running back back prop and try and get

0:14:27 - 0:14:43     Text: the sentences and it doesn't learn by itself in deciding where it's best to pay attention what is a good alignment between the source and the target languages.

0:14:43 - 0:15:12     Text: Okay so that's the basic idea of attention I want to go a bit more into the complex unit of attention since it's such an important idea that will see a lot as we continue now of the course right so there are several attention variance but first of all you know what's the common part so we have some value.

0:15:12 - 0:15:32     Text: Values some vectors that we're going to be sort of using as our memory and we have a query vector so attention always involves that we calculate the attention stores and turn those into a probability distribution with a softmax

0:15:32 - 0:15:45     Text: and we use attention distribution to calculate and weighted some of the elements in our memory giving us an attention output.

0:15:45 - 0:15:55     Text: So the main place where you're immediately seen variation and attention is how do you compute these attention stores.

0:15:55 - 0:16:14     Text: So let's go through some of the ways that that's done so the simplest most obvious way to do it is to say let's just take the dot product between the given the current in state of the decoder and all the vectors here the

0:16:14 - 0:16:42     Text: source and code of vectors that we are putting attention over and you know that sort of makes sense right this is dot product is our most basic similarity store but it seems like there's something wrong with this and that is you know it seems wrong for you to want to think that the entirety of the source hidden states and the entirety of the target hidden states is all

0:16:42 - 0:17:05     Text: having information about where to attend to because really these LSTMs are doing multiple things so the LSTMs are carrying forward information along their own sequence to help you know record information about the past so it can be used in the future they have information in the hidden state to tell you which output you should

0:17:05 - 0:17:34     Text: generate next and perhaps they're encoding some information that will serve as a kind of a query key for getting out information by attention from the source hidden states so it seems like probably we only want to use some of the information in them to calculate our attention store and so that's the kind of approach that was taken very quickly in subsequent work.

0:17:34 - 0:18:03     Text: So the next year, Tang Luang working with me explore this ideas now normally called multi picket of attention so for multi picket of attention we put an extra matrix in the middle of our dot product so this gives us a matrix of learnable parameters and so effectively inside this matrix we can learn what parts of S to pay attention to and what parts of

0:18:03 - 0:18:32     Text: each to pay attention to when calculating the similarity and hence the attention store between the source hidden states and then depot it and say so that was a is a good thing which in general works much better but there's perhaps a problem with that which is you know maybe this W matrix has too many parameters you know because we've got

0:18:32 - 0:18:59     Text: these two vectors which in the simple case of both of the same dimension D that they don't have to keep the same and you know we're now putting in D square new parameters for the matrix W and that sort of feels like it's too many because you know arguably it seems like we should only have about 2D parameters you know one D saying how much attention to pay different parts of S and the other one how much attention to pay the parts of age.

0:18:59 - 0:19:23     Text: You know there's a reason for more by having a whole matrix here you're not only doing the scoring element wise right you can have any element of S being combined with any element of the H vector and see that as a useful part of your similarity store but there's still a lot of parameters so bit after that.

0:19:23 - 0:19:52     Text: People here well maybe thing with less parameters so if you have a matrix W and you'd like it to have less parameters the obvious linear algebra thing to do is to say okay we can model W as you transpose the where you and the low ranks skinny matrices so we can choose some number K for how skinny these matrices are going to be.

0:19:52 - 0:20:08     Text: And they can be K times D matrices and then we're getting a sort of a reduced rank matrix here and so we have a lot let we end up with the D by D matrix but it has a lot less parameters in it.

0:20:08 - 0:20:35     Text: And so people explore that and you know at that point if you just do a little bit of basic linear algebra what we have here is exactly the same source in vector and the decoder in vector and projecting each of them with a low rank linear projection and then taking the dot product of the of the projections.

0:20:35 - 0:20:48     Text: And if you remember this equation here to next Tuesday's lecture you'll see that that's actually exactly what's happening in transform models.

0:20:48 - 0:20:58     Text: And none of these were actually the original form of attention that was suggested by button now at our but button now.

0:20:58 - 0:21:17     Text: And how suggested is the way we could calculate an attention score is by taking the two vectors multiply each by a matrix putting and adding them putting to a 10 H function giving us another vector which we then dot product with yet another vector.

0:21:17 - 0:21:32     Text: And we get out of weight so in the literature compares attention variance this one is normally referred to as added of attention I've always thought that's a really lazy name.

0:21:32 - 0:22:00     Text: And at least it never made sense to me because really what you're doing here is that you're using a neural net layer that calculates an attention score right this looks just like the kind of neural net layers that we use when we wanted to calculate scores such as when we're doing simple sort of speed for networks of the beginning and we wanted to calculate scores.

0:22:00 - 0:22:15     Text: And in a assignment for which you should look at really soon if you haven't looked at it yet actually one of the things in the written problems of assignment for is thinking about these different attention variance.

0:22:15 - 0:22:38     Text: Okay, yeah, so I've only presented this idea of attention as something good for machine translation to use between the source of the target sequence models except for that when when I tried to answer the question but really this is a general deep learning technique.

0:22:38 - 0:23:05     Text: Right so it's not only great for that application you can use it in many architectures not just sequence sequence and you can use it for many tasks not just for machine translation so anytime that you have a bunch of vector values and you have some other vector that you can regards a query attention as a technique to compute a value form.

0:23:05 - 0:23:13     Text: That you can have the query attend to the values.

0:23:13 - 0:23:32     Text: And so you know once you think about it like that you can think about attention as a kind of memory access mechanism that the weight and some that attention calculates gives you a kind of selective summary of some of the information contained in the values.

0:23:32 - 0:23:58     Text: And the query tells you which values to pay attention to so on the one hand you could say this is a good general technique anytime you have a whole bunch of vectors and you want to just get one vector out well you know the dumbest thing you can do is just average them all or do max pooling take the element wise max of each element.

0:23:58 - 0:24:15     Text: This gives you a much more flexible way to combine them together in as to a single vector and so you can think of it as actually giving us an operation that's sort of much more like a conventional computer and it's memory access.

0:24:15 - 0:24:41     Text: So we can think of as like our RAM and have a feel like an associate of memory right we have a query vector and the query vector access a sort of associate of memory pointer that says how much weight to put on different parts of RAM and then we sort of proportionally retrieve those bits of RAM to get a new value.

0:24:41 - 0:24:49     Text: Another cool thing about attention that i'll just mention here is that.

0:24:49 - 0:25:15     Text: You know attention is actually a genuinely new deep learning idea from the last decade middle of the last decade there's sort of a slightly depressing fact about deep learning that a lot of what's been done and deep learning in recent years wasn't actually new ideas there are ideas that were developed in the 80s and 90s is just people were able to get much use of them then because.

0:25:15 - 0:25:43     Text: The computers were too small on the amount of data they had was too small and they were sort of reinvented in the two while not reinvented but they were given you life in the 2010s where attention is genuinely new idea from the 2010s and we'll see sort of next next week how it sort of then turns into this huge idea of transforms it all.

0:25:43 - 0:26:01     Text: Okay so that's all I wanted to say about attention and now i'm going to get on and go and talk about final projects and all this stuff about that so am I good to start on to other stuff.

0:26:01 - 0:26:27     Text: Okay so this is a quick reminder so don't forget this is about coursework so you're right in the middle or past halfway almost going through the assignments and there were about half the grade but the other half of the grade comes from the final project which i'm going to talk about today and that gets without.

0:26:27 - 0:26:48     Text: So there's a project proposal which we've handed out instructions for today it's worth 5% there's a milestone that's worth 5% in the middle and there's a report at the end which is the big part which is worth 30% and then we also want to have a represent patient of your project with people can easily browse so in.

0:26:48 - 0:27:16     Text: In normal years when there are people on campus we had a post-desessions in the class which is seems like we just can't use what you do in the current world we want to put up a website where there's sort of descriptions each project some nice picture of what you've done and then we can link to your project right up and so you also then get 3% for that that's mainly just to make sure you deliver it comes down to it.

0:27:16 - 0:27:43     Text: And yeah so I should just mention again as we go along the on a code that's important also for the final projects so just to be clear on that so for the final projects in a lot of cases you're going to use a lot of stuff that already exists that you may well be using some model that you can just download from the GitHub repository or similar places.

0:27:43 - 0:28:12     Text: You may well be taking various ideas you know it's fine for the final project to use any amount of existing stuff but you have to make sure that you acknowledge what you're using and document it and in terms of evaluating your projects what will be interested in as the value add in terms of what you did not what you're able to download from other people.

0:28:12 - 0:28:37     Text: So for the final project you can have teams of one to three in almost every case every member of the team gets the same score but we do ask for a group statement of the work done by each team may and if it's clear that there was sort of some egregious imbalance in sort of one in a hundred cases we do do something about that.

0:28:37 - 0:29:03     Text: Okay so for the final project you've essentially got two choikers you can either do our default final project where we give you the scaffolding and send you pointed in some direction or you can propose a custom final project which we then need to approve as a suitable final project for the course and i'm going to talk a bit about both of those in this class.

0:29:03 - 0:29:18     Text: So you can work in teams of one to three if you have a bigger team we expect you to do more sometimes people use the same project for multiple classes so that it might also use others.

0:29:18 - 0:29:47     Text: So it's sort of the same general rule right so that they have two of you and you're using the project for two classes each that's sort of like it should be four persons worth of work that is being done and so we expect you know projects that are bigger that you know if you just one of you we can be totally satisfied with something compact and small you know it sort of has to be done.

0:29:47 - 0:30:02     Text: But it can be compact and small whereas if there are three of you we sort of feel like well you really sort of have enough time to actually implement a variant model and see if that works better.

0:30:02 - 0:30:27     Text: You can use any framework the language you want the final project but in practice basically people go on using high torch and then finally i've got to mention on the default final project i'll get back to this in a minute but actually this year new thing we've got two some variants of the default final project and you can choose one or the other.

0:30:27 - 0:30:56     Text: So for custom final projects i'm really happy to talk to people about final projects that this is problem and and you'll encourage us to sign up for an office hours slide but this is problem that there's only one of me so also encourage you to talk to the TAs about final project many of the TAs have experience for all sorts of things and we tried to sort of summarize some of what they have experience.

0:30:56 - 0:31:25     Text: So we're going to talk about the final projects we make sure everyone has some kind of mentor and the mentor can either be TIO instructor in this class or it could be somebody else and for somebody else option well you know if there's someone you know it's standard and there's some cool

0:31:25 - 0:31:37     Text: project they have you know whatever it is that's something you know in the political science of understanding documents or in the med school doing something.

0:31:37 - 0:32:05     Text: So you can have them as your mentor or we then also collected some projects from people in the stamp and NLP group and community and we're going to be distributing a list of those so you can try and sort of sign up to be doing one of those projects and make them be a mentor so in that case the other person we expect to be the mentor that's keeping an eye on the project and telling you to do something sort of sensible but it's one of the.

0:32:05 - 0:32:11     Text: The TAs in the class will be grading your various pieces of the work of class.

0:32:11 - 0:32:33     Text: Okay, so these are the details of the two default final projects and the handouts for them that sort of can also pay to handouts, reach one out on the web now so there are so both of them involved question answering which I'll mention a bit at the end and there are sort of two variants.

0:32:33 - 0:33:02     Text: One is for you to actually build a question answering architecture by yourself and I describe this as as from scratch but it's not really from scratch because we give you a baseline question answering system and the start code but you're working doing all the work of what else could I add to the model how can I add extra layers different attention and other things to make it better using the squad data set.

0:33:02 - 0:33:23     Text: But one of the things that's happened in NLP which is really then the topic of next week's classes in the last few years in NLP has been this revolution in using large pre trained language models with names like birth,

0:33:23 - 0:33:44     Text: and other, and other, which have just been sort of fantastically good for doing natural language problems so the other choice is to make use of those models and so then effectively the model and the model architecture as a starting point is given to you.

0:33:44 - 0:34:13     Text: And what we're hoping to have people focus on is how to build a robust question answering system which works across different data sets and domains so a huge problem with a lot of NLP models is if you train them on say with the PDF data they work great on with the PDF data but then as soon as you try and use them on something else whether it's customer support questions or web questions that their performance.

0:34:13 - 0:34:22     Text: He grades greatly despite the fact that the fact of the human being it seems like you're doing.

0:34:22 - 0:34:43     Text: So the building or a bus QA system track we're then going to be sort of training or fine tuning free train language models on several data sets and your goal is to produce something that then forms well on different data sets.

0:34:43 - 0:35:00     Text: So for this topic of question answering i'm going to say a few minutes on it at the end but on Tuesday of week six and the entire lecturers on question answering so there will be a lot of content there on the different kinds of models for really getting up to speed.

0:35:00 - 0:35:24     Text: Good stuff to know even if you're not doing the final projects and some major application of NLP but even better stuff to know if you are doing the final project so look out for that but just to give you one example this is the kind of thing we have a question answering problem so question answering is taking paragraph of text but I only put one sentence in my slides short.

0:35:24 - 0:35:47     Text: Bill Aiken adopted by mixed and movie actress and we've there my yoga grew up in the neighboring town of the dead Madera and his song chronicled up or what the question is in what town did Bill Aiken grew up this doesn't actually seem so hard I presume all of you could have done this when you're in sixth grade in school or something like that the answer should be Madera.

0:35:47 - 0:36:07     Text: But somehow Google's book model fails to find that answer and have seen us partly because there's extra stuff in the middle there adopted by next to movie actress and so it says no this sentence doesn't answer the question so you can hope to do better than that.

0:36:07 - 0:36:36     Text: Okay so this is choice about doing either the default or custom vinyl project and you know the overall stats have been that about half the people do each so there's no sort of clear winner of the choice in terms of thinking about what you should choose I mean I think the default final project is great if you have limited experience with doing these certain things.

0:36:36 - 0:37:02     Text: If you don't have any clear idea of what you would want to do as a custom final project if you think of be good have you know guidance and the clear goal towards work towards indeed we give you a leader board so you can compete against the other people doing the final project then you should do the default final project gives you guidance scaffolding clear goal post.

0:37:02 - 0:37:31     Text: I mean and in particular just the sort of try and give us sharp edge to this I mean you know the fact of the matter is every year there are a few people who do a custom final project and when we're grading when we look at this custom final project and we say you know this just looks pretty lame compared to what people are doing in the default final project.

0:37:31 - 0:37:54     Text: And that's a bad state to be in right if you're doing the custom final project you want to have a sort of a clear thing that you thought out that is interesting so it will seem has all more interesting and the default final project and if you don't think you've got such a thing you actually better off doing the default final project.

0:37:54 - 0:38:20     Text: Why should you do the custom final project well if you have some research project possibly something you've already been working on or any rate something that you think all this would be great to do which there are two requirements for our project it has to supplement we involve human language and it has to subsequently involve your networks so that doesn't mean it has to be for without those if you want to do a language

0:38:20 - 0:38:36     Text: division project that's fine or if you want to compare your networks versus other machine learning methods on some problem that's fine but you do have to in your project subsequently use both of those things.

0:38:36 - 0:39:02     Text: Yeah so if you'd like to sort of design your own thing and come up with something different on your own or just you don't like question answering or what if you basically want more experience of going through the whole process of trying to find a good research goal finding data and tools to explore it working out on your own then doing the custom final project is a great choice.

0:39:02 - 0:39:22     Text: Either way the steps that you go through are the following so the first thing which you've gone to 516 or is to write a project proposal which is three.

0:39:22 - 0:39:50     Text: So you know how some of the details vary on the kind of project so you need to decide on the research topic for your project which is kind of easy if you're doing the final project that's either a or b but we then want you to choose one research paper that's relevant to your final project read it and then learn some stuff from that that you can write about.

0:39:50 - 0:40:12     Text: And we want you to write about what your plan is as to what you're going to do about the final project and that will include describing things as neither such as the data and the evaluation and again this is especially important for custom final projects might be kind of obvious if you're doing the final project.

0:40:12 - 0:40:26     Text: And so typically if you're doing the final project this should be three pages and if you're doing a custom final project this should be four pages.

0:40:26 - 0:40:52     Text: And there are two parts of it the first part of it is writing about the paper you read and so that part is two pages and so we there are longer form of instructions but we want you to sort of read and think about this paper you know what are its novel contributions is that an idea to be employed in other ways are there things that didn't really do well.

0:40:52 - 0:41:08     Text: How does it compare to other ways that people have approached this problem does it sort of subject even though it did things one way does it actually suggest ideas that really could do things a different way that might even be better and so.

0:41:08 - 0:41:36     Text: We want you to write this two page paper summary which we will grade and effectively we're going to grade that on how good a job you do at thinking about analyzing and having critical comments on this paper so then the other half which might be longer a shorter depending on its custom or the fold is to propose what you're going to do.

0:41:36 - 0:41:46     Text: And really this part is formative so formative means we're not going to grade it harshly we want you to write it so we can help you.

0:41:46 - 0:42:04     Text: So we want you to sort of thought through what you're going to do and the past have an opportunity to say no that's unrealistically large or that sounds insufficiently ambitious or this isn't going to work unless you can get more data than you actually have and things like that.

0:42:04 - 0:42:26     Text: So this is mainly just a feedback but you know on the things to think about you know project plans that are lacking they're generally just lacking concreteness in nuts and bolts ways which are essential to being able to do a good final project and short amount of time so.

0:42:26 - 0:42:53     Text: You need to have found good data or have a realistic plan so that should be planned a realistic plan to be able to collect that you need to have a realistic way that you can evaluate your work you need to have thought about what kinds of experiments you can run so you can show whether your model is working well or badly and will be sort of looking to see whether you've done those things.

0:42:53 - 0:43:09     Text: Okay so then a couple of weeks after that is the project milestone which is again from everyone this is a progress report and this is again just meant to help keep you on track so you should be more and halfway done.

0:43:09 - 0:43:38     Text: You should have been able to in nearly all cases I'll talk about more about this in a minute in nearly all cases you should already have been able to implement some baseline system and have some initial experimental results to show you know might still be working on your main model and have nothing to show for it but hopefully you've at least said and of this simple baseline to know how well it works is this thing and I've got that and I have some numbers and then we're watching update.

0:43:38 - 0:44:07     Text: I'm how you plan to spend the rest of your time and again a lot of this is about us giving you more feedback as to what are the best things you can do for the final two weeks of a class and then at the end there's writing up the final project and so for this you know the quality of your right up is really really important to your grade right you know by and then you can do it.

0:44:07 - 0:44:22     Text: Right you know by and large going to evaluate how good your project is by reading the right up so make sure you budget sufficient time to actually do a good job.

0:44:22 - 0:44:51     Text: Right now I do you could look at good projects from past years they're all up on the CF24N website 2020 was kind of a mess because of the start of the pandemic so possibly should look at 2019's is even better models of what you should do so you know the details very but this is sort of a picture to have in mind and normally what them right up looks like so you know eight pages you want to have an abstract.

0:44:51 - 0:45:19     Text: Introductions of paper you want to talk about related prior work you want to present the model you're using you want to talk about the data you're using talk about your experiments what the results are what you learned about that you know the details may be very you know some papers there's less to say about model and there's more to say about the experiment so you can sort of move things around a bit but roughly something like this.

0:45:19 - 0:45:48     Text: Okay so I now want to kind of go on and say a bit about research and practical things that we need to do a lot of these things are relevant to everybody at any rate there are things that you should know a little bit about the very first one is finding research topics which is sort of especially by all to custom final projects.

0:45:48 - 0:46:16     Text: So really for all of science there's sort of only two ways that you can have a research project that one way is doing domain science where you start with a problem of interest in the domain such as how can I build a decent charity to an English machine translation system and you work on finding ways to be able to do it better than people currently know how to do.

0:46:16 - 0:46:44     Text: Or to understand the problem better than people currently understand that and the other way is to take a methodological approach where you start off with some method or approach of interest and then you work out good ways to extend or improve that or you waste or fire and so basically you're doing one of those so there is effectively different kinds of projects you can do.

0:46:44 - 0:47:11     Text: This is non-explorced at list that most projects sort of fall into one of these buckets very non uniformly so the most common type as you find some application or task of interest and you explore how to approach and solve that as well as possible often that's using existing models and trying out different options and things like that.

0:47:11 - 0:47:30     Text: So the kind is you can take some relatively complex neural architecture i.e. it has to be something more complex than we built for assignments one through five you implement it and get something that works on some data.

0:47:30 - 0:47:50     Text: And you know by you're doing something fairly complex it's fine just to implement it and get it to work. But you know if there's some way that you can tweak it and try and do something different and see if that makes it even better or maybe it makes it worse and you can do experiments either way that sort of even better.

0:47:50 - 0:48:17     Text: There are then other kinds of projects that you can do so another kind of project is an analysis project so you can take some model that's an existing model and you can put at it and find out things about what it knows so you can take anything even as simple as word vectors and put at them and you can find out things like well how much the word vectors know about word sensors.

0:48:17 - 0:48:39     Text: And sometimes the same words both are now and a verb can you tell those are can you sort of get that different similarities out of the word vectors so analysis projects are perfectly fine often interesting and then all these five times the rareest kind of project we have a couple is a theoretical project.

0:48:39 - 0:48:48     Text: You know there's a lot of interesting deep learning theory is that how do these things work and why and what would it take to make it work better.

0:48:48 - 0:48:59     Text: And so you can work on sort of getting any kind of non trivial property or understanding of a deep learning model.

0:48:59 - 0:49:12     Text: And here are just quickly a couple of examples of some projects from class years of 2 24 and just give you a couple of examples of things like there so deep poetry.

0:49:12 - 0:49:40     Text: This was doing generation of Shakespearean sonnets so the way it was being generated was with the kind of last work with now for machine translation but have interest in differences because if you wanted to generate poetry you need to know about metrical structure and rhyme and so they were working out how to add components to a model that could do that.

0:49:40 - 0:50:03     Text: Here's someone who was implementing a complex new model so there's been a line of work at deep mind on trying to build general purpose computers of neural architectures and it's first of all new or chewing machines and then a subsequent model for the differential wall neural computer.

0:50:03 - 0:50:15     Text: They haven't released the code of those open source so Carol decided that she was going to implement your computers and get them to work.

0:50:15 - 0:50:32     Text: This was a very dangerous idea because you know we were in week 10 of the class and she still hadn't gone to work at all but luckily she got it together at the last moment and actually got her model working and it was able to run it and get results on some of the problems that deep mind.

0:50:32 - 0:50:44     Text: She was also shown results on and so she pulled the rabbit out of the hat and had an enormous success and we were very impressed that she managed to do that.

0:50:44 - 0:50:59     Text: So sometimes final projects have become papers. Here's a final project that became paper published at the top machine learning conference this is more a few years ago in 2017.

0:50:59 - 0:51:17     Text: It actually has a couple of fairly simple ideas but you know there are ideas that at that time people weren't using and these two people proved work to improve things and they got the conference paper out of it so I'll just mention one of them now.

0:51:17 - 0:51:41     Text: So if you think about our current neural network language models they have both and for words and encoding into distributed vectors that the other end you know the softmax matrix basically hides inside it just like for our word vectors a word vector for each word and then you're sort of.

0:51:41 - 0:52:06     Text: The siging the probabilities of generating different words based on similarities between the query vector in each of those words so their idea was look maybe we're actually able to build better language models if we tied together the word of that in matrix and the matrix used to project the RNN output.

0:52:06 - 0:52:26     Text: And actually they showed that you can get significant gains by doing that and so now you know it's basically sort of now become standard if you're wanting to build strong neural language models that you want to tie those two sets of parameters together so that was pretty cool.

0:52:26 - 0:52:55     Text: Here's a more systems the project so I'll mention this again later but you know something people have been interested in is how can you just wish your nets and make them small so you can run them on you know a regular laptop or a smaller device like a mobile phone and so these people worked on how could you do quantization of neural networks down sort of.

0:52:55 - 0:53:23     Text: One or two bits per parameter and still get good results so this is sort of an example has to involve neural nets and human language is sort of a way of which you could say this didn't involve human language at all because this was sort of really about quantizing neural nets but you know effectively where we draw the line is we say you will be allowed to do this for buying the task that you.

0:53:23 - 0:53:45     Text: The model you use to demonstrate success or failure is a natural language task and so they did both words similarity tasks and question answering tasks of seeing how all the system performed after doing the compression by quantization.

0:53:45 - 0:54:11     Text: How can you find a good place for a project if you don't have a good idea one place is the sort of look at recent papers so for most NLP papers they're here in the ACL and apology also look at major machine learning conferences most past two twenty four in projects up on the class website you can look through them there's just general archive dot or pre print.

0:54:11 - 0:54:24     Text: And so that's even better is looking for an interesting problem in the world so how very and as an economist was it.

0:54:24 - 0:54:48     Text: And he wrote this call paper that I often recommend to students of how to build an economic model in your spare time and the paper isn't really that economic models is about how to do research and what he says right at the beginning section one of the paper is called getting ideas and he writes.

0:54:48 - 0:55:11     Text: But where to get ideas that's the question most graduate students are convinced that the way you get ideas is to read journal papers that in my experience journals really have a good very good source of original ideas you can get lots of things from journal papers technique inside even true but most of the time you only get somebody else's ideas.

0:55:11 - 0:55:21     Text: And so he sort of talks about further about better ways of getting ideas by thinking about problems in the world.

0:55:21 - 0:55:40     Text: So for archivists this huge repository of papers it's hard to navigate there are various tools that can make it easier to navigate one of them is archive sanity preserver which is a website that was written by Andre Kapati who is the original person who constructed and taught the Seattle.

0:55:40 - 0:55:46     Text: The CS231 and costs still a good thing to use.

0:55:46 - 0:56:04     Text: There are lots of leaderboards for different tasks in NLP so a place you can find tasks and work on tasks is looking at leaderboards so papers with code and NLP progress are two good general sources of leaderboards.

0:56:04 - 0:56:10     Text: There are also then lots of more specialized ones for particular tasks.

0:56:10 - 0:56:19     Text: I wanted to then for research topics this material of these next four slides are brand new for this year.

0:56:19 - 0:56:36     Text: I wanted to sort of just say a few words about the kind of the funny somewhat different time that we're in where there's sort of all deep learning NLP and new deep learning and LP.

0:56:36 - 0:56:54     Text: In the early days of the deep learning revival which are called 2010 to 2018 because 2010's the year I started doing deep learning for NLP most of the work was in defining and exploring better deep learning architectures.

0:56:54 - 0:57:08     Text: The typical paper was I can improve a summarization system but not by not only using the kind of attention I just explained but I can add an extra kind of attention which I'll use as a copying mechanism.

0:57:08 - 0:57:29     Text: So I'll do additional attention calculations to work out source words which I could copy of a bait into the output rather than having to regenerate them through the stand and kind of softmax that's just one example of a millions but you are changing the architecture of the model and make things better.

0:57:29 - 0:57:55     Text: You know that's what a lot of good CS 224 in projects did to and in particular you know for one branch of the default final project the one where you start with the baseline squad question answering system and work out ways to make it better by adding more new architecture it's essentially doing this.

0:57:55 - 0:58:19     Text: But actually if you look out in the research world of the last couple of years the sad truth is that approach is dead love it's not actually dead I mean that's too strong but that's now much more difficult and rarer to do.

0:58:19 - 0:58:40     Text: So we look at the sort of last three years of NLP and this is true both for people who are implementing production NLP systems of companies and for people who are doing research and NLP that most work is more like this.

0:58:40 - 0:59:07     Text: There are these enormous enormously good big pre trained language models but gpt 2 reberta excel net t5 they exist on the web and you can download them with one python command and they give you a great basis for doing most NLP task will learn all about the next week.

0:59:07 - 0:59:20     Text: So you're not actually starting from scratch to find your own model architecture and playing with it variance you're saying I'm going to be using reberta.

0:59:20 - 0:59:45     Text: So this is like find further fine tuning it for your task doing domain adaptation and things like that. So you know this is sort of very quickly for 2021 NLP for all of your practical projects and industry needs this is basically the formula that which you should probably use.

0:59:45 - 1:00:02     Text: There's this enormously great library by the company hugging face you install it pip install transformers and then that gives you a great implementation of the transformers we learn about next week and then effectively what you're doing is something like my code below.

1:00:02 - 1:00:25     Text: This code doesn't quite run it's missing some pieces if you try it but you know sort of pieces are you load a bit pre trained language model so here I'm way loading the work based on case model turns out these models all hats have very special tokenizers you'll hear about those next week as well so we grab the tokenizer for it.

1:00:25 - 1:00:37     Text: Then we're going to find tune it for our task like maybe I'm going to be working with legal data so I want to take the general model and find tune it to understand legal data better.

1:00:37 - 1:01:04     Text: And then I've got something I want to do like question answering or perhaps here where I'm using birth for sequence classification maybe the task I want to do is label mentions of sections of legal code so at that point I'm then going to sort of find tune for that particular task and then run on that task.

1:01:04 - 1:01:22     Text: And so this is kind of what we're doing and we're just sort of doing stuff on top of a pre existing model so a lot of what is exciting now is problems that work within or around that world.

1:01:22 - 1:01:51     Text: So and you might sort of think about that and look at recent papers and things that could be done about that world well one of them and this is the other half of the default final project is we still have this problem of robustness to domain shift that these models are trained on normally one kind of text but we often want to use it for different kinds of text including often when you're talking about problem or application specific.

1:01:51 - 1:01:59     Text: Questions that are mains for you don't have much data how can you do that well.

1:01:59 - 1:02:16     Text: If you've got one of these big models on some sense they're good but are they actually you know robust that they work for all of the different things in the space should like them to work on so not only interested in basic accuracy but robustness.

1:02:16 - 1:02:45     Text: So this robustness gym was a very recently announced project by a PhD student Stanford current girl it's actually testing the robustness of NLP models and they've got some stuff built into it but there's lots of stuff that isn't built into it if someone would like to sort of choose some NLP problem whether it's dependency parsing or something else like summarization and build out robustness test for that could be kind of interesting to do.

1:02:45 - 1:03:08     Text: You know a bunch of other things you can look at issues whether there's bias embedding these models how can you explain what they're doing how can you use data augmentation has been an enormous amount of work on using data augmentation to improve model resources lots of different areas.

1:03:08 - 1:03:36     Text: I'll just mention two other things and then I'll go on so it's been a ton of work on scaling models up and down so these big pre train language models are already big so the fact of the matter is it's just not possible for the time and resources that you have in 224 and to be thinking I'm going to be.

1:03:36 - 1:04:04     Text: I'm going to apply myself big pre train models all you can do is use ones that already exist but you know on the other hand people are actually interested in can you but build very small models that still work pretty well and you know there are lots of examples but actually one that was done recently for question answering was that there was a bake off competition at the last

1:04:04 - 1:04:20     Text: Europe's which was called efficient QA and one of the divisions of this bake off was can you can you build a performance question answering system that will run in

1:04:20 - 1:04:48     Text: B and well you know that's actually a reasonable thing that you could attempt for a final project. Another thing people are being very interested in is wanting to explore more advanced learning capabilities in neural networks ideas like compositionality systematic generalization fast learning such as metal learning work and a lot of the time people have investigated these

1:04:48 - 1:05:02     Text: in small remains so here's a couple of examples that you go look at baby AI and G scan so those could be kind of interesting places you can look for a final project.

1:05:02 - 1:05:15     Text: Okay other stuff to know quickly probably I can't quite go through all of these slides and some of them you could just take home and look at you need data.

1:05:15 - 1:05:44     Text: Now we actually love it if people feel like they can collect their own data and sometimes a good ways to collect your own data and but you know the reality is a lot of the time the easiest way to get a fast start on a final project when you have a few weeks is to make use of an existing data set and there are lots of pre existing data sets so there's data from the linguistic data consortium which is licensed data we have licenses that stand there.

1:05:44 - 1:06:13     Text: You can look through their catalog and find all the things they have there are websites that have lots of data so if you want to machine translation tasks you can find lots of machine translation data on this website dependency parsing if you're keen on assignment three lots of data on the universal dependencies website there are now several websites that are collecting a lot of data sets so hugging faces also just recently actually now it's hugging the web site.

1:06:13 - 1:06:27     Text: And we actually now hugging face data sets which is sort of an index of data sets and the people with code people also have people with code data sets so you can look at those there are many more.

1:06:27 - 1:06:29     Text: Yeah.

1:06:29 - 1:06:58     Text: Here's just a quick summary of thinking about a research project so you know this is just one example right that's post you think summarization so that's sort of going from a longer piece of text or short summary of that so you know what you have to do you need to you know find the data set and well probably easiest thing to do is to sort of use an existing text summarization data set.

1:06:58 - 1:07:26     Text: You can find several online but you know this is where you can think of interesting ways to create your own data so something that you might have noticed if you look at Twitter is the journalist often promote their own stories by or their their newspaper or TV station does by putting out a tweet and so that's sort of like found data which is sort of self summaries could you collect some of those and try and learn to generate tweets for a new story.

1:07:26 - 1:07:55     Text: So I'll say a bit about data set hygiene you want to be careful about working things out so you have trains sets and test sets I'll say that in a minute you want some way to evaluate whether you're doing well as you build models and you pretty much need to have an automatic evaluation metric even though human evaluation is great since you want to train a bunch of models and see whether they're better or worse you normally want an automatic metric.

1:07:55 - 1:08:23     Text: You should make sure you have some kind of baseline that you want to have some sense of whether you're doing well and so commonly you first off want to implement some simple model like a logistic regression or just averaging word vectors or something and see how well that works because then you know if you're not doing better than that that you're actually making no progress whatsoever.

1:08:23 - 1:08:33     Text: So you can make some neural net model and see if you can that you think might be good and see if you can get that to work and have that work well.

1:08:33 - 1:08:52     Text: Make sure you keep looking at your data to see what kind of errors you're making and think about how you could change the model to avoid them and then hopefully you've got time to try out some model variance and see if they're better or worse and then that will help in having a good project.

1:08:52 - 1:09:18     Text: Yeah, so quickly what I call here pots of data so many public data sense come with a structure where they have trained dead and test and the idea is that you keep your test data until the end and you only sort of do test runs when development is complete or at least almost complete.

1:09:18 - 1:09:35     Text: And the thing you're more trained on the train data and you evaluate on the dev data sometimes you need even more data sets is sometimes you want to tune your parameters and you might need a train set a tune set a depth set and a test set.

1:09:35 - 1:09:56     Text: If the data set if doesn't come pre splitter it doesn't have enough pieces it's sort of your job to work out how to cut it into pieces and to get things working and the reason why you need to do this is.

1:09:56 - 1:10:22     Text: That it's necessary to have these different sets to get realistic measures of performance in fact ideally you're only actually testing on the test set once and at any rate very few times so if you're doing the default final project we limits you to three runs on the test set so remember that and save your three runs.

1:10:22 - 1:10:40     Text: The reason that we do that is that if you sort of mix up and over use these data sets the results just become invalid so you know when you train on data well you always build a model what does well on the training data so that's not exciting.

1:10:40 - 1:10:55     Text: And well if you want to tune hyper parameters if you tune them on the training data set you won't get good valid values for them because they're not tuned to good values for something that would work on different test data.

1:10:55 - 1:11:24     Text: But the subtlest part of it is you know I think a lot of people starting off think oh there's no harm in just keeping on testing on the test set every time I you know try out some variant I'll just see how it's going on the test set it's a really compelling thing that you want to do right you change your model you'd like to know whether they've the score on the test set go up or down but you know the truth is that if you do that.

1:11:24 - 1:11:53     Text: You're cheating because what you do is you're sort of slowly training on the test set because you keep every change that helps on the test set and you throw away every change that doesn't help on the test set all makes it worse and so you're learning about the test set and what things happen by chance to work on that particular test set so you're effectively slowly training on the test set and you get biased unrealistically high.

1:11:53 - 1:11:56     Text: Play at levels of performance.

1:11:56 - 1:11:59     Text: Okay.

1:11:59 - 1:12:03     Text: Let's see.

1:12:03 - 1:12:21     Text: I think there then sort of a bunch more slides on neural nets that I think I'm going to just skip for the moment but I think in general they use forward of scene maybe I've just sort of

1:12:21 - 1:12:26     Text: mentioned this one.

1:12:26 - 1:12:39     Text: So for the final projects you're much more on your own and you have to work out for yourself how to get your neural network to work.

1:12:39 - 1:12:51     Text: The first thing is you want to start with a positive attitude you know these neural nets are amazing they really want to learn they want to find any pattern they can anywhere and data.

1:12:51 - 1:13:03     Text: They really just do that it's in their DNA so if your neural network isn't learning it means you're doing something wrong that's preventing it from learning successfully.

1:13:03 - 1:13:22     Text: But you know at that point there's a grim reality there are all sorts of things that cause neural nets to not learn at all or more common case actually is they sort of learn a bit that they don't learn very well you know there are bugs in the code you're

1:13:22 - 1:13:39     Text: doing through by the wrong thing you've missing some connections in your network so nothing information is flowing from one place to another place you're calculating the gradients wrong and some layer there are all sorts of things that can go wrong.

1:13:39 - 1:14:00     Text: So you need to then work out how to find and fix them and the truth is that this debugging and tuning can often take way more time than I'm going to implement the model so in terms of thinking about how much you could get through you should be thinking okay I've coded up my model that doesn't mean that you're 75%

1:14:00 - 1:14:10     Text: and it's pretty frequent it means that you're only 20% done and there's still a lot of work to go to get things working.

1:14:10 - 1:14:29     Text: Okay so for the last minutes I just want to say a few minutes about the final projects and I've read like some idea about what this is about as an NLP problem.

1:14:29 - 1:14:52     Text: Okay so the problem is most commonly called question answering over documents but really what the part that we're doing is perhaps better called reading comprehension and so the idea of this is that you want to actually be able to answer questions based on documents.

1:14:52 - 1:15:15     Text: So here's an example of the question who was Australia's third prime minister and you know once upon a time when if you typed a question into Google all you got was web search and it returned to you list of pages with the implied promise that some of the ones right near the top probably had the answer the question.

1:15:15 - 1:15:44     Text: But if you do this now it gives you back an answer and so here's the answer John Christian Watson and the important thing to realize is that this kind of featured snippet you know Google has 101 or maybe a thousand one different pieces inside it but you know this feature snippet isn't coming from the Google knowledge graph structured data it's coming straight from a web page

1:15:44 - 1:15:54     Text: where some part of the Google search system has actually read this web page and decided what the answer is.

1:15:54 - 1:16:13     Text: So for this kind of system the most straightforward way to do it is you have two parts first you have a web search system that finds a page that probably has the answer and then you have a reading comprehension system that actually looks inside the text and works to extract this.

1:16:13 - 1:16:31     Text: And so it's looked through this piece of text and this sentence says was an Australian politician who served as the third prime minister of Australia that's why I'm asking for and slightly different wording Australia's third prime minister.

1:16:31 - 1:16:51     Text: So this answers the question and so it correctly says that the answer is John Christian Watson and so what we want to build in the default final project is systems that do that second part that given a piece of text and a question they can give the answer.

1:16:51 - 1:17:10     Text: So the simple motivation for why this is important is once we have massive full text document collections that you know simply we're saying here's a list of maybe relevant documents is of limited use that we really much prefer to get answers to our questions.

1:17:10 - 1:17:29     Text: And you know that's true in general but you know it's especially true if you're using your phone to try and look for information rather than sitting in front of a 27 inch monitor it's especially true if you're using a virtual assistant device like Alexa or Google assistant.

1:17:29 - 1:17:55     Text: That's the problem that's been worked on reading comprehension or question answering and so the squad data set which was built by front of Rajperker and personally and consists of passages taken from Wikipedia and questions which team one super bowl 50 and what you're meant to be able to do is read through this passage.

1:17:55 - 1:18:03     Text: And so the answer is the dead and the bonkers.

1:18:03 - 1:18:17     Text: And so there are 100,000 such examples and so the answer is always just taken to be a span of the passage and that's referred to as extractive question answering.

1:18:17 - 1:18:35     Text: So to let this data what was done was that people were shown passages asked several questions just like reading comprehension at school perhaps slightly simple questions and they're asked to choose a span that answered it.

1:18:35 - 1:18:50     Text: You know as in these examples show the they showed it to three human beings and they didn't always choose exactly the same span because sort of there's summer uncertainty as to how many words to include but roughly.

1:18:50 - 1:19:09     Text: They're answering the question in that way and so then we have a valuation measures and so there are two evaluation measures one is exact match whether you return exactly what one of the humans returned and the other one is the F1 measure which is.

1:19:09 - 1:19:22     Text: Is the overlap in words of your span to one of the humans roughly so for squad the initial.

1:19:22 - 1:19:45     Text: So we're going to do squad 2.0 which is just a little bit more complex because they make it a little bit trickier that some of the questions have no answers in the text so here's a piece of text about genghis Khan and the question is when did genghis Khan kill great Khan.

1:19:45 - 1:20:06     Text: Well if we sort of read through this text there's genghis Khan at the beginning and there's talk about different kinds and there's the person down here who became great Khan in 1251 then genghis Khan didn't kill great Khan it doesn't say that at all.

1:20:06 - 1:20:33     Text: But you know actually here's Microsoft and L net which is another strong question answering system and if you ask at this question it says 1234 so the reality is that a lot of these models have effectively heuristically behave like okay well this is asking for a year let me look for a year in this passage this near discussion of genghis Khan.

1:20:33 - 1:20:47     Text: And maybe weakening is somehow similar to killing i'm going to guess 1234 and that's sort of the wrong thing to have done here so this is a good reliability test for question answering models so that's an interesting add on problem to look at.

1:20:47 - 1:21:13     Text: Okay i'm going to stop there today good luck with your projects.