Stanford CS224N - NLP w⧸ DL | Winter 2021 | Lecture 5 - Recurrent Neural networks (RNNs)

0:00:00 - 0:00:12     Text: So we're now starting in week three with lecture five.

0:00:12 - 0:00:19     Text: So unfortunately, the last class, I guess I really got behind and went a bit slowly.

0:00:19 - 0:00:23     Text: I guess I must just enjoy talking about natural languages too much.

0:00:23 - 0:00:29     Text: And so I never really got to the punchline of showing how you could do good things with the new dependency parser.

0:00:29 - 0:00:44     Text: So I'm going to say for the first piece I'll in some sense be finishing the content of last time and talk about neural dependency parsing, which also gives us the opportunity to introduce a simple feed forward neural net classifier.

0:00:44 - 0:00:56     Text: That will then lead into a little bit of just background things that you need to know about neural networks content because the fact of the matter is there is a bunch of stuff you need to know about neural networks.

0:00:56 - 0:01:06     Text: So both of those things are getting to what's really meant to be the topic of today's lecture, which is looking at language modeling and recurrent neural networks.

0:01:06 - 0:01:17     Text: And that's then going to lead into those two things are important topics that will then be talking about really for the whole of next week as well.

0:01:17 - 0:01:29     Text: So there's a couple of reminders before we get underway. The first is that you should have handed in assignment to before you join this class and in turn assignment three is out today.

0:01:29 - 0:01:39     Text: And it's an assignment where you're going to build essentially the new dependency parser that I'm just about to present in PyTorch.

0:01:39 - 0:01:51     Text: So part of the role of this assignment is actually to get you up to speed with PyTorch. So this assignment is highly scaffolded with lots of comments and hints about what to do.

0:01:51 - 0:01:59     Text: And so the hope is that by the time you come to the end of it, you'll feel fairly familiar and comfortable with PyTorch.

0:01:59 - 0:02:08     Text: Don't forget there was also tutorial on PyTorch last week if you didn't catch that at the time you might want to go back and look at the video.

0:02:08 - 0:02:21     Text: Another thing to mention about the assignments is that assignment three is the last assignment where our great team of TAs are happy to look at your code and sort out your bugs for you.

0:02:21 - 0:02:25     Text: So maybe take advantage of that but not too much.

0:02:25 - 0:02:38     Text: But starting an assignment for assignments for five in the final project, the TAs are very happy to help in general, but it's just not going to be their job to be actually sorting out bugs for you.

0:02:38 - 0:02:46     Text: You should be looking at your code and discussing ideas and concepts and reasons why things might not work with them.

0:02:46 - 0:03:02     Text: Okay, so if you remember where we were last time, I'd introduced this idea of transition based dependency parsers and that these were an efficient linear time method for giving the syntactic structure of natural language text.

0:03:02 - 0:03:19     Text: And that they worked pretty well before neural nets came along and took over an old P again, but they had some disadvantages and their biggest disadvantage is that like most machine learning models of that time, they worked with indicator features.

0:03:19 - 0:03:43     Text: So that means that you were specifying some condition and then checking whether it was true of a configuration. So something like the word on the top of the stack is good and it's part of speech is adjective or the next word coming up is a personal pronoun that those are conditions that would be features and conventional.

0:03:43 - 0:03:54     Text: Transition based dependency parser and so what are the problems with doing that well, one problem is that those features are very sparse.

0:03:54 - 0:04:19     Text: So the second problem is the features are incomplete. Well, what I mean by that is depending on what words and configurations occurred in the training data, there are certain features that will exist because you sort of saw a certain word preceding a verb and certain features that just won't exist because that word never occurred before a verb and the training data.

0:04:19 - 0:04:34     Text: So that's the biggest problem and opportunity for doing better with the neural dependency parser is that it turns out that in a symbolic dependency parser, computing all these features just turns out to actually be pretty expensive.

0:04:34 - 0:04:56     Text: But although the actual transition system that I showed last time is fast and efficient to run, you actually have to compute all of these features and what you found was that about 95% of the parsing time of one of these models was spent just computing all of the features of every configuration.

0:04:56 - 0:05:07     Text: So that suggests that perhaps we can do better with a neural approach where we're going to learn a dense and compact feature representation. And so that's what I want to go through now.

0:05:07 - 0:05:36     Text: So this time we're still going to have exactly the same kind of configuration of a stack and a buffer and running exactly the same transition sequence except this time rather than representing the configuration the stack and the buffer by having several million symbolic features, we're instead going to summarize this configuration as a dense vector of dimensionality perhaps approximately a thousand.

0:05:36 - 0:05:42     Text: And now neural approach is going to learn this dense compact feature representation.

0:05:42 - 0:05:58     Text: And so quite explicitly what I'm going to show you now briefly and what you're going to implement is essentially the neural dependency parser that was developed by d'antichern in 2014.

0:05:58 - 0:06:25     Text: And to skip to the advertisement right at the beginning as to how this works so well, these are the kind of results that you got from it using the measures that I introduced at the last time the unlabeled attachment score, whether you attached dependencies correctly to the right word and the labeled attachment scores to whether you also get the type of grammatical relation of that dependency correct.

0:06:25 - 0:06:37     Text: And so essentially this chin and manning parser gave a neural version of something like a transition based dependency parser like malt parser in yellow.

0:06:37 - 0:06:53     Text: And the interesting thing was that taking advantage of a neural classifier and ways that I'm about to explain that that could produce something that was about 2% more accurate than the symbolic dependency parser.

0:06:53 - 0:07:07     Text: And because of the fact that it's not doing all of the symbolic feature computation, despite the fact that you might think at first that there's a lot of real number math and matrix vector multiplies in a neural dependency parser.

0:07:07 - 0:07:16     Text: It actually ran noticeably faster than the symbolic dependency parser because it didn't have all feature computation.

0:07:16 - 0:07:42     Text: The other major approach to dependency parsing that I'm also showing here and I'll get back to at the end is what's referred to as graph based dependency parsing and so that's a different approach to dependency parsing and so these are two symbolic graph based dependency parsers and in the pre neural world, they were somewhat more accurate than the transition based parsers as you could see.

0:07:42 - 0:08:03     Text: But on the other hand, they were close to two orders of magnitude slower. And so essentially with the term manning parser, we were able to provide something that was basically as accurate as the best graph based dependency parsers, which were the best dependency parsers while operating about two orders of magnitude more quickly.

0:08:03 - 0:08:15     Text: And how did we do it? It was actually a very straightforward implementation, which is part of what makes it great for doing for assignment three.

0:08:15 - 0:08:33     Text: But this is how we did it and we got wins. So the first win, which is what we've already talked about extensively starting in week one is to make use of distributed representations. So we represent each word as a word embedding and you've had a lot of experience with that already.

0:08:33 - 0:08:46     Text: And so that means when words weren't seen in a particular configuration, we still know what they're like because they'll be well have seen similar words in the correct configuration.

0:08:46 - 0:09:04     Text: But we don't stop only with word embeddings. The other things that are central to our dependency parser are the parts of speech of words and the dependency labels. And so what we decided to do is that although those are much smaller sets.

0:09:04 - 0:09:16     Text: So the dependency labels are about 40 in number and the parts of speech are around that order of magnitude sometimes less sometimes more than even within those sets of categories.

0:09:16 - 0:09:33     Text: There are ones that are very strongly related. So we also adopted distributed representations for them. So for example, there might be parts of speech for single nouns and plural nouns. And basically most of the time they behave similarly. And there are

0:09:33 - 0:09:48     Text: some kind of magic type of modifiers and numerical modifiers. So these are just numbers like 3, 4, 5. And again, a lot of the time they behave the same that you have both three cows and brown cows.

0:09:48 - 0:10:04     Text: So everything is going to be represented in the distributed representation. So at that point, we have exactly the same kind of configuration where we have our stack, our buffer, and we've started to build some arcs.

0:10:04 - 0:10:19     Text: So the classification decisions of the next transition are going to be made out of a few elements of this configuration. So we're looking at the top thing on the stack, the thing that second on the stack, the first word on the buffer.

0:10:19 - 0:10:36     Text: And then we actually added in some additional features that are then to the extent that we've already built arcs for words on the stack that we can be looking at the dependence on the left and right of those words that are on the stack that are already in the sets of arcs.

0:10:36 - 0:10:58     Text: And so for each of those things, there is a word, there is a part of speech. And for some of them, there is a dependency where it's already connected up to something else. So for example, the left corner of S2 here has an insub-dependency back to the second thing on the stack.

0:10:58 - 0:11:15     Text: So we can take these elements of the configuration and can look up the embedding of each one. So we have word embeddings, part of speech embeddings, and dependency embeddings, and just concatenate them all together, kind of like we did before with the window classifier.

0:11:15 - 0:11:20     Text: And that will give us a newer representation of the configuration.

0:11:20 - 0:11:34     Text: Well, there's a second reason why we can hope to win by using a deep learning classifier to predict the next transition. And we haven't really said much about that yet. So I just wanted to detour and say a little bit more about that.

0:11:34 - 0:11:54     Text: So the simplest kind of classifier that's close to what we've been talking about in neural models is a softmax classifier. So that if we have de-dimensional vectors X and we have Y classes to assign things to.

0:11:54 - 0:12:14     Text: Also Y is an element of a set of C classes to assign things to then we can build a softmax classifier using the softmax distribution that we've seen before where we decide the classes based on having a weight matrix that's C by D.

0:12:14 - 0:12:25     Text: And we train on supervised data, the values of this W weight matrix to minimize our negative log likelihood loss that we've seen before.

0:12:25 - 0:12:32     Text: A loss does also commonly refer to as cross entropy loss a term that you'll see in pie torch among other places.

0:12:32 - 0:12:55     Text: So that is a straightforward machine learning classifier. And if you've done two twenty nine you've seen softmax classifiers. But a simple softmax classifier like this shares with most traditional machine learning classifiers and models include naive base models support vector machines logistic regression.

0:12:55 - 0:13:05     Text: But at the end of the day, they're not very powerful classifiers. They're classifiers that only give linear decision boundaries. And so this can be quite limiting.

0:13:05 - 0:13:18     Text: So if you have a difficult problem like the one I'm indicating in the picture in the bottom left. Well, there's just no way you can divide the green points from the red points by simply drawing a straight line.

0:13:18 - 0:13:32     Text: So you're going to have a quite imperfect classifier. So the second big win of neural classifiers is that they can be much more powerful because they can provide nonlinear classification.

0:13:32 - 0:13:45     Text: So rather than only being able to do something like in the left picture, we can come up with classifiers that do something like in the right picture and therefore can separate the green and the red points.

0:13:45 - 0:13:56     Text: As an aside, I'm these pictures I've taken from Andrea Caparti's ComvNet.js software, which is a kind of a fun little tool to play around with if you've got a bit of spare time.

0:13:56 - 0:14:11     Text: And so there's something subtle going on here is because our more powerful neural net classifiers at the end of the day, what they have at the top of them is a softmax layer.

0:14:11 - 0:14:23     Text: So this softmax class layer is indeed a linear classifier and it's still a linear classifier. But what they have below that is other layers of neural net.

0:14:23 - 0:14:51     Text: And so effectively what happens is that the classification decisions are linear as far as the top softmax is concerned, but nonlinear in the original representation space. So precisely what a neural net can do is warp the space around and move the representation of data points to provide something that at the end of the day can be classified by linear classifier.

0:14:51 - 0:15:05     Text: And so that's what a simple feed forward neural network multi class classifier does. So it starts with an input representation. So these are is some dense representation of the input.

0:15:05 - 0:15:30     Text: It puts it through a hidden layer age with a matrix multiply followed by nonlinearity. So that matrix multiply can transform the space and map things around. And so then the output of that we can then put into a softmax layer and get out softmax probabilities from which we make a classification decisions.

0:15:30 - 0:15:44     Text: And to the extent that our probabilities don't assign one to the correct class, we then get some log loss or cross entropy error, which we back propagate towards the parameters and embeddings of our model.

0:15:44 - 0:16:03     Text: And as the learning that goes on via back propagation, we increasingly well learn parameters of this hidden layer of the model, which learn to re represent the input, they move the inputs around in an intermediate hidden vector space.

0:16:03 - 0:16:17     Text: And so this is the easy classified with what at the end of the day is the linear softmax. So this is basically the whole of a simple feed forward neural network multi class classifier.

0:16:17 - 0:16:42     Text: And if we had something like a visual signal, we just sort of feed straight in here real numbers and we've been done. But normally with human language material, we actually effectively have one more layer that we're feeding in before that because really below this dense input layer, we actually have one hot vectors for what words or parts of speech were involved.

0:16:42 - 0:16:53     Text: And we're doing a lookup process, which you can think of as one more matrix multiply to convert the one hot features into our dense input layer.

0:16:53 - 0:17:03     Text: Okay, in my picture here, the one other thing that's different is I've introduced a different nonlinearity in the hidden layer, which is a rectified linear unit.

0:17:03 - 0:17:16     Text: So we'll be using our neural dependency pauses. It looks like the picture in the bottom right and I'll come back to that in a few minutes. That's one of the extra neural net things to talk about.

0:17:16 - 0:17:43     Text: Okay, so our neural net dependency parser model architecture is essentially exactly that, but applied to the configuration of our transition based dependency parser. So based on our transition based dependency parser configuration, we construct an input layer embedding by looking up on the various elements as I discussed previously.

0:17:43 - 0:18:01     Text: And then we feed it through this hidden layer to the softmax layer to get probabilities out of which we can choose what the next action is, and it's no more complicated than that.

0:18:01 - 0:18:28     Text: What we found is that just simply, you know, in some sense, using the simplest kind of feed forward neural classifier could provide a very accurate dependency parser that determines the structure of sentences supporting meaning interpretation, the kind of way that I suggested last time.

0:18:28 - 0:18:39     Text: Indeed, you know, despite the fact that it was a quite simple architecture in 2014. This was the first successful neural dependency parser.

0:18:39 - 0:18:57     Text: And the dense representations, especially, but also partly the nonlinearity of the classifier gave us this good result that it could both outperform symbolic parsers in terms of accuracy and it could outperform them in terms of speed.

0:18:57 - 0:19:07     Text: So that was 2014 just quickly here a couple more slides on what's happened since then.

0:19:07 - 0:19:26     Text: So lots of people got excited by the success of this neural dependency parser and a number of people, particularly at Google, then set about building a bigger fancier transition based neural dependency parser. So they explored bigger, deeper networks. There's no reason to only have one hidden layer.

0:19:26 - 0:19:40     Text: Two hidden layers, you can do beam search that I briefly mentioned last time. Another thing that I'm not going to talk about now is adding conditional random field style inference over decision sequences.

0:19:40 - 0:19:57     Text: And then led in 2016 for a model that they called parsing MacPars face, which is hard to say with a straight face, which was then about two and a half, three percent more accurate than the model that we had produced.

0:19:57 - 0:20:09     Text: But still in basically the same family of transition based parser with the neural net classifier to choose the next transition.

0:20:09 - 0:20:18     Text: So alternative to transition based parsers as graph based dependency parsers and for a graph based dependency parser.

0:20:18 - 0:20:30     Text: What you're doing is effectively considering every pair of words and considering a word as a dependent of root and you're coming up with a score as to how likely is that.

0:20:30 - 0:20:45     Text: That big is a dependent of root or how likely is big to be dependent of cat and similarly for every other word for the word sat how likely is it to be a dependent of root or a dependent of the etc.

0:20:45 - 0:21:09     Text: And well to do that well, you need to know more than just what the two words involved are. And so what you want to do is understand the context. So you want to have an understanding of the context of big what's to the left or what's to the right of it to understand how you might hook it up into the dependency representations of the sentence.

0:21:09 - 0:21:18     Text: And so while there being previous work and graph based dependency parsing like the mst parser I showed on the earlier results slide.

0:21:18 - 0:21:31     Text: It seemed appealing that we could come up with a much better representation of context using neural nets that look at context and how we do that is actually what I'll be talking about in the end part of the lecture.

0:21:31 - 0:21:42     Text: And so at Stanford, we became interested in trying to work out how to come up with a better graph based dependency parser using context.

0:21:42 - 0:21:49     Text: So I forgot this. This was showing that if we can score each pair wise dependency, we can simply choose the best one.

0:21:49 - 0:22:04     Text: So we can say probably big is a dependent of cat. And to a first of approximation, we're going to want to choose for each word that it is a dependent of the word that seems most likely to be a dependent.

0:22:04 - 0:22:22     Text: But we want to do that with some constraints because we want to get out something that is a tree with a single root of I discussed last time and you can do that by making use of a minimum spanning tree algorithm that uses these scores of how likely different dependencies are.

0:22:22 - 0:22:41     Text: Okay, so then in 2017 another student Tim Dozer and me then worked on saying well, can we now also build a much better neural graph based dependency parser and we developed a novel methods for scoring.

0:22:41 - 0:23:04     Text: And scoring dependency parsers and a graph based model, which I'm not going to get into the details of right now, but that also had a very nice result because using getting back to graph based parsing we could then build a graph based parser that performed about a percent better than the best of the Google transition based neural dependency parsers.

0:23:04 - 0:23:11     Text: And then we'll point out that this is a mixed win because although it's accuracy is better.

0:23:11 - 0:23:24     Text: These graph based parsers are just in squared and performance rather than linear time. So kind of like the earlier results I showed they don't operate nearly as quickly when you're wanting to

0:23:24 - 0:23:39     Text: have a large amounts of text with complex long sentences. Okay, so that's everything you need to know about dependency parsers and to do assignment three so grab it this evening and start to work.

0:23:39 - 0:23:57     Text: I did want to sort of before going on to the next topic just mention a few more things about neural networks since some of you know this well already some of you have seen less of it, but you know, there just are a bunch of things you have to be aware of for building neural networks.

0:23:57 - 0:24:24     Text: Now again for assignment three, essentially we give you everything and if you follow the recipe your parser should work well, but you know what you should have minimally do is actually you know look carefully at some of the things that this parser does, which is questions like how do we initialize our matrices of our neural network.

0:24:24 - 0:24:29     Text: What kind of optimizers do we use and things like that.

0:24:29 - 0:24:36     Text: Because these are all important decisions and so I wanted to say just a few words about that.

0:24:36 - 0:24:43     Text: Okay, so the first thing that we haven't discussed at all is the concept of regularization.

0:24:43 - 0:25:04     Text: So we're building these neural nets we're now building models with a huge number of parameters so essentially just about all neural net models that work well actually their full loss function is a regularized loss function.

0:25:04 - 0:25:26     Text: So this is a loss function here of J well this part here is the part that we've seen before where we're using a softmax classifier and then taking a negative log likelihood loss which within averaging over the different examples, but actually we then stick on the end of it.

0:25:26 - 0:25:36     Text: So this is a regularization term and so this regularization term sums the square of every parameter in the model.

0:25:36 - 0:25:53     Text: And so what that effectively says is you only want to make parameters non zero if they're really useful right so to the extent of the parameters don't help much you're just being penalized here.

0:25:53 - 0:26:09     Text: And they're not going to be extremely non zero but to the extent that the parameters do help you will gain in your estimation of likelihood and therefore it's OK for them to be non zero in particular notice that this penalty is assessed only once per parameter.

0:26:09 - 0:26:13     Text: It's not being assessed separately for each example.

0:26:13 - 0:26:21     Text: Okay, and having this kind of regularization is essential to build new net models that regularize well.

0:26:21 - 0:26:25     Text: So the classic problem is referred to as overfitting.

0:26:25 - 0:26:31     Text: And what overfitting means is that if you have a particular training data set and you start training your model,

0:26:32 - 0:26:37     Text: your error will go down because you'll shift the parameters so they better predict

0:26:37 - 0:26:46     Text: the correct answer for data points in the model and you can keep on doing that and it will

0:26:46 - 0:26:54     Text: keep on reducing your error rate. But if you then look at your partially trained classifier and say

0:26:54 - 0:27:02     Text: how well does this classifier classify independent data, different test data that you weren't training

0:27:02 - 0:27:10     Text: the model on, what you'll find is up until a certain point you'll get better at classifying

0:27:10 - 0:27:17     Text: independent test examples as well. And after that, commonly what will happen is you'll actually

0:27:17 - 0:27:23     Text: start to get worse at classifying independent test examples even though you're continuing to get

0:27:23 - 0:27:29     Text: better at predicting the training examples. And so this was then referred to as your overfitting

0:27:29 - 0:27:35     Text: the training examples that you're fiddling the parameters of the model so they're really good at

0:27:35 - 0:27:42     Text: predicting the training examples which aren't useful things that can then predict on independent

0:27:43 - 0:27:53     Text: examples that you come to at runtime. Okay, that classic view of regularization is sort of actually

0:27:53 - 0:28:03     Text: outmoded and wrong for modern neural networks. So the right way to think of it for the kind of

0:28:03 - 0:28:12     Text: modern big neural networks that we build is that overfitting on the training data isn't a problem

0:28:12 - 0:28:21     Text: but nevertheless you need regularization to make sure that your models generalize well to

0:28:21 - 0:28:29     Text: independent test data. So what you'd like is for your graph not to look like this example

0:28:29 - 0:28:38     Text: with test error starting to head up. You'd like to have it at worst case flatline and best case

0:28:38 - 0:28:45     Text: still be gradually dropping. It'll always be higher than the training error but it's not actually

0:28:45 - 0:28:54     Text: showing a failure to generalize. So when we train big neural nets these days our big neural nets

0:28:55 - 0:29:02     Text: always overfit on the training data they hugely overfit on the training data. In fact in many

0:29:02 - 0:29:08     Text: circumstances our neural nets have so many parameters that you can continue to train them on the

0:29:08 - 0:29:14     Text: training data until the error on the training data is zero. They get every single example right

0:29:14 - 0:29:20     Text: because they can just memorize enough stuff about it to predict the right answer. But in general

0:29:20 - 0:29:27     Text: providing the models are regularized well those models will still also generalize well and

0:29:27 - 0:29:34     Text: predict well an independent data. And so for part of what we want to do for that is to work out how

0:29:34 - 0:29:42     Text: much to regularize. And so this lambda parameter here is the strength of regularization. So if you're

0:29:42 - 0:29:48     Text: making that lambda number big you're getting more regularization and if you're making it small

0:29:48 - 0:29:53     Text: you're getting less. And you don't want to have it be too big or else you won't fit the data well

0:29:53 - 0:30:01     Text: and you don't want it to be too small or else you have the problem that you don't generalize well.

0:30:01 - 0:30:08     Text: Okay so this is classic L2 regularization and it's a starting point but our big neural nets are

0:30:08 - 0:30:14     Text: sufficiently complex and have sufficiently many parameters that essentially L2 regularization

0:30:14 - 0:30:20     Text: doesn't cut it. So the next thing that you should know about and is a very standard good feature

0:30:21 - 0:30:32     Text: for building your own nets is a technique called dropout. So dropout is generally introduced as a

0:30:32 - 0:30:40     Text: sort of a slightly funny process that you do when training to avoid feature co-adaptation.

0:30:40 - 0:30:48     Text: So in dropout what you do is at the time that you're training your model that for each

0:30:49 - 0:31:00     Text: instance or for each batch in your training then for each neuron in the model you drop 50% of its

0:31:00 - 0:31:07     Text: inputs you just treat them as zero and so that you can do by sort of zeroing out elements of

0:31:08 - 0:31:18     Text: the sort of layers. And then at test time you don't drop any of the model weights you keep them all

0:31:18 - 0:31:23     Text: but actually you harble the model weights because you're now keeping twice as many things as you'd

0:31:23 - 0:31:33     Text: used at training data. And so effectively that little recipe prevents what's called feature co-adaptation.

0:31:33 - 0:31:42     Text: So you can't you can't have features that are only useful in the presence of particular other

0:31:42 - 0:31:47     Text: features because the model can't guarantee which features are going to be present for different

0:31:47 - 0:31:54     Text: examples because different features are being randomly dropped all of the time. And so effectively

0:31:54 - 0:31:59     Text: dropout gives you a kind of a middle ground between naive bays and a logistic regression model

0:31:59 - 0:32:05     Text: and a naive bays models all the weights are set independently and a logistic regression model

0:32:05 - 0:32:10     Text: all the weights are set in the context of all the others and here you are aware of other weights

0:32:10 - 0:32:17     Text: but they can randomly disappear from you. It's also related to ensemble models like model bagging

0:32:17 - 0:32:24     Text: because you're using different subsets of the features every time. But after all of those

0:32:24 - 0:32:30     Text: explanations there's actually another way of thinking about dropout which was actually developed

0:32:30 - 0:32:38     Text: here at Stanford as a paper by personally angling students which is to argue that really what dropout

0:32:38 - 0:32:45     Text: gives you is a strong regularizer that isn't a uniform regularizer like L2 that regularizes

0:32:45 - 0:32:51     Text: everything with an L2 last but can learn a feature dependent regularization and so that dropout

0:32:51 - 0:32:58     Text: is just emerged as in general the best way to do regularization for neural nets. I think you've

0:32:58 - 0:33:06     Text: already seen and heard this one but just have it on my slides once. If you want to have your neural

0:33:06 - 0:33:14     Text: networks go fast it's really essential that you make use of vectors matrices tensors and you

0:33:14 - 0:33:22     Text: don't do things with for loop. So here's a teeny example where I'm using time it which is a useful

0:33:22 - 0:33:27     Text: thing that you can use too to see where how faster neural nets run in different ways of writing

0:33:27 - 0:33:41     Text: it and so when I'm doing this doing these dot products here I can either do the dot product in a

0:33:41 - 0:33:49     Text: for loop against each word vector or I can do the dot product with a single word vector matrix

0:33:49 - 0:34:00     Text: and if I do it in a for loop doing each loop takes me almost a second whereas if I do it with a

0:34:01 - 0:34:07     Text: matrix multiply it takes me an order of magnitude less time so you should always be looking to use

0:34:07 - 0:34:14     Text: vectors and matrices not for loops and this is a speed up of about 10 times when you're doing things

0:34:14 - 0:34:22     Text: on a CPU heading forward we're going to be using GPUs and they only further exaggerate the

0:34:22 - 0:34:27     Text: advantages of using vectors and matrices where you'll commonly get two orders of magnitude speed

0:34:27 - 0:34:38     Text: up by doing things that way. Yeah so for the backward pass you are running a backward passes before

0:34:38 - 0:34:47     Text: on the drop down examples right so for the things that were dropped out no gradient is going through

0:34:47 - 0:34:54     Text: them because they weren't present they're not affecting things so in a particular batch

0:34:54 - 0:35:01     Text: you're only training weights for the things that aren't dropped out but then since you for each

0:35:01 - 0:35:07     Text: successive batch you drop out different things that over a bunch of batches you're then training

0:35:07 - 0:35:18     Text: all of the weights of the model and so feature dependent regularizer is meaning that how much

0:35:18 - 0:35:29     Text: a feature the different features can be regularized different amounts to maximize performance

0:35:29 - 0:35:38     Text: so back in this model every feature was just sort of being penalized by taking lambda times

0:35:38 - 0:35:46     Text: at squared value so this is sort of uniform regularization where the end result of this dropout

0:35:46 - 0:35:54     Text: style training is that you end up with some features being regularized much more strongly

0:35:54 - 0:36:03     Text: and some other features being regularized less strongly and how much they regularize depends on

0:36:03 - 0:36:09     Text: how much they're being used so you're regularizing more features that are being used less but I'm

0:36:11 - 0:36:15     Text: I'm not going to get through into the details of how you can understand that perspective

0:36:15 - 0:36:22     Text: that's that that's outside of context what I'm going to get through right now so the final bit is

0:36:22 - 0:36:31     Text: I just wanted to give a little bit of perspective on non-linearities in our neural nets so the first

0:36:31 - 0:36:39     Text: thing to remember is you have to have non-linearities so if you're building a multi-layer neural net

0:36:39 - 0:36:46     Text: and you've just got you know w1 x plus b1 then you put it through w2 x plus b2 and then put

0:36:46 - 0:36:53     Text: through w3 x well I guess they're different hidden layers so I should have said x they should be

0:36:53 - 0:37:01     Text: hidden one hidden two hidden three w3 hidden three plus b3 that multiple linear transformations

0:37:02 - 0:37:08     Text: composed so they can be just collapsed down into a single linear transformation so you don't get

0:37:08 - 0:37:19     Text: any power as a data representation by having multiple linear layers there's a slightly longer story

0:37:19 - 0:37:23     Text: there because you actually do get some interesting learning effects but I'm not going to talk about

0:37:23 - 0:37:33     Text: that now but standardly we have to have some kind of non-linearity to do something interesting

0:37:33 - 0:37:41     Text: in a deep neural network okay so there's a starting a starting point is the most classic non-linearity

0:37:41 - 0:37:48     Text: is the logistic often just called the sigmine non-linearity because of its s shape which we've

0:37:48 - 0:37:56     Text: seen before in previous lectures so this will take any real number and map it on to the range

0:37:56 - 0:38:06     Text: of 0 1 and that was sort of basically what people used in sort of 1980s neural nets now one

0:38:06 - 0:38:14     Text: disadvantage of this non-linearity is that it's moving everything into the positive space

0:38:14 - 0:38:21     Text: because the output was always between 0 and 1 so people then decided that for many purposes

0:38:21 - 0:38:28     Text: it was useful to have this variant sigmois shape of hyperbolic tan which is then being shown

0:38:28 - 0:38:35     Text: in the second picture now you know logistic and hyperbolic tan they sound like they're very

0:38:35 - 0:38:41     Text: different things but actually as you maybe remember from a math class hyperbolic tan can be

0:38:41 - 0:38:47     Text: represented in terms of exponentials as well and if you do a bit of math which possibly we might

0:38:47 - 0:38:53     Text: make you do on an assignment it's actually the case that a hyperbolic tangent is just a rescaled

0:38:53 - 0:38:59     Text: and shifted version of the logistic so it's really exactly the same curve just squeezed a bit so

0:38:59 - 0:39:09     Text: it goes now symmetrically between minus 1 and 1 well these kind of transcendental functions like

0:39:09 - 0:39:15     Text: hyperbolic tangent they're kind of slow and expensive to compute right even on our fast computers

0:39:15 - 0:39:22     Text: calculating exponentials is a bit slow so something people became interested in was well could we do

0:39:22 - 0:39:30     Text: things with much simpler non-linearity so what if we used a so-called hard 10h so the hard 10h

0:39:31 - 0:39:42     Text: at some point up to some point it just flat lines at minus 1 then it is y equals x up until 1

0:39:42 - 0:39:49     Text: and then it just flat lines again and you know that seems a slightly weird thing to use because

0:39:49 - 0:39:57     Text: if your input is over on the left or over on the right you're sort of not getting any discrimination

0:39:57 - 0:40:04     Text: in everything's giving the same output but somewhat surprisingly I mean I was surprised when people

0:40:04 - 0:40:13     Text: started doing this these kind of models proved to be very successful and so that then led into

0:40:13 - 0:40:19     Text: what's proven to be kind of the most successful and generally widely used non-linearity in a lot

0:40:19 - 0:40:28     Text: of recent deep learning work which was what was being used in the dependency paths model I showed

0:40:28 - 0:40:34     Text: is what's called the rectified linear unit or value so a real U is kind of the simplest kind of

0:40:34 - 0:40:41     Text: non-linearity that you can imagine so if the value of x is negative it's value is 0 so effectively

0:40:41 - 0:40:49     Text: it's just dead it's not doing anything in the computation and if it's value of x is greater than

0:40:49 - 0:40:58     Text: 0 then it's just simply y equals x the value as being passed through and at first sight this might

0:40:58 - 0:41:05     Text: seem really really weird and how could this be useful as a non-linearity but if you sort of think

0:41:05 - 0:41:11     Text: a bit about how you can approximate things with piecewise linear functions very accurately you

0:41:11 - 0:41:18     Text: might kind of start to see how you could use this to do accurate function approximation

0:41:18 - 0:41:24     Text: with piecewise linear functions and that's what value units have been found to do extremely

0:41:24 - 0:41:32     Text: extremely successfully so logistic and tanH are still used in various places you use logistic when

0:41:32 - 0:41:38     Text: you want a probability output we'll see tanH's again very soon when we get to a current neural

0:41:38 - 0:41:44     Text: networks but they're no longer the default when making deep networks that in a lot of places the

0:41:44 - 0:41:50     Text: first thing you should think about trying is relu non-linearities and so in particular

0:41:52 - 0:42:01     Text: that why part of why they're good is that relu non networks train very quickly because you get

0:42:01 - 0:42:07     Text: this sort of very straightforward gradient backflow because providing you on the right hand side

0:42:07 - 0:42:13     Text: of it you then just getting this sort of constant gradient backflow from the slope one and so they

0:42:13 - 0:42:21     Text: train very quickly the somewhat surprising fact is that sort of almost the simplest non-linearity

0:42:21 - 0:42:28     Text: imaginable is still enough to have a very good neural network but it just is people have played

0:42:28 - 0:42:34     Text: around the variance of that so people have then played around with leaky relus where rather than

0:42:35 - 0:42:41     Text: the left hand side just going completely to zero it goes slightly negative on a

0:42:41 - 0:42:49     Text: vertmuch shallower slope and then there's been a parametric relu where you have an extra parameter

0:42:49 - 0:42:56     Text: where you learn the slope of the negative part another thing that's been used recently is this

0:42:56 - 0:43:04     Text: swish non-linearity which looks almost like a relu but it sort of curves down just a little bit

0:43:04 - 0:43:10     Text: there and starts to go up I mean I think it's fair to say that you know none of these have really

0:43:10 - 0:43:15     Text: proven themselves vastly superior there are papers saying I can get better results by using one of

0:43:15 - 0:43:21     Text: these and maybe you can but you know it's not night and day and a vast majority of work that you

0:43:21 - 0:43:33     Text: see around is still just using relus in many places okay a couple more things parameter initialization

0:43:33 - 0:43:45     Text: so in almost all cases you must must must initialize the matrices of your neural nets with small random

0:43:45 - 0:43:53     Text: values neural nets just don't work if you start the matrices off as zero because effectively then

0:43:53 - 0:44:03     Text: everything is symmetric is symmetric nothing can specialize in different ways and you then get sort of

0:44:03 - 0:44:08     Text: you just don't have an ability for a neural net to learn you sort of get this defective solution

0:44:09 - 0:44:17     Text: so standardly you're using some methods such as drawing random numbers uniformly between

0:44:17 - 0:44:25     Text: minus r and r for a small value r and just filling in all the parameters with that exception

0:44:25 - 0:44:30     Text: is with bias weights it's fine to set bias weights to zero and in some sense that's better

0:44:32 - 0:44:41     Text: in terms of choosing what the r value is essentially for traditional neural nets what we want to

0:44:41 - 0:44:49     Text: set that r range for is so that the numbers in our neural network stay you're of a reasonable size

0:44:49 - 0:44:56     Text: they don't get too big and they don't get too small and whether they kind of blow up or not

0:44:56 - 0:45:02     Text: depends on how many connections there are in the neural networks I'm looking at the fan in and

0:45:02 - 0:45:12     Text: fan out of connections in the neural network and so a very common initialization that you'll see

0:45:12 - 0:45:19     Text: in PyTorch is what's called Havier initialization named after a person who suggested that

0:45:20 - 0:45:29     Text: and it's working out a value of based on this fan in fan out of the layers but you can just sort of

0:45:29 - 0:45:36     Text: ask for it say initialize with this initialization and it will this is another area where there have

0:45:36 - 0:45:44     Text: been some subsequent development so around week five we'll start talking about layer normalization

0:45:44 - 0:45:49     Text: and if you're using layer normalization then it sort of doesn't matter the same how you initialize

0:45:49 - 0:45:58     Text: the weights so finally we have to train our models and I've briefly introduced the idea of

0:45:58 - 0:46:06     Text: stochastic gradient descent and you know the good news is that most of the time that if training

0:46:06 - 0:46:14     Text: your networks with stochastic gradient descent works just fine use it and you will get good results

0:46:16 - 0:46:23     Text: however often that requires choosing a suitable learning rate which is my final slide of tips

0:46:23 - 0:46:31     Text: on the next slide but there's been an enormous amount of work on optimization of neural networks

0:46:31 - 0:46:39     Text: and people have come up with the whole series of more sophisticated optimizers and I'm not going

0:46:39 - 0:46:46     Text: again to the details of optimization this class but the very loose idea is that these optimizers

0:46:46 - 0:46:53     Text: are adaptive in that they can kind of keep track of how much slope there was, how much gradient

0:46:53 - 0:46:59     Text: there is for different parameters and therefore based on that make decisions as to how much to

0:46:59 - 0:47:06     Text: adjust the weights when doing the gradient update rather than adjusting it by a constant amount

0:47:06 - 0:47:13     Text: and so in that family of methods there are methods that include eta grad, RMS, prof,

0:47:13 - 0:47:19     Text: atom and then a variance of atom including sparse atom, atom, w etc.

0:47:20 - 0:47:27     Text: The one called atom is a pretty good place to start and a lot of the time that's a good one to use

0:47:27 - 0:47:33     Text: and again from the perspective of PyTorch when you're initializing an optimizer you can just say

0:47:33 - 0:47:38     Text: please use atom and you don't actually need to know much more about it than that.

0:47:38 - 0:47:47     Text: If you are using simple secastic gradient descent you have to change choose a learning rate so

0:47:47 - 0:47:54     Text: that was the eta value of the two multiplied the gradient by for how much to adjust the weights

0:47:54 - 0:48:01     Text: and so I talked about that slightly how you didn't want it to be too big or your model could diverge

0:48:01 - 0:48:09     Text: or bounce around you didn't want it to be too small or else training could take place exceedingly

0:48:09 - 0:48:16     Text: slowly and you'll miss this sign deadline. How big it should be depends on all sorts of details

0:48:16 - 0:48:23     Text: of the model and so you sort of want to try out some different order of magnitude numbers to see

0:48:23 - 0:48:29     Text: what numbers seem to work well for a training stable but reasonably quickly something around 10 to

0:48:29 - 0:48:36     Text: the minus 3 or 10 to the minus 4 is an crazy place to start. In principle you can do fine just

0:48:36 - 0:48:43     Text: using a constant learning rate in SGD in practice people generally find they can get better results

0:48:43 - 0:48:51     Text: by decreasing learning rates as you trained so a very common recipe is that you have the learning

0:48:51 - 0:48:57     Text: rate after every K epox where an epoch means that you've made a pass through the entire set of

0:48:57 - 0:49:03     Text: training data so perhaps something like every three epochs you have the learning rate.

0:49:05 - 0:49:09     Text: And a final little note there in purple is when you make a pass through the data you don't want

0:49:09 - 0:49:16     Text: to go through the data items in the same order each time because that leads you to kind of speed

0:49:16 - 0:49:23     Text: have a sort of patterning of the training examples that the model will sort of fall into that

0:49:23 - 0:49:28     Text: periodicity of those patterns so it's best to shuffle the data before each pass through it.

0:49:30 - 0:49:37     Text: Okay now more sophisticated ways to set learning rates and I won't really get into those now.

0:49:38 - 0:49:43     Text: Fancy optimizers like Adam also have a learning rate so you still have to choose

0:49:43 - 0:49:48     Text: a learning rate value but it's effectively it's an initial learning rate which typically the

0:49:48 - 0:49:54     Text: optimizer shrinks as it runs and so you commonly want to have the number it starts off with

0:49:54 - 0:50:02     Text: beyond the larger size because it'll be shrinking as it goes. Okay so that's all by way of

0:50:02 - 0:50:09     Text: introduction and I'm now ready to start on language models and our own ends so what is language

0:50:09 - 0:50:16     Text: modeling? I mean as two words of English language modeling can mean just about anything but in the

0:50:16 - 0:50:22     Text: natural language processing literature language modeling has a very precise technical definition

0:50:22 - 0:50:30     Text: which you should know so language modeling is the task of predicting the word that comes next.

0:50:31 - 0:50:38     Text: So if you have some context like the students open there you want to be able to predict

0:50:38 - 0:50:47     Text: what words will come next is it their books their laptops their exams their minds and so in particular

0:50:48 - 0:50:56     Text: what you want to be doing is being able to give a probability that different words will occur

0:50:56 - 0:51:05     Text: in this context. So a language model is a probability distribution over next words given a

0:51:05 - 0:51:17     Text: preceding context and a system that does that is called a language model. So as a result of that

0:51:17 - 0:51:23     Text: you can also think of a language model as a system that assigns a probability score to a piece of

0:51:23 - 0:51:29     Text: text. So if we have a piece of text then we can just work out its probability according to a

0:51:29 - 0:51:36     Text: language model. So the probability of a sequence of tokens we can decompose via the chain rule

0:51:36 - 0:51:42     Text: probability of the first times probability of the second given the first etc etc and then

0:51:42 - 0:51:49     Text: we can work that out using our language model provides as a product of each probability of

0:51:49 - 0:51:59     Text: predicting the next word. Okay language models are really the cornerstone of human language

0:51:59 - 0:52:07     Text: technology everything that you do with computers and involves human language you are using language

0:52:07 - 0:52:15     Text: models. So when you're using your phone and it's suggesting whether well or badly what the next

0:52:15 - 0:52:21     Text: word that you probably want to type is that's a language model working to try and predict the likely

0:52:21 - 0:52:28     Text: next words. When the same thing happens in a Google doc and it's suggesting a next word or a next

0:52:28 - 0:52:36     Text: few words that's a language model. You know the main reason why the one in Google Docs works much

0:52:36 - 0:52:42     Text: better than the one on your phone is that for the keyboard phone models they have to be very compact

0:52:42 - 0:52:49     Text: so they can run quickly and not much memory. So there's sort of only mediocre language models whether

0:52:49 - 0:52:56     Text: something like Google Docs can do a much better language modeling job. Queer completion same thing

0:52:56 - 0:53:05     Text: there's a language model. And so then the question is well how do we build language models and so

0:53:05 - 0:53:12     Text: I briefly wanted to first again give the traditional answer since you should have at least some

0:53:12 - 0:53:19     Text: understanding of how NLP was done without a neural network and the traditional answer that

0:53:19 - 0:53:26     Text: powered speech recognition and other applications for at least two decades, three decades really

0:53:27 - 0:53:33     Text: was what we're called NGram language models and these were very simple but still quite effective

0:53:33 - 0:53:42     Text: idea. So we want to give probabilities of next words. So what we're going to work with is what

0:53:42 - 0:53:50     Text: are referred to as Ngrams and so Ngrams is just a chunk of N consecutive words which are usually

0:53:50 - 0:53:56     Text: referred to as unagrams, bi-grams, trigrams and then four grams and five grams. A horrible set of

0:53:56 - 0:54:05     Text: names which would offend any humanist but that's what people normally say. And so effectively what

0:54:05 - 0:54:12     Text: we do is just collect statistics about how often different Ngrams occur in a large amount of

0:54:12 - 0:54:19     Text: text and then use those to build a probability model. So the first thing we do is what's referred

0:54:19 - 0:54:25     Text: to as making a mark off assumption so these are also referred to as mark off models and we decide

0:54:25 - 0:54:32     Text: that the word and position t plus one only depends on the preceding N minus one words.

0:54:32 - 0:54:43     Text: So if we want to predict t plus one given the entire preceding text we actually throw away the

0:54:43 - 0:54:50     Text: early words and just use the preceding N minus one words as context. Well once we've made that

0:54:50 - 0:54:56     Text: simplification we can then just use the definition of conditional probability and say all that

0:54:56 - 0:55:06     Text: conditional probability is the probability of N words divided by the preceding N minus one words

0:55:06 - 0:55:12     Text: and so we have the probability of an Ngram over the probability of an N minus one gram.

0:55:13 - 0:55:19     Text: And so then how do we get these Ngram and N minus one gram probabilities? We simply take a

0:55:19 - 0:55:27     Text: large amount of text and some language and we count how often the different Ngrams occur.

0:55:27 - 0:55:34     Text: And so our crude statistical approximation starts off as the count of the Ngram over the count of

0:55:34 - 0:55:41     Text: the N minus one gram. So here's an example of that. Suppose we are learning a four-gram language

0:55:41 - 0:55:48     Text: model. Okay so we throw away all words apart from the last three words and they're our conditioning.

0:55:48 - 0:55:59     Text: We look in some large, we use the counts from some large training corpus and we see how often did

0:55:59 - 0:56:05     Text: students open their books occur, how often did students open their minds occur and then for each

0:56:05 - 0:56:11     Text: of those counts we divide through by the count of how often students open their occurred and that

0:56:11 - 0:56:20     Text: gives us our probability estimates. So for example if in the corpus students open their occurred

0:56:20 - 0:56:26     Text: a thousand times, students open their books occurred four hundred times, we get a probability

0:56:26 - 0:56:32     Text: estimate of 0.4 for books, if exams occurred a hundred times it gets 0.1 for exams.

0:56:33 - 0:56:39     Text: And we sort of see here already the disadvantage of having made the mark off assumption

0:56:39 - 0:56:46     Text: and have gotten rid of all of this earlier context which would be useful for helping us to predict.

0:56:48 - 0:56:56     Text: The one other point that I'll just mention that I confuse myself on is this count of the Ngram

0:56:56 - 0:57:04     Text: language model. So for a four-gram language model it's called a four-gram language model because

0:57:04 - 0:57:11     Text: in its estimation you're using four grams in the numerator and trigrams in the denominator.

0:57:11 - 0:57:18     Text: So you use the size of the numerator. So that terminology is different to the terminology

0:57:18 - 0:57:25     Text: that's used in mark-off models. So when people talk about the order of a mark-off model that

0:57:25 - 0:57:34     Text: refers to the amount of context you're using so this would correspond to a third order mark-off model.

0:57:35 - 0:57:43     Text: Yeah so someone said is this similar to a naive-based model sort of naive-based models you also

0:57:43 - 0:57:52     Text: estimate the probabilities just by counting. So they're they're related and they're sort of in

0:57:52 - 0:58:01     Text: some sense two different answers. The first difference or specialization is that naive-based models

0:58:03 - 0:58:11     Text: work out probabilities of words independent of their neighbors. So in one part that a naive-based

0:58:11 - 0:58:17     Text: language model is a unagram language model. So you're just using the counts of individual words.

0:58:17 - 0:58:24     Text: But the other part of a naive-based model is you're learning a different set of unagram counts

0:58:24 - 0:58:34     Text: for every class for your classifier. And so you've then got sort of so effectively a naive-based

0:58:34 - 0:58:46     Text: model is you've got class-specific unagram language models. Okay I gave this as a

0:58:46 - 0:58:52     Text: simple statistical model for estimating your probabilities with an engram model. You can't

0:58:52 - 0:58:58     Text: actually get away with just doing that because you have sparsity problems. So you know often

0:58:58 - 0:59:07     Text: will be the case that for many words students open their books or students opened their backpacks

0:59:07 - 0:59:11     Text: just never occurred in the training data. That if you think about it if you have something like

0:59:11 - 0:59:18     Text: 10 to the fifth different words even and you want to have then a sequence of four words

0:59:18 - 0:59:23     Text: are probably and they're 10 to the fifth of each. There's sort of 10 to the 20th different

0:59:23 - 0:59:29     Text: combinations. So unless you're seeing and it's truly astronomical amount of data most

0:59:29 - 0:59:35     Text: forward sequences you've never seen. So then your numerate will be zero and your probability

0:59:35 - 0:59:40     Text: estimate will be zero. And so that's bad. And so the commoners way of solving that is just to add

0:59:40 - 0:59:46     Text: a little delta to every count and then everything is non-zero and that's called smoothing.

0:59:47 - 0:59:51     Text: But well sometimes it's worse than that because sometimes you won't even have seen

0:59:51 - 0:59:56     Text: students open theirs and that's more problematic because that means our denominator

0:59:56 - 1:00:04     Text: if is zero and so the division will be ill-defined and we can't usefully calculate any probabilities

1:00:04 - 1:00:09     Text: in a context that we've never seen. And so the standard solution to that is to

1:00:09 - 1:00:17     Text: shorten the context and that's called back-off. So we condition only on open-there or if we

1:00:17 - 1:00:23     Text: still don't haven't seen open-there we'll condition only on there or we could just forget

1:00:23 - 1:00:28     Text: all conditioning and actually use a unagram model for our probabilities.

1:00:30 - 1:00:38     Text: Yeah. And so as you increase the order N of the N-gram language model

1:00:38 - 1:00:44     Text: these sparsity problems become worse and worse. So in the early days people normally worked with

1:00:44 - 1:00:51     Text: tri-gram models as it became easier to collect billions of words of text. People commonly move

1:00:51 - 1:01:00     Text: to five-gram models but every time you go up an order of conditioning you effectively need to be

1:01:00 - 1:01:08     Text: collecting orders of magnitude more data because of the size of the vocabules of human languages.

1:01:09 - 1:01:16     Text: There's also a problem that these models are huge. So basically have to be caught storing

1:01:16 - 1:01:22     Text: counts of all of these words sequences so you can work out these probabilities. And I mean

1:01:22 - 1:01:31     Text: that's actually had a big effect in terms of what technology is available. So in the 2000s decade

1:01:31 - 1:01:38     Text: up till that whenever it was 2014 that there was already Google translate using

1:01:39 - 1:01:45     Text: probabilistic models included language models of the N-gram language model sort. But the only

1:01:45 - 1:01:52     Text: way they could possibly be run is in the cloud because you needed to have these huge tables of

1:01:52 - 1:01:58     Text: probabilities. But now we have neural nets and you can have Google translate just actually run

1:01:58 - 1:02:06     Text: on your phone and that's possible because neural net models can be massively more compact than these

1:02:06 - 1:02:17     Text: old N-gram language models. But nevertheless before we get onto the neural models let's just

1:02:20 - 1:02:27     Text: sort of look at the example of how these work. So it's trivial to train an N-gram language model

1:02:27 - 1:02:33     Text: because you really just count how often words sequences occur in a corpus and you're ready to go.

1:02:33 - 1:02:37     Text: So these models can be trained in seconds. That's really good. That's not like sitting around

1:02:37 - 1:02:45     Text: for training neural networks. So if I train on my laptop a small language model on you know about

1:02:47 - 1:02:54     Text: 1.7 million words as a tri-gram model I can then ask it to generate text. If I give it a couple of

1:02:54 - 1:03:02     Text: words today I can then get it to sort of suggest a word that might come next. And the way I do that

1:03:02 - 1:03:07     Text: is the language model knows the probability distribution of things that can come next.

1:03:08 - 1:03:14     Text: Now there's a kind of a crude probability distribution. I mean because effectively over

1:03:14 - 1:03:20     Text: this relatively small corpus there were things that occurred once Italian and Emirate. There are

1:03:20 - 1:03:25     Text: things that occurred twice price. There were things that occurred four times company and bank.

1:03:25 - 1:03:34     Text: It's sort of fairly crude and rough but I never let's get probability estimates. I can then say okay

1:03:34 - 1:03:42     Text: based on this let's take this probability distribution and then we'll just sample an X word. So the

1:03:42 - 1:03:48     Text: two most likely words the sample a company or bank but we're rolling the dice and we might get

1:03:48 - 1:03:55     Text: any of the words that had come next. So maybe I sample price. Now I'll condition on price

1:03:55 - 1:04:03     Text: on the price and look up the probability distribution of what comes next the most likely thing is

1:04:03 - 1:04:11     Text: of. And so again I'll sample and maybe this time I'll pick up of and then I will now condition

1:04:11 - 1:04:18     Text: on price of and I will look up the probability distribution of words following that and I get

1:04:18 - 1:04:25     Text: this probability distribution and I'll sample randomly some word from it and maybe this time of

1:04:25 - 1:04:33     Text: sample of rare but possible one like gold and I can keep on going and I'll get out something like

1:04:33 - 1:04:39     Text: this. Today the price of gold per ton while production of shoe lasts and shoe industry the bank

1:04:39 - 1:04:46     Text: intervened just after it considered rejected an IMF demand to rebuild depleted European stocks

1:04:46 - 1:04:55     Text: set 30 in primary 76 cents a share. So what a just a simple tri-gram model can produce over not

1:04:55 - 1:05:01     Text: very much text is actually already kind of interesting like it's actually surprisingly grammatical

1:05:01 - 1:05:07     Text: right there are whole pieces of it while production of shoe lasts and shoe industry the bank

1:05:07 - 1:05:12     Text: intervened just after it considered a rejected an IMF demand it's really actually pretty good

1:05:12 - 1:05:19     Text: grammatical text so it's it's sort of amazing that these simple n-gram models actually can model

1:05:19 - 1:05:25     Text: a lot of human language. On the other hand it's not a very good piece of text it's completely

1:05:25 - 1:05:34     Text: incoherent and makes no sense and so to actually be able to generate text that seems like it makes

1:05:34 - 1:05:41     Text: sense we're going to need a considerably better language model and that's precisely what newer

1:05:41 - 1:05:49     Text: language models have allowed us to build as we'll see later. Okay so how can we build a newer

1:05:49 - 1:05:56     Text: language model and so first of all we're going to do a simple one and then we'll see where we get

1:05:56 - 1:06:04     Text: but to move into a current neural nets might still take us to next time. So we've going to have input

1:06:04 - 1:06:10     Text: sequence of words and we want a probability distribution over the next word. Well the simplest

1:06:10 - 1:06:18     Text: thing that we could try is to say well kind of the only tool we have so far is a window-based

1:06:18 - 1:06:27     Text: classifier so what we can say you know what we've done previously either for our name density

1:06:27 - 1:06:32     Text: recognize in lecture three or what I just showed you for the dependency parser is we have some

1:06:32 - 1:06:39     Text: context window we put it through a neural net and we predict something as a classifier. So before

1:06:39 - 1:06:48     Text: we were predicting a location but maybe instead we could reuse exactly the same technology and say

1:06:48 - 1:06:54     Text: we're going to have a window-based classifier so we're discarding the further away words just like

1:06:54 - 1:07:03     Text: in a n-gram language model but we'll feed this fixed window into a neural net so we can

1:07:03 - 1:07:11     Text: catenate the word embeddings we put it through a hidden layer and then we have a softmax classifier

1:07:11 - 1:07:19     Text: over our vocabulary and so now rather than predicting something like location or left arc

1:07:19 - 1:07:26     Text: and the dependency parser we're going to have a softmax over the entire vocabulary sort of like

1:07:26 - 1:07:33     Text: we did with the skip-gram negative sampling model in the first two lectures and so we're going to

1:07:33 - 1:07:41     Text: see this choice as predicting what word that comes next whether it produces laptops minds books

1:07:41 - 1:07:51     Text: etc. Okay so this is a fairly simple fixed window neural net classifier but this is essentially

1:07:51 - 1:08:01     Text: a famous early model in the use of neural nets for NLP applications so first a 2000 conference

1:08:01 - 1:08:09     Text: paper and then a somewhat later journal paper Yashua Benjiro and colleagues introduced precisely

1:08:09 - 1:08:15     Text: this model as the neural probabilistic language model and they were already able to show

1:08:15 - 1:08:24     Text: that this could give interesting good results for language modeling and so it wasn't a great

1:08:24 - 1:08:31     Text: solution for a neural language modeling but it still had value so it didn't solve the problem

1:08:31 - 1:08:38     Text: of allowing us to have bigger context to predict what words are going to come next it's in that way

1:08:38 - 1:08:46     Text: limited exactly like an end-gram language model is but it does have all the advantages of distributed

1:08:46 - 1:08:54     Text: representations so rather than having these counts for words sequences that are very sparse and

1:08:54 - 1:09:02     Text: very crude we can use distributed representative representations of words which then make predictions

1:09:02 - 1:09:10     Text: that semantically similar words should give similar probability distribution so the idea of that

1:09:10 - 1:09:19     Text: is if we use some other word here like maybe the pupils open there well maybe in our training data

1:09:19 - 1:09:25     Text: we'd seen sentences about students but we've never seen sentences about pupils an end-gram

1:09:25 - 1:09:31     Text: language model then would sort of have no idea what probabilities to use whereas a neural language

1:09:31 - 1:09:37     Text: model can say well pupils is kind of similar to students therefore I can predict similarly to

1:09:37 - 1:09:46     Text: what I would have predicted for students okay so there's now no sparsity problem we don't need to

1:09:47 - 1:09:58     Text: store billions of end-gram counts we simply need to store our word vectors and our W and new matrices

1:09:58 - 1:10:05     Text: but we still have the remaining problems that our fixed windows too small we can try and make the

1:10:05 - 1:10:13     Text: window larger if we do that W the W matrix gets bigger but that also points out another problem

1:10:13 - 1:10:22     Text: with this model not only can the window never be large enough but W is just a trained matrix

1:10:22 - 1:10:29     Text: and so therefore we're learning completely different weights for each position of context the

1:10:29 - 1:10:35     Text: word minus one position the word minus two the word minus three and the word minus four so that

1:10:35 - 1:10:44     Text: there's no sharing in the model as to how it treats words in different positions even though in

1:10:44 - 1:10:52     Text: some sense they will contribute semantic components that are at least somewhat position independent so

1:10:52 - 1:10:59     Text: again for those of if you sort of think back to either a naive based model or what we saw with the

1:10:59 - 1:11:05     Text: word-to-vec model at the beginning the word-to-vec model or naive based model completely ignores word

1:11:05 - 1:11:11     Text: order so it has one set of parameters regardless of what position things occur in that doesn't work

1:11:11 - 1:11:17     Text: well for language modeling because word order is really important in language modeling if the last

1:11:17 - 1:11:22     Text: word is the that's a really good predictor of there being an adjective or noun following where

1:11:22 - 1:11:30     Text: if the word four back is the it doesn't give you the same information so you do want to somewhat

1:11:31 - 1:11:38     Text: make use of word order but this model is at the opposite extreme that each position is being

1:11:38 - 1:11:47     Text: modeled completely independently so what we'd like to have is a neural architecture that can process

1:11:48 - 1:11:56     Text: an arbitrary amount of context and have more sharing of the parameters while still be sensitive

1:11:56 - 1:12:03     Text: to proximity and so that's the idea of recurrent neural networks and I'll say about five minutes

1:12:03 - 1:12:10     Text: about these today and then next time we'll return and do more about neural of recurrent neural

1:12:10 - 1:12:20     Text: networks so for the recurrent neural network rather than having a single hidden layer inside our

1:12:20 - 1:12:28     Text: classifier here that we compute each time for the recurrent neural network we have the hidden layer

1:12:28 - 1:12:36     Text: which often was referred to as the hidden state but we maintain it over time and we feed it back

1:12:36 - 1:12:42     Text: into itself so that's what the word recurrent as meaning that you're sort of feeding the hidden

1:12:42 - 1:12:53     Text: layer back into itself so what we do is based on the first word we compute a hidden representation

1:12:53 - 1:13:01     Text: kind of like before which can be used to predict the next word but then for when we want to

1:13:01 - 1:13:09     Text: predict what comes after the second word we not only feed in the second word we feed in the hidden

1:13:10 - 1:13:18     Text: layer from the previous word to have it help predict the hidden layer above the second word

1:13:18 - 1:13:24     Text: and so formally the way we're doing that is we're taking the hidden layer above the first word

1:13:24 - 1:13:33     Text: multiplying it by a matrix w and then that's going to be going in together with x2 to generate

1:13:33 - 1:13:40     Text: the next hidden step and so we keep on doing that at each time step so that we are kind of repeating

1:13:40 - 1:13:50     Text: a pattern of creating a next hidden layer based on the next input word and the previous hidden state

1:13:50 - 1:13:56     Text: by updating it by multiplying it by a matrix w okay so in my slide here I've still only got

1:13:56 - 1:14:02     Text: forwards of context because it's nice for my slide but you know in principle there could be you

1:14:02 - 1:14:12     Text: know any number of words of context now okay so what we're doing is so that we start off by having

1:14:12 - 1:14:21     Text: input vectors which can be our word vectors that we've looked up for each word so sorry yeah so

1:14:21 - 1:14:27     Text: we can have the one hot vectors for word identity we look up our word embedding so then we've got

1:14:27 - 1:14:34     Text: word embeddings for each word and then we want to compute hidden states so we need to start from

1:14:34 - 1:14:42     Text: somewhere h zero is the initial hidden state and h zero is normally taken as a zero vector so this

1:14:42 - 1:14:50     Text: is actually just initialized the zeros and so for working out the first hidden state we calculated

1:14:50 - 1:14:59     Text: based on the first word embedding by multiplying this embedding by a matrix w e and that gives us

1:14:59 - 1:15:09     Text: the first hidden state but then you know as we go on we want to apply the same formula over again

1:15:09 - 1:15:18     Text: so we have just two parameter matrices in the recurrent neural network one matrix for multiplying

1:15:18 - 1:15:25     Text: input embeddings and one matrix for updating the hidden state of the network and so for the second

1:15:25 - 1:15:36     Text: word from its word embedding we multiply it by the w e matrix we take the previous time steps hidden

1:15:36 - 1:15:45     Text: state and multiply it by the w h matrix and we use the two of those to generate the new hidden state

1:15:45 - 1:15:52     Text: and precisely how we generate the new hidden state is then by shown on this equation on the left

1:15:52 - 1:15:59     Text: so we take the previous hidden state multiply it by w h we take the input embedding multiply it by

1:15:59 - 1:16:10     Text: w e we sum those two we add on a learn bias rate and then we put that through a nonlinearity

1:16:10 - 1:16:17     Text: and although on the slide that nonlinearity is written as sigma by far the most common nonlinearity

1:16:17 - 1:16:27     Text: to use here actually is a tan h nonlinearity and so this is the core equation for a simple recurrent

1:16:27 - 1:16:34     Text: neural network and for each successive time step we're just going to keep on applying that to work

1:16:34 - 1:16:41     Text: out hidden states and then from those hidden states we can use them just like in our window

1:16:41 - 1:16:49     Text: classifier to predict what would be the next word so at any position we can take this hidden vector

1:16:50 - 1:16:56     Text: put it through a softmax layer which is multiplying by you matrix and adding on another bias

1:16:56 - 1:17:00     Text: and then making a softmax distribution out of that and that will then gives the probability

1:17:00 - 1:17:10     Text: distribution over next words what we saw here right this is the entire math of a simple recurrent

1:17:10 - 1:17:18     Text: neural network and next time I'll come back and say more about them but this is the entirety of a

1:17:18 - 1:17:24     Text: of what you need to know in some sense for the computation of the forward model of a simple

1:17:24 - 1:17:32     Text: or a current neural network so the advantages we have now is it can process a text import of any length

1:17:34 - 1:17:40     Text: in theory at least it can use information from any number of steps back we'll talk more about

1:17:40 - 1:17:46     Text: in practice how well that actually works the model size is fixed it doesn't matter how much

1:17:46 - 1:17:55     Text: of a past context there is all we have is our WH and W e parameters and at each time step

1:17:55 - 1:18:03     Text: we use exactly the same weights to update our hidden state so there's a symmetry in how different

1:18:03 - 1:18:11     Text: inputs are processed in producing our predictions our NNs in practice though or these simple RNNs

1:18:11 - 1:18:18     Text: and practice aren't perfect so a disadvantage is that they're actually kind of slow because with

1:18:18 - 1:18:24     Text: this recurrent computation in some sense we are sort of stuck with having to have on the outside

1:18:24 - 1:18:31     Text: of for loop so we can do vector matrix multiplies on the inside here but really we have to do

1:18:31 - 1:18:40     Text: for time to stay up equals one to N calculate these successive hidden states and so that's not a

1:18:40 - 1:18:49     Text: perfect neural net architecture and we'll discuss alternatives to that later and although in theory

1:18:49 - 1:18:55     Text: this model can access information any number of steps back in practice we find that it's

1:18:55 - 1:19:01     Text: pretty imperfect at doing that and that will then lead to more advanced forms of a current neural

1:19:01 - 1:19:09     Text: network that I'll talk about next time that are able to more effectively access past context

1:19:09 - 1:19:11     Text: okay I think I'll stop there for the day