Stanford CS224N NLP with Deep Learning | Winter 2021 | Lecture 6 - Simple and LSTM RNNs

0:00:00 - 0:00:14     Text: Okay, hi everyone. Welcome back to CS224N. So today is a pretty key lecture where we get

0:00:14 - 0:00:20     Text: through a number of important topics for neural networks, especially as supplied to natural

0:00:20 - 0:00:24     Text: language processing. So right at the end of last time I started through current neural

0:00:24 - 0:00:29     Text: networks. So we'll talk in detail more about the current neural networks in the first

0:00:29 - 0:00:36     Text: part of the class. And we'd emphasize language models, but then also getting a bit beyond

0:00:36 - 0:00:42     Text: that. And then look at it more advanced kinds of recurrent neural networks towards the

0:00:42 - 0:00:49     Text: end part of the class. I just wanted to sort of say a word before getting underway about

0:00:49 - 0:00:55     Text: the final project. So hopefully by now you've started looking at assignment three, which

0:00:55 - 0:00:59     Text: is the middle of the five assignments for the first half of the course. And then in the

0:00:59 - 0:01:05     Text: second half of the course most of your effort goes into a final project. So next week the

0:01:05 - 0:01:10     Text: first day lecture is going to be about final projects and choosing the final project and

0:01:10 - 0:01:15     Text: tips for final projects, etc. So it's fine to delay thinking about final projects until

0:01:15 - 0:01:20     Text: next week if you want, but you shouldn't delay it too long because we do want you to get

0:01:20 - 0:01:25     Text: underway with what topic you're going to do for your final project. If you are thinking

0:01:25 - 0:01:31     Text: about final projects you can find some info on the website, but note that the info that's

0:01:31 - 0:01:36     Text: there at the moment is still last year's information and it will be being updated over

0:01:36 - 0:01:42     Text: the coming week. We'll also talk about project mentors. If you've got ideas of people who

0:01:42 - 0:01:46     Text: on your own you can line up as a mentor that now would be a good time to ask them about

0:01:46 - 0:01:51     Text: what they're doing and we'll sort of talk about what the alternatives are.

0:01:51 - 0:01:58     Text: Okay so last lecture I introduced the idea of language models, so probabilistic models

0:01:58 - 0:02:04     Text: that predict the probability of next words after a word sequence and then we looked at

0:02:04 - 0:02:09     Text: end-gram language models and started into a current neural network models. So today we

0:02:09 - 0:02:15     Text: go to talk more about the simple RNNs we saw before talking about training RNNs and

0:02:15 - 0:02:22     Text: uses RNNs, but then we'll also look into the problems that occur with RNNs and how we

0:02:22 - 0:02:28     Text: might fix them. These will motivate a more sophisticated RNN architecture called LSTMs

0:02:28 - 0:02:34     Text: and we'll talk about other more complex RNN options by directional RNNs and multi-layer

0:02:34 - 0:02:43     Text: RNNs. Then next Tuesday we're essentially going to further exploit and build on the RNN

0:02:43 - 0:02:47     Text: based architectures that we've been looking at to discuss how to build a neural machine

0:02:47 - 0:02:53     Text: translation system with the sequence to sequence and model with attention and effectively

0:02:53 - 0:02:59     Text: as that model is what we'll use in assignment 4 but it also means that you'll be using all

0:02:59 - 0:03:02     Text: of the stuff that we're talking about today.

0:03:02 - 0:03:08     Text: Okay so if you remember from last time this was the idea of a simple recurrent neural

0:03:08 - 0:03:15     Text: network language model. So we had a sequence of words as our context for which we've looked

0:03:15 - 0:03:23     Text: up word embeddings and then the recurrent neural network model ran this recurrent layer

0:03:23 - 0:03:29     Text: where at each point we have a previous hidden state which can just be zero at the beginning

0:03:29 - 0:03:36     Text: of a sequence and you have feeding it in to the next hidden state, the previous hidden

0:03:36 - 0:03:44     Text: state and encoding and transform the coding of a word using this recurrent neural network

0:03:44 - 0:03:49     Text: equation that I have on the left that's very central and based on that you compute a

0:03:49 - 0:03:57     Text: new hidden representation for the next time step and you can repeat that along for successive

0:03:57 - 0:04:05     Text: time steps. Now we also usually want our recurrent neural networks to produce outputs.

0:04:05 - 0:04:11     Text: So I only show it at the end here but at each time step we're then also going to generate

0:04:11 - 0:04:18     Text: an output and so to do that we're feeding the hidden layer into a softmax layer so we're

0:04:18 - 0:04:23     Text: doing another matrix multiply add on a bias, put it through the softmax equation and that

0:04:23 - 0:04:29     Text: will then gives the probability distribution over words and we can use that to predict

0:04:29 - 0:04:36     Text: how likely it is that different words are going to occur after the students open there.

0:04:36 - 0:04:42     Text: Okay, so I didn't introduce that model but I hadn't really gone through the specifics

0:04:42 - 0:04:50     Text: of how we train this model, how we use it and evaluate it so let me go through this now.

0:04:50 - 0:04:56     Text: So here's how we train an RNN language model, we get a big corpus of text, just a lot of

0:04:56 - 0:05:03     Text: text and so in regard that is just a long sequence of words x1 to xt and what we're going

0:05:03 - 0:05:13     Text: to do is feed it into the RNN LM. So for each we're going to take prefixes of that sequence

0:05:13 - 0:05:19     Text: and based on each prefix we're going to want to predict the probability distribution

0:05:19 - 0:05:27     Text: for the word that comes next and then we're going to train our model by assessing how

0:05:27 - 0:05:35     Text: good a job we do about that and so the loss function we use is the loss function normally

0:05:35 - 0:05:41     Text: referred to as cross entropy loss in the literature which is this negative log likelihood loss.

0:05:41 - 0:05:48     Text: So we are going to predict some word to come next. Well we have a probability distribution

0:05:48 - 0:05:53     Text: over predictions of what word comes next and actually there was an actual next word in

0:05:53 - 0:05:58     Text: the text and so we say well what probability did you give to that word and maybe we gave

0:05:58 - 0:06:03     Text: it a probability estimate of.01. Well it would have been greater if we've given a probability

0:06:03 - 0:06:09     Text: estimate of almost one because that meant we've almost certain that what did come next

0:06:09 - 0:06:15     Text: in our model and so we'll take a loss to the extent that we're giving the actual next

0:06:15 - 0:06:23     Text: word a predicted probability of less than one. So then get an idea of how well we're doing

0:06:23 - 0:06:31     Text: over the entire corpus we work out that loss at each position and then we work out the

0:06:31 - 0:06:37     Text: average loss of the entire training set. So let's just go through that again more graphically

0:06:37 - 0:06:45     Text: in the next couple of slides. So down the bottom here's our corpus of text. We're running

0:06:45 - 0:06:52     Text: it through our simple or current neural network and at each position we've predicting a probability

0:06:52 - 0:07:01     Text: distribution over words. We then say well actually at each position we know what word is actually

0:07:01 - 0:07:07     Text: next. So when we're at time step one the actual next word is students because we can see

0:07:07 - 0:07:12     Text: it just to the right of us here and we say what probability estimate did you give to students

0:07:12 - 0:07:19     Text: and to the extent that it's not high it's not one we take a loss and then we go on to the

0:07:19 - 0:07:25     Text: time step two and we say well at time step two you predict the probability distribution

0:07:25 - 0:07:30     Text: over words. The actual next word is opened so to the extent that you haven't given high

0:07:30 - 0:07:37     Text: probability to open you take a loss and then that repeats in time step three we're hoping the

0:07:37 - 0:07:44     Text: model predict there at time step four we're hoping the model predict exams and then to work out

0:07:44 - 0:07:53     Text: our overall loss we're then averaging our per time step loss. So in a way this is a pretty

0:07:53 - 0:08:01     Text: obvious thing to do but note that there is a little subtlety here and in particular this algorithm

0:08:01 - 0:08:07     Text: is referred to in the literature as teacher forcing and so what does that mean? Well you know you

0:08:07 - 0:08:15     Text: can imagine what you can do with a recurrent neural network is say okay just start generating

0:08:15 - 0:08:20     Text: maybe I'll give you a hint as to where to start I'll say the sentence starts thus students

0:08:20 - 0:08:27     Text: and then let it run and see what it generates coming next it might start saying the students

0:08:27 - 0:08:34     Text: have been locked out in the classroom or whatever it is right and that we could say as well that's

0:08:34 - 0:08:39     Text: not very close to what the actual text says and somehow we want to learn from that and if you go

0:08:39 - 0:08:44     Text: in that direction there's a space of things you can do that leads into more complex algorithms

0:08:44 - 0:08:52     Text: such as reinforcement learning but from the perspective of training these neural models that's

0:08:52 - 0:09:00     Text: unjule complex and unnecessary so we have this very simple way of doing things which is what we do

0:09:00 - 0:09:07     Text: is just predict one time step forward so we say we know that the prefix is the students predict a

0:09:07 - 0:09:13     Text: probability distribution over the next word it's good to the extent that you give probability

0:09:13 - 0:09:19     Text: mass to open okay now the prefix is the students opened predict the a probability distribution over

0:09:19 - 0:09:27     Text: the next word it's good to the extent that you give probability mass to there and so effectively

0:09:27 - 0:09:32     Text: at each step we're resetting to what was actually in the corpus so you know it's possible after the

0:09:32 - 0:09:40     Text: students opened the model thinks that by far the most probable thing to come next is ah or thus say

0:09:40 - 0:09:45     Text: I mean we don't actually use what the model suggested we penalize the model for not having

0:09:45 - 0:09:50     Text: suggested there but then we just go with what's actually in the corpus and ask it to predict again

0:09:55 - 0:10:02     Text: this is just a little side thing but it's an important part to know if you're actually training

0:10:02 - 0:10:09     Text: your own neural language model I sort of presented as one huge corpus that we chug through

0:10:09 - 0:10:17     Text: but in practice we don't chug through a whole corpus one step at a time what we do is we cut the whole

0:10:17 - 0:10:24     Text: corpus into shorter pieces which might commonly be sentences or documents or sometimes they're

0:10:24 - 0:10:29     Text: literally just pieces that are chopped right so your recall that stochastic gradient send

0:10:29 - 0:10:35     Text: allows us to compute a lost ingredients from a small chunk of data and update so what we do is we

0:10:35 - 0:10:42     Text: take these small pieces compute gradients from those and update weights and repeat and in particular

0:10:43 - 0:10:50     Text: we get a lot more speed and efficiency in training if we aren't actually doing an update for just

0:10:50 - 0:10:56     Text: one sentence at a time but actually a batch of sentences so typically what we'll actually do

0:10:56 - 0:11:03     Text: is we'll feed to the model 32 sentences say of a similar length at the same time compute gradients for

0:11:03 - 0:11:13     Text: them update weights and then get another batch of sentences to train on how do we train I haven't

0:11:13 - 0:11:19     Text: sort of gone through the details of this I mean in one sense the answer is just like we talked

0:11:19 - 0:11:25     Text: about lecture three we use back propagation to get gradients and update parameters but let's

0:11:25 - 0:11:33     Text: take at least a minute to go through the differences and subtleties of the current neural network case

0:11:33 - 0:11:40     Text: and the central thing that's a bit you know as before we're going to take our loss and we're going

0:11:40 - 0:11:47     Text: to back propagate it to all of the parameters of the network everything from word embeddings to biases

0:11:47 - 0:11:56     Text: etc but the central bit that's a little bit different and is more complicated is that we have this

0:11:56 - 0:12:05     Text: WH matrix that runs along the sequence that we keep on applying to update our hidden state so what's

0:12:05 - 0:12:15     Text: the derivative of Jt of theta with respect to the repeated weight matrix WH and well the answer

0:12:15 - 0:12:25     Text: that is that what we do is we look at it in each position and work out what the

0:12:25 - 0:12:34     Text: particles are of Jt with respect to WH in position one or position two position three position four

0:12:34 - 0:12:40     Text: etc right along the sequence and we just sum up all of those particles and that gives us

0:12:40 - 0:12:51     Text: a partial for Jt with respect to WH overall so the answer for a current neural networks is the

0:12:51 - 0:12:57     Text: gradient with respect to a repeated weight in our current network is the sum of the gradient

0:12:57 - 0:13:06     Text: with respect to each time it appears and let me just then go through a little why that is the case

0:13:06 - 0:13:14     Text: but before I do that let me just note one gotcha I mean it's just not the case that this means

0:13:14 - 0:13:25     Text: it equals t times the partial of Jt with respect to WH because we're using WH here here here here here

0:13:25 - 0:13:31     Text: through the sequence and for each of the places we use it there's a different upstream gradient

0:13:31 - 0:13:37     Text: that's being fed into it so each of the values in this sum will be completely different from each other

0:13:39 - 0:13:48     Text: well why we get this answer is essentially a consequence of what we talked about in the third

0:13:48 - 0:13:56     Text: lecture so to take the simplest case of it right that if you have a multivariable function f of xy

0:13:56 - 0:14:05     Text: and you have two single variable functions x of t and y of t which are fed one input t well then

0:14:05 - 0:14:15     Text: the simple version of working out the derivative of this function is you take the derivative down one

0:14:15 - 0:14:23     Text: path and you take the derivative down the other path and so in the slides in lecture 3 that was

0:14:23 - 0:14:31     Text: what was summarized on a couple of slides by the slogan gradient sum at outward branches so t

0:14:31 - 0:14:37     Text: has outward branches and so you take gradient here on the left gradient on the right and you

0:14:37 - 0:14:43     Text: sum them together and so really what's happening with the recurrent neural network is just

0:14:44 - 0:14:52     Text: many pieces generalization of this so we have one WH matrix and we're using it to keep on updating

0:14:52 - 0:15:01     Text: the hidden state at time one time two time three right through time t and so what we're going to get

0:15:01 - 0:15:11     Text: is that this has a lot of outward branches and we're going to sum the gradient path at each one

0:15:11 - 0:15:19     Text: of them but what is this gradient path here it kind of goes down here and then goes down there

0:15:19 - 0:15:27     Text: but you know actually the bottom part is that we're just using WH at each position so we have the

0:15:27 - 0:15:35     Text: partial of WH used at position i with respect to the partial of WH which is just our weight matrix

0:15:35 - 0:15:42     Text: for our recurrent neural network so that's just one because you know we're just using the same matrix

0:15:42 - 0:15:49     Text: everywhere and so we are just then summing the partials in each position that we use it

0:15:52 - 0:16:00     Text: okay i'm practically what does that mean in terms of how you compute this um well if you're doing

0:16:00 - 0:16:07     Text: it by hand um what happens is you start at the end just like the general lecture three story

0:16:07 - 0:16:16     Text: you work out um derivatives um with respect to the hidden layer and then with respect to WH

0:16:16 - 0:16:24     Text: at the last time step and so that gives you one update for WH but then you continue passing the

0:16:24 - 0:16:31     Text: gradient back to the t minus one time step and after a couple more steps of the chain rule

0:16:31 - 0:16:38     Text: you get another update for WH and you simply sum that onto your previous update for WH and then you

0:16:38 - 0:16:46     Text: go to ht minus two you get another update for WH and you sum that onto your update for WH and you go

0:16:46 - 0:16:53     Text: back all the way um and you sum up the gradients as you go um and that gives you a total update

0:16:53 - 0:17:01     Text: um for WH um and so there's sort of two tricks here and i'll just mention um the two tricks

0:17:01 - 0:17:08     Text: you have to kind of separately sum the updates for WH and then once you finished um apply them all

0:17:08 - 0:17:14     Text: at once you don't want to actually be changing the WH matrix as you go because that's then invalid

0:17:14 - 0:17:21     Text: because the forward calculations were done with the constant WH that you had from the previous

0:17:21 - 0:17:29     Text: um state all through the network um the second trick is well if you're doing this for sentences

0:17:29 - 0:17:35     Text: you can normally just go back to the beginning of the sentence um but if you've got very long

0:17:35 - 0:17:41     Text: sequences this can really slow you down if you're having to sort of run this algorithm back for a

0:17:41 - 0:17:48     Text: huge amount of time so something people commonly do is what's called truncated back propagation

0:17:48 - 0:17:54     Text: through time where you choose some constants say 20 and you say well i'm just going to run

0:17:54 - 0:18:03     Text: this back propagation for 20 time steps some those 20 gradients and then i'm just done that's what i'll

0:18:03 - 0:18:14     Text: update um the WH matrix with um and that works just fine okay so now given a corpus um we can train

0:18:14 - 0:18:24     Text: uh uh simple RNN and so that's good progress um but um this is a model that can also generate text

0:18:24 - 0:18:30     Text: in general so how do we generate text well just like about n-gram language model we're going to

0:18:30 - 0:18:37     Text: generate text by repeated sampling so we're going to start off with an initial state um

0:18:37 - 0:18:49     Text: um and um yeah this slide is imperfect um so the initial state for the hidden state um is is normally

0:18:49 - 0:18:56     Text: just taken as a zero vector and well then we need to have something for a first input and on this

0:18:56 - 0:19:01     Text: slide the first input is shown as the first word my and if you want to feed a starting point you

0:19:01 - 0:19:07     Text: could feed my but a lot of the time you'd like to generate a sentence from nothing and if you

0:19:07 - 0:19:13     Text: want to do that what's conventional is to additionally have a beginning of sequence token which is a

0:19:13 - 0:19:18     Text: special token so you'll feed in the beginning of sequence token in at the beginning as the first

0:19:18 - 0:19:26     Text: token it has an embedding and then you use the um RNN update and then you generate using the soft

0:19:26 - 0:19:34     Text: max and next word and um well you generate a probability probability distribution over next words

0:19:34 - 0:19:41     Text: and then at that point you sample from that and it chooses some word like favorite and so then

0:19:41 - 0:19:47     Text: the trick is for doing generation that you take this word that you sampled and you copy it back

0:19:47 - 0:19:54     Text: down to the input and then you feed it as an in as an input next step if you are an N sample from

0:19:54 - 0:20:00     Text: the soft max get another word and just keep repeating this over and over again and you start

0:20:00 - 0:20:09     Text: generating the text and how you end is as well as having a beginning of sequence um special symbol

0:20:09 - 0:20:14     Text: you usually have an end of sequence special symbol and at some point um the recurrent neural

0:20:14 - 0:20:22     Text: network will generate the end of um sequence symbol and then you say okay I'm done I'm finished

0:20:22 - 0:20:31     Text: generating text um so before going on for the um more of the um difficult content of the lecture

0:20:31 - 0:20:37     Text: we can just have a little bit of fun with this and try um training up and generating text with

0:20:37 - 0:20:43     Text: a recurrent neural network model so you can generate you can train an RNN on any kind of text

0:20:43 - 0:20:50     Text: and so that means one of the fun things that you can do is generate text um in different styles

0:20:50 - 0:20:58     Text: based on what you could train it from um so here um Harry Potter as a there it's a fair amount

0:20:58 - 0:21:05     Text: of a corpus of text so you can train our NNLM on the Harry Potter box and then say go off and

0:21:05 - 0:21:11     Text: generate some text and it'll generate text like this sorry how Harry shouted panicking I'll leave

0:21:11 - 0:21:17     Text: those brooms in London are they no idea said nearly headless Nick casting low close by sadrick

0:21:17 - 0:21:22     Text: carrying the last bit of treacle charms from Harry's shoulder and to answer him the common room

0:21:22 - 0:21:29     Text: pushed upon it four arms held a shining knob from when the spider hadn't felt it seemed he reached

0:21:29 - 0:21:37     Text: the teams too um well so on the one hand that's still kind of a bit in coherent as a story

0:21:37 - 0:21:42     Text: on the other hand it sort of sounds like Harry Potter and certainly the kind of you know vocabulary

0:21:42 - 0:21:49     Text: and constructions it uses and like I think you'd agree that you know even though it gets sort of

0:21:49 - 0:21:56     Text: incoherent it's sort of more coherent than what we got from an in-gram language model when I

0:21:56 - 0:22:04     Text: showed a generation in the last um lecture um you can choose a very different style of text um so

0:22:04 - 0:22:11     Text: you could instead train the model on a bunch of cookbooks um and if you do that you can then say

0:22:11 - 0:22:17     Text: generate um based on what you've learned about cookbooks um and it'll just generate a recipe so

0:22:17 - 0:22:27     Text: here's a recipe um chocolate ranch barbecue um categories yield six servings two tablespoons of

0:22:27 - 0:22:34     Text: parmesan cheese chopped a one cup of coconut milk three eggs beaten place each pasta over layers of

0:22:34 - 0:22:42     Text: lumps shaped mixture into the moderate oven and simmer until firm serve hot and bodied fresh mustard

0:22:42 - 0:22:49     Text: orange and cheese combine the cheese and salt together the dough in a large skillet add the ingredients

0:22:49 - 0:22:56     Text: and stir in the chocolate and pepper so you know um this recipe makes um no sense and it's

0:22:56 - 0:23:02     Text: sufficiently um incoherent there's actually even no danger that you'll try cooking this at home um

0:23:02 - 0:23:07     Text: but you know something that's interesting is although you know this really just

0:23:08 - 0:23:16     Text: isn't a recipe and the things that are done in the instructions have no relation um to the

0:23:16 - 0:23:22     Text: ingredients that the thing that's interesting that it has learned as this recurrent your network

0:23:22 - 0:23:28     Text: model is that it's really mastered the overall structure of a recipe it knows that a recipe has a

0:23:28 - 0:23:35     Text: title it often tells you about how many people it serves at list the ingredients and then it has

0:23:35 - 0:23:41     Text: instructions um to make it so that's sort of fairly impressive in some sense the high level tech

0:23:41 - 0:23:50     Text: structuring um so the one other thing I wanted to mention was um when I say you can train and

0:23:50 - 0:23:54     Text: our an end language model and any kind of text the other difference from where we were in

0:23:54 - 0:24:00     Text: in-gram language models was on in-gram language models that just meant counting in grams and meant

0:24:00 - 0:24:07     Text: it took um two minutes or even on a large corpus with any modern computer training your iron and

0:24:07 - 0:24:12     Text: L.M. actually can then be a time intensive activity and you can spend hours doing that as you

0:24:12 - 0:24:21     Text: might find next week when you're training um machine translation models okay um how do we decide

0:24:21 - 0:24:31     Text: if our models are good or not um so the standard evaluation metric for language models is what's

0:24:31 - 0:24:41     Text: called perplexity um and what perplexity is is um kind of like when you were um training your

0:24:41 - 0:24:49     Text: model you use teacher forcing over a piece of text that's a different piece of test text which

0:24:49 - 0:24:57     Text: isn't text that was in the training data and you say well given a sequence of t words um what

0:24:57 - 0:25:05     Text: probability do you give to the actual t plus oneth word and you repeat that at each position

0:25:05 - 0:25:11     Text: and then you take the inverse of that probability and raise it to the one on t for the length of your

0:25:11 - 0:25:21     Text: test text sample and that number is the perplexity so it's a geometric mean of the inverse probabilities

0:25:22 - 0:25:30     Text: now um after that explanation perhaps an easier way to think of it is that the perplexity um is

0:25:30 - 0:25:40     Text: simply um the cross-entropy loss that I introduced before expenentiated um so um but you know it's

0:25:40 - 0:25:47     Text: now the other way around so low perplexity um is better so there's actually an interesting

0:25:47 - 0:25:54     Text: story about these perplexities um so a famous figure in the development of um probabilistic

0:25:54 - 0:26:00     Text: and machine learning approaches to natural language processing is Fred Jeleneck who died a few years

0:26:00 - 0:26:10     Text: ago um and he was trying to um interest people and the idea of using probability models and

0:26:10 - 0:26:17     Text: machine learning um for natural language processing at a time i this is the 1970s and early 1980s

0:26:18 - 0:26:26     Text: when nearly everyone in the field of AI um was still in the thrall of logic-based models and

0:26:26 - 0:26:32     Text: blackboard architectures and things like that for artificial intelligence systems and so Fred

0:26:32 - 0:26:41     Text: Jeleneck was actually an information theorist by background um and who um then got interest in

0:26:41 - 0:26:49     Text: working with speech and then language data um so at that time the stuff that's this sort of um

0:26:49 - 0:26:55     Text: exponential or using cross-entropy losses was completely bread and butter um to Fred Jeleneck

0:26:55 - 0:27:02     Text: but he'd found that no one in AI could understand the bottom half of the slide and so he wanted

0:27:02 - 0:27:08     Text: to come up with something simple that AI people at that time could understand and perplexity

0:27:08 - 0:27:18     Text: has a kind of a simple interpretation you can tell people so if you get a perplexity of 53

0:27:18 - 0:27:25     Text: that means how uncertain you are um of the next word is equivalent to the uncertainty of that

0:27:25 - 0:27:34     Text: you're tossing a 53 sided dice and it coming up as a one right so um that was kind of an easy

0:27:34 - 0:27:43     Text: simple metric and so he introduced um that idea um but you know i guess things stick and to this day

0:27:44 - 0:27:49     Text: everyone evaluates their language models by providing perplexity numbers and so here are

0:27:49 - 0:27:58     Text: some perplexity numbers um so traditional n-gram language models commonly had perplexities over 100

0:27:58 - 0:28:04     Text: but if you made them really big and really careful you carefully you could get them down into a number

0:28:04 - 0:28:12     Text: like 67 as people started to build more advanced recurrent neural networks especially as they move

0:28:12 - 0:28:18     Text: beyond the kind of simple RNNs which has all I've shown you so far which one of is in the second

0:28:18 - 0:28:27     Text: line of the slide into LSTMs which I talk about later in this course that people started producing

0:28:28 - 0:28:35     Text: much better perplexities and here we're getting perplexities down to 30 and this is results actually

0:28:35 - 0:28:43     Text: from a few years ago so nowadays people get perplexities even lower than 30 you have to be realistic

0:28:43 - 0:28:49     Text: and what you can expect right because if you're just generating a text some words are almost

0:28:49 - 0:28:59     Text: determined um so you know if it's something like um you know sum gave the man a napkin he said thank

0:28:59 - 0:29:05     Text: you know basically 100 percent you should be able to say the word that comes next is you um and

0:29:05 - 0:29:11     Text: so that you can predict really well but um you know if it's a lot of other sentences like um he

0:29:11 - 0:29:19     Text: looked out the window and saw uh something right no probability in the model or model in the world

0:29:19 - 0:29:23     Text: can give a very good estimate of what's actually going to be coming next to that point and so that

0:29:23 - 0:29:30     Text: gives us the sort of residual um uncertainty that leads to perplexities that on average might be around

0:29:30 - 0:29:40     Text: 20 or something okay um so we've talked a lot about language models now why should we care about

0:29:40 - 0:29:47     Text: language modeling you know well there's sort of an intellectual scientific answer that says

0:29:47 - 0:29:53     Text: this is a benchmark task right if we what we want to do is build machine learning models of

0:29:53 - 0:29:59     Text: language and our ability to predict what word will come next in the context that shows how well

0:29:59 - 0:30:06     Text: we understand both the structure of language and the structure of the human world that um language

0:30:06 - 0:30:13     Text: talks about um but there's a much more practical answer than that um which is you know

0:30:13 - 0:30:20     Text: language models are really the secret tool of natural language processing so if you're talking

0:30:20 - 0:30:30     Text: to any nlp person and you've got almost any task it's quite likely they'll say oh I bet we could

0:30:30 - 0:30:38     Text: use a language model for that and so language models are sort of used as a not the whole solution

0:30:38 - 0:30:46     Text: but a part of almost any task any task involves generating or estimating the probability of text

0:30:46 - 0:30:53     Text: so you can use it for predictive typing speech recognition grammar correction identifying authors

0:30:53 - 0:30:58     Text: machine translation summarization dialogue just that anything you do with natural language

0:30:58 - 0:31:05     Text: involves language models and we'll see examples of that in following classes including next Tuesday

0:31:05 - 0:31:13     Text: where we're using language models for machine translation okay so a language model is just a system

0:31:13 - 0:31:21     Text: that predicts the next word a recurrent neural network is a family of neural networks which can

0:31:21 - 0:31:29     Text: take sequential input of any length they reuse the same weights to generate a hidden state

0:31:29 - 0:31:37     Text: and optionally but commonly an output on each step um note that these two things are different um so

0:31:38 - 0:31:43     Text: we've talked about two ways that you could build language models but one of them is our

0:31:43 - 0:31:49     Text: ns being a great way but our ns can also be used for a lot of other things so let me just quickly

0:31:49 - 0:31:55     Text: preview a few other things you can do with our ns so there are lots of tasks that people want to do

0:31:55 - 0:32:04     Text: an nlp which are referred to as sequence taking tasks where we'd like to take words of text and do

0:32:04 - 0:32:11     Text: some kind of classification along the sequence so one simple common one is to give words parts of

0:32:11 - 0:32:18     Text: speech that is a determinar start orders an adjective cat is a noun not does a verb um and well you can

0:32:18 - 0:32:26     Text: do this straightforwardly by using a recurrent neural network as a sequential classifier whereas now

0:32:26 - 0:32:33     Text: going to generate parts of speech rather than the next word you can use a recurrent neural network

0:32:33 - 0:32:41     Text: the sentiment classification well this time we don't actually want to generate um an output at

0:32:41 - 0:32:46     Text: each word necessarily but we want to know what the overall sentiment looks like so somehow we

0:32:46 - 0:32:53     Text: want to get out a sentence encoding that we can perhaps put through another neural network layer

0:32:53 - 0:33:00     Text: to judge whether the sentence is positive or negative well the simplest way to do that is to think

0:33:00 - 0:33:11     Text: well after I've run my LSTM through the whole sentence actually this final hidden state it's

0:33:11 - 0:33:16     Text: encoded the whole sentence because remember I updated that hidden state based on each previous

0:33:16 - 0:33:21     Text: word and so you could say that this is the whole meaning of the sentence so let's just say that

0:33:21 - 0:33:29     Text: is the sentence encoding um and then put an extra um classifier layer on that with something like a

0:33:29 - 0:33:35     Text: softmax classifier um that method has been used and it actually works reasonably well and if you

0:33:35 - 0:33:42     Text: sort of train this model end to end well it's actually then motivated to preserve sentiment information

0:33:42 - 0:33:47     Text: in the hidden state of the recurrent neural network because that will allow it to better predict

0:33:47 - 0:33:54     Text: the sentiment of the whole sentence um which is the final task and hence loss function that we're

0:33:54 - 0:34:00     Text: giving the network but it turns out that you can commonly do better than that by actually

0:34:00 - 0:34:06     Text: doing things like feeding all hidden states into the sentence encoding perhaps by making the

0:34:06 - 0:34:14     Text: sentence encoding an element wise max or an element wise mean of all the hidden states because

0:34:14 - 0:34:20     Text: this then more symmetrically encodes the hidden state over each time step

0:34:24 - 0:34:34     Text: another big use of recurrent neural networks is what I'll call language encoder module uses so

0:34:34 - 0:34:40     Text: anytime you have some text for example here we have a question of what nationality was Beethoven

0:34:41 - 0:34:49     Text: we'd like to construct some kind of neural representation of this so one way to do it is to run

0:34:50 - 0:34:56     Text: recurrent neural network over it and then just like last time to either take the final hidden

0:34:56 - 0:35:05     Text: state or take some kind of um function of all the hidden states and say that's the sentence representation

0:35:05 - 0:35:11     Text: and we could do the same thing um for the context so for question answering we're going to build some more

0:35:11 - 0:35:17     Text: neural net structure um on top of that and we'll learn more about them a couple of weeks um when we

0:35:17 - 0:35:24     Text: have the question answering lecture but the key thing is what we built so far we used to get

0:35:24 - 0:35:31     Text: sentence representation so it's a language encoder module so that was the language encoding part

0:35:31 - 0:35:39     Text: we can also use RNNs to decode into language and that's commonly used in speech recognition

0:35:39 - 0:35:46     Text: machine translation summarization so if we have a speech recognizer the input is an audio signal

0:35:46 - 0:35:53     Text: and what we want to do is decode that into language well what we could do is use some function of

0:35:53 - 0:36:02     Text: the input which is probably itself going to be in your net as the initial um hidden state of

0:36:02 - 0:36:12     Text: our RNN LM and then we say start generating text based on that and so it should then um we generate

0:36:12 - 0:36:18     Text: word at a time by the method that we just looked at um we turn the speech into text so this is an

0:36:18 - 0:36:25     Text: example of a conditional language model because we're now generating text conditioned on the speech

0:36:25 - 0:36:31     Text: signal and a lot of the time you can do interesting more advanced things with the current neural

0:36:31 - 0:36:38     Text: networks by building conditional language models another place you can use conditional language

0:36:38 - 0:36:45     Text: models is for text classification tasks and including sentiment classification so if you can

0:36:45 - 0:36:52     Text: condition um your language model based on a kind of sentiment you can build a kind of classifier

0:36:52 - 0:36:57     Text: for that and another use that we'll see a lot of next class is for machine translation

0:36:59 - 0:37:09     Text: okay so that's the end of the intro um to um doing things with um recurrent neural networks and

0:37:09 - 0:37:17     Text: language models now I want to move on and tell you about the fact that everything is not perfect

0:37:17 - 0:37:24     Text: and these recurrent neural networks tend to have a couple of problems and we'll talk about those

0:37:24 - 0:37:30     Text: and then in part that'll then motivate coming up with a more advanced recurrent neural network

0:37:30 - 0:37:39     Text: architecture so the first problem to be mentioned is the idea of what's called vanishing gradients

0:37:39 - 0:37:47     Text: and what does that mean well at the end of our sequence we have some overall um loss that we're

0:37:47 - 0:37:55     Text: calculating and well what we want to do is back propagate that loss um and we want to back propagate

0:37:55 - 0:38:03     Text: it right along the sequence and so we're working out the partials of j4 um with respect to the hidden

0:38:03 - 0:38:09     Text: state at time one and when we have a longer sequence we'll be working out the partials of j20

0:38:09 - 0:38:18     Text: with respect to the hidden state at time one and how do we do that well how we do it is by composition

0:38:18 - 0:38:27     Text: and the chain rule we've got a big long chain rule along the whole sequence um well if we're doing

0:38:27 - 0:38:37     Text: that um you know we're multiplying a ton of things together and so the danger of what tends to happen

0:38:37 - 0:38:46     Text: is that as we do these um multiplications a lot of time these partials between successive hidden

0:38:46 - 0:38:54     Text: states become small and so what happens is as we go along the gradient gets smaller and smaller

0:38:54 - 0:39:03     Text: and smaller and starts to peter out and to the extent that appears out um well then we've kind of

0:39:03 - 0:39:10     Text: got no upstream gradient and therefore we won't be changing the parameters at all and that turns out

0:39:10 - 0:39:19     Text: to be pretty problematic um so the next couple of slides sort of um say a little bit about the

0:39:19 - 0:39:31     Text: why and how this happens um what's presented here is a kind of only semi formal wave your hands

0:39:31 - 0:39:36     Text: at the kind of problems that you might expect um if you really want to sort of get into all the

0:39:36 - 0:39:42     Text: details of this um you should look at the couple of papers um than I mentioned a small print

0:39:42 - 0:39:47     Text: at the bottom of the slide but at any rate if you remember that this is our basic um

0:39:47 - 0:39:54     Text: my current neural network equation well let's consider an easy case suppose we sort of get rid

0:39:54 - 0:40:02     Text: of our non-linearity and just assume that it's an identity function okay so then when we're working

0:40:02 - 0:40:08     Text: out the partials of the hidden state with respect to the previous hidden state um we can work those

0:40:08 - 0:40:19     Text: out in the usual way according to the chain rule and then if um sigma is um simply the identity

0:40:19 - 0:40:28     Text: function um well then everything gets really easy for us so only the the sigma just goes away

0:40:28 - 0:40:36     Text: and only the first term involves um h at time t minus 1 so the later terms go away

0:40:36 - 0:40:45     Text: and so um our gradient ends up as wh well that's doing it for just one time step what happens when

0:40:45 - 0:40:51     Text: you want to work out these partials a number of time steps away so we want to work it out the

0:40:51 - 0:41:03     Text: partial of time step i with respect to j um well what we end up with is a product of the

0:41:03 - 0:41:14     Text: partials of successive time steps um and well each of those um is coming out as wh and so we end

0:41:14 - 0:41:26     Text: up um getting wh raised to the elf power and well our potential problem is that if wh is small

0:41:26 - 0:41:33     Text: in some sense then this term gets exponentially problematic i it becomes vanishingly small

0:41:33 - 0:41:41     Text: as our sequence length becomes long well what can we mean by small well a matrix is small

0:41:41 - 0:41:47     Text: if it's eigenvalues are all less than one so we can rewrite what's happening with this

0:41:47 - 0:41:56     Text: successor multiplication using eigenvalues and eigenvectors um and i should say that all eigenvector

0:41:56 - 0:42:01     Text: values less than one is sufficient but not necessary condition for what i'm about to say um

0:42:01 - 0:42:10     Text: right so we can rewrite um things using the eigenvectors as a basis and if we do that um

0:42:10 - 0:42:20     Text: um we end up getting um the eigenvalues being raised to the elf power and so if all of our eigen

0:42:20 - 0:42:26     Text: values are less than one if we're taking a number less than one um and raising it to the elf power

0:42:26 - 0:42:32     Text: that's going to approach zero as the sequence length grows and so the gradient vanishes

0:42:33 - 0:42:39     Text: okay now the reality is more complex than that because actually we always use a non-linear

0:42:39 - 0:42:45     Text: activation sigma but you know in principle it's sort of the same thing um apart from we have to

0:42:45 - 0:42:54     Text: consider in the effect of the non-linear activation okay so why is this a problem that the gradients

0:42:54 - 0:43:01     Text: disappear well suppose we're wanting to look at the influence of time steps well in the future

0:43:01 - 0:43:11     Text: uh on um the representations we want to have early in the sentence well um what's happening

0:43:11 - 0:43:16     Text: late in the sentence just isn't going to be giving much information about what we should

0:43:16 - 0:43:25     Text: be storing in the h at time one vector whereas on the other hand the loss at time step two is

0:43:25 - 0:43:32     Text: going to be giving a lot of information at what um should be stored in the hidden vector at time step

0:43:32 - 0:43:42     Text: one so the end result of that is that what happens is that these simple RNNs are very good at

0:43:42 - 0:43:50     Text: modeling nearby effects but they're not good at all at modeling long term effects because the

0:43:50 - 0:43:58     Text: gradient signal from far away is just lost too much and therefore the model never effectively gets

0:43:58 - 0:44:06     Text: to learn um what information from far away it would be useful to preserve into the future so let's

0:44:06 - 0:44:13     Text: consider that concretely um for the example of language models that we've worked on so here's

0:44:13 - 0:44:19     Text: a piece of text um when she tried to print her tickets she found that the printer was out of

0:44:19 - 0:44:25     Text: toner she went to the stationery store to buy more toner it was very overpriced after installing

0:44:25 - 0:44:32     Text: the toner into the printer she finally printed her and well you're all smart human beings i trust

0:44:32 - 0:44:40     Text: you can all guess what the word that comes next is it should be tickets um but well the problem is

0:44:40 - 0:44:47     Text: that for the RNN start to learn cases like this it would have to carry through in its hidden state

0:44:47 - 0:44:56     Text: a memory of the word tickets for sort of whatever it is about 30 hidden state updates and well

0:44:56 - 0:45:04     Text: we'll train on this um example and so we'll be wanting it to predict tickets um is the next word

0:45:04 - 0:45:12     Text: and so a gradient update will be sent right back through the hidden states of the LSTM corresponding

0:45:12 - 0:45:20     Text: to this sentence and that should tell the model um is good to preserve information about the word

0:45:20 - 0:45:25     Text: tickets because that might be useful in the future here it was useful in the future um but the

0:45:25 - 0:45:33     Text: problem is that the gradient signal will just become far too weak out after a bunch of words

0:45:33 - 0:45:41     Text: and it just never learns that dependency um and so what we find in practice is the model is just

0:45:41 - 0:45:48     Text: unable to predict similar long distance dependencies at test time i spent quite a long time on

0:45:48 - 0:45:55     Text: vanishing gradients and and really vanishing gradients are the big problem in practice um with using

0:45:57 - 0:46:04     Text: recurrent neural networks over long sequences um but you know i have to do justice to the fact

0:46:04 - 0:46:09     Text: that you could actually also have the opposite problem you can also have exploding gradients so

0:46:09 - 0:46:19     Text: if a gradient becomes too big that's also a problem and it's a problem because the secastic gradient

0:46:20 - 0:46:28     Text: update step becomes too big right so remember our parameter update is um based on the product

0:46:28 - 0:46:34     Text: of the learning rate and the gradient so if your gradient is huge right you've calculated

0:46:34 - 0:46:42     Text: oh it's got a lot of slope here this has a slope of 10,000 um then your parameter update can be

0:46:42 - 0:46:49     Text: arbitrarily large and that's potentially problematic that can cause a bad update where you take a

0:46:49 - 0:46:57     Text: huge step and you end up at a weird and bad parameter configuration so you sort of think

0:46:57 - 0:47:03     Text: you're coming up with a to a steep hill to climb and well you want to be climbing the hill to

0:47:03 - 0:47:11     Text: high likelihood but actually the gradient is so steep that you make an enormous um update and then

0:47:11 - 0:47:16     Text: suddenly your parameters are over an IOR and you've lost your hill altogether there's also the

0:47:16 - 0:47:21     Text: practical differently that we only have so much resolution now floating point numbers um so

0:47:22 - 0:47:29     Text: if your gradient gets too steep um you start getting um not a numbers in your calculations which

0:47:29 - 0:47:36     Text: ruin all your hard training work um we use a kind of an easy fix to this which is called gradient

0:47:36 - 0:47:44     Text: clipping um which is we choose some reasonable number and we say we're just not going to deal with

0:47:44 - 0:47:50     Text: gradients that bigger than this number um a commonly used number is 20 you know some the thing that's

0:47:50 - 0:47:56     Text: got a range of spread but not that high you know you can use 10,000 some where sort of in that range

0:47:56 - 0:48:05     Text: um and if the norm of the gradient is greater than that threshold we simply just scale it down

0:48:05 - 0:48:12     Text: which means that we then make a smaller gradient update so we're still moving in exactly the same

0:48:12 - 0:48:20     Text: um direction but we're taking a smaller step um so doing this gradient clipping is important um

0:48:20 - 0:48:31     Text: um you know but it's an easy problem to solve okay um so the thing that we've still got left to

0:48:31 - 0:48:42     Text: solve is how to really solve this problem of vanishing gradients um so the problem is yeah these

0:48:42 - 0:48:51     Text: RNNs just can't preserve information over many time steps and one way to think about that intuitively

0:48:51 - 0:49:01     Text: is at each time step we have a hidden state and the hidden state is being completely changed

0:49:01 - 0:49:08     Text: at each time step and it's being changed in a multiplicative manner by multiplying by wh and

0:49:08 - 0:49:18     Text: then putting it through um and nonlinearity like maybe we can make some more progress um if we

0:49:18 - 0:49:26     Text: could more flexibly maintain a memory in our recurrent neural network which we can

0:49:27 - 0:49:34     Text: manipulate in a more flexible manner that allows us to more easily preserve information

0:49:34 - 0:49:41     Text: and so this was an idea that people started thinking about and actually they started thinking

0:49:41 - 0:49:51     Text: about it a long time ago um in the late 1990s um and Huck Rydon Schmitt Hoover came up with this

0:49:51 - 0:50:00     Text: idea that got called long short term memory RNNs as a solution to the problem of vanishing gradients

0:50:00 - 0:50:08     Text: I mean so this 1997 paper is the paper you always see cited for LSTM's but you know actually

0:50:09 - 0:50:16     Text: in terms of what we now understand as an LSTM um it was missing part of it in fact it's missing

0:50:16 - 0:50:24     Text: what in retrospect has turned out to be the most important part of um the um modern LSTM so really

0:50:24 - 0:50:32     Text: in some sense the real paper that the modern LSTM is due to is this slightly later paper by

0:50:32 - 0:50:38     Text: Gerst, still Schmitt Hoover and Cummins from 2000 which additionally introduces the forget gate

0:50:38 - 0:50:48     Text: that I'll explain in a minute um yeah so um so this was some very clever stuff that was introduced

0:50:48 - 0:50:56     Text: and it turned out later to have an enormous impact um if I just diverge from the technical part

0:50:56 - 0:51:04     Text: for one more moment um that you know for those of you who these days um think that mastering your

0:51:04 - 0:51:11     Text: networks is the path to fame and fortune um the funny thing is you know at the time that this work

0:51:11 - 0:51:20     Text: was done that just was not true right very few people were interested in neural networks and

0:51:20 - 0:51:26     Text: although long short term memories have turned out to be one of the most important successful and

0:51:26 - 0:51:34     Text: influential ideas in neural networks for the following 25 years um really the original authors

0:51:34 - 0:51:41     Text: didn't get recognition for that so both of them are now professors at German universities um but

0:51:41 - 0:51:51     Text: Hock Rider um moved over into doing bioinformatics work um to find um something to do and

0:51:51 - 0:51:59     Text: Gerst actually is doing kind of multimedia studies um so um that's the fates of history um

0:52:00 - 0:52:11     Text: okay so what is an LSTM so the crucial innovation of an LSTM is to say well rather than just having

0:52:11 - 0:52:22     Text: one hidden vector in the recurrent model we're going to um build a model with two um hidden vectors

0:52:22 - 0:52:30     Text: at each time step one of which is still called the hidden state H and the other of which is called

0:52:30 - 0:52:37     Text: the cell state um now you know arguably in retrospect these were named wrongly because as you'll

0:52:37 - 0:52:44     Text: see when we look at in more detail in some sense the cell is more equivalent to the hidden state of

0:52:44 - 0:52:50     Text: the simple RNN than vice versa but we're just going with the names that everybody uses so both

0:52:50 - 0:52:57     Text: of these are vectors of length N um and it's going to be the cell that stores long term information

0:52:58 - 0:53:05     Text: and so we want to have something that's more like memory so the meaning like RAM and the computer

0:53:05 - 0:53:13     Text: um so the cell is designed so you can read from it you can erase parts of it and you can write new

0:53:13 - 0:53:22     Text: information to the cell um and the interesting part of an LSTM is then it's got control structures

0:53:22 - 0:53:29     Text: to decide how you do that so the selection of which information to erase write and read is controlled

0:53:29 - 0:53:39     Text: by probabilistic gates so the gates are also vectors of length N and on each time step um we work out

0:53:39 - 0:53:45     Text: a state for the gate vectors so each element of the gate vectors is a probability so they can be

0:53:45 - 0:53:52     Text: open probability one close probability zero or somewhere in between and their value will be saying

0:53:52 - 0:54:00     Text: how much do you erase how much do you write how much do you read and so these are dynamic gates

0:54:00 - 0:54:08     Text: with a value that's computed based on the current context okay so in this next slide we go

0:54:08 - 0:54:14     Text: through the equations of an LSTM but following this there are some more graphic slides which will

0:54:14 - 0:54:21     Text: probably be easier to absorb right so we again just like before it's a current neural network

0:54:21 - 0:54:29     Text: we have a sequence of inputs x um t and we're going to it each time step compute a cell state

0:54:29 - 0:54:37     Text: in the hidden state so how do we do that so firstly we're going to compute values of the three gates

0:54:37 - 0:54:45     Text: and so we're computing the gate values using an equation that's identical to the equation

0:54:45 - 0:54:53     Text: um for the simple recurrent neural network um but in particular um oops sorry how does

0:54:53 - 0:55:00     Text: just say what the gates are first so there's a forget gate um which we will control what is kept

0:55:00 - 0:55:07     Text: in the cell at the next time step versus what is forgotten there's an input gate which is going

0:55:07 - 0:55:14     Text: to determine which parts of a calculator new cell content get written to the cell memory and there's

0:55:14 - 0:55:20     Text: an output gate which is going to control what parts of the cell memory are moved over into the hidden

0:55:20 - 0:55:28     Text: state and so each of these is using the logistic function because we want them to be in each

0:55:28 - 0:55:35     Text: element of this vector a probability which will say whether to fully forget partially forget

0:55:35 - 0:55:43     Text: or fully fully remember yeah and the equation for each of these is exactly like the simple r and

0:55:43 - 0:55:49     Text: an equation but note of course that we've got different parameters for each one so we've got

0:55:49 - 0:55:58     Text: forgetting weight matrix w with a forgetting bias um and a forgetting um multiplier of the input

0:56:00 - 0:56:08     Text: okay so then we have the other equations that really are the mechanics of the LSTM

0:56:08 - 0:56:16     Text: so we have something that will calculate a new cell content so this is our candidate update

0:56:16 - 0:56:24     Text: and so for calculating the candidate update we're again essentially using exactly the same simple

0:56:24 - 0:56:31     Text: r and n equation apart from now it's usual to use 10h so you get something that I discussed

0:56:31 - 0:56:40     Text: last time is balanced around zero okay so then to actually update things we use our gates so

0:56:40 - 0:56:48     Text: for our new cell content what the idea is is that we want to remember some but probably not all

0:56:48 - 0:56:57     Text: of what we had in the cell from previous time steps and we want to store some but probably not

0:56:57 - 0:57:08     Text: all of the value that we've calculated as the new cell update and so the way we do that is we take

0:57:08 - 0:57:16     Text: the previous cell content and then we take its hard-amired product with the forget vector

0:57:16 - 0:57:26     Text: and then we add to it the hard-amired product of the input gate times the candidate cell update

0:57:30 - 0:57:37     Text: and then for working out the new hidden state we then work out which parts of the cell

0:57:38 - 0:57:46     Text: to expose in the hidden state and so after taking a 10h transform of the cell we then

0:57:46 - 0:57:52     Text: take the hard-amired product with the output gate and that gives us our hidden representation

0:57:52 - 0:57:58     Text: and if this hidden representation that we then put through a soft softmax layer to generate

0:57:59 - 0:58:08     Text: our next output of our LSTM or current neural network yeah so the gates and the things that they're

0:58:08 - 0:58:16     Text: put with our vectors of size n and what we're doing is we're taking each element of them and

0:58:16 - 0:58:23     Text: multiplying them element wise to work out a new vector and then we get two vectors and that

0:58:23 - 0:58:30     Text: we're adding together so this way of doing things element wise you sort of don't really see and

0:58:30 - 0:58:37     Text: standard linear algebra course it's referred to as the hard-amired product it's represented by some

0:58:37 - 0:58:42     Text: kind of circle I mean actually in more modern work it's been more usual to represent it with

0:58:42 - 0:58:49     Text: this slightly bigger circle with the dot at the middle as the hard-amired product symbol and

0:58:49 - 0:58:54     Text: someday I'll change these slides to be like that but I was lazy and redoing the equations but the

0:58:54 - 0:59:00     Text: other notation you do see quite often is just using the same little circle that you use for function

0:59:00 - 0:59:08     Text: composition to represent hard-amired product okay so all of these things are being done as vectors

0:59:08 - 0:59:16     Text: of the same length n and the other thing that you might notice is that the candidate update and

0:59:17 - 0:59:26     Text: forget import and output gates all have a very similar form the only difference is three logistics

0:59:26 - 0:59:34     Text: in one 10-h and none of them depend on each other so all four of those can be calculated parallel

0:59:34 - 0:59:40     Text: and if you want to have an efficient LSTM implementation that's what you do okay so here's the more

0:59:40 - 0:59:48     Text: graphical presentation of this so these pictures come from Chris Ola and I guess he did such a

0:59:48 - 0:59:56     Text: nice job at producing pictures for LSTMs that almost everyone uses them these days and so this

0:59:56 - 1:00:06     Text: sort of pulls apart the computation graph of an LSTM unit so blowing this up you've got from

1:00:06 - 1:00:15     Text: the previous time step both your cell and hidden recurrent vectors and so you feed the hidden

1:00:18 - 1:00:25     Text: vector from the previous time step and the new input xt into the computation of the gates which

1:00:25 - 1:00:32     Text: is happening down the bottom so you compute the forget gate and then you use the forget gate in a

1:00:32 - 1:00:40     Text: hard-amired product here drawn as a actually a time symbol so forget some cell content you work out

1:00:40 - 1:00:49     Text: the input gate and then using the input gate and a regular recurrent neural network like computation

1:00:49 - 1:01:00     Text: you can compute candidate new cell content and so then you add those two together to get the new cell

1:01:00 - 1:01:08     Text: content which then heads out as the new cell content at time t but then you also have worked out

1:01:08 - 1:01:16     Text: an output gate and so then you take the cell content put it through another non-linearity and

1:01:16 - 1:01:24     Text: multi-hard-amired product it with the output gate and that then gives you the new hidden state

1:01:26 - 1:01:34     Text: so this is all kind of complex but as to understanding why something is different as happening here

1:01:34 - 1:01:44     Text: the thing to notice is that the cell state from t minus 1 is passing right through this to be the

1:01:44 - 1:01:54     Text: cell state at time t without very much happening to it so some of it is being deleted by the forget gate

1:01:55 - 1:02:03     Text: and then some new stuff is being written to it as a result of using this candidate new cell

1:02:03 - 1:02:15     Text: content but the real secret of the LSTM is that new stuff is just being added to the cell with an

1:02:15 - 1:02:23     Text: addition right so in the simple RNN at each success of step you are doing a multiplication

1:02:23 - 1:02:30     Text: and that makes it incredibly difficult to learn to preserve information in the hidden state

1:02:30 - 1:02:36     Text: over a long period of time it's not completely impossible but it's a very difficult thing to

1:02:36 - 1:02:44     Text: learn whereas with this new LSTM architecture it's trivial to preserve information the cell

1:02:44 - 1:02:51     Text: from one time step to the next you just don't forget it and it'll carry right through with perhaps

1:02:51 - 1:03:00     Text: some new stuff added in to also remember and so that's the sense in which the cell behaves much

1:03:00 - 1:03:06     Text: more like RAM and a conventional computer that storing stuff and extra stuff can be stored

1:03:06 - 1:03:15     Text: into it and other stuff can be deleted from it as you go along. Okay so the LSTM architecture

1:03:15 - 1:03:24     Text: makes it much easier to preserve information from many time steps and I just right so in particular

1:03:24 - 1:03:32     Text: standard practice with LSTMs is to initialize the forget gate to a one vector which it's just

1:03:32 - 1:03:39     Text: so that a starting point is to say preserve everything from previous time steps and then

1:03:40 - 1:03:47     Text: it is then learning when it's appropriate to forget stuff and contrast is very hard to get

1:03:47 - 1:03:55     Text: or a simple RNN to preserve stuff for a very long time. I mean what does that actually mean?

1:03:56 - 1:04:03     Text: Well you know I've put down some numbers here I mean you know how what you get in practice

1:04:03 - 1:04:08     Text: you know depends on a million things it depends on the nature of your data and how much data you

1:04:08 - 1:04:15     Text: have and what dimensionality your hidden states are blurdy blurdy blur but just to give you

1:04:15 - 1:04:23     Text: some idea of what's going on is typically if you train a simple recurrent neural network

1:04:23 - 1:04:28     Text: that it's effective memory it's ability to be able to use things in the past to condition the

1:04:28 - 1:04:34     Text: future goes for about seven time steps you just really can't get it to remember stuff further

1:04:34 - 1:04:44     Text: back in the past than that whereas for the LSTM it's not complete magic it doesn't work forever

1:04:44 - 1:04:51     Text: but you know it's effectively able to remember and use things from much much further back so typically

1:04:51 - 1:04:58     Text: you find that with an LSTM you can effectively remember and use things about a hundred time steps

1:04:58 - 1:05:04     Text: back and that's just enormously more useful for a lot of the natural language understanding

1:05:04 - 1:05:13     Text: tasks that we want to do and so that was precisely what the LSTM was designed to do and I mean so

1:05:13 - 1:05:20     Text: in particular just going back to its name quite a few people miss paths its name the idea of

1:05:20 - 1:05:26     Text: its name was there's a concept of short term memory which comes from psychology and it'd been

1:05:26 - 1:05:34     Text: suggested for simple RNNs that the hidden state of the RNN could be a model of human short term

1:05:34 - 1:05:41     Text: memory and then there would be something somewhere else that would deal with human long term memory

1:05:41 - 1:05:47     Text: but while people had found that this only gave you a very short short term memory so what

1:05:48 - 1:05:52     Text: Hock-Rider and Schmidt who were interested in was how we could give

1:05:53 - 1:06:00     Text: construct models with a long short term memory and so that then gave us this name of LSTM.

1:06:03 - 1:06:10     Text: LSTMs don't guarantee that there are no vanishing exploding gradients but in practice they provide

1:06:10 - 1:06:17     Text: they they don't tend to explode nearly the same way again that plus sign is crucial rather than a

1:06:17 - 1:06:23     Text: multiplication and so they're a much more effective way of learning long distance dependencies.

1:06:25 - 1:06:36     Text: Okay so despite the fact that LSTMs were developed around 1997 2000 it was really only in the early

1:06:36 - 1:06:45     Text: 2010s that the world woke up to them and how successful they were so it was really around 2013 to

1:06:45 - 1:06:53     Text: 2015 that LSTMs sort of hit the world achieving state-of-the-art results on all kinds of problems.

1:06:53 - 1:06:59     Text: One of the first big demonstrations was for handwriting recognition then speech recognition

1:06:59 - 1:07:06     Text: and then going on to a lot of natural language tasks including machine translation, parsing,

1:07:07 - 1:07:13     Text: vision and language tasks like minskapshening as well of course using them for language models

1:07:13 - 1:07:20     Text: and around these years LSTMs became the dominant approach for most NLP tasks. The easiest way to

1:07:20 - 1:07:30     Text: build a good strong model was to approach the problem with an LSTM. So now in 2021 actually LSTMs

1:07:30 - 1:07:36     Text: are starting to be supplanted or have been supplanted by other approaches particularly transformer models

1:07:37 - 1:07:42     Text: which we'll get to in the class in a couple of weeks time. So this is the sort of picture you can see.

1:07:42 - 1:07:49     Text: So for many years there's been a machine translation conference and so a Bake Off competition

1:07:49 - 1:07:57     Text: called WMT workshop on machine translation. So if you look at the history of that in WMT 2014

1:07:58 - 1:08:05     Text: there was zero neural machine translation systems in the competition. 2014 was actually the first

1:08:05 - 1:08:14     Text: year that the success of LSTMs for machine translation was proven in a conference paper

1:08:14 - 1:08:25     Text: but nothing occurred in this competition. By 2016 everyone had jumped on LSTMs as working great

1:08:26 - 1:08:31     Text: and lots of people including the winner of the competition was using an LSTM model.

1:08:32 - 1:08:41     Text: If you then jump ahead to 2019 then there's relatively little use of LSTMs and the vast majority

1:08:41 - 1:08:47     Text: of people are now using transformers. So things change quickly in your network land and I keep

1:08:47 - 1:08:56     Text: on having to rewrite these lectures. So quick further note on vanishing and exploding gradients.

1:08:56 - 1:09:03     Text: Is it only a problem with recurrent neural networks? It's not. It's actually a problem that also

1:09:03 - 1:09:09     Text: occurs anywhere where you have a lot of depth including feed forward and convolutional neural networks.

1:09:09 - 1:09:17     Text: As any time when you've got long sequences of chain rules which give you multiplications the

1:09:17 - 1:09:26     Text: gradient can become vanishingly small as it back propagates. And so generally sort of lower layers

1:09:26 - 1:09:32     Text: are learned very slowly in a hard to train. So there's been a lot of effort in other places as well

1:09:32 - 1:09:41     Text: to come up with different architectures that let you learn more efficiently in deep network.

1:09:41 - 1:09:48     Text: And the commonest way to do that is to add more direct connections that allow the gradient to flow.

1:09:48 - 1:09:56     Text: So the big thing in vision in the last few years has been resnets where the res stands for residual

1:09:56 - 1:10:04     Text: connections. And so the way they made this picture is upside down so the input is at the top is that you

1:10:05 - 1:10:12     Text: have these sort of two paths that are summed together. One path is just an identity path and the other

1:10:12 - 1:10:18     Text: one goes through some neural network layers. And so therefore it's default behavior is just to preserve

1:10:18 - 1:10:26     Text: the input which might sound a little bit like what we just saw for LSTMs. There are other methods

1:10:26 - 1:10:31     Text: that there have been dense nets where you add skip connections forward to every layer.

1:10:33 - 1:10:39     Text: Highway nets were also actually developed by Schmitt Hoover and sort of a reminiscent of what was

1:10:39 - 1:10:48     Text: done with LSTMs. So rather than just having an identity connection as a resnet has, it introduces an extra

1:10:48 - 1:10:55     Text: gate. So it looks more like an LSTM which says how much to send the input through the highway

1:10:56 - 1:11:02     Text: versus how much to put it through a neural net layer and those two are then combined into the

1:11:02 - 1:11:14     Text: output. So essentially this problem occurs anywhere when you have a lot of depth in your layers of

1:11:14 - 1:11:22     Text: neural network. But it first arose and turns out to be especially problematic with recurrent

1:11:22 - 1:11:28     Text: neural networks. They're particularly unstable because of the fact that you've got this one

1:11:28 - 1:11:33     Text: weight matrix that you're repeatedly using through the time sequence.

1:11:35 - 1:11:36     Text: Okay.

1:11:40 - 1:11:46     Text: So Chris, we've got a couple of questions more or less about whether you would ever want to use

1:11:46 - 1:11:52     Text: an RN like a simple RNN instead of an LSTM. How does the LSTM learn what to do with its gates?

1:11:52 - 1:12:02     Text: How can you apply in on those things? Sure. So I think basically the answer is you should never

1:12:02 - 1:12:09     Text: use a simple RNN these days. You should always use an LSTM. I mean, you know, obviously that depends

1:12:09 - 1:12:14     Text: on what you're doing. If you're wanting to do some kind of analytical paper or something, you might

1:12:14 - 1:12:23     Text: prefer a simple RNN. And it is the case that you can actually get decent results with simple RNNs

1:12:23 - 1:12:29     Text: providing you're very careful to make sure that things aren't exploding nor vanishing.

1:12:32 - 1:12:39     Text: But, you know, in practice, getting simple RNNs to work and preserve long contexts is

1:12:39 - 1:12:45     Text: incredibly difficult where you can train LSTMs and they will just work. So really, you should

1:12:45 - 1:12:54     Text: always just use an LSTM. Now wait, the second question was... I think there's a bit of confusion

1:12:54 - 1:13:04     Text: about like whether the gates are learning differently. Yeah. So the gates are also just learned. So

1:13:04 - 1:13:13     Text: if we go back to these equations, you know, this is the complete model. And when we're training the

1:13:13 - 1:13:22     Text: model, every one of these parameters, so all of these WU and B's, everything is simultaneously

1:13:22 - 1:13:32     Text: being trained by BackProp. So that what you hope and indeed it works is the model is learning

1:13:32 - 1:13:38     Text: what stuff should I remember for a long time versus what stuff should I forget, what things in

1:13:38 - 1:13:44     Text: the input are important versus what things in the input don't really matter. So it can learn things

1:13:44 - 1:13:50     Text: like function words like A and D, don't really matter even though everyone uses them in English.

1:13:51 - 1:13:57     Text: So you can just not worry about those. So all of this is learned. And the models do actually

1:13:57 - 1:14:04     Text: successfully learn gate values about what information is useful to preserve long term versus what

1:14:04 - 1:14:10     Text: information is really only useful short term for predicting the next one or two words.

1:14:11 - 1:14:19     Text: Finally, the gradient improvements due to the... So you said that the addition is really important

1:14:19 - 1:14:23     Text: between the New Cell candidate and the Cell State. I don't think at least a couple of students

1:14:23 - 1:14:29     Text: have sort of questioned that. So if you want to go over that again, then maybe useful. Sure.

1:14:32 - 1:14:42     Text: So what we would like is an easy way for memory to be preserved long term. And

1:14:43 - 1:14:49     Text: you know, one way, which is what ResNet's use is just to sort of completely have a direct path

1:14:49 - 1:14:56     Text: from CT minus one to CT and will preserve entirely the history. So there's kind of... There's

1:14:56 - 1:15:06     Text: the fault action of preserving information about the past long term. LSTMs don't quite do that,

1:15:06 - 1:15:14     Text: but they allow that function to be easy. So you start off with the previous Cell State and you

1:15:14 - 1:15:19     Text: can forget some of it by the Forget Gate, so you can delete stuff out of your memory that's used for

1:15:19 - 1:15:26     Text: operation. And then while you're going to be able to update the content of the Cell with this,

1:15:26 - 1:15:33     Text: the right operation that occurs in the plus where depending on the input gate, some parts of what's

1:15:33 - 1:15:41     Text: in the Cell will be added to. But you can think of that adding as overlaying extra information.

1:15:41 - 1:15:47     Text: Everything that was in the Cell that wasn't forgotten is still continuing on to the next time step.

1:15:49 - 1:15:56     Text: And in particular, when you're doing the back propagation through time, that there isn't...

1:15:58 - 1:16:04     Text: I want to say there isn't a multiplication between CT and CT minus one. And there's this unfortunate

1:16:05 - 1:16:10     Text: time symbol here, but remember that's the Hadamard product, which is zeroing out part of it with

1:16:10 - 1:16:16     Text: the Forget Gate. It's not a multiplication by a matrix like in the simple RNN.

1:16:22 - 1:16:27     Text: I hope that's good. Okay, so there are a couple of other things that I

1:16:28 - 1:16:32     Text: wanted to get through before the end. I guess I'm not going to have time to do both of them,

1:16:32 - 1:16:38     Text: I think, so I'll do the last one probably next time. So these are actually simple and easy things,

1:16:38 - 1:16:47     Text: but they complete our picture. So I sort of briefly alluded to this example of sentiment classification

1:16:47 - 1:16:58     Text: where what we could do is run an RNN, maybe an LSTM over a sentence, call this our representation

1:16:58 - 1:17:08     Text: of the sentence and you feed it into a softmax classifier to classify for sentiment. So what we

1:17:08 - 1:17:15     Text: are actually saying there is that we can regard the hidden state as a representation of a word in

1:17:15 - 1:17:25     Text: context, that below that we have just a word vector for terribly, but we then looked at our context

1:17:25 - 1:17:34     Text: and say, okay, we've now created a hidden state representation for the word terribly in the context

1:17:34 - 1:17:41     Text: of the movie was and that proves to be a really useful idea because words have different meanings

1:17:41 - 1:17:49     Text: in different contexts, but it seems like there's a defect of what we've done here because our context

1:17:49 - 1:17:56     Text: only contains information from the left. What about right context? Surely it also be useful to have the

1:17:56 - 1:18:04     Text: meaning of terribly depend on exciting because often words mean different things based on what follows

1:18:04 - 1:18:12     Text: them. So if you have something like red wine, it means something quite different from a red light.

1:18:14 - 1:18:20     Text: So how could we deal with that? Well, an easy way to deal with that would be to say, well, if we're

1:18:20 - 1:18:27     Text: just going to come up with a neural encoding of a sentence, we could have a second RNN with

1:18:27 - 1:18:32     Text: completely separate parameters learned and we could run it backwards through the sentence

1:18:33 - 1:18:39     Text: to get a backward representation of each word and then we can get an overall representation of

1:18:39 - 1:18:45     Text: each word in context by just concatenating those two representations and now we've got a

1:18:45 - 1:18:55     Text: representation of terribly that has both left and right context. So we're simply running a forward

1:18:55 - 1:19:02     Text: RNN and when I say RNN here, that just means any kind of recurrent neural network so commonly

1:19:02 - 1:19:09     Text: it'll be an LSTM and the backward one and then at each time step we just concatenating their

1:19:09 - 1:19:17     Text: representations with each of these having separate weights. And so then we regard this concatenated

1:19:17 - 1:19:24     Text: thing as the hidden state, the contextual representation of a token at a particular time that we pass

1:19:24 - 1:19:33     Text: forward. This is so common that people use a shortcut to denote that and now just draw this picture

1:19:33 - 1:19:40     Text: with two sided arrows and when you see that picture with two sided arrows it means that you're

1:19:40 - 1:19:50     Text: running two RNNs one in each direction and then concatenating their results at each time step

1:19:50 - 1:19:59     Text: and that's what you're going to use later in the model. Okay, but so if you're doing an encoding

1:19:59 - 1:20:08     Text: problem like for sentiment classification or question answering using bidirectional RNNs is a

1:20:08 - 1:20:14     Text: great thing to do but they're only applicable if you have access to the entire input sequence.

1:20:15 - 1:20:22     Text: They're not applicable to language modeling because in a language model necessarily you have to

1:20:22 - 1:20:29     Text: generate the next word based on only the preceding context. But if you do have the entire input

1:20:29 - 1:20:36     Text: sequence that bidirectionality gives you greater power and indeed that's been an idea that people

1:20:36 - 1:20:44     Text: have built on in subsequent work. So when we get to transformers in a couple of weeks we'll spend

1:20:44 - 1:20:51     Text: plenty of time on the BERT model where that acronym stands for bidirectional encoder representations

1:20:51 - 1:21:00     Text: from transformers. So part of what's important in that model is the transformer but really a central

1:21:00 - 1:21:07     Text: point of the paper was to say that you could build more powerful models using transformers by again

1:21:08 - 1:21:17     Text: exploiting bidirectionality. Okay, there's one teeny bit left on RNNs but I'll sneak it into next class

1:21:17 - 1:21:23     Text: and I'll call it the end for today and if there are other things you'd like to ask questions about

1:21:23 - 1:21:53     Text: you can find me on NOX again in just in just a minute. Okay, so see you again next Tuesday.