Stanford CS224N NLP with Deep Learning ｜ Winter 2021 ｜ Lecture 3 - Backprop and Neural Networks

0:00:00 - 0:00:14 Text: Hi everyone, I'll get started. Okay, so we're now back to the second week of CS224N on

0:00:14 - 0:00:19 Text: Natural Language Processing with Deep Learning. Okay, so for today's lecture, what we're

0:00:19 - 0:00:27 Text: going to be looking at is all the math details of during neural net learning. First of

0:00:27 - 0:00:33 Text: all, looking at how we can work out by hand, gradients for training neural networks, and

0:00:33 - 0:00:40 Text: then looking at how it's done more algorithmically, which is known as the back propagation algorithm.

0:00:40 - 0:00:47 Text: And correspondingly for you guys, well, I hope you remembered that one minute ago was when

0:00:47 - 0:00:52 Text: assignment one was due and everyone has handed that in. If I sometimes who haven't handed

0:00:52 - 0:00:58 Text: in, really should have been as soon as possible best to preserve those late days for the

0:00:58 - 0:01:04 Text: harder assignments. So I mean, I actually forgot to mention, we actually did make one change

0:01:04 - 0:01:09 Text: for this year to make it a bit easier when occasionally people join the class a week

0:01:09 - 0:01:16 Text: wait. If you want to this year and the grading assignment one can be discounted and will

0:01:16 - 0:01:21 Text: just use your other four assignments. But if you've been in the class so far for that 98%

0:01:21 - 0:01:26 Text: of people, well since assignment one is the easiest assignment, again, it's silly not

0:01:26 - 0:01:34 Text: to do it and have it as part of your grade. Okay, so starting today, we've put out assignment

0:01:34 - 0:01:39 Text: two and assignment two is all about making sure you really understand the math of neural

0:01:39 - 0:01:47 Text: networks and then the software that we use to do that math. So this is going to be a

0:01:47 - 0:01:54 Text: bit of a tough week for some. So for some people who are great on all their math and backgrounds,

0:01:54 - 0:01:59 Text: they'll feel like this is stuff they know. Well, nothing very difficult, but I know there

0:01:59 - 0:02:06 Text: are quite a few of you who this lecture and week is the biggest struggle of the course.

0:02:06 - 0:02:11 Text: We really do want people to actually have an understanding of what goes on in your network

0:02:11 - 0:02:17 Text: learning rather than viewing it as some kind of deep magic. And I hope that some of

0:02:17 - 0:02:22 Text: the material we give today and that you read up on and use in the assignment will really

0:02:22 - 0:02:29 Text: give you more of a sense of what these neural networks are doing and how it is just math

0:02:29 - 0:02:35 Text: that's applied in a systematic large scale that works out the answers and that this will

0:02:35 - 0:02:40 Text: be valuable and give you a deeper sense of what's going on. But if this material seems

0:02:40 - 0:02:48 Text: very scary and difficult, you can take some refuge in the fact that there's fast light

0:02:48 - 0:02:53 Text: at the end of the tunnel since this is really the only lecture that's heavily going through

0:02:53 - 0:02:58 Text: the math details of neural networks. After that, we'll be kind of popping back up to a

0:02:58 - 0:03:05 Text: higher level and by and large, after this week, we'll be making use of software to do a

0:03:05 - 0:03:13 Text: lot of the complicated math for us. But nevertheless, I hope this is valuable. I'll go through everything

0:03:13 - 0:03:18 Text: quickly today, but if this isn't stuff that you know backwards, I really do encourage

0:03:18 - 0:03:27 Text: you to work through it and get help as you need it. So do come along to our office hours.

0:03:27 - 0:03:32 Text: There are also a number of pieces of tutorial material given in the syllabus. So there's

0:03:32 - 0:03:38 Text: both the lecture notes. There's some materials from CS231. In the list of readings, the

0:03:38 - 0:03:46 Text: very top reading is some material put together by Kevin Clark a couple of years ago. And

0:03:46 - 0:03:52 Text: actually, that one's my favorite. The presentation there fairly closely follows the presentation

0:03:52 - 0:03:57 Text: in this lecture of going through matrix calculus. So, you know, personally, I'd recommend

0:03:57 - 0:04:01 Text: starting with that one, but there are four different ones you can choose from. If one

0:04:01 - 0:04:09 Text: of them seems more helpful to you. Two other things on what's coming up. Actually, for Thursday's

0:04:09 - 0:04:14 Text: lecture, we make a big change. And Thursday's lecture is probably the most linguistic

0:04:14 - 0:04:20 Text: lecture of the whole class where we go through the details of dependency grammar and dependency

0:04:20 - 0:04:24 Text: parsing. Some people find that tough as well, but at least there'll be tough in a different

0:04:24 - 0:04:30 Text: way. And then one other really good opportunity is this Friday, we have our second tutorial

0:04:30 - 0:04:36 Text: at 10am, which is an introduction to PyTorch, which is the deep learning framework that

0:04:36 - 0:04:41 Text: will be using for the rest of the class. Once we've gone through these first two assignments

0:04:41 - 0:04:46 Text: where you do things by yourself. So this is a great chance, again, to intro to PyTorch.

0:04:46 - 0:04:55 Text: It will be really useful for later in the class. Okay. Today's material is really all about

0:04:55 - 0:05:01 Text: sort of the math of neural networks, but just to sort of introduce a setting where we can

0:05:01 - 0:05:08 Text: work through this, I'm going to introduce a simple NLP task and a simple form of classifier

0:05:08 - 0:05:14 Text: that we can use for it. So the task of named entity recognition is a very common basic

0:05:14 - 0:05:20 Text: NLP task. And the goal of this is you're looking through pieces of text and you're wanting

0:05:20 - 0:05:26 Text: to label by labeling the words, which words belong to entity categories like persons,

0:05:26 - 0:05:33 Text: locations, products, dates, times, et cetera. So for this piece of text, last night Paris

0:05:33 - 0:05:39 Text: Hilton, wowed in the sequin gown, Samuel Quinn was arrest in the Hilton Hotel in Paris

0:05:39 - 0:05:47 Text: in April 1989. The some words are being labeled as named entities as shown. These two sentences

0:05:47 - 0:05:53 Text: don't actually belong together in the same article, but I chose those two sentences to illustrate

0:05:53 - 0:05:59 Text: the basic point that is not that you can just do this task by using a dictionary. Yes,

0:05:59 - 0:06:06 Text: a dictionary is helpful to know that Paris can possibly be a location, but Paris can also

0:06:06 - 0:06:12 Text: be a person name. So you have to use context to get named entity recognition rights.

0:06:12 - 0:06:20 Text: Okay, well, how might we do that with the neural network? There are much more advanced

0:06:20 - 0:06:29 Text: ways of doing this, but a simple yet already pretty good way of doing named entity recognition

0:06:29 - 0:06:36 Text: with a simple neural net is to say, well, what we're going to do is use the word vectors

0:06:36 - 0:06:43 Text: that we've learned about and we're going to build up a context window of word vectors.

0:06:43 - 0:06:49 Text: And then we're going to put those through a neural network layer and then feed it through

0:06:49 - 0:06:55 Text: a softmax classifier of the kind that we, sorry, I said that wrong. And then we're going

0:06:55 - 0:07:00 Text: to feed it through a logistic classifier of the kind that we saw when looking at negative

0:07:00 - 0:07:09 Text: sampling, which is going to say for a particular entity type such as location, is it high probability

0:07:09 - 0:07:15 Text: location or is it not a high probability location. So for a sentence like the museums and

0:07:15 - 0:07:20 Text: Paris are amazing to see what we're going to do is for each word say we're doing the

0:07:20 - 0:07:27 Text: word Paris, we're going to form a window around it say a plus or minus two word window.

0:07:27 - 0:07:33 Text: And so for those five words, we're going to get word vectors for them from the kind of word

0:07:33 - 0:07:39 Text: to vac or glove word vectors we've learned. And we're going to make a long vector out of

0:07:39 - 0:07:44 Text: the concatenation of those five word vectors. So the word of interest is in the middle.

0:07:44 - 0:07:50 Text: And then we're going to feed this vector to a classifier, which is that the end going

0:07:50 - 0:07:56 Text: to have a probability of the word being a location. And then we could have another class

0:07:56 - 0:08:01 Text: ifier that says the probability of the word being a person name. And so once we've done

0:08:01 - 0:08:05 Text: that, we're then going to run it at the next position. So we then say, well, is the word

0:08:05 - 0:08:12 Text: are a location and we'd feed a window of five words as then in Paris are amazing to

0:08:12 - 0:08:18 Text: and put it through the same kind of classifier. And so this is the classifier that we'll use.

0:08:18 - 0:08:25 Text: So its import will be this word window. So if we have de-dimensional word vectors, this

0:08:25 - 0:08:32 Text: will be a five de vector. And then we're going to put it through a layer of a neural network.

0:08:32 - 0:08:39 Text: So the layer of the neural network is going to multiply this vector by a matrix, add

0:08:39 - 0:08:49 Text: on a bias vector, and then put that through a non-linearity such as the softmax transformation

0:08:49 - 0:08:55 Text: that we've seen before. And that will give us a hidden vector, which might be of a smaller

0:08:55 - 0:09:02 Text: dimensionality such as this one here. And so then with that hidden vector, we're then

0:09:02 - 0:09:10 Text: going to take the dot product of it with an extra vector here, here's you. So we take

0:09:10 - 0:09:17 Text: you dot product h. And so when we do that, we're getting out a single number. And that

0:09:17 - 0:09:24 Text: number can be any real number. And so then finally, we're going to put that number through

0:09:24 - 0:09:31 Text: a logistic transform of the same kind that we saw when doing negative sampling. And the

0:09:31 - 0:09:38 Text: logistic transform will take any real number and it'll transform it into a probability

0:09:38 - 0:09:44 Text: that that word is a location. So its output is the predicted probability of the word

0:09:44 - 0:09:50 Text: belonging to a particular class. And so this could be our location classifier, which

0:09:50 - 0:09:56 Text: could classify each word in a window as to what the probability is that it's a location

0:09:56 - 0:10:03 Text: word. And so this little neural network here is the neural network I'm going to use today

0:10:03 - 0:10:09 Text: when going through some of the math. But actually, I'm going to make it even easier on myself.

0:10:09 - 0:10:16 Text: I'm going to throw away the logistic function at the top. And I'm really just going to

0:10:16 - 0:10:23 Text: work through the math of the bottom three quarters of this. If you look at Kevin Clark's handout

0:10:23 - 0:10:28 Text: that I just mentioned, he includes when he works through it also working through the logistic

0:10:28 - 0:10:35 Text: function. And we also saw working through a softmax in the first lecture when I was

0:10:35 - 0:10:42 Text: working through some of the word detect model. Okay. So the overall question we want to

0:10:42 - 0:10:51 Text: be able to answer is, so here's our stochastic gradient descent equation that we have existing

0:10:51 - 0:11:00 Text: parameters of our model. And we want to update them based on our current loss, which is

0:11:00 - 0:11:08 Text: the j of theta. So for getting our loss here, that the true answer is to whether a word

0:11:08 - 0:11:15 Text: is a location or not will be either one, if it is a location or zero, if it isn't, our

0:11:15 - 0:11:21 Text: logistic class to file returns some number like 0.9. And we'll use the distance away from

0:11:21 - 0:11:27 Text: what it should have been squared as our loss. So we work out a loss. And then we're moving

0:11:27 - 0:11:35 Text: a little distance in the negative of the gradient, which will be changing our parameter estimates

0:11:35 - 0:11:41 Text: in such a way that they reduce the loss. And so this is already being written in terms

0:11:41 - 0:11:48 Text: of a whole vector of parameters, which is being updated as to a new vector of parameters.

0:11:48 - 0:11:54 Text: But you can also think about it that for each individual parameter theta j that we're

0:11:54 - 0:12:00 Text: working out the partial derivative of the loss with respect to that parameter. And then

0:12:00 - 0:12:07 Text: we're moving a little bit in the negative direction of that. That's going to give us

0:12:07 - 0:12:15 Text: a new value for parameter theta j. And we're going to update all of the parameters of our

0:12:15 - 0:12:23 Text: model as we learn. I mean, in particular, in contrast to what commonly happens in statistics,

0:12:23 - 0:12:30 Text: we also, we update not only the sort of parameters of our model that are sort of weights in the

0:12:30 - 0:12:36 Text: classifier, but we also will update our data representation. So we'll also be changing

0:12:36 - 0:12:43 Text: our word vectors as we learn. Okay, so to build neural nets, I use to train neural nets based

0:12:43 - 0:12:50 Text: on data, what we need is to be able to compute this gradient of the parameters so that we

0:12:50 - 0:12:56 Text: can then iteratively update the weights of the model and efficiently train a model that

0:12:56 - 0:13:05 Text: has good weights, i.e. that has high accuracy. And so how can we do that? Well, what I'm

0:13:05 - 0:13:12 Text: going to talk about today is first of all how you can do it by hand. And so for doing

0:13:12 - 0:13:20 Text: it by hand, this is basically a review of matrix calculus. And that'll take quite a bit

0:13:20 - 0:13:28 Text: of the lecture. And then after we've talked about that for a while, I'll then shift gears

0:13:28 - 0:13:34 Text: and introduce the back propagation algorithm, which is the central technology for neural

0:13:34 - 0:13:41 Text: networks. And that technology is essentially the efficient application of calculus on a

0:13:41 - 0:13:48 Text: large scale as we'll come to talking about soon. So for computing gradients by hand, what

0:13:48 - 0:13:56 Text: we're doing is matrix calculus. So we're working with vectors and matrices and working out

0:13:56 - 0:14:06 Text: gradients. And this can seem like pretty scary stuff. And well, to extent that you're

0:14:06 - 0:14:14 Text: kind of scared and don't know what's going on, one choice is to work out a non-vectorized

0:14:14 - 0:14:22 Text: gradient by just working out what the partial derivative is for one parameter at a time.

0:14:22 - 0:14:29 Text: And I showed a little example of that in the first lecture. But it's much, much faster

0:14:29 - 0:14:39 Text: and more useful to actually be able to work with vectorized gradients. And in some sense,

0:14:39 - 0:14:44 Text: if you're not very confident, this is kind of almost a leap of faith. But it really is

0:14:44 - 0:14:50 Text: the case that multivariable calculus is just like single variable calculus, except you're

0:14:50 - 0:14:56 Text: using vectors and matrices. So providing you to remember some basics of single variable

0:14:56 - 0:15:04 Text: calculus, you really should be able to do this stuff and get it to work out. Lots of other

0:15:04 - 0:15:11 Text: sources, I've mentioned the notes. You can also look at the textbook from A51, which also

0:15:11 - 0:15:17 Text: has quite a lot of material on this. I know some of you have bad memories of A51.

0:15:17 - 0:15:22 Text: OK, so let's go through this and see how it works from ramping up from the beginning.

0:15:22 - 0:15:28 Text: So the beginning of calculus is, you know, we have a function with one input and one

0:15:28 - 0:15:35 Text: output, f of x equals x cubed. And so then its gradient is its slope, right? So that's

0:15:35 - 0:15:43 Text: its derivative. So its derivative is 3x squared. And the way to think about this is how much

0:15:43 - 0:15:49 Text: will the output change if we change the input a little bit, right? So what we're wanting

0:15:49 - 0:15:56 Text: to do in our neural net models is change what they output so that they do a better job

0:15:56 - 0:16:01 Text: of predicting the correct answers when we're doing supervised learning. And so what we want

0:16:01 - 0:16:06 Text: to know is if we fiddle different parameters of the model, how much will they have on the

0:16:06 - 0:16:12 Text: output? Because then we can choose how to fiddle them in the right way to move things down,

0:16:12 - 0:16:18 Text: right? So, you know, when we're saying that the derivative here is 3x squared, well,

0:16:18 - 0:16:26 Text: what we're saying is that if you're at x equals 1, if you fiddle the input a little bit,

0:16:26 - 0:16:32 Text: the output will change 3 times as much, 3 times 1 squared. And it does. So if I say what's

0:16:32 - 0:16:40 Text: the value at 1.01, it's about 1.03, it's changed 3 times as much and that's its slope. But

0:16:40 - 0:16:49 Text: it x equals 4, the derivative is 16 times 3, 48. So if we fiddle the input a little, it'll

0:16:49 - 0:16:57 Text: change 48 times as much and that's roughly what happens, 4.01 cubed is 64.48. Now, of

0:16:57 - 0:17:02 Text: course, you know, this is just sort of showing it for a small fiddle, but, you know, that's

0:17:02 - 0:17:10 Text: an approximation to the actual truth. Okay, so then we sort of ramp up to the more complex

0:17:10 - 0:17:16 Text: cases, which are more reflective of what we do with neural networks. So if we have a function

0:17:16 - 0:17:23 Text: with one output and n inputs, then we have a gradient. So a gradient is a vector of partial

0:17:23 - 0:17:29 Text: derivatives with respect to each input. So we've got n inputs x1 to xn and we're working

0:17:29 - 0:17:35 Text: out the partial derivative f with respect to x1, the partial derivative f with respect to

0:17:35 - 0:17:42 Text: x2, et cetera. And we then get a vector of partial derivatives, where each element of

0:17:42 - 0:17:50 Text: this vector is just like a simple derivative with respect to one variable. Okay, so from

0:17:50 - 0:17:57 Text: that point, we just keep on ramping up for what we do with neural networks. So commonly,

0:17:57 - 0:18:03 Text: when we have something like a layer in a neural network, we'll have a function with n inputs

0:18:03 - 0:18:10 Text: that will be like our word vectors, then we do something like multiply by a matrix and

0:18:10 - 0:18:17 Text: then we'll have m outputs. So we have a function now, which is taking n inputs and is producing

0:18:17 - 0:18:25 Text: m outputs. So at this point, what we're calculating for the gradient is what's called a

0:18:25 - 0:18:34 Text: Jacobian matrix. So for m inputs and n outputs, the Jacobian is an m by n matrix of every

0:18:34 - 0:18:43 Text: combination of partial derivatives. So function f splits up into these different sub functions

0:18:43 - 0:18:50 Text: f1 through m, fm, which generate each of the m outputs. And so then we're taking the

0:18:50 - 0:18:56 Text: partial derivative f1 with respect to x1 through the partial derivative f1 with respect to

0:18:56 - 0:19:02 Text: xn, then heading down, you know, we make it up to the partial derivative of fm with respect

0:19:02 - 0:19:09 Text: to x1, et cetera. So we have every possible partial derivative of an output variable with

0:19:09 - 0:19:20 Text: respect to one of the input variables. Okay. So in simple calculus, when you have a composition

0:19:20 - 0:19:30 Text: of one variable functions, so that if you have y equals x squared and then z equals 3y,

0:19:30 - 0:19:38 Text: then z is a composition of two functions of, or you're composing two functions, z is

0:19:38 - 0:19:44 Text: a function of x. Then you can work out the derivative of z with respect to x. And the

0:19:44 - 0:19:49 Text: way you do that is with the chain rule. And so in the chain rule, you multiply derivatives.

0:19:49 - 0:20:03 Text: So dzdx equals dzdy times dydx. So dzdy is just three, and dydx is 2x. So we get three

0:20:03 - 0:20:13 Text: times 2x. So that overall, the derivative here is 6x. And since if we multiply this together,

0:20:13 - 0:20:20 Text: we're really saying that z equals 3x squared, you should trivially be able to see again,

0:20:20 - 0:20:27 Text: aha, it's derivative is 6x. So that works. Okay. So once we move into vectors and matrices

0:20:27 - 0:20:35 Text: and Jacobians, it's actually the same game. So when we're working with those, we can compose

0:20:35 - 0:20:41 Text: functions and work out their derivatives by simply multiplying Jacobians. So if we have

0:20:41 - 0:20:48 Text: start with an input x and then put it through the simplest form of neural network layer

0:20:48 - 0:20:55 Text: and say that z equals wx plus b. So we multiply the expected by matrix w and then add on a

0:20:55 - 0:21:01 Text: bias vector b. And then typically we'd put things through a nonlinearity f. So f could

0:21:01 - 0:21:08 Text: be a sigmoid function. We'll then say 8 equals f of z. So this is the composition of two

0:21:08 - 0:21:15 Text: functions in terms of vectors and matrices. So we can use Jacobians and we can say the

0:21:15 - 0:21:23 Text: partial of h with respect to x is going to be the product of the partial of h with respect

0:21:23 - 0:21:31 Text: to z and the partial of z with respect to x. And this all does work out. So let's start

0:21:31 - 0:21:39 Text: going through some examples of how these things work slightly more concretely. First,

0:21:39 - 0:21:47 Text: just particular Jacobians and then composing them together. So one case we look at is the

0:21:47 - 0:21:53 Text: nonlinearities that we put a vector through. So this is something like putting a vector

0:21:53 - 0:22:01 Text: through the sigmoid function f. And so if we have an intermediate vector z and we're turning

0:22:01 - 0:22:09 Text: into vector h by putting it through a logistic function, we can say what is dhdz.

0:22:12 - 0:22:22 Text: Well, for this, formally, this is a function that has n inputs and n outputs. So at the end

0:22:22 - 0:22:30 Text: of the day, we're computing an n by n Jacobian. And so what that's meaning is the elements

0:22:30 - 0:22:39 Text: of this n by n Jacobian are going to take the partial derivative of each output with respect

0:22:39 - 0:22:48 Text: to each input. And well, what is that going to be in this case? Well, in this case, because

0:22:48 - 0:22:56 Text: we're actually just computing element-wise a transformation such as a logistic transform

0:22:56 - 0:23:05 Text: of each element zi, like the second equation here. If i equals j, we've got something to compute.

0:23:05 - 0:23:13 Text: Whereas if i doesn't equal j, there's just the input has no influence on the output

0:23:13 - 0:23:19 Text: and so the derivative is zero. So if i doesn't equal j, we're going to get a zero. And if i does

0:23:19 - 0:23:27 Text: equal j, then we're going to get the regular one variable derivative of the logistic function,

0:23:27 - 0:23:34 Text: which if I remember correctly, you were asked to compute, now I can't remember what's the

0:23:34 - 0:23:40 Text: assignment one or assignment two, but one of the two asks you to compute it. So our Jacobian

0:23:40 - 0:23:50 Text: for this case looks like this. We have a diagonal matrix with the derivatives of each element

0:23:50 - 0:23:57 Text: along the diagonal and everything else is zero. Okay, so let's look at a couple of other Jacobians.

0:23:58 - 0:24:06 Text: So if we're asking, if we've got this wx plus b basic neural network layer and we're asking

0:24:06 - 0:24:14 Text: for the gradient with respect to x, then what we're going to have coming out is that's actually

0:24:14 - 0:24:24 Text: going to be the matrix w. So this is where what I hope you can do is look at the notes at home

0:24:24 - 0:24:33 Text: and work through this exactly and see that this is actually the right answer. But this is the way

0:24:33 - 0:24:41 Text: in which if you just have faith and think this is just like single variable calculus, except I've

0:24:41 - 0:24:46 Text: now got vectors and matrices, the answer you get is actually what you expected to get because this

0:24:46 - 0:24:55 Text: is just like the derivative of ax plus b with respect to x where it's a. So similarly, if we take

0:24:55 - 0:25:05 Text: the partial derivative with respect to b of wx plus b, we get out the identity matrix. Okay, then one

0:25:05 - 0:25:12 Text: other Jacobian that we mentioned while in the first lecture while working through word to veck

0:25:12 - 0:25:24 Text: is if you have the dot product of two vectors, i, that's a number, that what you get coming out of

0:25:24 - 0:25:34 Text: that, so the the partial derivative of uth with respect to u is h transpose. And at this point,

0:25:34 - 0:25:42 Text: there's some fine print that I'm going to come back to in a minute. So this is the correct Jacobian,

0:25:42 - 0:25:53 Text: right? Because in this case, we have the dimension of h inputs and we have one output. And so we want

0:25:53 - 0:26:00 Text: to have a row vector. But there's a little bit more to say on that that I'll come back to in

0:26:00 - 0:26:09 Text: about 20 slides. But this is the correct Jacobian. Okay, so if you aren't not familiar with these

0:26:09 - 0:26:17 Text: kind of Jacobians, do please look at some of the notes that are available and try and compute these

0:26:17 - 0:26:23 Text: in more detail element wise and convince yourself that they really are right. But I'm going to assume

0:26:23 - 0:26:30 Text: these now and show you what happens when we actually then work out gradients for at least a mini little

0:26:30 - 0:26:47 Text: neural net. Okay, so here is most of this neural net. I mean, as I commented that, you know, really

0:26:47 - 0:26:53 Text: we'd be working out the partial derivative of the loss j with respect to these variables.

0:26:53 - 0:26:58 Text: But for the example I'm doing here, I just I've locked that off to keep it a little simpler and

0:26:58 - 0:27:04 Text: more manageable for the lecture. And so we're going to just work out the partial derivative of the

0:27:04 - 0:27:11 Text: score s, which is a real number with respect to the different parameters of this model where the

0:27:11 - 0:27:21 Text: parameters of this model are going to be the w and the b and the u and also the input because we

0:27:21 - 0:27:31 Text: can update the weight vectors of the word vectors of different words based on tuning them to better

0:27:31 - 0:27:37 Text: predict the classification outputs that we desire. So let's start off with a fairly easy one

0:27:37 - 0:27:46 Text: where we want to update the bias vector b to have our system classify better. So to be able to

0:27:46 - 0:27:51 Text: do that, what we want to work out is the partial derivatives of s with respect to b.

0:27:53 - 0:27:59 Text: So we know how to put that into our stochastic gradient update for the b parameters.

0:28:00 - 0:28:07 Text: Okay, so how do we go about doing these things? So the first step is we want to sort of break things

0:28:07 - 0:28:15 Text: up into different functions of minimal complexity that compose together. So in particular,

0:28:15 - 0:28:22 Text: this neural net layer, a equals f of wx plus b, it's still a little bit complex. So let's decompose

0:28:22 - 0:28:33 Text: that one further step. So we have the input x, we then calculate the linear transformation z equals

0:28:33 - 0:28:44 Text: wx plus b. And then we put things through the sort of element wise non-linearity, a equals f of z,

0:28:44 - 0:28:54 Text: and then we do the dot product with u. And it's useful for working these things out to split into

0:28:54 - 0:29:00 Text: pieces like this, have straight what your different variables are, and to know what the

0:29:00 - 0:29:07 Text: dimensionality of each of these variables is, it's well worth just writing out the dimensionality

0:29:07 - 0:29:12 Text: of every variable and making sure that the answers that you're computing are of the right dimensionality.

0:29:13 - 0:29:22 Text: So at this point though, what we can see is that calculating s is the product of three,

0:29:22 - 0:29:31 Text: sorry, is the composition of three functions around x. So for working out the partials of s with

0:29:31 - 0:29:40 Text: respect to b, it's the composition of the three functions shown on the left. And so therefore,

0:29:40 - 0:29:51 Text: the gradient of s with respect to b, we're going to take the product of these three partial derivatives.

0:29:53 - 0:30:04 Text: Okay, so how do we, so we've got the s equals uth, so that's sort of the top corresponding

0:30:04 - 0:30:11 Text: partial derivative, partial derivative of h with respect to z, partial derivative of z with respect

0:30:11 - 0:30:18 Text: to b, which is the first one that we're working out. Okay, so we want to work this out,

0:30:18 - 0:30:25 Text: and if we're lucky, we remember those Jacobians I showed previously about the Jacobian for

0:30:25 - 0:30:34 Text: a vector dot product, the Jacobian for the nonlinearity and the Jacobian for the simple linear

0:30:34 - 0:30:43 Text: transformation. And so we can use those. So for the partials of s with respect to h,

0:30:45 - 0:30:52 Text: well, that's going to be ut using the first one. The partials of h with respect to z,

0:30:52 - 0:30:59 Text: okay, so that's the nonlinearity. And so that's going to be the matrix that's the diagonal matrix

0:30:59 - 0:31:09 Text: with the element-wise derivative of f prime of z and zero elsewhere. And then for the wx plus b,

0:31:09 - 0:31:16 Text: when we're taking the partials with respect to b, that's just the identity matrix. So we can

0:31:16 - 0:31:28 Text: simplify that down a little, the identity matrix disappears. And since ut is a vector, and this is

0:31:28 - 0:31:36 Text: a diagonal matrix, we can rewrite this as ut, Hadamard product of f prime of z. I think this is the

0:31:36 - 0:31:43 Text: first time I've used this little circle for Hadamard product, but it's something that you'll see

0:31:43 - 0:31:50 Text: quite a bit in your network work since it's often used. So when we have two vectors,

0:31:53 - 0:32:00 Text: ut and this vector here, sometimes you want to do an element-wise product. So the output of this

0:32:00 - 0:32:05 Text: will be a vector where you've taken the first element of each and multiply them, the second element

0:32:05 - 0:32:10 Text: of each and multiply them, etc. downwards. And so that's called the Hadamard product, and it's

0:32:10 - 0:32:20 Text: what we're calculating to calculate a vector, which is the gradient of s with respect to b.

0:32:22 - 0:32:31 Text: Okay, so that's good. So we now have a gradient of s with respect to b, and we could use that

0:32:31 - 0:32:38 Text: in our stochastic gradient. But we don't stop there. We also want to work out the gradient

0:32:38 - 0:32:47 Text: with respect to others of our parameters. So we might want to next go on and work out the gradient

0:32:47 - 0:32:57 Text: of s with respect to w. Well, we can use the chain rule just like we did before. So we've got the

0:32:57 - 0:33:03 Text: same product of functions, and everything is going to be the same, apart from now taking

0:33:03 - 0:33:13 Text: the derivatives with respect to w rather than b. So it's now going to be the partial of s with

0:33:13 - 0:33:22 Text: respect to h, h with respect to z, and z with respect to w. And the important thing to notice here,

0:33:22 - 0:33:29 Text: and this leads into what we do with the back propagation algorithm, is wait a minute,

0:33:29 - 0:33:36 Text: that this is very similar to what we've already done. So when we're all working out the gradients

0:33:36 - 0:33:44 Text: of s with respect to b, the first two terms were exactly the same. It's only the last one that

0:33:44 - 0:33:54 Text: differs. So to be able to build or to train neural networks efficiently, this is what happens all

0:33:54 - 0:34:01 Text: the time, and it's absolutely essential that we use an algorithm that avoids repeated computation.

0:34:02 - 0:34:09 Text: And so the idea we're going to develop is when we have this equation stack that there's sort of

0:34:09 - 0:34:17 Text: stuff that's above where we compute z, and we're going to be sort of that'll be the same each time,

0:34:17 - 0:34:24 Text: and we want to compute something from that that we can then sort of feed downwards when working

0:34:24 - 0:34:36 Text: out the gradients with respect to w x or b. And so we do that by defining delta, which is delta is

0:34:36 - 0:34:44 Text: the partial's composed that are above the linear transform, and that's referred to as the local

0:34:44 - 0:34:51 Text: error signal. It's what's being passed in from above to the linear transform. And we've already

0:34:51 - 0:35:00 Text: computed the gradient of that in the preceding slides. And so the final form of the partial

0:35:00 - 0:35:11 Text: s with respect to b will be delta times the remaining part. And well, we'd seen that, you know,

0:35:11 - 0:35:19 Text: for partial s with respect to b, the partial z with respect to b is just the identity. So the end

0:35:19 - 0:35:25 Text: result was delta. But in this time, we then go and have to work out the partial of z with respect to

0:35:25 - 0:35:33 Text: w and multiply that by delta. So that's the part that we still haven't yet done. So

0:35:33 - 0:35:40 Text: and this is where things get in some sense a little bit hairier.

0:35:42 - 0:35:49 Text: And so there's something that's important to explain. So, you know, what should we have for the

0:35:49 - 0:36:03 Text: Jacobian of dsdw? Well, that's a function that has one output, the output is just a score of real

0:36:03 - 0:36:16 Text: number. And then it has n by m inputs. So the Jacobian is a 1 by m matrix. I a very long

0:36:16 - 0:36:25 Text: row vector. But that's correct math. But it turns out that that's kind of bad for our neural

0:36:25 - 0:36:31 Text: networks. Because remember, what we want to do with our neural networks is do stochastic gradient

0:36:31 - 0:36:41 Text: descent. And we want to say theta new equals theta old minus a small multiplier times the gradient.

0:36:41 - 0:36:56 Text: And well, actually, the w matrix is an n by m matrix. And so we couldn't actually do the subtraction

0:36:56 - 0:37:02 Text: if this gradient we calculate is just a huge row vector. We'd like to have it as the same

0:37:02 - 0:37:11 Text: shape as the w matrix. In neural network land, when we do this, we depart from pure math that

0:37:11 - 0:37:17 Text: this point. And we use what we call the shape convention. So what we're going to say is,

0:37:19 - 0:37:24 Text: and you're meant to use this for answers in the assignment, that the shape of the gradient

0:37:24 - 0:37:29 Text: we're always going to make to be the shape of the parameters. And so therefore,

0:37:29 - 0:37:40 Text: the st w we're also going to represent as an n by m matrix just like w. And we're going to reshape

0:37:41 - 0:37:50 Text: the Jacobian, deplace it into this matrix shape. Okay, so if we want to place it into this matrix

0:37:50 - 0:38:02 Text: shape, what do we, what are we going to want to get for the st w? Well, we know that it's

0:38:03 - 0:38:16 Text: going to involve delta our local error signal. And then we have to work out something for d z

0:38:16 - 0:38:27 Text: w. Well, since c equals w x plus b, you'd kind of expect that the answer should be x.

0:38:28 - 0:38:40 Text: And that's right. So the answer to d s d w is going to be delta transpose times x transpose.

0:38:40 - 0:38:46 Text: And so the form that we're getting for this derivative is going to be the product of the local

0:38:46 - 0:38:56 Text: error signal that comes from above versus what we calculate from the local input x.

0:38:58 - 0:39:04 Text: So that shouldn't yet be obvious why that is true. So let me just go through in a bit more detail

0:39:04 - 0:39:14 Text: why that's true. So when we want to work out d s d w, right, it's sort of delta times d z

0:39:14 - 0:39:23 Text: w, where what that's computing for z is w x plus b. So let's just consider for a moment what the

0:39:23 - 0:39:34 Text: derivative is with respect to a single weight w ij. So w ij might be w two three that's shown in

0:39:34 - 0:39:44 Text: my little neural network here. And so the first thing to notice is the w ij only contributes to z i.

0:39:44 - 0:39:55 Text: So it's going into z two, which then computes h two. And it has no effect whatsoever on h one.

0:39:55 - 0:40:09 Text: Okay, so when we're working out d z i d w ij, it's going to be d w i x that sort of row that

0:40:09 - 0:40:20 Text: that row of the matrix plus bi, which means that for we've got a kind of a sum of w i k times x k.

0:40:20 - 0:40:27 Text: And then for this sum, this is like one variable calculus that when we're taking the derivative of

0:40:27 - 0:40:36 Text: this with respect to w ij, every term in this sum is going to be zero. The derivative is going to

0:40:36 - 0:40:44 Text: be zero except for the one that involves w ij. And then the derivative of that is just like a x

0:40:44 - 0:40:53 Text: with respect to a, it's going to be x. So you get x j out as the answer. And so the end result of

0:40:53 - 0:41:03 Text: that is that when we're working out, what we want as the answer is that we're going to get that these

0:41:04 - 0:41:12 Text: columns where x one is all that's left x two is all that's left through x m is all that's left.

0:41:12 - 0:41:20 Text: And then that's multiplied by the vectors of the local error signal from above. And what we want

0:41:20 - 0:41:27 Text: to compute is this outer product matrix, we're getting the different combinations of the delta

0:41:28 - 0:41:36 Text: and the x. And so we can get the n by m matrix that we'd like to have by our shape convention

0:41:36 - 0:41:44 Text: by taking delta transpose, which is n by one times x transpose, which is n one by m. And then we

0:41:44 - 0:41:52 Text: get this outer product matrix. So like that's a kind of a hacky argument that I've made. It's

0:41:52 - 0:41:59 Text: certainly a way of doing things that the dimensions work out and it sort of makes sense. There's a more

0:41:59 - 0:42:06 Text: detailed run through this that appears in election notes. And I encourage you to sort of also look

0:42:06 - 0:42:14 Text: at the more Matthew version of that. Here's a little bit more information about the shape convention.

0:42:14 - 0:42:26 Text: So well, first of all, one more example of this. So when you're working out DSDB that

0:42:26 - 0:42:37 Text: that comes out as it's Jacobian is a row vector. But similarly, you know, according to shape

0:42:37 - 0:42:46 Text: convention, we want our gradient to be the same shape as B and B as a column vector. So that's sort

0:42:46 - 0:42:53 Text: of again, they're different shapes and you have to transpose one to get the other. And so effectively,

0:42:53 - 0:43:00 Text: what we have is a disagreement between the Jacobian form. So the Jacobian form makes sense for

0:43:01 - 0:43:08 Text: you know, calculus and math. Because if you want to have it like I claimed that matrix calculus

0:43:08 - 0:43:14 Text: is just like single variable calculus apart from using vectors and matrices, you can just multiply

0:43:14 - 0:43:21 Text: together the particles. That only works out if you're using Jacobians. But on the other hand,

0:43:21 - 0:43:29 Text: if you want to do stochastic gradient descent and be able to sort of subtract off a piece of the

0:43:29 - 0:43:38 Text: gradient, that only works if you have the same shape matrix for the gradient as you do for the

0:43:38 - 0:43:46 Text: original matrix. And so this is a bit confusing, but that's just the reality. There are both of these

0:43:46 - 0:43:57 Text: two things. So the Jacobian form is useful in doing the calculus. But for the answers in the

0:43:57 - 0:44:06 Text: assignment, we want the answers to be presented using the shape convention so that the gradient is

0:44:06 - 0:44:14 Text: shown in the same shape as the parameters. And therefore, you'll be able to, it's the right shape

0:44:14 - 0:44:23 Text: for doing a gradient update by just subtracting a small amount of the gradient. So for working

0:44:23 - 0:44:32 Text: through things, there are then basically two choices. One choice is to work through all the math

0:44:32 - 0:44:39 Text: using Jacobians and then write at the end to reshape following the shape convention to give the

0:44:39 - 0:44:50 Text: answer. So that's what I did when I worked out DSDB. We worked through it using Jacobians. We

0:44:50 - 0:44:57 Text: got an answer, but it turned out to be a row vector. And so, well, then we have to transpose it at

0:44:57 - 0:45:08 Text: the end to get into the right shape for the shape convention. The alternative is to always follow

0:45:08 - 0:45:16 Text: the shape convention. And that's kind of what I did when I was then working out DSDW. I didn't

0:45:16 - 0:45:26 Text: faultfully use Jacobians. I said, oh, well, when we work out, whatever was DZDW, let's work out what

0:45:26 - 0:45:34 Text: shape we want it to be and what to fill in the cells with. And if you're sort of trying to do it

0:45:34 - 0:45:43 Text: immediately with the shape convention, it's a little bit more hacky in a way since you know,

0:45:43 - 0:45:48 Text: you have to look at the dimensions for what you want and figure out when to transpose or to reshape

0:45:48 - 0:45:56 Text: the matrix to be it the right shape. But the kind of informal reasoning that I gave is what you do

0:45:56 - 0:46:03 Text: and what works. And you know, one way of, and there are sort of hints that you can use, right? That

0:46:03 - 0:46:09 Text: you know that your gradient should always be the same shape as your parameters. And you know that

0:46:09 - 0:46:16 Text: the error message coming in will always have the same dimensionality as that hidden layer. And

0:46:16 - 0:46:30 Text: you can sort of work it out always following the shape convention. Okay. So that is, hey, doing this

0:46:30 - 0:46:43 Text: is all matrix calculus. So after pausing for breath for a second, the rest of the lecture is then,

0:46:43 - 0:46:53 Text: okay, let's look at how our software trains neural networks using what's referred to as the back

0:46:53 - 0:47:11 Text: propagation algorithm. So the short answer is, you know, basically we've already done it,

0:47:11 - 0:47:17 Text: the rest of the lecture is easy. So, you know, essentially I've just shown you what the

0:47:17 - 0:47:28 Text: back propagation algorithm does. So the back propagation algorithm is judiciously taking and

0:47:30 - 0:47:41 Text: propagating derivatives using the matrix chain rule. The rest of the back propagation algorithm

0:47:41 - 0:47:50 Text: is to say, okay, when we have these neural networks, we have a lot of shared structure and shared

0:47:50 - 0:48:01 Text: derivatives. So what we want to do is maximally, efficiently reuse derivatives of higher layers

0:48:01 - 0:48:08 Text: when we're computing derivatives for lower layers so that we minimize computation. And I already

0:48:08 - 0:48:16 Text: pointed that out in the first half, but we want to systematically exploit that. And so the way we do

0:48:16 - 0:48:26 Text: that in our computational systems is they construct computation graphs. So this maybe looks a little

0:48:26 - 0:48:34 Text: bit like what you saw in a compiler's class if you did one, right, that you're creating, I call it

0:48:34 - 0:48:39 Text: here computation graph, but it's really a tree, right. So you're creating here this tree of

0:48:39 - 0:48:47 Text: computations in this case, but in more general case, it's some kind of directed graph of computations,

0:48:48 - 0:48:59 Text: which has source nodes, which are imports either inputs like x or input parameters like w and b.

0:48:59 - 0:49:06 Text: And it's interior nodes or operations. And so then once we've constructed a graph,

0:49:06 - 0:49:11 Text: and so this graph corresponds to exactly the example I did before, right, that this was our little

0:49:11 - 0:49:17 Text: neural net that's in the top right. And here's the corresponding computation graph of computing

0:49:17 - 0:49:26 Text: wx plus b, put it through the sigmoid nonlinearity f, multiply the resulting dot product that the

0:49:26 - 0:49:37 Text: resulting vector with you gives us our output score s. Okay, so what we do to compute this is we

0:49:37 - 0:49:44 Text: pass along the edges the results of operations. So this is wx, then z, then h, and then our output is s.

0:49:45 - 0:49:51 Text: And so the first thing we want to be able to do to compute with neural networks is to be able to

0:49:51 - 0:49:59 Text: compute for different inputs what the output is. And so that's referred to as forward propagation.

0:49:59 - 0:50:10 Text: And so we simply run this expression much like you just standardly do in a compiler to compute

0:50:10 - 0:50:16 Text: the value of s. And that's the forward propagation phase. But the essential additional element of

0:50:16 - 0:50:25 Text: neural networks is that we then also want to be able to send back gradients, which will tell us how

0:50:25 - 0:50:33 Text: to update the parameters of the model. And so it's this ability to send back gradients, which gives us

0:50:33 - 0:50:40 Text: the ability for these models to learn once we have a loss function at the end, we can work out how to

0:50:40 - 0:50:48 Text: change the parameters of the model so that they more accurately produce the desired output, i.e.

0:50:48 - 0:50:57 Text: they minimize the loss. And so it's doing that part that then is called back propagation. So we then

0:50:57 - 0:51:06 Text: once we forward propagated a value with our current parameters, we then head backwards reversing

0:51:06 - 0:51:16 Text: the direction of the arrows and pass along gradients down to the different parameters like B and W and U

0:51:16 - 0:51:22 Text: that we can use to change using stochastic gradient descent what the value of B is or what the

0:51:22 - 0:51:32 Text: value of W is. So we start off with ds ds, which is just one. And then we run our back propagation.

0:51:32 - 0:51:41 Text: And we're using the sort of same kind of composition of Jacobian. So we have ds dh here and ds dz

0:51:41 - 0:51:49 Text: and we progressively pass back those gradients. So we just need to work out how to efficiently and

0:51:49 - 0:51:57 Text: cleanly do this in a computational system. And so let's sort of work through again a few of these

0:51:57 - 0:52:07 Text: cases. So the general situation is we have a particular node. So a node is where some kind of

0:52:08 - 0:52:17 Text: operation like multiplication or a nonlinearity happens. And so the simplest case is that we've got

0:52:17 - 0:52:26 Text: one output and one input. So we'll do that first. So that's like h equals f of z. So what we have is

0:52:26 - 0:52:38 Text: an upstream gradient ds dh. And what we want to do is compute the downstream gradient of ds dz.

0:52:38 - 0:52:46 Text: And the way we're going to do that is say, well, for this function f, it's a function, it's got

0:52:46 - 0:52:53 Text: a derivative for gradient. So what we want to do is work out that local gradient dhd.

0:52:53 - 0:53:03 Text: And then that gives us everything that we need to work out ds dz. Because that's precisely we're

0:53:03 - 0:53:10 Text: going to use the chain rule. We're going to say that ds dz equals the product of ds dh times dhd

0:53:10 - 0:53:17 Text: where this is again using Jacobians. Okay, so the general principle that we're going to use is

0:53:17 - 0:53:25 Text: that downstream gradient equals the upstream gradient times the local gradient. Okay, sometimes

0:53:25 - 0:53:31 Text: it gets a little bit more complicated. So we might have multiple inputs to a function. So this is

0:53:32 - 0:53:40 Text: the matrix vector multiply. So z equals wx. Okay, when there are multiple inputs, we still have

0:53:40 - 0:53:50 Text: an upstream gradient ds dz. But what we're going to do is work out a local gradient with respect to

0:53:50 - 0:54:00 Text: each input. So we have dz dw and dz dx. And so then at that point, it's exactly the same for each

0:54:00 - 0:54:08 Text: piece of it. We're going to work out the downstream gradients ds dw and ds dx by using the chain rule

0:54:08 - 0:54:17 Text: with respect to the particular local gradient. So let's go through an example of this. I mean,

0:54:17 - 0:54:24 Text: this is kind of a silly example. It's not really an example that looks like a typical neural net.

0:54:24 - 0:54:30 Text: But it's sort of a simple example where we can show some of the components of what we do. So

0:54:30 - 0:54:39 Text: what we're going to do is want to calculate f of xyz, which is being calculated as x plus y times

0:54:39 - 0:54:47 Text: the max of y and z. And we've got, you know, particular values that we're starting off with x equals

0:54:47 - 0:54:55 Text: one y equals two and z equals zero. So these are the current values of our parameters. And so we can

0:54:55 - 0:55:03 Text: say, okay, well, we want to build an expression tree for that. Here's our expression tree. We're

0:55:03 - 0:55:10 Text: taking x plus y. We're taking the max of y and z. And then we're multiplying them. And so our

0:55:10 - 0:55:18 Text: forward propagation phase is just to run this. So we take the values of our parameters. And we

0:55:18 - 0:55:25 Text: simply start to compute with them, right? So we have one, two, two, zero. And we add them as three,

0:55:25 - 0:55:35 Text: the max is two. We multiply them. And that gives us six. Okay. So then at that point, we then

0:55:35 - 0:55:45 Text: want to go and work out how to do things for back propagation and how these back propagation

0:55:45 - 0:55:52 Text: steps work. And so the first part of that is sort of working out what our local gradients are

0:55:52 - 0:56:03 Text: going to be. So, so this is a here. And this is x and y. So dADX, since a equals x plus y is

0:56:03 - 0:56:18 Text: just going to be one. And dADY is also going to be one. Then for b equals the max of y z. So this

0:56:18 - 0:56:25 Text: is this max node. So the local gradients for that is it's going to depend on y, where the y is

0:56:25 - 0:56:35 Text: greater than z. So dBDY is going to be one, if and only if y is greater than z, which it is at

0:56:35 - 0:56:45 Text: our particular point here. So that's one. And dBdz is going to be one only if z is greater than y.

0:56:45 - 0:56:55 Text: So for our particular values here, that one is going to be zero. And then finally, here,

0:56:55 - 0:57:08 Text: we're calculating the product f equals a b. So for that, we're going to, sorry, that slides

0:57:08 - 0:57:15 Text: all in perfect. Okay, so for the product, the derivative f with respect to a is equal to b,

0:57:15 - 0:57:21 Text: which is two. And the derivative f with respect to b is a equals three. So that gives us all of

0:57:21 - 0:57:30 Text: the local gradients at each node. And so then to run back propagation, we start with dF dF,

0:57:30 - 0:57:39 Text: which is just one. And then we're going to work out the downstream equals the upstream times the

0:57:39 - 0:57:48 Text: local. Okay, so the local, so when you have a product like this, note the sort of the gradients flip.

0:57:48 - 0:58:04 Text: So we take upstream times the local, which is two. Oops. So the downstream is two on this side.

0:58:06 - 0:58:15 Text: DFDB is three. So we're taking upstream times local. That gives us three. And so that gives us

0:58:15 - 0:58:24 Text: back propagates values to the plus and max nodes. And so then we continue along. So for the max node,

0:58:25 - 0:58:34 Text: the local gradient dBDY equals one. So we're going to take upstream as three. So it's going to take

0:58:34 - 0:58:43 Text: three times one. And that gives us three. DBDC is zero because of the fact that Z's value is not

0:58:43 - 0:58:50 Text: the max. So we're taking three times zero and saying that the gradient there is zero. So finally,

0:58:50 - 0:58:58 Text: doing the plus node, the local gradients for both x and y, there are one. So we're just getting two

0:58:58 - 0:59:06 Text: times one in both cases. And we're saying the gradients there are two. Okay. And so again, at the end

0:59:06 - 0:59:15 Text: of the day, the interpretation here is that this is giving us this information as to if we wiggle the

0:59:15 - 0:59:23 Text: values of x, y and z, how much of a difference does it make to the output? What is the slope, the

0:59:23 - 0:59:35 Text: gradient, with respect to the variable? So what we've seen is that since Z isn't the max of y and z,

0:59:35 - 0:59:43 Text: if I change the value of z a little, like I find, z.1 or minus.1, it makes no difference at all

0:59:43 - 0:59:52 Text: to what I compute as the output. So therefore, the gradient there is zero. If I change the value

0:59:52 - 1:00:02 Text: of x a little, then that is going to have an effect. And it's going to affect the output by

1:00:02 - 1:00:17 Text: twice as much as the amount I change it. Right. So, and that's because the df dz equals two.

1:00:19 - 1:00:27 Text: So interestingly, so I mean, we can basically work that out. So if we imagine

1:00:27 - 1:00:36 Text: making sort of x 2.1, well, then what we'd calculate the max is two.

1:00:37 - 1:00:46 Text: Oh, sorry, sorry, if we make x 1.1, we then get the max here is two, and we get 1.1 plus two

1:00:46 - 1:00:58 Text: is 3.1. So we get 3.1 times two. So that'd be about 6.2. So changing x by 0.1 has added 0.2 to the

1:00:58 - 1:01:08 Text: value of f. Conversely, for the value of y, we find that the df dy equals five. So what we do when

1:01:08 - 1:01:14 Text: we've got two things coming out here, as I'll go through again in a moment, is with summing the

1:01:14 - 1:01:19 Text: gradient. So again, three plus two equals five. And empirically, that's what happens. So if we

1:01:19 - 1:01:28 Text: consider fiddling the value of y a little, let's say we make it a value of 2.1, then the prediction

1:01:28 - 1:01:35 Text: is they'll have five times as big an effect on the output value we compute. And well, what do we

1:01:35 - 1:01:47 Text: compute? So we compute 1 plus 2.1. So that's 3.1. And we compute the max of 2.1 and 0 is 2.1. So

1:01:47 - 1:01:54 Text: we'll take the product of 2.1 and 3.1. And I calculate that in advance, as I can't really do

1:01:54 - 1:02:01 Text: this arithmetic in my head. And the product of those two is 6.51. So it has gone up about by

1:02:01 - 1:02:09 Text: 0.5. So we've multiplied my fiddly at by 0.1 by five times to work out the magnitude of the

1:02:09 - 1:02:19 Text: effect of the output. Okay. So for this start, you know, before I did the case of, you know,

1:02:19 - 1:02:32 Text: when we had one one in and one out here and multiple inns and one out here, the case that I

1:02:32 - 1:02:40 Text: had actually dealt with is the case of when you have multiple outward branches, but that then turned

1:02:40 - 1:02:48 Text: up in the computation of y. So once you have multiple outward branches, what you're doing is your

1:02:48 - 1:03:03 Text: summing. So that when you want to work out the dfdy, you've got a local gradient, you've got two

1:03:03 - 1:03:11 Text: upstream gradients. And you're working it out with respect to each of them as in the chain rule,

1:03:11 - 1:03:22 Text: and then you're summing them together to work out the impact at the end. Right. So we also saw

1:03:22 - 1:03:30 Text: some of the other node intuitions, which it's useful to have doing this. So when you have an addition,

1:03:31 - 1:03:40 Text: that distributes the upstream gradient to each of the things the lowered. When you have max,

1:03:40 - 1:03:47 Text: it's like a routing node. So when you have max, you have the upstream gradient, and it goes to one

1:03:47 - 1:03:56 Text: of the branches below it and the rest of them get no gradient. When you then have a multiplication,

1:03:56 - 1:04:06 Text: it has this effect of switching the gradient. So if you're taking three by two, the gradient on

1:04:06 - 1:04:12 Text: the two side is three, and on the three side is two. And if you think about in terms of how much

1:04:12 - 1:04:18 Text: effect you get from when you're doing this sort of wiggling, that totally makes sense, right? Because

1:04:18 - 1:04:25 Text: if you're multiplying another number by three, then any change here is going to be multiplied by three

1:04:25 - 1:04:37 Text: and vice versa. Okay. So this is the kind of computation graph that we want to use to work out

1:04:38 - 1:04:45 Text: derivatives in an automated computational fashion, which is the basis of the back propagation

1:04:45 - 1:04:54 Text: algorithm. But at that point, this is what we're doing, but there's still one mistake that we can make.

1:04:54 - 1:05:00 Text: It would be wrong for us to sort of say, okay, well, first of all, we want to work out DSDB.

1:05:00 - 1:05:09 Text: So look, we can start up here. We can propagate our upstream errors, work out local gradients,

1:05:09 - 1:05:19 Text: upstream error, local gradient, and keep all the way down and get the DSDB down here. Okay,

1:05:19 - 1:05:27 Text: next we want to do it for DSDW. Let's just run it all over again. Because if we do that, we'd be

1:05:27 - 1:05:36 Text: doing repeated computation, as I showed in the first half, that this term is the same both times,

1:05:36 - 1:05:42 Text: this term is the same both times, this term is the same both times, that only the bits at the end

1:05:42 - 1:05:50 Text: differ. So what we want to do is avoid duplicated computation and compute all the gradients

1:05:53 - 1:06:00 Text: that we're going to need, successively, so that we only do them once. And so that was analogous

1:06:00 - 1:06:08 Text: to when I introduced this delta variable when we computed gradients by hand. So starting off here from

1:06:08 - 1:06:21 Text: DSD, we're starting off here with DSDS is one. We then want to one time compute gradients in the

1:06:21 - 1:06:28 Text: green here, one time compute the gradient and green here, that's all common work. Then we're

1:06:28 - 1:06:38 Text: going to take the local gradient for DZDB and multiply that by the upstream gradient to have

1:06:38 - 1:06:47 Text: worked out DSDB. And then we're going to take the same upstream gradient and then work out the

1:06:47 - 1:06:58 Text: local gradient here and then propagate that down to give us DSDW. So the end result is we want to

1:06:58 - 1:07:06 Text: systematically work to forward computation forward in the graph and backward computation,

1:07:06 - 1:07:14 Text: back propagation, backward in the graph in a way that we do things efficiently. So this is

1:07:14 - 1:07:24 Text: the general form of the algorithm which works for an arbitrary computation graph. So at the end

1:07:24 - 1:07:36 Text: of the day, we've got a single scalar output Z and then we have inputs and parameters which compute

1:07:36 - 1:07:46 Text: Z. And so once we have this computation graph and I added in this funky extra arrow here to make

1:07:46 - 1:07:53 Text: it a more general computation graph, well we can always say that we can work out a starting point,

1:07:53 - 1:08:00 Text: something that doesn't depend on anything. So in this case both of these bottom two nodes don't

1:08:00 - 1:08:07 Text: depend on anything else. So we can start with them and we can start to compute forward. We can compute

1:08:07 - 1:08:13 Text: values for all of these sort of second row from the bottom nodes and then we're able to compute

1:08:15 - 1:08:22 Text: the third lens up. So we can have a top logical sort of the nodes based on the dependencies

1:08:22 - 1:08:31 Text: in this directed graph and we can compute the value of each node given some subset of its pre-decesses

1:08:31 - 1:08:38 Text: which it depends on. And so doing that as referred to as the forward propagation phase and gives us

1:08:38 - 1:08:45 Text: a computation of the scalar output Z using our current parameters and our current inputs.

1:08:45 - 1:08:53 Text: And so then after that we run back propagation. So for back propagation we initialize the output

1:08:53 - 1:09:04 Text: gradient, DZ, DZ as one and then we visit nodes in the reverse order of the top logical sort

1:09:04 - 1:09:12 Text: and we compute the gradients downward. And so our recipe is that for each node as we head down,

1:09:12 - 1:09:21 Text: we're going to compute the gradient of the node with respect to its successes, the things that it

1:09:21 - 1:09:29 Text: feeds into. And how we compute that gradient is using this chain rule that we've looked at. So

1:09:29 - 1:09:35 Text: this is sort of the generalized form of the chain rule where we have multiple outputs. And so we're

1:09:35 - 1:09:41 Text: summing over the different outputs. And then for each output we're computing the product of the

1:09:41 - 1:09:49 Text: upstream gradient and the local gradient with respect to that node. And so we head downwards.

1:09:49 - 1:09:57 Text: And we continue down the reverse top logical sort order and we work out the gradient with respect

1:09:57 - 1:10:08 Text: to each variable in this graph. And so it hopefully looks kind of intuitive looking at this picture

1:10:08 - 1:10:18 Text: that if you think of it like this, the big oak complexity of forward propagation and backward

1:10:18 - 1:10:26 Text: propagation is the same. Right. In both cases you're doing a linear pass through all of these nodes

1:10:26 - 1:10:33 Text: and calculating values given predecessors and then values given successors. I mean you have to

1:10:33 - 1:10:40 Text: do a little bit more work is for working out the gradients sort of as shown by this chain rule

1:10:40 - 1:10:45 Text: that it's the same big oak complexity. So if somehow you're implementing stuff for yourself rather

1:10:45 - 1:10:52 Text: than relying on the software and you're calculating the gradients of a different order of complexity

1:10:52 - 1:10:57 Text: of forward propagation, it means that you're doing something wrong. You're doing repeated work that

1:10:57 - 1:11:04 Text: you shouldn't have to do. Okay. So this algorithm works for a completely arbitrary

1:11:04 - 1:11:11 Text: computation graph, any directed a cyclic graph. You can apply this algorithm.

1:11:12 - 1:11:18 Text: In general, what we find is that we build neural networks that have a regular layer structure.

1:11:18 - 1:11:25 Text: So we have things like a vector of inputs and then that's multiplied by matrix. It's transformed

1:11:25 - 1:11:31 Text: into another vector which might be multiplied by another matrix or some with another matrix

1:11:31 - 1:11:37 Text: or something. Right. So once we're using that kind of regular layer structure, we can then parallelize

1:11:37 - 1:11:48 Text: the computation by working out the gradients in terms of Jacobians of vectors and matrices and do

1:11:48 - 1:11:56 Text: things in parallel much more efficiently. Okay. So doing this is then referred to as automatic

1:11:56 - 1:12:05 Text: differentiation. And so essentially if you know the computation graph, you should be able to have

1:12:05 - 1:12:15 Text: your computer, clever computer system work out what the derivatives of everything is and then

1:12:15 - 1:12:23 Text: apply back propagation to work out how to update the parameters and learn. And there's actually

1:12:23 - 1:12:33 Text: a sort of an interesting sort of thing of how history has gone backwards here, which I'll just

1:12:33 - 1:12:43 Text: note. So some of you might be familiar with symbolic computation packages. So those are things

1:12:43 - 1:12:51 Text: like mathematical. So mathematical, you can give it a symbolic form of a computation and then it

1:12:51 - 1:12:58 Text: can work out derivatives for you. So it should be the case that if you give a complete symbolic form

1:12:58 - 1:13:06 Text: of a computation graph, then it should be able to work out all the derivatives for you and you never

1:13:06 - 1:13:12 Text: have to work out a derivative by hand whatsoever. And that was actually attempted in the famous

1:13:12 - 1:13:19 Text: deep learning library called Fiano, which came out of Joshua Bendios group at the University of

1:13:19 - 1:13:28 Text: Montreal that had a compiler that did that kind of symbolic manipulation. But you know somehow

1:13:28 - 1:13:37 Text: that sort of proved a little bit too hard a road to follow. I imagine it actually might come back

1:13:37 - 1:13:44 Text: again in the future. And so for modern deep learning frameworks, which includes both TensorFlow

1:13:44 - 1:13:56 Text: or PyTorch, they do 90% of this computation of automatic differentiation for you, but they don't

1:13:56 - 1:14:04 Text: actually symbolically compute derivatives. So for each particular node or layer of your deep

1:14:04 - 1:14:15 Text: learning system, somebody, either you or the person who wrote that layer, has handwritten the

1:14:15 - 1:14:22 Text: local derivatives. But then everything from that point on, the sort of the taking, doing the

1:14:22 - 1:14:29 Text: chain rule of combining upstream gradients with local gradients to work out downstream gradients,

1:14:29 - 1:14:34 Text: that's then all being done automatically for back propagation on the computation graph.

1:14:35 - 1:14:42 Text: And so that what that means is for a whole neural network, you have a computation graph,

1:14:42 - 1:14:50 Text: and it's going to have a forward pass and a backward pass. And so for the forward pass,

1:14:50 - 1:14:56 Text: you're topologically sorting the nodes based on their dependencies and the computation graph.

1:14:56 - 1:15:04 Text: And then for each node, you're running forward, the forward computation on that node. And then

1:15:04 - 1:15:11 Text: for backward propagation, you're reversing the topological sort of the graph. And then for each node

1:15:11 - 1:15:17 Text: in the graph, you're running the backward propagation, which is a little bit of back crop, the chain

1:15:17 - 1:15:26 Text: rule at that node. And then the result of doing that is you have gradients for your inputs and parameters.

1:15:28 - 1:15:38 Text: And so this is the overall software runs this for you. And so what you want to do is then actually

1:15:38 - 1:15:45 Text: have stuff for particular nodes or layers in the graph. So if I have a multiply

1:15:45 - 1:15:54 Text: gate, it's going to have a forward algorithm, which just computes that the output is x times y in terms

1:15:54 - 1:16:00 Text: of the two inputs. And then I'm going to want to compute, to tell it also how to calculate the

1:16:00 - 1:16:10 Text: local derivative. So I want to say, what is the local derivative? So dL dx and dL dy in terms of the

1:16:10 - 1:16:19 Text: upstream gradient, dL dz. And so I will then manually work out how to calculate that. And normally,

1:16:19 - 1:16:29 Text: what I have to do is I assume the forward pass is being run first. And I'm going to shove into some

1:16:29 - 1:16:35 Text: local variables for my class, the values that we used in the forward computation. So as well as

1:16:35 - 1:16:44 Text: computing z equals x times y, I'm going to sort of remember what x and y were. So then when I'm

1:16:44 - 1:16:52 Text: asked to compute the backward pass, I'm then going to have implemented here what we saw earlier of

1:16:54 - 1:17:02 Text: that when it's xy, you're going to sort of swap the y and the x to work out the local gradients.

1:17:02 - 1:17:07 Text: And so then I'm going to multiply those by the upstream gradient. And I'm going to return,

1:17:07 - 1:17:14 Text: I've just written it here as a sort of a little list, but really it's going to be a numpy vector

1:17:14 - 1:17:25 Text: of the gradients. Okay, so that's 98% of what I wanted to cover today, just a couple of quick

1:17:25 - 1:17:34 Text: comments left. So that can and should all be automated. Sometimes you want to just check if you're

1:17:35 - 1:17:41 Text: computing the right gradients. And so the standard way of checking that you're computing the right

1:17:41 - 1:17:49 Text: gradients is to manually work out the gradient by doing a numeric calculation of the gradient. And so

1:17:49 - 1:17:58 Text: you can do that. So you can work out what the derivative of x of f with respect to x should be

1:17:59 - 1:18:06 Text: by choosing some sort of small number like 10 to the minus 4, adding it to x, subtracting it from x.

1:18:06 - 1:18:12 Text: And then so the difference between these numbers is 2h, dividing it through by 2h. And you're simply

1:18:12 - 1:18:19 Text: working out the rise over the run, which is the slope at that point with respect to x. And that's

1:18:19 - 1:18:28 Text: an approximation of the gradient of f with respect to x at that value of x. So this is so simple,

1:18:28 - 1:18:33 Text: you can't make a mistake implementing it. And so therefore you can use this to check

1:18:34 - 1:18:41 Text: where your gradient values are correct or not. This isn't something that you'd want to use much

1:18:41 - 1:18:47 Text: because not only is it approximate that it's extremely slow. Because to work this out you have to run

1:18:47 - 1:18:53 Text: the forward computation for every parameter of the model. So if you have a model with a million

1:18:53 - 1:19:00 Text: parameters, you're now doing a million times as much work to run backprop as you would do if you're

1:19:00 - 1:19:06 Text: actually using calculus. So calculus is a good thing to know. But it can be really useful to check

1:19:06 - 1:19:12 Text: that the right values are being calculated. And the old days when we hand wrote everything,

1:19:13 - 1:19:18 Text: this was kind of the key unit test that people used everywhere. These days most of the time you're

1:19:18 - 1:19:24 Text: reusing layers that are built into PyTorch or some other deep learning framework. So it's much less

1:19:25 - 1:19:29 Text: needed. But sometimes you're implementing your own layer and you really do want to check

1:19:29 - 1:19:35 Text: the things are implemented correctly. There's a fine point in the way this has written. If you saw

1:19:35 - 1:19:44 Text: this in sort of high school calculus class, you will have seen rise over run of f of x plus h minus

1:19:44 - 1:19:55 Text: f of x divided by h. It turns out that doing this two-sided estimate like this is much, much more

1:19:55 - 1:20:01 Text: accurate than doing a one-sided estimate. And so you're really much encouraged to use this

1:20:01 - 1:20:08 Text: approximation. Okay, so at that point, we've mastered the core technology of neural nets. Back

1:20:08 - 1:20:16 Text: propagation is recursively and hence efficiently applying the chain rule along the computation graph

1:20:16 - 1:20:25 Text: with this sort of key step that downstream gradient equals upstream gradient times local gradient.

1:20:25 - 1:20:31 Text: And so for calculating with neural nets, we do the forward pass to work out values with current

1:20:31 - 1:20:40 Text: parameters, then run back propagation to work out the gradient of the loss currently computer

1:20:40 - 1:20:49 Text: loss with respect to those parameters. Now to some extent, you know, with modern deep learning

1:20:49 - 1:20:54 Text: frameworks, you don't actually have to know how to do any of this, right? It's the same as you

1:20:54 - 1:21:02 Text: don't have to know how to implement a C compiler. You can just write C code and say GCC and it'll

1:21:02 - 1:21:09 Text: compile it and it'll run the right stuff for you. And that's the kind of functionality you get

1:21:09 - 1:21:16 Text: from the PyTorch framework. So do come along to the PyTorch tutorial this Friday and get a sense

1:21:16 - 1:21:23 Text: about how easy it is to write neural networks using a framework like PyTorch or TensorFlow. And you

1:21:23 - 1:21:29 Text: know, it's so easy. That's why high school students across the nation are now doing their science

1:21:29 - 1:21:36 Text: projects, training deep learning systems because you don't actually have to understand very much

1:21:36 - 1:21:43 Text: to bung a few neural network layers together and set it computing on some data. But you know,

1:21:43 - 1:21:48 Text: we hope in this class that you actually are also learning how these things that implemented.

1:21:48 - 1:21:55 Text: So you have a deeper understanding of than that. And you know, it turns out that sometimes you

1:21:55 - 1:22:01 Text: need to have a deeper understanding. So back propagation doesn't always work carefully, perfectly.

1:22:01 - 1:22:07 Text: And so understanding what it's really doing can be crucial to debugging things. And so we'll

1:22:07 - 1:22:12 Text: actually see an example of that fairly soon when we start looking at recurrent models and some of

1:22:12 - 1:22:18 Text: the problems that they have, which will require us to think a bit more deeply about what's happening

1:22:18 - 1:22:26 Text: in our gradient computations. Okay, that's it for the day.