0:00:00 - 0:00:14 Text: Hi everyone, I'll get started. Okay, so we're now back to the second week of CS224N on
0:00:14 - 0:00:19 Text: Natural Language Processing with Deep Learning. Okay, so for today's lecture, what we're
0:00:19 - 0:00:27 Text: going to be looking at is all the math details of during neural net learning. First of
0:00:27 - 0:00:33 Text: all, looking at how we can work out by hand, gradients for training neural networks, and
0:00:33 - 0:00:40 Text: then looking at how it's done more algorithmically, which is known as the back propagation algorithm.
0:00:40 - 0:00:47 Text: And correspondingly for you guys, well, I hope you remembered that one minute ago was when
0:00:47 - 0:00:52 Text: assignment one was due and everyone has handed that in. If I sometimes who haven't handed
0:00:52 - 0:00:58 Text: in, really should have been as soon as possible best to preserve those late days for the
0:00:58 - 0:01:04 Text: harder assignments. So I mean, I actually forgot to mention, we actually did make one change
0:01:04 - 0:01:09 Text: for this year to make it a bit easier when occasionally people join the class a week
0:01:09 - 0:01:16 Text: wait. If you want to this year and the grading assignment one can be discounted and will
0:01:16 - 0:01:21 Text: just use your other four assignments. But if you've been in the class so far for that 98%
0:01:21 - 0:01:26 Text: of people, well since assignment one is the easiest assignment, again, it's silly not
0:01:26 - 0:01:34 Text: to do it and have it as part of your grade. Okay, so starting today, we've put out assignment
0:01:34 - 0:01:39 Text: two and assignment two is all about making sure you really understand the math of neural
0:01:39 - 0:01:47 Text: networks and then the software that we use to do that math. So this is going to be a
0:01:47 - 0:01:54 Text: bit of a tough week for some. So for some people who are great on all their math and backgrounds,
0:01:54 - 0:01:59 Text: they'll feel like this is stuff they know. Well, nothing very difficult, but I know there
0:01:59 - 0:02:06 Text: are quite a few of you who this lecture and week is the biggest struggle of the course.
0:02:06 - 0:02:11 Text: We really do want people to actually have an understanding of what goes on in your network
0:02:11 - 0:02:17 Text: learning rather than viewing it as some kind of deep magic. And I hope that some of
0:02:17 - 0:02:22 Text: the material we give today and that you read up on and use in the assignment will really
0:02:22 - 0:02:29 Text: give you more of a sense of what these neural networks are doing and how it is just math
0:02:29 - 0:02:35 Text: that's applied in a systematic large scale that works out the answers and that this will
0:02:35 - 0:02:40 Text: be valuable and give you a deeper sense of what's going on. But if this material seems
0:02:40 - 0:02:48 Text: very scary and difficult, you can take some refuge in the fact that there's fast light
0:02:48 - 0:02:53 Text: at the end of the tunnel since this is really the only lecture that's heavily going through
0:02:53 - 0:02:58 Text: the math details of neural networks. After that, we'll be kind of popping back up to a
0:02:58 - 0:03:05 Text: higher level and by and large, after this week, we'll be making use of software to do a
0:03:05 - 0:03:13 Text: lot of the complicated math for us. But nevertheless, I hope this is valuable. I'll go through everything
0:03:13 - 0:03:18 Text: quickly today, but if this isn't stuff that you know backwards, I really do encourage
0:03:18 - 0:03:27 Text: you to work through it and get help as you need it. So do come along to our office hours.
0:03:27 - 0:03:32 Text: There are also a number of pieces of tutorial material given in the syllabus. So there's
0:03:32 - 0:03:38 Text: both the lecture notes. There's some materials from CS231. In the list of readings, the
0:03:38 - 0:03:46 Text: very top reading is some material put together by Kevin Clark a couple of years ago. And
0:03:46 - 0:03:52 Text: actually, that one's my favorite. The presentation there fairly closely follows the presentation
0:03:52 - 0:03:57 Text: in this lecture of going through matrix calculus. So, you know, personally, I'd recommend
0:03:57 - 0:04:01 Text: starting with that one, but there are four different ones you can choose from. If one
0:04:01 - 0:04:09 Text: of them seems more helpful to you. Two other things on what's coming up. Actually, for Thursday's
0:04:09 - 0:04:14 Text: lecture, we make a big change. And Thursday's lecture is probably the most linguistic
0:04:14 - 0:04:20 Text: lecture of the whole class where we go through the details of dependency grammar and dependency
0:04:20 - 0:04:24 Text: parsing. Some people find that tough as well, but at least there'll be tough in a different
0:04:24 - 0:04:30 Text: way. And then one other really good opportunity is this Friday, we have our second tutorial
0:04:30 - 0:04:36 Text: at 10am, which is an introduction to PyTorch, which is the deep learning framework that
0:04:36 - 0:04:41 Text: will be using for the rest of the class. Once we've gone through these first two assignments
0:04:41 - 0:04:46 Text: where you do things by yourself. So this is a great chance, again, to intro to PyTorch.
0:04:46 - 0:04:55 Text: It will be really useful for later in the class. Okay. Today's material is really all about
0:04:55 - 0:05:01 Text: sort of the math of neural networks, but just to sort of introduce a setting where we can
0:05:01 - 0:05:08 Text: work through this, I'm going to introduce a simple NLP task and a simple form of classifier
0:05:08 - 0:05:14 Text: that we can use for it. So the task of named entity recognition is a very common basic
0:05:14 - 0:05:20 Text: NLP task. And the goal of this is you're looking through pieces of text and you're wanting
0:05:20 - 0:05:26 Text: to label by labeling the words, which words belong to entity categories like persons,
0:05:26 - 0:05:33 Text: locations, products, dates, times, et cetera. So for this piece of text, last night Paris
0:05:33 - 0:05:39 Text: Hilton, wowed in the sequin gown, Samuel Quinn was arrest in the Hilton Hotel in Paris
0:05:39 - 0:05:47 Text: in April 1989. The some words are being labeled as named entities as shown. These two sentences
0:05:47 - 0:05:53 Text: don't actually belong together in the same article, but I chose those two sentences to illustrate
0:05:53 - 0:05:59 Text: the basic point that is not that you can just do this task by using a dictionary. Yes,
0:05:59 - 0:06:06 Text: a dictionary is helpful to know that Paris can possibly be a location, but Paris can also
0:06:06 - 0:06:12 Text: be a person name. So you have to use context to get named entity recognition rights.
0:06:12 - 0:06:20 Text: Okay, well, how might we do that with the neural network? There are much more advanced
0:06:20 - 0:06:29 Text: ways of doing this, but a simple yet already pretty good way of doing named entity recognition
0:06:29 - 0:06:36 Text: with a simple neural net is to say, well, what we're going to do is use the word vectors
0:06:36 - 0:06:43 Text: that we've learned about and we're going to build up a context window of word vectors.
0:06:43 - 0:06:49 Text: And then we're going to put those through a neural network layer and then feed it through
0:06:49 - 0:06:55 Text: a softmax classifier of the kind that we, sorry, I said that wrong. And then we're going
0:06:55 - 0:07:00 Text: to feed it through a logistic classifier of the kind that we saw when looking at negative
0:07:00 - 0:07:09 Text: sampling, which is going to say for a particular entity type such as location, is it high probability
0:07:09 - 0:07:15 Text: location or is it not a high probability location. So for a sentence like the museums and
0:07:15 - 0:07:20 Text: Paris are amazing to see what we're going to do is for each word say we're doing the
0:07:20 - 0:07:27 Text: word Paris, we're going to form a window around it say a plus or minus two word window.
0:07:27 - 0:07:33 Text: And so for those five words, we're going to get word vectors for them from the kind of word
0:07:33 - 0:07:39 Text: to vac or glove word vectors we've learned. And we're going to make a long vector out of
0:07:39 - 0:07:44 Text: the concatenation of those five word vectors. So the word of interest is in the middle.
0:07:44 - 0:07:50 Text: And then we're going to feed this vector to a classifier, which is that the end going
0:07:50 - 0:07:56 Text: to have a probability of the word being a location. And then we could have another class
0:07:56 - 0:08:01 Text: ifier that says the probability of the word being a person name. And so once we've done
0:08:01 - 0:08:05 Text: that, we're then going to run it at the next position. So we then say, well, is the word
0:08:05 - 0:08:12 Text: are a location and we'd feed a window of five words as then in Paris are amazing to
0:08:12 - 0:08:18 Text: and put it through the same kind of classifier. And so this is the classifier that we'll use.
0:08:18 - 0:08:25 Text: So its import will be this word window. So if we have de-dimensional word vectors, this
0:08:25 - 0:08:32 Text: will be a five de vector. And then we're going to put it through a layer of a neural network.
0:08:32 - 0:08:39 Text: So the layer of the neural network is going to multiply this vector by a matrix, add
0:08:39 - 0:08:49 Text: on a bias vector, and then put that through a non-linearity such as the softmax transformation
0:08:49 - 0:08:55 Text: that we've seen before. And that will give us a hidden vector, which might be of a smaller
0:08:55 - 0:09:02 Text: dimensionality such as this one here. And so then with that hidden vector, we're then
0:09:02 - 0:09:10 Text: going to take the dot product of it with an extra vector here, here's you. So we take
0:09:10 - 0:09:17 Text: you dot product h. And so when we do that, we're getting out a single number. And that
0:09:17 - 0:09:24 Text: number can be any real number. And so then finally, we're going to put that number through
0:09:24 - 0:09:31 Text: a logistic transform of the same kind that we saw when doing negative sampling. And the
0:09:31 - 0:09:38 Text: logistic transform will take any real number and it'll transform it into a probability
0:09:38 - 0:09:44 Text: that that word is a location. So its output is the predicted probability of the word
0:09:44 - 0:09:50 Text: belonging to a particular class. And so this could be our location classifier, which
0:09:50 - 0:09:56 Text: could classify each word in a window as to what the probability is that it's a location
0:09:56 - 0:10:03 Text: word. And so this little neural network here is the neural network I'm going to use today
0:10:03 - 0:10:09 Text: when going through some of the math. But actually, I'm going to make it even easier on myself.
0:10:09 - 0:10:16 Text: I'm going to throw away the logistic function at the top. And I'm really just going to
0:10:16 - 0:10:23 Text: work through the math of the bottom three quarters of this. If you look at Kevin Clark's handout
0:10:23 - 0:10:28 Text: that I just mentioned, he includes when he works through it also working through the logistic
0:10:28 - 0:10:35 Text: function. And we also saw working through a softmax in the first lecture when I was
0:10:35 - 0:10:42 Text: working through some of the word detect model. Okay. So the overall question we want to
0:10:42 - 0:10:51 Text: be able to answer is, so here's our stochastic gradient descent equation that we have existing
0:10:51 - 0:11:00 Text: parameters of our model. And we want to update them based on our current loss, which is
0:11:00 - 0:11:08 Text: the j of theta. So for getting our loss here, that the true answer is to whether a word
0:11:08 - 0:11:15 Text: is a location or not will be either one, if it is a location or zero, if it isn't, our
0:11:15 - 0:11:21 Text: logistic class to file returns some number like 0.9. And we'll use the distance away from
0:11:21 - 0:11:27 Text: what it should have been squared as our loss. So we work out a loss. And then we're moving
0:11:27 - 0:11:35 Text: a little distance in the negative of the gradient, which will be changing our parameter estimates
0:11:35 - 0:11:41 Text: in such a way that they reduce the loss. And so this is already being written in terms
0:11:41 - 0:11:48 Text: of a whole vector of parameters, which is being updated as to a new vector of parameters.
0:11:48 - 0:11:54 Text: But you can also think about it that for each individual parameter theta j that we're
0:11:54 - 0:12:00 Text: working out the partial derivative of the loss with respect to that parameter. And then
0:12:00 - 0:12:07 Text: we're moving a little bit in the negative direction of that. That's going to give us
0:12:07 - 0:12:15 Text: a new value for parameter theta j. And we're going to update all of the parameters of our
0:12:15 - 0:12:23 Text: model as we learn. I mean, in particular, in contrast to what commonly happens in statistics,
0:12:23 - 0:12:30 Text: we also, we update not only the sort of parameters of our model that are sort of weights in the
0:12:30 - 0:12:36 Text: classifier, but we also will update our data representation. So we'll also be changing
0:12:36 - 0:12:43 Text: our word vectors as we learn. Okay, so to build neural nets, I use to train neural nets based
0:12:43 - 0:12:50 Text: on data, what we need is to be able to compute this gradient of the parameters so that we
0:12:50 - 0:12:56 Text: can then iteratively update the weights of the model and efficiently train a model that
0:12:56 - 0:13:05 Text: has good weights, i.e. that has high accuracy. And so how can we do that? Well, what I'm
0:13:05 - 0:13:12 Text: going to talk about today is first of all how you can do it by hand. And so for doing
0:13:12 - 0:13:20 Text: it by hand, this is basically a review of matrix calculus. And that'll take quite a bit
0:13:20 - 0:13:28 Text: of the lecture. And then after we've talked about that for a while, I'll then shift gears
0:13:28 - 0:13:34 Text: and introduce the back propagation algorithm, which is the central technology for neural
0:13:34 - 0:13:41 Text: networks. And that technology is essentially the efficient application of calculus on a
0:13:41 - 0:13:48 Text: large scale as we'll come to talking about soon. So for computing gradients by hand, what
0:13:48 - 0:13:56 Text: we're doing is matrix calculus. So we're working with vectors and matrices and working out
0:13:56 - 0:14:06 Text: gradients. And this can seem like pretty scary stuff. And well, to extent that you're
0:14:06 - 0:14:14 Text: kind of scared and don't know what's going on, one choice is to work out a non-vectorized
0:14:14 - 0:14:22 Text: gradient by just working out what the partial derivative is for one parameter at a time.
0:14:22 - 0:14:29 Text: And I showed a little example of that in the first lecture. But it's much, much faster
0:14:29 - 0:14:39 Text: and more useful to actually be able to work with vectorized gradients. And in some sense,
0:14:39 - 0:14:44 Text: if you're not very confident, this is kind of almost a leap of faith. But it really is
0:14:44 - 0:14:50 Text: the case that multivariable calculus is just like single variable calculus, except you're
0:14:50 - 0:14:56 Text: using vectors and matrices. So providing you to remember some basics of single variable
0:14:56 - 0:15:04 Text: calculus, you really should be able to do this stuff and get it to work out. Lots of other
0:15:04 - 0:15:11 Text: sources, I've mentioned the notes. You can also look at the textbook from A51, which also
0:15:11 - 0:15:17 Text: has quite a lot of material on this. I know some of you have bad memories of A51.
0:15:17 - 0:15:22 Text: OK, so let's go through this and see how it works from ramping up from the beginning.
0:15:22 - 0:15:28 Text: So the beginning of calculus is, you know, we have a function with one input and one
0:15:28 - 0:15:35 Text: output, f of x equals x cubed. And so then its gradient is its slope, right? So that's
0:15:35 - 0:15:43 Text: its derivative. So its derivative is 3x squared. And the way to think about this is how much
0:15:43 - 0:15:49 Text: will the output change if we change the input a little bit, right? So what we're wanting
0:15:49 - 0:15:56 Text: to do in our neural net models is change what they output so that they do a better job
0:15:56 - 0:16:01 Text: of predicting the correct answers when we're doing supervised learning. And so what we want
0:16:01 - 0:16:06 Text: to know is if we fiddle different parameters of the model, how much will they have on the
0:16:06 - 0:16:12 Text: output? Because then we can choose how to fiddle them in the right way to move things down,
0:16:12 - 0:16:18 Text: right? So, you know, when we're saying that the derivative here is 3x squared, well,
0:16:18 - 0:16:26 Text: what we're saying is that if you're at x equals 1, if you fiddle the input a little bit,
0:16:26 - 0:16:32 Text: the output will change 3 times as much, 3 times 1 squared. And it does. So if I say what's
0:16:32 - 0:16:40 Text: the value at 1.01, it's about 1.03, it's changed 3 times as much and that's its slope. But
0:16:40 - 0:16:49 Text: it x equals 4, the derivative is 16 times 3, 48. So if we fiddle the input a little, it'll
0:16:49 - 0:16:57 Text: change 48 times as much and that's roughly what happens, 4.01 cubed is 64.48. Now, of
0:16:57 - 0:17:02 Text: course, you know, this is just sort of showing it for a small fiddle, but, you know, that's
0:17:02 - 0:17:10 Text: an approximation to the actual truth. Okay, so then we sort of ramp up to the more complex
0:17:10 - 0:17:16 Text: cases, which are more reflective of what we do with neural networks. So if we have a function
0:17:16 - 0:17:23 Text: with one output and n inputs, then we have a gradient. So a gradient is a vector of partial
0:17:23 - 0:17:29 Text: derivatives with respect to each input. So we've got n inputs x1 to xn and we're working
0:17:29 - 0:17:35 Text: out the partial derivative f with respect to x1, the partial derivative f with respect to
0:17:35 - 0:17:42 Text: x2, et cetera. And we then get a vector of partial derivatives, where each element of
0:17:42 - 0:17:50 Text: this vector is just like a simple derivative with respect to one variable. Okay, so from
0:17:50 - 0:17:57 Text: that point, we just keep on ramping up for what we do with neural networks. So commonly,
0:17:57 - 0:18:03 Text: when we have something like a layer in a neural network, we'll have a function with n inputs
0:18:03 - 0:18:10 Text: that will be like our word vectors, then we do something like multiply by a matrix and
0:18:10 - 0:18:17 Text: then we'll have m outputs. So we have a function now, which is taking n inputs and is producing
0:18:17 - 0:18:25 Text: m outputs. So at this point, what we're calculating for the gradient is what's called a
0:18:25 - 0:18:34 Text: Jacobian matrix. So for m inputs and n outputs, the Jacobian is an m by n matrix of every
0:18:34 - 0:18:43 Text: combination of partial derivatives. So function f splits up into these different sub functions
0:18:43 - 0:18:50 Text: f1 through m, fm, which generate each of the m outputs. And so then we're taking the
0:18:50 - 0:18:56 Text: partial derivative f1 with respect to x1 through the partial derivative f1 with respect to
0:18:56 - 0:19:02 Text: xn, then heading down, you know, we make it up to the partial derivative of fm with respect
0:19:02 - 0:19:09 Text: to x1, et cetera. So we have every possible partial derivative of an output variable with
0:19:09 - 0:19:20 Text: respect to one of the input variables. Okay. So in simple calculus, when you have a composition
0:19:20 - 0:19:30 Text: of one variable functions, so that if you have y equals x squared and then z equals 3y,
0:19:30 - 0:19:38 Text: then z is a composition of two functions of, or you're composing two functions, z is
0:19:38 - 0:19:44 Text: a function of x. Then you can work out the derivative of z with respect to x. And the
0:19:44 - 0:19:49 Text: way you do that is with the chain rule. And so in the chain rule, you multiply derivatives.
0:19:49 - 0:20:03 Text: So dzdx equals dzdy times dydx. So dzdy is just three, and dydx is 2x. So we get three
0:20:03 - 0:20:13 Text: times 2x. So that overall, the derivative here is 6x. And since if we multiply this together,
0:20:13 - 0:20:20 Text: we're really saying that z equals 3x squared, you should trivially be able to see again,
0:20:20 - 0:20:27 Text: aha, it's derivative is 6x. So that works. Okay. So once we move into vectors and matrices
0:20:27 - 0:20:35 Text: and Jacobians, it's actually the same game. So when we're working with those, we can compose
0:20:35 - 0:20:41 Text: functions and work out their derivatives by simply multiplying Jacobians. So if we have
0:20:41 - 0:20:48 Text: start with an input x and then put it through the simplest form of neural network layer
0:20:48 - 0:20:55 Text: and say that z equals wx plus b. So we multiply the expected by matrix w and then add on a
0:20:55 - 0:21:01 Text: bias vector b. And then typically we'd put things through a nonlinearity f. So f could
0:21:01 - 0:21:08 Text: be a sigmoid function. We'll then say 8 equals f of z. So this is the composition of two
0:21:08 - 0:21:15 Text: functions in terms of vectors and matrices. So we can use Jacobians and we can say the
0:21:15 - 0:21:23 Text: partial of h with respect to x is going to be the product of the partial of h with respect
0:21:23 - 0:21:31 Text: to z and the partial of z with respect to x. And this all does work out. So let's start
0:21:31 - 0:21:39 Text: going through some examples of how these things work slightly more concretely. First,
0:21:39 - 0:21:47 Text: just particular Jacobians and then composing them together. So one case we look at is the
0:21:47 - 0:21:53 Text: nonlinearities that we put a vector through. So this is something like putting a vector
0:21:53 - 0:22:01 Text: through the sigmoid function f. And so if we have an intermediate vector z and we're turning
0:22:01 - 0:22:09 Text: into vector h by putting it through a logistic function, we can say what is dhdz.
0:22:12 - 0:22:22 Text: Well, for this, formally, this is a function that has n inputs and n outputs. So at the end
0:22:22 - 0:22:30 Text: of the day, we're computing an n by n Jacobian. And so what that's meaning is the elements
0:22:30 - 0:22:39 Text: of this n by n Jacobian are going to take the partial derivative of each output with respect
0:22:39 - 0:22:48 Text: to each input. And well, what is that going to be in this case? Well, in this case, because
0:22:48 - 0:22:56 Text: we're actually just computing element-wise a transformation such as a logistic transform
0:22:56 - 0:23:05 Text: of each element zi, like the second equation here. If i equals j, we've got something to compute.
0:23:05 - 0:23:13 Text: Whereas if i doesn't equal j, there's just the input has no influence on the output
0:23:13 - 0:23:19 Text: and so the derivative is zero. So if i doesn't equal j, we're going to get a zero. And if i does
0:23:19 - 0:23:27 Text: equal j, then we're going to get the regular one variable derivative of the logistic function,
0:23:27 - 0:23:34 Text: which if I remember correctly, you were asked to compute, now I can't remember what's the
0:23:34 - 0:23:40 Text: assignment one or assignment two, but one of the two asks you to compute it. So our Jacobian
0:23:40 - 0:23:50 Text: for this case looks like this. We have a diagonal matrix with the derivatives of each element
0:23:50 - 0:23:57 Text: along the diagonal and everything else is zero. Okay, so let's look at a couple of other Jacobians.
0:23:58 - 0:24:06 Text: So if we're asking, if we've got this wx plus b basic neural network layer and we're asking
0:24:06 - 0:24:14 Text: for the gradient with respect to x, then what we're going to have coming out is that's actually
0:24:14 - 0:24:24 Text: going to be the matrix w. So this is where what I hope you can do is look at the notes at home
0:24:24 - 0:24:33 Text: and work through this exactly and see that this is actually the right answer. But this is the way
0:24:33 - 0:24:41 Text: in which if you just have faith and think this is just like single variable calculus, except I've
0:24:41 - 0:24:46 Text: now got vectors and matrices, the answer you get is actually what you expected to get because this
0:24:46 - 0:24:55 Text: is just like the derivative of ax plus b with respect to x where it's a. So similarly, if we take
0:24:55 - 0:25:05 Text: the partial derivative with respect to b of wx plus b, we get out the identity matrix. Okay, then one
0:25:05 - 0:25:12 Text: other Jacobian that we mentioned while in the first lecture while working through word to veck
0:25:12 - 0:25:24 Text: is if you have the dot product of two vectors, i, that's a number, that what you get coming out of
0:25:24 - 0:25:34 Text: that, so the the partial derivative of uth with respect to u is h transpose. And at this point,
0:25:34 - 0:25:42 Text: there's some fine print that I'm going to come back to in a minute. So this is the correct Jacobian,
0:25:42 - 0:25:53 Text: right? Because in this case, we have the dimension of h inputs and we have one output. And so we want
0:25:53 - 0:26:00 Text: to have a row vector. But there's a little bit more to say on that that I'll come back to in
0:26:00 - 0:26:09 Text: about 20 slides. But this is the correct Jacobian. Okay, so if you aren't not familiar with these
0:26:09 - 0:26:17 Text: kind of Jacobians, do please look at some of the notes that are available and try and compute these
0:26:17 - 0:26:23 Text: in more detail element wise and convince yourself that they really are right. But I'm going to assume
0:26:23 - 0:26:30 Text: these now and show you what happens when we actually then work out gradients for at least a mini little
0:26:30 - 0:26:47 Text: neural net. Okay, so here is most of this neural net. I mean, as I commented that, you know, really
0:26:47 - 0:26:53 Text: we'd be working out the partial derivative of the loss j with respect to these variables.
0:26:53 - 0:26:58 Text: But for the example I'm doing here, I just I've locked that off to keep it a little simpler and
0:26:58 - 0:27:04 Text: more manageable for the lecture. And so we're going to just work out the partial derivative of the
0:27:04 - 0:27:11 Text: score s, which is a real number with respect to the different parameters of this model where the
0:27:11 - 0:27:21 Text: parameters of this model are going to be the w and the b and the u and also the input because we
0:27:21 - 0:27:31 Text: can update the weight vectors of the word vectors of different words based on tuning them to better
0:27:31 - 0:27:37 Text: predict the classification outputs that we desire. So let's start off with a fairly easy one
0:27:37 - 0:27:46 Text: where we want to update the bias vector b to have our system classify better. So to be able to
0:27:46 - 0:27:51 Text: do that, what we want to work out is the partial derivatives of s with respect to b.
0:27:53 - 0:27:59 Text: So we know how to put that into our stochastic gradient update for the b parameters.
0:28:00 - 0:28:07 Text: Okay, so how do we go about doing these things? So the first step is we want to sort of break things
0:28:07 - 0:28:15 Text: up into different functions of minimal complexity that compose together. So in particular,
0:28:15 - 0:28:22 Text: this neural net layer, a equals f of wx plus b, it's still a little bit complex. So let's decompose
0:28:22 - 0:28:33 Text: that one further step. So we have the input x, we then calculate the linear transformation z equals
0:28:33 - 0:28:44 Text: wx plus b. And then we put things through the sort of element wise non-linearity, a equals f of z,
0:28:44 - 0:28:54 Text: and then we do the dot product with u. And it's useful for working these things out to split into
0:28:54 - 0:29:00 Text: pieces like this, have straight what your different variables are, and to know what the
0:29:00 - 0:29:07 Text: dimensionality of each of these variables is, it's well worth just writing out the dimensionality
0:29:07 - 0:29:12 Text: of every variable and making sure that the answers that you're computing are of the right dimensionality.
0:29:13 - 0:29:22 Text: So at this point though, what we can see is that calculating s is the product of three,
0:29:22 - 0:29:31 Text: sorry, is the composition of three functions around x. So for working out the partials of s with
0:29:31 - 0:29:40 Text: respect to b, it's the composition of the three functions shown on the left. And so therefore,
0:29:40 - 0:29:51 Text: the gradient of s with respect to b, we're going to take the product of these three partial derivatives.
0:29:53 - 0:30:04 Text: Okay, so how do we, so we've got the s equals uth, so that's sort of the top corresponding
0:30:04 - 0:30:11 Text: partial derivative, partial derivative of h with respect to z, partial derivative of z with respect
0:30:11 - 0:30:18 Text: to b, which is the first one that we're working out. Okay, so we want to work this out,
0:30:18 - 0:30:25 Text: and if we're lucky, we remember those Jacobians I showed previously about the Jacobian for
0:30:25 - 0:30:34 Text: a vector dot product, the Jacobian for the nonlinearity and the Jacobian for the simple linear
0:30:34 - 0:30:43 Text: transformation. And so we can use those. So for the partials of s with respect to h,
0:30:45 - 0:30:52 Text: well, that's going to be ut using the first one. The partials of h with respect to z,
0:30:52 - 0:30:59 Text: okay, so that's the nonlinearity. And so that's going to be the matrix that's the diagonal matrix
0:30:59 - 0:31:09 Text: with the element-wise derivative of f prime of z and zero elsewhere. And then for the wx plus b,
0:31:09 - 0:31:16 Text: when we're taking the partials with respect to b, that's just the identity matrix. So we can
0:31:16 - 0:31:28 Text: simplify that down a little, the identity matrix disappears. And since ut is a vector, and this is
0:31:28 - 0:31:36 Text: a diagonal matrix, we can rewrite this as ut, Hadamard product of f prime of z. I think this is the
0:31:36 - 0:31:43 Text: first time I've used this little circle for Hadamard product, but it's something that you'll see
0:31:43 - 0:31:50 Text: quite a bit in your network work since it's often used. So when we have two vectors,
0:31:53 - 0:32:00 Text: ut and this vector here, sometimes you want to do an element-wise product. So the output of this
0:32:00 - 0:32:05 Text: will be a vector where you've taken the first element of each and multiply them, the second element
0:32:05 - 0:32:10 Text: of each and multiply them, etc. downwards. And so that's called the Hadamard product, and it's
0:32:10 - 0:32:20 Text: what we're calculating to calculate a vector, which is the gradient of s with respect to b.
0:32:22 - 0:32:31 Text: Okay, so that's good. So we now have a gradient of s with respect to b, and we could use that
0:32:31 - 0:32:38 Text: in our stochastic gradient. But we don't stop there. We also want to work out the gradient
0:32:38 - 0:32:47 Text: with respect to others of our parameters. So we might want to next go on and work out the gradient
0:32:47 - 0:32:57 Text: of s with respect to w. Well, we can use the chain rule just like we did before. So we've got the
0:32:57 - 0:33:03 Text: same product of functions, and everything is going to be the same, apart from now taking
0:33:03 - 0:33:13 Text: the derivatives with respect to w rather than b. So it's now going to be the partial of s with
0:33:13 - 0:33:22 Text: respect to h, h with respect to z, and z with respect to w. And the important thing to notice here,
0:33:22 - 0:33:29 Text: and this leads into what we do with the back propagation algorithm, is wait a minute,
0:33:29 - 0:33:36 Text: that this is very similar to what we've already done. So when we're all working out the gradients
0:33:36 - 0:33:44 Text: of s with respect to b, the first two terms were exactly the same. It's only the last one that
0:33:44 - 0:33:54 Text: differs. So to be able to build or to train neural networks efficiently, this is what happens all
0:33:54 - 0:34:01 Text: the time, and it's absolutely essential that we use an algorithm that avoids repeated computation.
0:34:02 - 0:34:09 Text: And so the idea we're going to develop is when we have this equation stack that there's sort of
0:34:09 - 0:34:17 Text: stuff that's above where we compute z, and we're going to be sort of that'll be the same each time,
0:34:17 - 0:34:24 Text: and we want to compute something from that that we can then sort of feed downwards when working
0:34:24 - 0:34:36 Text: out the gradients with respect to w x or b. And so we do that by defining delta, which is delta is
0:34:36 - 0:34:44 Text: the partial's composed that are above the linear transform, and that's referred to as the local
0:34:44 - 0:34:51 Text: error signal. It's what's being passed in from above to the linear transform. And we've already
0:34:51 - 0:35:00 Text: computed the gradient of that in the preceding slides. And so the final form of the partial
0:35:00 - 0:35:11 Text: s with respect to b will be delta times the remaining part. And well, we'd seen that, you know,
0:35:11 - 0:35:19 Text: for partial s with respect to b, the partial z with respect to b is just the identity. So the end
0:35:19 - 0:35:25 Text: result was delta. But in this time, we then go and have to work out the partial of z with respect to
0:35:25 - 0:35:33 Text: w and multiply that by delta. So that's the part that we still haven't yet done. So
0:35:33 - 0:35:40 Text: and this is where things get in some sense a little bit hairier.
0:35:42 - 0:35:49 Text: And so there's something that's important to explain. So, you know, what should we have for the
0:35:49 - 0:36:03 Text: Jacobian of dsdw? Well, that's a function that has one output, the output is just a score of real
0:36:03 - 0:36:16 Text: number. And then it has n by m inputs. So the Jacobian is a 1 by m matrix. I a very long
0:36:16 - 0:36:25 Text: row vector. But that's correct math. But it turns out that that's kind of bad for our neural
0:36:25 - 0:36:31 Text: networks. Because remember, what we want to do with our neural networks is do stochastic gradient
0:36:31 - 0:36:41 Text: descent. And we want to say theta new equals theta old minus a small multiplier times the gradient.
0:36:41 - 0:36:56 Text: And well, actually, the w matrix is an n by m matrix. And so we couldn't actually do the subtraction
0:36:56 - 0:37:02 Text: if this gradient we calculate is just a huge row vector. We'd like to have it as the same
0:37:02 - 0:37:11 Text: shape as the w matrix. In neural network land, when we do this, we depart from pure math that
0:37:11 - 0:37:17 Text: this point. And we use what we call the shape convention. So what we're going to say is,
0:37:19 - 0:37:24 Text: and you're meant to use this for answers in the assignment, that the shape of the gradient
0:37:24 - 0:37:29 Text: we're always going to make to be the shape of the parameters. And so therefore,
0:37:29 - 0:37:40 Text: the st w we're also going to represent as an n by m matrix just like w. And we're going to reshape
0:37:41 - 0:37:50 Text: the Jacobian, deplace it into this matrix shape. Okay, so if we want to place it into this matrix
0:37:50 - 0:38:02 Text: shape, what do we, what are we going to want to get for the st w? Well, we know that it's
0:38:03 - 0:38:16 Text: going to involve delta our local error signal. And then we have to work out something for d z
0:38:16 - 0:38:27 Text: w. Well, since c equals w x plus b, you'd kind of expect that the answer should be x.
0:38:28 - 0:38:40 Text: And that's right. So the answer to d s d w is going to be delta transpose times x transpose.
0:38:40 - 0:38:46 Text: And so the form that we're getting for this derivative is going to be the product of the local
0:38:46 - 0:38:56 Text: error signal that comes from above versus what we calculate from the local input x.
0:38:58 - 0:39:04 Text: So that shouldn't yet be obvious why that is true. So let me just go through in a bit more detail
0:39:04 - 0:39:14 Text: why that's true. So when we want to work out d s d w, right, it's sort of delta times d z
0:39:14 - 0:39:23 Text: w, where what that's computing for z is w x plus b. So let's just consider for a moment what the
0:39:23 - 0:39:34 Text: derivative is with respect to a single weight w ij. So w ij might be w two three that's shown in
0:39:34 - 0:39:44 Text: my little neural network here. And so the first thing to notice is the w ij only contributes to z i.
0:39:44 - 0:39:55 Text: So it's going into z two, which then computes h two. And it has no effect whatsoever on h one.
0:39:55 - 0:40:09 Text: Okay, so when we're working out d z i d w ij, it's going to be d w i x that sort of row that
0:40:09 - 0:40:20 Text: that row of the matrix plus bi, which means that for we've got a kind of a sum of w i k times x k.
0:40:20 - 0:40:27 Text: And then for this sum, this is like one variable calculus that when we're taking the derivative of
0:40:27 - 0:40:36 Text: this with respect to w ij, every term in this sum is going to be zero. The derivative is going to
0:40:36 - 0:40:44 Text: be zero except for the one that involves w ij. And then the derivative of that is just like a x
0:40:44 - 0:40:53 Text: with respect to a, it's going to be x. So you get x j out as the answer. And so the end result of
0:40:53 - 0:41:03 Text: that is that when we're working out, what we want as the answer is that we're going to get that these
0:41:04 - 0:41:12 Text: columns where x one is all that's left x two is all that's left through x m is all that's left.
0:41:12 - 0:41:20 Text: And then that's multiplied by the vectors of the local error signal from above. And what we want
0:41:20 - 0:41:27 Text: to compute is this outer product matrix, we're getting the different combinations of the delta
0:41:28 - 0:41:36 Text: and the x. And so we can get the n by m matrix that we'd like to have by our shape convention
0:41:36 - 0:41:44 Text: by taking delta transpose, which is n by one times x transpose, which is n one by m. And then we
0:41:44 - 0:41:52 Text: get this outer product matrix. So like that's a kind of a hacky argument that I've made. It's
0:41:52 - 0:41:59 Text: certainly a way of doing things that the dimensions work out and it sort of makes sense. There's a more
0:41:59 - 0:42:06 Text: detailed run through this that appears in election notes. And I encourage you to sort of also look
0:42:06 - 0:42:14 Text: at the more Matthew version of that. Here's a little bit more information about the shape convention.
0:42:14 - 0:42:26 Text: So well, first of all, one more example of this. So when you're working out DSDB that
0:42:26 - 0:42:37 Text: that comes out as it's Jacobian is a row vector. But similarly, you know, according to shape
0:42:37 - 0:42:46 Text: convention, we want our gradient to be the same shape as B and B as a column vector. So that's sort
0:42:46 - 0:42:53 Text: of again, they're different shapes and you have to transpose one to get the other. And so effectively,
0:42:53 - 0:43:00 Text: what we have is a disagreement between the Jacobian form. So the Jacobian form makes sense for
0:43:01 - 0:43:08 Text: you know, calculus and math. Because if you want to have it like I claimed that matrix calculus
0:43:08 - 0:43:14 Text: is just like single variable calculus apart from using vectors and matrices, you can just multiply
0:43:14 - 0:43:21 Text: together the particles. That only works out if you're using Jacobians. But on the other hand,
0:43:21 - 0:43:29 Text: if you want to do stochastic gradient descent and be able to sort of subtract off a piece of the
0:43:29 - 0:43:38 Text: gradient, that only works if you have the same shape matrix for the gradient as you do for the
0:43:38 - 0:43:46 Text: original matrix. And so this is a bit confusing, but that's just the reality. There are both of these
0:43:46 - 0:43:57 Text: two things. So the Jacobian form is useful in doing the calculus. But for the answers in the
0:43:57 - 0:44:06 Text: assignment, we want the answers to be presented using the shape convention so that the gradient is
0:44:06 - 0:44:14 Text: shown in the same shape as the parameters. And therefore, you'll be able to, it's the right shape
0:44:14 - 0:44:23 Text: for doing a gradient update by just subtracting a small amount of the gradient. So for working
0:44:23 - 0:44:32 Text: through things, there are then basically two choices. One choice is to work through all the math
0:44:32 - 0:44:39 Text: using Jacobians and then write at the end to reshape following the shape convention to give the
0:44:39 - 0:44:50 Text: answer. So that's what I did when I worked out DSDB. We worked through it using Jacobians. We
0:44:50 - 0:44:57 Text: got an answer, but it turned out to be a row vector. And so, well, then we have to transpose it at
0:44:57 - 0:45:08 Text: the end to get into the right shape for the shape convention. The alternative is to always follow
0:45:08 - 0:45:16 Text: the shape convention. And that's kind of what I did when I was then working out DSDW. I didn't
0:45:16 - 0:45:26 Text: faultfully use Jacobians. I said, oh, well, when we work out, whatever was DZDW, let's work out what
0:45:26 - 0:45:34 Text: shape we want it to be and what to fill in the cells with. And if you're sort of trying to do it
0:45:34 - 0:45:43 Text: immediately with the shape convention, it's a little bit more hacky in a way since you know,
0:45:43 - 0:45:48 Text: you have to look at the dimensions for what you want and figure out when to transpose or to reshape
0:45:48 - 0:45:56 Text: the matrix to be it the right shape. But the kind of informal reasoning that I gave is what you do
0:45:56 - 0:46:03 Text: and what works. And you know, one way of, and there are sort of hints that you can use, right? That
0:46:03 - 0:46:09 Text: you know that your gradient should always be the same shape as your parameters. And you know that
0:46:09 - 0:46:16 Text: the error message coming in will always have the same dimensionality as that hidden layer. And
0:46:16 - 0:46:30 Text: you can sort of work it out always following the shape convention. Okay. So that is, hey, doing this
0:46:30 - 0:46:43 Text: is all matrix calculus. So after pausing for breath for a second, the rest of the lecture is then,
0:46:43 - 0:46:53 Text: okay, let's look at how our software trains neural networks using what's referred to as the back
0:46:53 - 0:47:11 Text: propagation algorithm. So the short answer is, you know, basically we've already done it,
0:47:11 - 0:47:17 Text: the rest of the lecture is easy. So, you know, essentially I've just shown you what the
0:47:17 - 0:47:28 Text: back propagation algorithm does. So the back propagation algorithm is judiciously taking and
0:47:30 - 0:47:41 Text: propagating derivatives using the matrix chain rule. The rest of the back propagation algorithm
0:47:41 - 0:47:50 Text: is to say, okay, when we have these neural networks, we have a lot of shared structure and shared
0:47:50 - 0:48:01 Text: derivatives. So what we want to do is maximally, efficiently reuse derivatives of higher layers
0:48:01 - 0:48:08 Text: when we're computing derivatives for lower layers so that we minimize computation. And I already
0:48:08 - 0:48:16 Text: pointed that out in the first half, but we want to systematically exploit that. And so the way we do
0:48:16 - 0:48:26 Text: that in our computational systems is they construct computation graphs. So this maybe looks a little
0:48:26 - 0:48:34 Text: bit like what you saw in a compiler's class if you did one, right, that you're creating, I call it
0:48:34 - 0:48:39 Text: here computation graph, but it's really a tree, right. So you're creating here this tree of
0:48:39 - 0:48:47 Text: computations in this case, but in more general case, it's some kind of directed graph of computations,
0:48:48 - 0:48:59 Text: which has source nodes, which are imports either inputs like x or input parameters like w and b.
0:48:59 - 0:49:06 Text: And it's interior nodes or operations. And so then once we've constructed a graph,
0:49:06 - 0:49:11 Text: and so this graph corresponds to exactly the example I did before, right, that this was our little
0:49:11 - 0:49:17 Text: neural net that's in the top right. And here's the corresponding computation graph of computing
0:49:17 - 0:49:26 Text: wx plus b, put it through the sigmoid nonlinearity f, multiply the resulting dot product that the
0:49:26 - 0:49:37 Text: resulting vector with you gives us our output score s. Okay, so what we do to compute this is we
0:49:37 - 0:49:44 Text: pass along the edges the results of operations. So this is wx, then z, then h, and then our output is s.
0:49:45 - 0:49:51 Text: And so the first thing we want to be able to do to compute with neural networks is to be able to
0:49:51 - 0:49:59 Text: compute for different inputs what the output is. And so that's referred to as forward propagation.
0:49:59 - 0:50:10 Text: And so we simply run this expression much like you just standardly do in a compiler to compute
0:50:10 - 0:50:16 Text: the value of s. And that's the forward propagation phase. But the essential additional element of
0:50:16 - 0:50:25 Text: neural networks is that we then also want to be able to send back gradients, which will tell us how
0:50:25 - 0:50:33 Text: to update the parameters of the model. And so it's this ability to send back gradients, which gives us
0:50:33 - 0:50:40 Text: the ability for these models to learn once we have a loss function at the end, we can work out how to
0:50:40 - 0:50:48 Text: change the parameters of the model so that they more accurately produce the desired output, i.e.
0:50:48 - 0:50:57 Text: they minimize the loss. And so it's doing that part that then is called back propagation. So we then
0:50:57 - 0:51:06 Text: once we forward propagated a value with our current parameters, we then head backwards reversing
0:51:06 - 0:51:16 Text: the direction of the arrows and pass along gradients down to the different parameters like B and W and U
0:51:16 - 0:51:22 Text: that we can use to change using stochastic gradient descent what the value of B is or what the
0:51:22 - 0:51:32 Text: value of W is. So we start off with ds ds, which is just one. And then we run our back propagation.
0:51:32 - 0:51:41 Text: And we're using the sort of same kind of composition of Jacobian. So we have ds dh here and ds dz
0:51:41 - 0:51:49 Text: and we progressively pass back those gradients. So we just need to work out how to efficiently and
0:51:49 - 0:51:57 Text: cleanly do this in a computational system. And so let's sort of work through again a few of these
0:51:57 - 0:52:07 Text: cases. So the general situation is we have a particular node. So a node is where some kind of
0:52:08 - 0:52:17 Text: operation like multiplication or a nonlinearity happens. And so the simplest case is that we've got
0:52:17 - 0:52:26 Text: one output and one input. So we'll do that first. So that's like h equals f of z. So what we have is
0:52:26 - 0:52:38 Text: an upstream gradient ds dh. And what we want to do is compute the downstream gradient of ds dz.
0:52:38 - 0:52:46 Text: And the way we're going to do that is say, well, for this function f, it's a function, it's got
0:52:46 - 0:52:53 Text: a derivative for gradient. So what we want to do is work out that local gradient dhd.
0:52:53 - 0:53:03 Text: And then that gives us everything that we need to work out ds dz. Because that's precisely we're
0:53:03 - 0:53:10 Text: going to use the chain rule. We're going to say that ds dz equals the product of ds dh times dhd
0:53:10 - 0:53:17 Text: where this is again using Jacobians. Okay, so the general principle that we're going to use is
0:53:17 - 0:53:25 Text: that downstream gradient equals the upstream gradient times the local gradient. Okay, sometimes
0:53:25 - 0:53:31 Text: it gets a little bit more complicated. So we might have multiple inputs to a function. So this is
0:53:32 - 0:53:40 Text: the matrix vector multiply. So z equals wx. Okay, when there are multiple inputs, we still have
0:53:40 - 0:53:50 Text: an upstream gradient ds dz. But what we're going to do is work out a local gradient with respect to
0:53:50 - 0:54:00 Text: each input. So we have dz dw and dz dx. And so then at that point, it's exactly the same for each
0:54:00 - 0:54:08 Text: piece of it. We're going to work out the downstream gradients ds dw and ds dx by using the chain rule
0:54:08 - 0:54:17 Text: with respect to the particular local gradient. So let's go through an example of this. I mean,
0:54:17 - 0:54:24 Text: this is kind of a silly example. It's not really an example that looks like a typical neural net.
0:54:24 - 0:54:30 Text: But it's sort of a simple example where we can show some of the components of what we do. So
0:54:30 - 0:54:39 Text: what we're going to do is want to calculate f of xyz, which is being calculated as x plus y times
0:54:39 - 0:54:47 Text: the max of y and z. And we've got, you know, particular values that we're starting off with x equals
0:54:47 - 0:54:55 Text: one y equals two and z equals zero. So these are the current values of our parameters. And so we can
0:54:55 - 0:55:03 Text: say, okay, well, we want to build an expression tree for that. Here's our expression tree. We're
0:55:03 - 0:55:10 Text: taking x plus y. We're taking the max of y and z. And then we're multiplying them. And so our
0:55:10 - 0:55:18 Text: forward propagation phase is just to run this. So we take the values of our parameters. And we
0:55:18 - 0:55:25 Text: simply start to compute with them, right? So we have one, two, two, zero. And we add them as three,
0:55:25 - 0:55:35 Text: the max is two. We multiply them. And that gives us six. Okay. So then at that point, we then
0:55:35 - 0:55:45 Text: want to go and work out how to do things for back propagation and how these back propagation
0:55:45 - 0:55:52 Text: steps work. And so the first part of that is sort of working out what our local gradients are
0:55:52 - 0:56:03 Text: going to be. So, so this is a here. And this is x and y. So dADX, since a equals x plus y is
0:56:03 - 0:56:18 Text: just going to be one. And dADY is also going to be one. Then for b equals the max of y z. So this
0:56:18 - 0:56:25 Text: is this max node. So the local gradients for that is it's going to depend on y, where the y is
0:56:25 - 0:56:35 Text: greater than z. So dBDY is going to be one, if and only if y is greater than z, which it is at
0:56:35 - 0:56:45 Text: our particular point here. So that's one. And dBdz is going to be one only if z is greater than y.
0:56:45 - 0:56:55 Text: So for our particular values here, that one is going to be zero. And then finally, here,
0:56:55 - 0:57:08 Text: we're calculating the product f equals a b. So for that, we're going to, sorry, that slides
0:57:08 - 0:57:15 Text: all in perfect. Okay, so for the product, the derivative f with respect to a is equal to b,
0:57:15 - 0:57:21 Text: which is two. And the derivative f with respect to b is a equals three. So that gives us all of
0:57:21 - 0:57:30 Text: the local gradients at each node. And so then to run back propagation, we start with dF dF,
0:57:30 - 0:57:39 Text: which is just one. And then we're going to work out the downstream equals the upstream times the
0:57:39 - 0:57:48 Text: local. Okay, so the local, so when you have a product like this, note the sort of the gradients flip.
0:57:48 - 0:58:04 Text: So we take upstream times the local, which is two. Oops. So the downstream is two on this side.
0:58:06 - 0:58:15 Text: DFDB is three. So we're taking upstream times local. That gives us three. And so that gives us
0:58:15 - 0:58:24 Text: back propagates values to the plus and max nodes. And so then we continue along. So for the max node,
0:58:25 - 0:58:34 Text: the local gradient dBDY equals one. So we're going to take upstream as three. So it's going to take
0:58:34 - 0:58:43 Text: three times one. And that gives us three. DBDC is zero because of the fact that Z's value is not
0:58:43 - 0:58:50 Text: the max. So we're taking three times zero and saying that the gradient there is zero. So finally,
0:58:50 - 0:58:58 Text: doing the plus node, the local gradients for both x and y, there are one. So we're just getting two
0:58:58 - 0:59:06 Text: times one in both cases. And we're saying the gradients there are two. Okay. And so again, at the end
0:59:06 - 0:59:15 Text: of the day, the interpretation here is that this is giving us this information as to if we wiggle the
0:59:15 - 0:59:23 Text: values of x, y and z, how much of a difference does it make to the output? What is the slope, the
0:59:23 - 0:59:35 Text: gradient, with respect to the variable? So what we've seen is that since Z isn't the max of y and z,
0:59:35 - 0:59:43 Text: if I change the value of z a little, like I find, z.1 or minus.1, it makes no difference at all
0:59:43 - 0:59:52 Text: to what I compute as the output. So therefore, the gradient there is zero. If I change the value
0:59:52 - 1:00:02 Text: of x a little, then that is going to have an effect. And it's going to affect the output by
1:00:02 - 1:00:17 Text: twice as much as the amount I change it. Right. So, and that's because the df dz equals two.
1:00:19 - 1:00:27 Text: So interestingly, so I mean, we can basically work that out. So if we imagine
1:00:27 - 1:00:36 Text: making sort of x 2.1, well, then what we'd calculate the max is two.
1:00:37 - 1:00:46 Text: Oh, sorry, sorry, if we make x 1.1, we then get the max here is two, and we get 1.1 plus two
1:00:46 - 1:00:58 Text: is 3.1. So we get 3.1 times two. So that'd be about 6.2. So changing x by 0.1 has added 0.2 to the
1:00:58 - 1:01:08 Text: value of f. Conversely, for the value of y, we find that the df dy equals five. So what we do when
1:01:08 - 1:01:14 Text: we've got two things coming out here, as I'll go through again in a moment, is with summing the
1:01:14 - 1:01:19 Text: gradient. So again, three plus two equals five. And empirically, that's what happens. So if we
1:01:19 - 1:01:28 Text: consider fiddling the value of y a little, let's say we make it a value of 2.1, then the prediction
1:01:28 - 1:01:35 Text: is they'll have five times as big an effect on the output value we compute. And well, what do we
1:01:35 - 1:01:47 Text: compute? So we compute 1 plus 2.1. So that's 3.1. And we compute the max of 2.1 and 0 is 2.1. So
1:01:47 - 1:01:54 Text: we'll take the product of 2.1 and 3.1. And I calculate that in advance, as I can't really do
1:01:54 - 1:02:01 Text: this arithmetic in my head. And the product of those two is 6.51. So it has gone up about by
1:02:01 - 1:02:09 Text: 0.5. So we've multiplied my fiddly at by 0.1 by five times to work out the magnitude of the
1:02:09 - 1:02:19 Text: effect of the output. Okay. So for this start, you know, before I did the case of, you know,
1:02:19 - 1:02:32 Text: when we had one one in and one out here and multiple inns and one out here, the case that I
1:02:32 - 1:02:40 Text: had actually dealt with is the case of when you have multiple outward branches, but that then turned
1:02:40 - 1:02:48 Text: up in the computation of y. So once you have multiple outward branches, what you're doing is your
1:02:48 - 1:03:03 Text: summing. So that when you want to work out the dfdy, you've got a local gradient, you've got two
1:03:03 - 1:03:11 Text: upstream gradients. And you're working it out with respect to each of them as in the chain rule,
1:03:11 - 1:03:22 Text: and then you're summing them together to work out the impact at the end. Right. So we also saw
1:03:22 - 1:03:30 Text: some of the other node intuitions, which it's useful to have doing this. So when you have an addition,
1:03:31 - 1:03:40 Text: that distributes the upstream gradient to each of the things the lowered. When you have max,
1:03:40 - 1:03:47 Text: it's like a routing node. So when you have max, you have the upstream gradient, and it goes to one
1:03:47 - 1:03:56 Text: of the branches below it and the rest of them get no gradient. When you then have a multiplication,
1:03:56 - 1:04:06 Text: it has this effect of switching the gradient. So if you're taking three by two, the gradient on
1:04:06 - 1:04:12 Text: the two side is three, and on the three side is two. And if you think about in terms of how much
1:04:12 - 1:04:18 Text: effect you get from when you're doing this sort of wiggling, that totally makes sense, right? Because
1:04:18 - 1:04:25 Text: if you're multiplying another number by three, then any change here is going to be multiplied by three
1:04:25 - 1:04:37 Text: and vice versa. Okay. So this is the kind of computation graph that we want to use to work out
1:04:38 - 1:04:45 Text: derivatives in an automated computational fashion, which is the basis of the back propagation
1:04:45 - 1:04:54 Text: algorithm. But at that point, this is what we're doing, but there's still one mistake that we can make.
1:04:54 - 1:05:00 Text: It would be wrong for us to sort of say, okay, well, first of all, we want to work out DSDB.
1:05:00 - 1:05:09 Text: So look, we can start up here. We can propagate our upstream errors, work out local gradients,
1:05:09 - 1:05:19 Text: upstream error, local gradient, and keep all the way down and get the DSDB down here. Okay,
1:05:19 - 1:05:27 Text: next we want to do it for DSDW. Let's just run it all over again. Because if we do that, we'd be
1:05:27 - 1:05:36 Text: doing repeated computation, as I showed in the first half, that this term is the same both times,
1:05:36 - 1:05:42 Text: this term is the same both times, this term is the same both times, that only the bits at the end
1:05:42 - 1:05:50 Text: differ. So what we want to do is avoid duplicated computation and compute all the gradients
1:05:53 - 1:06:00 Text: that we're going to need, successively, so that we only do them once. And so that was analogous
1:06:00 - 1:06:08 Text: to when I introduced this delta variable when we computed gradients by hand. So starting off here from
1:06:08 - 1:06:21 Text: DSD, we're starting off here with DSDS is one. We then want to one time compute gradients in the
1:06:21 - 1:06:28 Text: green here, one time compute the gradient and green here, that's all common work. Then we're
1:06:28 - 1:06:38 Text: going to take the local gradient for DZDB and multiply that by the upstream gradient to have
1:06:38 - 1:06:47 Text: worked out DSDB. And then we're going to take the same upstream gradient and then work out the
1:06:47 - 1:06:58 Text: local gradient here and then propagate that down to give us DSDW. So the end result is we want to
1:06:58 - 1:07:06 Text: systematically work to forward computation forward in the graph and backward computation,
1:07:06 - 1:07:14 Text: back propagation, backward in the graph in a way that we do things efficiently. So this is
1:07:14 - 1:07:24 Text: the general form of the algorithm which works for an arbitrary computation graph. So at the end
1:07:24 - 1:07:36 Text: of the day, we've got a single scalar output Z and then we have inputs and parameters which compute
1:07:36 - 1:07:46 Text: Z. And so once we have this computation graph and I added in this funky extra arrow here to make
1:07:46 - 1:07:53 Text: it a more general computation graph, well we can always say that we can work out a starting point,
1:07:53 - 1:08:00 Text: something that doesn't depend on anything. So in this case both of these bottom two nodes don't
1:08:00 - 1:08:07 Text: depend on anything else. So we can start with them and we can start to compute forward. We can compute
1:08:07 - 1:08:13 Text: values for all of these sort of second row from the bottom nodes and then we're able to compute
1:08:15 - 1:08:22 Text: the third lens up. So we can have a top logical sort of the nodes based on the dependencies
1:08:22 - 1:08:31 Text: in this directed graph and we can compute the value of each node given some subset of its pre-decesses
1:08:31 - 1:08:38 Text: which it depends on. And so doing that as referred to as the forward propagation phase and gives us
1:08:38 - 1:08:45 Text: a computation of the scalar output Z using our current parameters and our current inputs.
1:08:45 - 1:08:53 Text: And so then after that we run back propagation. So for back propagation we initialize the output
1:08:53 - 1:09:04 Text: gradient, DZ, DZ as one and then we visit nodes in the reverse order of the top logical sort
1:09:04 - 1:09:12 Text: and we compute the gradients downward. And so our recipe is that for each node as we head down,
1:09:12 - 1:09:21 Text: we're going to compute the gradient of the node with respect to its successes, the things that it
1:09:21 - 1:09:29 Text: feeds into. And how we compute that gradient is using this chain rule that we've looked at. So
1:09:29 - 1:09:35 Text: this is sort of the generalized form of the chain rule where we have multiple outputs. And so we're
1:09:35 - 1:09:41 Text: summing over the different outputs. And then for each output we're computing the product of the
1:09:41 - 1:09:49 Text: upstream gradient and the local gradient with respect to that node. And so we head downwards.
1:09:49 - 1:09:57 Text: And we continue down the reverse top logical sort order and we work out the gradient with respect
1:09:57 - 1:10:08 Text: to each variable in this graph. And so it hopefully looks kind of intuitive looking at this picture
1:10:08 - 1:10:18 Text: that if you think of it like this, the big oak complexity of forward propagation and backward
1:10:18 - 1:10:26 Text: propagation is the same. Right. In both cases you're doing a linear pass through all of these nodes
1:10:26 - 1:10:33 Text: and calculating values given predecessors and then values given successors. I mean you have to
1:10:33 - 1:10:40 Text: do a little bit more work is for working out the gradients sort of as shown by this chain rule
1:10:40 - 1:10:45 Text: that it's the same big oak complexity. So if somehow you're implementing stuff for yourself rather
1:10:45 - 1:10:52 Text: than relying on the software and you're calculating the gradients of a different order of complexity
1:10:52 - 1:10:57 Text: of forward propagation, it means that you're doing something wrong. You're doing repeated work that
1:10:57 - 1:11:04 Text: you shouldn't have to do. Okay. So this algorithm works for a completely arbitrary
1:11:04 - 1:11:11 Text: computation graph, any directed a cyclic graph. You can apply this algorithm.
1:11:12 - 1:11:18 Text: In general, what we find is that we build neural networks that have a regular layer structure.
1:11:18 - 1:11:25 Text: So we have things like a vector of inputs and then that's multiplied by matrix. It's transformed
1:11:25 - 1:11:31 Text: into another vector which might be multiplied by another matrix or some with another matrix
1:11:31 - 1:11:37 Text: or something. Right. So once we're using that kind of regular layer structure, we can then parallelize
1:11:37 - 1:11:48 Text: the computation by working out the gradients in terms of Jacobians of vectors and matrices and do
1:11:48 - 1:11:56 Text: things in parallel much more efficiently. Okay. So doing this is then referred to as automatic
1:11:56 - 1:12:05 Text: differentiation. And so essentially if you know the computation graph, you should be able to have
1:12:05 - 1:12:15 Text: your computer, clever computer system work out what the derivatives of everything is and then
1:12:15 - 1:12:23 Text: apply back propagation to work out how to update the parameters and learn. And there's actually
1:12:23 - 1:12:33 Text: a sort of an interesting sort of thing of how history has gone backwards here, which I'll just
1:12:33 - 1:12:43 Text: note. So some of you might be familiar with symbolic computation packages. So those are things
1:12:43 - 1:12:51 Text: like mathematical. So mathematical, you can give it a symbolic form of a computation and then it
1:12:51 - 1:12:58 Text: can work out derivatives for you. So it should be the case that if you give a complete symbolic form
1:12:58 - 1:13:06 Text: of a computation graph, then it should be able to work out all the derivatives for you and you never
1:13:06 - 1:13:12 Text: have to work out a derivative by hand whatsoever. And that was actually attempted in the famous
1:13:12 - 1:13:19 Text: deep learning library called Fiano, which came out of Joshua Bendios group at the University of
1:13:19 - 1:13:28 Text: Montreal that had a compiler that did that kind of symbolic manipulation. But you know somehow
1:13:28 - 1:13:37 Text: that sort of proved a little bit too hard a road to follow. I imagine it actually might come back
1:13:37 - 1:13:44 Text: again in the future. And so for modern deep learning frameworks, which includes both TensorFlow
1:13:44 - 1:13:56 Text: or PyTorch, they do 90% of this computation of automatic differentiation for you, but they don't
1:13:56 - 1:14:04 Text: actually symbolically compute derivatives. So for each particular node or layer of your deep
1:14:04 - 1:14:15 Text: learning system, somebody, either you or the person who wrote that layer, has handwritten the
1:14:15 - 1:14:22 Text: local derivatives. But then everything from that point on, the sort of the taking, doing the
1:14:22 - 1:14:29 Text: chain rule of combining upstream gradients with local gradients to work out downstream gradients,
1:14:29 - 1:14:34 Text: that's then all being done automatically for back propagation on the computation graph.
1:14:35 - 1:14:42 Text: And so that what that means is for a whole neural network, you have a computation graph,
1:14:42 - 1:14:50 Text: and it's going to have a forward pass and a backward pass. And so for the forward pass,
1:14:50 - 1:14:56 Text: you're topologically sorting the nodes based on their dependencies and the computation graph.
1:14:56 - 1:15:04 Text: And then for each node, you're running forward, the forward computation on that node. And then
1:15:04 - 1:15:11 Text: for backward propagation, you're reversing the topological sort of the graph. And then for each node
1:15:11 - 1:15:17 Text: in the graph, you're running the backward propagation, which is a little bit of back crop, the chain
1:15:17 - 1:15:26 Text: rule at that node. And then the result of doing that is you have gradients for your inputs and parameters.
1:15:28 - 1:15:38 Text: And so this is the overall software runs this for you. And so what you want to do is then actually
1:15:38 - 1:15:45 Text: have stuff for particular nodes or layers in the graph. So if I have a multiply
1:15:45 - 1:15:54 Text: gate, it's going to have a forward algorithm, which just computes that the output is x times y in terms
1:15:54 - 1:16:00 Text: of the two inputs. And then I'm going to want to compute, to tell it also how to calculate the
1:16:00 - 1:16:10 Text: local derivative. So I want to say, what is the local derivative? So dL dx and dL dy in terms of the
1:16:10 - 1:16:19 Text: upstream gradient, dL dz. And so I will then manually work out how to calculate that. And normally,
1:16:19 - 1:16:29 Text: what I have to do is I assume the forward pass is being run first. And I'm going to shove into some
1:16:29 - 1:16:35 Text: local variables for my class, the values that we used in the forward computation. So as well as
1:16:35 - 1:16:44 Text: computing z equals x times y, I'm going to sort of remember what x and y were. So then when I'm
1:16:44 - 1:16:52 Text: asked to compute the backward pass, I'm then going to have implemented here what we saw earlier of
1:16:54 - 1:17:02 Text: that when it's xy, you're going to sort of swap the y and the x to work out the local gradients.
1:17:02 - 1:17:07 Text: And so then I'm going to multiply those by the upstream gradient. And I'm going to return,
1:17:07 - 1:17:14 Text: I've just written it here as a sort of a little list, but really it's going to be a numpy vector
1:17:14 - 1:17:25 Text: of the gradients. Okay, so that's 98% of what I wanted to cover today, just a couple of quick
1:17:25 - 1:17:34 Text: comments left. So that can and should all be automated. Sometimes you want to just check if you're
1:17:35 - 1:17:41 Text: computing the right gradients. And so the standard way of checking that you're computing the right
1:17:41 - 1:17:49 Text: gradients is to manually work out the gradient by doing a numeric calculation of the gradient. And so
1:17:49 - 1:17:58 Text: you can do that. So you can work out what the derivative of x of f with respect to x should be
1:17:59 - 1:18:06 Text: by choosing some sort of small number like 10 to the minus 4, adding it to x, subtracting it from x.
1:18:06 - 1:18:12 Text: And then so the difference between these numbers is 2h, dividing it through by 2h. And you're simply
1:18:12 - 1:18:19 Text: working out the rise over the run, which is the slope at that point with respect to x. And that's
1:18:19 - 1:18:28 Text: an approximation of the gradient of f with respect to x at that value of x. So this is so simple,
1:18:28 - 1:18:33 Text: you can't make a mistake implementing it. And so therefore you can use this to check
1:18:34 - 1:18:41 Text: where your gradient values are correct or not. This isn't something that you'd want to use much
1:18:41 - 1:18:47 Text: because not only is it approximate that it's extremely slow. Because to work this out you have to run
1:18:47 - 1:18:53 Text: the forward computation for every parameter of the model. So if you have a model with a million
1:18:53 - 1:19:00 Text: parameters, you're now doing a million times as much work to run backprop as you would do if you're
1:19:00 - 1:19:06 Text: actually using calculus. So calculus is a good thing to know. But it can be really useful to check
1:19:06 - 1:19:12 Text: that the right values are being calculated. And the old days when we hand wrote everything,
1:19:13 - 1:19:18 Text: this was kind of the key unit test that people used everywhere. These days most of the time you're
1:19:18 - 1:19:24 Text: reusing layers that are built into PyTorch or some other deep learning framework. So it's much less
1:19:25 - 1:19:29 Text: needed. But sometimes you're implementing your own layer and you really do want to check
1:19:29 - 1:19:35 Text: the things are implemented correctly. There's a fine point in the way this has written. If you saw
1:19:35 - 1:19:44 Text: this in sort of high school calculus class, you will have seen rise over run of f of x plus h minus
1:19:44 - 1:19:55 Text: f of x divided by h. It turns out that doing this two-sided estimate like this is much, much more
1:19:55 - 1:20:01 Text: accurate than doing a one-sided estimate. And so you're really much encouraged to use this
1:20:01 - 1:20:08 Text: approximation. Okay, so at that point, we've mastered the core technology of neural nets. Back
1:20:08 - 1:20:16 Text: propagation is recursively and hence efficiently applying the chain rule along the computation graph
1:20:16 - 1:20:25 Text: with this sort of key step that downstream gradient equals upstream gradient times local gradient.
1:20:25 - 1:20:31 Text: And so for calculating with neural nets, we do the forward pass to work out values with current
1:20:31 - 1:20:40 Text: parameters, then run back propagation to work out the gradient of the loss currently computer
1:20:40 - 1:20:49 Text: loss with respect to those parameters. Now to some extent, you know, with modern deep learning
1:20:49 - 1:20:54 Text: frameworks, you don't actually have to know how to do any of this, right? It's the same as you
1:20:54 - 1:21:02 Text: don't have to know how to implement a C compiler. You can just write C code and say GCC and it'll
1:21:02 - 1:21:09 Text: compile it and it'll run the right stuff for you. And that's the kind of functionality you get
1:21:09 - 1:21:16 Text: from the PyTorch framework. So do come along to the PyTorch tutorial this Friday and get a sense
1:21:16 - 1:21:23 Text: about how easy it is to write neural networks using a framework like PyTorch or TensorFlow. And you
1:21:23 - 1:21:29 Text: know, it's so easy. That's why high school students across the nation are now doing their science
1:21:29 - 1:21:36 Text: projects, training deep learning systems because you don't actually have to understand very much
1:21:36 - 1:21:43 Text: to bung a few neural network layers together and set it computing on some data. But you know,
1:21:43 - 1:21:48 Text: we hope in this class that you actually are also learning how these things that implemented.
1:21:48 - 1:21:55 Text: So you have a deeper understanding of than that. And you know, it turns out that sometimes you
1:21:55 - 1:22:01 Text: need to have a deeper understanding. So back propagation doesn't always work carefully, perfectly.
1:22:01 - 1:22:07 Text: And so understanding what it's really doing can be crucial to debugging things. And so we'll
1:22:07 - 1:22:12 Text: actually see an example of that fairly soon when we start looking at recurrent models and some of
1:22:12 - 1:22:18 Text: the problems that they have, which will require us to think a bit more deeply about what's happening
1:22:18 - 1:22:26 Text: in our gradient computations. Okay, that's it for the day.