Stanford CS224N NLP with Deep Learning ｜ Winter 2021 ｜ Lecture 2 - Neural Classifiers

0:00:00 - 0:00:07 Text: So what are we going to do for today?

0:00:07 - 0:00:17 Text: So the main content for today is to go through sort of more stuff about word vectors,

0:00:17 - 0:00:23 Text: including touching on word sensors and then introducing the notion of neural network classifiers.

0:00:23 - 0:00:41 Text: So our biggest goal is that by the end of today's class, you should feel like you could confidently look at one of the word embeddings papers, such as the Google word to vac paper or the glove paper or Sanji Rao's paper that will come to later and feel like, yeah, I can understand this.

0:00:41 - 0:00:43 Text: I know what they're doing and it makes sense.

0:00:43 - 0:00:52 Text: So let's go back to where we were. So this was sort of introducing this model of word to vac and

0:00:52 - 0:01:03 Text: the idea was that we started with random word vectors and then we're going to sort of, we have a big corpus of text and we're going to iterate through each word in the whole corpus.

0:01:03 - 0:01:10 Text: And for each position, we're going to try and predict what words surround this our center word.

0:01:10 - 0:01:23 Text: And we're going to do that with a probability distribution that's defined in terms of the dot product between the word vectors for the center word and the context words.

0:01:23 - 0:01:28 Text: And so that will give a probability estimate of a word appearing in the context of into.

0:01:28 - 0:01:42 Text: Well, actual words did occur in the context of into on this occasion. So what we're going to want to do is sort of make it more likely that turning problems banking and crises will turn up in the context of into.

0:01:42 - 0:01:49 Text: And so that's learning updating the word vectors so that they can predict actual surrounding words better.

0:01:49 - 0:02:05 Text: And then the thing is almost magical is that doing no more than this simple algorithm, this allows us to learn word vectors that capture well words similarity and meaningful directions in a word space.

0:02:05 - 0:02:20 Text: So more precisely right for this model, the only parameters of this model are the word vectors. So we have outside word vectors and center word vectors for each word.

0:02:20 - 0:02:37 Text: And then we're going to get a probability, well, we get taking a dot product to get a score of how likely a particular outside word is to occur with the center word. And then we're using the soft max transformation to convert those scores into probabilities as I discussed last time.

0:02:37 - 0:02:41 Text: And I kind of come back to at the end this time.

0:02:41 - 0:02:54 Text: And then I want to start with the next thing to note, this model is what we call an NLP a bag of words model. So bag of words models models that don't actually pay any attention to word order or position.

0:02:54 - 0:03:02 Text: It doesn't matter if you're next to the center word or bit further away on the left or right, the probability estimate would be the same.

0:03:02 - 0:03:13 Text: And that seems like a very crude model of language that will offend any linguist and it is a very crude model of language and we'll move on to better models of languages we go on.

0:03:13 - 0:03:23 Text: But even that crude model of language is enough to learn quite a lot of the probability, sorry, quite a lot about the properties of words.

0:03:23 - 0:03:40 Text: And then the second note is, well, with this model, we wanted to give reasonably high probabilities to the words that do occur in the context of the center word, at least if they do so at all often.

0:03:40 - 0:03:53 Text: Obviously, lots of different words can occur. So we're not talking about probabilities like point three and point five, we're more likely going to be talking about probabilities like point one and numbers like that.

0:03:53 - 0:04:11 Text: Well, how do we achieve that? And well, the way that the word to fact model achieves this, and this is the learning phase of the model is to place words that are similar in meaning close to each other in this high dimensional vector space.

0:04:11 - 0:04:28 Text: So again, you can't read this one, but if we scroll into this one, we see lots of words that are similar in meaning group close together in the space. So here are days of the week like Tuesday Thursday Sunday and also Christmas.

0:04:28 - 0:04:48 Text: So what else do we have? We have Samsung and Nokia. This is a diagram I made quite a few years ago. So that's when Nokia was still an important maker of cell phones. We have various sort of fields like mathematics and economics over here.

0:04:48 - 0:05:01 Text: So we have two words that are similar in meaning. Actually, one more note I wanted to make on this. I mean, again, this is a two dimensional picture, which is all I can show you on a slide.

0:05:01 - 0:05:08 Text: And it's done with the principal components projection that you're also using the assignment.

0:05:08 - 0:05:22 Text: So something important to remember about hard to remember is that high dimensional spaces have very different properties to the two dimensional spaces that we can look at. And so in particular,

0:05:22 - 0:05:32 Text: a word, a vector can be close to many other things in a high dimensional space, but close to them on different dimensions.

0:05:32 - 0:05:48 Text: So I've mentioned doing learning. So the next question is, well, how do we learn good word vectors? And this was the bit that I didn't quite hook up at the end of last class.

0:05:48 - 0:06:09 Text: So for a while in the last, I said, oh, gee, and we have to work out the gradient of the loss function with respect to the parameters that will allow us to make progress. But I didn't sort of altogether put that together. So what we're going to do is we start off with random word vectors.

0:06:09 - 0:06:30 Text: And we're going to visualize them to small numbers near zero in each dimension. We've defined our loss function J, which we looked at last time. And then we're going to use a gradient descent algorithm, which is an iterative algorithm that learns to maximize J of theta by changing theta.

0:06:30 - 0:06:44 Text: The idea of this algorithm is that from the current values of theta, you calculate the gradient J of theta. And then what you're going to do is make a small step in the direction of the negative gradient.

0:06:44 - 0:06:55 Text: So the gradient is pointing upwards. And we're taking a small step in the direction of the negative of the gradient to gradually move down towards the minimum.

0:06:55 - 0:07:13 Text: And so one of the parameters of neural nets that you can fiddle in your software package is what is the step size. So if you take a really, really, it's a bit see step. It might take you a long time to minimize the function. You do a lot of wasted computation.

0:07:13 - 0:07:30 Text: On the other hand, if your step size is much too big, well, then you can actually diverge and start going to worse places. Or even if you are going down hill a little bit that what's going to happen is you're then going to end up bouncing back and forth.

0:07:30 - 0:07:34 Text: And it'll take you much longer to get to the minimum.

0:07:34 - 0:07:46 Text: And this picture, I have a beautiful quadratic. And it's easy to minimize it. Something that you might know about neural networks is then in general, then not convex.

0:07:46 - 0:07:53 Text: So you could think that this is just all going to go awry. But the truth is, and practice life works out to be OK.

0:07:53 - 0:08:05 Text: So I think I won't get into that more right now and come back to that in the later class. So this is our gradient descent. So we have the current values of the parameters theta.

0:08:05 - 0:08:32 Text: And then walk a little bit in the negative direction of the gradient using our learning rate or step size alpha. And that gives us new parameter values where that means that you know these are vectors, but for each individual parameter, we're updating it a little bit by working out the partial derivative of J with respect to that parameter.

0:08:32 - 0:08:39 Text: So that's the simple gradient descent algorithm. Nobody uses it and you shouldn't use it.

0:08:39 - 0:08:54 Text: The problem is that our J is a function of all windows in the corpus. Remember, we're doing this sum over every center word in the entire corpus. And we'll often have billions of words in the corpus.

0:08:54 - 0:09:04 Text: So actually working out J of theta or the gradient of J of theta would be extremely extremely expensive because we have to iterate over our entire corpus.

0:09:04 - 0:09:12 Text: So you'd wait a very long time before you made a single gradient update. And so optimization be extremely slow.

0:09:12 - 0:09:28 Text: And so basically 100% of the time in your network land, we don't use gradient descent. We instead use what's called stochastic gradient descent. And stochastic gradient descent is a very simple modification of this.

0:09:28 - 0:09:45 Text: So when working out an estimate of the gradient based on the entire corpus, you simply take one center word or a small batch like 32 center words, and you work out an estimate of the gradient based on them.

0:09:45 - 0:10:02 Text: And that estimate of the gradient will be noisy and bad because you've only looked at a small fraction of the corpus rather than the whole corpus. But nevertheless, you can use that estimate of the gradient to update your theta parameters in exactly the same way.

0:10:02 - 0:10:23 Text: So this is the algorithm that we can do. And so then if we have a billion word corpus, we can if we do it on each center word, we can make a billion updates to the parameters we pass through the corpus once rather than only making one more accurate update to the parameters.

0:10:23 - 0:10:43 Text: And at once you've been through the corpus. So overall, we can learn several orders of magnitude more quickly. And so this is the algorithm that you'll be using everywhere, including, you know, right right from the beginning from our assignments.

0:10:43 - 0:10:59 Text: And then just an extra comment of more complicated stuff will come back to, I, I, this is the gradient descent is sort of performance hack it lets you learn much more quickly.

0:10:59 - 0:11:14 Text: But it's not only a performance hack, you're going to have some quite counter intuitive properties and actually the fact that stochastic gradient descent is kind of noisy and bounces around as it does its thing.

0:11:14 - 0:11:31 Text: And it actually means that in complex networks, it learns better solutions than if you were to run playing gradient descent very slowly. So you can both compute much more quickly and do a better job.

0:11:31 - 0:11:49 Text: And then we have a final note on running stochastic gradients with word vectors. This is kind of an aside. But something to note is that if we're doing a stochastic gradient update based on one window, then actually in that window will have seen almost none of our parameters.

0:11:49 - 0:12:09 Text: Because if we have a window of something like five words to be the side of the center word, we've seen at most 11 distinct word types. So we will have gradient information for those 11 words, but the other 100,000 odd words now vocabulary will have no gradient update information.

0:12:09 - 0:12:25 Text: So it will be a very, very sparse gradient update. So if you're only thinking math, you can just have your entire gradient and use the equation that I showed before.

0:12:25 - 0:12:44 Text: But if you're thinking systems optimization, then you'd want to think, well, actually, I only want to update the parameters for a few words and there have to be and there are much more efficient ways that I could do that.

0:12:44 - 0:12:58 Text: So here's sort, this is another aside will be useful for assignment. So I will say it up until now when I presented word vectors, I presented them as column vectors.

0:12:58 - 0:13:16 Text: And that makes the most sense if you think about it as a piece of math, whereas actually in all common deep learning packages, including PyTorch that we're using word vectors are actually represented as row vectors.

0:13:16 - 0:13:35 Text: And if you remember back to the representation of matrices and CS107 or something like that, that you'll know that that's then obviously efficient for representing words, because then you can access an entire word vector as a continuous range of memory.

0:13:35 - 0:13:47 Text: So actually, I'm going to throw in a few and four train. Anyway, so actually our word vectors will be row vectors when you look at those inside PyTorch.

0:13:47 - 0:13:59 Text: Okay, now I wanted to say a bit more about the word to veck algorithm family and also what you're going to do in homework too.

0:13:59 - 0:14:09 Text: If you're still meant to be working on homework one, which remembers to next Tuesday, that really actually with today's content, we're starting into homework too.

0:14:09 - 0:14:16 Text: And I'll kind of go through the first part of homework to today and the other stuff you need to know for homework to.

0:14:16 - 0:14:36 Text: I mentioned briefly the idea that we have two separate vectors for each word type, the center vector and the outside vectors, and we just average them both at the end, they're similar, but not identical for multiple reasons, including the random initialization and the stochastic gradient descent.

0:14:36 - 0:14:50 Text: You can implement a word to veck algorithm with just one vector per word, and actually if you do work slightly better, but it makes the algorithm much more complicated.

0:14:50 - 0:15:17 Text: The reason for that is sometimes you'll have the same word type as the center word and the context word, and that means that when you're doing your calculus at that point, you've then got this sort of messy case that just for that word, you're getting an x squared turn, sorry, or dot product, you're getting a dot product of x dot x term, which makes it sort of much messier to work out.

0:15:17 - 0:15:22 Text: And we use this sort of simple optimization of having two vectors per word.

0:15:22 - 0:15:39 Text: Okay, so for the word to veck model as introduced in the meek of it, our paper in 2013, it wasn't really just one algorithm, it was a family of algorithms.

0:15:39 - 0:15:49 Text: There were two basic model variants, one was called the skip gram model, which is the one that I've explained to you that.

0:15:49 - 0:16:09 Text: The other one was called the continuous bag of words model, C bow, and in this one, you predict the center word from a bag of context words.

0:16:09 - 0:16:20 Text: The skip gram one is more natural in various ways, so it's sort of normally the one that people have gravitated to and subsequent work.

0:16:20 - 0:16:34 Text: But then as to how you train this model, what I've presented so far is the naive softmax equation, which is a simple but relatively expensive training method.

0:16:34 - 0:16:42 Text: So that isn't really what they suggest using in your paper in the paper, they suggest using a method that's called negative sampling.

0:16:42 - 0:16:49 Text: So an acronym you'll see sometimes is SGNS, which means skip grams negative sampling.

0:16:49 - 0:16:59 Text: So let me just say a little bit about what this is, but actually doing the skip gram model with negative sampling is the part of homework too.

0:16:59 - 0:17:15 Text: So you'll get to know this model well. So the point is that if you use this naive softmax, you know, even though people commonly do use this naive softmax in various neural net models, that working out the denominator is pretty expensive.

0:17:15 - 0:17:23 Text: And that's because you have to iterate over every word in the vocabulary and workout these dot products.

0:17:23 - 0:17:35 Text: So if you have a hundred thousand word vocabulary, you have to do a hundred thousand dot products to work out the denominator and that seems a little bit of a shame.

0:17:35 - 0:17:52 Text: And so instead of that, the idea of negative sampling is we're instead of using this softmax, we're going to train binary logistic regression models for both the true pair of center word.

0:17:52 - 0:18:07 Text: And the context word versus noise pairs where we keep the true center word and we just randomly sample words from the vocabulary.

0:18:07 - 0:18:22 Text: And as presented in the paper, the idea is like this. So overall, what we want to optimize is still an average of the loss for each particular center word.

0:18:22 - 0:18:28 Text: But for when we're working out the loss for each particular center word, we're going to work out.

0:18:28 - 0:18:40 Text: So the loss for each particular center word and each particular window, we're going to take the dot product as before of the center word and the outside word.

0:18:40 - 0:18:54 Text: And that's sort of the main quantity. But now instead of using that inside the softmax, we're going to put it through the logistic function, which is sometimes also often also called the sigmoid function, the name logistic is more precise.

0:18:54 - 0:19:05 Text: So that's this function here. So the logistic function is a handy function that will map any real number to a probability between zero and one open interval.

0:19:05 - 0:19:13 Text: So basically if the dot product is large, the logistic of the dot product will be virtually one.

0:19:13 - 0:19:33 Text: Okay, so we want this to be large. And then what we'd like is on average, we'd like the dot product between the center word and words that we just chose randomly, I, they most likely didn't actually occur in the context of the center word to be small.

0:19:33 - 0:19:52 Text: And there's just one little trick of how this is done, which is this sigmoid function is symmetric. And so if we want this probability to be small, we can take the negative of the dot product.

0:19:52 - 0:20:05 Text: And wanting it to be over here, the product, the dot product of random word and the center word is a negative number. And so then we're going to take the negation of that.

0:20:05 - 0:20:09 Text: And then again, once we put that through the sigmoid, we'd like a big number.

0:20:09 - 0:20:26 Text: Okay, so the way they're presenting things, they're actually maximizing this quantity. But if I go back to making it a bit more similar to the way we had written things weird worked with minimizing the negative log likelihood.

0:20:26 - 0:20:43 Text: It, it looks like this. So we're taking the negative log likelihood of this, the sigmoid of the dot product. Again, negative log likelihood, we're using the same negator dot product through the sigmoid.

0:20:43 - 0:20:57 Text: And then we're going to work out this quantity for a handful of.

0:20:57 - 0:21:15 Text: And we're, this loss function is going to be minimized given this negation by making these dot products large and these dot products small means negative.

0:21:15 - 0:21:28 Text: And then there's just then one other trick that they use actually there's more than one other trick that's used in the word to vec paper to get it to perform well, but I'll only mention one of their other tricks here.

0:21:28 - 0:21:49 Text: When they sample the words, they don't simply just sample the words based on their probability of occurrence in the corpus or uniformly what they do is they start with what we call the unigram distribution of words. So that is how often words actually occur in our big corpus.

0:21:49 - 0:22:01 Text: So if you have a billion word corpus and a particular word occurred 90 times in it, you're taking 90 divided by a billion. And so that's the unigram probability of the word.

0:22:01 - 0:22:24 Text: But what they then do is that they take that to the three quarters power and the effect of that three quarters power, which is then re normalized to make a probability distribution with the Z kind of like we saw last time with the softmax by taking the three quarters power that has the effect of dampening the difference between common and rare words.

0:22:24 - 0:22:37 Text: So that less frequent words are sampled somewhat more often, but still not nearly as much as they would be if you just use something like a uniform distribution over the vocabulary.

0:22:37 - 0:22:57 Text: So that's basically everything to say about the basics of how we have this very simple neural network algorithm word to vac and how we can train it and learn word vectors.

0:22:57 - 0:23:11 Text: So the next bit, what I want to do is step back a bit and say, well, here's an algorithm that I've shown you that works great. What else could we have done and what can we say about that.

0:23:11 - 0:23:39 Text: The first thing that you might think about is, well, here's this funny iterative algorithm to give you word vectors. You know, if we have a lot of words in a corpus, seems like a more obvious thing that we could do is just look at the counts of how words occur with each other and build a matrix of counts.

0:23:39 - 0:23:51 Text: So here's the idea of a color currents matrix. So I've got a teeny little corpus. I like deep learning. I like NLP. I enjoy flying.

0:23:51 - 0:24:13 Text: I can define a window size. I made my window simply size one to make it easy to fill in my matrix, symmetric, just like our word to back algorithm. And so then the counts in these cells are simply how often things that co occur in the window of size one.

0:24:13 - 0:24:27 Text: I like occurs twice. So we get twos in these cells because it's symmetric deep learning curves one. So we get one here and lots of other things occur zero.

0:24:27 - 0:24:51 Text: And build up a co-occurrence matrix like this. And well, these actually give us a representation of words as co-occurrence vectors. So I can take the word I with either a row or column vector since it's symmetric and say, OK, my representation of the word I is this row vector.

0:24:51 - 0:25:15 Text: That is a representation of the word I and I think you can maybe convince yourself that the extent that words have similar meaning and usage. You'd sort of expect them to have somewhat similar vectors. Right. So if I have the word you as well on a larger corpus, you might expect I and you to have similar vectors because I like you like I enjoy you enjoy.

0:25:15 - 0:25:23 Text: You'd see the same kinds of possibilities. Hey, Chris, could you look into the answer some questions. Sure.

0:25:23 - 0:25:29 Text: Alright, so we got some questions from negative sort of the negative sampling sampling slides.

0:25:29 - 0:25:41 Text: In particular, what's like, can you give some intuition for negative sampling? What is the negative sampling doing? And why do we only take one positive example? Those are two questions.

0:25:41 - 0:26:00 Text: Answering the answer. Okay, that's a good question. Okay, I'll try and give more intuition. So is to work out something like what the softmax did in a much more efficient way.

0:26:00 - 0:26:14 Text: So in the softmax, well, you wanted to give high probability to the in predicting the context, a context word that actually did appear with the center word.

0:26:14 - 0:26:34 Text: And well, the way you do that is by having the dot product between those two words be as big as possible and part of how, but you know, you're going to be sort of it's more than that because in the denominator, you're also working out the dot product with every other word in the vocabulary.

0:26:34 - 0:26:52 Text: So as well as wanting the dot product with the actual word that you see in the context to be big, you maximize your likelihood by making the dot products of other words that weren't in the context smaller because that's shrinking your denominator.

0:26:52 - 0:27:14 Text: And therefore, you've got a bigger number coming out and you're maximizing the loss. So even for the softmax, the general thing that you want to do to maximize it is have dot product with words actually in the context big dot product with words and not in the context be small to the extent possible.

0:27:14 - 0:27:24 Text: And obviously you have to average this is best you can over all kinds of different contexts, because sometimes different words appear in different contexts, obviously.

0:27:24 - 0:27:34 Text: So, so the negative sampling as a way of therefore trying to maximize the same objective.

0:27:34 - 0:27:45 Text: You know, for you only you only have one positive term because you're actually wanting to use the actual data. So you're not wanting wanting to invent data.

0:27:45 - 0:27:54 Text: So for working out the entire J we do do work this quantity out for every center word and every context word.

0:27:54 - 0:28:05 Text: So you know we are iterating over the different words in the context window and then we're moving through positions in the corpus. So we're doing different VCs so you know gradually we do this.

0:28:05 - 0:28:17 Text: But for one particular center word and one particular context word, we only have one real piece of data that's positive. So that's all we use because we don't know what other words.

0:28:17 - 0:28:30 Text: So that should be counter positive words. Now for the negative words, you could just sample one negative word and that would probably work.

0:28:30 - 0:28:52 Text: So sort of a slightly better more stable sense of okay we'd like to in general have other words have low probability. It seems like you might be able to get better more stable results. If you're instead say let's have 10 or 15 sample negative words and indeed that's been found to be true.

0:28:52 - 0:29:06 Text: And for the negative words, well, it's easy to sample any number of random words you want. And at that point it's kind of a probabilistic argument. The words that you're sampling might not be actually bad words to appear in the context.

0:29:06 - 0:29:22 Text: They might actually be other words that are in the context, but 99.9% of the time they will be unlikely words to occur in the context. And so they're good ones to use. And yes, you only sample 10 or 15 of them.

0:29:22 - 0:29:37 Text: And it's enough to make progress because the center word is going to turn up on other occasions. And when it does your sample different words over here so that you gradually sample different parts of the space and start to learn.

0:29:37 - 0:29:55 Text: And it gives a representation of words as coerced current vectors. And just one more note on that. I mean, there are actually two ways that people have commonly made these coerced current matrices.

0:29:55 - 0:30:11 Text: And for a response to what we've seen already that you use a window around the word, which is similar to word to veck. And that allows you to capture some locality and some of the sort of syntactic and semantic proximity that's more fine grained.

0:30:11 - 0:30:25 Text: And the way these current matrix, these are the often made is that normally documents have some structure, whether it's paragraphs or just actual web pages sort of size documents.

0:30:25 - 0:30:40 Text: So you can just make your window size a paragraph for a whole web page and count current currents and those. And this is the kind of method that's often being used in information retrieval in methods like latent semantic analysis.

0:30:40 - 0:30:57 Text: Okay, so the question then is are these kind of count word vectors good things to use. Well, people have used them. They're not terrible. But they have certain problems.

0:30:57 - 0:31:20 Text: The kind of problems that they have, well, firstly, they're huge, though, very sparse. So this is back where I said before, if we had a vocabulary of half a million words, when then we have a half a million dimensional vector for each word, which is much, much bigger than the word vectors that we typically use.

0:31:20 - 0:31:39 Text: And it also means that because we have these very high dimensional vectors that we have a lot of sparsity and a lot of randomness. So the results that you get tend to be noisier and less robust depending on what particular stuff was in the corpus.

0:31:39 - 0:32:01 Text: So in general, people have found that you can get much better results by working with low dimensional vectors. So then the idea is we can store the most of the important information about the distribution of words in the context of other words in a fixed small number of dimensions, giving a dense vector.

0:32:01 - 0:32:20 Text: And then practice the dimensionality of the vectors that are used are normally somewhere between 25 and 1000. And so at that point, we need to use two, we need to use some way to reduce the dimensionality of our count co occurance vectors.

0:32:20 - 0:32:49 Text: So if you have a good memory from a linear algebra class, you hopefully saw singular value decomposition and it has various mathematical properties that I'm not going to talk about here of single singular value projection, giving you an optimal way under a certain definition of optimality of producing a reduced dimensionality matrix that maximally

0:32:49 - 0:33:18 Text: or sorry, pair of matrices that maximally well lets you recover the original matrix. But the idea of the singular value decomposition is you can take any matrix such as our count matrix and you can decompose it into three matrices, you a diagonal matrix sigma and a V transpose matrix.

0:33:18 - 0:33:36 Text: So this works for any shape. Now in these matrices, some parts of it, I never use because since this matrix is rectangular, there's nothing over here. And so this part of the V transpose matrix gets ignored.

0:33:36 - 0:33:53 Text: So I'm wanting to get smaller dimensional representations, what you do is take advantage of the fact that the singular values inside the diagonal sigma matrix are ordered from largest down to smallest.

0:33:53 - 0:34:10 Text: So what we can do is just delete out more of the matrix of the delete out some singular values, which effectively means that in this product, some of you and some of the is also not used.

0:34:10 - 0:34:34 Text: And as a result of that, we're getting lower dimensional representations for our words, if we're wanting to have word vectors, which still do as good as possible a job within the given dimensionality of enabling you to recover the original co occurrence matrix.

0:34:34 - 0:34:57 Text: So from a linear algebra background, this is the obvious thing to use. So how does that work? Well, if you just build a raw count co occurrence matrix and run SVD on that and try and use those as word vectors, it actually works poorly.

0:34:57 - 0:35:15 Text: It works poorly because if you get into the mathematical assumptions, SVD, you're expecting to have these normally distributed errors and what you're getting with word counts looked not at all.

0:35:15 - 0:35:30 Text: Like some normal because you have exceedingly common words like that and then and you have a very large number of rare words. So that doesn't work very well, but you actually get something that works a lot better.

0:35:30 - 0:35:33 Text: If you scale the counts in the cells.

0:35:33 - 0:35:45 Text: To deal with this problem, extremely frequent words, there are some things we can do. We could just take the log of the raw counts. We could kind of cap the maximum count.

0:35:45 - 0:35:59 Text: We could throw away the function words and any of these kind of ideas that you build, then have a co occurrence matrix that you get more useful word vectors from running something like SVD.

0:35:59 - 0:36:18 Text: These kind of models were explored in the 1990s and in the 2000s and in particular Doug Rodi explored a number of these ideas is how to improve the co occurrence matrix in a model that he built that was called calls.

0:36:18 - 0:36:37 Text: And you know, actually in his calls model, he observed the fact that you could get the same kind of linear components that have semantic components that we saw yesterday when talking about analogies.

0:36:37 - 0:36:50 Text: So for example, this is a figure from his paper and you can see that we seem to have a meaning component going from a verb to the person who does the verb.

0:36:50 - 0:37:04 Text: So drive to drive us when to swim or teach to teacher, marry to priest. And that these vector components are not perfectly, but are roughly parallel and roughly the same size.

0:37:04 - 0:37:18 Text: And so we have a meaning component there that we could add on to another word, just like we did for previously for analogies, we could say drivers to driver as Mary is to what.

0:37:18 - 0:37:33 Text: And we'd add on this screen vector component, which is roughly the same as this one. And we'd say, oh, priest. So that this space could actually get some word vectors analogies right as well.

0:37:33 - 0:37:45 Text: And so that seemed really interesting to us around the time word to vac came out of wanting to understand better what the literature of updating algorithm of word to vac did.

0:37:45 - 0:37:53 Text: And how it related to these more linear algebra based methods that had been explored in the couple of decades previously.

0:37:53 - 0:38:07 Text: And so for the next bit, I want to tell you a little bit about the glove algorithm, which was an algorithm for word vectors that was made by Jeffrey pennington, Richard Socher and me in 2014.

0:38:07 - 0:38:26 Text: And so the starting point of this was to try to connect together the linear algebra based methods on current matrices like LSA and calls with the models like skip grand CBO and their other friends, which were iterative neural updating algorithms.

0:38:26 - 0:38:37 Text: So on the one hand, you know, the linear algebra methods actually seemed like they had advantages for fast training and efficient usage of statistics.

0:38:37 - 0:38:44 Text: But although there had been work on capturing words similarities with them by and large.

0:38:44 - 0:39:06 Text: The results weren't as good perhaps because of disproportionate importance given to large counts in the main conversely, the models, the neural models, it seems like if you're just doing these gradient updates on windows, you're somehow inefficiently using statistics versus a coerced currents matrix.

0:39:06 - 0:39:14 Text: And the other hand is actually easier to scale to a very large corpus by trading time for space.

0:39:14 - 0:39:33 Text: And at that time, it seemed like the neural methods just worked better for people that they generated improved performance on many tasks, not just on words similarity, and that they could capture complex patterns such as the analogies that went beyond words similarity.

0:39:33 - 0:39:45 Text: And so what we wanted to do was understand a bit more is to what do you what properties you need to have this analogies work out as I showed last time.

0:39:45 - 0:40:14 Text: And so what we realized was that if you'd like to do have these sort of vector subtractions and additions work for an analogy, the property that you want is for meaning components or meaning component is something like going from male to female queen to king or going from

0:40:14 - 0:40:26 Text: to a bird with a drone truck driver, that those meaning components should be represented as ratios of color currents probabilities.

0:40:26 - 0:40:29 Text: So here's an example that shows that.

0:40:29 - 0:40:51 Text: Okay, so suppose the meaning component that we want to get out is the spectrum from solid to gas as in physics, well, you'd think that you can get at the solid part of it, perhaps by saying does the word coerced with ice and the word solid occurs with ice.

0:40:51 - 0:41:09 Text: So it looks hopeful and gas doesn't occur with ice much so that looks hopeful, but the problem is the word water will also occur a lot with ice and if you just take some other random word like the word random, it probably doesn't occur with ice much.

0:41:09 - 0:41:21 Text: So if you look at words coerced with steam solid won't occur with steam much, but gas will the water will again and random will be small.

0:41:21 - 0:41:46 Text: So to get out the meaning component we want of going from gas to solid was actually really useful is to look at the ratio of these coerced currents probabilities, because then we get a spectrum for large to small between solid and gas, whereas for water and a random word, it basically cancels out and gives you one.

0:41:46 - 0:42:15 Text: I just wrote these numbers in, but if you count them up in a large corpus, it is basically what you get so here actual coerced currents probabilities and that for water and my random word, which was fashion here, these are approximately one, whereas for the ratio of probability of coerced currents of solid with ice or steam is about 10 and for gas, it's about a 10.

0:42:15 - 0:42:31 Text: So how can we capture these ratios of coerced currents probabilities as linear meaning components so that in our word vector space, we can just add and subtract linear meaning components.

0:42:31 - 0:42:48 Text: Well, it seems like the way we can achieve that is if we build a log by linear model, so that the dot product between two word vectors, attempt to approximate the log of the probability of coerced currents.

0:42:48 - 0:43:07 Text: So if you do that, you then get this property that the difference between two vectors, it's similarity to another word corresponds to the log of the probability ratio shown on the previous slide.

0:43:07 - 0:43:29 Text: So the glove model wanted to try and unify the thinking between the coerced currents matrix models and the neural models by being in some way similar to a neural model, but actually calculated on top of a coerced currents matrix count.

0:43:29 - 0:43:42 Text: So we had an explicit loss function and our explicit loss function is that we wanted the dot product to be similar to the log of the coerced currents.

0:43:42 - 0:44:06 Text: And we actually added in some bias terms here, but I'll ignore those for the moment. And we wanted to not have very common words dominate. And so we kept the effect of high word counts using this F function that's shown here. And then we could optimize this J function directly on the coerced currents count matrix.

0:44:06 - 0:44:12 Text: And that gave us fast training scalable to huge corpora.

0:44:12 - 0:44:29 Text: And so this algorithm worked very well. So if you ask, if you run this algorithm, ask what are the nearest words to fog, you get fogs toad, and then you get some complicated words, but it turns out they are all fogs until you get down the lizard.

0:44:29 - 0:44:37 Text: And so this is a tutorial that lovely tree fog there. And so this actually seemed to work out pretty well.

0:44:37 - 0:44:46 Text: How well did it work out to discuss that a bit more. I now want to say something about how do we evaluate word vectors.

0:44:46 - 0:44:50 Text: Are we good for up to there for questions.

0:44:50 - 0:45:11 Text: We've got some questions. What do you mean by an inefficient use of statistics as a con for skip. Well, what I mean is that, you know, for word to vac, you're just, you know, looking at one center word at a time and generating a few negative samples.

0:45:11 - 0:45:28 Text: And so it sort of seems like us doing something always precise there, whereas if you're doing optimization algorithm on the whole matrix at once, well, you actually know everything about the matrix at once.

0:45:28 - 0:45:42 Text: And so that's just looking at what words, what other words occurred in this one context of the center word, you've got the entire vector of co occurrence accounts for the center word and another word.

0:45:42 - 0:45:51 Text: And so therefore you can much more efficiently and less noiseily work out how to minimize your loss.

0:45:51 - 0:46:06 Text: So I'm going to say, I'll go on. Okay, so I've sort of said, look at these word vectors. They're great. And I sort of showed you a few things at the end of the last class, which argued, hey, these are great.

0:46:06 - 0:46:27 Text: They work out these analogies, they show similarity and things like this. We want to make this a bit more precise. And indeed for natural language processing as in other areas of machine learning, a big part of what people are doing is working out good ways to evaluate knowledge that things have.

0:46:27 - 0:46:50 Text: So how can we really evaluate word vectors. So in general, for NLP evaluation, people talk about two ways of evaluation intrinsic and extrinsic. So an intrinsic evaluation means that you evaluate directly on the specific or intermediate subtasks that you've been working on.

0:46:50 - 0:47:04 Text: A measure where I can directly score how good my word vectors are. And normally intrinsic evaluations are fast to compute. They helped you to understand the component you've been working on.

0:47:04 - 0:47:17 Text: But often, simply trying to optimize that component may or may not have a very big good effect on the overall system that you're trying to build.

0:47:17 - 0:47:34 Text: So people have also also been very interested in extrinsic evaluations. So an extrinsic evaluation is that you take some real task of interest to human beings, whether that's a web search or machine translation or something like that.

0:47:34 - 0:47:48 Text: And you say your goal is to actually improve performance on that task. Well, that's a real proof that this is doing something useful. So in some ways, it's just clearly better.

0:47:48 - 0:48:17 Text: But on the other hand, it also has some disadvantages. It takes a lot longer to evaluate on an extrinsic task because it's a much bigger system. And sometimes, you know, when you change things, it's unclear whether the fact that the numbers went down was because you now have worse word vectors or whether it's just somehow the other components of the system.

0:48:17 - 0:48:30 Text: And it's a better with your old word vectors. And if you change the other components as well, things would get better again. So in some ways, it can sometimes be mudier to see if you're making progress.

0:48:30 - 0:48:34 Text: But I'll touch on both of these methods here.

0:48:34 - 0:48:50 Text: So for intrinsic evaluation of word vectors, one way, which we mentioned last time was this word vector analogies. So we could simply give our models a big collection of word vector analogy problems.

0:48:50 - 0:49:06 Text: So we could say man is the woman as king is the what and ask the model to find the word that is closest using that sort of word analogy computation and hope that what comes out there is queen.

0:49:06 - 0:49:14 Text: And so that's something people have done and have worked on accuracy score of how often that you are right.

0:49:14 - 0:49:40 Text: At this point, I should just mention one little trick of these word vector analogies that everyone uses, but not everyone talks it out along the first instance. I mean, there's a little trick which you can find in the gents encode, if you look at it, that when it does man is the woman as king is to what.

0:49:40 - 0:49:54 Text: Something that could often happen is that actually the word once you do your pluses and your minuses that the word that will actually be closest is still king.

0:49:54 - 0:50:11 Text: So the way people always do this is that they don't allow one of the three input words in the selection process. So you're choosing the nearest word that isn't one of the put words.

0:50:11 - 0:50:26 Text: So since here is showing results from the glove vectors. So the glove vectors have a strong linear component property, just like I showed before for.

0:50:26 - 0:50:40 Text: So this is for the male female dimension. And so because of this, you'd expect in a lot of cases that word analogies would work because I can take the vector difference of man and woman.

0:50:40 - 0:51:02 Text: And then if I add that vector difference on to brother, I expect to get to sister and king queen and from any of these examples, but of course they may not always work right because if I start from emperor, it's sort of on a more of a lean and so it might turn out that I get counted so Dutchess coming out instead.

0:51:02 - 0:51:12 Text: So you can do this for various different relations, a different semantic relation. So these sort of word vectors actually learn quite a bit of just world knowledge.

0:51:12 - 0:51:23 Text: So here's the company CEO, or this is the company CEO around 2010 to 2014 when the data was taken from word vectors.

0:51:23 - 0:51:35 Text: And they as well as semantic things or pragmatic things like this, they also learn syntactic things. So here are vectors for positive comparative and support of forms of adjectives.

0:51:35 - 0:51:40 Text: And you can see those also move and roughly linear components.

0:51:40 - 0:51:55 Text: So the word to back people built a data set of analogies so you could evaluate different models on the accuracy of their analogies. And so here's how you can do this and this gives some numbers.

0:51:55 - 0:52:01 Text: So there are some mannequins and tactic analogies. I'll just look at the totals.

0:52:01 - 0:52:15 Text: Okay, so what I said before is if you just use unscaled, um, co-occurrence counts and passing through an SVD things work terribly and you see that there you only get 7.3.

0:52:15 - 0:52:33 Text: And as I also pointed out, if you do some scaling, you can actually get SVD to have of a scaled count matrix to work reasonably well. So this SVD L is similar to the goals model. And now we're getting up to 60.1, which actually isn't a bad score.

0:52:33 - 0:52:47 Text: And we can actually do a decent job without a neural network. And then here are the two variants of the word to back model and here are our results from the glove model.

0:52:47 - 0:53:00 Text: And of course, at the time, 2014, we took this as absolute proof that our model was better and our more efficient use of statistics was really working in our favor.

0:53:00 - 0:53:11 Text: So with 70 years of retrospect, I think that's kind of not really true. It turns out, I think the main part of why we scored better is that we actually had better data.

0:53:11 - 0:53:29 Text: And so there's a bit of evidence about that on this next slide here. So this looks at the semantics and tactic and overall performance on word analogies of glove models that were trained on different subsets of data.

0:53:29 - 0:53:45 Text: And in particular, the two on the left are trained on Wikipedia. And you can see that training on Wikipedia makes you do really well on semantic analogies, which maybe makes sense because Wikipedia just tells you a lot of semantic facts.

0:53:45 - 0:53:54 Text: I mean, that's kind of what encyclopedias do. And so one of the big advantages we actually had was that Wikipedia.

0:53:54 - 0:54:06 Text: That the glove model was partly trained on Wikipedia as well as other text, whereas the word to back model that was released was trained exclusively on Google news, so newswire data.

0:54:06 - 0:54:22 Text: And if you only train on a smaller amount of newswire data, you can see that for the semantics, it's, it's just not as good as even a one quarter of the size amount of Wikipedia data.

0:54:22 - 0:54:40 Text: So if you get a lot of data, you can compensate for that. So here on the right end, did you then have common crawl web data. And so once there's a lot of web data. So now 42 billion words, you're then starting to get good scores again from the semantic side.

0:54:40 - 0:54:54 Text: So if you're on the right, then shows how well do you do as you increase the vector dimension. And so what you can see there is, you know, 25 dimensional vectors aren't very good.

0:54:54 - 0:55:17 Text: So that's what I used 100 dimensional vectors when I showed my example in class year, the sweet two long old and working reasonably well, but you still get significant gains for 200 and it's somewhat to 300.

0:55:17 - 0:55:25 Text: So I found so of 2013 to 15, everyone sort of gravitated to the fact that 300 dimensional vectors is the sweet spot.

0:55:25 - 0:55:40 Text: So almost frequently, if you look through the best known sets of word vectors, then include the word to vectors and the glove vectors that usually what you get is 300 dimensional word vectors.

0:55:40 - 0:55:53 Text: So that's not the only intrinsic evaluation you can do. Another intrinsic evaluation you can do is see how these models model human judgments of words similarity.

0:55:53 - 0:56:16 Text: So psychologists for several decades have actually taken human judgments of words similarity where literally you're asking people for pairs of words like professor and doctor to give them a similarity score that's sort of being measured as some continuous quantity giving you a score between say zero and 10.

0:56:16 - 0:56:38 Text: And so there are human judgments, which are then averaged over multiple human judgments as to how similar different words are so Tigran cat is pretty similar computer and internet is pretty similar plane and car is less similar stock and CD aren't very similar at all, but stock and Jaguar even less similar.

0:56:38 - 0:56:54 Text: So we could then say for the our models, do they have the same similarity judgments and in particular, we can measure correlation coefficient of whether they give the same ordering of similarity judgments.

0:56:54 - 0:57:19 Text: And so then we can get data for that. And so there are various different data sets of words similarities and we can score different models as to how well they do on similarities. And again, you see here that playing svds works comparatively better here for similarities that did for analogies, you know, it's not great, but is now not completely terrible.

0:57:19 - 0:57:34 Text: Because we no longer need that linear property, but again scaled svds work a lot better word to veck works a bit better than that, and we got some of the same kind of minor advantages from the glove model.

0:57:34 - 0:57:45 Text: Chris, sorry to interrupt a lot of the students who are asking if you could re-explain the objective function for the glove model and also what log by linear means.

0:57:45 - 0:57:50 Text: Okay, sure.

0:57:50 - 0:57:57 Text: Okay, here is my here is my objective function.

0:57:57 - 0:58:15 Text: So anyway, if I go so one slide before that, right, so the property that we want is that we want the dot product to represent the log probability of co occurrence.

0:58:15 - 0:58:31 Text: And that's then gives me my tricky log by linear. So the buy is that there's sort of the w i and the w j so that there are sort of two linear things, and it's linear in each one of them.

0:58:31 - 0:58:48 Text: And this is sort of like having and rather having a sort of an ax where you just have something with linear in x and a is a constant it's by linear because we have the w i w j and this linear in both of them.

0:58:48 - 0:58:57 Text: And that's then related to the log of a probability, and so that gives us the log by linear model.

0:58:57 - 0:59:13 Text: And so since we since we'd like these things to be equal, what we're doing here, if you ignore these two center terms, is that we're wanting to say the difference between these.

0:59:13 - 0:59:28 Text: And we want that is as small as possible, so we're taking this difference and we're squaring it so it's always positive, and we want that square term to be as small as possible.

0:59:28 - 0:59:51 Text: And you can basically stop there, but the other bit that's in here is a lot of the time when you're building models, rather than simply having sort of an ax model, it seems useful to have a bias term, which can move things up and down for the word in general.

0:59:51 - 1:00:09 Text: And so we added into the model bias term so that there's a bias term for both words, so if in general probabilities are high for a certain word, this bias term can model that and for the other word this bias term model it okay.

1:00:09 - 1:00:24 Text: And now I'll pop back and after actually I just saw someone said why multiplying by the f of sorry I did skip that last term.

1:00:24 - 1:00:53 Text: Okay, the why modifying by this f of x i j so this last bit was to scale things, depending on the frequency of a word, because you want to pay more attention to words that are more common or word pairs that are more common, because you know, if you think about it in word,

1:00:53 - 1:01:12 Text: you're seeing if things have a coerced current to count of 50 versus three, you want to do a better job at modeling the coerced current of the things that occurred together 50 times.

1:01:12 - 1:01:39 Text: So you want to consider in the count of coerced currents, but then the argument is that that actually lead to a stray when you have extremely common words like function words, and so effectively you paid more attention to words that co occurred together up until a certain point and then the curve just went flat, so it didn't matter if it was an extremely extremely common word.

1:01:39 - 1:01:58 Text: So then for extrinsic word vector evaluation, so at this point, you're now wanting to sort of say well, can we embed our word vectors in some end user task and do they help.

1:01:58 - 1:02:27 Text: And do different word vectors work better or worse than other word vectors, so this is something that will see a lot of later in the class, I mean in particular, when you get on to doing assignment three that assignment three, you get to build dependency parsers and you can then use word vectors in the dependency parser and see how much they help we don't actually make you test out different sets of word vectors, but you could.

1:02:27 - 1:02:46 Text: So here's just one example of this to give you a sense, so the task of named entity recognition is going through a piece of text and identifying mentions of a person name or an organization name like a company or a location and so.

1:02:46 - 1:03:13 Text: If you have good word vectors, do they help you do named entity recognition better and the answer that is yes, so if one starts off with a model that simply has discrete features so it uses word identity as features, you can build a pretty good name density model doing that, but if you add into it word vectors, you get a better representation of the meaning of words.

1:03:13 - 1:03:25 Text: So you can do that you can have the numbers go up quite a bit and then you can compare different models to see how much gain they give you in terms of this extrinsic task.

1:03:25 - 1:03:40 Text: So skipping ahead, this was a question that I asked after class, which was word sensors because so far we've had just one word.

1:03:40 - 1:04:08 Text: So for one particular string we've got some string house and we're going to say for each of those strings there's a word vector and if you think about it a bit more that seems like it's very weird because actually most words, especially common words and especially words that have existed for a long time actually have many meanings, which are very different.

1:04:08 - 1:04:22 Text: So how could that be captured if you only have one word vector for the word because you can't actually capture the fact that you've got different meanings for the word because your meaning for the word is just one point in space one vector.

1:04:22 - 1:04:38 Text: And so as an example of that here's the word pie.

1:04:38 - 1:05:00 Text: So for a minute and think what word meanings the word pie cares. And it actually turns out you know it has a lot of different meaning so so perhaps the most basic meaning is if you did fantasy games or something medieval weapons.

1:05:00 - 1:05:09 Text: And it's a kind of a kind of a fish that has a similar elongated shape that's a pike.

1:05:09 - 1:05:26 Text: It was used for railroad lines maybe that usage isn't used much anymore but it's certainly still survived and referring to roads so this is like when you have turn pikes we have expressions where pike means the future like coming down the pike.

1:05:26 - 1:05:31 Text: And then there's a position in diving that divers do a pike.

1:05:31 - 1:05:39 Text: Those are all noun uses. They're also verbal uses so you can pike somebody with your pike.

1:05:39 - 1:05:50 Text: You know different usages might have different currency in a stray you can also use pike to mean that you pull out of doing something like I reckon he's going to pike.

1:05:50 - 1:05:58 Text: And that usage is used in America but lots of meanings and actually for words that commoner if you start thinking words like line or field.

1:05:58 - 1:06:06 Text: I mean they just have even more meanings than this so what are we actually doing with just one vector for a word.

1:06:06 - 1:06:29 Text: Well one way you could go is to say okay up until now what we've done is crazy pike has and other words have all of these different meanings so maybe what we should do is have different word vectors for the different meanings of pike so we'd have one word vector for the medieval pointy weapon.

1:06:29 - 1:06:39 Text: Another word vector for the kind of fish another word vector for the kind of road so that they then be words sense vectors.

1:06:39 - 1:06:50 Text: And you can do that I mean actually we were working on that in the early 2000 and 10s actually even before word to that came out.

1:06:50 - 1:07:14 Text: So this picture is a little bit small to see but what we were doing was for words we work clustering instances of a word hoping that those clusters so clustering the word tokens hoping those clusters that were similar represented sensors and then for the clusters of word tokens.

1:07:14 - 1:07:34 Text: So treating them like they were separate words and learning a word vector for each and you know basically that actually works so in green we have two sensors for the word bank and so there's one sense for the word bank that's over here where it's close to words like banking finance transaction laundering.

1:07:34 - 1:07:45 Text: And then we have another sense for the word bank over here whereas close to words like plateau boundary gap territory which is the river bank sense of the word bank.

1:07:45 - 1:07:49 Text: And for the word jacuar that's in purple.

1:07:49 - 1:08:08 Text: Well, jacuar has a number of sensors and so we have those as well so this sense down here is close to hunter so that's the sort of big game animal sense of jaguar up the top here is being shown close to luxury and convertibles is the jaguar car sense.

1:08:08 - 1:08:36 Text: Then jaguar here is near string keyboard and words like that so jaguars the name of a kind of keyboard and then this final jaguar over here is close to software and Microsoft and then if you're old enough you'll remember that there was an old version of macOS so it was called jaguar so that's then the computer sense so basically this does work and we can learn word vectors.

1:08:36 - 1:08:55 Text: So there's a different sensors of a word but actually this isn't the majority way that things have then gone in practice and there are kind of a couple of reasons for that I mean one is just simplicity if you do this.

1:08:55 - 1:09:12 Text: And then you're going to have a kind of complex because you first of all have to learn word sensors and then start learning word vectors in terms of the word sensors.

1:09:12 - 1:09:27 Text: So it's commonly what's being used in natural language processing I mean it tends to be imperfect in its own way because we're trying to take all the uses of the word pike and sort of cut them up into key different sensors where

1:09:27 - 1:09:55 Text: the difference is kind of overlapping and it's often not clear which ones to count as distinct so for example here right a railroad line and a type of road sort of that's the same sense of pike it's just that they're different forms of transportation and so you know that this could be you know a type of transportation line and cover both of them so it's always sort of very unclear how you cut word meaning into

1:09:55 - 1:10:02 Text: different sensors and indeed if you look at different dictionaries everyone does it differently.

1:10:02 - 1:10:22 Text: So it actually turns out that in practice you can do rather well by simply having one word vector per word type and what happens if you do that well what you find

1:10:22 - 1:10:50 Text: is that what you learn as a word vector is what gets referred to in fancy talk as a super super position of the diff of the word vectors for the different sensors of a word where the word super position means no more or less than a weighted some so out the vector that we learn for pike will be a weighted average of the

1:10:50 - 1:11:17 Text: vectors that you would have learned for the medieval weapon sense plus the fish sense plus the road sense plus whatever other sensors that you have where the weighting that's given to these different sense vectors corresponds to the frequencies of use of the different sensors so we end up with the word the vector for pike being a kind of an average vector

1:11:17 - 1:11:40 Text: so if you're say okay you've just added up several different vectors into an average you might think that that's kind of useless because you know you've lost the real meanings of the word you've just got some kind of funny average vector that's in between them

1:11:40 - 1:12:08 Text: and then suddenly it turns out that if you use this average vector in applications it tends to sort of self disambiguate because if you say is the word pike similar to the word for fish well part of this vector represents fish the fish sense of pike and so in those components it will be kind of similar to the fish vector

1:12:08 - 1:12:35 Text: so yes you'll say the substantial similarity whereas if in another piece of text that says you know the men were armed with pikes and lancers or pikes and maces or whatever other many of the weapons you remember well actually some of that meaning is in the pike vector as well and so it will say yeah there's good similarity

1:12:35 - 1:12:52 Text: mace and staff and words like that as well and in fact we can work out which sense of pike is intended by just sort of seeing which components are similar to other words that are used in the same context

1:12:52 - 1:13:11 Text: indeed there's actually a much more surprising result than that and this is a result that's Jews Sangev Aurora, Tung Ruma who is now on our Stanford faculty and others in 2018 and that's the following result which I'm not actually going to explain but

1:13:11 - 1:13:34 Text: so if you think that the vector for pike is just a sum of the vectors for the different sensors well it should be you'd think that it's just completely impossible to reconstruct the sense vectors from the vector for the word type

1:13:34 - 1:13:51 Text: because normally if I say I've got two numbers the sum of them is 17 you just have no information as to what my two numbers are right you can't resolve it and even worse if I tell you I've got three numbers and they sum to 17

1:13:51 - 1:14:11 Text: but it turns out that when we have these high dimensional vector spaces that things are so sparse in those high dimensional vector spaces that you can use ideas from sparse coding to actually separate out the different sensors providing their relatively common

1:14:11 - 1:14:29 Text: so they show in their paper that you can start with the vector of say pike and actually separate out components of that vector that correspond to different sensors of the word pike and so here's an example at the bottom of this slide which is for the word

1:14:29 - 1:14:50 Text: separate out that vector into five different sensors and so there's one sense is close to the words trousers blouse waist coats and this is the sort of clothing sense of tie another sense is is close to wise cables wiring electrical so that's the sort of the tie sense of tie used in electrical staff

1:14:50 - 1:15:02 Text: then we have sort of scoreline goal is equalizer the so this is the sporting game sense of tie this one also seems to in a different way evokes sporting game sense of tie

1:15:02 - 1:15:20 Text: and then there's finally this one here maybe my music is just really bad maybe it's because you get ties and music when you tie notes together I guess so you get these different sensors out of it