Stanford CS224N: NLP with Deep Learning | Winter 2021 | Lecture 1 - Intro _ Word Vectors

0:00:00 - 0:00:15     Text: Hi everybody. Welcome to Stanford CS224N, also known as Ling284, natural language processing

0:00:15 - 0:00:22     Text: with deep learning. I'm Christopher Manning and I'm the main instructor for this class.

0:00:22 - 0:00:28     Text: So what we hope to do today is to dive right in. So I'm going to spend about 10 minutes

0:00:28 - 0:00:34     Text: talking about the course. And then we're going to get straight into content for reasons

0:00:34 - 0:00:40     Text: I'll explain in a minute. So we'll talk about human language and word meaning. I'll

0:00:40 - 0:00:45     Text: then introduce the ideas of the word to veck algorithm for learning word meaning.

0:00:45 - 0:00:50     Text: And then going from there we'll kind of concretely work through how you can work out objective

0:00:50 - 0:00:56     Text: function gradients with respect to the word to veck algorithm and say a teeny bit about

0:00:56 - 0:01:01     Text: how optimization works. And then right at the end of the class I then want to spend a little

0:01:01 - 0:01:08     Text: bit of time giving you a sense of how these word vectors work and what you can do with them.

0:01:08 - 0:01:15     Text: So really the key learning for today is I want to give you a sense of how amazing deep

0:01:15 - 0:01:21     Text: learning word vectors are. So we have this really surprising result that word meaning

0:01:21 - 0:01:28     Text: can be represented not perfectly but really rather well by a large vector of real numbers.

0:01:28 - 0:01:33     Text: And you know that's sort of in a way a common place of the last decade of deep learning

0:01:33 - 0:01:40     Text: but it flies in the face of thousands of years of tradition and it's really rather an unexpected

0:01:40 - 0:01:47     Text: result to start focusing on. Okay so quickly what do we hope to teach in this course. So

0:01:47 - 0:01:54     Text: we've got three primary goals. The first is to teach you the foundations are a good

0:01:54 - 0:02:00     Text: deep understanding of the effect of modern methods for deep learning applied to NLP.

0:02:00 - 0:02:05     Text: So we are going to start with and go through the basics and then go on to key methods

0:02:05 - 0:02:12     Text: that are used in NLP, your current networks, attention, transformers and things like that.

0:02:12 - 0:02:18     Text: You want to do something more than just that but also like to give you some sense of a

0:02:18 - 0:02:23     Text: big picture understanding of human languages and what are the reasons for why they're actually

0:02:23 - 0:02:29     Text: quite difficult to understand and produce even though humans seem to do it easily. Now

0:02:29 - 0:02:33     Text: obviously if you really want to learn a lot about this topic you should enroll in and

0:02:33 - 0:02:38     Text: go and start doing some classes in the linguistics department but nevertheless for a lot of

0:02:38 - 0:02:44     Text: you this is the only human language content you'll see during your master's degree or whatever

0:02:44 - 0:02:51     Text: and so we do hope to spend a bit of time on that starting today. And then finally we want

0:02:51 - 0:02:57     Text: to give you an understanding of an ability to build systems in PyTorch for some of the

0:02:57 - 0:03:03     Text: major problems in NLP so we'll look at learning word meanings dependency parsing machine translation

0:03:03 - 0:03:14     Text: question answering. Let's dive into human language. Once part of time I had a lot longer introduction

0:03:14 - 0:03:21     Text: that gave lots of examples about how human languages can be misunderstood and complex. I'll show

0:03:21 - 0:03:29     Text: a few of those examples in later lectures but since right for today we're going to be focused

0:03:29 - 0:03:38     Text: on word meaning I thought I'd just give one example which comes from a very nice xkcd cartoon

0:03:38 - 0:03:46     Text: and that isn't sort of about some of the sort of syntactic ambiguities of sentences but instead

0:03:46 - 0:03:52     Text: it's really emphasizing the important point that language is a social system constructed

0:03:52 - 0:04:01     Text: and interpreted by people and that's part of how and it changes as people decide to adapt its

0:04:01 - 0:04:07     Text: construction and that's part of the reason why human languages are greatest and adaptive system

0:04:07 - 0:04:15     Text: for human beings but difficult as a system or our computers to understand to this day. So in this

0:04:15 - 0:04:23     Text: conversation between the two women one says anyway I could care less and the other says I think

0:04:23 - 0:04:29     Text: you mean you couldn't care less saying you could care less implies you care at least some amount

0:04:29 - 0:04:35     Text: and the other one says I don't know where these unbelievably complicated brains drifting through

0:04:35 - 0:04:42     Text: avoid trying and vain to connect with one another by blindly fleeing words out into the darkness.

0:04:42 - 0:04:49     Text: Every choice of phrasing spelling and tone and timing carries countless synchic signals

0:04:49 - 0:04:56     Text: in context and subtext and more and every listener interprets those signals in their own way.

0:04:56 - 0:05:03     Text: Language isn't a formal system language is glorious chaos. You can never know for sure what any

0:05:03 - 0:05:10     Text: words will mean to anyone or you can do is try to get better at guessing how your words affect

0:05:10 - 0:05:15     Text: people so you can have a chance of finding the ones that will make them feel something like what

0:05:15 - 0:05:21     Text: you want them to feel. Everything else is pointless. I assume you're giving me tips on how you

0:05:21 - 0:05:28     Text: interpret words because you want me to feel less alone. If so then thank you that means a lot

0:05:29 - 0:05:35     Text: but if you're just running my sentences past some mental checklist so you can show off how well

0:05:35 - 0:05:43     Text: you know it then I could care less. Okay so that's ultimately what our goal is is to how to do a better

0:05:43 - 0:05:53     Text: job at building computational systems that try to get better at guessing how their words will affect

0:05:53 - 0:05:58     Text: other people and what other people are meaning by the words that they choose to say.

0:05:58 - 0:06:10     Text: So an interesting thing about human language is it is a system that was constructed by human beings

0:06:11 - 0:06:20     Text: and it's a system that was constructed relatively recently in some sense. So in discussions of

0:06:20 - 0:06:29     Text: artificial intelligence a lot of the time people focus a lot on human brains and the neurons buzzing

0:06:29 - 0:06:36     Text: by and this intelligence that's meant to be inside people's heads but I just wanted to focus for a

0:06:36 - 0:06:43     Text: moment on the role of language. There's actually you know this is kind of controversial but

0:06:44 - 0:06:49     Text: you know it's not necessarily the case that humans are much more intelligent than some of the

0:06:49 - 0:06:56     Text: higher rapes like chimpanzees of bonobos right so chimpanzees and bonobos have been shown to be

0:06:56 - 0:07:03     Text: able to use tools to make plans and in fact chimps have much better short term memory than human beings

0:07:03 - 0:07:11     Text: do. So relative to that if you look through the history of life on earth human beings develop language

0:07:11 - 0:07:17     Text: really recently. How recently we kind of actually don't know because you know there's no fossils

0:07:17 - 0:07:25     Text: that say okay here's a language speaker but you know most people estimate that language arose

0:07:25 - 0:07:33     Text: for human beings sort of you know somewhere in the range of a hundred thousand to a million years ago

0:07:33 - 0:07:39     Text: okay that's a while ago but compared to the process of evolution of life on earth that's kind of

0:07:39 - 0:07:47     Text: blinking an eyelid but that power from this communication between human beings quickly set off

0:07:47 - 0:07:55     Text: our ascendancy over other creatures. So it's kind of interesting that the ultimate power turned out not

0:07:55 - 0:08:01     Text: to be having poisonous fangs or being super fast or super big but having the ability to communicate

0:08:01 - 0:08:08     Text: with other members of your tribe. It was much more recently again that humans developed writing

0:08:08 - 0:08:15     Text: which allowed knowledge to be communicated across distances of time and space and so that's only

0:08:15 - 0:08:22     Text: about five thousand years old the power of writing. So in just a few thousand years the ability

0:08:22 - 0:08:28     Text: to preserve and share knowledge took us from the Bronze Age to the smartphones and tablets of today.

0:08:30 - 0:08:36     Text: So a key question for artificial intelligence and human computer interaction is how to get

0:08:36 - 0:08:42     Text: computers to be able to understand the information conveyed in human languages. Simultaneously artificial

0:08:42 - 0:08:49     Text: intelligence requires computers with a knowledge of people. Fortunately now our AI systems might be

0:08:49 - 0:08:55     Text: able to benefit from a virtuous cycle. We need knowledge to understand language and people well

0:08:55 - 0:09:02     Text: but it's also the case that a lot of that knowledge is contained in language spread out across the

0:09:02 - 0:09:06     Text: books and web pages of the world and that's one of the things we're going to look at in this

0:09:06 - 0:09:13     Text: course is how that we can sort of build on that virtuous cycle. A lot of progress has already been

0:09:13 - 0:09:23     Text: made and I just wanted to very quickly give a sense of that. So in the last decade or so and especially

0:09:23 - 0:09:29     Text: in the last few years with neural methods of machine translation where now in a space where

0:09:29 - 0:09:36     Text: machine translation really works moderately well. So again from the history of the world this is

0:09:36 - 0:09:43     Text: just amazing right for thousands of years learning other people's languages was a human task which

0:09:43 - 0:09:49     Text: required a lot of effort and concentration but now we're in a world where you could just hop on

0:09:49 - 0:09:56     Text: your web browser and think oh I wonder what the news is in Kenya today and you can head off over to

0:09:56 - 0:10:03     Text: a Kenyan website and you can see something like this and you can go and you can then ask Google

0:10:03 - 0:10:09     Text: to translate it for you from Swahili and you know the translation isn't quite perfect

0:10:09 - 0:10:15     Text: but it's you know it's reasonably good so the newspaper Tukko has been informed that local

0:10:15 - 0:10:21     Text: government minister Lingsan, Bella Kanyama and his transport counterparts sitting me

0:10:21 - 0:10:27     Text: died within two separate hours so you know within two separate hours is kind of awkward but essentially

0:10:27 - 0:10:33     Text: we're doing pretty well at getting the information out of this page and so that's quite amazing.

0:10:35 - 0:10:41     Text: The single biggest development in NLP for the last year certainly in the popular media

0:10:41 - 0:10:53     Text: meeting was GPT3 which was a huge new model that was released by OpenAI. What GPT3 is about and why

0:10:53 - 0:10:59     Text: it's great is actually a bit subtle and so I can't really go through all the details of this here

0:10:59 - 0:11:05     Text: but it's exciting because it seems like it's the first step on the path to what we might call

0:11:05 - 0:11:13     Text: universal models where you can train up one extremely large model on something like that library

0:11:13 - 0:11:20     Text: picture I showed before and it just has knowledge of the world, knowledge of human languages, knowledge

0:11:20 - 0:11:27     Text: of how to do tasks and then you can apply it to do all sorts of things so no longer are we building

0:11:27 - 0:11:33     Text: a model to detect spam and then a model to detect pornography and then a model to detect

0:11:34 - 0:11:39     Text: whatever foreign language content and just building all these separate supervised classifiers

0:11:39 - 0:11:46     Text: for every different task we've now just built up a model that understands so exactly what it does

0:11:46 - 0:11:58     Text: is it just predicts following words so on the left it's been told to write about Elon Musk in the

0:11:58 - 0:12:07     Text: style of Dr. Soos and it started off with some text and then it's generating more text and the

0:12:07 - 0:12:14     Text: way it generates more text is literally by just predicting one word at a time following words

0:12:14 - 0:12:22     Text: come to complete its text but this has a very powerful facility because what you can do with

0:12:22 - 0:12:30     Text: GPT3 is you can give it a couple of examples of what you'd like it to do so I can give it some

0:12:30 - 0:12:36     Text: text and say I broke the window, changed it into a question what did I break, I gracefully save

0:12:36 - 0:12:44     Text: the day, I changed it into a question what did I gracefully save so this prompt tells GPT3 what

0:12:45 - 0:12:51     Text: I'm wanting it to do and so then if I give it another statement like I gave John Flowers I can

0:12:51 - 0:12:58     Text: then say GPT3 predict what words come next and it'll follow my prompt and produce who did I give

0:12:58 - 0:13:05     Text: flowers to or I can say I gave her a rose and a guitar and it will follow the idea of the pattern

0:13:05 - 0:13:12     Text: and do who did I give a rose and a guitar too and actually this one model can then do an amazing

0:13:12 - 0:13:18     Text: range of things including many there's quite surprising to do at all to give just one example of

0:13:18 - 0:13:27     Text: that another thing that you can do is get it to translate human language sentences into SQL so

0:13:27 - 0:13:35     Text: this can make it much easier to do CS 145 so having given that a couple of examples of SQL

0:13:35 - 0:13:41     Text: translation of human language text which I'm this time not showing because it won't fit on my slide

0:13:41 - 0:13:47     Text: I can then give it a sentence like how many users have signed up since the start of 2020 and it turns

0:13:47 - 0:13:53     Text: it into SQL or I can give it another query what is the average number of influencers each user

0:13:53 - 0:14:04     Text: subscribe to and again it then converts that into SQL so GPT3 knows a lot about the meaning of

0:14:04 - 0:14:09     Text: language and the meaning of other things like SQL and confluently manipulate it

0:14:13 - 0:14:19     Text: okay so that leads us straight into this top of meaning and how do we represent the meaning of a

0:14:19 - 0:14:26     Text: word well what is meaning well we can look up something like the websteadictionary and say okay

0:14:27 - 0:14:33     Text: the idea that is represented by a word the idea that a person wants to express by using word

0:14:33 - 0:14:40     Text: signs etc the websteadictionary definition is really focused on the word idea somehow

0:14:40 - 0:14:46     Text: but this is pretty close to the commonest way that linguists think about meaning so that they think

0:14:46 - 0:14:54     Text: of word meaning as being a pairing between a word which is a signifier or symbol and the thing

0:14:54 - 0:15:00     Text: that it signifies the signified thing which is an idea or thing so that the meaning of the word

0:15:00 - 0:15:07     Text: chair is the set of things that are chairs and that's referred to as denotational semantics

0:15:07 - 0:15:13     Text: a term that's also used and similarly applied for the semantics of programming languages

0:15:13 - 0:15:23     Text: this model isn't very gently implementable like how do I go from the idea that okay chair

0:15:23 - 0:15:28     Text: means the set of chairs in the world so something I can manipulate meaning within my computers

0:15:28 - 0:15:36     Text: so traditionally the way that meaning has normally been handled in natural language processing

0:15:36 - 0:15:43     Text: systems is to make use of resources like dictionaries and thosauri in particular popular one is

0:15:43 - 0:15:52     Text: wordnet which organized words and terms into both synonym sets words that can mean the same thing

0:15:52 - 0:15:59     Text: and hypenims which correspond to is a relationship and so for the is a relationship you know we can

0:15:59 - 0:16:06     Text: kind of look at the hypenims of panda and a panda is a kind of prosiumid whatever those I get

0:16:06 - 0:16:12     Text: a set's probably with red pandas which is a kind of carnivore which is a kind of placental which

0:16:12 - 0:16:20     Text: is kind of mammal and you sort of head up this hypenim hierarchy so wordnet has been a greater

0:16:20 - 0:16:29     Text: resource for nlp but it's also been highly deficient so it lacks a lot of nuance so for example

0:16:29 - 0:16:36     Text: in wordnet proficient is list as a synonym for good but you know maybe that's sometimes true

0:16:36 - 0:16:40     Text: but it seems like in a lot of context it's not true and you mean something rather different when

0:16:40 - 0:16:48     Text: you say proficient versus good it's limited as a human constructed thosauris so in particular there's

0:16:48 - 0:16:55     Text: lots of words and lots of uses of words that just aren't there including you know anything that is

0:16:55 - 0:17:03     Text: you know sort of more current terminology like wicked is there for the wicked witch but not for

0:17:03 - 0:17:10     Text: more modern colloquial uses ninja certainly isn't there for the kind of description some people make

0:17:10 - 0:17:18     Text: of programmers and it's impossible to keep up to date so it requires a lot of human labor but even

0:17:18 - 0:17:26     Text: when you have that you know it has a set of synonyms but doesn't really have a good sense of words

0:17:26 - 0:17:35     Text: it means something similar so fantastic and great means something similar without really being

0:17:35 - 0:17:41     Text: synonymms and so this idea of meaning similarity is something that'd be really useful to make

0:17:41 - 0:17:48     Text: progress on and where deep learning models excel okay so what's the problem with a lot of

0:17:48 - 0:17:56     Text: traditional nlp well the problem with a lot of traditional nlp is that words are regardless

0:17:56 - 0:18:04     Text: discrete symbols so we have symbols like hotel conference motel our words which in deep learning

0:18:04 - 0:18:14     Text: speak we refer to as a localist representation and that's because if you in statistical or machine

0:18:14 - 0:18:21     Text: learning systems want to represent these symbols that each of them is a separate thing so the

0:18:21 - 0:18:27     Text: standard way of representing them and this is what you do in something like a statistical model

0:18:27 - 0:18:34     Text: if you're building a logistic regression model with words as features is that you represent them as one

0:18:34 - 0:18:40     Text: hot vectors so you have a dimension for each different word so maybe like my example here are my

0:18:40 - 0:18:49     Text: representations as vectors for motel and hotel and so that means that we have to have huge vectors

0:18:49 - 0:18:54     Text: corresponding to the number of words in our vocabulary so the kind of if you had a high school

0:18:54 - 0:19:01     Text: English dictionary it probably had about 250,000 words in it but there are many many more words

0:19:01 - 0:19:07     Text: in the language really so maybe we at least want to have a 500,000 dimensional vector to be able to

0:19:07 - 0:19:16     Text: cope with that okay but the bigger even bigger problem with the discrete symbols is that we don't

0:19:16 - 0:19:22     Text: have this notion of word relationships and similarity so for example in web search if they use

0:19:22 - 0:19:29     Text: assertions for Seattle motel we'd also like to match on documents containing Seattle hotel

0:19:29 - 0:19:35     Text: but our problem is we've got these one hot vectors for the different words and so in a formal

0:19:35 - 0:19:41     Text: mathematical sense these two vectors are orthogonal that there's no natural notion of similarity

0:19:41 - 0:19:47     Text: between them whatsoever well there are some things that we could do but try and do about that and

0:19:47 - 0:19:55     Text: people did do about that in you know before 2010 we could say hey we could use word net synonyms

0:19:55 - 0:20:01     Text: and we had count things that list the synonyms are similar anyway or hey maybe we could somehow build

0:20:01 - 0:20:08     Text: up representations of words that have meaning overlap and people did all of those things but

0:20:08 - 0:20:15     Text: they tended to fail badly from incompleteness so instead what I want to introduce today is the

0:20:15 - 0:20:23     Text: modern deep learning method of doing that where we encode similarity in a real value vector

0:20:23 - 0:20:31     Text: themselves so how do we go about doing that okay and the way we do that is by exploiting this

0:20:31 - 0:20:39     Text: idea called distributional semantics so the idea of distributional semantics is again something

0:20:39 - 0:20:46     Text: that when you first see it maybe feels a little bit crazy because rather than having something

0:20:46 - 0:20:53     Text: like denotational semantics what we're now going to do is say that a words meaning is going to be

0:20:53 - 0:21:02     Text: given by the words that frequently appear close to it. JR Firth was a British linguist from the

0:21:02 - 0:21:09     Text: middle of last century and one of his pity slogans that everyone quotes at this moment is you

0:21:09 - 0:21:18     Text: shall know a word by the company it keeps and so this idea that you can represent a sense for words

0:21:18 - 0:21:27     Text: meaning as a notion of what context that appears in has been a very successful idea one of the most

0:21:27 - 0:21:35     Text: successful ideas that's used throughout statistical and deep learning NLP is actually an interesting idea

0:21:35 - 0:21:42     Text: more philosophically so that there are kind of interesting connections for example in

0:21:42 - 0:21:48     Text: Vidconstein's later writings he became enamored of a use theory of meaning and this is a

0:21:48 - 0:21:52     Text: sentence in some sense a use theory of meaning but whether you know it's the ultimate theory of

0:21:52 - 0:21:58     Text: semantics it's actually still pretty controversial but it proves to be an extremely computational

0:21:58 - 0:22:06     Text: sense of semantics which has just led to it being used everywhere very successfully in deep

0:22:06 - 0:22:13     Text: learning systems so when a word appears in a text it has a context which are the set of words

0:22:13 - 0:22:22     Text: that appear in me and so for a particular word my example here is banking we'll find a bunch of

0:22:22 - 0:22:29     Text: places where banking occurs in texts and we'll collect the sort of nearby words as context words

0:22:29 - 0:22:35     Text: and we'll see say that those words that are appearing in that kind of muddy brown color around

0:22:35 - 0:22:42     Text: banking that those context words well in some sense represent the meaning of the word banking

0:22:43 - 0:22:48     Text: while I'm here let me just mention one distinction that will come up regularly when we're talking

0:22:48 - 0:22:57     Text: about a word in our natural language processing class we sort of have two senses of word which

0:22:57 - 0:23:04     Text: you're referred to as types and tokens so there's a particular instance for words so there's in the

0:23:04 - 0:23:10     Text: first example government debt problems turning into banking crises there's banking there and that's

0:23:10 - 0:23:18     Text: a token of the word banking but then I've collected a bunch of instances of quote unquote the word

0:23:18 - 0:23:24     Text: banking and when I say the word banking and a bunch of examples of it I'm then treating banking

0:23:24 - 0:23:31     Text: as a type which refers to you know the uses and meaning the word banking has across instances

0:23:31 - 0:23:44     Text: okay so what are we going to do with these distributional models of language well what we want to do

0:23:45 - 0:23:52     Text: is we're going based on looking at the words that occur in context as vectors that we want to build up

0:23:52 - 0:24:03     Text: a dense real valued vector for each word that in some sense represents the meaning of that word

0:24:03 - 0:24:10     Text: and the way it all represents the meaning of that word is that this vector will be useful

0:24:10 - 0:24:20     Text: for predicting other words that occur in the context so in this example to keep it manageable on

0:24:20 - 0:24:28     Text: the side vectors are only eight dimensional but in reality we use considerably bigger vectors so

0:24:28 - 0:24:35     Text: a very common size is actually 300 dimensional vectors okay so for each word that's a word type

0:24:35 - 0:24:43     Text: we're going to have a word vector these are also used with other names they refer to as new word

0:24:43 - 0:24:49     Text: representations or for a reason they'll become clear on the next slide they refer to as word

0:24:49 - 0:24:57     Text: embeddings so these are now distributed representation not a localist representation because the meaning

0:24:57 - 0:25:05     Text: of the word banking is spread over all 300 dimensions of the vector okay these are called

0:25:05 - 0:25:12     Text: word embeddings because effectively when we have a whole bunch of words these representations

0:25:12 - 0:25:18     Text: place them all in a high dimensional vector space and so they're embedded into that space

0:25:18 - 0:25:27     Text: now unfortunately human beings are very bad at looking at 300 dimensional vector spaces or even

0:25:27 - 0:25:33     Text: eight dimensional vector spaces so the only thing that I can really display to you here is a two

0:25:33 - 0:25:40     Text: dimensional projection of that space now even that's useful but it's also important to realize that

0:25:40 - 0:25:46     Text: when you're making a two dimensional projection of a 300 dimensional space you're losing almost

0:25:46 - 0:25:51     Text: info all the information in that space and a lot of things will be crushed together that don't

0:25:51 - 0:25:59     Text: actually deserve to be better so here's my word embeddings of course you can't see any of those at

0:25:59 - 0:26:08     Text: all but if I zoom in and then I zoom in further what you'll already see is that the representations

0:26:08 - 0:26:18     Text: we've learnt distributionally do a just a good job at grouping together similar words so in this

0:26:18 - 0:26:24     Text: sort of overall picture I consume into one part of the space is actually the part that's up here

0:26:24 - 0:26:33     Text: in this view of it and it's got words for countries so not only countries generally grouped together

0:26:33 - 0:26:40     Text: even the sort of particular subgroupings of countries make a certain amount of sense and down here

0:26:40 - 0:26:46     Text: we then have nationality words if we go to another part of the space we can see different kind of words

0:26:46 - 0:26:55     Text: so here are verbs and we have ones like come and go a very similar saying and thinking words say

0:26:55 - 0:27:03     Text: think expect a kind of similar and by nearby over in the bottom right we have sort of verbal

0:27:03 - 0:27:11     Text: exileries and copulas so have had has forms of the verb to be and certain contentful verbs are

0:27:11 - 0:27:18     Text: similar to copula verbs because they describe states you know he remained angry he became angry

0:27:18 - 0:27:23     Text: and so they're actually then grouped close together to the word the verb to be so there's a lot of

0:27:23 - 0:27:31     Text: interesting structure in this space that then represents the meaning of words so the algorithm I'm

0:27:31 - 0:27:40     Text: going to introduce now is one that's called word to vac which was introduced by Tamash Mikko

0:27:40 - 0:27:46     Text: often colleagues in 2013 as a framework for learning word vectors and it's sort of a simple and

0:27:46 - 0:27:54     Text: easy to understand place to start so the idea is we have a lot of text from somewhere which we

0:27:54 - 0:28:00     Text: commonly refer to as a corpus of text corpus is just the Latin word for body so it's a body of text

0:28:02 - 0:28:08     Text: and so then we choose a fix vocabulary which will typically be large but nevertheless truncated

0:28:08 - 0:28:15     Text: so we get rid of some of the really rare words so we might say vocabulary size of 400,000

0:28:15 - 0:28:24     Text: and we then create for ourselves a vector for each word okay so then what we do is we want to

0:28:24 - 0:28:33     Text: work out what's a good vector to for each word and the really interesting thing is that we can

0:28:33 - 0:28:41     Text: learn these word vectors from just a big pile of text by doing this distributional similarity

0:28:41 - 0:28:49     Text: task of being able to predict well what words occur in the context of other words so in particular

0:28:49 - 0:28:58     Text: we're going to iterate through the text and so at any moment we have a center word see and context

0:28:58 - 0:29:06     Text: words outside of it which we'll call oh and then based on the current word vectors we're going to

0:29:06 - 0:29:14     Text: calculate the probability of a context word occurring given the center word according to our

0:29:14 - 0:29:21     Text: current model but then we know that certain words did actually occur in the context of that center

0:29:21 - 0:29:28     Text: word and so what we want to do is then keep adjusting the word vectors to maximize the probability

0:29:28 - 0:29:35     Text: that's assigned to words that actually occur in the context of the center word as we proceed through

0:29:35 - 0:29:43     Text: these texts so to start to make that a bit more concrete this is what we're doing um so we have a

0:29:43 - 0:29:49     Text: piece of text we choose our center word which is here in two and then we say well

0:29:52 - 0:29:58     Text: for model of predicting the probability of context words given the center word and this model

0:29:58 - 0:30:05     Text: will come to in a minute but it's defined in terms of our word vectors so let's see what probability

0:30:05 - 0:30:13     Text: it gives to the words that actually occurred in the to the context of this word huh it gives them

0:30:13 - 0:30:19     Text: some probability but maybe be nice if the probability of assigned was higher so then how can we

0:30:19 - 0:30:26     Text: change our word vectors to raise those probabilities and so we'll do some calculations with into being

0:30:26 - 0:30:32     Text: the center word and then we'll just go on to the next word and then we'll do the same kind of

0:30:32 - 0:30:40     Text: calculations and keep on chunking so the big question then is well what are we doing for working out

0:30:40 - 0:30:47     Text: the probability of a word occurring in the context of the center word and so that's the central part

0:30:47 - 0:30:57     Text: of what we develop as the word to take a check so this is the overall model that we want to use so

0:30:57 - 0:31:05     Text: for each position in our corpus our body of text we want to predict context words within a window

0:31:05 - 0:31:13     Text: of fixize m given the center word wj and we want to become good at doing that so we want to give

0:31:13 - 0:31:20     Text: high probability to words that occur in the context and so what we're going to do is we're going to

0:31:20 - 0:31:26     Text: work out what's formerly the data likelihood as to how good a job we do at predicting words in the

0:31:26 - 0:31:34     Text: context of other words and so formally that likelihood is going to be defined in terms of our word

0:31:34 - 0:31:40     Text: vectors so they're the parameters of our model and it's going to be calculated as taking the product

0:31:40 - 0:31:47     Text: of using each word as the center word and then the product of each word in a window around that

0:31:47 - 0:31:55     Text: of the probability of predicting that context word in the center word and so to learn this model

0:31:55 - 0:32:00     Text: we're going to have an objective function sometimes also called a cost or a loss that we want to

0:32:00 - 0:32:08     Text: optimize and essentially what we want to do is we want to maximize the likelihood of the context

0:32:08 - 0:32:14     Text: we see around center words but following standard practice we slightly fiddle that

0:32:15 - 0:32:21     Text: because rather than dealing with products it's easier to deal with sums and so we work with log

0:32:21 - 0:32:28     Text: likelihood and once we take log likelihood all of our products turn into sums we also work with

0:32:28 - 0:32:36     Text: the average log likelihood so we've got a one-on-t term here for the number of words in the corpus

0:32:36 - 0:32:42     Text: and finally for no particular reason we like to minimize our objective function rather than

0:32:42 - 0:32:49     Text: maximizing it so we stick a minus sign in there and so then by minimizing this objective function

0:32:49 - 0:33:00     Text: j of theta that comes as maximizing our predictive accuracy okay so that's the setup but we still

0:33:00 - 0:33:07     Text: haven't made any progress in how do we calculate the probability of a word occurring in the context

0:33:07 - 0:33:15     Text: given the center word and so the way we're actually going to do that is we have vector representations

0:33:15 - 0:33:23     Text: for each word and we're going to work out the probability simply in terms of the word vectors

0:33:24 - 0:33:30     Text: now at this point there's a little technical point we're actually going to give to each word two

0:33:30 - 0:33:37     Text: word vectors one word vector for when it's used as the center word and a different word vector

0:33:37 - 0:33:44     Text: when it's used as a context word this is done because it just simplifies the math and the optimization

0:33:44 - 0:33:52     Text: so it seems a little bit ugly but actually makes building word vectors a lot easier and really

0:33:52 - 0:33:59     Text: we can come back to them discuss it later but that's what it is and so then once we have these word

0:33:59 - 0:34:08     Text: vectors the equation that we're going to use for giving the probability of a context word appearing

0:34:08 - 0:34:13     Text: given the center word is that we're going to calculate it using the expression in the middle

0:34:13 - 0:34:26     Text: bottom of my slide so let's sort of pull that apart just a little bit more so what we have here

0:34:26 - 0:34:34     Text: with this expression is so for a particular center word and a particular context word oh we're

0:34:34 - 0:34:41     Text: going to look up the vector representation of each word so they're u of o and v of c and so then

0:34:41 - 0:34:48     Text: we're simply going to take the dot product of those two vectors so dot product is a natural measure

0:34:48 - 0:34:56     Text: for similarity between words because in any particular mention positive you'll get some

0:34:56 - 0:35:02     Text: component that adds to the dot product sum if both are negative it'll add a lot to the dot product

0:35:02 - 0:35:09     Text: sum if one's positive and one's negative it'll subtract from the similarity measure if both

0:35:09 - 0:35:15     Text: them zero it won't change the similarity so it sort of seems a sort of plausible idea to just

0:35:15 - 0:35:22     Text: take a dot product and thinking well if two words have a larger dot product that means they're more

0:35:22 - 0:35:31     Text: similar and so then after that we sort of really doing nothing more than okay we want to use dot

0:35:31 - 0:35:38     Text: products to represent word similarity and now let's do the dumbest thing that we know how to turn

0:35:38 - 0:35:45     Text: this into a probability distribution well what do we do well firstly well taking a dot product of

0:35:45 - 0:35:52     Text: two vectors that might come out as positive or negative but well we want to have probabilities we

0:35:52 - 0:35:57     Text: can't have negative probabilities so a simple way to avoid negative probabilities is to

0:35:57 - 0:36:03     Text: exponentiate them because then we know everything is positive and so then we are always getting a

0:36:03 - 0:36:11     Text: positive number in the numerator but for probabilities we also want to have the numbers add up to one

0:36:11 - 0:36:17     Text: so we have a probability distribution so we're just normalizing in the obvious way where we divide

0:36:17 - 0:36:23     Text: through by the sum of the numerator quantity for each different word in the vocabulary and so

0:36:23 - 0:36:30     Text: then necessarily that gives us a probability distribution so all the rest of that that I was just

0:36:30 - 0:36:36     Text: talking through what we're using there is what's called the softmax function so the softmax

0:36:36 - 0:36:47     Text: function will take any Rn vector and turn it into things between 0 to 1 and so we can take

0:36:47 - 0:36:53     Text: numbers and put them through this softmax and turn them into a probability distribution right so

0:36:53 - 0:36:59     Text: the name comes from the fact that it's sort of like a max so because of the fact that we

0:36:59 - 0:37:07     Text: exponentiate that really emphasizes the big contents in the different dimensions of calculating

0:37:08 - 0:37:16     Text: similarity so most of the probability goes to the most similar things and it's called soft

0:37:16 - 0:37:22     Text: because well it doesn't do that absolutely it'll still give some probability to everything

0:37:23 - 0:37:28     Text: that's in the slightest bit similar I mean on the other hand it's a slightly weird name

0:37:28 - 0:37:35     Text: because you know max normally takes a set of things and just returns one the biggest of them

0:37:35 - 0:37:42     Text: whereas the softmax is taking a set of numbers and is scaling them but is returning the whole

0:37:42 - 0:37:50     Text: probability distribution okay so now we have all the pieces of our model and so how do we

0:37:50 - 0:38:01     Text: make our word vectors well the idea of what we want to do is we want to fiddle our word vectors

0:38:01 - 0:38:08     Text: in such a way that we minimize our loss i that we maximize the probability of the words that we

0:38:08 - 0:38:16     Text: actually saw in the context of the center word and so the theta the theta represents all of our

0:38:16 - 0:38:24     Text: model parameters in one very long vector so for our model here the only parameters are our word

0:38:24 - 0:38:34     Text: vectors so we have for each word two vectors its context vector and center vector and each of those

0:38:34 - 0:38:42     Text: is a d-dimensional vector where d might be 300 and we have v many words so we end up with this

0:38:42 - 0:38:51     Text: big huge vector which is 2dv long which if you have a 500,000 vocab times the 300-dimensional

0:38:52 - 0:38:57     Text: the time it's more mapping I can do in my head but it's got millions of millions of parameters so

0:38:57 - 0:39:01     Text: we've got millions of millions of parameters and we somehow want to fiddle them all

0:39:01 - 0:39:12     Text: to maximize the prediction of context words and so the way we're going to do that then is we use

0:39:12 - 0:39:20     Text: calculus so what we want to do is take that math that we've seen previously and say well with this

0:39:20 - 0:39:30     Text: objective function we can work out derivatives and so we can work out where the gradient is so how we

0:39:30 - 0:39:39     Text: can walk downhill to minimize loss so at some point and we can figure out what is downhill and we can

0:39:39 - 0:39:49     Text: then progressively walk downhill and improve our model and so what our job is going to be is to compute

0:39:49 - 0:39:59     Text: all of those vector gradients okay so at this point I then want to kind of show a little bit more

0:39:59 - 0:40:10     Text: as to how we can actually do that and a couple more slides here but maybe I'll just try and

0:40:11 - 0:40:20     Text: jigger things again and move to my interactive whiteboard what we wanted to do right so we had

0:40:20 - 0:40:30     Text: our overall we had our overall j theta that we were wanting to minimize our average

0:40:30 - 0:40:39     Text: neg log likelihood so that was the minus one on t of the sum of t equals one to big t which was our

0:40:39 - 0:40:45     Text: text length and then we were going through the words in each context so we were doing j

0:40:45 - 0:40:56     Text: between m words on each side except itself and then what we wanted to do was in the side there

0:40:56 - 0:41:05     Text: we were then we were working out the log probability of the context word at that position given the

0:41:05 - 0:41:17     Text: word that's in a center position t and so then we converted that into our word vectors by saying

0:41:17 - 0:41:44     Text: that the probability of oh given c is going to be expressed as the soft max of the dot product

0:41:44 - 0:41:56     Text: okay and so now what we want to do is work out the gradient the direction of downhill for this

0:41:57 - 0:42:04     Text: last gen and so the way we're doing that is we're working out the partial derivative of this

0:42:04 - 0:42:14     Text: expression with respect to every parameter in the model and all the parameters in the model are

0:42:14 - 0:42:22     Text: the components the dimensions of the word vectors of every word and so we have the center word

0:42:22 - 0:42:32     Text: vectors and the outside word vectors so here I'm just going to do the center word vectors

0:42:32 - 0:42:40     Text: but on a future homework assignment 2 the outside word vectors will show up and they're kind of

0:42:40 - 0:42:48     Text: similar so what we're doing is we're working out the partial derivative with respect to our center

0:42:48 - 0:42:59     Text: word vector which is you know maybe a 300 dimensional word vector of this probability of oh given c

0:42:59 - 0:43:05     Text: and since we're using log probabilities of the log of this probability of oh given c of this

0:43:05 - 0:43:15     Text: x of u of o t v c over my writing I'll get worse and worse sorry I've already made a mistake

0:43:15 - 0:43:28     Text: having a sum the sum w equals 1 to the vocabulary of the x of u w t v c okay well at this point things

0:43:28 - 0:43:37     Text: start off pretty easy so what we have here is something that's log of a over b so that's easy

0:43:37 - 0:43:43     Text: we can turn this into log a minus log b but before I go further I'll just make a comment at this

0:43:43 - 0:43:54     Text: point you know so at this point my audience divides on into right there are some people in the audience

0:43:54 - 0:44:02     Text: for which maybe a lot of people in the audience this is really elementary math I've seen this

0:44:02 - 0:44:08     Text: a million times before and he isn't even explaining it very well and if you're in that group well

0:44:08 - 0:44:15     Text: feel free to look at your email or the newspaper or whatever else is best suited to you but I think

0:44:15 - 0:44:22     Text: there are also other people in the class who oh the last time I saw calculus was when I was in high

0:44:22 - 0:44:28     Text: school for which that's not the case and so I wanted to spend a few minutes going through this a

0:44:28 - 0:44:38     Text: bit concretely so that to try and get over the idea that you know even though most of deep learning

0:44:38 - 0:44:46     Text: and even word vector learning seems like magic that it's not really magic it's really just doing

0:44:46 - 0:44:52     Text: math and one of the things that we hope is that you do actually understand this math that's being done

0:44:53 - 0:45:00     Text: so I'll keep along and do a bit more of it okay so then what we have is so use this way of writing

0:45:00 - 0:45:09     Text: the log and so then we can say that that expression above equals the partial derivatives with

0:45:09 - 0:45:31     Text: a VC of the log of the numerator log x u o to the vc minus the partial derivative of the log of the

0:45:31 - 0:45:47     Text: denominator so that's then the sum of w equals 1 to v of the x of u w to vc okay so at that point

0:45:47 - 0:46:00     Text: I have my numerator here and my former denominator there so at that point there are spots the first

0:46:00 - 0:46:11     Text: part is the numerator part so the numerator part is really really easy so we have here the log

0:46:11 - 0:46:19     Text: and x but just inverses of each other so they just go away so that becomes the derivative

0:46:19 - 0:46:34     Text: of with respect to VC of just what's left behind which is you use 0 dot product and with VC okay

0:46:35 - 0:46:41     Text: and so the thing to be aware of is you know we're still doing this multivariate calculus so

0:46:41 - 0:46:48     Text: what we have here is calculus with respect to a vector like hopefully you saw some of in math 51

0:46:48 - 0:46:57     Text: or some other place not high school single variable calculus on the other hand you know to the

0:46:57 - 0:47:04     Text: extent you and half remember some of this stuff most of the time you can just do perfectly well

0:47:04 - 0:47:12     Text: by thinking about what happens with one dimension at a time and it generalizes the multivariable

0:47:12 - 0:47:22     Text: calculus so if about all that you remember of calculus is that d dx of a x equals a really

0:47:22 - 0:47:29     Text: it's the same thing that we're going to be using here that here we have the

0:47:33 - 0:47:41     Text: the outside word dot producted with the VC well at the end of the day that's going to have

0:47:41 - 0:47:54     Text: terms of sort of you 0 component one times the center word component one plus you 0 component 2 plus

0:47:58 - 0:48:04     Text: this is the word component 2 and so we're sort of using this bed over here and so what we're

0:48:04 - 0:48:13     Text: going to be getting out is the u 0 and u 0 1 and the u 0 2 so this will be all that is left with

0:48:13 - 0:48:19     Text: respect to VC 1 when we take its derivative with respect to VC 1 and this term will be the only

0:48:19 - 0:48:27     Text: thing left when we take the derivative with respect to the variable VC 2 so the end result of

0:48:27 - 0:48:38     Text: taking the vector derivative of u 0 dot producted with VC is simply going to be u 0.

0:48:40 - 0:48:51     Text: Okay great so that's progress so then at that point we go on and we say oh damn we still have

0:48:51 - 0:49:03     Text: the the denominator to and that slightly more complex but not so bad so then we try to take

0:49:03 - 0:49:09     Text: the partial derivatives with respect to VC of the log of the denominator.

0:49:09 - 0:49:27     Text: Okay and so then at this point the one tool that we need to know and remember is how to use the

0:49:27 - 0:49:37     Text: chain rule so the chain rule is when you're wanting to work out of having derivatives of

0:49:37 - 0:49:47     Text: compositions of functions so we have f of g of whatever x but here it's going to be VC and so

0:49:47 - 0:49:54     Text: we want to say okay what we have here is we're working out a composition of functions so here's our

0:49:54 - 0:50:11     Text: f and here is our x which is g of VC actually maybe I shouldn't call it x maybe I

0:50:12 - 0:50:23     Text: probably better to call it z or something okay so when we then want to work out the chain rule well

0:50:23 - 0:50:33     Text: what do we do we take the derivative of f at the point z and so at that point we have to

0:50:33 - 0:50:39     Text: actually remember something we have to remember that the derivative of log is the one on x function

0:50:39 - 0:50:50     Text: so this is going to be equal to the one on x for z so that's then going to be one over the sum

0:50:50 - 0:51:06     Text: of w equals 1 to v of x of u to the c multiplied by the derivative of the inner function so

0:51:06 - 0:51:20     Text: so the derivative of the part that is remaining I'm getting this right the sum of oh and there's

0:51:20 - 0:51:25     Text: one trick here at this point we do want to have a change of index so we want to say the sum of x

0:51:25 - 0:51:38     Text: equals 1 to v of x of u of x VC since we can get into trouble if we don't change that variable

0:51:39 - 0:51:48     Text: to be using a different one okay so at that point we're making some progress but we still want to

0:51:48 - 0:51:55     Text: work out the derivative of this and so what we want to do is apply the chain rule once more so now

0:51:55 - 0:52:10     Text: here's our f and in here is our new z equals g of vc and so we then sort of repeat over so we can

0:52:10 - 0:52:27     Text: move the derivative inside a sum always so we're then taking the derivative of this and so then

0:52:27 - 0:52:42     Text: the derivative of x is itself so we're going to just have x of u x tvc times this is sum of x equals 1 to v

0:52:42 - 0:52:59     Text: times the derivative of u x tvc okay and so then this is what we'd worked out before we can just

0:52:59 - 0:53:09     Text: rewrite as u x okay so we're now making progress so if we start putting all of that together

0:53:09 - 0:53:20     Text: what we have is the derivative or the partial derivatives with VC of this log probability

0:53:22 - 0:53:33     Text: right we have the numerator which was just u 0 minus we then had the sum of the numerator

0:53:33 - 0:53:46     Text: sum over x equals 1 to v of x u x tvc times u of x then that was multiplied by our first term

0:53:46 - 0:54:00     Text: that came from the 1 on x which gives you the sum of w equals 1 to v of the x of u w tvc and this

0:54:00 - 0:54:07     Text: is the fact that we changed the variables became important and so by just sort of rewriting that

0:54:07 - 0:54:28     Text: a little we can get that that equals u 0 minus the sum of v equals sorry x equals 1 to v of

0:54:28 - 0:54:43     Text: this x view of x tvc over the sum of w equals 1 to v of x u w tvc times u of x and so at that point

0:54:43 - 0:54:50     Text: this sort of interesting thing has happened that we've ended up getting straight back exactly

0:54:50 - 0:54:59     Text: the softmax formula probability that we saw when we started and we can just rewrite that more

0:54:59 - 0:55:09     Text: conveniently as saying this equals u 0 minus the sum over x equals 1 to v of the probability of

0:55:09 - 0:55:22     Text: x given c times u x and so what we have at that moment is this thing here is an expectation and so

0:55:22 - 0:55:29     Text: this is an an average over all the context vectors weighted by their probability according to the

0:55:29 - 0:55:35     Text: model and so it's always the case with these softmax style models that what you get out for the

0:55:35 - 0:55:48     Text: derivatives is you get observed minus the expected so our model is good if our model on average predicts

0:55:48 - 0:55:56     Text: exactly the word vector that we actually see and so we're going to try and adjust the parameters

0:55:56 - 0:56:06     Text: for our model so it does that much as a ball now I mean we try and make it do it as much as possible

0:56:06 - 0:56:13     Text: I mean of course as you'll find you can never get close right you know if I just say to you okay

0:56:13 - 0:56:20     Text: the word is cross-on which words are going to occur in the context of cross-on I mean you can't

0:56:20 - 0:56:25     Text: answer that there are all sorts of sentences that you could say then involve the word cross-on so

0:56:25 - 0:56:33     Text: actually our particular probability estimates are going to be kind of small but nevertheless

0:56:33 - 0:56:40     Text: we want to sort of fiddle our word vectors to try and make those estimates as high as we possibly

0:56:40 - 0:56:51     Text: can so I've gone on about this stuff a bit but haven't actually sort of shown you any of what

0:56:51 - 0:57:00     Text: actually happens so I just want to quickly show you a bit of that as to what actually happens with

0:57:00 - 0:57:07     Text: word vectors so here's a simple little ipython notebook which is also what you'll be using for

0:57:07 - 0:57:15     Text: assignment one only so in the first cell I import a bunch of stuff so we've got numpy for our

0:57:15 - 0:57:24     Text: vectors matpotlib for part of the packet learns kind of your machine learning swissami knife

0:57:24 - 0:57:29     Text: gen sim is a package that you may well not have seen before it's a package that's often used for

0:57:29 - 0:57:35     Text: word vectors it's not really used for deep learning so this is the only time you'll see it in the

0:57:35 - 0:57:40     Text: class but if you just want a good package for working with word vectors and some other application

0:57:40 - 0:57:50     Text: it's a good one to know about okay so then in my second cell here I'm loading a particular set

0:57:50 - 0:57:58     Text: of word vectors so these are our glove word vectors that we made at stanford in 2014 and I'm

0:57:58 - 0:58:04     Text: loading a hundred dimensional word vectors so that things are a little bit quicker for me while

0:58:04 - 0:58:12     Text: I'm doing things here sort of do this model of bread and croissant well what I've just got here

0:58:12 - 0:58:20     Text: is word vectors so I just wanted to sort of show you that there are word vectors

0:58:25 - 0:58:30     Text: well maybe I should have loaded those word vectors in advance

0:58:30 - 0:58:33     Text: hmm let's see

0:58:42 - 0:58:51     Text: oh okay well I'm in business um okay so right so here are my word vectors for bread and croissant

0:58:53 - 0:58:57     Text: and while I'm seeing them maybe these two words are a bit similar so both of them are negative

0:58:57 - 0:59:04     Text: in the first dimension positive in the second negative in the third positive in the fourth negative

0:59:04 - 0:59:08     Text: in the fifth so it sort of looks like they might have a fair bit of dot product which is kind of

0:59:08 - 0:59:14     Text: what we want because bread and croissant are kind of similar but what we can do is actually ask the

0:59:14 - 0:59:21     Text: model and these are gents in functions now you know what are the most similar words so I can ask

0:59:21 - 0:59:28     Text: for croissant what are the most similar words to that and it will tell me it's things like

0:59:28 - 0:59:34     Text: brioche baguette for cacciate so that's pretty good putting us perhaps a little bit more questionable

0:59:34 - 0:59:43     Text: we can say most similar to the USA and it says Canada or America USA with periods United States

0:59:43 - 0:59:53     Text: that's pretty good most similar to banana I get out coconut mangoes bananas sort of fairly tropical

0:59:53 - 0:59:59     Text: through it great um before finishing though I want to show you something slightly more

0:59:59 - 1:00:05     Text: than just similarity which was one of the amazing things that people observed with these word

1:00:05 - 1:00:12     Text: vectors and that was to say you can actually sort of do arithmetic in this vector space that makes

1:00:12 - 1:00:18     Text: sense and so in particular people suggested this analogy task and so the idea of the analogy

1:00:18 - 1:00:24     Text: task is you should be able to start with a word like king and you should be able to subtract out a

1:00:24 - 1:00:31     Text: male component from it add back in a woman component and then you should be able to ask well what

1:00:31 - 1:00:44     Text: word is over here and what you'd like is that the word over there is queen um and so um this

1:00:44 - 1:00:52     Text: sort of little bit of so we're going to do that um with this sort of same most similar function

1:00:52 - 1:00:59     Text: which is actually more so as well as having positive words you can ask for most similar negative

1:00:59 - 1:01:05     Text: words and you might wonder what's most negatively similar to a banana and you might be thinking oh

1:01:05 - 1:01:12     Text: it's um I don't know um some kind of meat or something um actually that by itself isn't very

1:01:12 - 1:01:17     Text: useful because when you could just ask for um most negatively similar to things you tend to get

1:01:18 - 1:01:22     Text: crazy strings that were found in the data set um that you don't know what they mean if anything

1:01:22 - 1:01:29     Text: um but if we put the two together we can use the most similar function with positives and negatives

1:01:29 - 1:01:37     Text: to do analogies so we're going to say we want a positive king we want to subtract out negatively man

1:01:37 - 1:01:44     Text: we want to then add in positively woman and find out what's most similar to this point in the space

1:01:44 - 1:01:53     Text: so my analogy function does that precisely that by taking um a couple of most similar ones and then

1:01:53 - 1:02:00     Text: subtracting out um the negative one and so we can try out this analogy function so I can do the

1:02:00 - 1:02:09     Text: analogy I show in the picture um with man as to king as woman is uh fight so I'm not saying

1:02:09 - 1:02:17     Text: that's right um yeah man is to king as woman is to blah sorry I haven't done myself um

1:02:21 - 1:02:30     Text: okay man is to king as woman is to queen so um that's great and that um works well I mean and you

1:02:30 - 1:02:37     Text: can do it the sort of other way around king is to man as queen is to woman um if this only worked

1:02:37 - 1:02:44     Text: for that one freakish example um you maybe um wouldn't be very impressed but you know it actually

1:02:44 - 1:02:49     Text: turns out like it's not perfect but you can do all sorts of fun analogies with this and they

1:02:49 - 1:02:59     Text: actually work so you know I could ask for something like an analogy um oh here's a good one um

1:02:59 - 1:03:10     Text: Australia um is to be uh as France is to what um and you can think about what you think the answer

1:03:10 - 1:03:18     Text: that one should be and it comes out as um champagne which is pretty good or I could ask for

1:03:18 - 1:03:34     Text: something like analogy pencil is to sketching as camera is to what um and it says photographing um

1:03:34 - 1:03:40     Text: you can also do the analogies with people um at this point I have to point out that this data was

1:03:40 - 1:03:48     Text: um and the model was built in 2014 so you can't ask anything about um Donald Trump in it

1:03:48 - 1:03:53     Text: well you can he Trump is in there but not as president but I could ask something like analogy

1:03:54 - 1:04:08     Text: of bomb is to Clinton as Reagan is um to what and you can think of what you think is the right

1:04:08 - 1:04:15     Text: um analogy there um the analogy it returns is Nixon um so I guess that depends on what you think

1:04:15 - 1:04:22     Text: of Bill Clinton as to whether you think that was a good analogy or not you can also um do sort of

1:04:22 - 1:04:32     Text: linguistic analogies with it so you can do something like analogy tall is to tallest as long

1:04:32 - 1:04:39     Text: is to what and it does longest so it really just sort of knows a lot about the meaning

1:04:39 - 1:04:47     Text: behavior of words and you know I think when these um methods were first developed and hopefully

1:04:47 - 1:04:53     Text: still for you that you know people were just gobsmacked about how well this actually worked at capturing

1:04:54 - 1:05:01     Text: other enough words and so these word vectors then went everywhere as a new representation that was

1:05:01 - 1:05:07     Text: so powerful for working out word meaning and so that's our starting point for this class and we'll

1:05:07 - 1:05:13     Text: say a bit more about them next time and they're also the basis of what you're looking at for the

1:05:13 - 1:05:18     Text: first assignment can I ask a quick question about the distinction between the two vectors per word

1:05:19 - 1:05:27     Text: yes so um my understanding is that there can be several context words per uh word in the vocabulary

1:05:27 - 1:05:32     Text: or like word in the vocabulary um but then if there's only two vectors I kind of I thought the

1:05:32 - 1:05:36     Text: distinction between the two is that one it's like the actual word and one's like the context word but

1:05:36 - 1:05:42     Text: if there are multiple context words right how do you how do you pick to just two then well so we're

1:05:42 - 1:05:50     Text: doing every one of them right so like um maybe I won't turn back on the screen share but you know

1:05:50 - 1:05:57     Text: we were doing in the objective function there was a sum over you so you've got you know this big

1:05:57 - 1:06:04     Text: corpus of text right so you're taking a sum over every word which is it appearing as the center word

1:06:04 - 1:06:10     Text: and then inside that there's a second sum um which is for each word in the context so you are

1:06:10 - 1:06:17     Text: going to count each word as a context word and so then for one particular term of that objective

1:06:17 - 1:06:23     Text: function you've got a particular context word and a particular um center word but you're then

1:06:23 - 1:06:30     Text: sort of summing over different context words for each center word and then you're summing over

1:06:30 - 1:06:38     Text: all of the decisions of different center words and and to say um a little just a sentence more

1:06:38 - 1:06:44     Text: about having two vectors I mean you know in some senses in ugly detail but it was done to make

1:06:44 - 1:06:54     Text: things sort of simple and fast so you know if you um look at um the math carefully if you sort of

1:06:54 - 1:07:01     Text: treated um this two vectors is the same so if you use the same vectors for center and context

1:07:02 - 1:07:09     Text: and you say okay let's work out the derivatives um things get uglier and the reason that they get

1:07:09 - 1:07:18     Text: uglier is it's okay when I'm iterating over all the choices of um context word oh my god sometimes

1:07:18 - 1:07:24     Text: the context word is going to be the same as the center word and so that messes with working out

1:07:25 - 1:07:33     Text: my derivatives um whereas by taking them as separate vectors that never happens so it's easy um

1:07:33 - 1:07:39     Text: but the kind of interesting thing is you know saying that you have these two different representations

1:07:40 - 1:07:47     Text: sort of just ends up really sort of doing no harm and my wave my hands argument for that is

1:07:47 - 1:07:54     Text: you know since we're kind of moving through each position the corpus one by one you know

1:07:54 - 1:08:00     Text: something a word that is the center word at one moment is going to be the context word at the

1:08:00 - 1:08:06     Text: next moment and the word that was the context word is going to become the center word so you're

1:08:06 - 1:08:14     Text: sort of doing the um the computation both ways in each case and so you should be able to convince

1:08:14 - 1:08:21     Text: yourself that the two representations for the word end up being very similar and they do not

1:08:21 - 1:08:26     Text: not identical for technical reasons of the ends of documents and things like that but very

1:08:26 - 1:08:34     Text: very similar um and so effectively tend to get two very similar representations for each word

1:08:34 - 1:08:39     Text: and we just average them and call that the word vector and so when we use word vectors we just have

1:08:39 - 1:08:47     Text: one vector for each word that makes sense thank you i have a question purely of curiosity so we

1:08:47 - 1:08:52     Text: saw that when we projected the um vectors the word vectors onto the 2d surface we saw like

1:08:52 - 1:08:57     Text: little clusters of where's our similar each other and then later on we saw that um with the

1:08:57 - 1:09:02     Text: analogy thing we kind of see that there's these directional vectors that sort of anything like

1:09:02 - 1:09:08     Text: the ruler of or the CEO of something like that and so I'm wondering is there are other relationships

1:09:08 - 1:09:14     Text: between those relational vectors themselves such as like is the um the ruler of vector sort of

1:09:14 - 1:09:20     Text: similar to the CEO of vector which is very different from like is makes a good sandwich with

1:09:20 - 1:09:31     Text: vector um is there any research on that that's a good question um how will you stump me already

1:09:31 - 1:09:41     Text: in the first lecture uh i mean that yeah i can't actually think of a piece of research and so

1:09:41 - 1:09:46     Text: i'm not sure i have a confident and i'm not sure i have a confident answer i mean it seems like

1:09:46 - 1:09:53     Text: that's a really easy thing to check um with how much you have one of these sets of um

1:09:53 - 1:10:00     Text: word vectors that it seems um like and for any relationship that is represented well

1:10:00 - 1:10:08     Text: enough by word you should be able to see if it comes out kind of similar um huh i mean i'm

1:10:08 - 1:10:17     Text: not sure we can we can look and see yeah that's totally okay just just curious sorry i missed the

1:10:17 - 1:10:23     Text: last little bit your answer to first question so when you wanted to collapse to vectors for the same

1:10:23 - 1:10:28     Text: word did you say you usually take the average um different people have done different things for

1:10:28 - 1:10:35     Text: the most common practice is after you uh you know there's still a bit more i have to cover about

1:10:35 - 1:10:39     Text: running word devec that we didn't really get through today so i've still got a bit more work to do

1:10:39 - 1:10:47     Text: on first day but you know once you run your word devec algorithm and you sort of your output is two

1:10:47 - 1:10:55     Text: vectors for each word and kind of a when it's center and when it's context and so typically people

1:10:55 - 1:11:01     Text: just average those two vectors and say okay that's the representation of the word croissant

1:11:01 - 1:11:07     Text: and that's what appears in the sort of word vectors file like the one i loaded

1:11:11 - 1:11:17     Text: oh thanks so my question is if a word have two different meanings or multiple different meanings

1:11:17 - 1:11:24     Text: can we still represent that's the same single vector? Yes that's a very good question um and actually

1:11:24 - 1:11:31     Text: there is some content on that in first days lecture so i can say more about that um but yeah the

1:11:31 - 1:11:39     Text: first reaction is you kind of should be scared because um something i've said nothing about at all

1:11:39 - 1:11:46     Text: is you know most words especially short common words have lots of meanings so if you have a word

1:11:46 - 1:11:54     Text: like star that can be astronomical object or it can be you know a film star a Hollywood star

1:11:54 - 1:12:00     Text: or it can be something like the gold stars that you've got an elementary school and we've just

1:12:00 - 1:12:09     Text: taking all those uses of the word star and collapsing them together um into one word vector um

1:12:09 - 1:12:15     Text: and you might think that's really crazy and bad um but actually turns out to work

1:12:15 - 1:12:22     Text: rather well um maybe i won't go through all of that um right now because there is actually

1:12:22 - 1:12:29     Text: stuff on that on first days lecture oh i see thanks you can just add up the slides for next time

1:12:29 - 1:12:31     Text: oh white

1:12:32 - 1:12:39     Text: hey i know this makes me seem as good as but i guess a lot of us were also taking this course because

1:12:39 - 1:12:49     Text: of the height, good speed, AI, speech recognition and my basic question is maybe two basic is

1:12:49 - 1:12:56     Text: do we look at how to implement or do we look at the stack of like some of the

1:12:56 - 1:13:03     Text: lecture something prox speech to uh contact actions in this course was it just the priority

1:13:04 - 1:13:13     Text: uh understanding so this is an unusual content an unusual quarter um but for this quarter there's

1:13:13 - 1:13:22     Text: a very clear answer which is um this quarter um there's also a speech class being taught which is

1:13:22 - 1:13:30     Text: CS224S um a speech class being taught by Andrew Mars and you know this is a class that's been more

1:13:30 - 1:13:36     Text: regularly offered sometimes it's only been offered every third year um but it's being offered

1:13:36 - 1:13:44     Text: right now so if what you want to do is learn about speech recognition and learn about sort of methods

1:13:44 - 1:13:54     Text: for building dialogue systems um you should do CS224S um so you know for this class in general um the

1:13:54 - 1:14:05     Text: vast bulk of this class is working with text and doing various kinds of text analysis and understanding

1:14:05 - 1:14:13     Text: so we do tasks like some of the ones I've mentioned we do machine translation um we um do question

1:14:13 - 1:14:19     Text: answering um we look at how to parse this structure of sentences and things like that you know

1:14:19 - 1:14:25     Text: in other years I sometimes say a little bit about speech um but since this quarter there's a

1:14:25 - 1:14:28     Text: whole different class that's focused on speech that's similar but silly.

1:14:28 - 1:14:34     Text: Well that's what I'm talking about right now together uh the part of pardon and we're in each audio

1:14:34 - 1:14:41     Text: like I guess I guess I guess I guess I can you're in your focus more on speech

1:14:41 - 1:14:46     Text: feeling standing I guess I guess I guess I'll be I'll be this is the question I'm going to have

1:14:46 - 1:14:52     Text: on I'm now getting a bad echo I'm not sure if that's my fault or your fault but I'm anyway um

1:14:52 - 1:15:02     Text: um anyway answer yeah so the speech class does a mix of stuff so I mean the sort of pure speech

1:15:02 - 1:15:09     Text: problems classically have been um doing speech recognitions are going from a speech signal to text

1:15:09 - 1:15:18     Text: and doing text to speech going from text to a speech signal and both of those are problems which

1:15:18 - 1:15:24     Text: are now normally done including by the cell phone that sits in your pocket um using your networks

1:15:24 - 1:15:32     Text: and so it covers both of those but then between that um the class covers quite a bit and in particular

1:15:32 - 1:15:40     Text: it starts off with um looking at building dialogue systems so this is sort of something like Alexa

1:15:40 - 1:15:47     Text: Google Assistance series as to well assuming you have a speech recognition a text to speech system

1:15:49 - 1:15:55     Text: then you do have text in and text out what are the kind of ways that people go about building um

1:15:56 - 1:16:02     Text: um dialogue systems like the ones that I just mentioned

1:16:02 - 1:16:09     Text: um actually how to question so I think there is some people in the chat noticing that the

1:16:10 - 1:16:14     Text: like opposites were really near to each other which was kind of odd but I was also wondering um

1:16:15 - 1:16:21     Text: what about like positive and negative uh violins or like offense um is that captured well

1:16:21 - 1:16:27     Text: in this type of model or is it like not captured well like we're like with the opposites how those

1:16:27 - 1:16:33     Text: weren't really so the short answer is for both of those and so there's this is a good question

1:16:33 - 1:16:40     Text: a good observation and the short answer is no both of those are captured really really badly I mean

1:16:40 - 1:16:49     Text: there's there's a definition um you know when I say really really badly I mean what I mean is if

1:16:49 - 1:16:56     Text: that's what you want to focus on um you've got problems I mean it's not that the algorithm doesn't

1:16:56 - 1:17:05     Text: work so precisely what you find um is that you know antennas generally occur in very similar topics

1:17:05 - 1:17:12     Text: because you know whether it's um saying you know John is really tall or John is really short

1:17:12 - 1:17:19     Text: or that movie was fantastic or that movie was terrible right you get antennas occurring in the

1:17:19 - 1:17:26     Text: same context so because of that their vectors are very similar and similarly for sort of affect and

1:17:26 - 1:17:34     Text: sentiment-based words well like make um great and terrible example their contexts are similar um

1:17:34 - 1:17:42     Text: therefore um that if you're just learning this kind of predict words in context models um that

1:17:42 - 1:17:49     Text: no that's not captured now that's not the end of the story I mean you know absolutely people wanted

1:17:49 - 1:17:56     Text: to use neural networks for sentiment and other kinds of sort of connotation affect and there are

1:17:56 - 1:18:02     Text: very good ways of doing that but somehow you have to do something more than simply predicting words

1:18:02 - 1:18:09     Text: in context because that's not sufficient to um capture that dimension um more on that later

1:18:09 - 1:18:16     Text: but this happened to like adjectives too like very basic adjectives like so and like not

1:18:17 - 1:18:23     Text: because those would like appear like some like context right what was your first example before not

1:18:23 - 1:18:30     Text: like so this is so cool yeah so that's actually a good question as well so yeah so there are

1:18:30 - 1:18:35     Text: these very common words there are commonly referred to as function words by linguists which

1:18:35 - 1:18:44     Text: know includes ones like um so and not but other ones like and and prepositions like you know

1:18:44 - 1:18:52     Text: two and on um you sort of might suspect that the word vectors for those don't work out very well

1:18:52 - 1:18:59     Text: because they occur in all kinds of different contexts and they're not very distinct from each other

1:18:59 - 1:19:05     Text: in many cases and to a first approximation I think that's true and part of why I didn't use those as

1:19:05 - 1:19:14     Text: examples in my slides yeah but you know at the end of the day we do build up vector representations

1:19:14 - 1:19:20     Text: of those words too and you'll see in a few um lectures time when we start building what we call

1:19:20 - 1:19:26     Text: language models that actually they do do a great job in those words as well I mean to explain what

1:19:26 - 1:19:35     Text: I'm meaning there I mean you know another feature of the word to vector model is it actually

1:19:35 - 1:19:42     Text: ignore the position of words right so it's said I'm going to predict every word around the center

1:19:42 - 1:19:48     Text: word but you know I'm predicting it in the same way I'm not predicting differently the word before

1:19:48 - 1:19:54     Text: me or versus the word after me or the word two away in either direction right they're all just

1:19:54 - 1:20:02     Text: predicted the same by that one um probability function and so if that's all you've got that sort

1:20:02 - 1:20:09     Text: of destroys your ability to do a good job at um capturing these sort of common more grammatical

1:20:09 - 1:20:16     Text: words like so not an an um but we build slightly different models that are more sensitive to the

1:20:16 - 1:20:21     Text: structure of sentences and then we start doing a good job on those two okay thank you

1:20:21 - 1:20:31     Text: I had a question about um the characterization of word to vector um because I read all these

1:20:31 - 1:20:37     Text: such and it seems to characterize architecture as well but skip very model which was slightly

1:20:37 - 1:20:41     Text: different from how I was presented in the module so at least like two hundred and three hundred

1:20:41 - 1:20:50     Text: years ago or yeah so I've I've still gotten more to say so I'm stationed first day

1:20:50 - 1:20:59     Text: um for more stuff on word vectors um you know so word to vac is kind of a framework for building

1:20:59 - 1:21:07     Text: word vectors and that there are sort of several variant precise algorithms within the framework

1:21:07 - 1:21:14     Text: and you know one of them is have whether you're predicting the context words or whether you're

1:21:14 - 1:21:22     Text: predicting the center word um so the model I showed was predicting the context words so it was

1:21:22 - 1:21:32     Text: the skip gram model but then there's sort of a detail of how in particular do you do the optimization

1:21:32 - 1:21:40     Text: and what I presented was the sort of easiest way to do it which is naive optimization with the

1:21:40 - 1:21:49     Text: equation the softmax equation for word vectors um it turns out that that naive optimization is sort of

1:21:49 - 1:21:57     Text: exneedlessly expensive and people have come up with um faster ways of doing it in particular

1:21:57 - 1:22:02     Text: um the commonest thing you see is what's called skip gram with negative sound playing and the

1:22:02 - 1:22:08     Text: negative sound playing is then sort of a much more efficient way to estimate things and I'll mention

1:22:08 - 1:22:17     Text: that on Thursday. Right okay thank you. Who's asking for more information about how word vectors

1:22:17 - 1:22:24     Text: are constructed uh beyond the summary of random initialization and then gradient based

1:22:24 - 1:22:31     Text: uh iterative operator optimization. Yeah um so I sort of will do a bit more connecting this

1:22:31 - 1:22:37     Text: together um in the Thursday lecture I guess there's sort of I mean so much I'm gonna fit in the

1:22:37 - 1:22:44     Text: first class um but the picture the picture is essentially the picture I showed the pieces of

1:22:44 - 1:22:57     Text: so to learn word vectors you start off by having a vector for each word type both for context and

1:22:57 - 1:23:07     Text: outside and those vectors you to initialize randomly um so that you just put small little

1:23:07 - 1:23:13     Text: numbers that are randomly generated in each vector component and that's just your starting point

1:23:13 - 1:23:20     Text: and so from there on you're using an iterative algorithm where you're progressively updating

1:23:20 - 1:23:27     Text: those word vectors so they do a better job at predicting which words appear in the context of

1:23:27 - 1:23:36     Text: other words and the way that we're going to do that is by using um the gradients that I was sort of

1:23:36 - 1:23:42     Text: starting to show how to calculate and then you know once you have a gradient you can walk in the

1:23:42 - 1:23:49     Text: opposite direction of the gradient and you're then walking downhill I you're minimizing your loss

1:23:49 - 1:23:56     Text: and we're going to sort of do lots of that until our word vectors get as good as possible so you know

1:23:57 - 1:24:05     Text: um it's really all math but in some sense you know word vector learning is sort of miraculous since

1:24:05 - 1:24:14     Text: you do literally just start off with completely random word vectors and run this algorithm of

1:24:14 - 1:24:21     Text: predicting words for a long time and out of nothing emerges these word vectors that represent meaning

1:24:21 - 1:24:37     Text: well