Stanford CS224N NLP with Deep Learning ｜ Winter 2021 ｜ Lecture 7 - Translation, Seq2Seq, Attention

0:00:00 - 0:00:09 Text: Hello, everyone, and welcome back into week four.

0:00:09 - 0:00:14 Text: So for week four, it's going to come in two halves.

0:00:14 - 0:00:19 Text: So today I'm going to talk about machine translation related topics.

0:00:19 - 0:00:28 Text: And then in the second half of the week, we take a little bit of a break from learning more and more on your networks topics and talk about

0:00:28 - 0:00:34 Text: final projects, but also some practical tips for building your network systems.

0:00:34 - 0:00:40 Text: So for today's lecture, this is an important content for lecture.

0:00:40 - 0:00:45 Text: So first of all, I'm going to introduce a new task machine translation.

0:00:45 - 0:00:48 Text: And it turns out that.

0:00:48 - 0:00:58 Text: Our task is a major use case of a new architectural technique to teach you about deep learning, which is sequence to sequence models.

0:00:58 - 0:01:01 Text: And so we'll spend a lot of time on those.

0:01:01 - 0:01:09 Text: And then there's a crucial way that's been developed to improve sequence to sequence models, which is the idea of attention.

0:01:09 - 0:01:14 Text: And so that's what I'll talk about in the final part of the class.

0:01:14 - 0:01:24 Text: Just checking everyone's keeping up with what's happening. So first of all, assignment three is due today.

0:01:24 - 0:01:29 Text: So hopefully you've all gone your new dependency paths as passing text well.

0:01:29 - 0:01:32 Text: At the same time assignment four is out today.

0:01:32 - 0:01:41 Text: And really today's lecture is the primary content for what you'll be using for building your assignment four systems.

0:01:41 - 0:01:47 Text: Switching it up for a little for assignment four, we give you a mighty two extra days.

0:01:47 - 0:01:52 Text: So you get nine days for it and it's due on Thursday.

0:01:52 - 0:02:02 Text: On the other hand, do please be aware that assignment four is bigger and harder than the previous assignments.

0:02:02 - 0:02:05 Text: So do make sure you get started on it early.

0:02:05 - 0:02:10 Text: And then as I mentioned, Thursday, I'll turn to final projects.

0:02:10 - 0:02:16 Text: Okay, so let's get straight into this with machine translation.

0:02:16 - 0:02:27 Text: So very quickly, I wanted to tell you a little bit about, you know, where we were and what we did before we get to new machine translation.

0:02:27 - 0:02:31 Text: And so let's do the prehistory of machine translation.

0:02:31 - 0:02:44 Text: So machine translation is the task of translating a sentence X from one language, which is called the source language to another language, the target language forming a sentence Y.

0:02:44 - 0:02:51 Text: So we start off with a source language sentence X, long name.

0:02:51 - 0:03:01 Text: And then we translate it and we get out the translation man is born free, but everywhere he is in chains.

0:03:01 - 0:03:04 Text: Okay, so there's our machine translation.

0:03:04 - 0:03:10 Text: Okay, so in the early 1950s, there started to be work on machine translation.

0:03:10 - 0:03:18 Text: And so it's actually think about computer science. If you find things that have machine and the name, most of them are all things.

0:03:18 - 0:03:24 Text: And this really kind of came about in the US context in the context of the Cold War.

0:03:24 - 0:03:47 Text: So there was this desire to keep tabs on what the Russians were doing and people had the idea that because some of the earliest computers had been so successful at doing code breaking during the second world war, then maybe we could set early computers to work during the Cold War to do translation.

0:03:47 - 0:04:02 Text: And hopefully this or play and you'll be able to hear it is a little video clip showing some of the earliest work in machine translation from 1954.

0:04:02 - 0:04:19 Text: They hadn't reckoned with ambiguity when they set out to use computers to translate languages.

0:04:19 - 0:04:28 Text: One of the first non numerical applications of computers, it was hyped as the solution to the Cold War obsession of keeping tabs on what the Russians were doing.

0:04:28 - 0:04:32 Text: Claims were made that the computer would replace most human translators.

0:04:32 - 0:04:38 Text: Now, present of course you're just in the experimental stage. When you go in for full scale production, what will the capacity of you?

0:04:38 - 0:04:45 Text: We should be able to do about whether what commercial computer, about one to two million worth an hour.

0:04:45 - 0:04:52 Text: And this will be quite an adequate space to cope with the whole output of the Soviet Union. And they just a few hours computer time a week.

0:04:52 - 0:04:54 Text: When do you hope to be able to achieve this speed?

0:04:54 - 0:04:58 Text: If our experiments go well, then perhaps within five years or so.

0:04:58 - 0:05:03 Text: And finally Mr. McDaniel does this mean the end of human translators?

0:05:03 - 0:05:12 Text: So yes for translators of scientific and technical material, but as regards poetry and novels, no I don't think we'll ever replace the translators of that type of material.

0:05:12 - 0:05:15 Text: Mr. McDaniel, thank you very much.

0:05:15 - 0:05:18 Text: But despite the hype, it ran into deep trouble.

0:05:18 - 0:05:27 Text: Yeah, so the experiments did not go well.

0:05:27 - 0:05:36 Text: And so, you know, in retrospect, it's not very surprising that the early work did not work out very well.

0:05:36 - 0:05:47 Text: I mean, this was in the sort of really beginning of the computer age in the 1950s, but it was also the beginning of, you know, people starting to understand the science of human languages.

0:05:47 - 0:05:54 Text: The field of linguistics. So really people had not much understanding of either side of what was happening.

0:05:54 - 0:06:02 Text: So what you had was people were trying to write systems on really incredibly primitive computers, right?

0:06:02 - 0:06:14 Text: You know, it's probably the case that now if you have a USBC power brick that it has more computational capacity inside it than the computers that they were using to translate.

0:06:14 - 0:06:21 Text: And so effectively what you were getting were very simple rule based systems and word look up.

0:06:21 - 0:06:30 Text: So there was so like dictionary look up a word and get its translation, but that just didn't work well because human languages are much more complex than that.

0:06:30 - 0:06:36 Text: Often words have many meanings in different sensors as we've sort of discussed about a bit.

0:06:36 - 0:06:45 Text: Often there are idioms you need to understand the grammar to rewrite the sentences. So for all sorts of reasons, it didn't work well.

0:06:45 - 0:06:59 Text: And this idea was largely canned in particular. There was a famous US government report in the mid 1960s, the Alpac report, which basically concluded this wasn't working.

0:06:59 - 0:07:10 Text: Okay, work then did revive in AI at doing rule based methods of machine translation in the 90s.

0:07:10 - 0:07:24 Text: But when things really became alive was once you got into the mid 90s and when they were in the period of statistical NLP that we've seen in other places in the course.

0:07:24 - 0:07:40 Text: And then the idea began, can we start with just data about translation, I sentences and their translations and learn a probabilistic model that can predict the translations of fresh sentences.

0:07:40 - 0:07:44 Text: So suppose we're translating French into English.

0:07:44 - 0:07:58 Text: So what we want to do is build a probabilistic model that given a French sentence, we can say what's the probability of different English translations and then we'll choose the most likely translation.

0:07:58 - 0:08:18 Text: We can then found it was found to felicitous to break this down into two components by just reversing this with base role. So if instead we had a probability over English sentences, P of Y.

0:08:18 - 0:08:34 Text: And then a probability of a French sentence given an English sentence that people were able to make more progress and it's not immediately obvious as to why this should be because this is just sort of a trivial rewrite with base role.

0:08:34 - 0:08:58 Text: So the problem to be separate into two parts which proved to be more tractable. So on the left hand side, you effectively had a translation model where you could just give a probability of words or phrases being translated between the two languages without having to bother about the structural word order of the languages.

0:08:58 - 0:09:27 Text: And on the right hand, you saw precisely what we spent a long time with last week, which is this is just a probabilistic language model. So if we have a very good model of what good flu and English sentences sound like which we can build just from monolingual data, we can then get it to make sure we're producing sentences that sound good while the translation model hopefully puts the right words into them.

0:09:27 - 0:09:40 Text: So how do we learn the translation models since we haven't covered that. So the starting point was to get a large amount of parallel data, which is human translated sentences.

0:09:40 - 0:09:59 Text: And this point is mandatory that I show a particular picture of the rose data stone, which is the famous original piece of parallel data that allowed the decoding of Egyptian hieroglyphs because it had the same piece of text in different languages.

0:09:59 - 0:10:11 Text: In the modern world, there are fortunately for people who build natural language processing systems, quite a few places where parallel data is produced in large quantities.

0:10:11 - 0:10:36 Text: So the European Union produces a huge amount of parallel text across European languages, the French, sorry, not the French, the Canadian parliament conveniently produces parallel text between French and English and even a limited amount in a multitude, Canadian, Eskimo.

0:10:36 - 0:10:49 Text: And then the Hong Kong parliament produces English and Chinese. So there's a fair availability from different sources and we can use that to build models.

0:10:49 - 0:10:58 Text: So how do we do it though, or we have as these sentences and it's not quite obvious how to build a probabilistic model out of those.

0:10:58 - 0:11:11 Text: So as before, what we want to do is break this problem down. So in this case, what we're going to do is introduce an extra variable, which is an alignment variable.

0:11:11 - 0:11:24 Text: So a is the alignment variable, which is going to give a word level or sometimes phrase level correspondence between parts of the source sentence and the target sentence.

0:11:24 - 0:11:46 Text: And so if we could induce this alignment between the two sentences, then we have can have probabilities of pieces of how likely a word or a short phrase is translated in a particular way.

0:11:46 - 0:11:59 Text: In general, you know, alignment is working out the correspondence between words that is capturing the grammatical differences between languages.

0:11:59 - 0:12:14 Text: So words will occur in different orders and different languages, depending on whether it's a language that puts on the subject before the verb or the subject after the verb or the verb before both the subject and the object.

0:12:14 - 0:12:28 Text: So alignments will also capture something about differences about the ways the word languages do things. So what we find is that we get every possibility of how words can align between languages.

0:12:28 - 0:12:45 Text: So we don't have words that don't get translated at all in the other language. So in French, you put a definite article, the before country names like Le Japon. So when that gets translated into English, you just get Japan.

0:12:45 - 0:13:01 Text: So there's no translation of the thus, so it just goes away. On the other hand, you can get many to one translations where one French word gets translated as several English words.

0:13:01 - 0:13:18 Text: So for the last French words being translated as Aboriginal people as multiple words, you can get the reverse where you can have several French words that get translated as one English word.

0:13:18 - 0:13:38 Text: So this is what we see on the application is game translators implemented. And you can get even more complicated one. So here we sort of have four English words being translators to French words, but they don't really break down and translate each other well.

0:13:38 - 0:13:54 Text: And these things don't only happen across languages, they also happen within the language when you have different ways of saying the same thing. So another way you might have expressed the poor don't have any money is to say the poor money less.

0:13:54 - 0:14:06 Text: And that's much more similar to how the French is being rendered here. And so even English to English, you have the same kind of alignment problem.

0:14:06 - 0:14:16 Text: Probable is stick or statistical machine translation is more commonly known. What we wanted to do is learn these alignments.

0:14:16 - 0:14:39 Text: There's a bunch of sources of information you could use if you start with parallel sentences, you can see how often words and phrases co occur in parallel sentences, you can look at their positions in the sentence. And figure out what a good alignments, but alignments are categorical thing they're not.

0:14:39 - 0:15:02 Text: Probable is stick and so they are latent variables and so you need to use special learning algorithms like the expectation maximization algorithm for learning about latent variables in the olden days of CS 224N before we start doing it all with deep learning we spent tons of CS 224N dealing with latent variable algorithms.

0:15:02 - 0:15:13 Text: But these days we don't cover that at all and you're going to have to go off and see CS 228. If you want to know more about that and you know we're not really expecting you to understand the details here.

0:15:13 - 0:15:25 Text: But I did then want to say a bit more about how decoding was done in a statistical machine translation system.

0:15:25 - 0:15:43 Text: So what we wanted to do is to say we had a translation model and a language model and we want to pick out the most likely why there's the translation of the sentence and what kind of process could we use to do that.

0:15:43 - 0:16:03 Text: Well, you know, the naive thing is to say well, let's just enumerate every possible why and calculate its probability, but we can't possibly do that because there's a number of translation sentences in the target language is exponential in the length of the sentence.

0:16:03 - 0:16:19 Text: So that's way too expensive. So we need to have some way to break it down more and while we had a simple way for language models, we just generated words one at a time and laid out the sentence.

0:16:19 - 0:16:34 Text: And so that seems a reasonable thing to do, but here we need to deal with the fact that things occur in different orders in source languages and then translations.

0:16:34 - 0:16:47 Text: And so we do want to break it into pieces with an independence assumption like the language model, but then we want a way of breaking things apart and exploring it in what's called a decoding process.

0:16:47 - 0:16:54 Text: So this is the way it was done. So we'd start with a source sentence. So this is a German sentence.

0:16:54 - 0:17:08 Text: And as is standard in German, you're getting this second position verb. So that's probably not in the right position for where the English translation is going to be.

0:17:08 - 0:17:28 Text: So we might need to rearrange the words. So what we have is based on the translation model. We have words or phrases that are reasonably likely translations of each German word or sometimes a German phrase.

0:17:28 - 0:17:49 Text: So these are effectively the Lego pieces out of which we're going to want to create the translation. And so then inside that what we're going to making use of this data, we're going to generate the translation piece by piece, kind of like we did with our new language models.

0:17:49 - 0:18:02 Text: So we're going to start with an empty translation and then we're going to say, well, we want to use one of these Lego pieces. And so we could explore different possible ones.

0:18:02 - 0:18:18 Text: So there's a search process, but one of the possible pieces is we could translate her with he or we could start the sentence with our translating the second word. So we could explore various likely possibilities.

0:18:18 - 0:18:33 Text: So we guided by our language model. It's probably much more likely to start the sentence with he than it is to start the sentence with our though ours not impossible. Okay. And then the other thing we're doing with these little blotches of black up the top.

0:18:33 - 0:18:49 Text: So we're going to start the sentence with the translation process. And so we're going to start the sentence with the translation process.

0:18:49 - 0:19:03 Text: So before we could translate the negation here and translate that is does not when we explore various continuations. And in a process I'll go through in more detail later when we do the neural equivalent.

0:19:03 - 0:19:21 Text: So we do this search where explore likely translations and prune and eventually we've translated the whole of the inputs sentence and have worked out a fairly likely translation. He does not go home. And that's what we use as the translation.

0:19:21 - 0:19:38 Text: Okay. So in the period from about 1997 to around 2013 statistical machine translation was a huge research field.

0:19:38 - 0:19:55 Text: The best systems were extremely complex. They had hundreds of details that I certainly haven't mentioned here. The systems had lots of separately designed and built components. So I mentioned language model and a translation model.

0:19:55 - 0:20:10 Text: And there were lots of other components for reordering models and inflection models and other things. There was lots of feature engineering. Typically the models also made use of lots of extra resources.

0:20:10 - 0:20:33 Text: There were lots of human effort to maintain. But nevertheless, they were already fairly successful. So Google translate launched in the mid 2000s and people thought, wow, this is amazing. You could start to get sort of semi decent automatic translations by for different web pages.

0:20:33 - 0:20:43 Text: That was chugging along well enough. And then we got to 2014 and really with enormous suddenness.

0:20:43 - 0:21:01 Text: People then worked out ways of doing machine translation using a large neural network. And these large neural networks proved to be just extremely successful and largely blew away everything that preceded it.

0:21:01 - 0:21:10 Text: And the next big part of the lecture, what I'd like to do is tell you something about neural machine translation.

0:21:10 - 0:21:30 Text: Well, it means you're using a neural network to do machine translation. But in practice, it's meant slightly more than that. It has meant that we're going to build one very large neural network, which completely does translation end to end.

0:21:30 - 0:21:43 Text: So we're going to have a large neural network. We're going to feed in the source sentence into the input and what's going to come out of the output of the neural network is the translation of the sentence.

0:21:43 - 0:21:59 Text: We're going to train that model end to end on parallel sentences. And it's the entire system, rather than being lots of separate components as an old fashioned machine translation system. And we'll see that in a bit.

0:21:59 - 0:22:19 Text: This neural network architectures are called sequence to sequence models or commonly abbreviated seek to seek. And they involve two neural networks. Here it says two RNNs, the version I'm presenting now has two RNNs. But more generally they involve two neural networks.

0:22:19 - 0:22:33 Text: There's one neural network that is going to encode the source sentence. So if we have a source sentence here, we are going to encode that sentence. And well, we know about a way that we can do that.

0:22:33 - 0:22:51 Text: So using the kind of LSTMs that we saw last class, we can start at the beginning and go through a sentence and update the hidden state each time. And that will give us a representation of the content of the source sentence.

0:22:51 - 0:23:11 Text: So that's the first sequence model, which encodes the source sentence. And we'll use the idea that the final hidden state of the encoder RNN is going to instance represent the source sentence.

0:23:11 - 0:23:34 Text: And we're going to feed it in directly as the initial hidden state for the decoder RNN. So then on the other side of the picture, we have our decoder RNN. And it's a language model that's going to generate a target sentence conditioned on the final hidden state of the encoder RNN.

0:23:34 - 0:23:49 Text: So we're going to start with the input of start symbol. We're going to feed in the hidden state from the encoder RNN. And now this second green RNN has completely separate parameters. I might just emphasize.

0:23:49 - 0:24:05 Text: And we do the same kind of LSTM computations and generate a first word of the sentence he. And so then doing LSTM generation, just like last class, we copy that down as the next input.

0:24:05 - 0:24:31 Text: And run the next step of the LSTM generate another word here to copy it down and chug along and we've translated the sentence. Right. So this is showing the test time behavior when we're generating the next sentence for the training time behavior when we have parallel sentences.

0:24:31 - 0:24:51 Text: We're using the same kind of sequence the sequence model, but we're doing it with the decoder part, just like training a language model where we're wanting to do teacher forcing and predict each word that's actually found in the source language sentence.

0:24:51 - 0:25:13 Text: The sequence sequence models have been an incredibly powerful widely used workhorse in new neural networks, the NLP. So although, you know, historically machine translation was the first big use of them and is sort of the canonical use.

0:25:13 - 0:25:29 Text: So you can do everywhere else as well. So you can do many other NLP tasks for them. So you can do summarization. You can think of text summarization as translating a long text into a short text.

0:25:29 - 0:25:51 Text: And there are other things that are in no way a translation whatsoever. So they're commonly used for neural dialogue systems. So the encoder will encode the previous two utterances say and then you will use the decoder to generate a next utterance.

0:25:51 - 0:26:07 Text: So many uses are even freaky, but have proven to be quite successful. So if you have any way of representing the parts of a sentence as a string.

0:26:07 - 0:26:25 Text: So sort of think a little, it's fairly obvious how you can turn the parts of a sentence into a string by just making use of extra syntax like parentheses or putting in explicit words that are saying left arc, right arc.

0:26:25 - 0:26:43 Text: And then you can shift shift like the transition systems that you use for assignment three. Well, then we could say let's use the encoder feed the input sentence to the encoder and let it output the transition sequence about dependency parser.

0:26:43 - 0:27:03 Text: So somewhat surprisingly that actually works well as another way to build a dependency parser or other kinds of parser. These models have also been applied not just to natural languages, but to other kinds of languages, including music and also programming language code.

0:27:03 - 0:27:21 Text: So you can train a seek to seek system where it reads in pseudo code in natural language and it generates out Python code. And if you have a good enough one, it can do this assignment for you.

0:27:21 - 0:27:42 Text: So the essential new idea here with our sequence the sequence models is we have an example of conditional language models. So previously the main thing we were doing was just a start at the beginning of the sentence and generator sentence based on nothing.

0:27:42 - 0:27:58 Text: And here we have something that is going to determine or partially determine that is going to condition what we should produce. So we have a source sentence and that's going to strongly determine what is a good translation.

0:27:58 - 0:28:15 Text: And so to achieve that, what we're going to do is have some way of transferring information about the source sentence from the encoder to trigger what the decoder should do.

0:28:15 - 0:28:29 Text: And the two standard ways of doing that are you either feed in a hidden state as the initial hidden state to the decoder or sometimes you will feed something in as the initial input to the decoder.

0:28:29 - 0:28:50 Text: And so the in your machine translation our we're directly calculating this conditional model probability of target language sentence given source language sentence. And so at each step as we break down the word by word generation that we're

0:28:50 - 0:29:00 Text: issuing not only on previous words of the target language, but also each time on our source language sentence X.

0:29:00 - 0:29:17 Text: Because of this, we actually know a ton more about what our sentence that we generate should be. So if you look at the complexities of these kind of conditional language models, you will find them like the numbers I showed last time.

0:29:17 - 0:29:34 Text: And so we have almost frequently low complexities that you will have models with complexities that are something like four or even less sometimes you know 2.5 because you get a lot of information about what words you should be generating.

0:29:34 - 0:29:51 Text: So then we have the same questions as we had for language models in general how to train a new machine translation system and then how to use it at runtime. So let's go through both of those in a bit more detail.

0:29:51 - 0:30:08 Text: So the first step is we get a large parallel corpus. So we run off to the European Union, for example, and we grab a lot of parallel English French data from the European Parliament proceedings.

0:30:08 - 0:30:35 Text: And then once we have our parallel sentences, what we're going to do is take batches of source sentences and target sentences will encode the source sentence with our encoder LSTM will feed its final hidden state into a target LSTM.

0:30:35 - 0:30:55 Text: And this one we are now then going to train word by word by comparing what it predicts as the most likely word to be produced versus what the actual first word and then the actual second word is and to the extent that we get it wrong.

0:30:55 - 0:31:07 Text: So we're going to suffer some loss. So this is going to be the negative log probability of generating the correct next word he and so on along the sentence.

0:31:07 - 0:31:26 Text: And the same way that we saw last time for language models, we can work out our overall loss for the sentence doing this teacher forcing style generate one word at a time, calculate a loss relative to the word that you should have produced.

0:31:26 - 0:31:49 Text: And so that loss then gives us information that we can back propagate through the entire network and the crucial thing about these sequence the sequence models that has made them extremely successful in practice is that the entire thing is optimized as a single system end to end.

0:31:49 - 0:32:17 Text: So starting with our final last we that propagated right through the system. So we not only update all the parameters of the decoder model, but we also update all of the parameters of the encoder model, which in turn will influence what conditioning gets passed over from the encoder to the decoder.

0:32:17 - 0:32:31 Text: So this moment is a good moment for me to return to the three slides that I skipped running out of time at the end of last time, which is to mention multi layer RNNs.

0:32:31 - 0:32:51 Text: So the RNNs that we've looked at so far already deep on one dimension then unroll horizontally over many time steps, but they've been shallow in that there's just been a single layer of recurrent structure above our sentences.

0:32:51 - 0:33:05 Text: We can also make them deepen the other dimension by applying multiple RNNs on top of each other, and this gives us a multi layer RNN often also called a stacked RNN.

0:33:05 - 0:33:25 Text: And having a multi layer RNN allows us the network to compute more complex representations. So simply put the lower RNNs tend to compute lower level features and the higher RNNs should compute higher level features.

0:33:25 - 0:33:44 Text: And just like in other neural networks, whether it's feed forward networks or the kind of networks you see in vision systems, you get much greater power and success by having a stack on multi multiple layers of recurrent neural networks.

0:33:44 - 0:34:13 Text: And you might think that all there are two things I could do, I could have a single S LSTM with a hidden state of dimension 2000, or I could have four layers of LSTMs with a hidden state of 500 each and it shouldn't make any difference because I've got the same number of parameters roughly, but that's not true and practice it does make a big difference and multi layer or stacked RNNs are more powerful.

0:34:13 - 0:34:23 Text: So, could I ask you, there's a good student question here about what lower level versus higher level features mean in this context?

0:34:23 - 0:34:38 Text: Sure. Yeah, so I mean, in some sense, these are kind of somewhat flimsy, wait, you know, terms, those meaning isn't precise.

0:34:38 - 0:34:50 Text: But typically what that's meaning is that lower level features are knowing sort of more basic things about words and phrases.

0:34:50 - 0:35:09 Text: So that commonly might be things like what part of speech is this word or are these words, the name of a person or the name of a company, whereas higher level features refer to things that are at a higher semantic level.

0:35:09 - 0:35:36 Text: So knowing more about the overall structure of a sentence, knowing something about what it means, whether a phrase has positive or negative connotations, what it's semantics are when you put together several words into an idiomatic phrase, roughly the higher level kinds of things.

0:35:36 - 0:35:41 Text: Okay, jump ahead.

0:35:41 - 0:36:04 Text: Okay, so when we build one of these end to end, new machine translation systems, if we want them to work well, single layer LSTM encoded decoded your machine translation systems just don't work well.

0:36:04 - 0:36:21 Text: But you can build something that is no more complex than the model that I've just explained now that does work pretty well by making it multi layer stacked LSTM new machine translation system.

0:36:21 - 0:36:43 Text: Therefore, the picture looks like this, so we've got this multi layer LSTM that's going through the source sentence. And so now at each point in time, we calculate a new hidden representation that rather than stopping there, we sort of feed it as the input into another layer of LSTM.

0:36:43 - 0:36:54 Text: And we calculate in the standard way it's new hidden representation and the output of it, we feed into a third layer of LSTM. And so we run that right along.

0:36:54 - 0:37:23 Text: So representation of the source sentence from our encoder is then this stack of three hidden layers, and then that we use to then feed in as the initial as the initial hidden layer into then sort of generating translations or for training the model of comparing to losses.

0:37:23 - 0:37:33 Text: So this is kind of what the picture of a LSTM encoded decoder, your machine translation system really looks like.

0:37:33 - 0:37:59 Text: So in particular, you know, to give you some idea of that, so a 2017 paper by Denny Brits and others that what they found was that for the encoder RNN, it worked best if it had two to four layers and four layers was best for the decoder RNN.

0:37:59 - 0:38:21 Text: And the details here like for a lot of neural nets depends so much on what you're doing and how much data you have and things like that. But you know, as rules of thumb to have in your head, it's almost invariably the case that having a two layer LSTM works a lot better than having a one layer LSTM.

0:38:21 - 0:38:31 Text: After that, things become much less clear, you know, it's not so infrequent that if you try three layers, it's a fraction better than two, but not really.

0:38:31 - 0:38:50 Text: And if you try four layers, it's actually getting worse again, you know, it depends on how much data, etc. you have at any rate, it's normally very hard with the kind of model architecture that I just showed back here to get better results with more than four layers of LSTM.

0:38:50 - 0:39:07 Text: Normally to do deeper LSTM models and get even better results, you have to be adding extra skip connections of the kind that I talked about at the very end of the last class.

0:39:07 - 0:39:30 Text: Next week, John is going to talk about transformer based networks in contrast for fairly fundamental reasons there, typically much deeper, but we'll leave discussing them until we get on further.

0:39:30 - 0:39:46 Text: So that was how we train the model, so let's just go a bit more through what the possibilities are for decoding and explore a more complex form of decoding than we've looked at.

0:39:46 - 0:40:08 Text: The simplest way to decode is the one that we presented so far so that we have our LSTM, we start generating hidden state, it has a probability distribution over words, and you choose the most probable one, the argmax, and you say he and you copy it down and you repeat over.

0:40:08 - 0:40:27 Text: Doing this is referred to as greedy decoding, taking the most probable word on each step, and it's sort of the obvious thing to do and doesn't seem like it could be a bad thing to do, but it turns out that it actually can be a fairly problematic thing to do.

0:40:27 - 0:40:41 Text: And the idea of that is that with greedy decoding, you're sort of taking locally what seems the best choice, and then you're stuck with it, and you have no way to undo decisions.

0:40:41 - 0:40:58 Text: So if these examples have been using this sentence about he hit me with a pie going from translating from French to English, so you know if you start off, and you say, OK, eel, the first word in the translation should be he.

0:40:58 - 0:41:23 Text: That looks good, but then you say well hit, I'll generate hit, then somehow the model thinks that the most likely next word is hit after hit is ah, and there are lots of reasons it could think so, because after hit most commonly there's a direct object now and then you know he hit a guy, he hit.

0:41:23 - 0:41:40 Text: I have a roadblock right so that's pretty sounds pretty likely, but you know once you've generated it there's no way to go backwards and so you just have to keep on going from there and you may not be able to generate the translation you want.

0:41:40 - 0:42:09 Text: At best you can generate he hit a pie ups something so we'd like to be able to explore a bit more in generating our translations and well you know what could we do well you know I sort of mentioned this before looking at the statistical empty models overall what we'd like to do is find translations.

0:42:09 - 0:42:31 Text: That maximize the probability of why given x and at least if we know what the length of that translation is we can do that as a product of generating a word at a time and so to have a full model we also have to have a probability distribution over how long the translation length would be.

0:42:31 - 0:42:52 Text: So we could say this is the model and let's you know generate and score all possible sequences why using this model and that's where that then requires generating an exponential number of translations and it's far far too expensive.

0:42:52 - 0:43:21 Text: So beyond greedy decoding the most important method that is used and you'll see lots of places is something called beam search decoding and so this isn't what you're all well any kind of machine translation is one place where it's commonly used that this isn't a method that's specific to mean machine translation you find lots of other places including all other kinds of sequence of sequence models.

0:43:21 - 0:43:32 Text: It's not the only other decoding method once when we got on to the language generation class will see a couple more but this is sort of the next one that you should know about.

0:43:32 - 0:43:47 Text: So beam search as idea is that you're going to keep some hypotheses to make it more likely that you'll find good generation while keeping the search tractable.

0:43:47 - 0:44:11 Text: So what we do is choose a beam size and for neural empty the beam size is normally fairly small something like five to 10 and at each step of the decoder we're going to keep track of the K most probable partial translation so initial sub sequences of what we're generating which we call hypotheses.

0:44:11 - 0:44:28 Text: So a hypothesis which is then sort of the prefix of the translation has a score which is this log probability up to what's been generated so far so we can generate that in the typical way using our conditional language model.

0:44:28 - 0:44:44 Text: As written all of the scores are negative and so the least negative one either highest probability one is the best one so what we want to do is search for high probability hypotheses.

0:44:44 - 0:45:02 Text: So this is a heuristic method it's not guaranteed to find the highest probability decoding but at least it gives you more of a shot than simply doing greedy coding so let's go through an example to see how it works.

0:45:02 - 0:45:30 Text: In this case so I can fit it on a slide the size of our beam is just two, though normally it would actually be a bit bigger than that and the blue numbers are the scores of the prefixes so these are these log probabilities of a prefix so we start off with our start symbol and we're going to say okay what are the two most likely words to generate first.

0:45:30 - 0:45:40 Text: According to our language model and so maybe the first two most likely words are he and I and there are their log probabilities.

0:45:40 - 0:46:03 Text: Then what we do next is for each of these k hypotheses we find what are likely words to follow them in particular we find what are the k most likely words to follow each of those so we might generate he hit he struck I was I got.

0:46:03 - 0:46:11 Text: Okay so at this point it sort of looks like we're heading down what will turn into an exponential.

0:46:11 - 0:46:32 Text: Tristize tree structure again but what we do now is we work out the scores of each of these partial hypotheses so we have four partial hypotheses he hit he struck I was I got and we can do that by taking the previous score that we have the partial hypothesis.

0:46:32 - 0:47:00 Text: And adding on the log probability of generating the next word here here hit so this gives us scores for each hypothesis and then we can say which of those two partial hypotheses because our beam size k equals to have the highest score and so they are I was and he hit so we keep those two and ignore the rest.

0:47:00 - 0:47:13 Text: And so then for those two we're going to generate k hypotheses for the most likely following word he hit are he hit me I was hit I was struck.

0:47:13 - 0:47:39 Text: And again now we want to find the k most likely hypotheses out of this full set and so that's going to be he struck me and I was I know he struck me and he hit are so we keep just those ones and then for each of those we generate the k most likely next words.

0:47:39 - 0:48:05 Text: And then again we filter back down to size k by saying okay the two most likely things here a pie or with so we continue working on those generate things find the two most likely generate things find the two most likely.

0:48:05 - 0:48:28 Text: And at this point we would generate end of string and say okay we've got a complete hypothesis he struck me with a pie and we could then trace back through the tree to obtain the full hypothesis for this sentence.

0:48:28 - 0:48:48 Text: So that's most of the algorithm there's one more detail which is the stopping criterion so in greedy decoding we usually decode until the model produces an end token and when it produces the end token we say we are done.

0:48:48 - 0:49:11 Text: In beam search decoding different hypotheses may produce end tokens on different time steps and so we don't want to stop as soon as one path through the search tree has generated end because it could turn out there's a different path through the search tree which will still prove to be better.

0:49:11 - 0:49:22 Text: So what we do is sort of put us put it aside as a complete hypothesis and continue exploring other hypotheses by our beam search.

0:49:22 - 0:49:48 Text: So usually we will then either stop when we've hit a cut off length or when we've completed end complete hypotheses and then we'll look through the hypotheses that we've completed and say which is the best one of those and that's the one we'll use.

0:49:48 - 0:50:17 Text: Okay so at that point we have a list of completed hypotheses and we want to select the top one with the highest score well that's exactly what we've been computing each one has a probability that we've worked out but it turns out that we might not want to use that just so naively because there turns out to be a kind of a systematic problem which is.

0:50:17 - 0:50:43 Text: Not as a theorem but in general longer hypotheses have lower scores so if you think about this is probabilities of successively generating each word that basically at each step you're multiplying by another chance of generating the next word probability and commonly those might be you know 10 to the minus three 10 to the minus two so just from the length of the sentence.

0:50:43 - 0:51:06 Text: Your probabilities are getting much lower the longer that they go on in a way that appears to be unfair since although in some sense extremely long sentences aren't as likely as short ones they're not less likely by that much a lot of the time we produce long sentences so for example you know and newspaper the.

0:51:06 - 0:51:32 Text: Medium length of sentences is over 20 so you wouldn't want to be having a decoding model when translating news articles that sort of says heart just generate two word sentences they just way higher probability according to my language model so the comments way of dealing with that is that we normalize by length so if we're working in log probabilities that means.

0:51:32 - 0:51:52 Text: So I'm taking dividing through by the length of the sentence and then you have a per word log probability score and you know you can argue that this isn't quite right some theoretical sense been practice at works pretty well and it's very commonly used.

0:51:52 - 0:52:21 Text: Newer translation has proven to be much much better i'll show you a couple of statistics and about that in a moment it has many advantages it gives better performance the translations of better in particular they're more fluent because newer language models produce much more fluent sentences but also they much better use context.

0:52:21 - 0:52:44 Text: Because newer language models including conditional newer language models give us a very good way of conditioning on a lot of context in particular we can just run along in coda and condition on the previous sentence or we can translate words well in context by making use of neural context.

0:52:44 - 0:53:12 Text: So neural models better understand phrase similarities and phrases that mean approximately the same thing and then the technique of optimizing all parameters of the model end to end in a single large neural network is just proved to be a really powerful idea so previously a lot of the time people were building separate components and tuning them into

0:53:12 - 0:53:40 Text: visually which just meant that they weren't actually optimal when put into a much bigger system so really a hugely powerful guiding idea and neural network land as if you can sort of build one huge network and just optimize the entire thing end to end that will give you much better performance and component wise systems will come back to the cost of that later in the course.

0:53:40 - 0:53:59 Text: The models are also actually great in other ways actually require much less human effort to build there's no feature engineering there's in general no language specific components you're using the same method for all language pairs.

0:53:59 - 0:54:22 Text: Of course it's rare for things to be perfect in every way so new machine translation systems also have some disadvantages compared to the older statistical machine translation systems they're less interpretable is harder to see why they're doing what they're doing where you before you could actually look at phrase tables and they were useful.

0:54:22 - 0:54:49 Text: So they're hard to debug they also tend to be sort of difficult to control so compared to anything like writing rules you can't really give much specification as if you'd like to say I'd like my translations to be more casual or something like that it's hard to know what they'll generate so they're very safety concerns.

0:54:49 - 0:55:16 Text: I'll show a few examples of that in just a minute but first before doing that quickly how do we evaluate machine translation the best way to evaluate machine translation is to show a human being who's fluent in the source and target languages the sentences and get them to give judgment on how good a translation it is.

0:55:16 - 0:55:45 Text: But that's expensive to do and might not even be possible if you don't have the right human beings around so a lot of work was put into finding automatic methods of scoring translations that were good enough and the most famous method of doing that is what's called blue and the way you do blue is you have a human translation or several human translations of the source sentence.

0:55:45 - 0:56:06 Text: And you're comparing a machine generated translation to those pre given human written translations and you score them for similarity by calculating in grand precision I words that overlap between the computer and human written translation by grams,

0:56:06 - 0:56:33 Text: grams, and four grams and then working out a geometric average between overlaps of n grams plus there's a penalty for two short system translations so blue is proven to be a really useful measure but it's an imperfect measure that commonly there are many valid ways to translate a sentence and so there's some luck as to whether

0:56:33 - 0:56:44 Text: the human written translations you have happened to correspond to which to what might be a good translation from the system.

0:56:44 - 0:57:01 Text: There's more to say about the details of blue and how it's implemented that you're going to see all of that doing assignment for because you will be building your machine translation systems and evaluating with them with the

0:57:01 - 0:57:26 Text: blue algorithm and there are full details about blue in the assignment handout that the end of the day blue gives the score between zero and a hundred where your score is a hundred if you're exactly producing one of the human written translations and zero if you're not even a single unogram that overlaps between the two.

0:57:26 - 0:57:35 Text: With that rather brief intro I wanted to show you sort of what happened in machine translation.

0:57:35 - 0:57:55 Text: So machine translation with statistical models phrase based statistical machine translation that I showed at the beginning of the class had been going on since the mid 2000s decade and it had produced sort of semi good results of the kind that were in Google translation those days.

0:57:55 - 0:58:19 Text: But by the time you entered the 2010s basically progress in statistical machine translation had stalled and you are getting barely any increase over time and most of the increase in top you were getting over time was simply because you're training your models on more data.

0:58:19 - 0:58:37 Text: In those years around the early 2000s and tens the big hope that most people had someone asked what is the why access here this why access is this blue score that I told you about on the previous slide.

0:58:37 - 0:58:58 Text: The early 2010s the big hope that most people in the machine translation field had was well if we build a more complex kind of machine translation model that knows about the syntactic structure of languages that makes use of tools like dependency parsers will be able to build much better translations.

0:58:58 - 0:59:12 Text: And so those are the purple systems here which I haven't described at all, but it's sort of as the years went by and was pretty obvious that that barely seemed to help.

0:59:12 - 0:59:36 Text: And so then in the mid 2010s so in 2014 was the first modern attempt to build a neural network for machine translation and encoder decoder model and by the time it was sort of evaluated in bakeoffs in 2015 it wasn't as good as what had been built up over the preceding decade.

0:59:36 - 1:00:03 Text: But it was already getting pretty good, but what was found was that these neural models just really opened up a whole new pathway to start building much much better machine translation systems and since then things have just sort of taken off and year by year new machine translation systems are getting much better and far better than anything we had preceding that.

1:00:03 - 1:00:18 Text: For you know at least the early part of application of deep learning and natural language processing, neural machine translation was the huge big success story.

1:00:18 - 1:00:47 Text: In the last few years when we've had models like GPT 2 and GPT 3 and other huge neural models like bird improving web search you know it's a bit more complex, but this was the first area where there was a neural network which was hugely better than what had proceeded and was actually solving a practical problem that lots of people in the world need.

1:00:47 - 1:00:55 Text: And it was stunning with the speed of which success was achieved so.

1:00:55 - 1:01:22 Text: So in 2014 were the first what I call here fringe research attempts to build a neural machine translation system meaning that three or four people who were working on neural network models thought oh why don't we see if we can use one of these to translate learn to translate sentences where they weren't really people with the background and machine translation at all.

1:01:22 - 1:01:51 Text: Success was achieved so quickly that within two years time Google had switched to using your machine translation for most languages and by a couple of years later after that essentially anybody who does machine translation is now deploying live new machine translation systems and getting much much better results.

1:01:51 - 1:02:09 Text: So that was sort of just an amazing technological transition that for the preceding decade the big statistical machine translation systems like the previous generation of Google translate and literally been built up by hundreds of engineers over years.

1:02:09 - 1:02:30 Text: And comparatively small group of deep learning people in a few months with a small amount of code and hopefully you'll even get a sense of this doing assignment for where able to build new machine translation systems that proved to work much better.

1:02:30 - 1:02:49 Text: Does that mean the machine translation is solved no there are still lots of difficulties which people continue to work on very actively and you can see more about it in the sky at today article was linked at the bottom but you know there are lots of problems without a vocabulary words.

1:02:49 - 1:03:01 Text: So there are the domain mismatches between the training and test data so it might be trained mainly on use via data but you want to translate people's Facebook messages.

1:03:01 - 1:03:18 Text: There are still problems of maintaining context over longer text we'd like to translate languages for which we don't have much data and so these methods work by far the best when we have huge amounts of parallel data.

1:03:18 - 1:03:46 Text: The best multilayer LSTMs aren't that great at capturing sentence meaning there are particular problems such as interpreting what pronouns refer to or in languages like Chinese or Japanese where there's often no pronoun present but there is an implied reference to some person working out how to translate that for languages that lots have lots of inflectional forms.

1:03:46 - 1:04:10 Text: So there are still a lot of things to do with nouns verbs and adjectives these systems often get them wrong so there's still tons of stuff to do so here's just sort of quick funny examples of the kind of things that go wrong right so if you asked to translate paper jam Google translators deciding that this is a kind of jam just like there's

1:04:10 - 1:04:18 Text: raspberry jam and strawberry jam and so this becomes a jam of paper.

1:04:18 - 1:04:39 Text: There are problems of agreement and choice so if you have let many languages don't distinguish gender and so the sentences neutral between things are masculine or feminine so Malay or Turkish or two well known languages of that sort.

1:04:39 - 1:05:00 Text: But what happens when that gets translated into English by Google translate is that the English language model just kicks in and applies stereotypical biases and so these gender neutral sentences get translated into she works as a nurse he works as a programmer.

1:05:00 - 1:05:20 Text: So if you want to help solve this problem you all if you can help by using singular they in all context when you're putting material online and that could then change the distribution of what's generated but people also work on modeling improvements to try and avoid this.

1:05:20 - 1:05:40 Text: Here's one more example that's kind of funny people noticed a couple of years ago that if you choose one of the rarer languages that Google translate such as Somali.

1:05:40 - 1:06:02 Text: And you just write in some rubbish like ag ag ag ag. Freakily it had produced out of nowhere prophetic and biblical texts as the name of the Lord was written in the Hebrew language it was written in the language of the Hebrew nation which makes no sense at all.

1:06:02 - 1:06:22 Text: But we were about to see a bit more about why this happens but that was sort of a bit worrying this as far as I can see this problem is now fixed in 2021 I couldn't actually get Google translate to generate examples like this anymore.

1:06:22 - 1:06:33 Text: So there was lots of ways to keep on doing research and empty is certainly is you know flagship task for an LP and deep learning.

1:06:33 - 1:07:01 Text: And it was a place where many of the innovations of deep learning and LP were pioneered and people continue to work hard on it people found many many improvements and actually for the last bit of the class in the minute i'm going to present one huge improvement which is so important that's really come to dominate the whole of the recent field of neural.

1:07:01 - 1:07:28 Text: neural networks for nlp and that's the idea of attention but before I get on to attention I want to spend three minutes on our assignment for so for assignment for this year we've got a new version of the assignment which we hope will be interesting but it's also a real challenge so for assignment for this year.

1:07:28 - 1:07:55 Text: So we decided to do Cherokee English machine translation so Cherokee is an endangered Native American language it has about 2000 fluent speakers it's an extremely low resource language so it's just there isn't much written Cherokee data available period and particularly there's not a lot of parallel sentences between Cherokee and English.

1:07:55 - 1:08:15 Text: So we have a chance to answer to the Google's freaky prophetic translations for languages for which they isn't much parallel data available commonly the biggest place where you can get parallel data is from Bible translations.

1:08:15 - 1:08:44 Text: So you can have your own personal choice wherever it is over the map as to where you stand with respect to religion but the fact of the matter is if you work on indigenous languages what you very very quickly find is that a lot of the work that's done on collecting data on indigenous languages and a lot of the material that's available in written form for many indigenous languages is Bible translation.

1:08:44 - 1:09:12 Text: Yeah, okay, so this is what Cherokee looks like and so you can see that the writing system has a mixture of things that look like English letters and then all sorts of letters that don't and so here's the initial bit of a story long ago was seven boys who used to spend all their time down by the townhouse so this is a piece of parallel data.

1:09:12 - 1:09:40 Text: That we can learn from so the Cherokee writing system has 85 letters and the reason why it has so many letters is that each of these letters actually represents a syllable so many languages of the world have strict consonant vowel syllable structure so you have words like Rata pair or something like that for what Cherokee.

1:09:40 - 1:10:00 Text: Right and another language like that's Hawaiian and so each of the letters represent a combination of a consonant and a vowel and that's the set of those you then get 17 by five gives you 85 letters.

1:10:00 - 1:10:15 Text: Yeah, so being able to do this assignment big thanks to people from University of North Carolina, Chapel Hill, who've provided the resources we're using for this assignment.

1:10:15 - 1:10:38 Text: Although you can do quite a lot of languages on Google translate Cherokee is not a language that Google offers on Google translate so we can see how far we can get but we have to be modest in our expectations because it's hard to build a very good empty system with only a fairly limited amount of data so we'll see how far we can get.

1:10:38 - 1:10:59 Text: There is a flip side which is for you students doing the assignment the advantage of having not too much data is that your models or train relatively quickly so we'll actually have less troubles than we did last year with people's models taking hours to train as the assignment deadline closed in.

1:10:59 - 1:11:16 Text: There's a couple more words about Cherokee so we have some idea what we're talking about so the Cherokee originally lived in Western North Carolina and tennis East and Tennessee they then sort of got shunted.

1:11:16 - 1:11:39 Text: South West from that and then in particular for those of you who went to American high schools and paid attention you might remember discussion of the trail of tears when a lot of the Native Americans from the southeast of the US got forcibly shoved a long way for the West.

1:11:39 - 1:11:47 Text: And so most Cherokee now live in Oklahoma though there are some that are in North Carolina.

1:11:47 - 1:12:07 Text: The writing system that I showed on this previous slide it was invented by a Cherokee man Sequoia that's a drawing of him there and that was actually a kind of incredible thing so he started off a literate and.

1:12:07 - 1:12:31 Text: Worked out how to write a producer writing system that would be good by good for Cherokee and given that it has this consonant valve structure he chose a celebrity which turned out to be a good choice so here's here's a neat historical fact so in the 1830s and 1840s.

1:12:31 - 1:12:47 Text: The percentage of Cherokee that were literate in Cherokee written like this was actually higher than the percentage of white people in the south east and United States at that point in time.

1:12:47 - 1:13:00 Text: Okay, before time disappears, oops time has almost disappeared i'll just start to say and then i'll have to do a bit more of this.

1:13:00 - 1:13:25 Text: And they'll be okay right so the final idea that's really important for sequence to sequence models is the idea of attention and so we had this model of doing sequence to sequence model such as the neural machine translation and the problem with this architecture is that.

1:13:25 - 1:13:41 Text: We have this one hidden state which has to encode all the information about the source sentence so it acts as a kind of information bottleneck and that's all the information that the generation is conditioned on.

1:13:41 - 1:14:07 Text: Well, I didn't already mention one idea last time of how to get more information where I said look maybe you could kind of average all of the vectors of the source to get a sentence representation but you know that method turns out to be better for things like sentiment analysis and not so good for machine translation where the order of words is very important to preserve.

1:14:07 - 1:14:30 Text: So it's it seems like we would do better if somehow we could get more information from the source sentence while we're generating the translation and in some sense it's just corresponds to what a human translator does right if you're a human translator you read the sentence that you're meant to translate.

1:14:30 - 1:14:55 Text: Maybe start translating a few words but then you look back at the source sentence to see what else was in it and translates some more words so very quickly after the first new machine translation systems people came up with the idea of maybe we could build a better neural mt model that did that and that's the idea of attention.

1:14:55 - 1:15:19 Text: So the core idea is on each step of the decoder we're going to use a direct link between the encoder and the decoder that will allow us to focus on particular particular word or words in the source sequence and use it to help us generate what words come next.

1:15:19 - 1:15:32 Text: I'll just go through now showing you the pictures of what attention does and then at the start of next time we'll go through the equations in more detail.

1:15:32 - 1:15:58 Text: So we generate we use our encoder just as before and generate our representations feed in our conditioning as before and say we're starting our translation but at this point we take this hidden representation and say I'm going to use this hidden representation to look back at the source to get information directly from it.

1:15:58 - 1:16:26 Text: So what I will do is I will compare the hidden state of the decoder with the hidden state of the encoder at each position and generate an attention score which is a kind of similarity score like a dot product and then based on those attention scores i'm going to calculate a probability distribution.

1:16:26 - 1:16:51 Text: As to by using a softmax as usual to say which of these encoder states is most like my decoder state and so we'll be training the model here to be saying well probably you should translate the first word of the sentence first so that's where the attention should be placed.

1:16:51 - 1:17:19 Text: So on this attention distribution which is a probability distribution coming out of the softmax we're going to generate a new attention out port and so this attention out port is going to be an average of the hidden states of the encoder model that is going to be a weighted average based on our attention distribution.

1:17:19 - 1:17:43 Text: So we're then going to take that attention out port combined with the hidden state of the decoder are an end and together the two of them are then going to be used to predict fire a softmax what word to generate first and we hope to generate he.

1:17:43 - 1:17:53 Text: And then at that point we sort of chug along and keep doing these same kind of computations at each position.

1:17:53 - 1:18:12 Text: There's a little side note here that says sometimes we take the attention out port from the previous step and also feed into the decoder along with the usual decoder inputs so we're taking this attention out from actually feeding it back in to the hidden state calculation.

1:18:12 - 1:18:21 Text: And sometimes improve performance and we actually have that trick in the assignment for system and you can try it out.

1:18:21 - 1:18:41 Text: Okay so we generate along and generate our whole sentence in this manner and that's proven to be a very effective way of getting more information from the source sentence more flexibility to allow us to generate a good translation.

1:18:41 - 1:18:51 Text: So we'll stop here for now and at the start of next time I'll finish this off by going through the actual equations for how attention works.