Stanford CS224N: NLP with Deep Learning | Winter 2020 | Low Resource Machine Translation

0:00:00 - 0:00:10     Text: For today, I'm really delighted to introduce our third guest speaker,

0:00:10 - 0:00:12     Text: who's Mark Raleo-Ranzato.

0:00:12 - 0:00:18     Text: So he's originally from Italy and then worked at NYU with Jan Lecun

0:00:19 - 0:00:21     Text: and then has a post-doc with Jeffrey Hinton.

0:00:21 - 0:00:24     Text: So he's a very died-of-the-wall deep learning researcher.

0:00:25 - 0:00:29     Text: A lot of his ritual work was in areas like feature learning

0:00:29 - 0:00:34     Text: and Bijun, but over the last few years, he's really turned his interest

0:00:34 - 0:00:36     Text: to natural language processing.

0:00:37 - 0:00:41     Text: And in particular, in the last few years, he's worked a huge amount

0:00:41 - 0:00:44     Text: in looking at machine translation in general

0:00:44 - 0:00:48     Text: and particular machine translation for languages

0:00:48 - 0:00:50     Text: for which less resources are available.

0:00:50 - 0:00:54     Text: So I saw a talk of his about six months ago on this topic

0:00:55 - 0:00:58     Text: and through him and his team at Facebook,

0:00:58 - 0:01:01     Text: they've really got a lot of exciting new work in ways

0:01:01 - 0:01:05     Text: to bring neural machine translation up to the next level.

0:01:05 - 0:01:08     Text: And so I hope that this would be a really great opportunity

0:01:08 - 0:01:12     Text: for everyone to see some of the latest and most exciting techniques

0:01:12 - 0:01:14     Text: in neural machine translation.

0:01:14 - 0:01:18     Text: There's sort of the next level beyond what we talked about

0:01:18 - 0:01:21     Text: and you guys all did an assignment four and five for the class.

0:01:21 - 0:01:24     Text: So take it away with Mark Raleo.

0:01:24 - 0:01:26     Text: OK, thank you so much, Chris, for inviting me.

0:01:26 - 0:01:29     Text: Let me just put my face.

0:01:29 - 0:01:30     Text: I'm here.

0:01:30 - 0:01:31     Text: Hi, everybody.

0:01:31 - 0:01:34     Text: I'm going to disable it now so you can focus on the presentation.

0:01:36 - 0:01:38     Text: So share.

0:01:40 - 0:01:43     Text: I hope you should be able to see my presentation now.

0:01:43 - 0:01:47     Text: OK, so I'm very excited to tell you a little bit

0:01:47 - 0:01:49     Text: about a lot of resource machine translation.

0:01:50 - 0:01:56     Text: And let's start by revisiting the machine translation problem.

0:01:56 - 0:01:58     Text: So let's say that we want to translate

0:01:58 - 0:02:01     Text: between English and French.

0:02:01 - 0:02:06     Text: And we start with a big training set

0:02:06 - 0:02:09     Text: where we have a collection of sentences in English

0:02:09 - 0:02:12     Text: with their corresponding translation in French.

0:02:12 - 0:02:15     Text: And this is what we call a parallel data set.

0:02:15 - 0:02:18     Text: And in particular, the sentences in English,

0:02:18 - 0:02:21     Text: we call them source sentences, right?

0:02:21 - 0:02:23     Text: And the corresponding sentences in French

0:02:23 - 0:02:26     Text: are what we call the target sentences.

0:02:26 - 0:02:33     Text: And now the learning problem is about for a given sentence

0:02:33 - 0:02:37     Text: in English, you want to predict the corresponding sentence

0:02:37 - 0:02:38     Text: in French.

0:02:38 - 0:02:42     Text: And the way that we do that is by minimizing the cross-central

0:02:42 - 0:02:46     Text: p loss, which maximizes the probability

0:02:46 - 0:02:50     Text: of the reference human translation given the input source

0:02:50 - 0:02:51     Text: sentence.

0:02:51 - 0:02:55     Text: And we do this by stochastic learning descent

0:02:55 - 0:02:58     Text: using as a architecture, a sequence to sequence

0:02:58 - 0:03:02     Text: with attention that as far as you know, you

0:03:02 - 0:03:07     Text: studied and you had a long work on a few weeks ago.

0:03:07 - 0:03:11     Text: And then after you train this, at this time,

0:03:11 - 0:03:14     Text: you are given a novel English sentence

0:03:14 - 0:03:17     Text: and you want to produce the corresponding translation.

0:03:17 - 0:03:21     Text: In order to do that, we usually employ

0:03:21 - 0:03:27     Text: a heuristic search method like beam that tries to find

0:03:27 - 0:03:31     Text: approximately the target sentence that maximizes the probability

0:03:31 - 0:03:37     Text: given the given source sentence.

0:03:37 - 0:03:41     Text: So this is at the high level how machine translation works.

0:03:41 - 0:03:43     Text: And let's think about the assumptions

0:03:43 - 0:03:46     Text: that we have been making through this discussion.

0:03:46 - 0:03:50     Text: So the first assumption is that we are working with two

0:03:50 - 0:03:53     Text: fairly related languages like English and French.

0:03:53 - 0:03:56     Text: And the second assumption is that we have a tower

0:03:56 - 0:04:00     Text: disposal, a large data set of parallel sentences.

0:04:00 - 0:04:04     Text: Because here we are essentially doing supervised learning.

0:04:04 - 0:04:08     Text: And it is a beautiful example of end-to-end supervised learning

0:04:08 - 0:04:13     Text: that relies on the availability of a large parallel data set.

0:04:13 - 0:04:20     Text: And so in the world, there are more than 6,000 languages.

0:04:20 - 0:04:25     Text: And needless to say, most of these languages

0:04:25 - 0:04:29     Text: don't belong to the European family

0:04:29 - 0:04:32     Text: for which most of the recent research on machine translation

0:04:32 - 0:04:34     Text: has been focusing on.

0:04:34 - 0:04:36     Text: And even if you look at English, English

0:04:36 - 0:04:41     Text: is spoken by less than 5% of, as native languages,

0:04:41 - 0:04:45     Text: is spoken by less than 5% of the world population.

0:04:45 - 0:04:51     Text: And so if you were to count how many people speak

0:04:51 - 0:04:53     Text: a certain language, and you look at that histogram,

0:04:53 - 0:04:56     Text: it's a very heavy distribution.

0:04:56 - 0:05:02     Text: So even if you take the top 10 spoken languages,

0:05:02 - 0:05:04     Text: you find that these accounts for less than 50%

0:05:04 - 0:05:05     Text: of the people in the world.

0:05:05 - 0:05:13     Text: And now, if you look at the very far right of the tail,

0:05:13 - 0:05:16     Text: those are languages for which there are very few speakers.

0:05:16 - 0:05:18     Text: And essentially, there is not digitized data,

0:05:18 - 0:05:20     Text: material for you to train anything.

0:05:20 - 0:05:25     Text: So for those, I think it's almost hopeless, I would say.

0:05:25 - 0:05:30     Text: But in the middle of this tail, we have a lot of languages

0:05:30 - 0:05:33     Text: for which there is some digital data.

0:05:33 - 0:05:38     Text: And for which we don't have good ways to translate nowadays.

0:05:38 - 0:05:41     Text: If you think about major providers, like Google,

0:05:41 - 0:05:44     Text: a Yandex, Baidu, Facebook, and so on and so forth,

0:05:44 - 0:05:48     Text: they provide translation for the top 100 languages.

0:05:48 - 0:05:51     Text: So we are still very much at the far right

0:05:51 - 0:05:54     Text: of this non-distribution.

0:05:54 - 0:06:01     Text: And so if we are able to improve machine translation

0:06:01 - 0:06:03     Text: in the middle, I think we could do,

0:06:03 - 0:06:06     Text: it would be very impactful, right?

0:06:06 - 0:06:09     Text: But so what happens is we walk down this tail.

0:06:09 - 0:06:13     Text: So what happens is that the amount of data,

0:06:13 - 0:06:18     Text: or parallel data decreases and that correlates very much

0:06:19 - 0:06:22     Text: with the quality of the automatic machine translation

0:06:22 - 0:06:24     Text: systems that we have.

0:06:24 - 0:06:26     Text: And particularly, as you can see here,

0:06:27 - 0:06:29     Text: at some point there is actually a trusted drop

0:06:29 - 0:06:33     Text: in accuracy of your machine translation system.

0:06:33 - 0:06:44     Text: So perhaps the initial picture that we had in mind is a little different.

0:06:44 - 0:06:48     Text: So now if we take a fairly low resource language,

0:06:48 - 0:06:51     Text: like Nepali, which is the language spoken in Nepal,

0:06:51 - 0:06:54     Text: a lovely country, an artist of India,

0:06:54 - 0:06:57     Text: with more than 25 million people.

0:06:57 - 0:07:00     Text: So it's not as a handful of people.

0:07:00 - 0:07:05     Text: So first of all, the amount of training data is not as much

0:07:05 - 0:07:09     Text: as English friends is much, much less than that.

0:07:09 - 0:07:12     Text: And here, let's use a different visual representation.

0:07:12 - 0:07:16     Text: So let's use field rectangles with a color

0:07:16 - 0:07:18     Text: that corresponds to the language.

0:07:18 - 0:07:21     Text: So the blue rectangle is English data

0:07:21 - 0:07:25     Text: and the red rectangle is Nepali data.

0:07:25 - 0:07:35     Text: Now, in practice, the parallel data set is not just such a monolithic thing.

0:07:35 - 0:07:38     Text: Because some part originates in English

0:07:38 - 0:07:42     Text: and some parts originates in Nepali.

0:07:42 - 0:07:50     Text: And now let's represent the Nepali translations of English data

0:07:50 - 0:07:53     Text: with a nemti rectangle,

0:07:53 - 0:07:55     Text: where the color corresponds to the language.

0:07:55 - 0:07:57     Text: And whether you feel it or not depends

0:07:57 - 0:07:59     Text: whether this is translation is.

0:07:59 - 0:08:01     Text: So whether this is a human translation,

0:08:01 - 0:08:06     Text: or whether it is data originating in the language.

0:08:06 - 0:08:11     Text: So in this case, we take data that originates in English

0:08:11 - 0:08:12     Text: and we translate it into Nepali.

0:08:12 - 0:08:18     Text: And so this is the empty rectangle and the same

0:08:18 - 0:08:21     Text: for when you go from Nepali to English.

0:08:21 - 0:08:25     Text: Now, in general, the data that originates in English

0:08:25 - 0:08:30     Text: and the data that originates in Nepali may come from different domains.

0:08:30 - 0:08:33     Text: So here on the Y axis, you have the domain.

0:08:33 - 0:08:37     Text: And so in this example that I taught in the meta,

0:08:37 - 0:08:41     Text: but it's pretty indicative of what happens in practice.

0:08:41 - 0:08:46     Text: You may have that English sentences may come from, let's say, Bible.

0:08:46 - 0:08:50     Text: And so the Nepali here are translations from the Bible.

0:08:50 - 0:08:55     Text: And the Nepali sentences may come from parliamentary data.

0:08:58 - 0:09:02     Text: So you may agree with me that translating

0:09:02 - 0:09:07     Text: novel sentence from the Bible is not a super interesting task

0:09:07 - 0:09:10     Text: because the Bible is a pretty static data set, right?

0:09:10 - 0:09:14     Text: And so maybe we want to translate news data.

0:09:14 - 0:09:21     Text: And but so in practice, we don't have any parallel data

0:09:21 - 0:09:22     Text: in the news domain.

0:09:22 - 0:09:26     Text: Perhaps what we have, so what we really want to do at the end,

0:09:26 - 0:09:30     Text: is translate sentences from this test set

0:09:30 - 0:09:33     Text: that is English news into Nepali.

0:09:33 - 0:09:35     Text: But all we have in the news domain

0:09:35 - 0:09:40     Text: is at most monolingual data both in English and in Nepali.

0:09:40 - 0:09:45     Text: So these are English sentences that are not aligned at all

0:09:45 - 0:09:48     Text: with the Nepali sentences over here.

0:09:48 - 0:09:53     Text: You just happen to be just data that you got from news sources.

0:09:53 - 0:09:56     Text: And so this is a pretty complicated learning setting

0:09:56 - 0:10:02     Text: because you have a little bit of parallel sentences

0:10:02 - 0:10:06     Text: and that are in a different domain from the test set.

0:10:06 - 0:10:10     Text: And all you have in the domain of interest

0:10:10 - 0:10:13     Text: is monolingual data.

0:10:13 - 0:10:20     Text: And in fact, you may have also some other parallel data

0:10:20 - 0:10:23     Text: but in another language, let's say Hindi

0:10:23 - 0:10:26     Text: that is in the same family as Nepali.

0:10:26 - 0:10:29     Text: But maybe this parallel data is in a different domain,

0:10:29 - 0:10:30     Text: let's say books.

0:10:30 - 0:10:33     Text: And perhaps you have also monolingual data in Hindi

0:10:33 - 0:10:36     Text: that is also in the book domain.

0:10:36 - 0:10:42     Text: So in fact, what you really, in practice,

0:10:42 - 0:10:46     Text: what you find is that you may have a lot of languages here

0:10:46 - 0:10:51     Text: from which you could learn and a lot of domains.

0:10:51 - 0:10:54     Text: And all you want to do at the end

0:10:54 - 0:10:58     Text: is to be able to translate news data in English into Nepali.

0:10:58 - 0:11:01     Text: But you don't have any supervision for that.

0:11:01 - 0:11:03     Text: You don't have any label data and a parallel data for that.

0:11:03 - 0:11:07     Text: What you have is a bunch of data in different domains

0:11:07 - 0:11:08     Text: and in different languages.

0:11:08 - 0:11:13     Text: And so the question is, how can you leverage all these data

0:11:13 - 0:11:17     Text: in order to perform your original translation task?

0:11:17 - 0:11:21     Text: And so this is a Mondrian-like learning setting,

0:11:21 - 0:11:24     Text: which is pretty tricky.

0:11:24 - 0:11:29     Text: And this is going to be the topic of this lecture.

0:11:29 - 0:11:35     Text: And so there is not a very peer definition

0:11:35 - 0:11:39     Text: of what Lores or Smoshin translation is.

0:11:39 - 0:11:42     Text: Lusely speaking, a language park can be considered

0:11:42 - 0:11:46     Text: lower resource when the number of parallel sentences in domain

0:11:46 - 0:11:50     Text: is less than 10,000.

0:11:50 - 0:11:55     Text: And as order of magnitude.

0:11:55 - 0:11:57     Text: And this is very little.

0:11:57 - 0:12:00     Text: Particularly, if you think the modern neural machine

0:12:00 - 0:12:04     Text: translation systems have easily hundreds

0:12:04 - 0:12:06     Text: of millions of parameters.

0:12:06 - 0:12:08     Text: And so there are several challenges.

0:12:08 - 0:12:10     Text: There are challenges that pertain the data

0:12:10 - 0:12:13     Text: and challenges that pertain to the model design.

0:12:13 - 0:12:19     Text: So in terms of the data, it is very hard to get data to train.

0:12:19 - 0:12:23     Text: It is very hard to figure out where to get the data to train.

0:12:23 - 0:12:27     Text: Data that is in a domain similar to the domain

0:12:27 - 0:12:31     Text: that you are interested in eventually translating.

0:12:31 - 0:12:35     Text: If that doesn't exist, how to get data in similar languages,

0:12:35 - 0:12:40     Text: on other domains, and even how to get data to evaluate

0:12:40 - 0:12:41     Text: your system on.

0:12:41 - 0:12:46     Text: And on the modeling side, there is the question of course,

0:12:46 - 0:12:49     Text: how to learn with solito supervision,

0:12:49 - 0:12:52     Text: solito direct supervision at the very least,

0:12:52 - 0:12:57     Text: and how to operate in this framework for which

0:12:57 - 0:13:02     Text: you have so many languages and so many domains.

0:13:02 - 0:13:09     Text: So as Chris mentioned at the very beginning,

0:13:09 - 0:13:11     Text: my background is not really NLP.

0:13:11 - 0:13:18     Text: My have always been trusted in learning with less supervision.

0:13:18 - 0:13:22     Text: And I think working in a lot of those machine translation

0:13:22 - 0:13:26     Text: is at least personally a very unique opportunity.

0:13:26 - 0:13:29     Text: It's a very rare case in which my research agenda

0:13:29 - 0:13:31     Text: is aligned with an application.

0:13:31 - 0:13:35     Text: Because in Lorshows machine translation,

0:13:35 - 0:13:37     Text: you don't have much level data.

0:13:37 - 0:13:41     Text: And you need to make the best use of auxiliary tasks

0:13:41 - 0:13:45     Text: and auxiliary data in order to perform well.

0:13:45 - 0:13:47     Text: And this is a general problem.

0:13:47 - 0:13:50     Text: And at the same time, machine translation

0:13:50 - 0:13:53     Text: is a real application, is something that if we improve,

0:13:53 - 0:13:58     Text: we can really have a chance to improve

0:13:58 - 0:14:04     Text: a lot of applications and the life of a lot of people.

0:14:04 - 0:14:07     Text: So this concludes my introduction

0:14:07 - 0:14:13     Text: about Lorshows machine translation and the issues

0:14:13 - 0:14:16     Text: that we face when working on these languages

0:14:16 - 0:14:22     Text: be for and let me just pause for a second

0:14:22 - 0:14:26     Text: saying the outline of this talk was around three pillars

0:14:26 - 0:14:31     Text: that in a way define the cycle of research.

0:14:31 - 0:14:34     Text: So the first pillar is data.

0:14:34 - 0:14:40     Text: So I'm going to review how we can get data

0:14:40 - 0:14:42     Text: in particular for evaluation.

0:14:42 - 0:14:46     Text: So data is the prerequisite to do anything in our life

0:14:46 - 0:14:49     Text: as machine learner practitioners.

0:14:49 - 0:14:53     Text: And then afterwards, I'm going to move to the modeling.

0:14:53 - 0:15:01     Text: So describing some algorithms to learn on Lorshows languages.

0:15:01 - 0:15:03     Text: And finally, we conclude with some work

0:15:03 - 0:15:08     Text: on analyzing what the model does when we train on Lorshows

0:15:08 - 0:15:09     Text: languages.

0:15:09 - 0:15:14     Text: And in practice, like throughout my work here,

0:15:14 - 0:15:19     Text: I keep going around the circle because as I figure out

0:15:19 - 0:15:21     Text: the issues that we have with the data with the model,

0:15:21 - 0:15:25     Text: then I make a map with a data set that better fits the kind

0:15:25 - 0:15:27     Text: of problems that I'm interested in.

0:15:27 - 0:15:30     Text: And then I may go back to the modeling side

0:15:30 - 0:15:34     Text: to improve the models and so and so forth.

0:15:34 - 0:15:36     Text: And here I'm giving some references

0:15:36 - 0:15:40     Text: of the works that I'm presenting, not all of them.

0:15:40 - 0:15:44     Text: And just to be clear, this is not meant

0:15:44 - 0:15:46     Text: to be a chronological survey.

0:15:46 - 0:15:50     Text: So these are not necessarily the works that

0:15:50 - 0:15:53     Text: introduce a certain idea.

0:15:53 - 0:15:56     Text: But it's just, I would say, the most accessible entry

0:15:56 - 0:15:58     Text: points on the topic.

0:15:58 - 0:16:01     Text: And then you can go on the really the work sections

0:16:01 - 0:16:07     Text: to figure out if there was some semi-known paper that led

0:16:07 - 0:16:09     Text: to that line of research.

0:16:09 - 0:16:11     Text: And of course, there is quite a bit of presenter bias

0:16:11 - 0:16:16     Text: because most of these works have been caught or by me.

0:16:16 - 0:16:18     Text: So be mindful of that.

0:16:18 - 0:16:22     Text: That said, do you have any questions so far?

0:16:22 - 0:16:24     Text: I have a quick question about, I see it

0:16:24 - 0:16:26     Text: in the first week when the model is about free space

0:16:26 - 0:16:29     Text: and new long-soup on M.T. I was wondering if you could

0:16:29 - 0:16:33     Text: talk about the different approaches about unsupervised learning

0:16:33 - 0:16:36     Text: and also algorithms like cloud and work

0:16:36 - 0:16:39     Text: to work are possible in low resource languages.

0:16:39 - 0:16:40     Text: Yeah, yeah.

0:16:40 - 0:16:46     Text: So in this lecture, I'm going to focus on neuro.

0:16:46 - 0:16:51     Text: Actually, I'm not even going in the details of the architecture.

0:16:51 - 0:16:54     Text: I'm more talking about algorithms actually.

0:16:54 - 0:16:56     Text: And so these algorithms are applicable both

0:16:56 - 0:16:59     Text: to a neural machine translation system, as well as

0:16:59 - 0:17:03     Text: to statistical machine translation systems.

0:17:03 - 0:17:09     Text: When I go over this part, I can address a little bit

0:17:09 - 0:17:11     Text: your question and tell you a little bit

0:17:11 - 0:17:14     Text: about the differences between these two.

0:17:17 - 0:17:21     Text: And then in terms of methods to learn

0:17:21 - 0:17:25     Text: more than settings and sentence and settings,

0:17:25 - 0:17:28     Text: I'm going to touch very briefly on that.

0:17:28 - 0:17:30     Text: So at the end of the lecture, I'm going

0:17:30 - 0:17:34     Text: to refer to some recent work on filtering

0:17:34 - 0:17:39     Text: where people use sentence embedding methods.

0:17:39 - 0:17:45     Text: It's not a globe, but it's something similar in a way.

0:17:45 - 0:17:51     Text: In practice, for word embeddings,

0:17:51 - 0:17:56     Text: it's kind of, I would say, a prerequisite for machine

0:17:56 - 0:17:59     Text: translation, because if you can align

0:17:59 - 0:18:01     Text: word embeddings, you learn a dictionary,

0:18:01 - 0:18:06     Text: and that's a primitive way to do machine translation.

0:18:06 - 0:18:08     Text: So oftentimes, we look at those things

0:18:08 - 0:18:15     Text: as a good sanity check or as a simplified machine translation

0:18:15 - 0:18:16     Text: task.

0:18:16 - 0:18:18     Text: Whenever you have a reference dictionary

0:18:18 - 0:18:22     Text: for which you can then check the accuracy of your alignment.

0:18:27 - 0:18:31     Text: So if you, let me get back to you

0:18:31 - 0:18:34     Text: when we talk about this paper.

0:18:34 - 0:18:38     Text: So let's talk about data then.

0:18:38 - 0:18:47     Text: So let's go back to our English Nepali translation task.

0:18:47 - 0:18:54     Text: So there is a resource called Opus, which

0:18:54 - 0:18:57     Text: is a very nice, which shows the very nice collection

0:18:57 - 0:19:03     Text: of datasets, all publicly available in lots of languages.

0:19:03 - 0:19:06     Text: And when you go to this website, the Opus website,

0:19:06 - 0:19:10     Text: you find that for English Nepali, actually,

0:19:10 - 0:19:13     Text: there are 1 million parallel sentences.

0:19:13 - 0:19:16     Text: So maybe I like to you, telling you

0:19:16 - 0:19:17     Text: that this is a lot of those language.

0:19:17 - 0:19:23     Text: But actually, if you look at what these corpus

0:19:23 - 0:19:26     Text: are, you realize that pretty much half a million

0:19:26 - 0:19:32     Text: these sentences come from JW300, which

0:19:32 - 0:19:34     Text: is a religious magazine.

0:19:34 - 0:19:37     Text: And then you have 60,000 sentences from the Bible.

0:19:37 - 0:19:40     Text: And the rest come from Genome, Kedi, Ubuntu.

0:19:40 - 0:19:45     Text: So these are computer related materials.

0:19:45 - 0:19:49     Text: And so again, unless you're interested in translating novel

0:19:49 - 0:19:56     Text: sentences from the Bible, this is now super useful, I would

0:19:56 - 0:19:58     Text: say.

0:19:58 - 0:20:03     Text: And so one thing to note is that all these data originates

0:20:03 - 0:20:03     Text: from English.

0:20:03 - 0:20:07     Text: You have nothing that originates from the poly, first of all.

0:20:07 - 0:20:09     Text: And second of all, if you're interested in, let's say,

0:20:09 - 0:20:14     Text: translating Wikipedia, all you have is Wikipedia monolingual

0:20:14 - 0:20:16     Text: data, both in English and in the poly.

0:20:16 - 0:20:20     Text: And in the poly is not even very much.

0:20:20 - 0:20:22     Text: And then of course, you can add some monolingual data

0:20:22 - 0:20:24     Text: in another domain, like CommonCrow, which

0:20:24 - 0:20:28     Text: is just a dump of the internet.

0:20:28 - 0:20:32     Text: But again, translating between English and Nepali using

0:20:32 - 0:20:35     Text: publicly available data is going to be a challenge,

0:20:35 - 0:20:40     Text: because you don't have any in domain parallel data side.

0:20:40 - 0:20:45     Text: All you have is at most some in domain monolingual data.

0:20:45 - 0:20:47     Text: But there is an even bigger problem, which

0:20:47 - 0:20:52     Text: is that there is no test data.

0:20:52 - 0:20:56     Text: So here, we don't have reference translations in the poly

0:20:56 - 0:21:00     Text: to measure the quality of our machine translation system.

0:21:00 - 0:21:04     Text: And this is a big problem, because if you don't have a high

0:21:04 - 0:21:08     Text: quality or you don't have at all, that's sad.

0:21:08 - 0:21:11     Text: It's very hard to compare models, and it's very hard

0:21:11 - 0:21:14     Text: to do model selection to compare algorithms

0:21:14 - 0:21:17     Text: and our field is creeper.

0:21:17 - 0:21:22     Text: We need strong evaluation benchmarks.

0:21:22 - 0:21:26     Text: And so this motivated a project that's called Flores,

0:21:26 - 0:21:29     Text: the stands for Facebook, Law Resource,

0:21:29 - 0:21:33     Text: Law Resource, Evolution, Benzmer for Machine Translation,

0:21:33 - 0:21:39     Text: where we took Wikipedia sentences in English and translated

0:21:39 - 0:21:41     Text: them into Nepali and Sinala.

0:21:41 - 0:21:48     Text: And then we took Wikipedia sentences from Nepali, Wikipedia,

0:21:48 - 0:21:51     Text: and translated them into English, as well as from Sinala,

0:21:51 - 0:21:55     Text: Wikipedia and translated them into English.

0:21:55 - 0:21:58     Text: So you may say, this is a little bit boring,

0:21:58 - 0:22:02     Text: because what's hard about it?

0:22:02 - 0:22:06     Text: And tell me about tricks to do pattern modeling.

0:22:06 - 0:22:09     Text: But actually, you'll be surprised that this data collection

0:22:09 - 0:22:15     Text: process was harder and more interesting also than we thought.

0:22:15 - 0:22:22     Text: So it is hard because there are very few fluent professional

0:22:22 - 0:22:24     Text: translators in these languages, and this

0:22:24 - 0:22:27     Text: is not even super low resource.

0:22:27 - 0:22:33     Text: And so we dealt with translator agency.

0:22:33 - 0:22:38     Text: And typically, there are not enough people for which you can do

0:22:38 - 0:22:41     Text: kind of AB test into test the translation of one person

0:22:41 - 0:22:43     Text: with another one.

0:22:43 - 0:22:44     Text: That's number one.

0:22:44 - 0:22:47     Text: Number two, in general, it's very hard

0:22:47 - 0:22:51     Text: to assess automatically the quality of a translation,

0:22:51 - 0:22:54     Text: because we don't have enough parallel data

0:22:54 - 0:22:56     Text: to train machine translation system, right?

0:22:56 - 0:22:59     Text: And so we need to rely on other methods

0:22:59 - 0:23:01     Text: than a well-trained machine translation system

0:23:01 - 0:23:04     Text: to assess the quality.

0:23:04 - 0:23:10     Text: And so we built a pipeline where we would have,

0:23:10 - 0:23:13     Text: we would send the sentences to the translators

0:23:13 - 0:23:15     Text: once the translators are back.

0:23:15 - 0:23:23     Text: We would do several checks like fluency checks using a language

0:23:23 - 0:23:24     Text: model.

0:23:24 - 0:23:26     Text: We would check for transliteration to make sure

0:23:26 - 0:23:30     Text: that a sentence is not translated by simply transliterating.

0:23:30 - 0:23:35     Text: We would check that the language is the desired one, right?

0:23:35 - 0:23:37     Text: And so we would have a lot of checks like that.

0:23:37 - 0:23:39     Text: And then if, and of course, here there

0:23:39 - 0:23:42     Text: are trashes that you need to set somehow.

0:23:42 - 0:23:50     Text: And then for those sentences that would fail this step,

0:23:50 - 0:23:55     Text: we would send them back to retranslation.

0:23:55 - 0:23:58     Text: And so after a few iterations, so this then eventually

0:23:58 - 0:24:00     Text: we do also a human evaluation.

0:24:00 - 0:24:04     Text: And then the sentences in this evaluation

0:24:04 - 0:24:12     Text: bench, Markard, those that have passed all the automatic

0:24:12 - 0:24:15     Text: and human assessment checks.

0:24:15 - 0:24:21     Text: Now, it turns out that there is not even a very good literature

0:24:21 - 0:24:23     Text: that tells you how to collect data.

0:24:23 - 0:24:25     Text: And in particular, for laws or slang, which

0:24:25 - 0:24:29     Text: is there are a lot of issues related to the quality

0:24:29 - 0:24:30     Text: of the translations.

0:24:30 - 0:24:34     Text: And so this was a project that we thought

0:24:34 - 0:24:35     Text: would take us a couple of months.

0:24:35 - 0:24:40     Text: But instead it took us more than six months.

0:24:40 - 0:24:44     Text: And that's an eventually we call a validation set,

0:24:44 - 0:24:46     Text: a test set, and also a hidden test set,

0:24:46 - 0:24:50     Text: because we use this data for a WMT competition.

0:24:50 - 0:24:54     Text: And for that, they needed to have a test set that was not

0:24:54 - 0:24:57     Text: available to people to make sure that people were not

0:24:57 - 0:25:00     Text: cross-validating on the test set.

0:25:00 - 0:25:04     Text: And here's some examples of sentences.

0:25:04 - 0:25:08     Text: So this is from a sentence from the Sinala Wikipedia

0:25:08 - 0:25:11     Text: translated into English, a couple of sentences here.

0:25:11 - 0:25:14     Text: And it's from English Wikipedia translated into Sinala.

0:25:14 - 0:25:19     Text: I don't know how many people in the audience come from Sri Lanka.

0:25:19 - 0:25:22     Text: That could appreciate this effort.

0:25:22 - 0:25:25     Text: But one interesting thing that you can already see

0:25:25 - 0:25:28     Text: is that although this is totally anecdotal,

0:25:28 - 0:25:31     Text: because it's just a couple of sentences from a

0:25:31 - 0:25:35     Text: for Sinala in English, you can see that the topic, kind

0:25:35 - 0:25:37     Text: of topic distribution is different.

0:25:37 - 0:25:42     Text: And here you have things that would be a little unlikely

0:25:42 - 0:25:46     Text: in English Wikipedia.

0:25:46 - 0:25:49     Text: And the same is for Nepal English and English

0:25:49 - 0:25:51     Text: Nepali.

0:25:51 - 0:25:55     Text: So we have a GitHub repository where we host the data

0:25:55 - 0:26:01     Text: and also baseline models that we train on public available data

0:26:01 - 0:26:04     Text: and then tested on this floor of the benchmark.

0:26:04 - 0:26:08     Text: And last week we released another couple of languages,

0:26:08 - 0:26:10     Text: English, Pashto, English, Camar.

0:26:10 - 0:26:14     Text: And we are adding more and more languages

0:26:14 - 0:26:16     Text: in the coming months.

0:26:16 - 0:26:23     Text: So the point of this section is just to say that data is often

0:26:23 - 0:26:27     Text: time more important than design a model.

0:26:27 - 0:26:29     Text: Because without data, in particular,

0:26:29 - 0:26:31     Text: without a good evaluation benchmark,

0:26:31 - 0:26:37     Text: is essentially impossible to do research in this area.

0:26:37 - 0:26:39     Text: And collecting data is not review.

0:26:39 - 0:26:45     Text: It's not review the process that you use is now well

0:26:45 - 0:26:52     Text: established and in practice, it is hard to do.

0:26:52 - 0:26:55     Text: And another thing to consider, sorry,

0:26:55 - 0:27:00     Text: is to look at the data, look at the data when you collected

0:27:00 - 0:27:02     Text: and also before you start training your model.

0:27:02 - 0:27:06     Text: Because you may realize some issues with the quality

0:27:06 - 0:27:09     Text: of the translations, if you speak the language,

0:27:09 - 0:27:12     Text: oftentimes English is on one side.

0:27:12 - 0:27:15     Text: And or you may discover biases or you

0:27:15 - 0:27:17     Text: may discover interesting things.

0:27:17 - 0:27:19     Text: So always look at the data as opposed

0:27:19 - 0:27:25     Text: to just apply your method in a black box way.

0:27:25 - 0:27:31     Text: That concludes my little discussion of the data part.

0:27:31 - 0:27:32     Text: Are there any questions on this?

0:27:32 - 0:27:35     Text: If you could talk about building a language model

0:27:35 - 0:27:36     Text: for low resource language.

0:27:36 - 0:27:37     Text: Yeah, yeah, yeah.

0:27:37 - 0:27:41     Text: So in this case, what we did, actually,

0:27:41 - 0:27:47     Text: we took the common crawl data.

0:27:47 - 0:27:49     Text: And I think I actually don't remember exactly.

0:27:49 - 0:27:53     Text: So for Nepali, I think we had to concatenate the Wikipedia

0:27:53 - 0:27:56     Text: data and the common crawl data because the Wikipedia data

0:27:56 - 0:27:58     Text: was just too small.

0:27:58 - 0:28:02     Text: And we simply train a count-based angram.

0:28:02 - 0:28:04     Text: And then the count-based angram gives you,

0:28:04 - 0:28:06     Text: I don't know if you studied this, but it gives you

0:28:06 - 0:28:11     Text: the probability of more work given some fixed window of context.

0:28:11 - 0:28:15     Text: And then for a given sentence, we would, like,

0:28:15 - 0:28:18     Text: let's say, what is it?

0:28:18 - 0:28:23     Text: For a given sentence, you would compute the score for every work.

0:28:23 - 0:28:29     Text: And then the score of a sentence is simply the average

0:28:29 - 0:28:32     Text: probability score across all the words in the sentence.

0:28:32 - 0:28:33     Text: And that would give you a score.

0:28:33 - 0:28:38     Text: And then we would simply have a threshold on that.

0:28:38 - 0:28:42     Text: And so all the sentences that would score too low,

0:28:42 - 0:28:44     Text: that would be deemed not fluent enough,

0:28:44 - 0:28:48     Text: would be sent for reward.

0:28:48 - 0:28:52     Text: But of course, whenever you have an entity,

0:28:52 - 0:28:58     Text: whenever you have, you know, it's not super reliable.

0:28:58 - 0:29:03     Text: And if you go on languages that are even lower resource

0:29:03 - 0:29:06     Text: than Sinala, then you don't even have really

0:29:06 - 0:29:08     Text: in domain data.

0:29:08 - 0:29:10     Text: Like Wikipedia is not in all the languages.

0:29:10 - 0:29:13     Text: And then it becomes even harder.

0:29:13 - 0:29:18     Text: And so oftentimes, so now that we are scaling this up,

0:29:18 - 0:29:22     Text: we are looking at neural language models

0:29:22 - 0:29:24     Text: that are trained in a multilingual way

0:29:24 - 0:29:29     Text: and that are fine tuned on a small in domain

0:29:29 - 0:29:33     Text: modeling with data set, if available.

0:29:33 - 0:29:40     Text: But also this step is not particularly obvious how to do it.

0:29:40 - 0:29:40     Text: Yeah, sure.

0:29:40 - 0:29:45     Text: So thank you for this amazing result.

0:29:45 - 0:29:49     Text: But I just want to comment, because I've noticed that Wikipedia

0:29:49 - 0:29:52     Text: actually will have different content with different language

0:29:52 - 0:29:53     Text: you choose.

0:29:53 - 0:29:57     Text: So for example, they'll have very detailed description

0:29:57 - 0:30:00     Text: of some basic topic.

0:30:00 - 0:30:04     Text: And then in other languages, even with really commonly

0:30:04 - 0:30:07     Text: used language like Chinese, they'll actually just

0:30:07 - 0:30:11     Text: have completely different content or basically simplified

0:30:11 - 0:30:12     Text: content.

0:30:12 - 0:30:16     Text: So I'm pretty sure this is also going

0:30:16 - 0:30:19     Text: to happen with rarely used languages.

0:30:19 - 0:30:23     Text: So yeah, I just generally think that Wikipedia

0:30:23 - 0:30:28     Text: might not be, basically they might not

0:30:28 - 0:30:32     Text: be very direct reference to the translation.

0:30:32 - 0:30:33     Text: Yeah.

0:30:33 - 0:30:34     Text: Yeah.

0:30:34 - 0:30:36     Text: So it's an excellent point.

0:30:36 - 0:30:37     Text: And this is something that I'm going

0:30:37 - 0:30:40     Text: to discuss more in the third part of the lecture.

0:30:40 - 0:30:46     Text: And in a way, this is the translation problem.

0:30:46 - 0:30:52     Text: So we need to accept the fact that content that is originated

0:30:52 - 0:30:56     Text: in a certain language may have a different topic distribution

0:30:56 - 0:30:59     Text: than content that originates in another language.

0:30:59 - 0:31:03     Text: And what you want to translate is really content that originates

0:31:03 - 0:31:06     Text: in the source language.

0:31:06 - 0:31:09     Text: And so you need to live with it.

0:31:09 - 0:31:11     Text: That's what it is.

0:31:11 - 0:31:18     Text: So oftentimes in the public benchmarks, in the literature,

0:31:18 - 0:31:22     Text: you find that people assume that corpora are comparable.

0:31:22 - 0:31:25     Text: So everything that originates in English

0:31:25 - 0:31:29     Text: and let's say Nepali essentially comes from the same kind

0:31:29 - 0:31:29     Text: of sources.

0:31:29 - 0:31:34     Text: So it's news and it's all news talking about the senior things.

0:31:34 - 0:31:35     Text: But in practice, this is not true.

0:31:35 - 0:31:39     Text: It's not true for Wikipedia, as you correctly said.

0:31:39 - 0:31:41     Text: But it's also true for news, right?

0:31:41 - 0:31:45     Text: Because if I look at local news in Nepal

0:31:45 - 0:31:49     Text: and local news over here, it's quite different.

0:31:49 - 0:31:52     Text: So this is a general problem.

0:31:52 - 0:31:56     Text: And this has implications in terms of the methods

0:31:56 - 0:31:58     Text: that we are going to use as we will discuss later.

0:32:05 - 0:32:07     Text: Other questions?

0:32:07 - 0:32:10     Text: Or I'm not sure if I was clear.

0:32:10 - 0:32:17     Text: It's really hard to speak without seeing, without feedback.

0:32:17 - 0:32:21     Text: Please, please let me know if anything is not clear.

0:32:21 - 0:32:23     Text: OK, let's talk about modeling.

0:32:23 - 0:32:28     Text: And this is going to be most of our,

0:32:28 - 0:32:31     Text: where we are going to spend most of our time.

0:32:31 - 0:32:35     Text: So remember that we have this funky chart,

0:32:35 - 0:32:36     Text: where we have domain and languages.

0:32:36 - 0:32:41     Text: And it's a pretty complicated learning setting.

0:32:41 - 0:32:43     Text: And here for simplicity, we are going

0:32:43 - 0:32:49     Text: to focus just on English and Nepali languages.

0:32:49 - 0:32:51     Text: And we start with a simplest setting ever,

0:32:51 - 0:32:53     Text: which is supervised learning, assuming

0:32:53 - 0:32:57     Text: that all data is in the same domain.

0:32:57 - 0:33:01     Text: So perhaps you have a small training set.

0:33:01 - 0:33:05     Text: And the test set is in the same domain as the training set,

0:33:05 - 0:33:08     Text: as the parallel training set.

0:33:08 - 0:33:11     Text: So we denote as x the source sentence,

0:33:11 - 0:33:14     Text: as y, the target sentence, d.

0:33:14 - 0:33:17     Text: So d is the parallel data set that

0:33:17 - 0:33:21     Text: collects all these sentence pairs.

0:33:21 - 0:33:25     Text: And so this is the typical empirical risk minimization framework,

0:33:25 - 0:33:28     Text: whereby you do supervised learning.

0:33:28 - 0:33:30     Text: In this case, you minimize the percent of the loss,

0:33:30 - 0:33:33     Text: and you want to maximize the probability of the target

0:33:33 - 0:33:35     Text: sentence given the source sentence.

0:33:35 - 0:33:41     Text: And so a way to visualize this is to say that x is my English sentence.

0:33:41 - 0:33:45     Text: It goes to my encoder, decoder, and antisystem

0:33:45 - 0:33:47     Text: that produces a prediction.

0:33:47 - 0:33:49     Text: And then we have a loss that measure

0:33:49 - 0:33:51     Text: the discrepancy between the human reference,

0:33:51 - 0:33:54     Text: that you took the sentence x.

0:33:54 - 0:33:58     Text: You asked your translator that gave you the human reference.

0:33:58 - 0:33:59     Text: And so the percent of the loss measure

0:33:59 - 0:34:02     Text: the discrepancy between the model prediction

0:34:02 - 0:34:03     Text: and the human reference.

0:34:06 - 0:34:12     Text: Now, notice that here I'm denoting with boxes.

0:34:12 - 0:34:16     Text: Now model components, in this case, the blue boxes,

0:34:16 - 0:34:20     Text: the encoder, the processes English sentences,

0:34:20 - 0:34:26     Text: and the red boxes, the decoder that operates in the poly.

0:34:26 - 0:34:30     Text: And I just wanted to add one more thing,

0:34:30 - 0:34:33     Text: which is that if you don't have a lot of parallel data,

0:34:33 - 0:34:34     Text: you need to regularize.

0:34:34 - 0:34:38     Text: And so you can do a wad decay, which is pretty standard.

0:34:38 - 0:34:42     Text: So you kind of minimize the abnormal parameters,

0:34:42 - 0:34:44     Text: but there are also other methods

0:34:44 - 0:34:46     Text: that I think in the machine learning class,

0:34:46 - 0:34:49     Text: you may have seen like dropout, where you set to 0

0:34:49 - 0:34:53     Text: at random hidden units in your encoder, decoder,

0:34:53 - 0:34:55     Text: or you can do levels smoothing, whereby you,

0:34:58 - 0:35:01     Text: in your cross-entropy loss, instead of,

0:35:01 - 0:35:02     Text: actually, should be more over here.

0:35:03 - 0:35:08     Text: Instead of setting as a target for the correct word,

0:35:09 - 0:35:11     Text: so this is the probability over the whole sequence,

0:35:11 - 0:35:15     Text: which you can factorize over each individual word,

0:35:15 - 0:35:17     Text: by the product rule.

0:35:18 - 0:35:23     Text: So for every word, you have the correct word that you want,

0:35:24 - 0:35:26     Text: sorry, at every time step,

0:35:26 - 0:35:28     Text: you want to predict the next word.

0:35:28 - 0:35:31     Text: And now instead of assigning 100% of probability

0:35:31 - 0:35:34     Text: on the next word, you, let's say,

0:35:34 - 0:35:37     Text: you assign 90% of the probability,

0:35:37 - 0:35:40     Text: and the remaining 10% you evenly distribute

0:35:40 - 0:35:42     Text: across all the remaining words, so that the model

0:35:42 - 0:35:43     Text: is not too overly confident.

0:35:43 - 0:35:45     Text: So the combinations of these two things

0:35:45 - 0:35:50     Text: are usually good ways to regularize the system.

0:35:51 - 0:35:53     Text: So that's the simplest setting.

0:35:53 - 0:35:55     Text: Now, let's see what happens when we have also

0:35:55 - 0:35:58     Text: some source side monolingual data.

0:35:58 - 0:36:02     Text: So here now we have an additional data set

0:36:02 - 0:36:07     Text: that has only sentences in the source language English.

0:36:07 - 0:36:10     Text: So in addition to T, now we have also MS,

0:36:10 - 0:36:14     Text: which is the monolingual data on the source side.

0:36:14 - 0:36:16     Text: So we have a bunch of axes.

0:36:16 - 0:36:21     Text: So typically, M is much greater than M, right?

0:36:21 - 0:36:25     Text: And now a typical way to use this data

0:36:25 - 0:36:30     Text: is to model the margin distribution of the data of X, right?

0:36:30 - 0:36:33     Text: And so there are many ways to do that.

0:36:33 - 0:36:35     Text: One way that has proved that,

0:36:35 - 0:36:38     Text: one way that has proven to be pretty effective

0:36:38 - 0:36:42     Text: in machine translation is to do denoising auto encoding.

0:36:42 - 0:36:47     Text: And so here the idea is that you have something

0:36:47 - 0:36:52     Text: similar to what we had before, except that now the input

0:36:52 - 0:36:56     Text: is taken from this monolingual data set, okay?

0:36:56 - 0:36:58     Text: And you add noise to it,

0:36:58 - 0:37:01     Text: and I'm going to describe the noise in a second.

0:37:01 - 0:37:05     Text: And then the job of the encoder decoder

0:37:05 - 0:37:10     Text: is simply to denoise the noisy input.

0:37:10 - 0:37:13     Text: And the cross-central P loss measured this discrepancy

0:37:13 - 0:37:18     Text: between the prediction and the actual clean input.

0:37:18 - 0:37:23     Text: But now notice that the decoder is not this radicoder

0:37:23 - 0:37:27     Text: because the decoder now is a decoder that operates in English.

0:37:27 - 0:37:29     Text: But the encoder does not.

0:37:29 - 0:37:35     Text: Now the encoder is the same that you have seen here.

0:37:35 - 0:37:41     Text: So again, the loss function here is very similar to before,

0:37:41 - 0:37:46     Text: except that the target is the clean input X.

0:37:46 - 0:37:53     Text: And the input is a notified version of X.

0:37:53 - 0:37:55     Text: So in this case,

0:37:55 - 0:37:58     Text: we are not predicting something in a poly by something in English.

0:37:58 - 0:38:02     Text: So this is a, if you want, a limitation of this work.

0:38:02 - 0:38:08     Text: But this is useful because you are anyway doing some good modeling

0:38:08 - 0:38:11     Text: of the input sentences.

0:38:11 - 0:38:14     Text: And you're going to train the encoder parameters

0:38:14 - 0:38:18     Text: that are going to be shared with your supervised system.

0:38:18 - 0:38:24     Text: So the encoder is shared between the translation task

0:38:24 - 0:38:29     Text: on parallel data, right? And the denosional code in task.

0:38:29 - 0:38:31     Text: So essentially you have an encoder and two decoders.

0:38:31 - 0:38:35     Text: One that operates in a poly, one that operates in English.

0:38:35 - 0:38:40     Text: And so in terms of noise,

0:38:40 - 0:38:44     Text: there are essentially two types of noise that we have been using our work.

0:38:44 - 0:38:47     Text: Others are possible.

0:38:47 - 0:38:51     Text: In the simplest case, you can drop words or swap words.

0:38:51 - 0:38:55     Text: So assume that the input sentences is the cassette on the mat.

0:38:55 - 0:39:01     Text: Then if you swap words, you may provide the input on set mat.

0:39:01 - 0:39:06     Text: And so here the encoder, the code needs to understand a little bit of the syntax,

0:39:06 - 0:39:10     Text: the grammatical rules in order to reorder.

0:39:10 - 0:39:14     Text: If you drop, let's say you drop the last word, the cassette on the,

0:39:14 - 0:39:17     Text: then the model needs to understand a little bit of the semantics

0:39:17 - 0:39:22     Text: because it needs to assign higher probability to mat now, right?

0:39:22 - 0:39:27     Text: And so you can see that there is a little bit of,

0:39:27 - 0:39:29     Text: so there are two hyper parameters here.

0:39:29 - 0:39:34     Text: So one, actually, there are several ways to use denosional encoding.

0:39:34 - 0:39:38     Text: So you can use denosional encoding as a way to pre-training encoder.

0:39:38 - 0:39:44     Text: Or you can use it as auxiliary loss when you do supervised learning.

0:39:44 - 0:39:49     Text: So you can have this term plus lambda this term.

0:39:49 - 0:39:52     Text: Okay, so both ways are fine.

0:39:52 - 0:39:58     Text: And so there is a very critical hyper parameter here,

0:39:58 - 0:40:00     Text: which is the level noise.

0:40:00 - 0:40:03     Text: If you don't have any noise or if the noise level is too low,

0:40:03 - 0:40:07     Text: then this task is trivial because of the attention,

0:40:07 - 0:40:09     Text: you can simply copy the input.

0:40:09 - 0:40:13     Text: And so the encoder and the code don't need to learn anything.

0:40:13 - 0:40:18     Text: If the noise level is too high, then you destroy the input here.

0:40:18 - 0:40:20     Text: So the encoder is not useful.

0:40:20 - 0:40:23     Text: You just do language modeling using the decoder.

0:40:23 - 0:40:30     Text: But remember that this decoder is then not used for translation

0:40:30 - 0:40:35     Text: because what you use in the machine translation system is the encoder box, right?

0:40:35 - 0:40:38     Text: The encoder module.

0:40:38 - 0:40:42     Text: Okay, so there are other ways to use source.

0:40:42 - 0:40:45     Text: Side monolingual data.

0:40:45 - 0:40:47     Text: In addition to the noisy of encoding,

0:40:47 - 0:40:49     Text: you could also do a set training,

0:40:49 - 0:40:55     Text: which is a method that comes from the 90s, if not earlier.

0:40:55 - 0:40:57     Text: And the idea is very simple.

0:40:57 - 0:41:02     Text: So again, you take a sentence from your source side monolingual data set.

0:41:02 - 0:41:06     Text: And you add noise to it and then you have an encoder decoder that.

0:41:06 - 0:41:10     Text: Tries to this time translate from this noisy input.

0:41:10 - 0:41:13     Text: Okay, and now what's the reference?

0:41:13 - 0:41:20     Text: The reference is given by a stale version of your machine translation system.

0:41:20 - 0:41:24     Text: Okay, where the reference is produced by let's say beam.

0:41:24 - 0:41:32     Text: And so the central p loss is then going to measure the discrepancy between your prediction and what.

0:41:32 - 0:41:36     Text: The prediction from from a stale version of your system gave.

0:41:36 - 0:41:43     Text: The reason why this works is that when you do beam, you actually typically.

0:41:43 - 0:41:46     Text: Produce better quality.

0:41:46 - 0:41:48     Text: Uh, outputs.

0:41:48 - 0:41:53     Text: And so when you train now this encoder decoder by cross entropy loss,

0:41:53 - 0:41:56     Text: you're going to learn the decoding process.

0:41:56 - 0:41:59     Text: Okay, and so this is something.

0:41:59 - 0:42:01     Text: Good for you.

0:42:01 - 0:42:05     Text: In addition, when you train you inject noise and the noise is regularizing.

0:42:05 - 0:42:07     Text: So you're going to be able to see the difference.

0:42:07 - 0:42:09     Text: So you're going to see the difference.

0:42:09 - 0:42:11     Text: So you're going to see the difference.

0:42:11 - 0:42:15     Text: And so if you're predicting correctly one sentence now also nearby sentences,

0:42:15 - 0:42:21     Text: but nearby I mean sentences that are sooner phrases or have a good overlap.

0:42:21 - 0:42:26     Text: With the current sentence are going to be more likely predicted correctly.

0:42:26 - 0:42:30     Text: And so we have this paper where we analyze a little bit of this aspects.

0:42:30 - 0:42:33     Text: And so the algorithm is very simple.

0:42:33 - 0:42:37     Text: So you can train your machine translation system on the parallel data.

0:42:37 - 0:42:39     Text: And then you repeat the following process.

0:42:39 - 0:42:46     Text: So first you decode your monolingual data set using your current machine translation system.

0:42:46 - 0:42:53     Text: And you make a new parallel data set of sentences from your monolingual data set with,

0:42:53 - 0:42:58     Text: sorry, with the translations from your current system.

0:42:58 - 0:43:08     Text: And then you retrain the model this pure white pure white given X on the union of your original parallel data plus this auxiliary data set.

0:43:08 - 0:43:12     Text: And so here you have two hyper parameters.

0:43:12 - 0:43:19     Text: One is the noise level and the other is the hyper parameters that weight this auxiliary data set.

0:43:19 - 0:43:23     Text: So this is a training loss.

0:43:23 - 0:43:30     Text: Now let's so that's that concludes how we can use source Simon in will data.

0:43:30 - 0:43:35     Text: Let me say a word about how we can use target Simon on in will data.

0:43:35 - 0:43:43     Text: So you could use the target Simon in will data to train a language model and then train the machine translation system in the residual space of this language model.

0:43:43 - 0:43:48     Text: By turns out that there is a much more effective way to leverage this data.

0:43:48 - 0:43:50     Text: And that's called back translation.

0:43:50 - 0:44:00     Text: At the high level it works as follows so you take a sentence from your target Simon in will data set white here.

0:44:00 - 0:44:07     Text: And on the part of the data set you train also back or machine translation system that goes from the party to English.

0:44:07 - 0:44:16     Text: Okay, so that's so now you have a red and colder at the x Nepali and the blue decoder that works in the English space.

0:44:16 - 0:44:20     Text: So you map the sentencing into English.

0:44:20 - 0:44:26     Text: It's the policy sentencing to English and now this may not be a correct translation path.

0:44:26 - 0:44:33     Text: It's a noisy input that you feed to your encoder decoder that you want to train.

0:44:33 - 0:44:43     Text: And so now the input is noisy but the target here is clean because it comes from the original target Simon in will data set.

0:44:43 - 0:44:48     Text: So this is a very powerful algorithm.

0:44:48 - 0:44:52     Text: Because.

0:44:52 - 0:45:12     Text: Unlike saltraining here the target circling but the input is a little noisy and that's usually much better than having clean inputs but noisy targets right because the targets affect essentially all the other signals that you propagate through the NMT systems.

0:45:12 - 0:45:26     Text: And you can see that translation is a way to do that augmentation because you produce noise a version of inputs for a given target a little bit like envision where they do.

0:45:26 - 0:45:38     Text: I guess this is not the right audience to do this analogy but it should work on vision you will do scaling rotation different cropping and that's a little bit similar to what we're doing here.

0:45:38 - 0:45:54     Text: The algorithm again is you train a backward and forward machine translation system on the parallel data and then you use your backward model to decode the target side modeling with the set to produce an auxiliary parallel data set.

0:45:54 - 0:46:06     Text: And then you concatenate the two data sets the original parallel data set and the auxiliary one to train the new forward model.

0:46:06 - 0:46:08     Text: So you can combine.

0:46:08 - 0:46:15     Text: Set training and bad translation so if you have both source modeling with the set and target modeling with the set.

0:46:15 - 0:46:35     Text: You can do the following so you can use the parallel data to train the forward and the backward machine translation system and then at the step to you can use the forward model to decode to translate the source at modeling with the set into this data.

0:46:35 - 0:46:46     Text: And then you use the backward machine translation system to translate the target same on the ingot data set into these translations.

0:46:46 - 0:46:54     Text: And then you treat this parallel sentences as real data and you concatenate them to the parallel data set.

0:46:54 - 0:46:59     Text: And now you retrain both the forward and the backward machine translation systems.

0:46:59 - 0:47:15     Text: And as long as this to improve then you can go and do another iteration whereby you again you really code you translate the source and the target side modeling with the sets and then you go and you retrain them.

0:47:15 - 0:47:27     Text: And this is as far as no, the most effective way to leverage modeling with data in law resource languages nowadays.

0:47:27 - 0:47:40     Text: And let me talk a little bit about how we can do multilingual training. So in this case, we have parallel data sets on different language pairs.

0:47:40 - 0:47:50     Text: And so you have a parallel data set for English Nepali one for English Hindi one for Hindi English or Nepali Hindi or any subset of this.

0:47:50 - 0:47:58     Text: This is super simple. So the way that it works is that you have a single encoder and the single decoder.

0:47:58 - 0:48:02     Text: And you train by supervised learning.

0:48:02 - 0:48:15     Text: The only change that needs to be made is that at the input of the encoder you concatenate also a token that specifies the language in which you want to translate.

0:48:15 - 0:48:28     Text: And so the encoder will learn to process multiple languages. The decoder will learn to do to produce multiple languages as well.

0:48:28 - 0:48:34     Text: And we pick the language based on the token specified by Indian code input.

0:48:34 - 0:48:53     Text: And so training is just minimizing the cross entropy loss for all the parallel data sets that you have where you simply add an extra token in the source sentence that specifies the target language that you want to translate.

0:48:53 - 0:49:14     Text: And the only thing that I wanted to add on this is that oftentimes it types if you preprocess the data by I'm not sure if you learn about by parent coding sentence pieces essentially waste to segment words into syllables or frequent.

0:49:14 - 0:49:26     Text: And so if you concatenate this data in order to learn these ways to segment the data, then it's also possible that for many languages, there is a good fraction addiction that is shared.

0:49:26 - 0:49:36     Text: And so this also helps making sure that you can do a good job translating multiple languages at once at once.

0:49:36 - 0:49:49     Text: And so my conclusion so far is that even without domain effect, there are a lot of training paradigms depending on the available data that you have.

0:49:49 - 0:50:06     Text: And so for the diary, it's very hard to tell which method works best nowadays because it really depends on how much data you have, how different are the domains and what is the language part that you're working with.

0:50:06 - 0:50:15     Text: For instance, the domains may be very different, but if you have a lot of data, maybe it doesn't matter much.

0:50:15 - 0:50:25     Text: And one of these we have key ingredients, the noise in auto encoding, bad translation, multilingual training that work pretty well.

0:50:25 - 0:50:39     Text: And nowadays the field is at the stage in which we are trying to figure out the best way to combine them. And right now there is a lot of what I would say craftsmanship to figure out how to best combine them.

0:50:39 - 0:50:47     Text: But hopefully we can find out. And I think there is a lot of effort in trying to automate this process because right now.

0:50:47 - 0:50:54     Text: There is a lot of cross validation would say to figure out all these hyper parameters.

0:50:54 - 0:51:21     Text: So the open challenges here are dealing with the diversity of domains, a dealing with data sets that have very wildly different translation quality, some are very noisy, some are very clean dealing with different with data sets of different size and very different language pairs.

0:51:21 - 0:51:44     Text: And yeah, and so I would say that in general it may be counterintuitive, but working in low resource machine translation doesn't mean training small models on small data, but actually means training even bigger models on even more data because you need to compensate for the lack of supervision that you have of direct supervision that you have.

0:51:44 - 0:51:56     Text: Very good. Before I go on, are there any questions? Yeah, I just had a quick question regarding in the in a few of the previous algorithms that you described.

0:51:56 - 0:52:09     Text: Is it necessary to retrain entirely like retrain the model entirely or is there some way to augment the model or fine tune it on the on the generated.

0:52:09 - 0:52:17     Text: So actually, what usually happens is that as you iterate, you can make the model bigger.

0:52:17 - 0:52:26     Text: So when you train on the parallel data set, usually this is not much data and so you need to train in something smaller, why is your overfed too much.

0:52:26 - 0:52:48     Text: Once you add the monolingual data that you this 80 data set, then this model can be much bigger than your original model. Now, it's not super obvious how to, you know, initialize a bigger model from a smaller model.

0:52:48 - 0:53:02     Text: So that's why people usually initialize from random at the next iterations, you can initialize from the model at the previous interaction.

0:53:02 - 0:53:11     Text: What we usually find is that initializing at random usually works as well.

0:53:11 - 0:53:18     Text: Thank you. Thank you. Okay, any other question.

0:53:18 - 0:53:29     Text: When you say, usually the model learns, I was wondering, do you mean that you add more layers of the more parameters as the model keeps training.

0:53:29 - 0:53:45     Text: So usually you just make it bigger. Yeah. Yeah, the more layers more parameter. So whether it is wider or deeper.

0:53:45 - 0:54:01     Text: I think usually, yeah, I'm not entirely sure there is a definite answer on that. Usually making the encoder deeper is a good thing.

0:54:01 - 0:54:05     Text: Making the decoder deeper doesn't buy you much.

0:54:05 - 0:54:19     Text: So usually we play with encoder, I would say. But yeah, there is not so much difference in practice, I would say.

0:54:19 - 0:54:29     Text: So you can imagine just double the size of your hidden state that would work.

0:54:29 - 0:54:42     Text: Okay, so let's see how this. So I didn't speak about models, but I spoke about algorithms. You can turn this algorithms into models a little bit and talk about joint distribution, March on distribution.

0:54:42 - 0:54:51     Text: But in my view, it's just simpler to think in terms of algorithms, because also the way that we implement them.

0:54:51 - 0:55:06     Text: And so let's see how these algorithms can be put together in some interesting case studies. So actually, I realized that I'm really going slow. So let's see the case where you only have monolingual data and no parallel data.

0:55:06 - 0:55:10     Text: So this is what we call unsupervised machine translation.

0:55:10 - 0:55:24     Text: So let's say that you have an English and a French data set. This is not a typical use case of a supervised machine translation, but this is where it works really well. So let's focus for this for now.

0:55:24 - 0:55:34     Text: And so you take a sentence from the target seminal in all data set, you go through your encoder decoder and you produce an English translation. Obviously, you don't have the reference here.

0:55:34 - 0:55:51     Text: So what you could do is to feed this to a machine translation system that goes from English to French so that you kind of reconstruct your original French sentence. And now you have you can have another signal to back propagate through your machine translation system.

0:55:51 - 0:55:55     Text: You can do the same going from English to French to English.

0:55:55 - 0:56:09     Text: This is very much what people have done in vision, the code, a psycho consistency. You can see this as a note encoder where the intermediate representation is, you know, is a language in English.

0:56:09 - 0:56:18     Text: The problem is that as it is, the model is not constrained to produce something that is fluent English sentence.

0:56:18 - 0:56:35     Text: So in the vision of main people use adversarial training, but in an LTE, it's kind of tricky because this is a discrete sequence. And so in order to make sure that this decoder produces English fluent English sentences, you could imagine to do the noisy auto encoding.

0:56:35 - 0:56:53     Text: So you could take a, you could take an English sentence, notified what through your denoysed auto encoding now this decoder is the same block that you appear is going to be forced to learn the statistics and the regularities of the English language.

0:56:53 - 0:57:11     Text: The problem is that if you look at this decoder, this decoder is in the denoysed auto encoding game, it is operating on the output of this encoder that takes English as input while here encoder takes French as input.

0:57:11 - 0:57:17     Text: It could be very well be the case that the representation produced by these two encoders is different.

0:57:17 - 0:57:33     Text: So this decoder may work very well in this setting, but not in this setting. And so in other ways, how can we make sure that this red and blue blocks are interchangeable? How can we make sure that there is good modularity.

0:57:33 - 0:57:49     Text: So one way to do this is to use the trick that we use for multilingual training, whereby we have a single encoder and a single decoder. So the decoder is shared across France and English and the encoder is shared across English and France.

0:57:49 - 0:58:17     Text: And we specify the target language by an extra token at the input. And so in particular, if you learn common BPS and if you share parameters, then this process, sorry, this process really works well and you have an encoder that operates well with whenever it is fed with the hidden state that comes from an encoder operating on English or French.

0:58:17 - 0:58:28     Text: And so again, the key ingredients are iterative by translation, the noise encoding and multilingual training. And for supervised machine translation, we do.

0:58:28 - 0:58:57     Text: And so when you do this on English friends, you find that as you you can get pretty good performance, a blue of 30, usually key is you pretty fluent translations that are also adequate.

0:58:57 - 0:59:22     Text: And if you compare that to what you got with the supervised baseline that is trained on the parallel data set, you find that training on 10 million monolingual sentences in English and 10 million French gives you the same translation accuracy, then training a supervised baseline that is this red curve.

0:59:22 - 0:59:43     Text: Actually, this red curve, this blue curve and this red curve, this is the new version, this is the place base version with 100,000 parallel sentences. So in other words, each parallel sentence, bar is equivalent to 100 monolingual sentences.

0:59:43 - 1:00:03     Text: So in the sense that they give you a machine translation system of similar accuracy. And so now the more the domains are different and the more the languages are different from each other, the words it gets.

1:00:03 - 1:00:19     Text: That's why when you do loads of machine translation, this is the extreme case of unsupervised machine translation, you need to learn from lots of data in order to compensate for the lack of direct supervision.

1:00:19 - 1:00:41     Text: And maybe give you an example on Flores where for the Flores, as we have seen, there was no in domain parallel data. There was some monolingual data that was in domain, but not very much. And there was quite a bit of auto domain parallel data.

1:00:41 - 1:00:52     Text: Do you remember the one million sentences from Bible and Ubuntu. And then we have quite a bit of monolingual data that is auto domain.

1:00:52 - 1:01:06     Text: And so this is the supervised baseline unsupervised machine translation here didn't work at all because very much like was mentioned that the Wikipedia domains are not quite aligned.

1:01:06 - 1:01:19     Text: So this doesn't have unsupervised machine translation. If you do back translation, if you do iterative, but transition, you do quite a bit better than the supervised baseline, which is quite good.

1:01:19 - 1:01:41     Text: And you add also English Hindi parallel data, you do quite a bit better. And now also the unsupervised machine translation works. It's unsupervised for English Nepali, but you do have supervision for English Hindi. And so the combination of back translation and multilingual training is here, the winning combination.

1:01:41 - 1:01:57     Text: And this is something that we see through in general. So I'm going to skip the results on English bar means actually, I had a nice demo, but I'm going to show it to you later if there is time.

1:01:57 - 1:02:20     Text: And so, as I said, we have quite a few good components, which we can combine pretty easily. And now the research is about how to best combine them, how to best weight data sets, how to best weight examples in order to automate the current cross validation based process, I would say.

1:02:20 - 1:02:38     Text: And the other message here is that it's, Lorenzo's machine translation is a big data problem. It requires big compute. It's a pretty big engineering feat in order to compensate for the lack of parallel data.

1:02:38 - 1:02:41     Text: Are there any questions on this?

1:02:41 - 1:02:51     Text: I was just wondering when you mentioned that the parallel to vision in cycle consistency, you mentioned that we can't do adversarial training.

1:02:51 - 1:03:00     Text: And I was just wondering if you could flush that out and why we couldn't just use say like an LSTM and performs adversarial training.

1:03:00 - 1:03:13     Text: Yeah, yes, yes, so there are actually a bunch of papers trying to do adversarial training or guns, dial training for text generation.

1:03:13 - 1:03:24     Text: I must say that it's a pretty active research area. I haven't seen a very compelling demonstration that this methods work very well with tried.

1:03:24 - 1:03:34     Text: It's a little difficult to propagate. So when this produces a sentence, you need to produce, you know, a sentence and that's discrete.

1:03:34 - 1:03:48     Text: And so you could propagate using reinforced kind of methods. You could do a lot of these things, but essentially it's just a little hard to make it work. And it's very finicky.

1:03:48 - 1:04:06     Text: So it may work on simple data sets, but at scale, it's very hard to. So another thing, another consideration is that anything that you do has to work a scale, because again, the value, the amount of information that you get from a monolingual sentence is not very much.

1:04:06 - 1:04:18     Text: Now, if you do a lot of compute, if you or if your gradients have a lot of noise, like when you're doing with reinforced, then it's not going to work.

1:04:18 - 1:04:28     Text: It's possible that people make a map with ways to make it work. I don't think this is true. A present, but it could be in the future.

1:04:28 - 1:04:49     Text: Let me spend five minutes on the analysis. And then you will have this slide so you can go over the remaining details. So here, we, so the starting point is to say, well, if I were to simulate Lord resource machine translation with a high resource language, like French to English, let's say you take Europe,

1:04:49 - 1:05:08     Text: let's say 20,000 parades sentences and 100,000 mongol target sentences, and you play by translation, you got a very nice improvement. Now, if you come here, Facebook, and you try this on Facebook data, you find that the improvement is actually very, very minimal.

1:05:08 - 1:05:26     Text: And that relates to the discussion that we had at the very beginning that what people talk about in different parts of the world is very different. And so now you need, it's like you need to align to point clouds, but the distribution is to point clouds is very, very different from each other.

1:05:26 - 1:05:44     Text: And so here I was making example that even for English speaking countries, if you look at topics on sports, you have that in America people may talk more about food, but Facebook, in the UK, more about cricket and soccer, right.

1:05:44 - 1:06:05     Text: For the same topic, you have a different distribution words, but you also have a different distribution of topics. And so this is what we call the source target domain mismatch. So you may have several kinds of domain mismatch. Typically, you have a mismatch between the training distribution and the test distribution.

1:06:05 - 1:06:27     Text: And so talking also about the mismatch between the source domain, the source language, the source domain and the target domain. Okay. And so there is a hypothesis that this may make better translation less effective because even if you were to perfectly translate target side monitoring will data once you translate is going to be auto domain.

1:06:27 - 1:06:41     Text: And so with respect to the data that you really want to translate, which originates in the source domain. And so we had a very nice control setting to study this problem.

1:06:41 - 1:07:01     Text: And so we were we create a synthetic data set where the source domain comes from your parliament data and the target domain comes from open subtitles, which are movie captions. And now by creating the target domain as a mixture of the two, you can precisely control the amount of

1:07:01 - 1:07:22     Text: the domain is between the source and the target domain. And by very alpha, you can vary that. And so the major result is this figure where alpha measures how much is the target domain similar to the source domain. So if I face equal to one, they are all

1:07:22 - 1:07:38     Text: very different. One is Europe and the other is open subtitles. And so it turns out that in this extreme regime actually said training, which is this deadline works better than back translation.

1:07:38 - 1:07:51     Text: But as you make the domains more and more similar, back translation is much better than said training and both of them are much better than just if you were to use the parallel data.

1:07:51 - 1:08:06     Text: So I'm going to skip all of this. You can look at the paper and the slides. I want to conclude that there are other things that I didn't talk about like filtering. This is one of the most exciting things nowadays.

1:08:06 - 1:08:27     Text: And the idea is to essentially learn a joint embedding space for sentences by simply training a multilingual system on lots of public available data. And then you use this in order to do nearest the level of three of a sentence for what the corresponding translation would be in other languages.

1:08:27 - 1:08:41     Text: And they found that they collected a large data set and they were able to beat the performance of state of the art machine translation system on high resource languages like English German English Russian.

1:08:41 - 1:09:02     Text: And the idea is that by using much more data, although noisy, you can do better than using a curated high quality data set. And this is something that we see over and over. And again, the idea here is that we need to figure out how to best combine by translation this filtering material in pre training in order to

1:09:02 - 1:09:20     Text: get the past combination ever for solving or for improving law resource machine translation. And so I just want to maybe I should conclude here by thanking my collaborators and by telling you that

1:09:20 - 1:09:36     Text: if you have any questions about this lecture, you can always email me drop me a line. I'd be happy to follow up. And also in my lab, we have a lot of opportunities from internships to full time positions as a research scientist research engineers.

1:09:36 - 1:09:47     Text: So if you're interested or curious, just also drop me an email. Okay, thank you.

1:09:47 - 1:10:02     Text: So maybe there's still a few people that might have questions and you're happy to stay a few more minutes for questions.

1:10:02 - 1:10:07     Text: Happy to ask your questions. Yes.

1:10:07 - 1:10:17     Text: I'd love to learn more about the models that you used. Actually, if you go back to the model that you first spoke about, spoke about right before back translation. I wanted to understand.

1:10:17 - 1:10:21     Text: You have a pipeline from English English, right?

1:10:21 - 1:10:32     Text: In this one, you want something like you the data augmentation techniques like in region, such as dropping the word or switching switching words to be able to make an augmented data set.

1:10:32 - 1:10:56     Text: That's right. So the analogy that I made is for back translation were, yes, all these methods essentially, you don't have X and what golden X and Y pairs. And so for self training, what you do, you fantasize the target for back translation, you fantasize the input.

1:10:56 - 1:11:12     Text: And so you can see all these methods as a way in particular, but translation is very similar to the data augmentation that people do in vision in the sense that here, the transformation is not rule based is produced by a platform machine translation system.

1:11:12 - 1:11:19     Text: It has the same objective of regularizing by adding a lot of noisy level data.

1:11:19 - 1:11:31     Text: So if you go back to the previous slide, when you say you fantasize the target. So in this case, you have one where you predict the gold target and one way you change the input and then predict the target.

1:11:31 - 1:11:37     Text: Yeah, so in for set training, the way that it works is that you take the clean input.

1:11:37 - 1:11:49     Text: And you pass it through your machine translation system at the previous iteration and you decode with beam or with other methods and you got a prediction for what the label should be.

1:11:49 - 1:11:52     Text: And that's now your reference.

1:11:52 - 1:12:07     Text: The way that you train your machine translation system is by noisy finding input. So you add noise to your input and the noise is you drop words, you swap words, and then you try to predict the target that you fantasize.

1:12:07 - 1:12:10     Text: And the two targets should be the same.

1:12:10 - 1:12:19     Text: Yeah, so the prediction and these targets when you train with cross entropy loss, you try to tie them together as much as possible.

1:12:19 - 1:12:29     Text: Yeah, this is a very, this is one of the first semi supervised learning methods that you find in the machine learning community.

1:12:29 - 1:12:40     Text: There are a lot of variants of this where they have perhaps a community of experts that produces the label.

1:12:40 - 1:12:58     Text: There are a lot of variants of this and it's something that makes a lot of sense, particularly for asymmetric tasks, like if you do image specific, if you do text classification, if you do summarization, then by translation is not really applicable because

1:12:58 - 1:13:18     Text: you know, if you go from a label category from a categorical input to a whole sentence, that's a very difficult task, right. So by translation works really well for symmetry tasks like machine translation, but for things that are for many to one mapping,

1:13:18 - 1:13:28     Text: it is definitely, definitely works better. I said training works well also in machine translation, when there is a lot of domain is matched between the source and the target as person.

1:13:28 - 1:13:42     Text: Yeah, so unfortunately, these algorithms, so it's hard to say in general what works best because it really depends on the application really depends on the kind of data that you have.

1:13:42 - 1:14:02     Text: Does anyone else have a question they'd like to ask? Well, it seems like we may be not doing another immediate question and I guess we have gone to the end of the time that we're meant to do it, so maybe we should call it bring it to a close, but thank you so much, Mark, or alieo.

1:14:02 - 1:14:28     Text: I hope everyone really enjoyed that and I, you know, speaking as someone who did work in machine translation for a decade, I haven't so much for the last few years, I mean, you know, it actually still seems to me just amazing how successfully you can build things with these building with modeling or data and using ideas like the back translation.

1:14:28 - 1:14:39     Text: I mean, it's just actually incredible that that's provided such competitive machine translation systems now and, you know, obviously this is something that isn't just.

1:14:39 - 1:14:57     Text: It's just a bit of an impact of academic interests as you might have realized if you've thought about it right if you've got a company like Facebook right being able to translate well data on domains that are very far from news data or the Bible and.

1:14:57 - 1:15:14     Text: And the community of speakers is just actually super duper important to people being happy users of and members of communities yeah and I just want to add the kind of these methods are pretty general so we apply them to some organization,

1:15:14 - 1:15:32     Text: and they start to transfer so you know it's really beautiful that it's a set of tools that you can use them in many places and it's all about in a way aligned in domains with little.

1:15:32 - 1:15:36     Text: Super vision of correspondences right so.

1:15:36 - 1:15:46     Text: Okay, thank you very much thank you thank you bye bye bye.