Stanford CS224N NLP with Deep Learning ｜ Winter 2021 ｜ Lecture 12 - Natural Language Generation

0:00:00 - 0:00:09 Text: Hi everybody, welcome back to CS224N.

0:00:09 - 0:00:13 Text: First just a couple of announcements.

0:00:13 - 0:00:16 Text: Originally this was going to be the day when assignment five was due,

0:00:16 - 0:00:20 Text: but as you've seen, we're giving you one extra day.

0:00:20 - 0:00:22 Text: So it's now due Friday at 430.

0:00:22 - 0:00:27 Text: We do realize that assignment five has been a bit of a tough

0:00:27 - 0:00:31 Text: challenge for many people, though we've been trying to help people out and offer

0:00:31 - 0:00:32 Text: a thousand otherwise.

0:00:32 - 0:00:37 Text: So I hope at the end of the day it will seem like it was a really good learning experience

0:00:37 - 0:00:43 Text: to really get some much more kind of close hands on, look at how transformers work,

0:00:43 - 0:00:50 Text: rather than it's simply being loading up a transformer as a black mystery box.

0:00:50 - 0:00:56 Text: After Friday, I guess there's no rest since we do really hope that you can sort of

0:00:56 - 0:00:57 Text: get a lot of help.

0:00:57 - 0:01:01 Text: Basically immediately transition to working on final projects and so it's basically

0:01:01 - 0:01:06 Text: four weeks to go on final projects and in particular we're hoping to get feedback

0:01:06 - 0:01:11 Text: on your project proposals back by next Tuesday to help that process or on

0:01:11 - 0:01:16 Text: when people the after get started on them soon.

0:01:16 - 0:01:23 Text: And you know, it's just maybe a good moment to say that we do really appreciate

0:01:23 - 0:01:28 Text: all the people have been putting tons of effort into these assignments and we like

0:01:28 - 0:01:32 Text: that sort of all the keenness we're seeing from the students.

0:01:32 - 0:01:39 Text: Okay, so with that out of the way today, I'm delighted to have giving today's lecture on

0:01:39 - 0:01:44 Text: your language generation, Antoine Bosleau, who's at present a postdoc at Stanford.

0:01:44 - 0:01:49 Text: He's someone who's done a lot of work on natural language generation in his

0:01:49 - 0:01:56 Text: previous life as a university, Washington PhD student and next year he's going to be

0:01:56 - 0:02:01 Text: taking up a position as a professor in Switzerland.

0:02:01 - 0:02:04 Text: Okay, so welcome and Trams.

0:02:04 - 0:02:08 Text: Thanks Chris, that's a very kind introduction.

0:02:08 - 0:02:13 Text: It's great to be here giving this lecture.

0:02:13 - 0:02:20 Text: I'm going to see CS224N particularly on one of my favorite topics in deep learning

0:02:20 - 0:02:24 Text: for NLP natural language generation.

0:02:24 - 0:02:29 Text: So hopefully by the end of this lecture, most of you will have at least learned a bit

0:02:29 - 0:02:34 Text: about NLG with deep learning and hopefully be motivated to start doing some research

0:02:34 - 0:02:42 Text: on it or launch a start up in NLG or perhaps go work on it in a larger organization.

0:02:42 - 0:02:48 Text: Okay, so to start, I think it might be really helpful to define what we meet at a high

0:02:48 - 0:02:53 Text: level when we talk about natural language generation because over the last few

0:02:53 - 0:02:57 Text: years, the definition has actually sort of changed and has really grown as a

0:02:57 - 0:03:04 Text: subfield to really encapsulate any part of NLP that involves the production of

0:03:04 - 0:03:07 Text: written or spoken language.

0:03:07 - 0:03:12 Text: So in other words, if you're given some inputs and your goal is to generate text

0:03:12 - 0:03:18 Text: to describe, respond, translate or summary is that piece of text.

0:03:18 - 0:03:22 Text: NLG really focuses on how you can actually build a system that can automatically

0:03:22 - 0:03:31 Text: produce a coherent and useful written piece of text for that human consumption.

0:03:31 - 0:03:35 Text: And it used to be a much more limited research area since many tasks that we now

0:03:35 - 0:03:40 Text: view as NLG problems didn't actually involve much text production prior to

0:03:40 - 0:03:45 Text: neural networks. But now that scope has expanded considerably and we have this

0:03:45 - 0:03:49 Text: much larger area to sort of work in.

0:03:49 - 0:03:54 Text: Unfortunately, we're not quite yet at the level of the types of A.I. NLG

0:03:54 - 0:03:58 Text: tools that we've seen in pop culture and that we like to imagine.

0:03:58 - 0:04:03 Text: But we are starting to see many areas where NLG tools are having a massive

0:04:03 - 0:04:11 Text: impact to start with machine translation is kind of the classical example of an

0:04:11 - 0:04:18 Text: NLG task these days ever since the task moved to neural networks and

0:04:18 - 0:04:22 Text: a NLG framework around a 2014 or so.

0:04:22 - 0:04:26 Text: And now we've seen a rapid improvement in the quality and applicability of

0:04:26 - 0:04:30 Text: translation systems. In fact, you can often use Google Translates for most

0:04:30 - 0:04:37 Text: of your kind of retail translation needs as a good starting point.

0:04:37 - 0:04:42 Text: Similarly, NLG technologies really underpin some of the dialogue systems that

0:04:42 - 0:04:47 Text: you might interact with on a daily basis. Any time you use, let's say,

0:04:47 - 0:04:52 Text: Siri, Alexa, Cortana, Google Home, Bixby, or pretty much any other major

0:04:52 - 0:04:55 Text: companies dialogue system. There's a good chance that there's a neural

0:04:55 - 0:05:00 Text: NLG component embedded in that system that's involved with providing you an answer

0:05:00 - 0:05:04 Text: to your query. And there's really still a ton of progress to be made in this area.

0:05:04 - 0:05:08 Text: And it's led to some major companies to actually crowdsource chatbot

0:05:08 - 0:05:13 Text: technologies from researchers and students such as yourself to continue to try to

0:05:13 - 0:05:16 Text: make big advances in this area.

0:05:16 - 0:05:22 Text: We're also seeing lots of NLG technologies used in areas such as

0:05:22 - 0:05:28 Text: summarization, where systems often have to aggregate information from

0:05:28 - 0:05:34 Text: potentially multiple sources and rephrase the most salient content in a

0:05:34 - 0:05:38 Text: shortened but still very engaging way.

0:05:38 - 0:05:43 Text: While our go-to example for summarization is generally related to, let's say,

0:05:43 - 0:05:47 Text: generating news highlights, summarization systems have actually achieved

0:05:47 - 0:05:51 Text: broad applicability in many areas where we ingest content such as

0:05:51 - 0:05:56 Text: summarizing emails or summarizing meeting transcripts.

0:05:56 - 0:05:59 Text: And there's actually many more areas not listed here. I didn't actually put it

0:05:59 - 0:06:03 Text: on the slide, but a few months back, a tool called Semantic Scholar actually

0:06:03 - 0:06:07 Text: developed a neural system for generating summaries of scientific papers, which is

0:06:07 - 0:06:11 Text: something that I personally end up using quite a bit as an example of how

0:06:11 - 0:06:15 Text: humans can interact with these technologies.

0:06:15 - 0:06:19 Text: But these modalities aren't actually limited to text in or text out.

0:06:19 - 0:06:23 Text: So actually the classical NLG area that I mentioned earlier is how the

0:06:23 - 0:06:27 Text: task used to be framed was actually around what we now call data to text

0:06:27 - 0:06:31 Text: generation. So can you learn to compile or let's say, summarize the most

0:06:31 - 0:06:37 Text: interesting facts from a table or a knowledge graph or some type of data stream.

0:06:37 - 0:06:40 Text: That way humans can get the most most interesting and salient content that's

0:06:40 - 0:06:46 Text: being presented in these data structures rapidly and in easier to ingest

0:06:46 - 0:06:51 Text: format than having to look through the structures themselves.

0:06:51 - 0:06:56 Text: We've also seen a lot of recent work in visual description that tries to

0:06:56 - 0:06:59 Text: use language to describe the content and images or videos.

0:06:59 - 0:07:05 Text: So around 2014 or so we start to see the first neural NLG systems in the

0:07:05 - 0:07:09 Text: space. And they've really continued to mature in the last six years.

0:07:09 - 0:07:13 Text: And now we actually tackle much more challenging description tasks such

0:07:13 - 0:07:18 Text: as generating full descriptive paragraphs of scenes or generating

0:07:18 - 0:07:22 Text: streams of visual generating descriptions for streams of visual

0:07:22 - 0:07:27 Text: contents such as in video captioning and these tools have really broad

0:07:27 - 0:07:31 Text: applicability in different areas of AI.

0:07:31 - 0:07:35 Text: And finally the last sort of application I kind of want to mention is that we've

0:07:35 - 0:07:39 Text: also started seeing NLG systems being developed in more creative

0:07:39 - 0:07:44 Text: applications such as story generation where AI systems can now help humans

0:07:44 - 0:07:49 Text: write short stories, blog posts or even full books in some case as creative

0:07:49 - 0:07:54 Text: writing assistants. In another area such as post tree generation we can

0:07:54 - 0:07:58 Text: actually have kind of full automated settings where you can have AI agents

0:07:58 - 0:08:02 Text: that can generate something like a sonnet and in fact condition that's on a

0:08:02 - 0:08:06 Text: lot of demands that are given through a user interface.

0:08:06 - 0:08:11 Text: So I hope that by this point I've really given you a look into the breadth of

0:08:11 - 0:08:18 Text: NLG applications and how it sort of encompasses any task that you might

0:08:18 - 0:08:23 Text: think of that involves production of text. And each of these tasks really

0:08:23 - 0:08:27 Text: requires different algorithms and different models and a different way of

0:08:27 - 0:08:31 Text: designing the system to get right. But what they have in common is that a

0:08:31 - 0:08:39 Text: lot of our power is by next generation advances in deep learning for NLG.

0:08:39 - 0:08:44 Text: And the goal of today is to really give you the introduction to these

0:08:44 - 0:08:48 Text: topics that really allows you to contribute to the next era of these

0:08:48 - 0:08:52 Text: technologies and designing deep learning systems for NLG.

0:08:52 - 0:08:57 Text: And so I think to start what might be interesting to do is to quickly

0:08:57 - 0:09:01 Text: reap cap topics that you may have seen in previous lectures, but which are

0:09:01 - 0:09:07 Text: going to be very relevant today when we're trying to design an NLG system.

0:09:07 - 0:09:11 Text: And what we're effectively trying to do in the setting is take a sequence of

0:09:11 - 0:09:16 Text: tokens as inputs and produce new text that is conditioned on this input.

0:09:16 - 0:09:20 Text: And then what we typically call the auto regressive setting, which is the most

0:09:20 - 0:09:25 Text: common sort of text generation setting, we take these produced tokens of text

0:09:25 - 0:09:29 Text: and we feed them back into our model to generate the next token in the sequence

0:09:29 - 0:09:32 Text: that we want to generate.

0:09:32 - 0:09:37 Text: But so to really understand what's going on in an auto regressive NLG system,

0:09:37 - 0:09:41 Text: what we really need to start to do is look at what happens for the generation of an

0:09:41 - 0:09:45 Text: individual token since further stage is really depend on taking that generated

0:09:45 - 0:09:48 Text: token and passing it back in as input and doing the same thing.

0:09:48 - 0:09:52 Text: So what happens at a low level is that your model takes in this sequence of

0:09:52 - 0:09:58 Text: inputs, so these Ys and it computes a vector of scores using the model itself.

0:09:58 - 0:10:03 Text: And each index in that vector corresponds to the score for a token in your

0:10:03 - 0:10:07 Text: vocabulary. So the only tokens that your model is actually allowed to generate.

0:10:07 - 0:10:11 Text: And then what you do is that you compute a probability distribution over these

0:10:11 - 0:10:16 Text: scores using what we call a softmax function to compute a probability estimate

0:10:16 - 0:10:24 Text: for each token in your vocabulary given the context that precedes it.

0:10:24 - 0:10:29 Text: And as a shorthand, I'll just mention that sometimes I'll remove the W from this

0:10:29 - 0:10:33 Text: probability equation, but just know that when I write out the probability of a token

0:10:33 - 0:10:38 Text: Y at time t, what I mean is the probability that Y of t is a particular word.

0:10:38 - 0:10:43 Text: So it's kind of a variable assignment.

0:10:43 - 0:10:48 Text: But so what actually is the output of what we typically call a text generation model

0:10:48 - 0:10:52 Text: at this point is actually this vector of scores. And then that vector gets past

0:10:52 - 0:10:57 Text: the softmax function to give us a probability distribution over the set of tokens in the

0:10:57 - 0:11:02 Text: vocabulary. And then to actually generate a token, we can define what we call a

0:11:02 - 0:11:07 Text: decoding algorithm. That is a function that takes in this distribution peak over

0:11:07 - 0:11:12 Text: all the tokens of the vocabulary. And you know, it defines a function for selecting

0:11:12 - 0:11:17 Text: a function from this distribution as the next token that is produced by our

0:11:17 - 0:11:22 Text: NLG system. And for that distribution to be calibrated in such a way that

0:11:22 - 0:11:27 Text: it means anything, we need to train the model to actually be able to do the task.

0:11:27 - 0:11:32 Text: So the most common way of training text generation models is to use maximum

0:11:32 - 0:11:37 Text: likelihood training. And despite its name, we don't actually maximize

0:11:37 - 0:11:41 Text: likelihoods. We actually minimize negative log likelihoods. And what that actually

0:11:41 - 0:11:46 Text: is is just a multi-class classification task where each word in our vocabulary

0:11:46 - 0:11:50 Text: is a class that can be predicted by the model. And so at each step in the

0:11:50 - 0:11:54 Text: sequence, we're actually trying to predict the class that corresponds to the word

0:11:54 - 0:11:58 Text: that comes next in the sequence of text that we're trying to train on.

0:11:58 - 0:12:03 Text: And this word is often called the gold or ground truth token as just kind of

0:12:03 - 0:12:08 Text: interchangeable vocabulary that we use. And another term for this

0:12:08 - 0:12:12 Text: training algorithm is teacher forcing. So you might see these expressions

0:12:12 - 0:12:17 Text: kind of used interchangeably if you read papers on this topic.

0:12:17 - 0:12:20 Text: But so at each step, you're really computing a loss term that is the negative

0:12:20 - 0:12:25 Text: log likelihood of predicting this gold token y1 at every step. So in these slides,

0:12:25 - 0:12:29 Text: whenever you see an asterisk next to a y, that means that this is a gold

0:12:29 - 0:12:33 Text: token that comes from a training sequence. And you can do this for multiple

0:12:33 - 0:12:39 Text: steps, adding up the log likelihoods along the way. Eventually, you're going

0:12:39 - 0:12:42 Text: to arrive at the end of your of your gold sequence. And you'll be able to

0:12:42 - 0:12:47 Text: compute gradients with respect to this sum-dLoss term for every parameter in

0:12:47 - 0:12:51 Text: your model, which allows you to update it so that the next time around when you see

0:12:51 - 0:12:55 Text: this sequence, your model is more confident in the probability that this sequence

0:12:55 - 0:12:59 Text: is a correct sequence given the same context it has seen before.

0:12:59 - 0:13:03 Text: But most of this should just be a recap from previous lectures on language

0:13:03 - 0:13:08 Text: modeling and machine translation. But now that we've got that out of the

0:13:08 - 0:13:13 Text: way, let's get to the fun part and talk about some new topics.

0:13:13 - 0:13:16 Text: The first one of which is decoding, which is actually one of my favorite

0:13:16 - 0:13:22 Text: topics in natural language generation research. So if you recall, your decoding

0:13:22 - 0:13:26 Text: algorithm is really the function that takes in this this induced probability

0:13:26 - 0:13:31 Text: distribution from your model over the next possible tokens that can be generated

0:13:31 - 0:13:36 Text: and selects which one of those tokens should be outputted.

0:13:36 - 0:13:40 Text: So once your model is trained, this distribution should be meaningful.

0:13:40 - 0:13:44 Text: And you want to be able to generate a sensible next token.

0:13:44 - 0:13:48 Text: And then you can use these generated next tokens. So the blue y hats here,

0:13:48 - 0:13:52 Text: as input in the next step of the model, which allows you to recompute a new

0:13:52 - 0:13:57 Text: distribution decode a new token, repeat the process and eventually end up with a

0:13:57 - 0:14:02 Text: full sequence that your model has now generated given a fixed starting sequence of

0:14:02 - 0:14:03 Text: text.

0:14:03 - 0:14:07 Text: And so let's talk a bit about the algorithms that we can use to decode

0:14:07 - 0:14:10 Text: tokens from this distribution.

0:14:10 - 0:14:14 Text: So you've actually already seen some of these in a previous lecture on

0:14:14 - 0:14:19 Text: neural machine translation, I believe, where you started off by by seeing a

0:14:19 - 0:14:24 Text: relatively simple decoding algorithm that nonetheless remains very popular,

0:14:24 - 0:14:28 Text: argmax decoding. And with argmax decoding, you pretty much just take the highest

0:14:28 - 0:14:33 Text: probability token from your distribution as the decoded token and feed it back

0:14:33 - 0:14:37 Text: into the model for to get the distribution of the next step. And you keep repeating

0:14:37 - 0:14:41 Text: this process and it's very nice and it's very convenient.

0:14:41 - 0:14:45 Text: And you've also, I think, learned about beam search where you can scale up these

0:14:45 - 0:14:51 Text: these greedy methods by doing a wider search over the set of tokens that follow,

0:14:51 - 0:14:56 Text: the set of most likely tokens that follow to try to find a seek a sub sequence that

0:14:56 - 0:15:02 Text: is a lower overall negative log likelihood, even if it in the intermediate step,

0:15:02 - 0:15:07 Text: it tends to be higher than what would be the argmax decoded token.

0:15:07 - 0:15:13 Text: And while these greedy methods work great for machine translation and in other

0:15:13 - 0:15:18 Text: tests as well, such as summarization, they do tend to be problematic in many other

0:15:18 - 0:15:23 Text: text generation tasks, particularly ones that end up being more open-ended.

0:15:23 - 0:15:27 Text: So one of these big problems that they have is that they often end up repeating

0:15:27 - 0:15:32 Text: themselves. So here in this example from from Holtsman at all 2020, we can see that

0:15:32 - 0:15:38 Text: after around 20 or sorry, yet 60 tokens or so of generation, the model really

0:15:38 - 0:15:42 Text: devolves into just repeating the same thing over and over again. And this actually

0:15:42 - 0:15:47 Text: tends to happen a lot in text generation systems. Repetition was actually one of

0:15:47 - 0:15:51 Text: the biggest problems that we that we tried to tackle in text generation for many

0:15:51 - 0:15:56 Text: years and then still face to this day. And you know, I think it's worth taking a

0:15:56 - 0:16:02 Text: look at why repetition happens a bit more analytically so you can perhaps

0:16:02 - 0:16:07 Text: better understand the interaction between your model and your decoding algorithm.

0:16:07 - 0:16:12 Text: So just as a quick little visual demonstration, here I'm showing you the step by step

0:16:12 - 0:16:16 Text: negative log likelihoods from two different language models, one based on recurrent

0:16:16 - 0:16:21 Text: neural networks and one based on a transformer language model called GPT.

0:16:21 - 0:16:27 Text: And I'm showing this this plot for a particular phrase I don't know, which for

0:16:27 - 0:16:33 Text: anyone who's worked in chatbots has probably seen many times potentially in nightmares.

0:16:33 - 0:16:38 Text: It's not a very interesting plot, though, you know, what we do see is that the

0:16:38 - 0:16:42 Text: transformer model does tend to be a bit less confident in the probability of each

0:16:42 - 0:16:46 Text: word on the recurrent neural network does. What's more interesting though is what

0:16:46 - 0:16:50 Text: happens if I repeat this same phrase multiple times in a row.

0:16:50 - 0:16:54 Text: And one of the things that we notice here is that the repetition of this phrase

0:16:54 - 0:17:00 Text: actually causes the token level negative log likelihoods to get lower and

0:17:00 - 0:17:03 Text: lower for each of these tokens, which actually means that the model is becoming

0:17:03 - 0:17:08 Text: more confident that these are the right tokens and that they should probably follow

0:17:08 - 0:17:11 Text: the preceding context as we generate it more times.

0:17:11 - 0:17:16 Text: And this doesn't really subside as the sequence gets longer and longer.

0:17:16 - 0:17:21 Text: And as you keep repeating the same phrases again over and over again, the model

0:17:21 - 0:17:27 Text: becomes more and more confident the next time around it should say the same thing.

0:17:27 - 0:17:31 Text: And, you know, while this actually kind of makes sense and that if you, you know,

0:17:31 - 0:17:36 Text: say the phrase, let's say I'm tired 15 times, it's a fair bet that on a 16th

0:17:36 - 0:17:39 Text: time you're actually going to say it again, it's not necessarily the behavior that

0:17:39 - 0:17:43 Text: we want our generation systems to get stuck in.

0:17:43 - 0:17:47 Text: Another interesting thing to note here as an aside is that this behavior is actually

0:17:47 - 0:17:52 Text: less problematic in recurrent neural networks than in transformer language models.

0:17:52 - 0:17:57 Text: So you can see that for the LSTM, the curve flat, flat ends after a certain point.

0:17:57 - 0:18:01 Text: And so if you remember in perhaps a previous lecture on why we might like

0:18:01 - 0:18:05 Text: transformer language models, one of their benefits is that they don't have the temporal

0:18:05 - 0:18:10 Text: bottleneck of tracking a state, which a recurrent neural network tends to have.

0:18:10 - 0:18:14 Text: And so the removal of that bottleneck actually ends up making them more prone to repetitive

0:18:14 - 0:18:19 Text: behavior when you use greedy algorithms to decode.

0:18:19 - 0:18:23 Text: So what can we actually do to reduce repetition since it's a pretty big problem in these

0:18:23 - 0:18:27 Text: systems? Well, there are actually quite a few proposed approaches in the last few

0:18:27 - 0:18:32 Text: years, some which I'll summarize here, which, you know, range from the kind of hacky,

0:18:32 - 0:18:37 Text: but surprisingly effective, don't repeat any end grams at inference time.

0:18:37 - 0:18:41 Text: But there's also been training time approaches to do it, such as having a loss

0:18:41 - 0:18:46 Text: function that minimizes the similarity between hidden activations at different

0:18:46 - 0:18:51 Text: steps or coverage loss that penalizes attending to the same tokens over time.

0:18:51 - 0:18:54 Text: So if you, you know, change the inputs that your model is allowed to focus on,

0:18:54 - 0:18:58 Text: it's naturally going to produce different text or more recently an unlikely

0:18:58 - 0:19:02 Text: hood objective that actually penalizes outputting the same words, which we'll talk a bit

0:19:02 - 0:19:05 Text: more about later.

0:19:05 - 0:19:11 Text: But the truth is, is that the problem here really lies in using greedy algorithms

0:19:11 - 0:19:16 Text: in the first place. In many applications of, you know, human language,

0:19:16 - 0:19:20 Text: humans don't actually speak in a probability maximizing way.

0:19:20 - 0:19:25 Text: So if you look at this plot from Ultimate All 2020, it shows the per-time step

0:19:25 - 0:19:30 Text: probability of human written text in orange and beam search decoded text in blue

0:19:30 - 0:19:34 Text: on the same graph. And what you can see is that beam search decoded text tends to be

0:19:34 - 0:19:38 Text: very high probability with little variance over time.

0:19:38 - 0:19:42 Text: And this makes a lot of sense, you know, because it's literally trying to maximize

0:19:42 - 0:19:44 Text: the probability of the sequences that it produces.

0:19:44 - 0:19:48 Text: And a big part of that is maximizing the, you know, step by step probability

0:19:48 - 0:19:50 Text: of the tokens that it uses.

0:19:50 - 0:19:53 Text: But meanwhile, we can see that human written text is a lot more variable,

0:19:53 - 0:19:56 Text: often actually dipping into very low probability territory.

0:19:56 - 0:20:00 Text: That actually makes a lot of sense. If we could always, you know, predict human text

0:20:00 - 0:20:05 Text: with high probability, you know, there really be no reason to listen to each other's

0:20:05 - 0:20:08 Text: comments. We know what we were going to say.

0:20:08 - 0:20:12 Text: But so ultimately, what we want to be able to do is to match the uncertainty of human

0:20:12 - 0:20:17 Text: language patterns in how we decode text, which is why in many applications that

0:20:17 - 0:20:23 Text: tend to have this higher variability, sampling from these distributions has kind of

0:20:23 - 0:20:29 Text: become a go-to decoding method, particularly in creative generation tasks.

0:20:29 - 0:20:34 Text: And so with sampling, we take the distribution over tokens that's produced

0:20:34 - 0:20:40 Text: in our model, and we generate a token randomly, according to the probability

0:20:40 - 0:20:43 Text: mass that is assigned to each potential option.

0:20:43 - 0:20:48 Text: So rather than doing any type of greedy step, we use the probability on each token

0:20:48 - 0:20:54 Text: to give us a chance that that token is generated.

0:20:54 - 0:20:58 Text: And so this does allow us to kind of get much more, you know,

0:20:58 - 0:21:01 Text: stochasticity and the types of tokens that are generated.

0:21:01 - 0:21:05 Text: But a challenge that pops up here is that these distributions tend to be

0:21:05 - 0:21:07 Text: over a very large vocabulary.

0:21:07 - 0:21:11 Text: And so even if there's clearly tokens that have a higher chance of being

0:21:11 - 0:21:17 Text: generated, the tail of the distribution can often be spread over a much larger

0:21:17 - 0:21:19 Text: number of possible tokens.

0:21:19 - 0:21:24 Text: And so this becomes a bit of a problem because these tokens in the long tail

0:21:24 - 0:21:27 Text: are probably completely irrelevant to the current context.

0:21:27 - 0:21:31 Text: So they shouldn't have any chance of being selected individually.

0:21:31 - 0:21:35 Text: But as a group, it ends up that there's a decent chance that you could output

0:21:35 - 0:21:37 Text: a completely irrelevant token.

0:21:37 - 0:21:41 Text: Even if 90% of your probability mass is on relevant tokens, that means that you have

0:21:41 - 0:21:45 Text: a one-intent chance of outputting something that completely throws off your entire

0:21:45 - 0:21:49 Text: text generation pipeline and is completely enane.

0:21:49 - 0:21:54 Text: So to mitigate this, the field has developed a new set of algorithms

0:21:54 - 0:21:57 Text: that tries to prune these distributions at inference time.

0:21:57 - 0:22:02 Text: And so top-case sampling is kind of the most obvious way to do this.

0:22:02 - 0:22:07 Text: So here we recognize that most of the tokens in our vocabulary should have no

0:22:07 - 0:22:09 Text: probability of being selected at all.

0:22:09 - 0:22:13 Text: So we just truncate the set of tokens that were allowed to sample from to be

0:22:13 - 0:22:19 Text: the K tokens with the highest amount of probability mass of the distributions.

0:22:19 - 0:22:24 Text: And common values of K are often 5, 10, 20, sometimes up to 100.

0:22:24 - 0:22:29 Text: But really, it's a hyperparameter that you end up setting as the designer of this system.

0:22:29 - 0:22:33 Text: In general, though, what's important to note is that the higher you make K,

0:22:33 - 0:22:37 Text: the more you'll be able to generate diverse outputs, which is good,

0:22:37 - 0:22:39 Text: because that's what we're trying to do.

0:22:39 - 0:22:43 Text: But you're also going to increase the chance of letting that long tail seep in

0:22:43 - 0:22:47 Text: and generating something that's completely irrelevant to the current context.

0:22:47 - 0:22:52 Text: Oh, sorry. Meanwhile, if you decrease K, your outputs are going to be safer

0:22:52 - 0:22:56 Text: from these long tail effects, but your text may end up being boring and generic

0:22:56 - 0:23:01 Text: because your sampling algorithm starts to look a lot more greedy in nature.

0:23:01 - 0:23:06 Text: And this kind of shows the problem of having a fixed K as the number of tokens

0:23:06 - 0:23:09 Text: that you can generate from your distribution.

0:23:09 - 0:23:13 Text: If your distribution is certain points is pretty flat, such as in this example,

0:23:13 - 0:23:18 Text: she said, I never blank, you might not want to truncate a lot of interesting options

0:23:18 - 0:23:22 Text: that, you know, using a small value of K when there's so many good choices

0:23:22 - 0:23:26 Text: that could fit in this potential context.

0:23:26 - 0:23:30 Text: You know, conversely in a different example, you might want to cut off much more

0:23:30 - 0:23:35 Text: than let's say your minimum K options, because only a subset of them end up being quite suitable.

0:23:35 - 0:23:39 Text: And, you know, a higher K than is really necessary, lets that long tail seep in

0:23:39 - 0:23:42 Text: and potentially, you know, ruin your generation.

0:23:42 - 0:23:48 Text: And so, you know, in response to this top P or nucleus sampling is a way around this issue.

0:23:48 - 0:23:52 Text: So here, instead of sampling from a fixed number of tokens at each step,

0:23:52 - 0:23:55 Text: you sample from a fixed amount of probability mass.

0:23:55 - 0:23:59 Text: And so depending on the flatness of your distribution,

0:23:59 - 0:24:06 Text: you end up including a variable number of tokens that is, that is kind of dynamically

0:24:06 - 0:24:11 Text: changing depending on, you know, how that probability mass is spread across the distribution.

0:24:11 - 0:24:16 Text: And so, you know, to kind of describe this visually, if you have, you know,

0:24:16 - 0:24:20 Text: three different distributions at a particular step to generate a particular token,

0:24:20 - 0:24:26 Text: they're each going to prune a different number of tokens from the available set that you can sample from,

0:24:26 - 0:24:38 Text: depending on what the value of P is and what the peakingness of that distribution actually ends up being.

0:24:38 - 0:24:43 Text: So, you know, I keep talking about this concept of flatness of a distribution,

0:24:43 - 0:24:49 Text: being kind of critical and understanding how many tokens we can actually end up sampling from.

0:24:49 - 0:24:55 Text: And in fact, as we try to use sampling algorithms, we might find that the model that we've learned

0:24:55 - 0:25:01 Text: may not actually be producing probability distributions that lend themselves very nicely to using these types of sampling algorithms.

0:25:01 - 0:25:05 Text: You know, the distributions might be too flat, they might be too peaky.

0:25:05 - 0:25:12 Text: And in fact, what we might want to do is rescale those distributions to better fit the decoding algorithm that we might want to use.

0:25:12 - 0:25:19 Text: And we can do this with a method that's, you know, goes by a variety of different names, which I call temperature scaling.

0:25:19 - 0:25:25 Text: And here what you do is that you apply a linear coefficient to every score for each token.

0:25:25 - 0:25:30 Text: Before you pass it through the softmax, that temperature coefficient is the same for every token.

0:25:30 - 0:25:35 Text: It's not dynamically changing amongst your vocabulary. It stays the same.

0:25:35 - 0:25:41 Text: But what happens is that that change ends up being amplified by the softmax function.

0:25:41 - 0:25:50 Text: And what ends up happening is that if your temperature coefficient is greater than one, you're actually going to make your probability distribution much more uniform.

0:25:50 - 0:25:53 Text: In other words, you're going to make it flatter.

0:25:53 - 0:25:59 Text: Meanwhile, if your temperature coefficient is less than one, these scores are going to increase, which is going to make your distributions more spiky.

0:25:59 - 0:26:04 Text: And make the probability mass kind of be pushed towards the most likely tokens.

0:26:04 - 0:26:09 Text: One last thing to note about temperature is that it's not actually a decoding algorithm.

0:26:09 - 0:26:13 Text: It's just a way of rebalancing your probability distribution.

0:26:13 - 0:26:17 Text: So in fact, it can be applied to all of the sampling algorithms I described before.

0:26:17 - 0:26:21 Text: And some greedy decoding algorithms as well.

0:26:21 - 0:26:27 Text: The only one whose behavior is not affected by softmax temperature scaling is argmax decoding.

0:26:27 - 0:26:38 Text: Because even though you change the relative magnitudes of the probability mass in your distribution, you don't actually change the relative ranking amongst tokens in that distribution.

0:26:38 - 0:26:46 Text: So argmax decoding will give you the exact same output as before.

0:26:46 - 0:26:57 Text: But now that we're thinking about how we might be changed to the distribution that's produced by our model, we might realize that we might want to change more than the relative magnitudes I mentioned.

0:26:57 - 0:27:00 Text: And also instead change how they're ranked with respect to one another.

0:27:00 - 0:27:09 Text: Maybe in fact, our model is not a perfect approximation of what the distribution over tokens should be.

0:27:09 - 0:27:16 Text: You know, perhaps the training was done right or we didn't have enough training data to actually make it well calibrated.

0:27:16 - 0:27:24 Text: And so if we decide that our model isn't well calibrated for the task that we're doing, we may want to bring in outside information at decoding time.

0:27:24 - 0:27:38 Text: And so here I want to talk a bit about new classes of methods that let us change our model prediction distributions at inference time, rather than relying on a fixed static model that's that's only been trained once.

0:27:38 - 0:27:56 Text: And a cool way of doing this that came out last year is to actually use cane years neighbor language models, which allow you to to recalibrate your output probability distribution by using phrase statistics from let's say a much larger corpus.

0:27:56 - 0:28:07 Text: And so, you know, what you do in this method is that you initialize a large database of phrases along with vector representations for those phrases.

0:28:07 - 0:28:13 Text: And then at decoding time, you can search for the most freight similar phrases in the database.

0:28:13 - 0:28:19 Text: And so what you do is that, you know, you take the representation of the context that you have from your model.

0:28:19 - 0:28:30 Text: And then compute a similarity function with all the representations of the phrases that you have stored. And you know, based off the relative differences amongst these different phrases to your current context.

0:28:30 - 0:28:39 Text: You can compute a distribution over these most similar phrases. And then you can take the next tokens that follow these phrases.

0:28:39 - 0:29:02 Text: And then you can add the statistics around those phrases to the distribution from your model. And so this allows you to really rebalance the probability distribution that your model has given you with this this this induced distribution over phrases and interpolate them together to get a different estimate of how likely certain words are.

0:29:02 - 0:29:10 Text: And then one question right now is, how do you know what to cage cash.

0:29:10 - 0:29:14 Text: Yeah, so that's a really good question.

0:29:14 - 0:29:23 Text: I guess, you know, the answer there is that you could probably, you know, decide on what might be a salient set to phrases, depending on let's say named entities that you might be interested in.

0:29:23 - 0:29:44 Text: And then you know that you know that you know that your models distribution doesn't handle very well. But I'm pretty sure in this work, they took every phrase in their training corpus, cashed it and then relied on very efficient algorithms for doing this search over over over representation similarity to actually find the most likely once.

0:29:44 - 0:30:00 Text: And then you did prune the number of phrases that they actually used to make this distribution, the phrase distributions that it wasn't over the entire corpus.

0:30:00 - 0:30:21 Text: So it's fantastic that we can now rebalance these distributions if we find that our model is is is doing poorly for you know, particularly you know this might be relevant if let's say we're jumping into a new domain. So we've trained a nice big generation model to on Wikipedia text and now we're going into something that's you know more.

0:30:21 - 0:30:38 Text: More linked to stories, you know, we might want to use this type of system to get distributions from phrases from there. But you know it's also possible that we may not always have a good database of phrases to help us calibrate output distributions for all the types of text that we want to generate.

0:30:38 - 0:30:56 Text: And so luckily last year there's there's also been you know new approaches that look at doing this in a gradient based way. And so the idea here is that you can actually define some type of external objective using a classifier that we typically call a discriminator.

0:30:56 - 0:31:09 Text: And in this figure from the paper that proposed it, they called an attribute model. And what that classifier does is that it's approximates some property that you'd like to encourage your text to exhibit as you decode.

0:31:09 - 0:31:18 Text: So perhaps it's a sentiment classifier because you are working on a dialogue model that and you want to encourage positive sending comments.

0:31:18 - 0:31:36 Text: So as you generate text, you input the output of your text generation model to this attribute model. And there's some tricks on how you should do that in order to actually not have it be a discrete token that you provide to the model but instead a distribution over tokens.

0:31:36 - 0:31:54 Text: And so the thing is that you know if you do this the right way by using the soft distribution of tokens as inputs to the attribute model, it's going to be able to compute a score for the sequence that it receives so that you know if it's a sentiment classifier, it can evaluate how positive of the sequence you provided to it.

0:31:54 - 0:32:06 Text: So what you can do is that you can compute gradients with respect to this property and back propagate those gradients back to your text generation model directly.

0:32:06 - 0:32:19 Text: And but instead of updating the parameters, which is what you would do during training, you instead update the intermediate activations at each layer of your model, which you can then forward propagate to compute a new distribution over the sets of tokens.

0:32:19 - 0:32:37 Text: It's a neat trick that allows you to do real time distribution updating based on some outside discriminator that's allowing you to update your internal representations of the sequence such that it hopefully generates something more positive at the output in this case.

0:32:37 - 0:32:51 Text: So these distribution rebalancing methods either you know based off of nearest neighbor search or on using some type of discriminator are quite promising and interesting, but they also end up being quite computationally intensive.

0:32:51 - 0:33:05 Text: In the first case, you're essentially doing a search over you know thousands of phrases to to rebalance your distribution. In the second, you're doing you know multiple forwards and backwards passes at every step to try and make the tokens exhibit a particular behavior more.

0:33:05 - 0:33:16 Text: And unfortunately neither of them actually stop you from decoding bad sequences either it's possible, and even after you rebalance your distribution, you're still generating something that looks terrible.

0:33:16 - 0:33:23 Text: So in practice, something that we often use in text generation to improve our sequence outputs are what are called re-rankers.

0:33:23 - 0:33:32 Text: And so what we do here is that we actually decode multiple sequences, perhaps using sampling or a wider greedy search, say maybe 10.

0:33:32 - 0:33:42 Text: And then what we can do is that we can initialize a score to evaluate the sequences we produce and re-rank the sequences according to the score.

0:33:42 - 0:34:00 Text: And the simplest thing we can do is to actually just score them by the likelihood given given by the model for example, you know especially if we're using a sampling algorithm, we might want to you know make sure that we didn't generate something that you know totally deviated from good text, which would tend to have a very high perplexity.

0:34:00 - 0:34:12 Text: So you know this perplexity is perhaps a good re-ranking function. It's just important to be careful as well that you remember repetitive sequences tend to have very low perplexity as well.

0:34:12 - 0:34:19 Text: And so if you you know rank by perplexity, you're likely to just generate something you were trying to avoid in the first case.

0:34:19 - 0:34:41 Text: But you know we can also make our re-ranker is evaluate more complex behaviors. So in the same way that we could use gradient based methods to update our distributions to exhibit more complex behaviors, we can actually just you know take those same attribute models and use them as re-rankers to re-rank a fixed set of sequences rather than have them back propagate gradients to the main model.

0:34:41 - 0:35:04 Text: And so we can use them to rank things such as style, discourse, factuality, logical consistency, you know, but you know just be careful if your re-ranker ends up being poorly calculated, you know just because you've trained a classifier to predict whether a sentence makes a factual statement doesn't actually mean that it will be good at ranking different factual statements with respect to one another.

0:35:04 - 0:35:22 Text: And finally a nice thing about re-rankers as well is that you can use multiple re-rankers in parallel. There's multiple properties you want to score and come up with let's say a weighted average of different ranking scores to decide on what might be the best sequence according to different properties.

0:35:22 - 0:35:34 Text: But so you know to recap what we've talked about in terms of decoding, you know I just I want to mention that decoding is still a very challenging problem in natural language generation that we haven't really figured out yet.

0:35:34 - 0:35:39 Text: Our algorithms still don't you know really reflect the way that humans choose words when they speak.

0:35:39 - 0:35:49 Text: And our best approaches you know are currently based on trying to calibrate probability distributions produced by models to perhaps be more representative of human likelihood.

0:35:49 - 0:36:02 Text: But the truth is you know the human language distribution is it's quite noisy and doesn't reflect the simple properties that are decoding algorithms often capture such as probability maximization.

0:36:02 - 0:36:10 Text: But different decoding algorithms do allow us to perhaps inject biases that encourage different properties of going here at natural language generation.

0:36:10 - 0:36:23 Text: And that's allowed us to make promising improvements in this area as well. And in fact some of the most impactful advances in an algae over the last few years have really come from simple but very effective modifications to decoding algorithms.

0:36:23 - 0:36:30 Text: Because you can often have impact across a very large number of tasks by making a good change to a decoding algorithm.

0:36:30 - 0:36:39 Text: But really there's there's still a lot more work to be done in the space and hopefully you know many of you will be the ones to make these next breakthroughs.

0:36:39 - 0:36:42 Text: I'm happy to take questions at this point if any of popped up.

0:36:42 - 0:36:49 Text: Here's one question. How do you evaluate how do you tell whether rebalance distribution is better.

0:36:49 - 0:36:57 Text: And if I editorialize from there I guess you're admitting on this slide that you can't just look at the probability.

0:36:57 - 0:37:12 Text: So yes you you can't just look at the probability. I mean there is a certain amount of trust that happens that if you don't trust that your ranker is giving you a better assessment of whether you're you've produced a quality piece of text.

0:37:12 - 0:37:22 Text: Perhaps you shouldn't be using that re ranker in the first place and you know hopefully you've you've actually means tested that re ranker to actually show that it has that it improves the quality of text.

0:37:22 - 0:37:27 Text: And there's a lot more about how you can actually evaluate the quality of text later on.

0:37:27 - 0:37:32 Text: Though I should warn you ahead of time that the answers are not as.

0:37:32 - 0:37:41 Text: Not as direct and complete as you might want them to be and in fact there's there's a lot of room for interpretation in how you would actually do that.

0:37:41 - 0:37:51 Text: Maybe you should go on about that later but I guess people are puzzled on that because there's another question that's asking.

0:37:51 - 0:37:56 Text: If you said you don't know how to make a model choose was like a human.

0:37:56 - 0:38:04 Text: How do we model different humans from different background.

0:38:04 - 0:38:09 Text: Yeah that's that's a really good question.

0:38:09 - 0:38:28 Text: The answer to that is that you could potentially try to you know do do kind of fine tuning on the language distribution of a particular human starting from let's say a pre train language model since you'll probably never have enough data to only use a single humans outputs.

0:38:28 - 0:38:40 Text: Or you could try to do some of these rebalancing methods that we've spoken about to perhaps use those to actually make your distributions approach a particular humans language distribution more closely.

0:38:40 - 0:38:47 Text: So in the gradient based methods that I described I could perhaps train a single language model only on my type of language.

0:38:47 - 0:39:00 Text: Even if I have a much larger one that's trained on a much larger corpus of language from different speakers but then try to make it so that my model ranks the outputs of the main model.

0:39:00 - 0:39:03 Text: I've got a question.

0:39:03 - 0:39:14 Text: Nuclear sampling and top case sampling are really effective in practice and you made the argument that there are all these little tiny things with very little probability mass but it sums up to more probability mass.

0:39:14 - 0:39:26 Text: But if it sums up to more probability mass than they actually should have under the real distribution of human language shouldn't our models have been trained to put less probability mass on them.

0:39:26 - 0:39:39 Text: So why don't we why aren't our language models better in that case like why do we have this issue if that's if they actually are getting more probability and they should.

0:39:39 - 0:39:43 Text: Yeah, that's a really good question.

0:39:43 - 0:40:00 Text: I think the answer to that is that the way we train them which which I'll get to in a bit is really trying to is really trying to model the distribution of human language as it sees it in its training corpus and it's actually surprisingly effective at doing that.

0:40:00 - 0:40:20 Text: But a but at the same time when we actually start to use these language models out of the box in the NLT task that we work with first of all there's there's always you know slight defiations in the distribution of text that we're actually trying to model for the task we're doing and what the large corpus we trained on was in the first place which can make these these decoding algorithms less effective.

0:40:20 - 0:40:46 Text: And you know the second thing I would notice that even though these these decoding algorithms are quite effective in practice. They aren't doing what humans do when we speak we're at no point when a human speaks are we potentially trying to maximize the probability of a potential next token or randomly select a word amongst a set of tokens we ultimately have world models that drive how we select the tokens that we choose to say in order to make our points.

0:40:46 - 0:40:57 Text: And that's very different from what we get in probabilistic language models now you know should we throw away probabilistic language models no because as you mentioned they end up working quite well.

0:40:57 - 0:41:10 Text: But at some point we also do need to mitigate these differences between how humans end up speaking and how language models end up model language.

0:41:10 - 0:41:16 Text: So you know now that now the john has primed as perfectly for what comes next.

0:41:16 - 0:41:39 Text: You know let's let's jump back into training these models because you know the pipeline you know that I framed earlier being you know you train your model then you choose your decoding algorithm depending on properties you're interested in is is is great but the truth is there's interactions between your decoding model and your training algorithm that you might want to be thinking about during training which is not really what we're doing right now.

0:41:39 - 0:41:53 Text: And so if you recall the training algorithm that we've proposed up to this point is one where we just try to minimize the negative log likelihood of the next token in a sequence given the preceding ones at every step.

0:41:53 - 0:42:06 Text: And you know as I mentioned this actually works pretty well for training auto regress the models of human language but it actually causes a few issues which which john hinted at.

0:42:06 - 0:42:19 Text: So in the next few slides i'm going to talk a bit about these issues and then highlight some some training solutions to these problems that I found interesting or that I think are important from from the last few years.

0:42:19 - 0:42:26 Text: So the first issue is actually one that I hinted at in the last section which is that training with maximum likelihood.

0:42:26 - 0:42:43 Text: So I wanted to discourage textual diversity and I showed this sequence on a slide earlier as an example of greedy algorithms being prone to generating repetitive sequences which you know it's just about the worst form of diversity that you could that you could get.

0:42:43 - 0:42:57 Text: And so many algorithms are just trying to maximize the probability of the sequences that they produce so it can really only be prone to to repetitive and undiverse phrases if those phrases are scored highly by the model to begin with.

0:42:57 - 0:43:10 Text: And that ends up being one of the issues with maximum likelihood training is that it tends to end up favoring generic expressions because those are the ones that are you know often the most likely in human language production.

0:43:10 - 0:43:24 Text: As we all know and as I mentioned earlier human language production isn't about maximizing the likelihood of the words that we produce so even though we might produce generic phrases more often than engineering phrases that's not the goal that we're setting out with when we speak.

0:43:24 - 0:43:37 Text: There's a lot more to communication that really isn't synthesized by a training objective that tries to maximize you know probability over the human language that's being read so how can we end up mitigating this problem.

0:43:37 - 0:43:48 Text: You know an interesting approach that I really like that came out last year was was actually you know proposed by by wellic it all which was called unlikely hood training.

0:43:48 - 0:44:06 Text: And so here what you do is that you actually discourage the production of particular tokens by the model in certain contexts and so it happens that this loss term here decreases as the probability of the why neg tokens decreases so for any token that you don't want to generate as the probability of generating that token goes down.

0:44:06 - 0:44:14 Text: So so so does the loss term which means that you're not actually updating the model as much for the to for this particular behavior.

0:44:14 - 0:44:34 Text: What's important though is that you still have your teacher forcing objective so what's going to happen is that the models going to learn to capture both the distribution of language from the training corbis which it needs to do to learn how to generate text but also it's going to learn how to not say particular words that you might not want it to.

0:44:34 - 0:45:00 Text: And then what you can do is that you can set this list of words that you don't want the model to generate to actually be words that you've already generated before and so in essence you're you're teaching the model to not say the same things again and that's just naturally going to limit the amount of repetition that your model is going to be able to spit out and you're going to be able to generate more diverse texts as a result as well.

0:45:00 - 0:45:18 Text: But a second and very important issue that comes from training with with maximum likelihood is you know what we often call exposure bias, which is that the context that we train on to generate the next token are going to look different from the ones that we see a generation time.

0:45:18 - 0:45:21 Text: And so why might that be.

0:45:21 - 0:45:31 Text: The key to the challenge that happens during training is that we always get a token from a gold document or human text as it put it's the gold sequence as we call it.

0:45:31 - 0:45:44 Text: But then during generation we're feeding our previously generated tokens back into the model as the input rather than these these teacher force tokens that are that are from the gold documents.

0:45:44 - 0:45:53 Text: And so if tokens is actually quite affected by things like the distributions produced by our model and the decoding algorithm we use to get tokens.

0:45:53 - 0:45:55 Text: And so you know can this end up being a problem.

0:45:55 - 0:46:07 Text: Well yes, because as we've seen before the types of text that our model generates are often you know not a very close approximation of human language patterns in the training set.

0:46:07 - 0:46:19 Text: And so it could be an imbalance between the type of text that our model has learned to predict learn to predict and to expect to see and the type of text that it will see once we actually start decoding.

0:46:19 - 0:46:34 Text: And so once your model starts receiving its own inputs, which are going to deviate from the distribution of text to expect it's going to be very challenging for it to be able to generate coherent text going forward because it's not going to really know how to synthesize its own information that it's generated.

0:46:34 - 0:46:42 Text: And so there's a variety of ways to try to counter this exposure bias issue and many more that that continue to come out.

0:46:42 - 0:46:51 Text: And fortunately there's not really enough time to talk about all of them. So I've added slides to discuss two of them that are based on semi supervised learning here.

0:46:51 - 0:46:57 Text: But I really want to focus more on two other approaches that I personally find very interesting.

0:46:57 - 0:47:10 Text: The first is called sequence rewriting. And so you know in this setting your model first learns to retrieve a sequence from an existing database of human written prototypes.

0:47:10 - 0:47:21 Text: So it's kind of like our nearest neighbor decoders earlier when we cashed a bunch of phrases here you cash a bunch of sequences that might be similar to the one that you're supposed to produce for this new situation.

0:47:21 - 0:47:39 Text: And then what we do is that once we take this sequence and retrieve it, we learn to edit it by doing things like adding, removing or modifying tokens to more accurately reflect the context that we're actually given rather than the one that this original sequence was designed for in the first place.

0:47:39 - 0:47:59 Text: And so we can still use an algorithm here that tries to maximize like the training, but because there's this sort of latent variable of retrieving the right prototype that's involved, it makes it less likely that are that our generated text ends up suffering from exposure bias, because you're already starting from something that looks more like a training sequence that you might have seen.

0:47:59 - 0:48:17 Text: Another general class of possibilities we can do is to let our model learn to generate text by learning from its own samples. And you know this naturally maps itself nicely to reinforcement learning, which is actually one of my favorite ways to learn how to generate text.

0:48:17 - 0:48:31 Text: In this setting, you're going to cast your text generation model as a mark off decision process, where your state s is the models representation of the preceding context that you see your actions, a are the words that can be generated.

0:48:31 - 0:48:37 Text: Your policy is the decoder and your rewards are provided by some type of external score.

0:48:37 - 0:48:46 Text: And here you can learn many different behaviors for your text generation model by rewarding it when it exhibits those behaviors.

0:48:46 - 0:48:59 Text: So to kind of quickly join this framing with the perspective of the text generation models that we've seen so far, you're going to be taking actions by sampling words, hat, why from the distribution.

0:48:59 - 0:49:05 Text: And then you're going to feed them back into the input to get a new state, which is what we've been doing at every point.

0:49:05 - 0:49:16 Text: What's different though is that as you generate text, you're using some external reward function to compute rewards for each token that you generate. So you're rewarding every action that you take.

0:49:16 - 0:49:29 Text: And then you scale the sample loss on this on this particular token that you generate by by this reward, which is going to encourage the model to generate the sequence in similar context if the reward is high.

0:49:29 - 0:49:42 Text: So very clearly, you're minimizing the negative log likelihood of your sample token. So here it's not a gold token. Notice the hats that is on the Y expression in the reward function.

0:49:42 - 0:49:48 Text: And then you're going to compute a reward for that token and scale this negative log likelihood by that reward.

0:49:48 - 0:49:59 Text: So if the reward is high, the model is going to be more likely to generate this same sequence in a similar context in the future, if the reward is low, it will be less likely.

0:49:59 - 0:50:07 Text: But this sort of brings up a natural question, you know, what can we actually use as a reward to encourage the behaviors we want in this text generation system.

0:50:07 - 0:50:22 Text: And then you can use the value of the text generation system to use as you design this your generation pipeline, a common practice in the early days of using RL for text generation was to set the reward to be the final evaluation metric that you were going to evaluate on.

0:50:22 - 0:50:34 Text: And so here instead of having a unique reward for each generated token at every time step, you would just take the final sequence score that you get and reward every token in the generated sequence with that value.

0:50:34 - 0:50:49 Text: And so it's absolutely magical. You would set your evaluation metric as the reward and you'd end up learning to get more reward because that's what RL algorithms do, which in turn means that you were learning to generate sequences that do better on your evaluation metric.

0:50:49 - 0:50:52 Text: So an LG benchmark scores were shooting through the roof.

0:50:52 - 0:51:04 Text: And so it's making real progress. But it was actually all a lie, you know, turns out as I'll talk about later evaluation metrics, particularly for text generation are just approximations.

0:51:04 - 0:51:12 Text: And it's not always clear that optimizing to those approximations is going to lead to more and better coherent text generation.

0:51:12 - 0:51:41 Text: And so instead oftentimes what ends up happening is that it just learns to exploit the noise in the evaluation metric. And in fact, in their large work where they introduced Google's neural machine translation system in 2016, you know, Google researchers generally found that training machine translation models with RL and blue scores as rewards, didn't actually improve the translation quality at all, even if it did lead to higher blue scores.

0:51:41 - 0:51:50 Text: But so designing your reward function is a very important problem in RL for actually learning the behavior that you want.

0:51:50 - 0:52:10 Text: And I've listed some cool work here on how you can actually learn to tie fairly complex behaviors to reward functions by actually initializing the scores that use as rewards as neural networks that you know get trained on an auxiliary task ahead of time, but can then be used to provide scores as rewards to the system that you produce.

0:52:10 - 0:52:24 Text: So to go back to our example earlier of trying to create a dialogue agent that is very positive and only says positive things, you could use a sentiment classifier to produce a reward for the sequences that you generate.

0:52:24 - 0:52:28 Text: And so that's a lot of fun.

0:52:28 - 0:52:40 Text: And unfortunately, despite all the fun that you can have using RL to train text generation engines, there's a bit of a dark side to which is that reinforcement learning algorithms can be notoriously unstable.

0:52:40 - 0:52:49 Text: And so to get these text generation systems to learn with RL, you often have to be thorough in tuning different dials in your model setup accurately.

0:52:49 - 0:52:59 Text: There's many of them, two of them that I think are worth mentioning are one that you always need to pre train with teacher forcing you generally can't train with reinforcement learning from scratch.

0:52:59 - 0:53:04 Text: And also you need to provide some type of baseline reward that your model should be achieving.

0:53:04 - 0:53:10 Text: So for example, blue score, which I described earlier is always a positive value, unless it's zero.

0:53:10 - 0:53:20 Text: But that means that if you use it alone as a reward, every single sequence that you sample ends up, you know, being encouraged in the future. So what you want to have is some type of baseline.

0:53:20 - 0:53:30 Text: That is an expectation of how much reward you should be getting, which can be subtracted from the reward that you actually get so you can discourage certain behaviors as well.

0:53:30 - 0:53:41 Text: One last note about this is that neural networks are quite good at finding the easiest way to learn something. So, you know, if there's a way to exploit your reward function.

0:53:41 - 0:53:46 Text: It'll find a way to do it, particularly that's easier than learning the behavior that you want to learn.

0:53:46 - 0:53:59 Text: So something to remember that's kind of important if you try to use reinforcement learning for text generation systems, particularly because there's such a large action action space for words that it can generate to try to accomplish certain behaviors.

0:53:59 - 0:54:12 Text: So, you know, to end this section, I just want to start off by saying that in general, we still use teacher forcing as a as a primary means of learning to generate coherent text.

0:54:12 - 0:54:20 Text: It has diversity issues, but it still lets us learn a model with with decent text generation abilities.

0:54:20 - 0:54:32 Text: One thing I haven't focused too much on in this lecture is the type of model that you can use to actually generate text because they tend to be less universal and much more designed to very specific and tasks.

0:54:32 - 0:54:48 Text: But in general, a common approach in an LG is to try to design a neural architecture that allows your model to be perhaps less sensitive to the problems of teacher forcing or to address them with additional loss terms that are perhaps tasks specific.

0:54:48 - 0:54:54 Text: Exposure bias, though, is a problem everywhere, pretty much regardless of your of your neural architecture.

0:54:54 - 0:55:15 Text: And you know, to kind of mitigate it, you can either train your model to be more resistant to its own distribution changes through things like semi supervised learning, or you can change your your your pipeline, so that you're learning to make modifications to an existing sequence that you retrieve from your training set rather than trying to learn how to generate sequences from scratch.

0:55:15 - 0:55:27 Text: And the caveat there is that, you know, as the type of text that you're generating gets longer and longer, doing this, this kind of retrieval and editing becomes just as challenging as generating from scratch in many cases.

0:55:27 - 0:55:32 Text: And finally, you can use reinforcement learning as another means of of of learning from your own examples.

0:55:32 - 0:55:46 Text: And, you know, and in effect, you can also use it to encourage different behaviors than just likely with maximization. But this type of learning can end up being quite unstable and unless your reward is well shaped, the model can often learn to exploit it.

0:55:46 - 0:56:03 Text: And then you can also use the same kind of training, you know, like the other. So online language simulators where you can train with our online, or are you only talking about training offline.

0:56:03 - 0:56:23 Text: No, in this setting, we're generally talking about training offline. So you you train it all ahead of time and then you use your model as it's been trained the first time. Okay, yeah, so now this we've we finally reached the section that I hinted at earlier on it on a very important topic, evaluation.

0:56:23 - 0:56:38 Text: And you know, to be honest is something that we should be thinking about before we even start designing a model or training algorithm or or decoding algorithm. It's, you know, how can we actually check that our method is is how can we actually go into a value that our method is even working.

0:56:38 - 0:56:57 Text: I want to talk a bit about three types of evaluation metrics. So first we'll talk about automatic e-val metrics, because generally you need to be able to rapidly prototype and diagnose failures in your model. So it's essential to be able to have this this quick feedback, even if it's very course and an automatic evaluation metrics.

0:56:57 - 0:57:09 Text: And you know, we've traditionally use what I'm calling content overlap metrics, which focus on how much a sequence explicitly resembles another sequence, usually in terms of word or phrase matching.

0:57:09 - 0:57:29 Text: And so, lately there's also been a new automatic evaluations that are that are model based where we try to use advances and embeddings and neural models to define more implicit similarity measures between sequences. And finally, we'll talk a bit about human evaluations, which are kind of the gold standard of of evaluating text generation.

0:57:29 - 0:57:43 Text: And they also do have downsides that that we'll get to as well. And I just want to note that some of these slides here are actually repurposed from slides from us, each of you must who's a leading expert on an LG evaluation.

0:57:43 - 0:58:07 Text: But so let's jump in. So content overlap metrics generally compute an explicit similarity score between two sequences, the one that's been generated by your model and some gold standard reference sequence that was attached to the to the inputs that you had. So in other words, a sequence that you know was an appropriate generation for the inputs you had.

0:58:07 - 0:58:19 Text: And these metrics are often a popular starting point because they're fast and very efficient, which is their main benefit, as long as you have a reference sequence to compare to you can compute these scores rapidly to get feedback.

0:58:19 - 0:58:33 Text: And I'm going to categorize them into two zones here. First, and gram overlap metrics that compute different functions of word and word like overlap and semantic overlap metrics, which involve more complex overlap functions based off semantic structures.

0:58:33 - 0:58:46 Text: Unfortunately, besides being fast and efficient, most and gram overlap metrics don't actually give you a great approximation of sequence quality a lot of times.

0:58:46 - 0:59:01 Text: They're already not ideal for something like machine translation where there can be multiple ways to translate the same sequence with things like synonyms and they get progressively much worse for tasks that are more open ended than empty.

0:59:01 - 0:59:09 Text: So for example, in summarization, you know, the longer output texts make it naturally harder to measure something using word match.

0:59:09 - 0:59:22 Text: And it's something like dialogue. It's incredibly open ended. And in fact, you can have multiple responses to a particular utterance that you know mean the same thing, but don't use any common words.

0:59:22 - 0:59:38 Text: And so we can illustrate this with a simple fairly contrived example where you know have a dialogue context utterance that asks a question such as, you know, are you going to Antoine's incredible CS 224 lecture, a completely unbiased reference text from, you know, derived from a human.

0:59:38 - 0:59:56 Text: And then a dialogue agent that spits out different answers such as, yes, which gets a pretty high score on our n gram overlap metric or you know it, which already scores a lot lower despite depicting, you know, the exact same idea or up, which actually gets a score of zero.

0:59:56 - 1:00:11 Text: And meanwhile, a completely incorrect answer heck no gets the highest score out of all of them, which kind of points to the issue, you know, when you use n gram overlap measures. In a lot of applications, you're going to be missing the salient elements of what the generated sequence should capture.

1:00:11 - 1:00:19 Text: And instead, you know, the getting stylistic similarities between text, even as you miss the most important context.

1:00:19 - 1:00:29 Text: And if you prefer empirical validations to contrived examples, it's actually been shown that many dialogue evaluation metrics don't really correlate well with human judgments at all.

1:00:29 - 1:00:46 Text: And this gets worse as your sequence length increases typically. So with an open ended task like story generation, you can get improved scores by just matching a whole lot of stopboards that have nothing to do with the content of the story itself.

1:00:46 - 1:00:55 Text: And we also have another category of overlap metrics that I'll call semantic overlap metrics because they don't necessarily tied directly to the words you used.

1:00:55 - 1:01:01 Text: But instead try to create conceptual representations of the generated and reference outputs instead.

1:01:01 - 1:01:15 Text: So the center one here spice, for example, creates a scene graph of your of your generated text and then compares that to your to your reference caption to see how similar this this more semantic representation and stuff being.

1:01:15 - 1:01:27 Text: But clearly there's there's there's some limitations to how well explicit content overlap metrics can do, particularly as we start thinking about more open and to task.

1:01:27 - 1:01:42 Text: So in response of the last years, there's been a focus on using model based metrics, whose representations come from, you know, machine learning models and where they can be used actually evaluate the fidelity of generated text.

1:01:42 - 1:01:56 Text: And these are nice because there's no more needing explicit matches between words and your reference and generated text instead you can rely on much more implicit notions of similarity that you get from word embeddings.

1:01:56 - 1:02:17 Text: And so, you know, there's been a lot of models develop in this area, some of the original ones kind of focused on, you know, defining composition functions over the embeddings of words and your generated and reference sequences and then computing a distance between the compositions of the two sequences.

1:02:17 - 1:02:32 Text: Some more, you know, so some more involved takes on this idea are things like word movers distance where here you actually try to map word vectors in both your generated and reference sequences into pairs with one another.

1:02:32 - 1:02:53 Text: So each word vector is paired to another work vector in the opposite sequence and the distance between them is computed and then you allow the evaluation metric to actually compute the optimal matchings between these pairs of words such as the total distance is minimized and bird score, which has become quite popular over the last year.

1:02:53 - 1:03:04 Text: So as an evaluation metric is pretty much just word movers distance but using contextualized bird embeddings to compute these distances.

1:03:04 - 1:03:22 Text: Sentence movers similarity is kind of another extension of word movers similarity that adds sentence embeddings from recurrent no networks to this distance optimization and that allows it to be more effective for evaluating let's say long multi sentence text as opposed to just single single sentences.

1:03:22 - 1:03:45 Text: And finally last year there was a new model named blurt, which is actually a regression model that's based on birds and here it takes a pair of sentences, the reference on the generated one and returns a score that indicates to what extent the candidate is grammatical and conveys the meaning of the reference text.

1:03:45 - 1:03:58 Text: So you know we can we can talk about a lot more evaluation metrics that are computed automatically and you know there's there's far more of them than I then I actually mentioned though they do tend to fit in those two categories that I described.

1:03:58 - 1:04:08 Text: But it's it's important to remember that at the end of the day the true mark on an LG systems performance is whether it's valuable to the human user that has to interact with it or read the text that's produced from it.

1:04:08 - 1:04:33 Text: And unfortunately automatic metrics tend to fall short of replicating human opinions of the quality of generated text and that's why you know for this reason human evaluations are viewed as the most important form of evaluation for text generation systems almost all work in an LG generally includes some form of human evaluation particularly if the task is more open ended.

1:04:33 - 1:04:43 Text: And if it doesn't well you know let me be frank you should probably be skeptical of any claim that's being made by that work.

1:04:43 - 1:05:02 Text: And finally you know another use of human evaluations is that in addition to to to evaluating model performance you can also use them to actually train new machine learning models that are meant to be evaluate that are meant to serve as evaluation scoring functions themselves.

1:05:02 - 1:05:13 Text: So I guess the you know the main thing to mention about human evaluations since you know we can we can talk about them for a while is that they're both very simple but also very difficult to run.

1:05:13 - 1:05:20 Text: You know they're simple because you generally just need to find judges and ask them to rate an intrinsic dimension of the quality of your generated text.

1:05:20 - 1:05:34 Text: So you have to define a set of criteria that you decide is important for the task that you're designing a system for and that can be things like fluency where you just you measure things like grammar spelling or choice does this actually look like human language.

1:05:34 - 1:05:47 Text: So the text does the text accurately reflect facts that are described in the context, you know common sense doesn't sort of follow the logical rules of the world that we might expect.

1:05:47 - 1:06:00 Text: But once you once you've defined these criteria you can have humans evaluate the generated text for how well it actually produces text that that that that caters to the to these particular criteria.

1:06:00 - 1:06:06 Text: So here is that you know while these you know dimensions are common and repeated across different evaluations.

1:06:06 - 1:06:29 Text: They can you know often be referred to by other names and you know explain to evaluators in different terms and be measured in different ways and in fact one of the problems with human evaluations is that across works they tend to be very unstandardized which can make the replication of human results quite difficult and that's why when you read text generation papers you rarely see a comparison of human evaluation scores between two different studies.

1:06:29 - 1:06:34 Text: Even if they evaluated the same dimensions.

1:06:34 - 1:06:48 Text: But you know another set of issues with human evaluations beyond the fact that they're slow expensive and unstandardized is that you know humans themselves aren't actually perfect.

1:06:48 - 1:07:05 Text: And you know I guess as a few negatives that I can say about humans you know even though we're all humans is that we tend to not be very consistent folks you know often changing our minds about how we've used something depending on something as as you know trivial as the time of day.

1:07:05 - 1:07:25 Text: And I don't always reason in the way that we're expected to when presented with with a task such as evaluating something we can lose concentrated lose concentration and not really be focused on what we're doing and you know we can often misinterpret what let's say human evaluation is asking us to do such that we inject our own biases into the task.

1:07:25 - 1:07:42 Text: And you know on top of all these things you know when we run human evaluations we're also dealing with the fact that one of the big motivators that our human judges have is to do the task as quickly as possible which isn't a great mix particularly if we want them to really give us high quality ratings.

1:07:42 - 1:07:54 Text: But you know humans are kind of the best thing that we that we have to actually give us the most accurate assessments of whether text generation systems are are doing well so we so we do the best that we can.

1:07:54 - 1:08:17 Text: I'm actually going to skip this slide but I mentioned earlier that we know one of the things that we can also do is use human ratings to train models to actually predict scores for text itself and so two systems that that that do some things along these lines provided citations to here so that you can take a look at them if you're curious later on.

1:08:17 - 1:08:45 Text: So you know the takeaways I kind of want you to get from the section are that you know evaluation is quite hard and particularly in text generation so content overlap metrics do provide a good starting point for evaluating the quality of generated text you know if you run these and grab overlap metrics and they show scores that are worse than they should be that's the first sign that you that you have problem but they're generally not not good enough on their own.

1:08:45 - 1:09:14 Text: Model based metrics tend to be you know more correlated with human judgments than content overlap once particularly as the tasks become more open ended such as you know dialogue and storytelling but the downside there's that they're they're not very interpretable you know unlike with a content overlap metric where you can say exactly oh this is why this score is this way it's because these words match up with these words with a model based metric you you get a much more implicit definition of similarity which you know while useful is also less.

1:09:14 - 1:09:35 Text: Interpretable human judgments are absolutely critical because even if they're inconsistent and sometimes don't do the tasks that you want them to humans are actually able to intrinsically evaluate dimensions that we that we don't even know how to formulate using any type of automatic metric.

1:09:35 - 1:10:03 Text: But lastly slightly unrelated to what I've spoken about the section you know I just I just want to say that the number one evaluator of any NLG system that you create should should really be you you know a look at your models outputs as perhaps you do a project that involves an LG can really be worth days weeks and sometimes months of staring at evaluation metrics that are that are perhaps a bit informative so if you design an LG systems make sure to evaluate your own generations very consistently.

1:10:05 - 1:10:28 Text: So in this last section I think it's quite important to talk about ethical topics in natural language generation as well because you know ultimately while NLG really allows us to tackle new and interesting applications you know if we're not capable we can end up deploying fairly dangerous and harmful systems.

1:10:28 - 1:10:36 Text: And as a warning I just want to say that you know some of the content on the next use slides can you know is going to potentially be quite uncomfortable.

1:10:36 - 1:10:41 Text: But I think it's important to make very clear how these systems can go very very wrong.

1:10:41 - 1:10:54 Text: And without picking on a particular example I do perhaps think that you know that one of the most famous examples of this was the Tay dialogue chatbots which was released onto Twitter in 2016.

1:10:54 - 1:11:09 Text: And within 24 hours it had started you know making some very nasty comments that that exhibited racist sexist anti-submitting white supremacist leanings which you know is ultimately probably not with the designers of Tay had in mind.

1:11:09 - 1:11:19 Text: So what actually ended up going wrong with Tay well you know here's the thing Tay behaved exactly as we should have expected it would.

1:11:19 - 1:11:34 Text: It was designed to learn to exhibit the conversational patterns of the users that it interacted with and it did that you know NLG models are very good at capturing the language distribution of their training examples that's been the one thing that we've been remarkably consistent on.

1:11:34 - 1:11:43 Text: You know the last few years and it turns out if they're training examples end up having toxic content they will learn to repeat that content.

1:11:43 - 1:11:51 Text: And this is perhaps no clearer than if you learn at what pre-trained language models have to say about let's say different demographics.

1:11:51 - 1:12:03 Text: So if you remember you know large pre-trained language models that underlie many modern NLG systems they're trained on massive corporate text which are often opaque and crawled from online resources.

1:12:03 - 1:12:13 Text: If it turns out that those corporate have toxic content the language models are going to learn it's and in fact make it even worse.

1:12:13 - 1:12:23 Text: And then it turns out that if you prompt these language models for certain pieces of information it can spit out that toxic content showing you know very different opinions across you know gender races sexual orientations.

1:12:23 - 1:12:40 Text: Now you know I can see that it would actually be rare to ask a language model to weigh in with their opinions on this matter but you do have to ask yourself if this type of information is encoded in the model in some way in what other ways could these learn patterns end up being reflected by this model once it's actually deployed in practice.

1:12:40 - 1:12:45 Text: And that kind of leads us to the second problem with these language models.

1:12:45 - 1:12:55 Text: And we actually don't really know how information ends up being learned and encoded by them which means that we don't have a rigorous understanding of what types of inputs are going to trigger what types of outputs.

1:12:55 - 1:13:08 Text: And in fact Wallace it all showed in their EM and LP 2019 work that this was a big problem because if you prime these models with particularly adversarial inputs they would generally devolve immediately into producing very toxic content.

1:13:08 - 1:13:17 Text: In other words what it took to 24 hours to learn how to do these systems can kind of do out of the box if prying with the wrong examples.

1:13:17 - 1:13:21 Text: And unfortunately the wrong examples.

1:13:21 - 1:13:41 Text: And so it ends up being a lot less nasty than we might have expected a lot less nasty than the ones on the previous slide at the very least. But you know in a work at EM and LP findings last year.

1:13:41 - 1:13:51 Text: And it was not as consistent as in the previous work but it was still often enough.

1:13:51 - 1:13:56 Text: And these examples really go to show that we need to be careful with how these systems are deployed.

1:13:56 - 1:14:01 Text: If you have an NLG system you need safeguards to stop it from outputting harmful content.

1:14:01 - 1:14:12 Text: And also the number of toxic and bi toxicity and biosefaction today you know a model that can be primed to generate incorrect or unfactual information can be quite dangerous too.

1:14:12 - 1:14:18 Text: And also NLG models shouldn't be deployed without an understanding of who its users will be.

1:14:18 - 1:14:28 Text: And there's always going to be adversarial users for any model that you create even if you can't think of them in the moment.

1:14:28 - 1:14:39 Text: And so the final point which is that you know the advances in NLG have really you know allowed us to build text production systems for many new applications.

1:14:39 - 1:14:46 Text: You know as we do this though it's it's important to ask you know does the content that we're building a system to automatically generate.

1:14:46 - 1:14:59 Text: And so for easy human ingestion you know does it really need to be generated automatically.

1:14:59 - 1:15:04 Text: And I think a good example of this is the work of sellers at all at NURBS 2019 which you know showed off the potential dangers of fake news generators from pre train language models.

1:15:04 - 1:15:14 Text: I actually thought this was a great work and it highlighted you know many of the defenses that you could be developed against fake news generations system.

1:15:14 - 1:15:21 Text: But you know the point is more so that you should always imagine that any tool that you create could be used in a negative way.

1:15:21 - 1:15:30 Text: So a storytelling NLG system can also potentially be you know repurposed to do fake news generation.

1:15:30 - 1:15:38 Text: And you should really always ask yourself whether the positive applications of a particular technology outweigh the potential negative ones.

1:15:38 - 1:15:44 Text: And that turns out to often not be an easy question.

1:15:44 - 1:15:48 Text: So I guess as concluding thoughts for today.

1:15:48 - 1:16:02 Text: You know I just want to mention that you know if you start interacting with NLG systems and practice you're quickly going to see the fairly large limitations that they tend to have.

1:16:02 - 1:16:11 Text: Even in tasks where we've you know achieved a larger amount of progress at building systems that can do the task fairly well.

1:16:11 - 1:16:17 Text: There's still a lot of improvements that can be made to make them even better.

1:16:17 - 1:16:23 Text: And pretty much any NLG task at the same time evaluating it effectively remains a huge challenge.

1:16:23 - 1:16:32 Text: And you often have to rely on humans to give us the best estimates of how well our system is doing.

1:16:32 - 1:16:43 Text: And so an area where a large improvement would really you know kind of bootstrap larger improvements in many other areas of NLG would be to find better automatic evaluation.

1:16:43 - 1:16:56 Text: On the other hand on a very you know optimistic note I do want to say that with the advent of large skill language models.

1:16:56 - 1:17:06 Text: You know deep NLG research hasn't hasn't been reset but it's never been easier to jump in the space and start playing around with the systems and designing cool new tools that can that can help humans and just content and information more rapidly and more efficiently.

1:17:06 - 1:17:18 Text: And as a result I think that it's you know one of the most exciting areas of NLP to work in and I think that that you know if you start working in it as well you'll feel the same way and I would encourage you to do so.

1:17:18 - 1:17:46 Text: Thanks a lot for having me today. It was really exciting.