Stochastic Language Generation in Dialogue using Factored Language Models

Authors

  • Francois Mairesse Amazon
  • Steve Young University of Cambridge

Abstract

Most previous work on trainable language generation has focused on two paradigms: (a) using a statistical model to rank a set of pre-generated utterances, or (b) using statistics to determine the generation decisions of an existing generator. Both approaches rely on the existence of a handcrafted generation component, which is likely to limit their scalability to new domains. The first contribution of this paper is to present BAGEL, a fully data-driven generation method which treats the language generation task as a search for the most likely sequence of semantic concepts and realisation phrases according to Factored Language Models (FLMs). As domain utterances are not readily available for most NLG tasks, a large creative effort is required to produce the data necessary to represent human linguistic variation for non-trivial domains. This paper is based on the assumption that learning to produce paraphrases can be facilitated by collecting data from a large sample of untrained annotators using crowdsourcing—rather than a few domain experts—by relying on a coarse meaning representation.

 

A second contribution of this paper is to use crowdsourced data to show how dialogue naturalness can be improved by learning to vary the output utterances generated for a given semantic input. Two data-driven methods for generating paraphrases in dialogue are presented: (a) by sampling from the N- best list of realisations produced by BAGEL’s FLM reranker; and (b) by learning a structured perceptron predicting whether candidate realisations are valid paraphrases. We train BAGEL on a set of 1,956 utterances produced by 137 annotators, which covers 10 types of dialogue acts and 128 semantic concepts in a tourist information system for Cambridge. An automated evaluation shows that Bagel outperforms utterance class LM baselines on this domain. A human evaluation of 600 resynthesized dialogue extracts shows that BAGEL’s FLM output produces utterances comparable to a handcrafted baseline, while the perceptron classifier performs worse. Interestingly, human judges find the system sampling from the N-best list to be more natural than a system always returning the 1-best utterance. The judges are also more willing to interact with the N-best system in the future. These results suggest that capturing the large variation found in human language using data-driven methods is beneficial for dialogue interaction.

 

Author Biography

  • Steve Young, University of Cambridge
    Engineering Department, Professor

Published

2024-12-05

Issue

Section

Long paper