Stochastic Language Generation in Dialogue using Factored Language Models

Francois Mairesse; Steve Young

Authors

Francois Mairesse Amazon
Steve Young University of Cambridge

Abstract

Most previous work on trainable language generation has focused on two paradigms: (a) using a statistical model to rank a set of pre-generated utterances, or (b) using statistics to determine the generation decisions of an existing generator. Both approaches rely on the existence of a handcrafted generation component, which is likely to limit their scalability to new domains. The first contribution of this paper is to present BAGEL, a fully data-driven generation method which treats the language generation task as a search for the most likely sequence of semantic concepts and realisation phrases according to Factored Language Models (FLMs). As domain utterances are not readily available for most NLG tasks, a large creative effort is required to produce the data necessary to represent human linguistic variation for non-trivial domains. This paper is based on the assumption that learning to produce paraphrases can be facilitated by collecting data from a large sample of untrained annotators using crowdsourcing—rather than a few domain experts—by relying on a coarse meaning representation.

A second contribution of this paper is to use crowdsourced data to show how dialogue naturalness can be improved by learning to vary the output utterances generated for a given semantic input. Two data-driven methods for generating paraphrases in dialogue are presented: (a) by sampling from the N- best list of realisations produced by BAGEL’s FLM reranker; and (b) by learning a structured perceptron predicting whether candidate realisations are valid paraphrases. We train BAGEL on a set of 1,956 utterances produced by 137 annotators, which covers 10 types of dialogue acts and 128 semantic concepts in a tourist information system for Cambridge. An automated evaluation shows that Bagel outperforms utterance class LM baselines on this domain. A human evaluation of 600 resynthesized dialogue extracts shows that BAGEL’s FLM output produces utterances comparable to a handcrafted baseline, while the perceptron classifier performs worse. Interestingly, human judges find the system sampling from the N-best list to be more natural than a system always returning the 1-best utterance. The judges are also more willing to interact with the N-best system in the future. These results suggest that capturing the large variation found in human language using data-driven methods is beneficial for dialogue interaction.

Author Biography

Steve Young, University of Cambridge

Engineering Department, Professor

Stochastic Language Generation in Dialogue using Factored Language Models

Authors

Abstract

Author Biography

Published

Issue

Section

Make a Submission

Information

Announcements

EMNLP 2025 – CL deadlines for Qualifying Papers

Computational Linguistics - December 2025 51(1) has been published!