CODRA: A Novel Discriminative Framework for Rhetorical Analysis
Abstract
Clauses and sentences rarely stand on their own in an actual discourse; rather the relationship
between them carry important information which allows the discourse to express a meaning
as a whole beyond the sum of its individual parts. Rhetorical analysis seeks to uncover this coherence
structure. In this article, we present CODRA — a COmplete probabilistic Discriminative
framework for performing Rhetorical Analysis in accordance with Rhetorical Structure Theory,
which posits a tree representation of a discourse.
CODRA comprises a discourse segmenter and a discourse parser. First, the discourse
segmenter, which is based on a binary classifier, identifies the elementary discourse units in a
given text. Then the discourse parser builds a discourse tree by applying an optimal parsing
algorithm to probabilities inferred from two Conditional Random Fields: one for intra-sentential
parsing and the other for multi-sentential parsing. We present two approaches to combine these
two stages of parsing effectively. By conducting a series of empirical evaluations over two
different datasets, we demonstrate that CODRA significantly outperforms the state-of-the-art,
often by a wide margin. We also show that a reranking of the k -best parse hypotheses generated
by CODRA can potentially improve the accuracy even further.