Reflections on the Penn Discourse TreeBank and its relatives
Abstract
The Penn Discourse Treebank (PDTB) was released to the public in 2008. It remains the largest manually annotated corpus of discourse relations to date. The related decisions to (1) require lexical-grounding for each discourse relation (creating such grounding for relations that are otherwise implicit) and (2) avoid commitment to any structure above these lexically-grounded relations (in this sense, remaining ``theory neutral'') have not only allowed the development of basic software for recognizing discourse relations, their component elements and their senses, it has spawned similar annotation efforts in other languages and genres.
By highlighting aspects of the PDTB and its relation to other levels of annotation and similar annotation efforts in other languages and genres, this paper aims to demonstrate benefits that are not specific to the PDTB --- namely, the value of (1) recording the evidence for discourse relations; (2) allowing for both multiple relations and multiple sense labels; (3) annotating attribution in tandem with discourse relations; (4) employing an annotation workflow that allows annotators to attend to the flow of discourse; and (5) recognizing decisions made in annotating discourse relations in other languages and genres.