Reflections on the Penn Discourse TreeBank and its relatives

Bonnie Webber; Rashmi Prasad; Aravind Joshi

Authors

Bonnie Webber School of Informatics, University of Edinburgh
Rashmi Prasad Center for Biomedical Data and Language Processing, University of Wisconsin-Milwaukee
Aravind Joshi Dept of Computer & Information Science, University of Pennsylvania

Abstract

The Penn Discourse Treebank (PDTB) was released to the public in 2008. It remains the largest manually annotated corpus of discourse relations to date. The related decisions to (1) require lexical-grounding for each discourse relation (creating such grounding for relations that are otherwise implicit) and (2) avoid commitment to any structure above these lexically-grounded relations (in this sense, remaining ``theory neutral'') have not only allowed the development of basic software for recognizing discourse relations, their component elements and their senses, it has spawned similar annotation efforts in other languages and genres.

By highlighting aspects of the PDTB and its relation to other levels of annotation and similar annotation efforts in other languages and genres, this paper aims to demonstrate benefits that are not specific to the PDTB --- namely, the value of (1) recording the evidence for discourse relations; (2) allowing for both multiple relations and multiple sense labels; (3) annotating attribution in tandem with discourse relations; (4) employing an annotation workflow that allows annotators to attend to the flow of discourse; and (5) recognizing decisions made in annotating discourse relations in other languages and genres.

Reflections on the Penn Discourse TreeBank and its relatives

Authors

Abstract

Published

Issue

Section

Make a Submission

Information

Announcements

EACL 2027 - CL deadlines for Qualifying Papers

Special Issue on the Ethics of NLP and CL in Computational Linguistics