Reflections on the Penn Discourse TreeBank and its relatives

Authors

  • Bonnie Webber School of Informatics, University of Edinburgh
  • Rashmi Prasad Center for Biomedical Data and Language Processing, University of Wisconsin-Milwaukee
  • Aravind Joshi Dept of Computer & Information Science, University of Pennsylvania

Abstract

The Penn Discourse Treebank (PDTB) was released to the public in 2008.  It remains the largest manually annotated corpus of discourse relations to date. The related decisions to (1) require lexical-grounding for each discourse relation (creating such grounding for relations that  are otherwise implicit) and (2) avoid commitment to any structure  above these lexically-grounded relations (in this sense, remaining  ``theory neutral'') have not only allowed the development of basic software  for recognizing discourse relations, their component elements and their  senses, it has spawned similar annotation efforts in other languages and  genres.


By highlighting aspects of the PDTB and its relation to other levels  of annotation and similar annotation efforts in other languages  and genres, this paper aims to demonstrate benefits that are not  specific to the PDTB --- namely, the value of (1) recording the  evidence for discourse relations; (2) allowing for both multiple  relations and multiple sense labels; (3) annotating attribution  in tandem with discourse relations; (4) employing an annotation  workflow that allows annotators to attend to the flow of discourse;  and (5) recognizing decisions made in annotating discourse  relations in other languages and genres.

Published

2024-12-05

Issue

Section

Long paper