Automatically Assessing Machine Summary Content Without a Gold-standard

Authors

  • Annie Louis Department of Computer and Information Science, University of Pennsylvania
  • Ani Nenkova Department of Computer and Information Science, University of Pennsylvania

Abstract

The most widely adopted approaches for evaluation of summary content in essence define a protocol for comparing a summary with a gold-standard. This traditional evaluation paradigm
however falls short when human summaries are not available. We propose novel evaluation metrics which are model-free, do not rely on a gold-standard for the assessment and involve
little or no human judgements. We show that quantifying the similarity between the source text and its summary with appropriately chosen measures produces summary scores which replicate human assessments very accurately. In addition, we explore the feasibility of another measure—similarity between the summary and the pool of other system summaries for the same input. This method of comparison with the consensus of systems also produces impressively accurate rankings of system summaries, achieving correlation with human scores above 0.9. Finally, we also consider the issue of increasing evaluation quality when only one human model summary is available as gold-standard. We introduce pseudomodels, which are system summaries deemed to contain good content according to automatic evaluation. Combining the pseudomodels with
the single human model to form the gold-standard leads to higher correlations with human judgements compared to the case of using only the one available model.

Author Biographies

  • Annie Louis, Department of Computer and Information Science, University of Pennsylvania

    Department of Computer and Information Science

    Research Assistant

  • Ani Nenkova, Department of Computer and Information Science, University of Pennsylvania

    Department of Computer and Information Science

    Assistant Professor

Published

2024-12-05

Issue

Section

Short paper