Common Flaws in Running Human Evaluation Experiments in NLP

Authors

Abstract

While conducting a coordinated set of repetitions of human evaluation experiments, we discovered flaws in the running of at least one experiment from every paper that we systematically selected for reproduction. We describe these flaws, as well as some we found in other papers, which include bugs in code (such as loading the wrong outputs to evaluate), failure to follow standard scientific procedures (e.g. ad hoc exclusion of participants and responses), and mistakes in reported numerical results (e.g. reported numbers not matching experimental data). If these problems are widespread, it would have worrying implications for the rigour of NLP evaluations. We then discuss ways of reducing such flaws, including better code development practices, increased testing and piloting by authors, and post-publication discussion of papers.

Published

2024-11-10

Issue

Section

Squibs and Discussions