DiBiMT: A Gold Evaluation Benchmark for Studying Lexical Ambiguity in Machine Translation
Abstract
Despite the remarkable progress made in the field of Machine Translation
(MT), current systems still struggle when translating ambiguous words,
especially when these express infrequent meanings. In order to
investigate and analyze the impact of lexical ambiguity on automatic
translations, several tasks and evaluation benchmarks have been proposed
over the course of the last few years. However, works in this research
direction suffer from critical shortcomings. Indeed, existing evaluation
datasets are not entirely manually curated, which significantly
compromises their reliability. Furthermore, current literature fails to
provide detailed insights into the nature of the errors produced by
models translating ambiguous words, lacking a thorough manual analysis
across languages.
With a view to overcoming these limitations, we propose Disambiguation
Biases in MT (DiBiMT), an entirely manually-curated evaluation benchmark
for investigating disambiguation biases in eight language combinations
and assessing the ability of both commercial and non-commercial systems
to handle ambiguous words. We also examine and detail the errors
produced by models in this scenario by carrying out a manual error
analysis in all language pairs. Additionally, we perform an extensive
array of experiments aimed at studying the behavior of models when
dealing with ambiguous words.
Finally, we show the ineffectiveness of standard MT evaluation settings
for assessing the disambiguation capabilities of systems and highlight
the need for additional efforts in this research direction and ad-hoc
testbeds such as DiBiMT. Our benchmark is available at:
https://nlp.uniroma1.it/dibimt/.