DiBiMT: A Gold Evaluation Benchmark for Studying Lexical Ambiguity in Machine Translation

Federico Martelli; Stefano Perrella; Niccolò Campolungo; Tina Munda; Svetla Koeva; Carole Tiberius; Roberto Navigli

Authors

Federico Martelli Sapienza University of Rome
Stefano Perrella Sapienza University of Rome
Niccolò Campolungo Litus AI
Tina Munda University of Ljubljana, Centre for Language Resources and Technologies
Svetla Koeva Bulgarian Academy of Sciences, Institute for Bulgarian Language
Carole Tiberius Instituut voor de Nederlandse Taal
Roberto Navigli Sapienza University of Rome

Abstract

Despite the remarkable progress made in the field of Machine Translation
(MT), current systems still struggle when translating ambiguous words,
especially when these express infrequent meanings. In order to
investigate and analyze the impact of lexical ambiguity on automatic
translations, several tasks and evaluation benchmarks have been proposed
over the course of the last few years. However, works in this research
direction suffer from critical shortcomings. Indeed, existing evaluation
datasets are not entirely manually curated, which significantly
compromises their reliability. Furthermore, current literature fails to
provide detailed insights into the nature of the errors produced by
models translating ambiguous words, lacking a thorough manual analysis
across languages.

With a view to overcoming these limitations, we propose Disambiguation
Biases in MT (DiBiMT), an entirely manually-curated evaluation benchmark
for investigating disambiguation biases in eight language combinations
and assessing the ability of both commercial and non-commercial systems
to handle ambiguous words. We also examine and detail the errors
produced by models in this scenario by carrying out a manual error
analysis in all language pairs. Additionally, we perform an extensive
array of experiments aimed at studying the behavior of models when
dealing with ambiguous words.
Finally, we show the ineffectiveness of standard MT evaluation settings
for assessing the disambiguation capabilities of systems and highlight
the need for additional efforts in this research direction and ad-hoc
testbeds such as DiBiMT. Our benchmark is available at:
https://nlp.uniroma1.it/dibimt/.

DiBiMT: A Gold Evaluation Benchmark for Studying Lexical Ambiguity in Machine Translation

Authors

Abstract

Downloads

Published

Issue

Section

Make a Submission

Information

Announcements

EACL 2026 – CL deadlines for Qualifying Papers

2026 *ACL Conference Dates