Discriminative Bilingual Lexicon Induction

Authors

  • Ann Irvine Center for Language and Speech Processing, Johns Hopkins University
  • Chris Callison-Burch Computer and Information Science Department, University of Pennsylvania

Abstract

Bilingual lexicon induction is the task of inducing word translations from monolingual corpora in two languages.  We introduce a novel discriminative approach to bilingual lexicon induction.  Our discriminative model is capable of combining a wide variety of features, which individually provide only weak indications of translation equivalence.  When feature weights are discriminatively set, these signals produce dramatically higher translation quality than previous approaches that combined signals in an unsupervised fashion (e.g. using minimum reciprocal rank).  We present experiments on a wide range of languages and data sizes.  We examine translation into English from 25 foreign languages: Albanian, Azeri, Bengali, Bosnian, Bulgarian, Cebuano, Gujarati, Hindi, Hungarian, Indonesian, Latvian, Nepali, Romanian, Serbian, Slovak, Somali, Spanish, Swedish, Tamil, Telugu, Turkish, Ukrainian, Uzbek, Vietnamese and Welsh.  Rather than testing solely on high frequency words, as previous research has done, we test on low frequency as well, so that our results are more relevant to statistical machine translation, where systems typically lack translations of rare words that fall outside of their training data.  We systematically explore a wide range of features and phenomena that affect the quality of the translations discovered by bilingual lexicon induction. We give illustrative examples of the highest ranking translations for orthogonal signals of translation equivalence like contextual similarity and temporal similarity.  We analyze the effects of frequency and burstiness, and the sizes of the seed bilingual dictionaries and the monolingual training corpora.  We directly compare our model's performance against the state-of-the-art matching canonical correlation analysis (MCCA) algorithm used by Haghighi et al (2008).  Our algorithm achieves an accuracy of 42% versus MCCA's 15%.

Published

2024-12-05

Issue

Section

Long paper