Source Language Adaptation Approaches for Resource-Poor Machine Translation
Abstract
Most of the world languages are resource-poor for statistical machine translation; still, many of them are actually related to some resource-rich language. Thus, we propose three novel, language-independent approaches to source language adaptation for resource-poor statistical machine translation. Specifically, we build improved statistical machine translation models from a resource-poor language POOR into a target language TGT by adapting and using a large bitext for a related resource-rich language RICH and the same target language TGT. We assume a small POOR-TGT bi-text from which we learn word-level and phrase-level paraphrases and cross-lingual morphological variants between the resource-rich and the resource-poor language. Our work is of importance for resource-poor machine translation since it can provide a useful guideline for people building machine translation systems for resource-poor languages.
Our experiments for Indonesian/Malay–English translation show that using the large adapted resource-rich bi-text yields 7.26 BLEU points of improvement over the unadapted one and 3.09 BLEU points over the original small bi-text. Moreover, combining the small POOR-TGT bi-text with the adapted bi-text outperforms the corresponding combinations with the unadapted bi-text by 1.93-3.25 BLEU points. We also demonstrate the applicability of our approaches to other languages and domains.