Statistical Models for Unsupervised, Semi-supervised and Supervised Transliteration Mining

Authors

  • Hassan Sajjad Qatar Computing Research Institute
  • Helmut Schmid Ludwig Maximilian University of Munich
  • Alexander Fraser Ludwig Maximilian University of Munich
  • Hinrich Schuetze Ludwig Maximilian University of Munich

Abstract

We present two novel ideas for mining transliteration pairs in a fully unsupervised, language- independent way from noisy data which contains transliteration and non-transliteration word pairs. The first method is iterative: In each iteration, a transliteration model is trained on the noisy data. The data is then cleaned by removing a small number of word pairs which are very unlikely to be transliterations according to the model. In the next iteration, the model is retrained on the reduced data and then the data is filtered again. This process is iterated until a clean list of transliteration word pairs is left. The second method uses a novel generative model. It efficiently mines transliteration pairs in a consistent fashion in three very different settings, unsupervised, semi-supervised and supervised transliteration mining. The new model interpolates two sub-models, one for the generation of transliteration pairs and one for the generation of non-transliteration pairs (i.e. noise). The model is trained on the noisy unlabelled data using the EM algorithm. During training the transliteration submodel learns to generate transliteration pairs while the fixed non-transliteration model generates the noise pairs. After training, the unlabelled data is disambiguated based on the posterior probabilities of the two submodels. We evaluate our two transliteration mining systems on data from a transliteration mining shared task and on parallel corpora. For three out of four language pairs, our second system outperforms all semi-supervised and supervised systems that participated in the shared task. On parallel corpora with less than 2% transliteration pairs, our system achieves up to 86.7% F-measure. 

Author Biography

  • Hassan Sajjad, Qatar Computing Research Institute
    Scientist at Arabic Language Technology group in QCRI

Published

2024-12-05

Issue

Section

Long paper