A Scalable Distributed Syntactic, Semantic and Lexical Language Model

Authors

  • Ming Tan Wright State University
  • Wenli Zhou Wright State University
  • Lei Zheng Wright State University
  • Shaojun Wang Wright State University

Abstract

This paper presents an attempt at building a large scale distributed composite language model that is formed by seamlessly integrating n-gram, structured language model and probabilistic latent semantic analysis under a directed Markov random field paradigm to simultaneously account for local word lexical information, mid-range sentence syntactic structure, and long-span document semantic content. The composite language model has been trained by performing a convergent N-best list approximate EM algorithm and a follow-up EM algorithm to improve word prediction power on corpora with up to a billion tokens and stored on a supercomputer. The large scale distributed composite language model gives drastic perplexity reduction over n-grams and achieves significantly better translation quality measured by the BLEU score and ``readability'' of translations when applied to the task of re-ranking the N-best list from a state-of-the-art parsing-based machine translation system.

Author Biographies

  • Ming Tan, Wright State University
    Department of Computer Science and Engineer
  • Wenli Zhou, Wright State University
    Department of Computer Science and Engineer
  • Shaojun Wang, Wright State University
    Department of Computer Science and Engineer

Published

2024-12-05

Issue

Section

Long paper