Modeling Language Variation and Universals: A Survey on Typological Linguistics for Natural Language Processing
Abstract
Languages may share universal features at a deep, abstract level, but the structures found in real-world, surface-level natural language vary significantly. This variation makes it challenging to transfer systems across languages or to develop Natural Language Processing (NLP) systems that apply to a wide range of languages. As a result, there is a vast geographical and linguistic disparity in the availability of NLP technology, with the majority of approaches developed for a handful of resource-rich languages, leaving many other languages behind. Understanding linguistic variation in a structured and systematic way is essential for the development of effective multilingual NLP applications and thus making language technology accessible to the wider world.
The field of linguistic typology studies and classifies the world’s languages according to their structural and functional features, with the aim of explaining both the common properties and the structural diversity of languages. In NLP typological information has provided valuable guidance for multilingual tasks, as shown most clearly in the areas of morphosyntax and phonology. The investigated approaches include transfer from resource-rich to resource-poor languages (Padó and Lapata 2005; Khapra et al. 2011; Das and Petrov 2011; Täckström, McDonald, and Uszkoreit 2012), joint multilingual learning (Snyder 2010; Cohen, Das, and Smith 2011; Navigli and Ponzetto 2012), and development of universal models (de Marneffe et al. 2014; Nivre et al. 2016). Multilingual models can even outperform the best monolingual models (Ammar et al. 2016a; Tsvetkov et al. 2016; Adel, Vu, and Schultz 2013) by exploiting the systematic similarities and differences across languages. Such models can, in turn, help and inform research on linguistic typology itself, facilitating data-driven induction of typological knowledge. Despite this high potential of benefiting each other, the two fields mainly continue to develop independently, co-existing as two separate research communities. The goal of this paper is to provide a a comprehensive review of research at the intersection between multilingual NLP and linguistic typology. Offering an in-depth analysis of available typological information resources and approaches that integrate typological information in NLP, this review can not only encourage further advances in the two fields but can also help and build a bridge between the two communities for mutual benefit.
We will considerably expand on the two existing short surveys of this area (Bender 2016; O’Horan et al. 2016), covering the field both in greater depth and breadth. We will provide a more detailed background on linguistic typology and multilingual NLP and will offer an extensive analysis of relevant methods, feature sets and their performance and limitations. We will also cover areas thus far neglected, most notably typological semantics, and provide an analysis of how typological constraints can be integrated in NLP methods from a machine learning perspective.