Automatic Identification and Production of RelatedWords for Historical Linguistics
Abstract
Language change across space and time is one of the main concerns in historical linguistics. In this article, we develop tools to be used in historical linguistics, to assist researchers and domain experts in the study of language evolution.
Firstly, we address cognate identification. We propose an algorithm for extracting cognates from electronic dictionaries that contain etymology-related information. Having built a dataset of related words, we further develop machine learning methods for cognate identification. Words undergo various changes when entering new languages. Based on the assumption that these
linguistic changes follow certain rules, we propose a method for automatically detecting pairs of cognates employing an orthographic alignment method.We use aligned subsequences as features for machine learning algorithms in order to infer rules for linguistic changes undergone by words when entering new languages and to discriminate between cognates and non-cognates.
Secondly, we extend the method to a finer-grained level, to identify the type of relationship between words. Discriminating between cognates and borrowings provides a deeper insight into the history of a language and allows a better characterization of language relatedness. We show that orthographic features have discriminative power and we analyze the underlying linguistic
factors that prove relevant in the classification task. To our knowledge, this is the first attempt of this kind.
Thirdly, we address proto-word reconstruction, which is central to the study of language evolution. It consists of recreating the words in an ancient language from its modern daughter languages. Having modern word forms in multiple languages, we infer the form of their common ancestors. Our approach relies on the regularities that occurred when the Latin words entered the
modern languages. We leverage information from all modern languages, building an ensemble system for proto-word reconstruction. We use conditional random fields for sequence labeling, but we conduct preliminary experiments with recurrent neural networks as well. We apply our method on multiple datasets, showing that our method improves on previous results, having also
the advantage of requiring less input data, which is essential in historical linguistics, where resources are generally scarce.