Automatic Identification and Production of RelatedWords for Historical Linguistics

Authors

  • Alina Maria Ciobanu Univ. of Bucharest, Faculty of Mathematics and Computer Science, Human Languages Technologies Research Center
  • Liviu P Dinu Univ. of Bucharest, Faculty of Mathematics and Computer Science, Human Languages Technologies Research Center

Abstract

Language change across space and time is one of the main concerns in historical linguistics. In this article, we develop tools to be used in historical linguistics, to assist researchers and domain experts in the study of language evolution.
Firstly, we address cognate identification. We propose an algorithm for extracting cognates from electronic dictionaries that contain etymology-related information. Having built a dataset of related words, we further develop machine learning methods for cognate identification. Words undergo various changes when entering new languages. Based on the assumption that these
linguistic changes follow certain rules, we propose a method for automatically detecting pairs of cognates employing an orthographic alignment method.We use aligned subsequences as features for machine learning algorithms in order to infer rules for linguistic changes undergone by words when entering new languages and to discriminate between cognates and non-cognates.
Secondly, we extend the method to a finer-grained level, to identify the type of relationship between words. Discriminating between cognates and borrowings provides a deeper insight into the history of a language and allows a better characterization of language relatedness. We show that orthographic features have discriminative power and we analyze the underlying linguistic
factors that prove relevant in the classification task. To our knowledge, this is the first attempt of this kind.
Thirdly, we address proto-word reconstruction, which is central to the study of language evolution. It consists of recreating the words in an ancient language from its modern daughter languages. Having modern word forms in multiple languages, we infer the form of their common ancestors. Our approach relies on the regularities that occurred when the Latin words entered the
modern languages. We leverage information from all modern languages, building an ensemble system for proto-word reconstruction. We use conditional random fields for sequence labeling, but we conduct preliminary experiments with recurrent neural networks as well. We apply our method on multiple datasets, showing that our method improves on previous results, having also
the advantage of requiring less input data, which is essential in historical linguistics, where resources are generally scarce.

Author Biographies

  • Alina Maria Ciobanu, Univ. of Bucharest, Faculty of Mathematics and Computer Science, Human Languages Technologies Research Center
    PhD student, Dept. Computer Science, Univ. of Bucharest
  • Liviu P Dinu, Univ. of Bucharest, Faculty of Mathematics and Computer Science, Human Languages Technologies Research Center
    Professor, Dept. Computer Science, Univ. of Bucharest

Published

2024-12-05

Issue

Section

Long paper