Automatic Identification and Production of RelatedWords for Historical Linguistics

Alina Maria Ciobanu; Liviu P Dinu

Authors

Alina Maria Ciobanu Univ. of Bucharest, Faculty of Mathematics and Computer Science, Human Languages Technologies Research Center
Liviu P Dinu Univ. of Bucharest, Faculty of Mathematics and Computer Science, Human Languages Technologies Research Center

Abstract

Language change across space and time is one of the main concerns in historical linguistics. In this article, we develop tools to be used in historical linguistics, to assist researchers and domain experts in the study of language evolution.
Firstly, we address cognate identification. We propose an algorithm for extracting cognatesÂ from electronic dictionaries that contain etymology-related information. Having built a datasetÂ of related words, we further develop machine learning methods for cognate identification. WordsÂ undergo various changes when entering new languages. Based on the assumption that these
linguistic changes follow certain rules, we propose a method for automatically detecting pairs of cognates employing an orthographic alignment method.We use aligned subsequences as featuresÂ for machine learning algorithms in order to infer rules for linguistic changes undergone by words when entering new languages and to discriminate between cognates and non-cognates.
Secondly, we extend the method to a finer-grained level, to identify the type of relationshipÂ between words. Discriminating between cognates and borrowings provides a deeper insight intoÂ the history of a language and allows a better characterization of language relatedness. We showÂ that orthographic features have discriminative power and we analyze the underlying linguistic
factors that prove relevant in the classification task. To our knowledge, this is the first attempt ofÂ this kind.
Thirdly, we address proto-word reconstruction, which is central to the study of languageÂ evolution. It consists of recreating the words in an ancient language from its modern daughterÂ languages. Having modern word forms in multiple languages, we infer the form of their commonÂ ancestors. Our approach relies on the regularities that occurred when the Latin words entered the
modern languages. We leverage information from all modern languages, building an ensembleÂ system for proto-word reconstruction. We use conditional random fields for sequence labeling,Â but we conduct preliminary experiments with recurrent neural networks as well. We apply ourÂ method on multiple datasets, showing that our method improves on previous results, having also
the advantage of requiring less input data, which is essential in historical linguistics, whereÂ resources are generally scarce.

Author Biographies

Alina Maria Ciobanu, Univ. of Bucharest, Faculty of Mathematics and Computer Science, Human Languages Technologies Research Center

PhD student,Â Dept. Computer Science, Univ. of Bucharest
Liviu P Dinu, Univ. of Bucharest, Faculty of Mathematics and Computer Science, Human Languages Technologies Research Center

Professor, Dept. Computer Science, Univ. of Bucharest

Automatic Identification and Production of RelatedWords for Historical Linguistics

Authors

Abstract

Author Biographies

Published

Issue

Section

Make a Submission

Information

Announcements

2026 *ACL Conference Dates

Computational Linguistics - September 2025 51(3) has been published!