Levenshtein distances fail to accurately identify language relationships

Authors

  • Simon James Greenhill Computational Evolution Group University of Auckland New Zealand

Abstract

The Levenshtein distance is a simple distance metric derived from the number of edit operations needed to transform one string into another. This metric has received recent attention as a means of automatically classifying languages into subgroups. In this paper, I test the performance of the Levenshtein distance for classifying languages by subsampling three language subsets from a large database of Austronesian languages. Comparing the classification proposed by the Levenshtein distance to that of the comparative method shows that the Levenshtein classification is correct only 40% of time. Standardising the orthography increases the performance, but only to a maximum of 65% accuracy. There is a very weak correlation between the Levenshtein distance and phylogenetic pathlengths indicating that the Levenshtein distance fails to discriminate ho- mology and chance similarity across distantly related languages. This poor performance suggests the need for more linguistically nuanced automated methods for language classification tasks.

Published

2024-12-05

Issue

Section

Short paper