What do Language Representations Really Represent?
Abstract
Similarly to how a large vocabulary of words can be represented by distributed word representations, the over 7,000 languages of the world can be represented with distributed language representations, thus modelling similarities between languages. This type of representation can both aid multilingual NLP modelling, as well as computational approaches to challenging typological research questions. However, it is potentially detrimental to such research when properties of language representations are insufficiently interpreted. For instance, it has been claimed that language representations encode genetic relationships between languages (Rabinovich, Ordan, and Wintner 2017). We question this claim, as we study the correlation and causal relationships between genetic, geographical, and structural distances with language representations trained on three levels of syntactic information, and show that the supposedly genetic relationship is a result of a confounding factor of structural similarities between languages.Published
2024-12-05
Issue
Section
Squibs and Discussions