What do Language Representations Really Represent?

Authors

  • Johannes Bjerva Department of Computer Science, University of Copenhagen
  • Robert Östling Department of Linguistics, Stockholm University
  • Maria Han Veiga Institute of Computational Science, University of Zurich
  • Jörg Tiedemann Department of Digital Humanities, University of Helsinki
  • Isabelle Augenstein Department of Computer Science, University of Copenhagen

Abstract

Similarly to how a large vocabulary of words can be represented by distributed word representations, the over 7,000 languages of the world can be represented with distributed language representations, thus modelling similarities between languages. This type of representation can both aid multilingual NLP modelling, as well as computational approaches to challenging typological research questions. However, it is potentially detrimental to such research when properties of language representations are insufficiently interpreted. For instance, it has been claimed that language representations encode genetic relationships between languages (Rabinovich, Ordan, and Wintner 2017). We question this claim, as we study the correlation and causal relationships between genetic, geographical, and structural distances with language representations trained on three levels of syntactic information, and show that the supposedly genetic relationship is a result of a confounding factor of structural similarities between languages.

Published

2024-12-05

Issue

Section

Squibs and Discussions