The Taxonomy of Writing Systems: How to Measure how Logographic a System is

Authors

  • Richard Sproat Google
  • Alexander Gutkin Google

Abstract

Taxonomies of writing systems since (Gelb 1952) have classified systems based on what the written symbols represent: if they represent words or morphemes, they are logographic; if syllables, syllabic; if segments, alphabetic; etc. Sproat (2000) and Rogers (2005) broke with tradition by splitting the logographic and phonographic aspects into two dimensions, with logography being graded rather than a categorical distinction. A system could be syllabic, and highly logographic; or alphabetic, and mostly non-logographic. This accords better with how writing systems actually work, but neither author proposed a method for measuring logography.

In this paper we show that a proposed measure by Penn and Choma (2006) does not work. We then propose a novel measure that uses an attention-based sequence-to-sequence model trained to predict the spelling of a token from its pronunciation in context. In an ideal phonographic system, the model should need to attend to only the current token in order to compute how to spell it, and this would show in the attention matrix activations. In contrast, with a logographic system, where a given pronunciation might correspond to several different spellings, the model would need to attend to a broader context. The ratio of the activation outside the token and the total activation forms the basis of our measure. We compare this with a simple lexical measure, and an entropic measure, and argue that our attention-based measure accords best with intuition about how logographic various systems are. We suggest that by quantifying the concept, we come to a better understanding of what it means.

Published

2024-11-22