Languages through the looking glass of BPE compression
Abstract
Byte-pair encoding (BPE) is widely used in NLP for performing subword tokenization. This is due to its capability to uncover redundant patterns when compressing the data. Subwords discovered during the first merge operations tend to have the most substantial impact on the compression of the text. This property seems universal for natural languages; nevertheless, the structural properties that allow compression are rarely analyzed cross-linguistically.We inspected these subwords closely and discovered that the types of recurrent patterns allowing compression are an indicator of the typological properties of the respective languages. For languages with richer inflectional morphology there is a preference for highly productive subwords on the early merges, while for languages with
less inflectional morphology, idiosyncratic subwords are more prominent. Both types of patterns contribute to efficient compression.
Even though BPE is commonly regarded as not linguistically motivated, we find patterns across languages that resemble those described in traditional typology. We thus propose a novel way to characterize languages according to their BPE subword properties, inspired by the notion of morphological productivity in linguistics. This study covers 47 diverse languages, different corpora, and registers, but our approach is easily applicable to other languages as it does not require annotated data or any external linguistic knowledge. Our research lies at the nexus of computational linguistics and linguistic typology.