Using Semantics for Granularities of Tokenization
Abstract
Depending on downstream applications, it is advisable to extend the notion of tokenization from low-level character-based token boundary detection to identifying meaningful and useful language units. This entails both identifying units composed of several single words that form a MWE, as well as splitting single-word compounds into their meaningful parts. In this article, we introduce unsupervised and knowledge-free methods for these two tasks. The main novelty of our research is constituted by the fact that methods are primarily based on distributional similarity, of which we use two flavors: a sparse count-based and a dense neural-based distributional semantic model.First, we introduce DRUID, which is a method for detecting MWEs. The evaluation on MWE-annotated datasets in two languages and newly extracted evaluation datasets for 32 languages shows that DRUID compares favorably over previous methods not utilizing distributional information.
Second, we present SECOS, an algorithm for decompounding close compounds. In an evaluation on four dedicated decompounding datasets across four languages and on datasets extracted from Wiktionary for 14 languages, we demonstrate the superiority of our approach over unsupervised baselines, sometimes even matching the performance of previous language-specific and supervised methods.
In a final experiment, we show how both decompounding and MWE information can be used in information retrieval. Here, we obtain the best results when combining word information with MWEs and the compound parts in a bag-of-words retrieval setup.
Overall, our methodology paves the way to automatic detection of lexical units beyond standard tokenization techniques without language-specific preprocessing steps such as POS tagging.