C'est kif-kif: Parsing and Multiword Expression Identification for Arabic and French
Abstract
We develop constituency parsing models for Arabic and French by analyzing the interplay of linguistic phenomena, annotation choices, and model design. Although these morphologically rich languages are linguistically unrelated, we show that similar NLP techniques improve parsing accuracy for both languages. One of these novel techniques is a factored lexicon, in which surface forms are decomposed into stems and sets of morphosyntactic features. Our models obtain state-of-the-art parsing results for both languages.
Next, we investigate the effectiveness of our parsers for multiword expression (MWE) identification. Work in theoretical linguistics has shown that idiomatic constructions have predictable syntactic patterns. We show that our models, especially one based on Tree Substitution Grammars, can effectively use syntactic context to identify MWEs. Our experimental results suggest that although Arabic and French parsing accuracy still lags that of English, our parsers are already useful for a hard NLP task.