Identification of Multi-word Expressions by Combining Multiple Linguistic Information Sources
Abstract
We propose a framework for employing multiple sources of linguistic information in the task of identifying multi-word expressions in natural language texts. We define various linguistically-motivated classification features and introduce novel ways for computing them. We then manually define interrelationships among the features, and express them in a Bayesian network. The result is a powerful classifier that can identify multi-word expressions of various types and syntactic constructions in text corpora. Our methodology is unsupervised and language-independent; it requires relatively few language resources and is thus suitable for a large number of languages. We report results on English, French, and Hebrew, and demonstrate a significant improvement in identification accuracy, compared with less sophisticated baselines.Published
2024-12-05
Issue
Section
Short Paper