Identification of Multi-word Expressions by Combining Multiple Linguistic Information Sources

Authors

  • Yulia Tsvetkov Carnegie Mellon University
  • Shuly Wintner Department of Computer Science, University of Haifa

Abstract

We propose a framework for employing multiple sources of linguistic information in the task of identifying multi-word  expressions in natural language texts. We define  various linguistically-motivated classification features   and introduce novel ways for computing them. We then manually define interrelationships among the features, and express them in a Bayesian network. The result is a powerful classifier that can identify multi-word expressions of various types and syntactic constructions in text corpora. Our methodology is unsupervised and  language-independent; it requires relatively few language resources  and is thus suitable for a large number of languages. We report results on English, French, and Hebrew, and demonstrate a  significant improvement in identification accuracy, compared with  less sophisticated baselines.

Published

2024-12-05

Issue

Section

Short paper