Multi-SimLex: A Large-Scale Evaluation of Multilingual and Cross-Lingual Lexical Semantic Similarity

Ivan VuliÄ‡; Simon Baker; Edoardo Maria Ponti; Ulla Petti; Ira Leviant; Kelly Wing; Olga Majewska; Eden Bar; Matt Malone; Thierry Poibeau; Roi Reichart; Anna Korhonen

Authors

Ivan VuliÄ‡ University of Cambridge
Simon Baker University of Cambridge
Edoardo Maria Ponti University of Cambridge
Ulla Petti University of Cambridge
Ira Leviant Faculty of Industrial Engineering and Management, Technion, IIT
Kelly Wing University of Cambridge
Olga Majewska University of Cambridge
Eden Bar Faculty of Industrial Engineering and Management, Technion, IIT
Matt Malone University of Cambridge
Thierry Poibeau LATTICE Lab, CNRS and ENS/PSL and Univ. Sorbonne nouvelle/USPC
Roi Reichart Faculty of Industrial Engineering and Management, Technion, IIT
Anna Korhonen University of Cambridge

Abstract

We introduce Multi-SimLex, a large-scale lexical resource and evaluation benchmark covering datasets for 12 typologically diverse languages, including major languages (e.g., Mandarin Chinese, Spanish, Russian) as well as less-resourced ones (e.g., Welsh, Kiswahili). Each language dataset is annotated for the lexical relation of semantic similarity and contains 1,888 semantically aligned concept pairs, providing a representative coverage of word classes (nouns, verbs, adjectives, adverbs), frequency ranks, similarity intervals, lexical fields, and concreteness levels. Additionally, owing to the alignment of concepts across languages, we provide a suite of 66 cross-lingual semantic similarity datasets. Due to its extensive size and language coverage, Multi-SimLex provides entirely novel opportunities for experimental evaluation and analysis. On its monolingual and cross-lingual benchmarks, we evaluate and analyze a wide array of recent state-of-the-art monolingual and cross-lingual representation models, including static and contextualized word embeddings (such as fastText, M-BERT and XLM), externally informed lexical representations, as well as fully unsupervised and (weakly) supervised cross-lingual word embeddings. We also present a step-by-step dataset creation protocol for creating consistent, Multi-Simlex-style resources for additional languages. We make these contributions -- the public release of Multi-SimLex datasets, their creation protocol, strong baseline results, and in-depth analyses which can be be helpful in guiding future developments in multilingual lexical semantics and representation learning -- available via a website which will encourage community effort in further expansion of Multi-Simlex to many more languages. Such a large-scale semantic resource could inspire significant further advances in NLP across languages.

Author Biography

Ivan VuliÄ‡, University of Cambridge

Language Technology Lab, Department of Theoretical and Applied Linguistics
Senior Research Associate

Multi-SimLex: A Large-Scale Evaluation of Multilingual and Cross-Lingual Lexical Semantic Similarity

Authors

Abstract

Author Biography

Published

Issue

Section

Make a Submission

Information

Announcements

EACL 2027 - CL deadlines for Qualifying Papers

Special Issue on the Ethics of NLP and CL in Computational Linguistics