Can a Large Language Model Replace Humans at Rating Lexical Semantic Relations Strength?
Keywords:
Semantic Measures, Semantic Similarity, Semantic Relatedness, Large Language Models (LLMs) , Semantic Relations DatasetsAbstract
This paper investigates the ability of large language models (LLMs) to evaluate semantic relations between word pairs by examining their alignment with human-generated semantic ratings. Semantic relations represent the degree of connection (e.g., relatedness or similarity) between linguistic elements and are traditionally validated against human-annotated datasets. Due to the challenges of building such datasets and recent progress in LLMs' capacity to model human-like understanding, we explore whether LLMs can serve as reliable substitutes for traditional human ratings.
We conducted experiments using multiple LLMs from OpenAI, Google, Mistral, and Anthropic, evaluating their performance across diverse English and Portuguese semantic relations datasets. We included in the analysis PAP900, a recently published dataset of semantic relations in Portuguese, to examine the influence of prior exposure to the dataset on LLM training.
The results show that the LLM predictions correlate strongly with human ratings. The findings reveal the potential of LLMs to supplement or replace traditional semantic measure algorithms and crowd-sourced human annotations in semantic tasks.