Hashtag Sense Clustering based on Temporal Similarity

Authors

  • Paola Velardi Department of Computer Science, University of Roma "La Sapienza"

Abstract

Hashtags are creative labels used in micro-blogs to characterize the topic of a mes- sage/discussion. However, since hashtags are created in a spontaneous and highly dynamic way by users in multiple languages, the same topic can be associated to different hashtags and conversely, the same hashtag may refer to different topics in different time spans. Contrary to common words, sense clustering for hashtags is complicated by the fact that no sense catalogues are available, like, e.g. Wikipedia or WordNet and furthermore, hashtag labels are often obscure to the evaluators. In this paper we propose a sense clustering algorithm, named SAX*, based on temporal mining. First, hashtag time series are converted into strings of symbols using Symbolic Aggregate ApproXimation (SAX), then, hashtags are clustered based on string similarity and temporal co-occurrence. Evaluation is performed on an available Twitter stream, both manually and using two reference sets of semantically tagged hashtags. We show that SAX* efficiently copes with the problems of ambiguity, polysemy and sense shifts. Finally, we also perform a complexity evaluation of our algorithm, since efficiency is a crucial performance factor when processing large-scale data streams, such as Twitter.

Author Biography

  • Paola Velardi, Department of Computer Science, University of Roma "La Sapienza"
    Full Professor, Head of Undergraduate and Master program in Computer Science, Dipartimento di Informatica,
    Sapienza Università di Roma

Published

2024-12-05

Issue

Section

Long paper