Re-evaluating the Word Token for Bilingual Speech Processing: The Case for Intonation Units

Rebecca Pattichis; Rena Torres Cacoullos; Dora LaCasse

Authors

Rebecca Pattichis Independent Researcher
Rena Torres Cacoullos
Dora LaCasse

Keywords:

Code-Switching, Intonation Units

Abstract

Natural Language Processing (NLP) metrics for bilingual code-switching (CS) have, until now, used words as the token level. However, the assumption that any two words constitute an equally likely switch point is erroneous. In spoken language, a major delimiter of CS is a prosodic chunk known as the Intonation Unit (IU). Switch points are far more likely between words at IU boundaries than between words in the same IU. The word as an elementary NLP unit is thus incommensurate with bilingual speech patterns. Here, we put forward an IU-based adaptation of a familiar metric of CS probability. We then compare the token levels on this metric for ten bilingual datasets featuring multi-word CS. Our comparison shows that the currently standard two-significant-figure precision of the word-based metric is insufficient, as the token level compresses the range of values by inflating the universe of CS. More discerning CS probability values can be obtained by normalizing word-based counts using mean IU length.

Re-evaluating the Word Token for Bilingual Speech Processing: The Case for Intonation Units

Authors

Keywords:

Abstract

Downloads

Published

Issue

Section

Make a Submission

Information

Announcements

Special Issue on the Ethics of NLP and CL in Computational Linguistics

EMNLP 2026 – CL deadlines for Qualifying Papers