Dotless Arabic text for Natural Language Processing

Maged S. Al-Shaibani; Irfan Ahmad

Authors

Maged S. Al-Shaibani King Fahd University of Petroleum and Minerals (KFUPM), Dhahran 31261, Saudi Arabia.
Irfan Ahmad King Fahd University of Petroleum and Minerals (KFUPM), Dhahran 31261, Saudi Arabia. SDAIA--KFUPM Joint Research Center for AI, Dhahran 31261, Saudi Arabia.

Abstract

This paper introduces a novel representation of Arabic text as an alternative approach for
Arabic NLP, inspired by the dotless script of ancient Arabic. We explored this representation
through extensive analysis on various text corpora, differing in size and domain, and tokenized
using multiple tokenization techniques. Furthermore, we examined the information density of
this representation and compared it with the standard dotted Arabic text using text entropy
analysis. Utilizing parallel corpora, we also drew comparisons between Arabic and English
text analysis to gain additional insights. Our investigation extended to various upstream and
downstream NLP tasks, including language modeling, text classification, sequence labeling, and
machine translation, examining the implications of both the representations. Specifically, we
performed seven different downstream tasks using various tokenization schemes comparing the
standard dotted text with dotless Arabic text representations. The performances using both the
representations were comparable across different tokenizations. However, dotless representation
achieves these results with significant reduction in vocabulary sizes, and in some scenarios
showing reduction of up to 50%. Additionally, we present a system that restores dots to the
dotless Arabic text. This system is useful for tasks that require Arabic texts as output.

Author Biography

Irfan Ahmad, King Fahd University of Petroleum and Minerals (KFUPM), Dhahran 31261, Saudi Arabia. SDAIA--KFUPM Joint Research Center for AI, Dhahran 31261, Saudi Arabia.

Assistant Professor, Information and Computer Science Department

Dotless Arabic text for Natural Language Processing

Authors

Abstract

Author Biography

Downloads

Published

Issue

Section

Make a Submission

Information

Announcements

EMNLP 2025 – CL deadlines for Qualifying Papers

Computational Linguistics - December 2025 51(1) has been published!