Dotless Arabic text for Natural Language Processing
Abstract
This paper introduces a novel representation of Arabic text as an alternative approach for
Arabic NLP, inspired by the dotless script of ancient Arabic. We explored this representation
through extensive analysis on various text corpora, differing in size and domain, and tokenized
using multiple tokenization techniques. Furthermore, we examined the information density of
this representation and compared it with the standard dotted Arabic text using text entropy
analysis. Utilizing parallel corpora, we also drew comparisons between Arabic and English
text analysis to gain additional insights. Our investigation extended to various upstream and
downstream NLP tasks, including language modeling, text classification, sequence labeling, and
machine translation, examining the implications of both the representations. Specifically, we
performed seven different downstream tasks using various tokenization schemes comparing the
standard dotted text with dotless Arabic text representations. The performances using both the
representations were comparable across different tokenizations. However, dotless representation
achieves these results with significant reduction in vocabulary sizes, and in some scenarios
showing reduction of up to 50%. Additionally, we present a system that restores dots to the
dotless Arabic text. This system is useful for tasks that require Arabic texts as output.