How Much Does Lookahead Matter for Disambiguation? Partial Arabic Diacritization Case Study
Abstract
We suggest a model for partial diacritization of deep orthographies. We focus on Arabic, where the optional indication of selected vowels by means of diacritics can improve readability. Our partial diacritizer restores short vowels only when they contribute to the ease of understandability of a given running text. The idea is to mark those instances that require lookahead to disambiguate. Two independent neural networks are employed, one that takes the entire sentence as input, and another that considers only the text that has been read so far. Partial diacritization is then achieved by keeping those vowels on which the two networks disagree, preferring the reading based on consideration of the whole sentence. For evaluation, we prepared a new dataset of Arabic texts with both full and partial vowelization.