To Augment or Not to Augment? A Comparative Study on Text Augmentation Techniques for Low-Resource NLP

Authors

  • Gozde Gul Sahin Technical University of Darmstadt

Abstract

Data-hungry deep neural networks have established themselves as the defacto standard for many NLP tasks including the traditional sequence tagging ones. Despite their state-of-the-art performance on high-resource languages, they still fall behind of their statistical counter-parts in low-resource scenarios. One methodology to counter attack this problem is text augmentation, i.e., generating new synthetic training data points from existing data. Although NLP has recently witnessed a load of textual augmentation techniques, the field still lacks a systematic performance analysis on a diverse set of languages and sequence tagging tasks. To fill this gap, we investigate three existing augmentation methodologies which perform changes on syntactic (e.g., cropping sub-sentences), token (e.g., random word insertion) and character (e.g., character swapping) levels. We systematically compare the methods on part-of-speech tagging, dependency parsing and semantic role labeling for a diverse set of language families such as Turkic, Slavic, Mongolic and Dravidian. We find that character-level techniques are more likely to provide improvements on most languages and tasks, while sophisticated syntactic-level techniques mostly benefit higher-level tasks and morphologically richer languages.

Published

2024-11-20