String Kernels for Native Language Identification: Insights from Behind the Curtains

Authors

  • Radu Tudor Ionescu University of Bucharest
  • Marius Popescu University of Bucharest
  • Aoife Cahill Educational Testing Service

Abstract

The most common approach in text mining classification tasks is to rely on features like words, part-of-speech tags, stems, or some other high-level linguistic features. Recently, an approach that uses only character n-grams as features has been proposed for the task of native language identification. The approach obtained state of the art results by combining several string kernels using multiple kernel learning. Despite the fact that the approach based on string kernels performs so well, several questions about this method have not yet been answered. First of all, it is not clear why such a simple approach can compete with far more complex approaches that take words, lemmas, syntactic information, or even semantics into account. Second of all, although the approach is designed to be language independent, all experiments to date have been on English. This work is an extensive study that aims to systematically present the string kernel approach and to clarify the open questions mentioned above. A broad set of native language identification experiments are conducted to compare the string kernels approach with other state of the art methods. The empirical results obtained in all the experiments conducted in this work indicate that the proposed approach achieves state of the art performance in native language identification, reaching an accuracy that is 1.7% above the top scoring system of the 2013 NLI Shared Task. Furthermore, the results obtained on both the Arabic and the Norwegian corpora demonstrate that the proposed approach is language independent. In the Arabic native language identification task, string kernels show an increase of more than 17% over the best accuracy reported so far. The results of string kernels on Norwegian native language identification are also significantly better than the state of the art approach. In addition, in a cross-corpus experiment, the proposed approach shows that it can also be topic independent, improving the state of the art system by 32.3%. To gain additional insights into why this technique is working well, the features selected by the classifier as being more discriminating are analyzed in this work. The analysis also offers information about localized language transfer effects, since the features used by the proposed model are n-grams of various lengths. The features captured by the model typically include stems, function words, word prefixes and suffixes, which have the potential to generalize over purely word-based features. By analyzing the discriminating features, this paper offers insights into two kinds of language transfer effects, namely word choice (lexical transfer) and morphological differences. The goal of the current study is to give a full view of the string kernels approach and shed some light on why this approach works so well.

Published

2024-12-05

Issue

Section

Long paper