A New Unsupervised Approach to Word Segmentation

Authors

  • Hanshi Wang Beijing Institute of Technology
  • Jian Zhu Beijing Institute of Technology
  • Shiping Tang Beijing Institute of Technology
  • Xiaozhong Fan Beijing Institute of Technology

Abstract

The article proposes ESA, a new unsupervised approach to word segmentation. ESA is an iterative process consisting of 3 phases: Evaluation, Selection and Adjustment. In Evaluation, both certainty and uncertainty of character sequence co-occurrence in corpora are considered as the statistical evidence supporting goodness measurement. Additionally, the statistical data of character sequences with various lengths become comparable with each other by using a simple process called Balancing. In Selection, a local maximum strategy is adopted without thresholds, and the strategy can be implemented with dynamic programming. In Adjustment, a part of the statistical data is updated to improve successive results. In our experiment, ESA was evaluated on the SIGHAN Bakeoff-2 dataset. The results suggest that ESA is effective on Chinese corpora. It is noteworthy that the F-measures of the results are basically monotone increasing and can rapidly converge to relatively high values. Furthermore, the empirical formulae based on the results can be used to predict the parameter in ESA for avoiding parameter estimation that is usually time-consuming.

Published

2024-12-05

Issue

Section

Long paper