Tokenization as Finite-State Transduction

Authors

  • Marco Cognetta Institute of Science Tokyo
  • Naoaki Okazaki Institute of Science Tokyo

Keywords:

Tokenization, automata theory, Byte Pair Encoding

Abstract

Tokenization is the first step in modern neural language model pipelines where an input text is converted to a sequence of subword tokens. We introduce from first principles a finite-state transduction framework which can encode all possible tokenizations of a regular language. We then constructively show that Byte-Pair Encoding (BPE) and MaxMatch (WordPiece), two popular tokenization schemes, are also efficiently representable by simple finite-state transducers. For BPE, this is particularly surprising given that it does not tokenize strings from left to right and requires a notion of priority.

We also discuss an application of subword-level pattern promotion to guided generation, where the outputs of a language model are constrained to match a specified pattern, and how tokenization-aware promotion offers a theoretical benefit to modeling.

Published

2025-09-09