Decode, move and speak! Self-supervised learning of speech units, gestures and sounds relationships using vocal imitation

Authors

  • Marc-antoine Georges Univ. Grenoble Alpes, CNRS, Grenoble INP, GIPSA-lab
  • Marvin Lavechin Univ. Grenoble Alpes, CNRS, Grenoble INP, GIPSA-lab
  • Jean-Luc Schwartz Univ. Grenoble Alpes, CNRS, Grenoble INP, GIPSA-lab
  • Thomas Hueber Univ. Grenoble Alpes, CNRS, Grenoble INP, GIPSA-lab http://orcid.org/0000-0002-8296-5177

Abstract

Speech learning involves controlling a complex motor system for uttering speech sounds from articulatory gestures and discovering a set of discrete and invariant units that provide entry to the linguistic system. Importantly, children seem to learn the relationships between speech sounds, the corresponding articulatory gestures, and these units in a weakly-supervised manner, with no explicit labeling of auditory inputs and no access to the articulatory gestures they should produce to reach an acoustic target. In this study, we propose a computational agent learning to drive a virtual vocal apparatus in order to repeat an auditory speech input. This model combines i) an articulatory synthesizer able to reproduce complex speech stimuli from a limited set of interpretable articulatory parameters, ii) two internal models respectively providing articulatory-to-acoustic forward predictions and acoustic-to-articulatory inverse computations, and iii) a (discrete) speech unit discovery module based on vector-quantized variational autoencoders (VQ-VAE). From this architecture, we provide two contributions. In a first experiment, we analyze the quantized embeddings learned by the VQ-VAE from ground truth data, and we show an interesting complementarity between acoustic and articulatory modalities which is potentially useful for the discovery of invariance. Then, we evaluate the performance of the proposed agent both at the acoustic and articulatory levels. We show that while most of the agent's productions are intelligible, the underlying articulatory trajectories of those productions are not systematically plausible. Finally, we present future perspectives for testing a developmental scenario for speech learning using end-to-end neural models.

Published

2024-12-23

Issue

Section

Special Issue on Language Learning, Representation, and Processing in Humans and Machines