Pretrained Speech Models Learn Boundaries, Not Patterns: An Analysis of Supervised vs. Unsupervised Capabilities

Authors: Rehman, A., Jayathunge, K., Zhang, J.J., Yang, X.

Journal: 2025 13th International Conference on Speech Technology and Human Computer Dialogue Sped 2025

Publication Date: 01/01/2025

Pages: 114-119

DOI: 10.1109/SpeD67700.2025.11252189

Abstract:

Do pretrained speech models genuinely understand speech patterns, or do they simply learn to classify? We investigate this fundamental question by analyzing 10 state-of-the-art speech models across diverse tasks. We tested models on six speech characteristics (gender, accent, age, emotion, words, speaker identity) using both classification and clustering approaches. While models achieve impressive classification accuracy (up to 100% for gender, 94.9% for words), they show poor clustering performance when attempting to discover identical patterns without supervision. Most critically, we find systematic negative correlations between these capabilities - models better at accent classification are actually worse at discovering accent patterns through clustering (r=-0.904). Our analysis suggests that current speech models learn equilateral decision boundaries rather than orthogonal pattern representations, due to training objectives promoting uniform dimensional alignment, lacking orthogonality in multiple dimensions. These findings expose a critical evaluation crisis in self-supervised learning: current benchmarks may systematically overestimate model capabilities, as high classification scores do not guarantee the pattern discovery capabilities that form SSL's core value proposition.

Source: Scopus