Optimizing Neural Topic Modeling Pipelines for Low-Quality Speech Transcriptions

Authors: Taati, E., Budka, M., Neville, S. and Canniffe, J.

Journal: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Volume: 14795 LNAI

Pages: 184-197

eISSN: 1611-3349

ISSN: 0302-9743

DOI: 10.1007/978-981-97-4982-9_15

Abstract:

Gaining insights from large-scale document archive is a challenging task. Recent advances in natural language processing, specifically unsupervised topic modeling, allow for automated discovery of abstract “topics” that characterize groups of semantically related documents within textual corpora. Neural topic modeling has emerged as a scalable approach through integrating state-of-the-art sentence embedding models into modeling pipelines. This embedding-based architecture enables efficient processing of large datasets. However, topic quality often related to input data quality, particularly in the case of speech-to-text, remains an open issue. This study presents a comparative evaluation of various component configurations within a neural topic modeling pipeline, as applied to a corpus of telephony transcriptions. Incorporating four embedding models (E5, Instructor, MiniLM, and SGPT), three dimensionality reduction approaches (maintaining versus reducing original embeddings by Truncated-SVD and UMAP), and two clustering algorithms (K-Means and HDBSCAN), 48 topic modeling pipelines are evaluated. The experimental results reveal that placing a context-aware embedding model in the pipeline leads to significant improvement in topic coherence, while larger models tend to achieve better topic diversity. Based on the above, we also propose best practices of the model layout in the pipeline, considering coherence and topic diversity scores.

Source: Scopus

Optimizing Neural Topic Modeling Pipelines for Low-Quality Speech Transcriptions

Authors: Taati, E., Budka, M., Neville, S. and Canniffe, J.

Journal: INTELLIGENT INFORMATION AND DATABASE SYSTEMS, PT I, ACIIDS 2024

Volume: 14795

Pages: 184-197

eISSN: 1611-3349

ISBN: 978-981-97-4981-2

ISSN: 2945-9133

DOI: 10.1007/978-981-97-4982-9_15

Source: Web of Science (Lite)

Optimizing Neural Topic Modeling Pipelines for Low-Quality Speech Transcriptions.

Authors: Taati, E., Budka, M., Neville, S. and Canniffe, J.

Editors: Nguyen, N.T., Chbeir, R., Manolopoulos, Y., Fujita, H., Hong, T.-P., Nguyen, M.L. and Wojtkiewicz, K.

Journal: ACIIDS (1)

Volume: 14795

Pages: 184-197

Publisher: Springer

ISBN: 978-981-97-4981-2

https://doi.org/10.1007/978-981-97-4982-9

Source: DBLP