Optimizing Neural Topic Modeling Pipelines for Low-Quality Speech Transcriptions
Authors: Taati, E., Budka, M., Neville, S. and Canniffe, J.
Journal: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume: 14795 LNAI
Pages: 184-197
eISSN: 1611-3349
ISSN: 0302-9743
DOI: 10.1007/978-981-97-4982-9_15
Abstract:Gaining insights from large-scale document archive is a challenging task. Recent advances in natural language processing, specifically unsupervised topic modeling, allow for automated discovery of abstract “topics” that characterize groups of semantically related documents within textual corpora. Neural topic modeling has emerged as a scalable approach through integrating state-of-the-art sentence embedding models into modeling pipelines. This embedding-based architecture enables efficient processing of large datasets. However, topic quality often related to input data quality, particularly in the case of speech-to-text, remains an open issue. This study presents a comparative evaluation of various component configurations within a neural topic modeling pipeline, as applied to a corpus of telephony transcriptions. Incorporating four embedding models (E5, Instructor, MiniLM, and SGPT), three dimensionality reduction approaches (maintaining versus reducing original embeddings by Truncated-SVD and UMAP), and two clustering algorithms (K-Means and HDBSCAN), 48 topic modeling pipelines are evaluated. The experimental results reveal that placing a context-aware embedding model in the pipeline leads to significant improvement in topic coherence, while larger models tend to achieve better topic diversity. Based on the above, we also propose best practices of the model layout in the pipeline, considering coherence and topic diversity scores.
Source: Scopus
Optimizing Neural Topic Modeling Pipelines for Low-Quality Speech Transcriptions
Authors: Taati, E., Budka, M., Neville, S. and Canniffe, J.
Journal: INTELLIGENT INFORMATION AND DATABASE SYSTEMS, PT I, ACIIDS 2024
Volume: 14795
Pages: 184-197
eISSN: 1611-3349
ISBN: 978-981-97-4981-2
ISSN: 2945-9133
DOI: 10.1007/978-981-97-4982-9_15
Source: Web of Science (Lite)
Optimizing Neural Topic Modeling Pipelines for Low-Quality Speech Transcriptions.
Authors: Taati, E., Budka, M., Neville, S. and Canniffe, J.
Editors: Nguyen, N.T., Chbeir, R., Manolopoulos, Y., Fujita, H., Hong, T.-P., Nguyen, M.L. and Wojtkiewicz, K.
Journal: ACIIDS (1)
Volume: 14795
Pages: 184-197
Publisher: Springer
ISBN: 978-981-97-4981-2
https://doi.org/10.1007/978-981-97-4982-9
Source: DBLP