Speech emotion recognition based on convolutional neural network with attention-based bidirectional long short-term memory network and multi-task learning

Authors: Liu, Z.T., Han, M.T., Wu, B.H. and Rehman, A.

Journal: Applied Acoustics

Volume: 202

eISSN: 1872-910X

ISSN: 0003-682X

DOI: 10.1016/j.apacoust.2022.109178

Abstract:

Speech emotion recognition (SER) is a challenging task since the distribution of the features is different among various people. To improve generalization performance and accuracy of SER, we employ balanced augmented sampling on the triple-channel log-Mel spectrograms to improve the imbalance of the sample distribution among emotional categories and provide sufficient inputs for the deep neural network model. Time-domain filter and frequency-domain filter are used to process the triple-channel log-Mel spectrograms respectively in order to increase the diversity of features. After that, a deep neural network composed of convolutional neural network (CNN) and attention-based bidirectional long short-term memory network (ABLSTM) is employed for feature extraction, in which multi-task learning is adopted to improve the performance of the deep neural network. We select seven auxiliary tasks and determine the optimal auxiliary tasks through experimental comparison. Finally, our method is evaluated on IEMOCAP and MSP-IMPROV database, and it achieves 70.27% and 66.27% in terms of WAR and UAR on IEMOCAP database, while the WAR and UAR are 60.90% and 61.83% on MSP-IMPROV database respectively, which demonstrates its better performance than other works.

Source: Scopus

Refresh now