Utilizing Dimensional Emotion Representations in Speech Emotion Recognition

Open Access
Conference Proceedings
Authors: John Lorenzo BautistaYun Kyung LeeSeungyoon NamChanki ParkHyun Soon Shin

Abstract: Speech is a natural way of communication amongst humans and advancements in speech emotion recognition (SER) technology allow further improvement of human-computer interactions (HCI) with speech by understanding human emotions. SER systems are traditionally focused on categorizing emotions into discrete classes. However, discrete classes often overlook some subtleties between each emotion as they are prone to individual differences and cultures. In this study, we focused on the use of dimensional emotional values: valence, arousal, and dominance as outputs for an SER instead of the traditional categorical classification. An SER model is developed using largely pre-trained models Wav2Vec 2.0 and HuBERT as feature encoders as a feature extraction technique from raw audio input. The model’s performance is assessed using a mean concordance coefficient (CCC) score for models trained on an English language-based dataset called Interactive Emotional Dyadic Motion Capture (IEMOCAP) and a Korean language-based dataset called Korean Emotion Multimodal Database (KEMDy19). For the experiments done on the IEMOCAP dataset, we reported a mean CCC of 0.3673 on the Wav2Vec 2.0-based model with CCC values of 0.3004, 0.4585, and 0.3431 for the valence, arousal, and dominance values respectively trained on the “anger”, “happy”, “sad”, and “neutral” emotion classes. Meanwhile, a mean CCC of 0.3573 on the HuBERT-based model with CCC values of 0.2789, 0.3295, and 0.3361 for the respectively on the same set of emotional classes. For the experiments done on the KEMDy19 dataset, a mean CCC of 0.5473 on the Wav2Vec 2.0-based model with CCC values of 0.5804 and 0.5142 for the valence and arousal were achieved using all available emotional classes on the dataset, while a mean CCC of 0.5580 from CCC values of 0.5941 and 0.5219 on four emotional classes “anger”, “happy”, “sad”, and “neutral” were observed. For the HuBERT-based model, a mean CCC of 0.5271 with CCC values of 0.5429 and 0.5113 for the valence and arousal were recorded using all available emotional classes, while a mean CCC of 0.5392 from CCC values of 0.5765 and 0.5019 for the valence and arousal values on the four selected emotional classes. The proposed approach outperforms traditional machine learning methods and previously reported CCC values from other literature. Moreover, the use of dimensional emotional values provides a more fine-grained insight into the user’s emotional states allowing for a much deeper understanding of one’s affective state with reduced dimensionality. By applying such SER technologies to other areas such as HCI, affective computing, and psychological research, more personalized and adaptable user interfaces can be developed to suit the emotional needs of each individual. This could also contribute to the advancement of our understanding of human factors by developing emotion recognition systems.

Keywords: speech emotion recognition, dimensional emotional values, valence, arousal

DOI: 10.54941/ahfe1004283

Cite this paper: