Emotion Recognition from Speech via the Use of Different Audio Features, Machine Learning and Deep Learning Algorithms
Authors: Alperen Sayar, Tuna Çakar, Tunahan Bozkan, Seyit Ertuğrul, Fatma Gümüş
Abstract: Speech has been accepted as one of the basic, efficient and powerful communication methods. At the beginning of the 20th century, electroacoustic analysis was used for determining emotions in psychology. In academics, Speech Emotion Recognition (SER) has become one of the most studied and investigated research areas. This research program aims to determine the emotional state of the speaker based on speech signals. Significant studies have been undertaken during the last two decades to identify emotions from speech by using machine learning. However, it is still a challenging task because emotions rotate from one to another and there are environmental factors which have significant effects on emotions. Furthermore, sound consists of numerous parameters and there are various anatomical characteristics to take into consideration. Determining an appropriate audio feature set for emotion recognition is still a critical decision point for an emotion recognition system. The demand for voice technology in both art and human – machine interaction systems has recently been increased. Our voice conveys both linguistic and paralinguistic messages in the course of speaking. The paralinguistic part, for example, rhythm and pitch, provides emotional cues to the speaker. The speech emotion recognition topic examines the question ‘How is it said?’ and an algorithm detects the emotional state of the speaker from an audio record. Although a considerable number of the studies have been conducted for selecting and extracting an optimal set of features, appropriate attributes for automatic emotion recognition from audio are still under research. The main aim of this study is obtaining the most distinctive emotional audio features. For this purpose, time- based features, frequency-based features and spectral shape-based features are used for comparing recognition accuracies. Besides these features, a pre-trained model is used for obtaining input for emotion recognition. Machine learning models are developed for classifying emotions with Support Vector Machine, Multi-Layer Perceptron and Convolutional Neural Network algorithms. Three emotional databases in English and German are combined and a larger database is obtained for training and testing the models. Emotions namely, Happy, Calm, Angry, Boredom, Disgust, Fear, Neutral, Sad and Surprised are classified with these models. When the classification results are examined, it is concluded that the pre- trained representations make the most successful predictions. The weighted accuracy ratio is 91% for both Convolutional Neural Network and Multilayer Perceptron algorithms while this ratio is 87% for the Support Vector Machine algorithm. A hybrid model is being developed which contains both a pre-trained model and spectral shaped based features. Speech contains silent and noisy sections which increase the computational complexity. Time performance is the other major factor which should be a great deal of careful consideration. Although there are many advancements on SER, custom architectures are designed to fuse accuracy and time performance. Even further for a more realistic emotion estimation all physical gestures like voice, body parts of movement and facial expression can be obtained together as humans use them collectively to express themselves.
Keywords: Voice Analysis, Deep Learning, Emotion Detection, Machine Learning, Human, Computer Interaction
Cite this paper: