Neural Network Model for Visualization of Conversational Mood with Four Adjective Pairs

Open Access
Article
Conference Proceedings
Authors: Koichi YamagataKoya KawaharaYuto SuzukiYuki NakahodoShunsuke ItoHaruka MatsukuraMaki Sakamoto

Abstract: In recent years, the accuracy of speech recognition has improved remarkably. Speech recognition software can be used to obtain text information from conversational speech data. Although text can be treated as surface level information, several studies have indicated that speech recognition can also be used to estimate emotions, which represent higher level information in a conversation. Several newly proposed models use LSTM or GRU to estimate emotion in conversations. However, when attempting to monitor or influence conversations conducted as part of a meeting or a chat, the mood of the conversation is more important than the emotion. In normal conversation, emotions such as anger and sadness are unlikely to be explicitly expressed for some purposes, including avoidance of getting into an unexpected argument and offending others. Thus, when attempting to control or monitor the state of a conversation during a meeting or casual discussion, it is often more important to estimate the mood than the emotion. Some researchers have examined the role of mood, as distinguished from emotion, and one called diffuse emotional states that persist over a long period of time "mood" and are usually distinguished based on duration and intensity of expression. However, these differences are rarely quantified, and no specific durations are fixed. Accurate identification of the mood of a conversation is especially important for Japanese people who are engaged in collaborative and democratic decision making. To construct the teacher data for the model designed to estimate the conversational mood, we first selected representative adjective pairs that could describe the conversational mood. We utilized a system developed by Iiba et al. to estimate 21 affective scales of adjective pairs from input text. The 21 adjective pairs were clustered into 4 groups based on the output scales. The 4 adjective pairs to be annotated were representative of the 4 clusters. We expected these 4 adjective pairs (gloomy-happy, easy-serious, calm-aggressive, tidy-messy) to capture the mood of a conversation.Based on the four adjective pairs, we constructed a new training data set containing 60 hours of conversations in Japanese. In this study, the data obtained only by microphones are used for estimation of conversational mood. The data set was annotated by the four adjective scales to learn the mood of the conversations. We de-veloped a LSTM deep neural network model that could read the "conversational mood" in real time. Furthermore, in our proposed neural network model, the amount of laughter which is generally measured by capturing facial expression with camera is also estimated together with the conversational mood. Because laughter is considered to play an important role in creating a cheerful environment, it can be used to evaluate the conversational mood. The evaluation results are shown to present the validity of our model. This model is expected to be applied to a system that can influence or control the mood of conversations in some ways, including presentation of ambient music and aromas, depending on the purpose of the discussion, such as during a conference, chatting, or business meeting.

Keywords: Mood, conversation, deep learning, affective ambient intelligence

DOI: 10.54941/ahfe1004396

Cite this paper:

Downloads
116
Visits
250
Download