Automatic Text-to-sound Generation by Doc2Vec

Open Access
Conference Proceedings
Authors: Kakeru IwamotoHironori UchidaYujie LiYoshihisa Nakatoh

Abstract: Nowadays, the market size for games and video viewing services has been expanding. The demand for sound effects to produce content using these services is also rising. However, sound-effect production often requires expensive software or hardware, exceptional equipment use, and experience. Therefore, this study aims to reduce the cost of sound effects production and enable the output of sound effects as imagined. It examines a method for automatically generating sound effects based on text input using Doc2Vec. The Natural Language Processing Model (NLP) calculates the similarity between the input text and the labels in the dataset. The Natural Language Processing Model (NLP) is created by Doc2Vec pre-loading sound-related language expressions (labels). The model is pre-trained using labels from VGG-Sound data. The calculated highly similar sounds are downloaded from VGG-Sound, a specified number of sound datasets. The data downloaded from highly similar data is synthesized in similarity order, and the audio is output.Furthermore, we verify the correlation between the sentences used in the proposed method and the generated sound effects. We conducted an experiment in which we presented the generated sounds and the sentences used to create them and had the participants rate them on a 5-point scale. The sentences used for a generation are those that live in the dataset, those that lived in a small number in the data set, those that did not exist in the data collection, those to which information such as location or scene has been added, and those that contain multiple events. The results show that the more speech in the dataset, the higher the rating and that sentences with added information or numerous events produce lower ratings.

Keywords: NPL, Doc2Vec, Text to Sound, AI, Automatic Generation

DOI: 10.54941/ahfe1004033

Cite this paper: