Lip-Reading Research Based on ShuffleNet and Attention-GRU

Open Access
Conference Proceedings
Authors: Yixian FuYUANYAO LU

Abstract: Human-computer interaction has seen a paradigm shift from textual or display-based control towards more intuitive control such as voice, gesture and mimicry. Particularly, speech recognition has attracted a lot of attention because it is the most prominent mode of communication. However, performance of speech recognition systems varies significantly according to sources of background noise, types of talkers and listener's hearing ability. Therefore, lip recognition technology which detects spoken words by tracking speaker's lip movements comes into being. It provides an alternative way for scenes with high background noise and people with hearing impaired problems. Also, lip reading technology has widespread application in public safety analysis, animation lip synthesis, identity authentication and other fields. Traditionally, most work in lipreading was based on hand-engineered features, that were usually modeled by HMM-based pipeline. Recently, deep learning methods are deployed either for extracting 'deep' features or for building end-to-end architectures. In this paper, we propose a neural network architecture combining convolutional Neural Network (CNN) and Recurrent Neural Network (RNN) with a plug-in attention mechanism. The model consists of five parts: (1). Input: We use Dlib library for detecting 68 landmarks of the face, crop the lip area and extract 29 consecutive frames from the video sequence. The frames go through a simple C3D network for generic feature extraction. (2). CNN: with neural networks becoming deeper and deeper, computation complexity increases significantly as well, which motivated the appearance of the lightweight model architecture design. Then a lightweight CNN named ShuffleNet pre-trained on ImageNet dataset is used in our method to perform spatial downsampling of a single image. The ShuffleNet mainly uses two new operations, namely, pointwise group convolution and channel shuffle, which greatly reduce the computational cost without affecting recognition accuracy. (3) CBAM: In the field of image processing, a feature map contains a variety of important information. The traditional convolutional neural network performs convolution in the same way on all channels but importance of information varies greatly depending on different channels. To improve the performance of convolutional neural networks for feature extraction, we utilize an attention mechanism named Convolutional Block Attention Module (CBAM), which is a simple and effective attention module for feedforward convolutional neural networks and contains two independent sub-modules, namely, Channel Attention Module (CAM) and Spatial Attention Module (SAM), which perform Channel and Spatial Attention respectively. (4) RNN: The traditional Recurrent Neural Network (RNN) is mainly used to process sequential data, but with the extension of the RNN network, it may be unable to connect to all related information which may cause key information loss. It cannot solve the long-distance dependence problem and the performance may drop significantly. Due to this shortcoming of the traditional RNN network, we select the GRU network in this paper, which is a variant of the LSTM. It has a simpler structure and better performance than the LSTM neural network. (5) Outputs: Lastly, we pass the result of the backend to SoftMax for classifying the final word. In our experiment, we compare several model architectures and find that our model achieves a comparable accuracy to the current state-of-the-art model at a lower computational cost.

Keywords: Lip reading, CBAM, Light-weight network

DOI: 10.54941/ahfe1004024

Cite this paper: