A Lip Reading Recognition System Based on SimAM and TCN

AHFE International

Accelerating Open Access Science in Human Factors Engineering and Human-Centered Computing

A Lip Reading Recognition System Based on SimAM and TCN

Open Access

Article

Conference Proceedings

Authors: Yi Liu, Yuanyao Lu

Abstract: Lip-reading recognition is a technology that converts the visual information of a speaker’s lip movements into corresponding textual content. It has broad applications in fields such as national defense, healthcare, and public safety, and holds significant academic value. In recent years, with the rapid advancement of deep learning, lip-reading technology has made notable progress, achieving numerous innovative and breakthrough results. This paper proposes a novel lip-reading recognition architecture that integrates a Residual Network (ResNet) with a Temporal Convolutional Network (TCN), and introduces a simple yet highly effective attention mechanism—Simple Attention Module (SimAM). The key components of the proposed approach are as follows: (1) Feature Extraction: ResNet is employed to extract spatial features from lip images. By introducing residual connections into conventional convolutional neural networks, ResNet effectively alleviates information loss and mitigates the vanishing gradient problem, allowing for more efficient utilization of deep-layer features. (2) SimAM: Traditional attention mechanisms often focus on enhancing features along either the spatial or channel dimension, limiting their ability to learn complex, multi-dimensional attention weights, and typically incurring high computational costs. To address these limitations, SimAM is incorporated. It leverages a spatial suppression mechanism to compute attention weights for each neuron, requiring no additional parameters, while simultaneously attending to both spatial and channel dimensions. (3) Temporal Modeling: TCN is adopted for sequence modeling, applying convolutional operations along the temporal axis. Unlike recurrent networks, TCN enables parallel computation, captures long-range dependencies effectively, and offers a simpler architecture with faster training and greater stability—particularly well-suited for large-scale lip-reading datasets. To validate the effectiveness of the proposed model, experiments were conducted on the largest publicly available lip-reading dataset, LRW, which features diverse pronunciation scenarios and a large number of samples. Comparative experiments with various state-of-the-art architectures demonstrate that the proposed model achieves significant improvements in both recognition accuracy and computational efficiency.

Keywords: Lip Reading, Simple Attention Module, Temporal Convolutional Networks

DOI: 10.54941/ahfe1006622

Cite this paper:

Downloads

19

Visits

117