A Lip Reading Recognition System Based on SimAM and TCN
Abstract
Lip-reading recognition is a technology that converts the visual information of a speaker’s lip movements into corresponding textual content. It has broad applications in fields such as national defense, healthcare, and public safety, and holds significant academic value. In recent years, with the rapid advancement of deep learning, lip-reading technology has made notable progress, achieving numerous innovative and breakthrough results. This paper proposes a novel lip-reading recognition architecture that integrates a Residual Network (ResNet) with a Temporal Convolutional Network (TCN), and introduces a simple yet highly effective attention mechanism—Simple Attention Module (SimAM). The key components of the proposed approach are as follows: (1) Feature Extraction: ResNet is employed to extract spatial features from lip images. By introducing residual connections into conventional convolutional neural networks, ResNet effectively alleviates information loss and mitigates the vanishing gradient problem, allowing for more efficient utilization of deep-layer features. (2) SimAM: Traditional attention mechanisms often focus on enhancing features along either the spatial or channel dimension, limiting their ability to learn complex, multi-dimensional attention weights, and typically incurring high computational costs. To address these limitations, SimAM is incorporated. It leverages a spatial suppression mechanism to compute attention weights for each neuron, requiring no additional parameters, while simultaneously attending to both spatial and channel dimensions. (3) Temporal Modeling: TCN is adopted for sequence modeling, applying convolutional operations along the temporal axis. Unlike recurrent networks, TCN enables parallel computation, captures long-range dependencies effectively, and offers a simpler architecture with faster training and greater stability—particularly well-suited for large-scale lip-reading datasets. To validate the effectiveness of the proposed model, experiments were conducted on the largest publicly available lip-reading dataset, LRW, which features diverse pronunciation scenarios and a large number of samples. Comparative experiments with various state-of-the-art architectures demonstrate that the proposed model achieves significant improvements in both recognition accuracy and computational efficiency.
Keywords: Lip Reading, Simple Attention Module, Temporal Convolutional Networks
DOI: 10.54941/ahfe1006622
Cite this paper
More from this volume
- Developing Effective VR Training Simulations for Additive Manufacturing: A Modular Usability-Driven Design Approach
- Marionette-Inspired Interface: Bridging Traditional Puppetry and Modern Avatar Control
- LightBUY - Developing Cloud Sales Design Specifications from the Ground Up
- Development of Color Universal Design Education System
- Realtime Video Underlay for Accessible Television Graphics
- The Impact of Cultural Values on Human-AI Collaboration in a Decision-Making Task
- The Impact of Time Constraints on Moral Decision-Making during Human-AI Interaction
- Is LLM a reliable risk detector? An evaluation of large language models in EMR-related medical incident detection
- Knowledge of Results (KR) and Vigilance: Are Feedback Effects Due to Information or Motivation?
- Leveraging Digital Twins and Generative AI to Alleviate Loneliness Among Elderly Adults Living Alone Through Smart Flowerpot Design
- The Benefits of Adopting Artificial Intelligence-Technologies in Mitigation Construction Risk in the South African Construction Industry
- Determinants of Quality Coping and Knowledge Acquisition in Professional Work and Academic Study Systemic Interaction


AHFE Open Access