An Action Recognition Method based on 3D Feature Fusion
Open Access
Article
Conference Proceedings
Authors: Yinhao Xu, Yuanyao Lu
Abstract: Video, which is distinct from a simple image, encompasses both spatial and temporal dimensions. In the spatial dimension, it contains various visual elements similar to those in static images. However, the addition of the temporal dimension makes it far more complex. It includes static image features such as color, texture, shape, and edge information that are crucial for identifying objects within each frame. Moreover, motion features play a significant role as they describe the movement of objects over time, including velocity, acceleration, and direction of movement. Additionally, external features like lighting conditions, background clutter, and occlusions also affect the overall nature of the video.As an important branch within the broad field of video understanding, human action recognition has attracted widespread attention from the research community and industries alike. The ability to accurately recognize human actions in videos has numerous applications, ranging from surveillance systems to human computer interaction, sports analysis, and entertainment.At present, there are three mainstream methods for processing video data, especially for action recognition: C3D, two stream network, and (2+1) D Net.SlowFast is a typical variant of C3D.The core of SlowFast is to process videos using two channels. These two channels are named Slow pathway and Fast pathway respectively. Compared with Fast pathway, Slow pathway has a relatively lower frame rate but has a greater number of channels. Slow pathway is used to capture semantic information in space, that is, Slow pathway captures the relatively static information in the video.While Fast pathway has a higher frame rate but a smaller number of channels. This greatly reduces the computational complexity of Fast. At the same time, it weakens Fast's ability to model spatial information and makes it pay more attention to information with obvious changes in the temporal dimension.Slow pathway and Fast pathway do not exist independently. The information fusion between the two is unidirectional information fusion. The two achieve information fusion through multiple lateral connections. And the direction of the lateral connections is from Fast to Slow. This means that Fast pathway will not receive any information about Slow pathway. This will undoubtedly lose some semantic information that describes space. We believe that adopting a more effective feature fusion method can further improve the recognition accuracy.Based on the well-known two branch network SlowFast, this paper introduces a significant improvement. Specifically, we propose an enhanced SlowFast network named ESL Net. A key innovation in this network is the addition of an improved 3D feature fusion module. This module is designed to make the most of the temporal information available in the video for effective feature fusion. It employs temporal and spatial attention mechanisms to precisely identify the most significant parts of the features. By analyzing the temporal information, it can also determine the crucial elements between dual - temporal features. Extensive experiments have demonstrated that our proposed method is highly effective when applied to the UCF-101 dataset and the HMDB51 dataset, showing superior performance compared to existing methods especially SlowFast Network in terms of accuracy and robustness in human action recognition tasks.
Keywords: Human action recognition, 3D-Feature Fusion, Two branch network, SlowFast
DOI: 10.54941/ahfe1005815
Cite this paper:
Downloads
14
Visits
72