Video Action Recognition Based on Spatio-Temporal Feature Pyramid Module

oleh: GONG Suming, CHEN Ying

Format: Article
Diterbitkan: Journal of Computer Engineering and Applications Beijing Co., Ltd., Science Press 2022-09-01

Deskripsi

At present, the mainstream 2D convolution neural network method for video action recognition can't extract the relevant information between input frames, which makes it difficult for the network to obtain the spatio-temporal feature information between input frames and improve the recognition accuracy. To solve the existing problems, a universal spatio-temporal feature pyramid module (STFPM) is proposed. STFPM consists of feature pyramid and dilated convolution pyramid, which can be directly embedded into the existing 2D convolution network to form a new action recognition network named spatio-temporal feature pyramid net (STFP-Net). For multi-frame image input, STFP-Net first extracts the individual spatial feature information of each frame input and records it as the original feature. Then, the designed STFPM uses matrix operation to construct the feature pyramid of the original feature. Furthermore, the spatio-temporal features with temporal and spatial correlation are extracted by applying the dilated convolution pyramid to feature pyramid. Next, the original features and spatio-temporal features are fused by a weighted summation and transmitted to the deep network. Finally, the action in the video is classified by full connected layer. Compared with Baseline, STFP-Net introduces negligible additional parameters and computational complexity. Experimental results show that compared with mainstream methods in recent years, STFP-Net has significant improvement in classification accuracy on the general datasets UCF101 and HMDB51.