时空多尺度关联特征融合的二维卷积网络细粒度动作识别模型

胡正平* **; 王昕宇*; 董佳伟*; 赵艳霜*; 刘洋*

文章摘要

胡正平* **,王昕宇*,董佳伟*,赵艳霜*,刘洋*.时空多尺度关联特征融合的二维卷积网络细粒度动作识别模型[J].高技术通讯(中文),2024,34(6):590~601

时空多尺度关联特征融合的二维卷积网络细粒度动作识别模型

Fine-grained 2D convolutional network model for action recognition based on spatio-temporal multi-scale correlation feature fusion

DOI：10. 3772 / j. issn. 1002-0470. 2024. 06. 004

中文关键词: 细粒度动作识别；多尺度时空关联特征；远程依赖建模；自注意力机制

英文关键词: fine-grained action recognition, multi-scale spatio-temporal correlation feature, long-range dependency modeling, self-attention mechanism

基金项目:

作者	单位
胡正平* **	（燕山大学信息科学与工程学院秦皇岛 066004）（*燕山大学河北省信息传输与信号处理重点实验室秦皇岛 066004)
王昕宇*
董佳伟*
赵艳霜*
刘洋*

摘要点击次数: 3101

全文下载次数: 2664

中文摘要:

针对传统二维(2D)卷积网络提取时空特征尺度单一以及对细粒度动作数据集中帧与帧之间的远程时间关联信息利用不足的问题，本文提出时空多尺度关联特征融合的2D卷积网络细粒度动作识别模型。首先，为建模视频多尺度空间关联以加强对细粒度视频数据的空间表征能力，模型使用多尺度“特征压缩、特征激发”方式，使网络所提取空间特征更加丰富有效。然后，为充分利用细粒度视频数据时间维度上的运动信息，本文引入时间窗口自注意力机制，利用自注意力机制强大的远程依赖建模能力同时只在时间维度上进行自注意力操作，以较低计算成本建模远程时间依赖关系。最后，考虑到所提取时空特征对不同类型动作分类的贡献不均等，本文引入自适应特征融合模块，为特征动态赋予不同权重实现自适应特征融合。模型在2个细粒度动作识别数据集Diving48和Something-somethingV1上识别准确率分别达到86.0%和46.9%，分别使原始主干网络识别准确率提升3.8%和1.3%。实验结果表明，在只使用视频帧信息作为输入的情况下，本模型达到与现有基于Transformer和三维卷积神经网络(3D CNN)算法相当的识别准确率。

英文摘要:

In order to solve the problems of traditional 2-dimensional (2D) convolutional network extracting spatiotemporal features at a single scale and insufficient utilization of long-range temporal correlation information between frames in fine-grained action data sets, this paper proposes a fine-grained 2D convolutional network model for action recognition based on spatio-temporal multi-scale correlation feature fusion model. First, in order to model the multi-scale spatial correlation of videos to enhance the spatial representation ability of fine-grained video data, the model uses a multi-scale ‘feature squeeze and feature excitation’ method to make the spatial features extracted by the network more abundant and effective. Then, in order to fully utilize the motion information in the time dimension of fine-grained video data, a temporal window self attention mechanism is introduced, and the powerful long-range dependency modeling ability of Transformer is utilized to only perform self attention operations in the time dimension, modeling long-range time dependencies at a lower computational cost. Finally, considering that the extracted spatio-temporal features contribute unevenly to different types of action classification, an adaptive feature fusion module is introduced to dynamically assign different weights to features to achieve adaptive feature fusion. The model’s Top-1 accuracy on the two fine-grained action recognition data sets Diving48 and Something-somethingV1 reached 86.0% and 46.9% respectively, which improved the Top-1 accuracy of the original backbone network by 3.8% and 1.3% respectively. Experimental results show that when only using video frame information as input, this model achieves recognition accuracy comparable to existing algorithms based on Transformer and 3-dimensional convolutional neural network (3D CNN).

查看全文查看/发表评论下载PDF阅读器

关闭