基于Transformer的多尺度分组空洞自注意力机制在复杂场景分割方面的研究

韩昌*; 朱祯琳**; 柴欣灵**; 王润民**; 熊正强*

文章摘要

韩昌*,朱祯琳**,柴欣灵**,王润民**,熊正强*.基于Transformer的多尺度分组空洞自注意力机制在复杂场景分割方面的研究[J].高技术通讯(中文),2025,35(8):847~860

基于Transformer的多尺度分组空洞自注意力机制在复杂场景分割方面的研究

Transformer-based multi-scale grouping hole self-attention mechanism in research on complex scene segmentation

DOI：10. 3772 / j. issn. 1002-0470. 2025. 08. 004

中文关键词: Transformer；多尺度特征；自注意力机制；行人重识别；语义分割

英文关键词: Transformer, multi-scale features, self-attention mechanism, Pedestrian re-identification, semantic segmentation

基金项目:

作者	单位
韩昌*	（武汉商学院机电工程学院武汉 430056） (*湖南师范大学信息科学与工程学院长沙 410081)
朱祯琳**
柴欣灵**
王润民**
熊正强*

摘要点击次数: 78

全文下载次数: 66

中文摘要:

针对基于复杂场景的图像处理所存在的不同分割场景、定位困难、超负荷的计算量等挑战，本文提出了一种基于Transformer的多尺度分组空洞自注意力机制，通过在进行局部的分组层面上以非倍数的形式进行空洞自注意力机制的注意力提取，并避免了棋盘效应；同时使用局部增强位置编码（local-enhanced positional encoding，LePE）进行局部位置编码信息的处理；在局部自注意力之间使用边缘自注意力机制提取，并与组间自注意力机制提取进一步融合，随后使用条件位置编码(conditional positional encodings，CPE)进行全局位置信息的再次整合，以增强对于图像特征提取的准确性。本文所提出的方法分别在图像分类、行人重识别、语义分割数据集上进行了相应的实验。本文所提出的基于Transformer的多尺度分组空洞自注意力机制在图像分类ImageNet数据集、行人重识别Market-1501和MSMT17数据集、语义分割ADE20K和Cityscapes数据集上取得了与当前多数典型架构相当的性能。在多个公共数据集上进行的相关实验，充分验证了本文方法的有效性，并为未来图像处理的技术研究提供了更多思考的前沿方向。

英文摘要:

To address the challenges in complex scenes image processing, such as different segmentation scenes, localization difficulties, and overloaded computation volume, this paper proposes a Transformer-based multi-scale grouping null self-attention mechanism, by deploying local grouping level in the form of non-multiplicative form of the null self-attention mechanism for attention extraction, and avoids the checkerboard effect. At the same time, the LePE module is utilized to enhance the processing of the local location coding information. Between the local self-attention using edge self-attention mechanism extraction and further integration with inter-group self-attention mechanism extraction, and using CPE position coding for global position information integration again to enhance the accuracy of image feature extraction. The proposed method is experimented accordingly on image classification, pedestrian re-recognition, and semantic segmentation datasets.The Transformer-based multi-scale grouped null self-attention mechanism achieves satisfactory performance on the image classification ImageNet dataset, Pedestrian re-identification Market-1501, MSMT17 datasets, semantic segmentation ADE20K, and Cityscapes datasets comparable to most of the current typical architectures. The paper further validates the effectiveness of our model by conducting relevant experiments on several public datasets, and provides more thought-provoking cutting-edge directions for future technical research in image processing.

查看全文查看/发表评论下载PDF阅读器

关闭