Medical visual question answering enhanced by multimodal feature augmentation and tri-path collaborative attention

SUN Haocheng(孙浩诚); DUAN Yong

文章摘要

SUN Haocheng(孙浩诚),DUAN Yong.[J].高技术通讯(英文),2025,31(2):175~183

Medical visual question answering enhanced by multimodal feature augmentation and tri-path collaborative attention

DOI：10. 3772 / j. issn. 1006-6748. 2025. 02. 007

中文关键词:

英文关键词: multimodal, deep learning, visual question answering (VQA), feature extraction, attention mechanism

基金项目:

Author Name	Affiliation
SUN Haocheng(孙浩诚)	(School of Information Science Engineering, Shenyang University of Technology, Shenyang 110870, P. R. China) (Shenyang Key Laboratory of Advanced Computing and Application Innovation, Shenyang 110870, P. R. China)
DUAN Yong

Hits: 28

Download times: 37

中文摘要:

英文摘要:

Medical visual question answering (MedVQA) faces unique challenges due to the high precision required for images and the specialized nature of the questions. These challenges include insufficient feature extraction capabilities, a lack of textual priors, and incomplete information fusion and interaction. This paper proposes an enhanced bootstrapping language-image pre-training ( BLIP) model for MedVQA based on multimodal feature augmentation and triple-path collaborative attention (FCA-BLIP) to address these issues. First, FCA-BLIP employs a unified bootstrap multimodal model architecture that integrates ResNet and bidirectional encoder representations from Transformer (BERT) models to enhance feature extraction capabilities. It enables a more precise analysis of the details in images and questions. Next, the pre-trained BLIP model is used to extract features from image-text sample pairs. The model can understand the semantic relationships and shared information between images and text. Finally, a novel attention structure is developed to fuse the multimodal feature vectors, thereby improving the alignment accuracy between modalities. Experimental results demonstrate that the proposed method performs well in clinical visual question-answering tasks.For the MedVQA task of staging diabetic macular edema in fundus imaging, the proposed method outperforms the existing major models in several performance metrics.

View Full Text View/Add Comment Download reader