SOM-NCSCM+：抽取式神经网络中文标题生成方法研究

资康莉* **; 王石*; 曹存根*

文章摘要

资康莉* **,王石*,曹存根*.SOM-NCSCM+：抽取式神经网络中文标题生成方法研究[J].高技术通讯(中文),2023,33(8):836~848

SOM-NCSCM+：抽取式神经网络中文标题生成方法研究

SOM-NCSCM+：research on Chinese headline generation method based on extractive neural network

DOI：10. 3772/ j. issn. 1002-0470. 2023. 08. 006

中文关键词: 中文标题生成；神经网络模型；主题模型；聚类模型；序列标注

英文关键词: Chinese headline generation, neural network model, topic model, clustering model, sequence labeling

基金项目:

作者	单位
资康莉* **	(中国科学院计算技术研究所智能信息处理重点实验室北京 100190) (*中国科学院大学北京 100049)
王石*
曹存根*

摘要点击次数: 1910

全文下载次数: 1522

中文摘要:

标题生成作为文本摘要任务的一个分支，能够帮助人们高效获取信息。本文针对中文标题生成任务面临的大规模、高质量中文标注数据缺乏的问题，利用标题往往可由原文中的词语来构成的特点，从将无监督学习模型与有监督的序列标注模型结合的角度出发，提出了融合聚类模型和主题模型的抽取式深度神经网络中文标题生成方法和模型。在缺乏人工分类标注信息的中文新闻数据集上，该模型可利用聚类和主题模型自动挖掘数据内部潜在的特征信息，获得不同的数据簇及各簇内的主题词来辅助中文新闻标题生成，使模型在具有潜在主题类别特征的、标题质量参差的中文新闻数据集上都具有较好的适用性。本文提出的中文标题生成模型在互联网上公开的中文新闻标题数据集上的实验结果也表明其在微观F1、BLEU、ROUGE、压缩率等评价指标上都取得了较基准模型更好的效果。

英文摘要:

As a branch of text summarization task, headline generation can help people obtain information efficiently. In this paper, aiming at the lack of large-scale and high-quality Chinese annotation data in the Chinese headline generation task, taking advantage of the feature that headlines can often be formed from words in the contents, a Chinese headline generation method and model based on extractive deep neural network is proposed. The whole model is enhanced with the clustering model and the topic model, from the perspective of combining unsupervised learning model with supervised sequence labeling model. On the Chinese news data lacking manual annotated classifications, the whole model can automatically mine potential feature information within the data, and obtain different data clusters and the topic words to assist Chinese news headline generation by applying the clustering model and topic model, which makes the whole model more adaptable on the Chinese news data of different topics and uneven annotation quality. The experimental results on a dataset of Chinese news headline generation publicly available on the Internet also show that this whole model achieves better performance on the evaluation metrics, including the micro F1, BLEU, ROUGE and compression ratio than the baseline models.

查看全文查看/发表评论下载PDF阅读器

关闭