基于词性自动机的关键短语抽取方法

王凌霄; 王弋波; 朱礼军

文章摘要

王凌霄王弋波朱礼军.基于词性自动机的关键短语抽取方法[J].中国科技资源导刊,2023,(5):31~40

基于词性自动机的关键短语抽取方法

Keyphrase Extraction Algorithm via Tagging Finite Automation

投稿时间：2023-02-21

DOI：

中文关键词: 命名实体识别;关键词抽取;关键短语抽取;有限状态自动机;词性标注

英文关键词: named entity recognition, keyword extraction, keyphrase extraction, finite state machine, part-of- speech tagging

基金项目:中国科学技术信息研究所创新研究基金资助项目“基于文本实体挖掘的新药发现领域人工智能技术应用识别方法”(QN2022-06)

作者	单位
王凌霄王弋波朱礼军	(中国科学技术信息研究所，北京 100038)

摘要点击次数: 1163

全文下载次数: 810

中文摘要:

关键短语抽取是一种识别目标文本中具有特殊价值的关键词组合的自然语言处理任务场景，对科技文献情报挖掘具有重要的实践价值。由于缺少足够的标注数据、知识库、预训练模型，针对前沿细分学科颠覆性内容的关键短语抽取还存在着许多挑战。将有限状态自动机概念引入关键短语抽取任务中，把关键短语的词性标注组合模式抽象为一系列有限状态自动机文法。这种基于词性自动机的无监督关键短语提取算法，能够在不依赖标注数据和高性能计算设备的条件下，通过高度自定义的词性组合模式，抽取不定长度的细分领域关键短语。这种算法具备运行速度快、环境依赖低、匹配模式多、提取效果好等特点。使用 SemEval-2017 数据集和智能新药发现领域的文献摘要作为测试数据，将研究所提出的算法和几种广泛应用的关键短语抽取算法进行对比。对比结果显示:这种算法在所有关键词中的准确率达到 30.8%，召回率达到 34.1%，F1 值达到 32.4%;在关键短语中的准确率达到 30.8%，召回率达到 52.0%，F1 值达到 38.7%。召回率指标与 F1 指标相比关键词抽取开源算法库有显著提升。

英文摘要:

Keyphrase extraction is a natural language processing task scenario for identifying keyword combinations with special value in target texts, which has important practical value in mining scientific and technological literature information. Due to the lack of sufficient labeled data, knowledge base, and pre-training models, there are still many practical challenges in the extraction of keyphrases for subversive content in cutting-edge sub-disciplines. In this paper, the concept of finite state automata is introduced into the key phrase extraction task, and the part-of-speech tagging combination patterns of keyphrases are abstracted into a series of finite state automata grammars. This unsupervised key phrase extraction algorithm based on part-of-speech automaton can extract keyphrases of indeterminate length in subdivision fields through a highly customized part-of-speech combination mode without relying on labeled data and high-performance computing equipment. The algorithm has the characteristics of fast running speed, low environment dependence, many matching modes, and good extraction effect. This paper uses the SemEval-2017 dataset and literature abstracts in the field of intelligent new drug discovery as test data, and compares the algorithm proposed in this paper with several widely used keyphrase extraction algorithms. The accuracy rate of this algorithm in all keywords reaches 30.8%, the recall rate reaches 34.1%, the F1 value reaches 32.4%, the accuracy rate in key phrases reaches 30.8%, the recall rate reaches 52.0%, and the F1 value reaches 38.7%. Compared with the open source algorithm library for keyword extraction, the recall score and the F1 score are significantly improved.

查看全文查看/发表评论下载PDF阅读器

关闭