基于BLSTM 的科技文献术语抽取方法

赵东玥; 杜永萍; 石崇德

文章摘要

赵东玥,杜永萍,石崇德.基于BLSTM 的科技文献术语抽取方法[J].情报工程,2018,4(1):67-74

基于BLSTM 的科技文献术语抽取方法

Scientific Literature Terms Extraction Based on Bidirectional Long Short-Term Memory Model

DOI：10.3772/j.issn.2095-915X.2018.01.008

中文关键词: 术语抽取；科技文献；长短时记忆

英文关键词: Term extraction; scientific literature; LSTM

基金项目:面向科技监测的实体识别与关系抽取研究（71403257）

作者	单位
赵东玥	1.北京工业大学信息学部 2.中国科学技术信息研究所
杜永萍	1.北京工业大学信息学部 2.中国科学技术信息研究所
石崇德	1.北京工业大学信息学部 2.中国科学技术信息研究所

摘要点击次数: 2789

全文下载次数: 1663

中文摘要:

术语抽取是研究科技文献领域的重要技术，为进一步提高科技文献术语抽取的准确率和召回率，本文采用了基于BLSTM（Bidirectional Long Short-Term Memory）的神经网络模型。使用预先训练的词向量字典将中文分词结果映射为向量作为BLSTM 模型的输入，使用序列标注的方法将输出分类结果映射为术语的边界进行术语抽取。在自动化技术、计算机技术领域的数据集上，设计实验对比了使用词为特征的BLSTM 模型和条件随机场模型的抽取结果。结果表明基于BLSTM 模型的科技文献术语抽取得了更优的性能，在中文数据集上精确率最高0.7821，召回率最高0.8020，F1 值最高0.7860，在英文数据集上分别达到0.8525，0.8677 和0.8543。

英文摘要:

Term extraction plays an important role in the field of scientific literature. In order to improve the accuracy and recall of the term extraction, this research designed a neural network model based on BLSTM (Bidirectional Long Short-Term Memory) model. The segmentation results in Chinese were mapped into the vectors via pre-trained word vector dictionary, and the output of classification results were mapped as the term boundaries via the sequence tagging. The experiment was implemented to compare the BLSTM model with word feature and the conditional random field method in the fields of automation technology and computer technology. The results presented that the BLSTM model obtained the better performance with the highest accuracy 0.7821, the highest recall 0.8020 and the highest F1 value 0.7860 in Chinese dataset. For the English dataset, the highest accuracy, recall and F1 value is 0.8525, 0.8677 and 0.8543, respectively.

查看全文查看/发表评论下载PDF阅读器

关闭