基于依存句法分析的科技政策领域主题词表无监督构建

邵卫; 化柏林

文章摘要

邵卫,化柏林.基于依存句法分析的科技政策领域主题词表无监督构建[J].情报工程,2020,6(6):033-044

基于依存句法分析的科技政策领域主题词表无监督构建

Unsupervised Construction of Thesaurus in the Science and Technology Policy Based on Dependency Syntax Analysis

DOI：10.3772/j.issn.2095-915X.2020.06.004

中文关键词: 科技政策；无监督构建；依存句法分析；主题词表；文本挖掘

英文关键词: Science and technology policy; unsupervised construction; dependency syntax analysis; thesaurus; text mining

基金项目:

作者	单位
邵卫	北京大学信息管理系北京 100871
化柏林	北京大学信息管理系北京 100871

摘要点击次数: 3753

全文下载次数: 3932

中文摘要:

为了解决科技政策领域词表构建的问题，本文提出一种基于依存句法分析的科技政策文本关键词抽取算法。在此基础上，提出文本主题词指数来构建文本主题词，利用同义词识别算法及百科知识发现和确定词与词的同义关系，采用字面匹配的方法判别上下位词，最终汇合四个部分形成科技政策领域主题词表。为了适应缺乏标记的实际情况，使得文章更具有实际应用价值，本文使用了无监督方法。结果表明，此方法产生的词表具有显著的领域特征，可以解决领域未登录词切分，主题词之间关系缺乏等问题，有效地支持分词及文本分析。

英文摘要:

In order to solve the problem of vocabulary construction in the field of science and technology policy, this paper proposes a keyword extraction algorithm for science and technology policy texts based on dependency syntax analysis. On this basis, the text topic index is proposed to construct the text topic words; using the synonym recognition algorithm and encyclopedia knowledge to discover and determine the synonymous relationship between words and words; utilizing the word matching method to discriminate the upper and lower words; converging four parts to form a thesaurus of science and technology policy. To adapt the real situation that labeled data is always lacked and improve the application value of this paper, all methods proposed by us belong to unsupervised methods. The results show that the vocabulary generated by this construction method has significant domain characteristics and can effectively support word segmentation and text analysis.

查看全文查看/发表评论下载PDF阅读器

关闭