面向专利的化合物和生物实体识别系统

赖鸿昌; 朱礼军; 徐硕

文章摘要

赖鸿昌,朱礼军,徐硕.面向专利的化合物和生物实体识别系统[J].情报工程,2015,1(4):095-103

面向专利的化合物和生物实体识别系统

Chemical and Biological Entity Recognition System from Patent Documents

DOI：10.3772/j.issn.2095-915X.2015.04.011

中文关键词: 条件随机场，化合物和生物实体，专利挖掘，交叉验证

英文关键词: Conditional Random Field (CRF), chemical and biological entity recognition, patent mining, cross validation

基金项目:国家自然科学基金项目“基于论文和专利资源的技术机会发现研究” （项目编号：71403255）、中国科学技术信息研究重点工作项目“大数据环境下融合多源信息的科技文献智能分析服务平台建设及应用示范”（编号：ZD2014-7-1）

作者	单位
赖鸿昌	中国科学技术信息研究所信息技术支持中心
朱礼军	中国科学技术信息研究所信息技术支持中心
徐硕	中国科学技术信息研究所信息技术支持中心

摘要点击次数: 7722

全文下载次数: 9729

中文摘要:

探索专利文献中的化合物和生物知识变得至关重要。为了识别化合物实体和生物实体，开发了面向专利的化合物和生物实体识别系统。系统基于开源的机器学习和自然语言工具进行开发。系统按照流水线模式进行，本文将详细阐述其三个主要过程：预处理（句子分割、词条化），识别（基于条件随机场的方法），后处理（基于规则的方法）。最后，利用系统在已标注的化合物专利语料库进行大量实验，进行十折交叉验证，得到了 69.20% 的 F 值。但是，从结果可以看到，在专利文献上的实验表现，要低于论文和新闻语料库中的表现。

英文摘要:

It is crucial to explore the chemical and biological space covered by patent documents. In order to recognize chemical and biological entities, a recognition system is developed on the basis of open-source machine learning and natural language processing (NLP) toolkits. The system processing pipeline consists of three major components: pre-processing (sentence detection, tokenization), recognition (conditional random ﬁ eld (CRF) based approach), and post-processing (rule-based approach). The paper introduces each part in detail.Finally, extensive experiments on annotated chemical patent corpus are conducted, and the balanced-F measure is 69.20% with 10-fold cross validation. The results indicates that the performance on patent documents is slightly lower than that of counterpart on paper and news corpus.

查看全文查看/发表评论下载PDF阅读器

关闭