面向科技语料的短语结构句法分析器

王亚楠; 马春鹏; 曹海龙; 赵铁军

文章摘要

王亚楠,马春鹏,曹海龙,赵铁军.面向科技语料的短语结构句法分析器[J].情报工程,2017,3(3):010-020

面向科技语料的短语结构句法分析器

A Constituent Parser for Science and Technology Corpus

DOI：10.3772/j.issn.2095-915X.2017.03.003

中文关键词: 短语结构句法分析，科技语料，多任务学习

英文关键词: Constituent parsing, science and technology corpus, multi-task learning

基金项目:本文受国家自然科学基金项目（91520204，61572154），863项目（2015AA015405），和微软亚洲研究院合作研究计划的资助。

作者	单位
王亚楠	哈尔滨工业大学机器智能与翻译研究室
马春鹏	哈尔滨工业大学机器智能与翻译研究室
曹海龙	哈尔滨工业大学机器智能与翻译研究室
赵铁军	哈尔滨工业大学机器智能与翻译研究室

摘要点击次数: 2266

全文下载次数: 1321

中文摘要:

本文介绍了一个由哈尔滨工业大学设计和开发的面向科技语料的短语结构句法分析器。与传统的短语结构句法分析器不同，本句法分析器不需要对输入语料进行预处理。给定未经预处理的语料，本句法分析器可以联合地进行分词、词性标注以及短语结构的句法分析。这可以看成是多任务学习的一个实例。此外，针对科技语料的特点，本句法分析器对所使用的特征模板进行了优化，同时构建了面向科技语料的单词内部结构树库。实验结果表明，我们的句法分析器在通用领域的测试集以及科技领域的测试集上均取得了较好的效果。

英文摘要:

In this paper, we proposed a constituent parser for science and technology corpus, which was designed and developed by Harbin Institute of Technology. Compared with traditional constituent parsers, the parser of this study does not need to pre-processed corpus. Given a raw text as the input, this parser can do the tasks of word segmentation, POS-tagging and constituent parsing simultaneously. This can be regarded as an instance of multi-task learning. Furthermore, based on the characteristics of science and technology corpus, we optimized the feature templates used in our parser, and constructed a new tree-bank of the inner structures of the words in the science and technology corpora. The results of the experiments indicated that our parser performed well both on the corpus of general domain and on the corpus of science/technology domain.

查看全文查看/发表评论下载PDF阅读器

关闭