科技大数据背景下的
中英双语语料库的构建及其特点研究

苏晓娟1; $22 白 晨2 吴 思2

文章摘要

苏晓娟1 张英杰2 白晨2 吴思2.科技大数据背景下的中英双语语料库的构建及其特点研究[J].中国科技资源导刊,2019,(6):87~92

科技大数据背景下的中英双语语料库的构建及其特点研究

Research of Bilingual Corpus Construction and Its Characteristics in Big Data

投稿时间：2019-06-21

DOI：

中文关键词: 科技大数据；双语语料库；机器学习；语料库构建；机器翻译引擎

英文关键词: big data, bilingual corpus, machine learning, corpus construction, machine translation engine

基金项目:中国科学技术信息研究所重点工作“面向中信所资源大数据建设的多源异构数据库内容获取与融合平台建设（二期）” （ZD2019-04）。

作者	单位
苏晓娟1 张英杰2 白晨2 吴思2	（1.北京石油化工学院，北京 102617； 2. 中国科学技术信息研究所，北京 100038）

摘要点击次数: 2946

全文下载次数: 2966

中文摘要:

首先通过对双语语料库全过程构建的描述，提出基于专业领域词库快速构建双语语料库的方式，并用于快速发现科技大数据基础语料的多属性，完成语料的标注，这对于科技大数据知识检索、知识图谱方面的应用具有基础性支撑作用。然后通过分析新时期科技大数据对语料库构建的要求，从期刊、专利中选择“分布式能源”主题数据集，结合“神经网络机器翻译 + 统计机器翻译”的机器翻译技术，构建形成20834个双语词对初试语料集，利用中国科学技术信息研究所专利数据库、德温特专利数据库形成6428条专利数据对双语词对初试语料集进行测试应用。最后从忠实度、流畅度、可理解度3 个方面进行人工评测。

英文摘要:

Firstly, based on the description of the whole process of constructing bilingual corpus, this paper puts forward a fast way to construct bilingual corpus based on specialized field lexicon. It can be used to quickly discover the multiple attributes of basic corpus of science and technology big data and complete the marking of corpus, which plays a fundamental supporting role in knowledge retrieval and knowledge mapping of science and technology big data. And then, we construct a corpus of 20834 bilingual word pairs for the preliminary test by analyzing the requirement of large scientific and technological data for corpus construction in the new era, selecting the subject data set of “distributed energy” from journals and patents, and combining the machine translation technology of “neural network machine translation + statistical machine translation”, 6428 patent data are generated from ISTIC Patent Database and Derwent Patent Database to test the bilingual corpus. Finally, the whole process of building bilingual corpus is described through manual evaluation in three aspects: adequacy, fluency and intelligibility.

查看全文查看/发表评论下载PDF阅读器

关闭