子句对齐及其在专利统计机器翻译中的应用

何彦青; 张娟

文章摘要

何彦青，张娟.子句对齐及其在专利统计机器翻译中的应用[J].中国科技资源导刊,2014,(4):86~93

子句对齐及其在专利统计机器翻译中的应用

Sub-sentence Alignment and Its Application for Statistical Patent Machine Translation

投稿时间：2014-05-12

DOI：

中文关键词: 子句对齐；词对齐；简单子句；专利文献；统计机器翻译

英文关键词: sub-sentence alignment, word alignment, simple sentence, patent text, statistical machine translation

基金项目:国家自然科学基金项目“面向专利文献的统计机器翻译语境分析”（61303152）；“十二五”国家科技支撑计划课题“基于多源信息的电动汽车数据挖掘关键技术研究（2013BAG06B01）”；国家国际科技合作专项“面向科技文献的日汉双向实用型机器翻译合作研究”（2014DFA11350）。

作者	单位
何彦青，张娟	1．中国科学技术信息研究所，北京 100038；2．北京联合大学，北京 100101

摘要点击次数: 2610

全文下载次数: 3609

中文摘要:

针对专利文献句子偏长的特点，将统计机器翻译中的训练语料进行子句切割获取双语的子句序列，再采用统计和规则相结合的策略来生成子句对齐，建立基于简单子句的双语语料来重新训练统计机器翻译系统，在一定程度上改善了原有双语训练语料中的短语对齐和词对齐，可以更为深入地利用平行语料中蕴含的翻译信息，应用于专利统计机器翻译中，在NTCIR-9的测试集上进行实验比较，获得较为满意的翻译效果。

英文摘要:

For sentences in patent documents are often long, this paper tries to segment the training corpus of statistical machine translation into bilingual sub-sentence lists and uses statistical strategies and rules to obtain their sub-sentence alignment. Then new-generated training corpus based on simple sub-sentences is added into the training data to train statistical machine translation system. This method improves phrase alignment and word alignment in bilingual training corpus. It also digs translation information in parallel corpus more deeply and improves translation quality. This method was applied to statistical patent machine translation. Experiments were conducted on the test set in NTCIR-9 and a satisfactory translation result was obtained.

查看全文查看/发表评论下载PDF阅读器

关闭