英汉双语富媒体知识图谱构建工程研究——以 CNS 英文期刊为例

韦向峰; 缪建明; 张全; 袁毅

文章摘要

韦向峰,缪建明,张全,袁毅.英汉双语富媒体知识图谱构建工程研究——以 CNS 英文期刊为例[J].情报工程,2023,9(5):084-096

英汉双语富媒体知识图谱构建工程研究——以 CNS 英文期刊为例

Research on the Construction of English-Chinese Bilingual Rich Media Knowledge Graph: A Case Study of CNS English Journal

DOI：10.3772/j.issn.2095-915X.2023.05.007

中文关键词: 富媒体；知识图谱；实体抽取；实体对齐；语步识别

英文关键词: Rich media; knowledge graph; entity extraction; entity alignment; moves recognition

基金项目:2022 年富媒体数字出版内容组织与知识服务重点实验室开放基金“基于英文科技出版物的跨语言富媒体知识工程研究”（ZD2022-10/01）。

作者	单位
韦向峰	1. 中国科学院声学研究所北京　100190；2. 富媒体数字出版内容组织与知识服务重点实验室北京　100038
缪建明	3. 中国兵器工业信息中心北京　100089
张全	1. 中国科学院声学研究所北京　100190
袁毅	1. 中国科学院声学研究所北京　100190

摘要点击次数: 2240

全文下载次数: 2804

中文摘要:

[目的/意义]研究自动构建英汉双语富媒体知识图谱的方法和过程，为跨语言多模态知识图谱的自动构建提供借鉴参考，对及时获取最新英文科研成果、科技情报监测等具有重要意义。[方法/过程]采用自顶向下和自底向上相结合的方法，先从顶层设计要抽取的主要实体、属性和关系，从底层非结构化文本数据进行分析抽取细粒度的实体和属性，对有歧义实体和跨语言实体进行实体对齐，对跨媒体的实体进行实体链接，用图数据库实现知识图谱的存储及应用。[局限]未来需进一步提高细粒度实体的抽取正确率，对音视频媒体进行特征提取和内容自动识别。[结果/结论]以 CNS（Cell、Nature、Science）等英文科技期刊网站为例，通过数据抓取、实体抽取、属性抽取、知识融合、跨媒体链接等过程，实现了英汉双语富媒体知识图谱的构建、存储和可视化展示。

英文摘要:

[Objective/Significance] It is of great significance for scientific and technological information monitoring and obtaining the latest English scientific research results in time, with researching the method and process of automatically constructing the English-Chinese rich media knowledge graph. It is also a meaningful experience for constructing cross-language and cross-media knowledge graph. [Methods/Processes] The approach that combines top-down and bottom-up methods is employed, starting with top-level design for extracting primary entities, attributes, and relationships. For fine-grained entities and attributes, analysis and extraction are performed from the bottom-up analyzing unstructured textual data. Ambiguous entities and cross-lingual entities require entity alignment, while cross-media entities require entity linking. By using a graph database, teh storage and its application of the knowledge graph can be implemented. [Limitations] Future works include further improving the accuracy of fine-grained entity extraction, extracting features and automatically recognizing content for audio and video media. [Results/Conclusions] Taking CNS (Cell, Nature, Science) and other English scientific and technological journal websites as an example, this paper successfully constructed a bilingual English-Chinese multimedia knowledge graph through data scraping, entity extraction, attribute extraction, knowledge fusion, cross-media linking.

查看全文查看/发表评论下载PDF阅读器

关闭