国家图书馆WEB数据增量采集设计及其实现

季士妍; 赵丹阳

文章摘要

季士妍,赵丹阳.国家图书馆WEB数据增量采集设计及其实现[J].数字图书馆论坛,2021,(1):32~37

国家图书馆WEB数据增量采集设计及其实现

Design and Implementation on the Web Data Deduplicated Crawlers of the National Library of China

投稿时间：2020-12-12

DOI：10.3772/j.issn.1673-2286.2021.01.005

中文关键词: 国家图书馆；增量采集；Heritrix

英文关键词: National Library of China; Duplicated Crawlers; Heritrix

基金项目:

作者	单位
季士妍	国家图书馆
赵丹阳	国家图书馆

摘要点击次数: 2476

全文下载次数: 1898

中文摘要:

本文详细介绍网络资源保存技术策略现状，并从国家图书馆网络资源采集的实际业务需求出发，制定并设计符合国家图书馆业务需求的增量采集技术策略，简述国家图书馆基于Heritrix3.4的增量采集实现方法和实验效果，以期为业界提供有益的参考和借鉴。

英文摘要:

This paper introduces the current situation of web archiving technology strategy in detail, and designs the deduplicated crawlers technology strategy based on the actual practices of web archiving in the National Library of China. It describes the realization method of duplicated crawlers based on heritrix 3.4, so as to provide useful reference for the industry.

查看全文查看/发表评论下载PDF阅读器

关闭