互联网网站存档增量采集研究

杨云鹏

文章摘要

杨云鹏.互联网网站存档增量采集研究[J].数字图书馆论坛,2020,(12):17~21

互联网网站存档增量采集研究

Research on Incremental Collection of Internet Archive

投稿时间：2020-11-14

DOI：10.3772/j.issn.1673-2286.2020.12.003

中文关键词: 互联网网站存档；增量采集；采集策略；网络抓取

英文关键词: Internet Archive; Incremental Acquisition; Acquisition Strategy; Web Scraping

基金项目:

作者	单位
杨云鹏	国家图书馆

摘要点击次数: 3308

全文下载次数: 2242

中文摘要:

互联网网站存档随着互联网的普及，每年的存储量都在快速增长，导致服务器的存储空间、运行负载和网络带宽已无法满足采集量的增长速度。因此，采集系统过滤掉采集周期内重复的文档实现增量采集将是解决这些问题的关键。本文首先讨论增量采集的采集策略和工具，然后根据采集策略选取合适的工具进行实际采集验证增量采集效果。通过对采集系统添加附加工具的形式实现互联网网站存档增量采集，并对采集的结果进行分析讨论，实现减轻服务器的运行负载、减少网络带宽的占用、降低互联网网站存档存储空间和提高采集资源展示质量的目标。

英文摘要:

Internet archive with the popularity of the internet, the amount of storage is growing rapidly every year. The storage space, operating load, and network bandwidth of the server can no longer meet the growth rate of the collection volume. Therefore, the key to solve these problems is to filter out the repeated documents in the collection cycle and realize incremental collection. This paper first discusses the acquisition strategy and tools of incremental acquisition, and then selects the appropriate tool according to the acquisition strategy for actual acquisition to verify the effect of incremental acquisition.Through the addition of additional tools to the collection system, the Internet archives incremental collection is realized, and the collected results are analyzed and discussed. The goals of reducing the operating load of the server, reducing the occupation of network bandwidth, reducing internet archive storage space and improving the display quality of collected resources are achieved.

查看全文查看/发表评论下载PDF阅读器

关闭