许笑,张伟哲,张宏莉,方滨兴.广域网分布式爬虫中的Agent协同与Web划分研究[J].高技术通讯(中文),2010,20(3):239~245 |
广域网分布式爬虫中的Agent协同与Web划分研究 |
Research on Agent collaboration and Web partition in WAN based distributed Web crawlers |
|
DOI: |
中文关键词: 分布式Web爬虫, Agent协同, Web划分, 顾问服务 |
英文关键词: distributed Web crawler, Agent collaboration, Web partition, consultant service |
基金项目:863计划(2009AA01Z437),973计划(G2005CB321806),国家自然科学基金(60703014),高等学校博士学科点专项科研基金(20070213044)和哈尔滨工业大学优秀青年教师培养计划(HITQNJS.2007.034)资助项目 |
作者 | 单位 | 许笑 | 哈尔滨工业大学计算机科学与技术学院 | 张伟哲 | 哈尔滨工业大学计算机科学与技术学院 | 张宏莉 | 哈尔滨工业大学计算机科学与技术学院 | 方滨兴 | 哈尔滨工业大学计算机科学与技术学院 |
|
摘要点击次数: 3098 |
全文下载次数: 2549 |
中文摘要: |
针对广域网环境下分布式Web爬虫的Agent协同和Web划分两个核心问题进行深入研究,提出了基于顾问服务的分布式Web爬虫系统模型,给出了详细的系统设计方案及Agent协同算法框架,并通过推导证明了顾问服务参与Agent协同能够使分布式爬虫系统承受相对较小的网络负载。提出了分布式Web爬虫Web划分的概念,围绕Web划分单元选取及Web划分策略,对Web划分的分类和实现进行了详细的讨论,并通过实验对多种Web划分方法进行了对比和评价,验证了广域网系统相对于局域网系统的优势,并发现运营商互连因素对爬虫系统性 |
英文摘要: |
This paper focuses on agent collaboration and Web partition, the two core issues in WAN based distributed crawling. First, a new consultant service based agent collaboration method and the corresponding system model are proposed. The new method has a lower communication overhead than the central coordinator based crawling systems and exploits location proximity better than the ones based on Distributed Hash Table (DHT). Second, the detailed definitions of Web partition are presented. The selection of Web partition unit and the Web partition strategy are discussed. The experiment under the real Internet environment shows that WAN based distributed Web crawling systems have better performance than the LAN based ones. The experiment also shows that the impact of Internet service providers interconnectivity on the system performance is greater than that of the geographical locality. |
查看全文
查看/发表评论 下载PDF阅读器 |
关闭 |