Please wait a minute...
Advanced Search
数据分析与知识发现  2022, Vol. 6 Issue (8): 52-60     https://doi.org/10.11925/infotech.2096-3467.2021.1125
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
融合动态主题词库和改进Shark-Search算法的主题爬虫方法——以武器装备领域为例*
丁晟春(),刘凯,方振
南京理工大学经济管理学院 南京 210094
Crawler with Dynamic Thesaurus and Improved Shark-Search Algorithm:Case Study of Military Equipment
Ding Shengchun(),Liu Kai,Fang Zhen
School of Economics and Management, Nanjing University of Science & Technology, Nanjing 210094, China
全文: PDF (1060 KB)   HTML ( 15
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】 解决传统主题爬虫容易出现爬取率低和主题相关度不足的问题。【方法】 基于 Shark-Search算法,提出两步式动态扩充主题词表的主题爬虫算法Two-step Dynamic Shark-Search(TDSS),将传统算法中主题相关性计算拆分为链接主题相关性和页面主题相关性两个单独步骤。通过相关资料和工具建立并拓展的主题词表,并在爬虫运行过程中从主题相关页面提取新的关键词补充到主题词表中,提升主题判断的效果。【结果】 在相同的实验环境下,TDSS主题爬虫方法比对比算法的爬准率最多高14.2%,采集效率最多高35%。【局限】 动态主题词扩展算法需进一步完善,主题词表过度扩充会降低爬准率。【结论】 基于TDSS的主题爬虫能够有效提高获取主题信息的准确率,爬取更多与主题相关的网页。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
丁晟春
刘凯
方振
关键词 主题爬虫Shark-Search主题相关度表主题词表    
Abstract

[Objective] This paper tries to address the issues facing traditional theme crawlers, such as low indexing rates and insufficient theme relevance. [Methods] We proposed a Two-step Dynamic Shark-Search (TDSS) algorithm based on Shark-Search, which divided the topic relevance calculation into the relevance of hyperlink and webpage topics. Then, we added new keywords extracted from topic-related pages to the established topic thesaurus, which improved the effectiveness of topic judgment. [Results] The TDSS crawler’s accuracy and indexing efficiency were 14.2% and 35% higher than the comparable algorithms in the same experiment environment. [Limitations] More research is needed to increase the clawer’s accuracy with excessive topic words. [Conclusions] The proposed algorithm could effectively improve the accuracy of topic information and retrieve more topic-related webpages.

Key wordsFocused Crawler    Shark-Search    Topic Relevance    Thesaurus
收稿日期: 2021-10-06      出版日期: 2022-09-23
ZTFLH:  E91  
  TP391  
基金资助:*江苏省社会科学基金项目的研究成果之一(20TQB004)
通讯作者: 丁晟春,ORCID: 0000-0002-4269-021X     E-mail: todingding@163.com
引用本文:   
丁晟春, 刘凯, 方振. 融合动态主题词库和改进Shark-Search算法的主题爬虫方法——以武器装备领域为例*[J]. 数据分析与知识发现, 2022, 6(8): 52-60.
Ding Shengchun, Liu Kai, Fang Zhen. Crawler with Dynamic Thesaurus and Improved Shark-Search Algorithm:Case Study of Military Equipment. Data Analysis and Knowledge Discovery, 2022, 6(8): 52-60.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2021.1125      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2022/V6/I8/52
方法 文献 爬准率 查全率 不足
VIPS分析网页深度+Shark-Search算法 [9] 0.66 主题相关性以及链接权重计算较慢
基于URL模式集的主题爬虫 [10] 0.69 0.52 要尽量获取全站URL,实际操作难度大
基于Best-First算法+HITS 算法 [11] 0.61 0.75 HITS算法耗时,爬取效率下降
局部社区发现+主题相关性分析 [12] 0.63 局部社区发现方法适用性不足
语义相关+网页重要性回归分析 [13] 0.91 0.65 算法设计过于冗余,硬件要求较高
Table 1  基于链接分析的主题爬虫方法
方法 文献 爬准率 查全率 不足
本体+改进禁忌搜索策略主题爬虫 [16] 0.82 本体构建比较复杂
VIPS分析网页视觉块+主题退火 [17] 0.95 规则引擎构建未公开
融合LDA的卷积神经网络主题爬虫 [18] 0.85 0.66 LDA构建过程工作量大
基于KNN分类算法的主题爬虫 [19] 0.75 未根据具体任务进行词典细化
Table 2  基于内容分析的主题爬虫
Fig.1  TDSS方法整体架构
Fig.2  5种方法爬准率比较
Fig.3  5种方法爬取相关网页数量比较
[1] 范昊, 郑小川. 国内外开源情报研究综述[J]. 情报理论与实践, 2021, 44(10): 185-192, 201.
[1] (Fan Hao, Zheng Xiaochuan. A Review of the Research on Open Source Intelligence at Home and Broad[J]. Information Studies: Theory & Application, 2021, 44(10): 185-192, 201.)
[2] 丁波涛. 国外开源情报工作的发展与我国的对策研究[J]. 情报资料工作, 2011(6): 103-106.
[2] (Ding Botao. The Development of Open Source Information Abroad and the Strategic Studies in China[J]. Information and Documentation Services, 2011(6): 103-106.)
[3] 傅畅, 宋佳庆. 一种基于文本聚类的web军事情报挖掘系统设计与实现[J]. 中国电子科学研究院学报, 2015, 10(5): 541-545.
[3] (Fu Chang, Song Jiaqing. Design and Realization of Web Military Intelligence Mining System Based on Document Clustering[J]. Journal of China Academy of Electronics and Information Technology, 2015, 10(5): 541-545.)
[4] 谢玲. 暗网环境下恐怖主义信息挖掘与分析[J]. 国际展望, 2021, 13(3): 135-151.
[4] (Xie Ling. Terrorism Information Mining and Analysis in the Dark Web[J]. Global Review, 2021, 13(3): 135-151.)
[5] 费晨杰, 刘柏嵩. 基于LDA扩展主题词库的主题爬虫研究[J]. 计算机应用与软件, 2018, 35(4): 49-54.
[5] (Fei Chenjie, Liu Baisong. Focused Crawler Based on LDA Extended Topic Terms[J]. Computer Applications and Software, 2018, 35(4): 49-54.)
[6] 王冲, 纪仙慧. 基于用户兴趣与主题相关的PageRank算法改进研究[J]. 计算机科学, 2016, 43(3): 275-278, 312.
doi: 10.11896/j.issn.1002-137X.2016.03.051
[6] (Wang Chong, Ji Xianhui. Improved PageRank Algorithm Based on User Interest and Topic[J]. Computer Science, 2016, 43(3): 275-278, 312.)
doi: 10.11896/j.issn.1002-137X.2016.03.051
[7] 刘昊, 洪宇, 姚亮, 等. 基于HITS算法的双语句对挖掘优化方法[J]. 中文信息学报, 2017, 31(2): 25-35.
[7] (Liu Hao, Hong Yu, Yao Liang, et al. HITS-Based Optimization Method for Bilingual Corpus Mining[J]. Journal of Chinese Information Processing, 2017, 31(2): 25-35.)
[8] Seyfi A, Patel A, Celestino J. Empirical Evaluation of the Link and Content-Based Focused Treasure-Crawler[J]. Computer Standards & Interfaces, 2016, 44: 54-62.
doi: 10.1016/j.csi.2015.09.007
[9] Liu N W, Yao R B. The Crawling Strategy of Shark-Search Algorithm Based on Multi Granularity[C]// Proceedings of the 8th International Symposium on Computational Intelligence and Design. IEEE, 2015: 41-44.
[10] 胡萍瑞, 李石君. 基于URL模式集的主题爬虫[J]. 计算机应用研究, 2018, 35(3): 694-699.
[10] (Hu Pingrui, Li Shijun. Focused Crawler Based on URL Patterns[J]. Application Research of Computers, 2018, 35(3): 694-699.)
[11] 刘韶涛, 李洪胜. 融合链接结构的主题爬虫算法[J]. 华侨大学学报(自然科学版), 2017, 38(2): 195-200.
[11] (Liu Shaotao, Li Hongsheng. Topic Crawler Algorithm with Link Structure[J]. Journal of Huaqiao University(Natural Science), 2017, 38(2): 195-200.)
[12] 沈桂兰, 孙洁, 杨小平. 基于复杂网络局部社团发现的主题爬行研究[J]. 河南师范大学学报(自然科学版), 2014, 42(4): 134-138.
[12] (Shen Guilan, Sun Jie, Yang Xiaoping. Focused Crawling Method Based on Detecting Local Communities in Complex Networks[J]. Journal of Henan Normal University(Natural Science Edition), 2014, 42(4): 134-138.)
[13] 黄炜, 张展程, 朱彬, 等. 基于回归分析的网络恐怖信息主题爬虫[J]. 图书情报工作, 2018, 62(4): 121-129.
doi: 10.13266/j.issn.0252-3116.2018.04.016
[13] (Huang Wei, Zhang Zhancheng, Zhu Bin, et al. A Network Counter-Terrorism Information Crawler Based on the Regression Analysis[J]. Library and Information Service, 2018, 62(4): 121-129.)
doi: 10.13266/j.issn.0252-3116.2018.04.016
[14] 程元堃, 廖闻剑, 程光. 词向量聚类加权Shark-Search的主题爬虫策略研究[J]. 计算机与数字工程, 2018, 46(1): 144-148.
[14] (Cheng Yuankun, Liao Wenjian, Cheng Guang. Strategy of Focused Crawler with Word Embedding Clustering Weighted in Shark-Search Algorithm[J]. Computer & Digital Engineering, 2018, 46(1): 144-148.)
[15] Zhang W H, Chen Y. Bayes Topic Prediction Model for Focused Crawling of Vertical Search Engine[C]// Proceedings of the 2014 IEEE Computers, Communications and IT Applications Conference. IEEE, 2014: 294-299.
[16] 刘景发, 顾瑶平, 刘文杰. 融合本体和改进禁忌搜索策略的气象灾害主题爬虫方法[J]. 计算机应用, 2020, 40(8): 2255-2261.
doi: 10.11772/j.issn.1001-9081.2019122238
[16] (Liu Jingfa, Gu Yaoping, Liu Wenjie. Focused Crawler Method Combining Ontology and Improved Tabu Search for Meteorological Disaster[J]. Journal of Computer Applications, 2020, 40(8): 2255-2261.)
doi: 10.11772/j.issn.1001-9081.2019122238
[17] 黄锦敬, 黄锦焕, 陈瑞志. 基于改进VIPS算法的主题退火爬虫技术[J]. 计算机仿真, 2021, 38(8): 412-416.
[17] (Huang Jinjing, Huang Jinhuan, Chen Ruizhi. Topic Annealing Crawler Technology Based on Improved VIPS Algorithm[J]. Computer Simulation, 2021, 38(8): 412-416.)
[18] 汪岿, 费晨杰, 刘柏嵩. 融合LDA的卷积神经网络主题爬虫研究[J]. 计算机工程与应用, 2019, 55(11): 123-128, 178.
doi: 10.3778/j.issn.1002-8331.1810-0127
[18] (Wang Kui, Fei Chenjie, Liu Baisong. Convolutional Neural Network Themed Reptile Research Based on LDA[J]. Computer Engineering and Applications, 2019, 55(11): 123-128, 178.)
doi: 10.3778/j.issn.1002-8331.1810-0127
[19] 李宏志, 宋婕. 基于KNN分类算法的主题网络爬虫[J]. 宜宾学院学报, 2017, 17(12): 61-65.
[19] (Li Hongzhi, Song Jie. Focused Crawling Technology Based on KNN Classifier[J]. Journal of Yibin University, 2017, 17(12): 61-65.)
[20] 刘灿, 任剑宇, 李伟, 等. 面向个性化推荐的教育新闻爬取及展示系统[J]. 软件工程, 2018, 21(2): 38-40.
[20] (Liu Can, Ren Jianyu, Li Wei, et al. The Personalized Recommendation-Oriented Education News Crawling and Displaying System[J]. Software Engineering, 2018, 21(2): 38-40.)
[21] 孟繁疆, 姬祥, 袁琦, 等. 农产品价格主题搜索引擎的研究与实现[J]. 东北农业大学学报, 2016, 47(9): 64-71.
[21] (Meng Fanjiang, Ji Xiang, Yuan Qi, et al. Research and Implementation of Agricultural Prices Subject Search Engine[J]. Journal of Northeast Agricultural University, 2016, 47(9): 64-71.)
[22] 李学博. 基于Hadoop的中医药Web信息资源评价体系研究[D]. 济南: 山东中医药大学, 2016.
[22] (Li Xuebo. Study of the Evaluation System of Web TCM Information Resource Based on Hadoop[D]. Jinan: Shandong University of Traditional Chinese Medicine, 2016.)
[23] 丁晟春, 龚思兰, 周文杰, 等. 基于知识库和主题爬虫的南海舆情实时监测研究[J]. 情报杂志, 2016, 35(5): 32-37.
[23] (Ding Shengchun, Gong Silan, Zhou Wenjie, et al. Research on Network Public Opinion Real-Time Monitoring of the South China Sea Issue Based on Knowledge Base and Focused Crawler[J]. Journal of Intelligence, 2016, 35(5): 32-37.)
[24] 吴羽萍, 杨仁广. 网络多媒体主题搜索算法比较研究[J]. 图书情报工作, 2013, 57(7): 112-115.
[24] (Wu Yuping, Yang Renguang. Comparative Research on Network Multimedia Topic Search Algorithms[J]. Library and Information Service, 2013, 57(7): 112-115.)
[25] 张玲, 祁玉娟, 姜华. 改进的Shark-Search算法在网络采集中的应用[J]. 计算机技术与发展, 2017, 27(8): 192-194, 199.
[25] (Zhang Ling, Qi Yujuan, Jiang Hua. Application of Improved Shark-Search Algorithm in Web Crawler[J]. Computer Technology and Development, 2017, 27(8): 192-194, 199.)
[26] 仇磊, 娄渊胜, 常民. 一种改进Shark-Search的主题爬虫算法[J]. 微型电脑应用, 2017, 33(2): 19-21.
[26] (Qiu Lei, Lou Yuansheng, Chang Min. An Improved Shark-Search Algorithm for Theme Crawler[J]. Microcomputer Applications, 2017, 33(2): 19-21.)
[27] 高乐, 张健, 田贤忠. 基于视觉的Web页面分块算法的改进与实现[J]. 计算机系统应用, 2009, 18(4): 65-69.
[27] (Gao Le, Zhang Jian, Tian Xianzhong. Improvement and Implementation of VIPS Algorithm[J]. Computer Systems & Applications, 2009, 18(4): 65-69.)
[28] Hasan K S, Ng V. Automatic Keyphrase Extraction: A Survey of the State of the Art[C]// Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. 2014: 1262-1273.
[29] Zhang Y Y, Li J, Song Y, et al. Encoding Conversation Context for Neural Keyphrase Extraction from Microblog Posts[C]// Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. 2018: 1676-1686.
[30] Zhang Q, Wang Y, Gong Y Y, et al. Keyphrase Extraction Using Deep Recurrent Neural Networks on Twitter[C]// Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 2016: 836-845.
[31] Lilleberg J, Zhu Y, Zhang Y Q. Support Vector Machines and Word2Vec for Text Classification with Semantic Features[C]// Proceedings of the 14th International Conference on Cognitive Informatics & Cognitive Computing. IEEE, 2015: 136-140.
[32] Gleich D F. PageRank Beyond the Web[J]. SIAM Review, 2015, 57(3): 321-363.
doi: 10.1137/140976649
[33] Mihalcea R, Tarau P. TextRank: Bringing Order into Text[C]// Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing. 2004: 404-411.
[34] 陈鑫. 基于行块分布函数的通用网页正文抽取[OL]. [2021-10-28]. www.doc88.com/p-912707793066.html.
[34] (Chen Xin. General Web Page Text Extraction Based on Line Block Distribution Function[OL]. [2021-10-28]. www.doc88.com/p-912707793066.html.)
[1] 孙海霞, 李军莲, 李丹亚, 吴英杰, 李晓瑛. 基于CMeSH语义系统的领域自由词-主题词语义映射研究[J]. 现代图书情报技术, 2013, 29(11): 46-51.
[2] 黄华军, 曾新红, 林伟明. OTCSS关联数据服务的研究与实现[J]. 现代图书情报技术, 2012, 28(7): 40-47.
[3] 黄炜, 金雅博, 胡昌龙. 网络舆情主题信息采集研究[J]. 现代图书情报技术, 2012, (11): 65-71.
[4] 孙海霞 钱庆 吴英杰 李军莲. MeSH词表的语义相似度计算研究*[J]. 现代图书情报技术, 2010, 26(6): 12-16.
[5] 段荣婷. 基于简约知识组织系统的《中国档案主题词表》语义网络化应用研究[J]. 现代图书情报技术, 2010, 26(10): 33-42.
[6] 黄炜,张李义. 基于语义爬虫的商品信息主题采集研究*[J]. 现代图书情报技术, 2010, 26(1): 3-8.
[7] 贾君枝,卫荣娟,罗林强. 《汉语主题词表》XML文档的自动生成研究[J]. 现代图书情报技术, 2009, 25(5): 50-54.
[8] 张辉,徐朝军,王蔚. 教育游戏资源智能搜索系统的设计与实现*[J]. 现代图书情报技术, 2008, 24(6): 46-50.
[9] 夏崇镨,康丽 . 基于叙词表的主题爬虫技术研究*[J]. 现代图书情报技术, 2007, 2(5): 41-44.
[10] 朱礼军,赵新力,乔晓东,孙钦山 . 跨领域多来源主题词表集成与服务研究*[J]. 现代图书情报技术, 2007, 2(1): 20-24.
[11] 王子熙,马蕾 . 《汉语主题词表》词间关系的可视化[J]. 现代图书情报技术, 2006, 1(3): 86-88.
[12] Wonsook Lee,Shigeo Sugimoto. 建立网络社区主题通道的核心主题词表[J]. 现代图书情报技术, 2006, 22(1): 25-32.
[13] 甘利人,李岳蒙. 主题法、分类法与Ontology的比较研究[J]. 现代图书情报技术, 2005, 21(12): 1-6.
[14] Edward T. O'Neill,麦麟屏. FAST:主题术语的分面式应用——以《国会图书馆主题词表(LCSH)》为基础的简单化词汇[J]. 现代图书情报技术, 2004, 20(1): 9-15.
[15] 沈迪飞. 金融证券系统的数据基础建设和巨灵公司的实践[J]. 现代图书情报技术, 2002, 18(2): 79-82.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn