Please wait a minute...
Data Analysis and Knowledge Discovery  2022, Vol. 6 Issue (8): 52-60    DOI: 10.11925/infotech.2096-3467.2021.1125
Current Issue | Archive | Adv Search |
Crawler with Dynamic Thesaurus and Improved Shark-Search Algorithm:Case Study of Military Equipment
Ding Shengchun(),Liu Kai,Fang Zhen
School of Economics and Management, Nanjing University of Science & Technology, Nanjing 210094, China
Download: PDF (1060 KB)   HTML ( 4
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper tries to address the issues facing traditional theme crawlers, such as low indexing rates and insufficient theme relevance. [Methods] We proposed a Two-step Dynamic Shark-Search (TDSS) algorithm based on Shark-Search, which divided the topic relevance calculation into the relevance of hyperlink and webpage topics. Then, we added new keywords extracted from topic-related pages to the established topic thesaurus, which improved the effectiveness of topic judgment. [Results] The TDSS crawler’s accuracy and indexing efficiency were 14.2% and 35% higher than the comparable algorithms in the same experiment environment. [Limitations] More research is needed to increase the clawer’s accuracy with excessive topic words. [Conclusions] The proposed algorithm could effectively improve the accuracy of topic information and retrieve more topic-related webpages.

Key wordsFocused Crawler      Shark-Search      Topic Relevance      Thesaurus     
Received: 06 October 2021      Published: 23 September 2022
ZTFLH:  E91  
  TP391  
Fund:Social Science Fund of Jiangsu Province(20TQB004)
Corresponding Authors: Ding Shengchun,ORCID: 0000-0002-4269-021X     E-mail: todingding@163.com

Cite this article:

Ding Shengchun, Liu Kai, Fang Zhen. Crawler with Dynamic Thesaurus and Improved Shark-Search Algorithm:Case Study of Military Equipment. Data Analysis and Knowledge Discovery, 2022, 6(8): 52-60.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2021.1125     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2022/V6/I8/52

方法 文献 爬准率 查全率 不足
VIPS分析网页深度+Shark-Search算法 [9] 0.66 主题相关性以及链接权重计算较慢
基于URL模式集的主题爬虫 [10] 0.69 0.52 要尽量获取全站URL,实际操作难度大
基于Best-First算法+HITS 算法 [11] 0.61 0.75 HITS算法耗时,爬取效率下降
局部社区发现+主题相关性分析 [12] 0.63 局部社区发现方法适用性不足
语义相关+网页重要性回归分析 [13] 0.91 0.65 算法设计过于冗余,硬件要求较高
Topic Crarwler Methods Based on Link Analysis
方法 文献 爬准率 查全率 不足
本体+改进禁忌搜索策略主题爬虫 [16] 0.82 本体构建比较复杂
VIPS分析网页视觉块+主题退火 [17] 0.95 规则引擎构建未公开
融合LDA的卷积神经网络主题爬虫 [18] 0.85 0.66 LDA构建过程工作量大
基于KNN分类算法的主题爬虫 [19] 0.75 未根据具体任务进行词典细化
Topic Crawler Methods Based on Web Content
Architecture of TDSS Method
Accuracy Comparison of 5 Methods
Related Web Pages Crawled by 5 Methods
[1] 范昊, 郑小川. 国内外开源情报研究综述[J]. 情报理论与实践, 2021, 44(10): 185-192, 201.
[1] (Fan Hao, Zheng Xiaochuan. A Review of the Research on Open Source Intelligence at Home and Broad[J]. Information Studies: Theory & Application, 2021, 44(10): 185-192, 201.)
[2] 丁波涛. 国外开源情报工作的发展与我国的对策研究[J]. 情报资料工作, 2011(6): 103-106.
[2] (Ding Botao. The Development of Open Source Information Abroad and the Strategic Studies in China[J]. Information and Documentation Services, 2011(6): 103-106.)
[3] 傅畅, 宋佳庆. 一种基于文本聚类的web军事情报挖掘系统设计与实现[J]. 中国电子科学研究院学报, 2015, 10(5): 541-545.
[3] (Fu Chang, Song Jiaqing. Design and Realization of Web Military Intelligence Mining System Based on Document Clustering[J]. Journal of China Academy of Electronics and Information Technology, 2015, 10(5): 541-545.)
[4] 谢玲. 暗网环境下恐怖主义信息挖掘与分析[J]. 国际展望, 2021, 13(3): 135-151.
[4] (Xie Ling. Terrorism Information Mining and Analysis in the Dark Web[J]. Global Review, 2021, 13(3): 135-151.)
[5] 费晨杰, 刘柏嵩. 基于LDA扩展主题词库的主题爬虫研究[J]. 计算机应用与软件, 2018, 35(4): 49-54.
[5] (Fei Chenjie, Liu Baisong. Focused Crawler Based on LDA Extended Topic Terms[J]. Computer Applications and Software, 2018, 35(4): 49-54.)
[6] 王冲, 纪仙慧. 基于用户兴趣与主题相关的PageRank算法改进研究[J]. 计算机科学, 2016, 43(3): 275-278, 312.
doi: 10.11896/j.issn.1002-137X.2016.03.051
[6] (Wang Chong, Ji Xianhui. Improved PageRank Algorithm Based on User Interest and Topic[J]. Computer Science, 2016, 43(3): 275-278, 312.)
doi: 10.11896/j.issn.1002-137X.2016.03.051
[7] 刘昊, 洪宇, 姚亮, 等. 基于HITS算法的双语句对挖掘优化方法[J]. 中文信息学报, 2017, 31(2): 25-35.
[7] (Liu Hao, Hong Yu, Yao Liang, et al. HITS-Based Optimization Method for Bilingual Corpus Mining[J]. Journal of Chinese Information Processing, 2017, 31(2): 25-35.)
[8] Seyfi A, Patel A, Celestino J. Empirical Evaluation of the Link and Content-Based Focused Treasure-Crawler[J]. Computer Standards & Interfaces, 2016, 44: 54-62.
doi: 10.1016/j.csi.2015.09.007
[9] Liu N W, Yao R B. The Crawling Strategy of Shark-Search Algorithm Based on Multi Granularity[C]// Proceedings of the 8th International Symposium on Computational Intelligence and Design. IEEE, 2015: 41-44.
[10] 胡萍瑞, 李石君. 基于URL模式集的主题爬虫[J]. 计算机应用研究, 2018, 35(3): 694-699.
[10] (Hu Pingrui, Li Shijun. Focused Crawler Based on URL Patterns[J]. Application Research of Computers, 2018, 35(3): 694-699.)
[11] 刘韶涛, 李洪胜. 融合链接结构的主题爬虫算法[J]. 华侨大学学报(自然科学版), 2017, 38(2): 195-200.
[11] (Liu Shaotao, Li Hongsheng. Topic Crawler Algorithm with Link Structure[J]. Journal of Huaqiao University(Natural Science), 2017, 38(2): 195-200.)
[12] 沈桂兰, 孙洁, 杨小平. 基于复杂网络局部社团发现的主题爬行研究[J]. 河南师范大学学报(自然科学版), 2014, 42(4): 134-138.
[12] (Shen Guilan, Sun Jie, Yang Xiaoping. Focused Crawling Method Based on Detecting Local Communities in Complex Networks[J]. Journal of Henan Normal University(Natural Science Edition), 2014, 42(4): 134-138.)
[13] 黄炜, 张展程, 朱彬, 等. 基于回归分析的网络恐怖信息主题爬虫[J]. 图书情报工作, 2018, 62(4): 121-129.
doi: 10.13266/j.issn.0252-3116.2018.04.016
[13] (Huang Wei, Zhang Zhancheng, Zhu Bin, et al. A Network Counter-Terrorism Information Crawler Based on the Regression Analysis[J]. Library and Information Service, 2018, 62(4): 121-129.)
doi: 10.13266/j.issn.0252-3116.2018.04.016
[14] 程元堃, 廖闻剑, 程光. 词向量聚类加权Shark-Search的主题爬虫策略研究[J]. 计算机与数字工程, 2018, 46(1): 144-148.
[14] (Cheng Yuankun, Liao Wenjian, Cheng Guang. Strategy of Focused Crawler with Word Embedding Clustering Weighted in Shark-Search Algorithm[J]. Computer & Digital Engineering, 2018, 46(1): 144-148.)
[15] Zhang W H, Chen Y. Bayes Topic Prediction Model for Focused Crawling of Vertical Search Engine[C]// Proceedings of the 2014 IEEE Computers, Communications and IT Applications Conference. IEEE, 2014: 294-299.
[16] 刘景发, 顾瑶平, 刘文杰. 融合本体和改进禁忌搜索策略的气象灾害主题爬虫方法[J]. 计算机应用, 2020, 40(8): 2255-2261.
doi: 10.11772/j.issn.1001-9081.2019122238
[16] (Liu Jingfa, Gu Yaoping, Liu Wenjie. Focused Crawler Method Combining Ontology and Improved Tabu Search for Meteorological Disaster[J]. Journal of Computer Applications, 2020, 40(8): 2255-2261.)
doi: 10.11772/j.issn.1001-9081.2019122238
[17] 黄锦敬, 黄锦焕, 陈瑞志. 基于改进VIPS算法的主题退火爬虫技术[J]. 计算机仿真, 2021, 38(8): 412-416.
[17] (Huang Jinjing, Huang Jinhuan, Chen Ruizhi. Topic Annealing Crawler Technology Based on Improved VIPS Algorithm[J]. Computer Simulation, 2021, 38(8): 412-416.)
[18] 汪岿, 费晨杰, 刘柏嵩. 融合LDA的卷积神经网络主题爬虫研究[J]. 计算机工程与应用, 2019, 55(11): 123-128, 178.
doi: 10.3778/j.issn.1002-8331.1810-0127
[18] (Wang Kui, Fei Chenjie, Liu Baisong. Convolutional Neural Network Themed Reptile Research Based on LDA[J]. Computer Engineering and Applications, 2019, 55(11): 123-128, 178.)
doi: 10.3778/j.issn.1002-8331.1810-0127
[19] 李宏志, 宋婕. 基于KNN分类算法的主题网络爬虫[J]. 宜宾学院学报, 2017, 17(12): 61-65.
[19] (Li Hongzhi, Song Jie. Focused Crawling Technology Based on KNN Classifier[J]. Journal of Yibin University, 2017, 17(12): 61-65.)
[20] 刘灿, 任剑宇, 李伟, 等. 面向个性化推荐的教育新闻爬取及展示系统[J]. 软件工程, 2018, 21(2): 38-40.
[20] (Liu Can, Ren Jianyu, Li Wei, et al. The Personalized Recommendation-Oriented Education News Crawling and Displaying System[J]. Software Engineering, 2018, 21(2): 38-40.)
[21] 孟繁疆, 姬祥, 袁琦, 等. 农产品价格主题搜索引擎的研究与实现[J]. 东北农业大学学报, 2016, 47(9): 64-71.
[21] (Meng Fanjiang, Ji Xiang, Yuan Qi, et al. Research and Implementation of Agricultural Prices Subject Search Engine[J]. Journal of Northeast Agricultural University, 2016, 47(9): 64-71.)
[22] 李学博. 基于Hadoop的中医药Web信息资源评价体系研究[D]. 济南: 山东中医药大学, 2016.
[22] (Li Xuebo. Study of the Evaluation System of Web TCM Information Resource Based on Hadoop[D]. Jinan: Shandong University of Traditional Chinese Medicine, 2016.)
[23] 丁晟春, 龚思兰, 周文杰, 等. 基于知识库和主题爬虫的南海舆情实时监测研究[J]. 情报杂志, 2016, 35(5): 32-37.
[23] (Ding Shengchun, Gong Silan, Zhou Wenjie, et al. Research on Network Public Opinion Real-Time Monitoring of the South China Sea Issue Based on Knowledge Base and Focused Crawler[J]. Journal of Intelligence, 2016, 35(5): 32-37.)
[24] 吴羽萍, 杨仁广. 网络多媒体主题搜索算法比较研究[J]. 图书情报工作, 2013, 57(7): 112-115.
[24] (Wu Yuping, Yang Renguang. Comparative Research on Network Multimedia Topic Search Algorithms[J]. Library and Information Service, 2013, 57(7): 112-115.)
[25] 张玲, 祁玉娟, 姜华. 改进的Shark-Search算法在网络采集中的应用[J]. 计算机技术与发展, 2017, 27(8): 192-194, 199.
[25] (Zhang Ling, Qi Yujuan, Jiang Hua. Application of Improved Shark-Search Algorithm in Web Crawler[J]. Computer Technology and Development, 2017, 27(8): 192-194, 199.)
[26] 仇磊, 娄渊胜, 常民. 一种改进Shark-Search的主题爬虫算法[J]. 微型电脑应用, 2017, 33(2): 19-21.
[26] (Qiu Lei, Lou Yuansheng, Chang Min. An Improved Shark-Search Algorithm for Theme Crawler[J]. Microcomputer Applications, 2017, 33(2): 19-21.)
[27] 高乐, 张健, 田贤忠. 基于视觉的Web页面分块算法的改进与实现[J]. 计算机系统应用, 2009, 18(4): 65-69.
[27] (Gao Le, Zhang Jian, Tian Xianzhong. Improvement and Implementation of VIPS Algorithm[J]. Computer Systems & Applications, 2009, 18(4): 65-69.)
[28] Hasan K S, Ng V. Automatic Keyphrase Extraction: A Survey of the State of the Art[C]// Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. 2014: 1262-1273.
[29] Zhang Y Y, Li J, Song Y, et al. Encoding Conversation Context for Neural Keyphrase Extraction from Microblog Posts[C]// Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. 2018: 1676-1686.
[30] Zhang Q, Wang Y, Gong Y Y, et al. Keyphrase Extraction Using Deep Recurrent Neural Networks on Twitter[C]// Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 2016: 836-845.
[31] Lilleberg J, Zhu Y, Zhang Y Q. Support Vector Machines and Word2Vec for Text Classification with Semantic Features[C]// Proceedings of the 14th International Conference on Cognitive Informatics & Cognitive Computing. IEEE, 2015: 136-140.
[32] Gleich D F. PageRank Beyond the Web[J]. SIAM Review, 2015, 57(3): 321-363.
doi: 10.1137/140976649
[33] Mihalcea R, Tarau P. TextRank: Bringing Order into Text[C]// Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing. 2004: 404-411.
[34] 陈鑫. 基于行块分布函数的通用网页正文抽取[OL]. [2021-10-28]. www.doc88.com/p-912707793066.html.
[34] (Chen Xin. General Web Page Text Extraction Based on Line Block Distribution Function[OL]. [2021-10-28]. www.doc88.com/p-912707793066.html.)
[1] Wang Ying,Wu Sizhu. Converting STKOS Metathesaurus to RDF Triples with R2RML[J]. 数据分析与知识发现, 2018, 2(12): 89-97.
[2] Xu Yuemei,Li Yang,Liang Ye,Cai Lianqiao. Analyzing Evolution of News Topics with Manifold Learning[J]. 现代图书情报技术, 2016, 32(10): 59-69.
[3] Duan Yufeng, Zhu Wenjing, Chen Qiao, Liu Wei, Liu Fenghong. A Domain Concepts Triple-layer Filter Method[J]. 现代图书情报技术, 2015, 31(4): 26-33.
[4] Zeng Xinhong, Cai Qinghe, Huang Huajun, Lin Weiming. Research on Non-uniform Node Clustered Graph Layout Algorithm for Visualization Based on Force Directed Model[J]. 现代图书情报技术, 2014, 30(9): 33-43.
[5] Li Peng, Zhu Lijun, Liu Yajie, Yan Yingying. Realization of Improved RBAC Model in Task Management in Normative Concepts Collaborative Construction Platform[J]. 现代图书情报技术, 2014, 30(2): 86-91.
[6] Qiao Jianzhong. An Improved Best-First Search Algorithm Based Focused Crawling Research[J]. 现代图书情报技术, 2013, 29(7/8): 28-35.
[7] Yang He, Yang Yihong, Li Ning. Construction of Keywords-Chinese Library Classification Codes Integrated Thesaurus[J]. 现代图书情报技术, 2013, 29(7/8): 107-113.
[8] Ye Chunlei, Leng Fuhai. Building the Future-oriented Technology Thesaurus of Technology Roadmap[J]. 现代图书情报技术, 2013, (5): 59-63.
[9] Xian Guojian, Zhao Ruixue, Kou Yuantao, Zhu Liang, Zhang Jie. Study and Practice on Converting and Publishing Chinese Agricultural Thesaurus as Linked Open Data[J]. 现代图书情报技术, 2013, 29(11): 8-14.
[10] Qiao Jianzhong. Statistical Characteristics Based Web Page Relevance Judgment Strategy for the “Type” Topics Crawled[J]. 现代图书情报技术, 2012, 28(6): 9-16.
[11] Huang Wei, Jin Yabo, Hu Changlong. Focused Crawling for Network Public Opinion’s Topic Information[J]. 现代图书情报技术, 2012, (11): 65-71.
[12] Zeng Xinhong, Cai Qinghe, Zeng Hanlong, Tang Cheng, Huang Huajun, Lin Weiming. The Research and Implementation of Clustered Graphs Layout Algorithm for OntoThesaurus Visualization[J]. 现代图书情报技术, 2012, (10): 8-15.
[13] Xian Guojian, Zhao Ruixue, Zhu Liang, Kou Yuantao. Conversion and Consumption of Chinese Agricultural Thesaurus as SKOS[J]. 现代图书情报技术, 2012, (10): 16-20.
[14] Ye Huanzhuo, Wu Di. Approximately Duplicate Data Cleaning Algorithm Based on Improved Edit Distance[J]. 现代图书情报技术, 2011, 27(7/8): 82-90.
[15] Ren Ruijuan, Mi Jia, Pu Demin, Zhang Shouhua, Liu Libin, Wang Le. The Design and Realization of ADORES[J]. 现代图书情报技术, 2011, 27(3): 9-16.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn