|
|
Crawler with Dynamic Thesaurus and Improved Shark-Search Algorithm:Case Study of Military Equipment |
Ding Shengchun(),Liu Kai,Fang Zhen |
School of Economics and Management, Nanjing University of Science & Technology, Nanjing 210094, China |
|
|
Abstract [Objective] This paper tries to address the issues facing traditional theme crawlers, such as low indexing rates and insufficient theme relevance. [Methods] We proposed a Two-step Dynamic Shark-Search (TDSS) algorithm based on Shark-Search, which divided the topic relevance calculation into the relevance of hyperlink and webpage topics. Then, we added new keywords extracted from topic-related pages to the established topic thesaurus, which improved the effectiveness of topic judgment. [Results] The TDSS crawler’s accuracy and indexing efficiency were 14.2% and 35% higher than the comparable algorithms in the same experiment environment. [Limitations] More research is needed to increase the clawer’s accuracy with excessive topic words. [Conclusions] The proposed algorithm could effectively improve the accuracy of topic information and retrieve more topic-related webpages.
|
Received: 06 October 2021
Published: 23 September 2022
|
|
Fund:Social Science Fund of Jiangsu Province(20TQB004) |
Corresponding Authors:
Ding Shengchun,ORCID: 0000-0002-4269-021X
E-mail: todingding@163.com
|
[1] |
范昊, 郑小川. 国内外开源情报研究综述[J]. 情报理论与实践, 2021, 44(10): 185-192, 201.
|
[1] |
(Fan Hao, Zheng Xiaochuan. A Review of the Research on Open Source Intelligence at Home and Broad[J]. Information Studies: Theory & Application, 2021, 44(10): 185-192, 201.)
|
[2] |
丁波涛. 国外开源情报工作的发展与我国的对策研究[J]. 情报资料工作, 2011(6): 103-106.
|
[2] |
(Ding Botao. The Development of Open Source Information Abroad and the Strategic Studies in China[J]. Information and Documentation Services, 2011(6): 103-106.)
|
[3] |
傅畅, 宋佳庆. 一种基于文本聚类的web军事情报挖掘系统设计与实现[J]. 中国电子科学研究院学报, 2015, 10(5): 541-545.
|
[3] |
(Fu Chang, Song Jiaqing. Design and Realization of Web Military Intelligence Mining System Based on Document Clustering[J]. Journal of China Academy of Electronics and Information Technology, 2015, 10(5): 541-545.)
|
[4] |
谢玲. 暗网环境下恐怖主义信息挖掘与分析[J]. 国际展望, 2021, 13(3): 135-151.
|
[4] |
(Xie Ling. Terrorism Information Mining and Analysis in the Dark Web[J]. Global Review, 2021, 13(3): 135-151.)
|
[5] |
费晨杰, 刘柏嵩. 基于LDA扩展主题词库的主题爬虫研究[J]. 计算机应用与软件, 2018, 35(4): 49-54.
|
[5] |
(Fei Chenjie, Liu Baisong. Focused Crawler Based on LDA Extended Topic Terms[J]. Computer Applications and Software, 2018, 35(4): 49-54.)
|
[6] |
王冲, 纪仙慧. 基于用户兴趣与主题相关的PageRank算法改进研究[J]. 计算机科学, 2016, 43(3): 275-278, 312.
doi: 10.11896/j.issn.1002-137X.2016.03.051
|
[6] |
(Wang Chong, Ji Xianhui. Improved PageRank Algorithm Based on User Interest and Topic[J]. Computer Science, 2016, 43(3): 275-278, 312.)
doi: 10.11896/j.issn.1002-137X.2016.03.051
|
[7] |
刘昊, 洪宇, 姚亮, 等. 基于HITS算法的双语句对挖掘优化方法[J]. 中文信息学报, 2017, 31(2): 25-35.
|
[7] |
(Liu Hao, Hong Yu, Yao Liang, et al. HITS-Based Optimization Method for Bilingual Corpus Mining[J]. Journal of Chinese Information Processing, 2017, 31(2): 25-35.)
|
[8] |
Seyfi A, Patel A, Celestino J. Empirical Evaluation of the Link and Content-Based Focused Treasure-Crawler[J]. Computer Standards & Interfaces, 2016, 44: 54-62.
doi: 10.1016/j.csi.2015.09.007
|
[9] |
Liu N W, Yao R B. The Crawling Strategy of Shark-Search Algorithm Based on Multi Granularity[C]// Proceedings of the 8th International Symposium on Computational Intelligence and Design. IEEE, 2015: 41-44.
|
[10] |
胡萍瑞, 李石君. 基于URL模式集的主题爬虫[J]. 计算机应用研究, 2018, 35(3): 694-699.
|
[10] |
(Hu Pingrui, Li Shijun. Focused Crawler Based on URL Patterns[J]. Application Research of Computers, 2018, 35(3): 694-699.)
|
[11] |
刘韶涛, 李洪胜. 融合链接结构的主题爬虫算法[J]. 华侨大学学报(自然科学版), 2017, 38(2): 195-200.
|
[11] |
(Liu Shaotao, Li Hongsheng. Topic Crawler Algorithm with Link Structure[J]. Journal of Huaqiao University(Natural Science), 2017, 38(2): 195-200.)
|
[12] |
沈桂兰, 孙洁, 杨小平. 基于复杂网络局部社团发现的主题爬行研究[J]. 河南师范大学学报(自然科学版), 2014, 42(4): 134-138.
|
[12] |
(Shen Guilan, Sun Jie, Yang Xiaoping. Focused Crawling Method Based on Detecting Local Communities in Complex Networks[J]. Journal of Henan Normal University(Natural Science Edition), 2014, 42(4): 134-138.)
|
[13] |
黄炜, 张展程, 朱彬, 等. 基于回归分析的网络恐怖信息主题爬虫[J]. 图书情报工作, 2018, 62(4): 121-129.
doi: 10.13266/j.issn.0252-3116.2018.04.016
|
[13] |
(Huang Wei, Zhang Zhancheng, Zhu Bin, et al. A Network Counter-Terrorism Information Crawler Based on the Regression Analysis[J]. Library and Information Service, 2018, 62(4): 121-129.)
doi: 10.13266/j.issn.0252-3116.2018.04.016
|
[14] |
程元堃, 廖闻剑, 程光. 词向量聚类加权Shark-Search的主题爬虫策略研究[J]. 计算机与数字工程, 2018, 46(1): 144-148.
|
[14] |
(Cheng Yuankun, Liao Wenjian, Cheng Guang. Strategy of Focused Crawler with Word Embedding Clustering Weighted in Shark-Search Algorithm[J]. Computer & Digital Engineering, 2018, 46(1): 144-148.)
|
[15] |
Zhang W H, Chen Y. Bayes Topic Prediction Model for Focused Crawling of Vertical Search Engine[C]// Proceedings of the 2014 IEEE Computers, Communications and IT Applications Conference. IEEE, 2014: 294-299.
|
[16] |
刘景发, 顾瑶平, 刘文杰. 融合本体和改进禁忌搜索策略的气象灾害主题爬虫方法[J]. 计算机应用, 2020, 40(8): 2255-2261.
doi: 10.11772/j.issn.1001-9081.2019122238
|
[16] |
(Liu Jingfa, Gu Yaoping, Liu Wenjie. Focused Crawler Method Combining Ontology and Improved Tabu Search for Meteorological Disaster[J]. Journal of Computer Applications, 2020, 40(8): 2255-2261.)
doi: 10.11772/j.issn.1001-9081.2019122238
|
[17] |
黄锦敬, 黄锦焕, 陈瑞志. 基于改进VIPS算法的主题退火爬虫技术[J]. 计算机仿真, 2021, 38(8): 412-416.
|
[17] |
(Huang Jinjing, Huang Jinhuan, Chen Ruizhi. Topic Annealing Crawler Technology Based on Improved VIPS Algorithm[J]. Computer Simulation, 2021, 38(8): 412-416.)
|
[18] |
汪岿, 费晨杰, 刘柏嵩. 融合LDA的卷积神经网络主题爬虫研究[J]. 计算机工程与应用, 2019, 55(11): 123-128, 178.
doi: 10.3778/j.issn.1002-8331.1810-0127
|
[18] |
(Wang Kui, Fei Chenjie, Liu Baisong. Convolutional Neural Network Themed Reptile Research Based on LDA[J]. Computer Engineering and Applications, 2019, 55(11): 123-128, 178.)
doi: 10.3778/j.issn.1002-8331.1810-0127
|
[19] |
李宏志, 宋婕. 基于KNN分类算法的主题网络爬虫[J]. 宜宾学院学报, 2017, 17(12): 61-65.
|
[19] |
(Li Hongzhi, Song Jie. Focused Crawling Technology Based on KNN Classifier[J]. Journal of Yibin University, 2017, 17(12): 61-65.)
|
[20] |
刘灿, 任剑宇, 李伟, 等. 面向个性化推荐的教育新闻爬取及展示系统[J]. 软件工程, 2018, 21(2): 38-40.
|
[20] |
(Liu Can, Ren Jianyu, Li Wei, et al. The Personalized Recommendation-Oriented Education News Crawling and Displaying System[J]. Software Engineering, 2018, 21(2): 38-40.)
|
[21] |
孟繁疆, 姬祥, 袁琦, 等. 农产品价格主题搜索引擎的研究与实现[J]. 东北农业大学学报, 2016, 47(9): 64-71.
|
[21] |
(Meng Fanjiang, Ji Xiang, Yuan Qi, et al. Research and Implementation of Agricultural Prices Subject Search Engine[J]. Journal of Northeast Agricultural University, 2016, 47(9): 64-71.)
|
[22] |
李学博. 基于Hadoop的中医药Web信息资源评价体系研究[D]. 济南: 山东中医药大学, 2016.
|
[22] |
(Li Xuebo. Study of the Evaluation System of Web TCM Information Resource Based on Hadoop[D]. Jinan: Shandong University of Traditional Chinese Medicine, 2016.)
|
[23] |
丁晟春, 龚思兰, 周文杰, 等. 基于知识库和主题爬虫的南海舆情实时监测研究[J]. 情报杂志, 2016, 35(5): 32-37.
|
[23] |
(Ding Shengchun, Gong Silan, Zhou Wenjie, et al. Research on Network Public Opinion Real-Time Monitoring of the South China Sea Issue Based on Knowledge Base and Focused Crawler[J]. Journal of Intelligence, 2016, 35(5): 32-37.)
|
[24] |
吴羽萍, 杨仁广. 网络多媒体主题搜索算法比较研究[J]. 图书情报工作, 2013, 57(7): 112-115.
|
[24] |
(Wu Yuping, Yang Renguang. Comparative Research on Network Multimedia Topic Search Algorithms[J]. Library and Information Service, 2013, 57(7): 112-115.)
|
[25] |
张玲, 祁玉娟, 姜华. 改进的Shark-Search算法在网络采集中的应用[J]. 计算机技术与发展, 2017, 27(8): 192-194, 199.
|
[25] |
(Zhang Ling, Qi Yujuan, Jiang Hua. Application of Improved Shark-Search Algorithm in Web Crawler[J]. Computer Technology and Development, 2017, 27(8): 192-194, 199.)
|
[26] |
仇磊, 娄渊胜, 常民. 一种改进Shark-Search的主题爬虫算法[J]. 微型电脑应用, 2017, 33(2): 19-21.
|
[26] |
(Qiu Lei, Lou Yuansheng, Chang Min. An Improved Shark-Search Algorithm for Theme Crawler[J]. Microcomputer Applications, 2017, 33(2): 19-21.)
|
[27] |
高乐, 张健, 田贤忠. 基于视觉的Web页面分块算法的改进与实现[J]. 计算机系统应用, 2009, 18(4): 65-69.
|
[27] |
(Gao Le, Zhang Jian, Tian Xianzhong. Improvement and Implementation of VIPS Algorithm[J]. Computer Systems & Applications, 2009, 18(4): 65-69.)
|
[28] |
Hasan K S, Ng V. Automatic Keyphrase Extraction: A Survey of the State of the Art[C]// Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. 2014: 1262-1273.
|
[29] |
Zhang Y Y, Li J, Song Y, et al. Encoding Conversation Context for Neural Keyphrase Extraction from Microblog Posts[C]// Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. 2018: 1676-1686.
|
[30] |
Zhang Q, Wang Y, Gong Y Y, et al. Keyphrase Extraction Using Deep Recurrent Neural Networks on Twitter[C]// Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 2016: 836-845.
|
[31] |
Lilleberg J, Zhu Y, Zhang Y Q. Support Vector Machines and Word2Vec for Text Classification with Semantic Features[C]// Proceedings of the 14th International Conference on Cognitive Informatics & Cognitive Computing. IEEE, 2015: 136-140.
|
[32] |
Gleich D F. PageRank Beyond the Web[J]. SIAM Review, 2015, 57(3): 321-363.
doi: 10.1137/140976649
|
[33] |
Mihalcea R, Tarau P. TextRank: Bringing Order into Text[C]// Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing. 2004: 404-411.
|
[34] |
陈鑫. 基于行块分布函数的通用网页正文抽取[OL]. [2021-10-28]. www.doc88.com/p-912707793066.html.
|
[34] |
(Chen Xin. General Web Page Text Extraction Based on Line Block Distribution Function[OL]. [2021-10-28]. www.doc88.com/p-912707793066.html.)
|
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|