[Objective] This paper tries to address the issues facing traditional theme crawlers, such as low indexing rates and insufficient theme relevance. [Methods] We proposed a Two-step Dynamic Shark-Search (TDSS) algorithm based on Shark-Search, which divided the topic relevance calculation into the relevance of hyperlink and webpage topics. Then, we added new keywords extracted from topic-related pages to the established topic thesaurus, which improved the effectiveness of topic judgment. [Results] The TDSS crawler’s accuracy and indexing efficiency were 14.2% and 35% higher than the comparable algorithms in the same experiment environment. [Limitations] More research is needed to increase the clawer’s accuracy with excessive topic words. [Conclusions] The proposed algorithm could effectively improve the accuracy of topic information and retrieve more topic-related webpages.
丁晟春, 刘凯, 方振. 融合动态主题词库和改进Shark-Search算法的主题爬虫方法——以武器装备领域为例*[J]. 数据分析与知识发现, 2022, 6(8): 52-60.
Ding Shengchun, Liu Kai, Fang Zhen. Crawler with Dynamic Thesaurus and Improved Shark-Search Algorithm：Case Study of Military Equipment. Data Analysis and Knowledge Discovery, 2022, 6(8): 52-60.
(Fu Chang, Song Jiaqing. Design and Realization of Web Military Intelligence Mining System Based on Document Clustering[J]. Journal of China Academy of Electronics and Information Technology, 2015, 10(5): 541-545.)
(Liu Hao, Hong Yu, Yao Liang, et al. HITS-Based Optimization Method for Bilingual Corpus Mining[J]. Journal of Chinese Information Processing, 2017, 31(2): 25-35.)
Seyfi A, Patel A, Celestino J. Empirical Evaluation of the Link and Content-Based Focused Treasure-Crawler[J]. Computer Standards & Interfaces, 2016, 44: 54-62.
Liu N W, Yao R B. The Crawling Strategy of Shark-Search Algorithm Based on Multi Granularity[C]// Proceedings of the 8th International Symposium on Computational Intelligence and Design. IEEE, 2015: 41-44.
(Shen Guilan, Sun Jie, Yang Xiaoping. Focused Crawling Method Based on Detecting Local Communities in Complex Networks[J]. Journal of Henan Normal University(Natural Science Edition), 2014, 42(4): 134-138.)
(Huang Wei, Zhang Zhancheng, Zhu Bin, et al. A Network Counter-Terrorism Information Crawler Based on the Regression Analysis[J]. Library and Information Service, 2018, 62(4): 121-129.)
(Cheng Yuankun, Liao Wenjian, Cheng Guang. Strategy of Focused Crawler with Word Embedding Clustering Weighted in Shark-Search Algorithm[J]. Computer & Digital Engineering, 2018, 46(1): 144-148.)
Zhang W H, Chen Y. Bayes Topic Prediction Model for Focused Crawling of Vertical Search Engine[C]// Proceedings of the 2014 IEEE Computers, Communications and IT Applications Conference. IEEE, 2014: 294-299.
(Wang Kui, Fei Chenjie, Liu Baisong. Convolutional Neural Network Themed Reptile Research Based on LDA[J]. Computer Engineering and Applications, 2019, 55(11): 123-128, 178.)
(Ding Shengchun, Gong Silan, Zhou Wenjie, et al. Research on Network Public Opinion Real-Time Monitoring of the South China Sea Issue Based on Knowledge Base and Focused Crawler[J]. Journal of Intelligence, 2016, 35(5): 32-37.)
(Gao Le, Zhang Jian, Tian Xianzhong. Improvement and Implementation of VIPS Algorithm[J]. Computer Systems & Applications, 2009, 18(4): 65-69.)
Hasan K S, Ng V. Automatic Keyphrase Extraction: A Survey of the State of the Art[C]// Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. 2014: 1262-1273.
Zhang Y Y, Li J, Song Y, et al. Encoding Conversation Context for Neural Keyphrase Extraction from Microblog Posts[C]// Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. 2018: 1676-1686.
Zhang Q, Wang Y, Gong Y Y, et al. Keyphrase Extraction Using Deep Recurrent Neural Networks on Twitter[C]// Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 2016: 836-845.
Lilleberg J, Zhu Y, Zhang Y Q. Support Vector Machines and Word2Vec for Text Classification with Semantic Features[C]// Proceedings of the 14th International Conference on Cognitive Informatics & Cognitive Computing. IEEE, 2015: 136-140.
Gleich D F. PageRank Beyond the Web[J]. SIAM Review, 2015, 57(3): 321-363.
Mihalcea R, Tarau P. TextRank: Bringing Order into Text[C]// Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing. 2004: 404-411.