[Objective] This paper tries to address the issues facing traditional theme crawlers, such as low indexing rates and insufficient theme relevance. [Methods] We proposed a Two-step Dynamic Shark-Search (TDSS) algorithm based on Shark-Search, which divided the topic relevance calculation into the relevance of hyperlink and webpage topics. Then, we added new keywords extracted from topic-related pages to the established topic thesaurus, which improved the effectiveness of topic judgment. [Results] The TDSS crawler’s accuracy and indexing efficiency were 14.2% and 35% higher than the comparable algorithms in the same experiment environment. [Limitations] More research is needed to increase the clawer’s accuracy with excessive topic words. [Conclusions] The proposed algorithm could effectively improve the accuracy of topic information and retrieve more topic-related webpages.
丁晟春, 刘凯, 方振. 融合动态主题词库和改进Shark-Search算法的主题爬虫方法——以武器装备领域为例*[J]. 数据分析与知识发现, 2022, 6(8): 52-60.
Ding Shengchun, Liu Kai, Fang Zhen. Crawler with Dynamic Thesaurus and Improved Shark-Search Algorithm:Case Study of Military Equipment. Data Analysis and Knowledge Discovery, 2022, 6(8): 52-60.
(Fan Hao, Zheng Xiaochuan. A Review of the Research on Open Source Intelligence at Home and Broad[J]. Information Studies: Theory & Application, 2021, 44(10): 185-192, 201.)
(Ding Botao. The Development of Open Source Information Abroad and the Strategic Studies in China[J]. Information and Documentation Services, 2011(6): 103-106.)
(Fu Chang, Song Jiaqing. Design and Realization of Web Military Intelligence Mining System Based on Document Clustering[J]. Journal of China Academy of Electronics and Information Technology, 2015, 10(5): 541-545.)
(Wang Chong, Ji Xianhui. Improved PageRank Algorithm Based on User Interest and Topic[J]. Computer Science, 2016, 43(3): 275-278, 312.)
doi: 10.11896/j.issn.1002-137X.2016.03.051
(Liu Hao, Hong Yu, Yao Liang, et al. HITS-Based Optimization Method for Bilingual Corpus Mining[J]. Journal of Chinese Information Processing, 2017, 31(2): 25-35.)
[8]
Seyfi A, Patel A, Celestino J. Empirical Evaluation of the Link and Content-Based Focused Treasure-Crawler[J]. Computer Standards & Interfaces, 2016, 44: 54-62.
doi: 10.1016/j.csi.2015.09.007
[9]
Liu N W, Yao R B. The Crawling Strategy of Shark-Search Algorithm Based on Multi Granularity[C]// Proceedings of the 8th International Symposium on Computational Intelligence and Design. IEEE, 2015: 41-44.
(Shen Guilan, Sun Jie, Yang Xiaoping. Focused Crawling Method Based on Detecting Local Communities in Complex Networks[J]. Journal of Henan Normal University(Natural Science Edition), 2014, 42(4): 134-138.)
(Huang Wei, Zhang Zhancheng, Zhu Bin, et al. A Network Counter-Terrorism Information Crawler Based on the Regression Analysis[J]. Library and Information Service, 2018, 62(4): 121-129.)
doi: 10.13266/j.issn.0252-3116.2018.04.016
(Cheng Yuankun, Liao Wenjian, Cheng Guang. Strategy of Focused Crawler with Word Embedding Clustering Weighted in Shark-Search Algorithm[J]. Computer & Digital Engineering, 2018, 46(1): 144-148.)
[15]
Zhang W H, Chen Y. Bayes Topic Prediction Model for Focused Crawling of Vertical Search Engine[C]// Proceedings of the 2014 IEEE Computers, Communications and IT Applications Conference. IEEE, 2014: 294-299.
(Wang Kui, Fei Chenjie, Liu Baisong. Convolutional Neural Network Themed Reptile Research Based on LDA[J]. Computer Engineering and Applications, 2019, 55(11): 123-128, 178.)
doi: 10.3778/j.issn.1002-8331.1810-0127
(Meng Fanjiang, Ji Xiang, Yuan Qi, et al. Research and Implementation of Agricultural Prices Subject Search Engine[J]. Journal of Northeast Agricultural University, 2016, 47(9): 64-71.)
(Li Xuebo. Study of the Evaluation System of Web TCM Information Resource Based on Hadoop[D]. Jinan: Shandong University of Traditional Chinese Medicine, 2016.)
(Ding Shengchun, Gong Silan, Zhou Wenjie, et al. Research on Network Public Opinion Real-Time Monitoring of the South China Sea Issue Based on Knowledge Base and Focused Crawler[J]. Journal of Intelligence, 2016, 35(5): 32-37.)
(Wu Yuping, Yang Renguang. Comparative Research on Network Multimedia Topic Search Algorithms[J]. Library and Information Service, 2013, 57(7): 112-115.)
(Gao Le, Zhang Jian, Tian Xianzhong. Improvement and Implementation of VIPS Algorithm[J]. Computer Systems & Applications, 2009, 18(4): 65-69.)
[28]
Hasan K S, Ng V. Automatic Keyphrase Extraction: A Survey of the State of the Art[C]// Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. 2014: 1262-1273.
[29]
Zhang Y Y, Li J, Song Y, et al. Encoding Conversation Context for Neural Keyphrase Extraction from Microblog Posts[C]// Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. 2018: 1676-1686.
[30]
Zhang Q, Wang Y, Gong Y Y, et al. Keyphrase Extraction Using Deep Recurrent Neural Networks on Twitter[C]// Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 2016: 836-845.
[31]
Lilleberg J, Zhu Y, Zhang Y Q. Support Vector Machines and Word2Vec for Text Classification with Semantic Features[C]// Proceedings of the 14th International Conference on Cognitive Informatics & Cognitive Computing. IEEE, 2015: 136-140.
[32]
Gleich D F. PageRank Beyond the Web[J]. SIAM Review, 2015, 57(3): 321-363.
doi: 10.1137/140976649
[33]
Mihalcea R, Tarau P. TextRank: Bringing Order into Text[C]// Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing. 2004: 404-411.