Focused Crawling for Network Public Opinion’s Topic Information
Huang Wei1,2, Jin Yabo1, Hu Changlong1
1. School of Management, Hubei University of Technology, Wuhan 430068, China; 2. School of Management, Wuhan University of Technology, Wuhan 430070, China
Abstract:The unfocused problem of network public opinion becomes more and more serious. This article proposes a focused crawler for network public opinion based on content topic selection strategy with time and spatial dimension factor by analyzing feature and evolution of network group events. The results of experiments prove that this focused crawler has higher execution efficiency, and also achives good focused ability. That provides the focused resources of processing network public opinion group events.
黄炜, 金雅博, 胡昌龙. 网络舆情主题信息采集研究[J]. 现代图书情报技术, 2012, (11): 65-71.
Huang Wei, Jin Yabo, Hu Changlong. Focused Crawling for Network Public Opinion’s Topic Information. New Technology of Library and Information Service, 2012, (11): 65-71.
[1] 中国互联网信息中心.第30次中国互联网络发展状况调查统计报告[R/OL].[2012-07-25]. http://www.cnnic.net.cn/hlwfzyj/hlwxzbg/hlwtjbg/201207/t20120723_32497.htm. (China Internet Network Information Center. The 30th China Internet Development Statistics Report[R/OL].[2012-07-25]. http://www.cnnic.net.cn/hlwfzyj/hlwxzbg/hlwtjbg/201207/t20120723_32497.htm.) [2] 刘毅. 略论网络舆情的概念、特点、表达与传播[J]. 理论界, 2007(1):11-12. (Liu Yi. Research on Network Public Opinion, Expression and Dissemination[J]. Theory Horizon, 2007(1):11-12.) [3] Sahami M. Using Machine Learning to Improve Information Access[D]. Stanford: Stanford University, 1998. [4] 北大方正技术研究院. 以科技手段辅助网络舆情突发事件的监测分析—方正智思舆情辅助决策支持系统[J]. 信息化建设, 2005(10):50-52. (Research Department of Fonder. Research on the Monitoring and Analysis of Network Public Opinion System-Founder Public Opinion of the Decision Support System[J]. Informatization Construction, 2005(10):50-52.) [5] 周立柱, 林玲. 聚焦爬虫技术研究综述[J]. 计算机应用, 2005, 25(9):1965-1969.(Zhou Lizhu, Lin Ling. Survey on the Research of Focused Crawling Technique[J]. Journal of Computer Applications, 2005, 25(9):1965-1969.) [6] Sun H, Wei Y M. A Note on the PageRank Algorithm[J]. Applied Mathematics and Computation, 2006, 79(2):799-806. [7] Nomura S, Oyama S, Hayamizu T, et al. Analysis and Improvement of HITS Algorithm for Detecting Web Communities[C]. In: Proceedings of 2002 Symposium on Applications and the Internet (SAINT'02). 2002:132-140. [8] Aggarwal C C, Gates S C, Yu P S. On the Merits of Using Supervised Clustering for Building Categorization Systems[C]. In: Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'99). New York: ACM, 1999: 352-356. [9] De Bra P M E, Houben G, Kornatzky Y, et al. Information Retrieval in Distributed Hypertexts[C]. In: Proceedings of the 4th Computer-Assisted Information Retrieval (RIAO'94). 1994: 481-493. [10] De Bra P M E, Post R D J. Information Retrieval in the World Wide Web: Making Client-based Searching Feasible[C]. In: Proceedings of the 1st International Conference on World Wide Web, Geneva. Amsterdam: Elsevier, 1994: 183-192. [11] 姜鹏,宋继华.一种主题爬虫文本分类器的构建[J]. 中文信息学报, 2010,24(6):92-96.(Jiang Peng,Song Jihua. A Method of Text Classifier for Focused Crawler[J]. Journal of Chinese Information Processing, 2010, 24(6):92-96.) [12] 朱学芳,韩占校.基于P2P的分布式主题爬虫系统的设计与实现[J]. 情报学报, 2010,29(3):402-407.(Zhu Xuefang, Han Zhanxiao. Design and Implementation of Distributed Topic Crawler Based on P2P for Image Retrieval[J]. Journal of the China Society for Scientific and Technical Information,2010,29(3):402-407.) [13] 乔建忠.基于主题爬虫的本体非分类关系学习框架[J]. 图书情报工作,2010,54(18):120-125, 129.(Qiao Jianzhong. Learning Non-taxonomic Relationships Based on Focused Crawler[J]. Library and Information Service, 2010,54(18):120-125, 129.) [14] 蒋国瑞,王秋利.基于本体的TBT电子信息产品领域主题爬虫研究[J]. 情报杂志, 2011,30(7):157-161.(Jiang Guorui, Wang Qiuli.Research on Focused Crawler of TBT Electronic Information Products Based on Ontology[J].Journal of Information, 2011, 30(7):157-161.) [15] 宋海洋,刘晓然,钱海俊.一种新的主题网络爬虫爬行策略[J]. 计算机应用与软件, 2011,28(11):264-267.(Song Haiyang, Liu Xiaoran, Qian Haijun. A Novel Crawling Strategy of Focused Web Crawler[J].Computer Applications and Software,2011,28(11):264-267.) [16] 张囡囡. 面向语义网的领域本体半自动构建方法的研究[D]. 大连:大连海事大学, 2008.(Zhang Nannan. Research on the Method of Semi-automatic Domain Ontology Building for the Semantic Web[D]. Dalian: Dalian Maritime University, 2008.) [17] 黄炜,程宝生,杨青. 基于本体的网络群体性事件主题发现研究[J]. 图书情报工作, 2012, 56(20):47-52.(Huang Wei, Cheng Baosheng, Yang Qing. Topic Discovery of Network Group Events Based on Ontology[J]. Library and Information Service, 2012, 56(20):47-52.) [18] 连浩,刘悦,许洪波, 等. 改进的基于布尔模型的网页查重算法[J]. 计算机应用研究, 2007, 24(2):36-39.(Lian Hao, Liu Yue, Xu Hongbo, et al. Duplicated Web Pages Detection Algorithm Based on Boolean Model[J]. Application Research of Computers, 2007, 24(2):36-39.) [19] 黄炜,张李义. 基于语义爬虫的商品信息主题采集研究[J]. 现代图书情报技术, 2010(1):3-8.(Huang Wei, Zhang Liyi. Research on Focused Merchandise Information Crawling Based on Semantic Crawler[J]. New Technology of Library and Information Service, 2010 (1):3-8.) [20] 谢科范,赵湜,陈刚, 等.网络舆情突发事件的生命周期原理及集群决策研究[J]. 武汉理工大学学报:社会科学版, 2010, 23(4):482-486.(Xie Kefan, Zhao Shi, Chen Gang, et al. Research on Lifecycle Principle and Group Decision-making of Network Public Sentiment Emergency[J]. Journal of Wuhan University of Technology :Social Science Edition, 2010, 23(4):482-486.)