Research on Chinese Micro-blog Bursty Topics Detection
Wang Yong1, Xiao Shibin1,2, Guo Yixiu1, Lv Xueqiang1,2
1. Beijing Key Laboratory of Internet Culture and Digital Dissemination Research, Beijing Information Science and Technology University, Beijing 100101, China; 2. Beijing TRS Information Technology Co., Ltd., Beijing 100101, China
Abstract:Much attention is paid to mining bursty topics accurately and efficiently from micro-blog nowadays. In this paper, a set of burst terms are extracted by counting the term frequency, calculating the growth rate of the terms and using Term Frequency-Proportional Document Frequency (TF-PDF) algorithm to measure the weight. And then micro-blog texts are described with the burst terms. Analyzing the characteristic that bursty topics propagate in the platform of micro-blog, the authors filter the texts that do not contribute to detect bursty topics. The paper proposes a novel clustering strategy of “Absolute Clustering” to cluster the micro-blog texts. By figuring up the hot spot of the texts with weighted value of reply and retweet number, the top 5 texts are extracted as the result of burst topics detection. The experiments show that the precision is 92.60%, the recall is 85.51% and the F-measure is 0.89. Contrast with the traditional method, the validity of the proposed method is proved.
王勇, 肖诗斌, 郭跇秀, 吕学强. 中文微博突发事件检测研究[J]. 现代图书情报技术, 2013, 29(2): 57-62.
Wang Yong, Xiao Shibin, Guo Yixiu, Lv Xueqiang. Research on Chinese Micro-blog Bursty Topics Detection. New Technology of Library and Information Service, 2013, 29(2): 57-62.
[1] 中国互联网信息中心.第30次中国互联网络发展状况统计报告[R].北京:中国互联网络信息中心,2012.(China Internet Network Information Center. The 30th Statistical Report of China Internet Development[R]. Beijing:CNNIC, 2012.) [2] 原福永,冯静,符茜茜.微博用户的影响力指数模型[J].现代图书情报技术,2012(6):60-64.(Yuan Fuyong, Feng Jing, Fu Qianqian. Influence Index Model of Micro-blog User[J]. New Technology of Library and Information Service, 2012(6):60-64.) [3] Diao Q M, Jiang J, Zhu F D. Finding Bursty Topics from Microblogs[C].In: Proceedings of ACL, 2012:536-544. [4] Wang X H, Zhai C X, Hu X,et al. Mining Correlated Bursty Topics Patterns from Coordinated Text Streams[C]. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining(KDD'07), California, USA. New York, NY, USA:ACM,2007:784-793. [5] Du Y Y, He Y X, Tian Y,et al. Microblog Bursty Topic Detection Based on User Relationship[C]. In: Proceedings of the 6th IEEE Joint International Information Technology and Artificial Intelligence Conference (ITAIC). 2011:260-263. [6] Du Y Y, Wu W, He Y X,et al. Microblog Bursty Feature Detection Based on Dynamics Model[C]. In: Proceedings of the International Conference on Systems and Informatics(ICSAI). 2012:2304-2308. [7] Fung G P C, Yu J X, Yu P S,et al. Parameter Free Bursty Events Detection in Text Streams[C].In: Proceedings of the 31st International Conference on Very Large Data Bases. 2005:181-192. [8] Erdmann M, Nakayama K, Hara T,et al. Improving the Extraction of Bilingual Terminology from Wikipedia[J]. ACM Transactions on Multimedia Computing, Communications and Applications, 2009, 5(4):1-17. [9] Bollegala D, Matsuo Y, Ishizuka M. Measuring the Similarity Between Implicit Semantic Relation Using Web Search Engines[C].In: Proceedings of the 2nd ACM International Conference on Web Search and Data Mining(WSDM'09). New York, NY, USA: ACM, 2009:104-113. [10] 李海芳,史俊冰,段利国,等.一种基于含糊同义词的查询扩展方法[J].计算机应用与软件,2011, 28(12):439-443.(Li Haifang, Shi Junbing, Duan Liguo, et.al. A Query Expansion Method Based on Vague Synonyms[J]. Computer Application and Software, 2011, 28(12):439-443.) [11] 赵辉,刘怀亮,范云杰,等.一种基于语义的中文文本分类算法[J].情报理论与实践,2012, 35(3):115-118.(Zhao Hui, Liu Huailiang, Fan Yunjie, et.al. A Chinese Text Classfication Algorithm Based on Semantics[J]. Information Studies:Theory & Application, 2012, 35(3):115-118.) [12] Blei D M , Ng A Y , Jordan M I. Latent Dirichlet Allocation[J]. The Journal of Machine Learning Research, 2003, 3:993-1022. [13] Nallapati R, Cohen W. Link-PLSA-LDA: A New Unsupervised Model for Topics and Influence in Blogs[C].In: Proceedings of the International Conference for Weblogs and Social Media. 2008:84-92. [14] 洪宇,张宇,刘挺,等.话题检测与跟踪的评测及研究综述[J].中文信息学报,2007,21(6):71-87.(Hong Yu, Zhang Yu, Liu Ting, et al. Topic Detection and Tracking Review[J]. Journal of Chinese Information Processing, 2007, 21(6):71-87.) [15] Bun K K,Ishizuka M. Topic Extraction from News Archive Using TF*PDF Algorithm[C]. In: Proceedings of the 3rd International Conference on Web Information Systems Engineering.2002:73-82. [16] 百度百科.新闻五要素[EB/OL].[2013-01-03].http://baike.baidu.com/view/754050.htm.(Baidu Baike. The Five Elements of News[EB/OL].[2013-01-03]. http://baike.baidu.com/view/754050.htm.)