|
|
Research on Chinese Micro-blog Bursty Topics Detection |
Wang Yong1, Xiao Shibin1,2, Guo Yixiu1, Lv Xueqiang1,2 |
1. Beijing Key Laboratory of Internet Culture and Digital Dissemination Research, Beijing Information Science and Technology University, Beijing 100101, China; 2. Beijing TRS Information Technology Co., Ltd., Beijing 100101, China |
|
|
Abstract Much attention is paid to mining bursty topics accurately and efficiently from micro-blog nowadays. In this paper, a set of burst terms are extracted by counting the term frequency, calculating the growth rate of the terms and using Term Frequency-Proportional Document Frequency (TF-PDF) algorithm to measure the weight. And then micro-blog texts are described with the burst terms. Analyzing the characteristic that bursty topics propagate in the platform of micro-blog, the authors filter the texts that do not contribute to detect bursty topics. The paper proposes a novel clustering strategy of “Absolute Clustering” to cluster the micro-blog texts. By figuring up the hot spot of the texts with weighted value of reply and retweet number, the top 5 texts are extracted as the result of burst topics detection. The experiments show that the precision is 92.60%, the recall is 85.51% and the F-measure is 0.89. Contrast with the traditional method, the validity of the proposed method is proved.
|
Received: 18 January 2013
Published: 24 April 2013
|
|
[1] 中国互联网信息中心.第30次中国互联网络发展状况统计报告[R].北京:中国互联网络信息中心,2012.(China Internet Network Information Center. The 30th Statistical Report of China Internet Development[R]. Beijing:CNNIC, 2012.) [2] 原福永,冯静,符茜茜.微博用户的影响力指数模型[J].现代图书情报技术,2012(6):60-64.(Yuan Fuyong, Feng Jing, Fu Qianqian. Influence Index Model of Micro-blog User[J]. New Technology of Library and Information Service, 2012(6):60-64.) [3] Diao Q M, Jiang J, Zhu F D. Finding Bursty Topics from Microblogs[C].In: Proceedings of ACL, 2012:536-544. [4] Wang X H, Zhai C X, Hu X,et al. Mining Correlated Bursty Topics Patterns from Coordinated Text Streams[C]. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining(KDD'07), California, USA. New York, NY, USA:ACM,2007:784-793. [5] Du Y Y, He Y X, Tian Y,et al. Microblog Bursty Topic Detection Based on User Relationship[C]. In: Proceedings of the 6th IEEE Joint International Information Technology and Artificial Intelligence Conference (ITAIC). 2011:260-263. [6] Du Y Y, Wu W, He Y X,et al. Microblog Bursty Feature Detection Based on Dynamics Model[C]. In: Proceedings of the International Conference on Systems and Informatics(ICSAI). 2012:2304-2308. [7] Fung G P C, Yu J X, Yu P S,et al. Parameter Free Bursty Events Detection in Text Streams[C].In: Proceedings of the 31st International Conference on Very Large Data Bases. 2005:181-192. [8] Erdmann M, Nakayama K, Hara T,et al. Improving the Extraction of Bilingual Terminology from Wikipedia[J]. ACM Transactions on Multimedia Computing, Communications and Applications, 2009, 5(4):1-17. [9] Bollegala D, Matsuo Y, Ishizuka M. Measuring the Similarity Between Implicit Semantic Relation Using Web Search Engines[C].In: Proceedings of the 2nd ACM International Conference on Web Search and Data Mining(WSDM'09). New York, NY, USA: ACM, 2009:104-113. [10] 李海芳,史俊冰,段利国,等.一种基于含糊同义词的查询扩展方法[J].计算机应用与软件,2011, 28(12):439-443.(Li Haifang, Shi Junbing, Duan Liguo, et.al. A Query Expansion Method Based on Vague Synonyms[J]. Computer Application and Software, 2011, 28(12):439-443.) [11] 赵辉,刘怀亮,范云杰,等.一种基于语义的中文文本分类算法[J].情报理论与实践,2012, 35(3):115-118.(Zhao Hui, Liu Huailiang, Fan Yunjie, et.al. A Chinese Text Classfication Algorithm Based on Semantics[J]. Information Studies:Theory & Application, 2012, 35(3):115-118.) [12] Blei D M , Ng A Y , Jordan M I. Latent Dirichlet Allocation[J]. The Journal of Machine Learning Research, 2003, 3:993-1022. [13] Nallapati R, Cohen W. Link-PLSA-LDA: A New Unsupervised Model for Topics and Influence in Blogs[C].In: Proceedings of the International Conference for Weblogs and Social Media. 2008:84-92. [14] 洪宇,张宇,刘挺,等.话题检测与跟踪的评测及研究综述[J].中文信息学报,2007,21(6):71-87.(Hong Yu, Zhang Yu, Liu Ting, et al. Topic Detection and Tracking Review[J]. Journal of Chinese Information Processing, 2007, 21(6):71-87.) [15] Bun K K,Ishizuka M. Topic Extraction from News Archive Using TF*PDF Algorithm[C]. In: Proceedings of the 3rd International Conference on Web Information Systems Engineering.2002:73-82. [16] 百度百科.新闻五要素[EB/OL].[2013-01-03].http://baike.baidu.com/view/754050.htm.(Baidu Baike. The Five Elements of News[EB/OL].[2013-01-03]. http://baike.baidu.com/view/754050.htm.) |
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|