Please wait a minute...
New Technology of Library and Information Service  2015, Vol. 31 Issue (2): 46-54    DOI: 10.11925/infotech.1003-3513.2015.02.07
Current Issue | Archive | Adv Search |
Parallel Implementing Bursty Events Detection Using MapReduce
Zhuo Keqiu, Yu Wei, Su Xinning
School of Information Management, Nanjing University, Nanjing 210023, China
Download: PDF(2259 KB)   HTML  
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] In big data environment, this paper aims to accurately and quickly detect bursty events from the text stream. [Methods] Using Kleinberg bursty detection and LDA topic model, the method is extended to MapReduce framework to achieve parallel corpus predisposed, parallel detection of bursty word, parallel filtration of bursty document and parallel extraction of topic. [Results] The results of simulation experiments on the news text stream show that precision reaches 87.50%, recall reaches 77.78%, and F-measure reaches 82.35% with the parallel method to detect bursty events in specific areas. [Limitations] The MapReduce parallel method is difficult to achieve Online and Real-time detection of bursty events with large-scale dynamic text stream. [Conclusions] Compared with the traditional serial detecting method of bursty events, the distributed parallel method not only guarantees the accuracy of detecting results, but also has a good scalability.

Key wordsBursty event detection      MapReduce      Distributed process      LDA topic model     
Received: 04 August 2014      Published: 17 March 2015
:  TP311.1  

Cite this article:

Zhuo Keqiu, Yu Wei, Su Xinning. Parallel Implementing Bursty Events Detection Using MapReduce. New Technology of Library and Information Service, 2015, 31(2): 46-54.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.1003-3513.2015.02.07     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2015/V31/I2/46

[1] Xie W, Zhu F, Jiang J, et al. TopicSketch: Real-Time Bursty Topic Detection from Twitter [C]. In: Proceedings of the 13th International Conference on Data Mining, Dallas, Texas, USA. IEEE, 2013: 837-846.
[2] Dean J, Ghemawat S. MapReduce: Simplified Data Processing on Large Clusters [J]. Communications of the ACM, 2008, 51(1): 107-113.
[3] Hadoop [EB/OL]. [2014-07-15]. http://hadoop.apache.org/.
[4] Allan J, Carbonell J, Doddington G, et al. Topic Detection and Tracking Pilot Study Final Report [C]. In: Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop, 1998: 194-218.
[5] Hofmann T. Probabilistic Latent Semantic Analysis [C]. In: Proceedings of the 15th Conference on Uncertainty in Artificial Intelligence. Morgan Kaufmann Publishers Inc., 1999: 289-296.
[6] Blei D M, Ng A Y, Jordan M I. Latent Dirichlet Allocation [J]. The Journal of Machine Learning Research, 2003, 3: 993-1022.
[7] 李文波, 孙乐, 张大鲲. 基于 Labeled-LDA 模型的文本分 类新算法[J]. 计算机学报, 2008, 31(4): 620-627. (Li Wenbo, Sun Le, Zhang Dakun. Text Classification Based on Labeled-LDA Model [J]. Chinese Journal of Computers, 2008, 31(4): 620-627.)
[8] Wang X, Zhai C, Hu X, et al. Mining Correlated Bursty Topic Patterns from Coordinated Text Streams [C]. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, USA: ACM, 2007: 784-793.
[9] Lin C X, Zhao B, Mei Q, et al. PET: A Statistical Model for Popular Events Tracking in Social Communities [C]. In: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, USA: ACM, 2010: 929-938.
[10] Dubrawski A. Detection of Events in Multiple Streams of Surveillance Data [A].//Infectious Disease Informatics and Biosurveillance [M]. Springer US, 2011: 145-171.
[11] Diao Q, Jiang J, Zhu F, et al. Finding Bursty Topics from Microblogs [C]. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, Jeju Island, Korea. 2012: 536-544.
[12] 周刚, 邹鸿程, 熊小兵, 等. MB-SinglePass: 基于组合相似 度的微博话题检测[J]. 计算机科学, 2012, 39(10): 198-202. (Zhou Gang, Zou Hongcheng, Xiong Xiaobing, et al. MB-SinglePass: Microblog Topic Detection Based on Combined Similarity [J]. Computer Science, 2012, 39(10): 198-202.)
[13] 郭跇秀, 吕学强, 李卓. 基于突发词聚类的微博突发事件 检测方法[J]. 计算机应用, 2014, 34(2): 486-490. (Guo Yixiu, Lv Xueqiang, Li Zhuo. Bursty Topics Detection Approach on Chinese Microblog Based on Burst Words Clustering [J]. Journal of Computer Applications, 2014, 34(2): 486-490.)
[14] 王勇, 肖诗斌, 郭跇秀, 等. 中文微博突发事件检测研究[J]. 现代图书情报技术, 2013(2): 57-62. (Wang Yong, Xiao Shibin, Guo Yixiu, et al. Research on Chinese Micro-blog Bursty Topics Detection [J]. New Technology of Library and Information Service, 2013(2): 57-62.)
[15] 邱云飞, 程亮. 微博突发话题检测方法研究[J]. 计算机工 程, 2012, 38(9): 288-290. (Qiu Yunfei, Cheng Liang. Research on Sudden Topic Detection Method for Microblog[J]. Computer Engineering, 2012, 38(9): 288-290.)
[16] Kleinberg J. Bursty and Hierarchical Structure in Streams [C]. In: Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM, 2002: 91-101.
[17] Ihler A, Hutchins J, Smyth P. Adaptive Event Detection with Time-Varying Poisson Processes [C]. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.New York: ACM, 2006: 207-216.
[18] Nakahara T, Hamuro Y. Detecting Topics from Twitter Posts During TV Program Viewing [C]. In: Proceedings of the 13th International Conference on Data Mining, Dallas, Texas, USA. IEEE, 2013: 714-719.
[19] Zhang L, Jia Y, Zhou B, et al. Detecting Real-Time Burst Topics in Microblog Streams: How Sentiment Can Help [C]. In: Proceedings of the 22nd International Conference on World Wide Web Companion. 2013: 781-782.
[20] Koike D, Takahashi Y, Utsuro T, et al. Time Series Topic Modeling and Bursty Topic Detection of Correlated News and Twitter [C]. In: Proceedings of the 6th International Joint Conference on Natural Language Processing (IJCNLP), Nagoya, Japan. 2013: 917-921.
[21] He D, Parker D S. Topic Dynamics: An Alternative Model of Bursts in Streams of Topics [C]. In: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.New York, USA: ACM, 2010: 443-452.
[22] 李锐, 王斌. 文本处理中的MapReduce 技术[J]. 中文信息 学报, 2012, 26(4): 9-20. (Li Rui, Wang Bin. MapReduce in Text Processing [J]. Journal of Chinese Information Processing, 2012, 26(4): 9-20.)
[23] Das A S, Datar M, Garg A, et al. Google News Personalization: Scalable Online Collaborative Filtering[C]. In: Proceedings of the 16th International Conference on World Wide Web. New York: ACM, 2007: 271-280.
[24] Choi H, Lee K H, Lee Y J. Parallel Labeling of Massive XML Data with MapReduce [J]. Journal of Supercomputing, 2013, 67(2): 408-437.
[25] 刘滔, 雷霖, 陈荦, 等. 基于MapReduce 的中文词性标注 CRF 模型并行化训练研究[J]. 北京大学学报: 自然科学版, 2013, 49(1): 147-152. (Liu Tao, Lei Lin, Chen Luo, et al. A Parallel Training Research of Chinese Part-of-Speech Tagging CRF Model Based on MapReduce [J]. Acta Scientiarum Naturalium Universitatis Pekinensis, 2013, 49(1): 147-152.)
[26] What is Apache Mahout? [EB/OL]. [2014-09-27]. http://mahout.apache.org/.
[27] Nallapati R, Cohen W, Lafferty J. Parallelized Variational EM for Latent Dirichlet Allocation: An Experimental Evaluation of Speed and Scalability [C]. In: Proceedings of the 17th IEEE International Conference on Data Mining Workshops, Omaha, Nebraska, USA. IEEE, 2007: 349-354.
[28] Zhai K, Boyd-Graber J, Asadi N. Using Variational Inference and MapReduce to Scale Topic Modeling [OL]. Eprint arXiv, 2011. arXiv: 1107.3765.

[1] Linna Xi,Yongxiang Dou. Examining Reposts of Micro-bloggers with Planned Behavior Theory[J]. 数据分析与知识发现, 2019, 3(2): 13-20.
[2] Jie Zhang,Junbo Zhao,Dongsheng Zhai,Ningning Sun. Patent Technology Analysis of Microalgae Biofuel Industrial Chain Based on Topic Model[J]. 数据分析与知识发现, 2019, 3(2): 52-64.
[3] Junwan Liu,Zhixin Long,Feifei Wang. Finding Collaboration Opportunities from Emerging Issues with LDA Topic Model and Link Prediction[J]. 数据分析与知识发现, 2019, 3(1): 104-117.
[4] He Li,Linlin Zhu,Min Yan,Jincheng Liu,Chuang Hong. Identifying Useful Information from Open Innovation Community[J]. 数据分析与知识发现, 2018, 2(12): 12-22.
[5] Jiabin Qu,Shiyan Ou. Analyzing Topic Evolution with Topic Filtering and Relevance[J]. 数据分析与知识发现, 2018, 2(1): 64-75.
[6] Changyuan Gao,Jianping Yu,Xiaoyan He. Knowledge Search for Cloud Computing Industry Alliance: An Algorithm Based on Improved Particle Swarm Optimization[J]. 数据分析与知识发现, 2017, 1(3): 81-89.
[7] Guan Peng,Wang Yuefen. Identifying Optimal Topic Numbers from Sci-Tech Information with LDA Model[J]. 现代图书情报技术, 2016, 32(9): 42-50.
[8] Ma Bin, Yin Lifeng. A Parallel Naive Bayesian Network Public Opinion Fast Classification Algorithm Based on Hadoop Platform[J]. 现代图书情报技术, 2015, 31(2): 78-84.
[9] Hu Jiming, Chen Guo. Study on Improvement of Text Classification Using HS-SVM[J]. 现代图书情报技术, 2014, 30(9): 74-80.
[10] Yu Wei, Chen Junpeng. Linking and Mapping of Library Catalogue Data Based on MapReduce[J]. 现代图书情报技术, 2013, 29(9): 15-22.
[11] Kang Liyun, Wang Xiaoyue, Bai Rujiang. Analysis of MapReduce Principle and Its Main Implementation Platforms[J]. 现代图书情报技术, 2012, 28(2): 60-67.
[12] Zhang Xingwang, Li Chenhui, Qin Xiaozhu. Research and Initial Implementation of Large-scale Data Processing Based on Cloud Computing[J]. 现代图书情报技术, 2011, 27(4): 17-23.
[13] Yang Daiqing,Zhang Zhixiong. A Method for Generating Co-occurrence Matrix of Mass Data Based on Hadoop[J]. 现代图书情报技术, 2009, 25(4): 23-26.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn