Please wait a minute...
Advanced Search
现代图书情报技术  2015, Vol. 31 Issue (2): 46-54    DOI: 10.11925/infotech.1003-3513.2015.02.07
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
突发事件检测的MapReduce并行化实现
卓可秋, 虞为, 苏新宁
南京大学信息管理学院 南京 210023
Parallel Implementing Bursty Events Detection Using MapReduce
Zhuo Keqiu, Yu Wei, Su Xinning
School of Information Management, Nanjing University, Nanjing 210023, China
全文: PDF(2259 KB)   HTML  
输出: BibTeX | EndNote (RIS)      
摘要 

[目的] 在大数据环境下, 从文本流中准确且快速地检测出特定领域的突发事件。[方法] 利用Kleinberg突发检测方法和LDA 主题模型方法, 将其扩展到MapReduce 并行框架中, 实现并行语料预处理、并行突发词检测、并行突发文档过滤和并行主题提取。[结果] 对新闻文本流进行模拟仿真实验, 结果表明, 该并行方法在特定领域突发事件检测中准确率P、召回率R 和调和平均值F 分别最高可达87.50%、77.78%和82.35%。[局限] 基于MapReduce 的并行方法难以实现大规模动态文本流在线(Online)实时(Real-time)突发事件检测。[结论] 与传统串行突发事件检测方法相比, 所构建的分布式并行化方法在保证检测结果正确性的同时, 具有良好的可扩展性, 性能得到较大提升。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
虞为
苏新宁
卓可秋
关键词 突发事件检测MapReduce分布式处理LDA主题模型    
Abstract

[Objective] In big data environment, this paper aims to accurately and quickly detect bursty events from the text stream. [Methods] Using Kleinberg bursty detection and LDA topic model, the method is extended to MapReduce framework to achieve parallel corpus predisposed, parallel detection of bursty word, parallel filtration of bursty document and parallel extraction of topic. [Results] The results of simulation experiments on the news text stream show that precision reaches 87.50%, recall reaches 77.78%, and F-measure reaches 82.35% with the parallel method to detect bursty events in specific areas. [Limitations] The MapReduce parallel method is difficult to achieve Online and Real-time detection of bursty events with large-scale dynamic text stream. [Conclusions] Compared with the traditional serial detecting method of bursty events, the distributed parallel method not only guarantees the accuracy of detecting results, but also has a good scalability.

Key wordsBursty event detection    MapReduce    Distributed process    LDA topic model
收稿日期: 2014-08-04     
:  TP311.1  
基金资助:

本文系国家社会科学基金项目“基于关联数据的图书馆语义云服务研究”(项目编号:12CTQ009)、国家社会科学基金重大项目“面向突发事件应急决策的快速响应情报体系研究”(项目编号: 13&ZD174)、国家自然科学基金面上项目“面向知识服务的知识组织模式与应用研究”(项目编号: 71273126)和江苏省社会科学基金青年项目“基于语义云服务的数字阅读推广研究”(项目编号:14TQC003)的研究成果之一。

通讯作者: 虞为, ORCID: 0000-0003-1933-5380, E-mail: luckjp@163.com。     E-mail: luckjp@163.com
作者简介: 作者贡献声明: 卓可秋: 设计研究方案, 采集数据, 进行相关实验并分析结果,论文起草与修订;虞为: 提出研究思路, 最终版本修订;苏新宁: 论文修订。
引用本文:   
卓可秋, 虞为, 苏新宁. 突发事件检测的MapReduce并行化实现[J]. 现代图书情报技术, 2015, 31(2): 46-54.
Zhuo Keqiu, Yu Wei, Su Xinning. Parallel Implementing Bursty Events Detection Using MapReduce. New Technology of Library and Information Service, DOI:10.11925/infotech.1003-3513.2015.02.07.
链接本文:  
http://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.1003-3513.2015.02.07

[1] Xie W, Zhu F, Jiang J, et al. TopicSketch: Real-Time Bursty Topic Detection from Twitter [C]. In: Proceedings of the 13th International Conference on Data Mining, Dallas, Texas, USA. IEEE, 2013: 837-846.
[2] Dean J, Ghemawat S. MapReduce: Simplified Data Processing on Large Clusters [J]. Communications of the ACM, 2008, 51(1): 107-113.
[3] Hadoop [EB/OL]. [2014-07-15]. http://hadoop.apache.org/.
[4] Allan J, Carbonell J, Doddington G, et al. Topic Detection and Tracking Pilot Study Final Report [C]. In: Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop, 1998: 194-218.
[5] Hofmann T. Probabilistic Latent Semantic Analysis [C]. In: Proceedings of the 15th Conference on Uncertainty in Artificial Intelligence. Morgan Kaufmann Publishers Inc., 1999: 289-296.
[6] Blei D M, Ng A Y, Jordan M I. Latent Dirichlet Allocation [J]. The Journal of Machine Learning Research, 2003, 3: 993-1022.
[7] 李文波, 孙乐, 张大鲲. 基于 Labeled-LDA 模型的文本分 类新算法[J]. 计算机学报, 2008, 31(4): 620-627. (Li Wenbo, Sun Le, Zhang Dakun. Text Classification Based on Labeled-LDA Model [J]. Chinese Journal of Computers, 2008, 31(4): 620-627.)
[8] Wang X, Zhai C, Hu X, et al. Mining Correlated Bursty Topic Patterns from Coordinated Text Streams [C]. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, USA: ACM, 2007: 784-793.
[9] Lin C X, Zhao B, Mei Q, et al. PET: A Statistical Model for Popular Events Tracking in Social Communities [C]. In: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, USA: ACM, 2010: 929-938.
[10] Dubrawski A. Detection of Events in Multiple Streams of Surveillance Data [A].//Infectious Disease Informatics and Biosurveillance [M]. Springer US, 2011: 145-171.
[11] Diao Q, Jiang J, Zhu F, et al. Finding Bursty Topics from Microblogs [C]. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, Jeju Island, Korea. 2012: 536-544.
[12] 周刚, 邹鸿程, 熊小兵, 等. MB-SinglePass: 基于组合相似 度的微博话题检测[J]. 计算机科学, 2012, 39(10): 198-202. (Zhou Gang, Zou Hongcheng, Xiong Xiaobing, et al. MB-SinglePass: Microblog Topic Detection Based on Combined Similarity [J]. Computer Science, 2012, 39(10): 198-202.)
[13] 郭跇秀, 吕学强, 李卓. 基于突发词聚类的微博突发事件 检测方法[J]. 计算机应用, 2014, 34(2): 486-490. (Guo Yixiu, Lv Xueqiang, Li Zhuo. Bursty Topics Detection Approach on Chinese Microblog Based on Burst Words Clustering [J]. Journal of Computer Applications, 2014, 34(2): 486-490.)
[14] 王勇, 肖诗斌, 郭跇秀, 等. 中文微博突发事件检测研究[J]. 现代图书情报技术, 2013(2): 57-62. (Wang Yong, Xiao Shibin, Guo Yixiu, et al. Research on Chinese Micro-blog Bursty Topics Detection [J]. New Technology of Library and Information Service, 2013(2): 57-62.)
[15] 邱云飞, 程亮. 微博突发话题检测方法研究[J]. 计算机工 程, 2012, 38(9): 288-290. (Qiu Yunfei, Cheng Liang. Research on Sudden Topic Detection Method for Microblog[J]. Computer Engineering, 2012, 38(9): 288-290.)
[16] Kleinberg J. Bursty and Hierarchical Structure in Streams [C]. In: Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM, 2002: 91-101.
[17] Ihler A, Hutchins J, Smyth P. Adaptive Event Detection with Time-Varying Poisson Processes [C]. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.New York: ACM, 2006: 207-216.
[18] Nakahara T, Hamuro Y. Detecting Topics from Twitter Posts During TV Program Viewing [C]. In: Proceedings of the 13th International Conference on Data Mining, Dallas, Texas, USA. IEEE, 2013: 714-719.
[19] Zhang L, Jia Y, Zhou B, et al. Detecting Real-Time Burst Topics in Microblog Streams: How Sentiment Can Help [C]. In: Proceedings of the 22nd International Conference on World Wide Web Companion. 2013: 781-782.
[20] Koike D, Takahashi Y, Utsuro T, et al. Time Series Topic Modeling and Bursty Topic Detection of Correlated News and Twitter [C]. In: Proceedings of the 6th International Joint Conference on Natural Language Processing (IJCNLP), Nagoya, Japan. 2013: 917-921.
[21] He D, Parker D S. Topic Dynamics: An Alternative Model of Bursts in Streams of Topics [C]. In: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.New York, USA: ACM, 2010: 443-452.
[22] 李锐, 王斌. 文本处理中的MapReduce 技术[J]. 中文信息 学报, 2012, 26(4): 9-20. (Li Rui, Wang Bin. MapReduce in Text Processing [J]. Journal of Chinese Information Processing, 2012, 26(4): 9-20.)
[23] Das A S, Datar M, Garg A, et al. Google News Personalization: Scalable Online Collaborative Filtering[C]. In: Proceedings of the 16th International Conference on World Wide Web. New York: ACM, 2007: 271-280.
[24] Choi H, Lee K H, Lee Y J. Parallel Labeling of Massive XML Data with MapReduce [J]. Journal of Supercomputing, 2013, 67(2): 408-437.
[25] 刘滔, 雷霖, 陈荦, 等. 基于MapReduce 的中文词性标注 CRF 模型并行化训练研究[J]. 北京大学学报: 自然科学版, 2013, 49(1): 147-152. (Liu Tao, Lei Lin, Chen Luo, et al. A Parallel Training Research of Chinese Part-of-Speech Tagging CRF Model Based on MapReduce [J]. Acta Scientiarum Naturalium Universitatis Pekinensis, 2013, 49(1): 147-152.)
[26] What is Apache Mahout? [EB/OL]. [2014-09-27]. http://mahout.apache.org/.
[27] Nallapati R, Cohen W, Lafferty J. Parallelized Variational EM for Latent Dirichlet Allocation: An Experimental Evaluation of Speed and Scalability [C]. In: Proceedings of the 17th IEEE International Conference on Data Mining Workshops, Omaha, Nebraska, USA. IEEE, 2007: 349-354.
[28] Zhai K, Boyd-Graber J, Asadi N. Using Variational Inference and MapReduce to Scale Topic Modeling [OL]. Eprint arXiv, 2011. arXiv: 1107.3765.

[1] 席林娜,窦永香. 基于计划行为理论的微博用户转发行为影响因素研究*[J]. 数据分析与知识发现, 2019, 3(2): 13-20.
[2] 张杰,赵君博,翟东升,孙宁宁. 基于主题模型的微藻生物燃料产业链专利技术分析*[J]. 数据分析与知识发现, 2019, 3(2): 52-64.
[3] 刘俊婉,龙志昕,王菲菲. 基于LDA主题模型与链路预测的新兴主题关联机会发现研究*[J]. 数据分析与知识发现, 2019, 3(1): 104-117.
[4] 杨贵军,徐雪,赵富强. 基于XGBoost算法的用户评分预测模型及应用*[J]. 数据分析与知识发现, 2019, 3(1): 118-126.
[5] 王丽,邹丽雪,刘细文. 基于LDA主题模型的文献关联分析及可视化研究[J]. 数据分析与知识发现, 2018, 2(3): 98-106.
[6] 李贺,祝琳琳,闫敏,刘金承,洪闯. 开放式创新社区用户信息有用性识别研究*[J]. 数据分析与知识发现, 2018, 2(12): 12-22.
[7] 曲佳彬,欧石燕. 基于主题过滤与主题关联的学科主题演化分析*[J]. 数据分析与知识发现, 2018, 2(1): 64-75.
[8] 高长元,于建萍,何晓燕. 基于改进粒子群算法的云计算产业联盟知识搜索算法研究*[J]. 数据分析与知识发现, 2017, 1(3): 81-89.
[9] 关鹏,王曰芬. 科技情报分析中LDA主题模型最优主题数确定方法研究*[J]. 现代图书情报技术, 2016, 32(9): 42-50.
[10] 丁晟春,龚思兰,李红梅. 基于突发主题词和凝聚式层次聚类的微博突发事件检测研究*[J]. 现代图书情报技术, 2016, 32(7-8): 12-20.
[11] 张群, 王红军, 王伦文. 词向量与LDA相融合的短文本分类方法*[J]. 数据分析与知识发现, 2016, 32(12): 27-35.
[12] 马宾, 殷立峰. 一种基于Hadoop平台的并行朴素贝叶斯网络舆情快速分类算法[J]. 现代图书情报技术, 2015, 31(2): 78-84.
[13] 虞为, 陈俊鹏. 基于MapReduce的书目数据关联匹配研究[J]. 现代图书情报技术, 2013, 29(9): 15-22.
[14] 亢丽芸, 王效岳, 白如江. MapReduce原理及其主要实现平台分析[J]. 现代图书情报技术, 2012, 28(2): 60-67.
[15] 张兴旺, 李晨晖, 秦晓珠. 云计算环境下大规模数据处理的研究与初步实现[J]. 现代图书情报技术, 2011, 27(4): 17-23.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn