Please wait a minute...
Advanced Search
现代图书情报技术  2015, Vol. 31 Issue (5): 57-64     https://doi.org/10.11925/infotech.1003-3513.2015.05.08
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
基于语义分析和相似强度的微博热点发现方法
吴妮, 赵捧未, 秦春秀
西安电子科技大学经济与管理学院 西安 710071
Microblog Hotspot Detection Based on Semantic Analysis and Similarity Strength
Wu Ni, Zhao Pengwei, Qin Chunxiu
School of Economics and Management, Xidian University, Xi'an 710071, China
全文: PDF (562 KB)   HTML  
输出: BibTeX | EndNote (RIS)      
摘要 

[目的]通过改进热点发现方法, 解决传统方法存在的语义理解不足和聚类算法局限性的问题。[方法]从语义分析角度表示文本, 使用信息增益和潜在语义分析方法构建词–文档矩阵; 提出二次聚类算法方案, 实现热点发现与更新, 并使用相似强度的大小选取最优热点。[结果]该热点发现方法的查全率为91.3%, 查准率为92.9%, 较前人方法的聚类效果有所提高; 该热点发现方法也可以更新数据, 降低实验复杂度。[局限]实验数据的时间跨度较小, 使得更新热点方法的效果不太显著。[结论]本文提出的热点发现方法具有良好的准确性。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
秦春秀
吴妮
赵捧未
关键词 潜在语义分析相似强度二次聚类热点发现    
Abstract

Abstract: [Objective] Improve the method of hotspot detection to solve the lack of semantic understanding and the limitation of clustering algorithm in the traditional method of microblog hotspot. [Methods] This paper uses the Information Gain and the Latent Semantic Analysis as the way to construct a word-document matrix, then, the two-step clustering algorithm is put up which uses an improved K-means algorithm in hotspot detection as well as incremental clustering algorithm in hotspot refreshing. Meanwhile, similarity strength is adopted to solve the low accuracy of traditional method in which the number of hot topics is firstly determined and then the topic is detected. [Results] Compared with previous methods, the recall ratio of presented method is 91.3% and the precision ratio is 92.9%, clustering effect increased. It also can update data to reduce the complexity of the experiment. [Limitations] The experimental data has a small time span making the effect of update hotspot is not outstanding. [Conclusions] Experimental results show that the proposed method has good accuracy.

Key wordsLatent semantic analysis    Similarity strength    Two-step clustering    Hotspot detection
收稿日期: 2014-11-17      出版日期: 2015-06-11
:  G353  
基金资助:

本文系国家自然科学基金项目“基于知识地图的对等网语义社区及其知识共享研究”(项目编号:71103138)、中央高校基本科研业务费专项资金资助项目“大数据背景下基于用户生成内容的商务智能模型研究”(项目编号:BDY231414)和横向课题北京TRS信息技术有限公司“员工知识管理共享系统”(项目编号:hx0113060415)的研究成果之一。

通讯作者: 吴妮,ORCID:0000-0002-8760-2308,E-mail:wuni_limia@sina.com。     E-mail: wuni_limia@sina.com
作者简介: 作者贡献声明: 吴妮,赵捧未,秦春秀:提出研究思路,设计研究方案,论文最终版本修订;吴妮:进行实验,采集、清洗和分析数据,起草论文。
引用本文:   
吴妮, 赵捧未, 秦春秀. 基于语义分析和相似强度的微博热点发现方法[J]. 现代图书情报技术, 2015, 31(5): 57-64.
Wu Ni, Zhao Pengwei, Qin Chunxiu. Microblog Hotspot Detection Based on Semantic Analysis and Similarity Strength. New Technology of Library and Information Service, 2015, 31(5): 57-64.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.1003-3513.2015.05.08      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2015/V31/I5/57

[1] 张琳. 我国微博的发展研究 [D]. 南昌: 江西财经大学, 2012. (Zhang Lin. The Development Research of Microblog in China [D]. Nanchang: Jiangxi University of Finance and Ecnomics, 2012.)
[2] 唐晓波, 王洪艳. 基于潜在语义分析的微博主题挖掘模型研究[J]. 图书情报工作, 2012, 56(24): 114-119. (Tang Xiaobo, Wang Hongyan. Microblog Topic Mining Model Based on Latent Semantic Analysis [J]. Library and Information Service, 2012, 56(24): 114-119. )
[3] 丁若尧. 基于博客的网络话题发现及追踪的研究 [D]. 北京: 北京交通大学, 2011. (Ding Ruoyao. Research on Internet Topic Detection and Tracking Based on Blog [D]. Beijing: Beijing Jiaotong University, 2011.)
[4] 孙胜平. 中文微博客热点话题检测与跟踪技术研究 [D]. 北京: 北京交通大学, 2011. (Sun Shengping. Research on Chinese Micro-Blog Hot Topic Detection and Tracking [D]. Beijing: Beijing Jiaotong University, 2011.)
[5] 李劲, 张华, 吴浩雄, 等. 基于特定领域的中文微博热点话题挖掘系统BTopicMiner [J]. 计算机应用, 2012, 32(8): 2346-2349. (Li Jin, Zhang Hua, Wu Haoxiong, et al. BTopicMiner: Domain-specific Topic Mining System for Chinese Microblog [J]. Journal of Computer Applications, 2012, 32(8): 2346-2349.)
[6] 马雯雯, 魏文晗, 邓一贵. 基于隐含语义分析的微博话题发现方法[J]. 计算机工程与应用, 2014, 50(1): 96-100. (Ma Wenwen, Wei Wenhan, Deng Yigui. Micro-blog Topic Detection Method Based on Latent Semantic Analysis [J]. Computer Engineering and Applications, 2014, 50(1): 96-100.)
[7] 马雯雯. 基于隐含语义分析的微博热点话题发现策略[D]. 重庆: 重庆大学, 2013. (Ma Wenwen. Hot Topic Detection Strategy of Micro-blog Based on Latent Semantic Analysis [D]. Chongqing: Chongqing University, 2013.)
[8] 杨长春, 周猛, 叶施仁, 等. 基于改进CURE算法的微博热点话题发现[J]. 计算机仿真, 2013, 30(11): 383-387. (Yang Changchun, Zhou Meng, Ye Shiren, et al. An Improved Hot Topic Detection Method for Microblog Based on CURE Algorithm [J]. Computer Simulation, 2013, 30(11): 383-387.)
[9] 黄波. 基于向量空间模型和LDA模型相结合的微博客话题发现算法研究[D]. 成都: 西南交通大学, 2012. (Huang Bo. Research on Microblog Topic Detection Based on VSM Model and LDA Model [D]. Chengdu: Southwest Jiaotong University, 2012.)
[10] Allan J. Introduction to Topic Detection and Tracking [A]//Allan J. Topic Detection and Tracking [M]. New York: Springer US, 2002.
[11] 于满泉, 骆卫华, 许洪波,等. 话题识别与跟踪中的层次化话题识别技术研究 [J]. 计算机研究与发展, 2006, 43(3): 489-495. (Yu Manquan, Luo Weihua, Xu Hongbo, et al. Research on Hierarchical Topic Detection in Topic Detection and Tracking [J]. Journal of Computer Research and Development, 2006, 43(3): 489-495.)
[12] 洪宇, 张宇, 刘挺, 等. 话题检测与跟踪的评测及研究综述[J]. 中文信息学报, 2007, 21(6): 71-87.(Hong Yu, Zhang Yu, Liu Ting, et al. Topic Detection and Tracking Review[J]. Journal of Chinese Information Processing, 2007, 21(6): 71-87.)
[13] 丁伟莉. 中文Blog热门话题检测与跟踪技术研究 [D]. 哈尔滨: 哈尔滨工业大学, 2007. (Ding Weili. Research on Chinese Blog Hot Topic Detection and Tracking [D]. Harbin: Harbin Institute of Technology, 2007.)
[14] 姚海波. 微博热点话题检测与趋势预测研究 [D]. 广州: 华南理工大学, 2013. (Yao Haibo. Detection and Trend Prediction Research of Hot Topic of Micro-Blogging [D]. Guangzhou: South China University of Technology, 2013.)
[15] 李永道. 微博热点话题发现方法研究 [D]. 南京: 南京师范大学, 2013. (Li Yongdao. Research on Hot Topic Detection Methods for Microblog [D]. Nanjing: Nanjing Normal University, 2013.)
[16] 雷震, 吴玲达, 雷蕾, 等. 初始化类中心的增量K均值法及其在新闻事件探测中的应用 [J]. 情报学报, 2006, 25(3): 289-295. (Lei Zhen, Wu Lingda, Lei Lei, et al. Incremental K-means Method Based on Initialisation of Cluster Centers and Its Application in News Event Detection [J]. Journal of the China Society for Scientific and Technical Information, 2006, 25(3): 289-295.)
[17] 王伟, 许鑫. 基于聚类的网络舆情热点发现及分析[J]. 现代图书情报技术, 2009(3): 74-79. (Wang Wei, Xu Xin. Online Public Opinion Hotspot Detection and Analysis Based on Document Clustering [J]. New Technology of Library and Information Service, 2009(3): 74-79. )
[18] 张洋, 何楚杰, 段俊文, 等. 微博舆情热点分析系统设计研究[J]. 信息网络安全, 2012(9): 60-64. (Zhang Yang, He Chujie, Duan Junwen, et al. Public Opinion Hotspot Analysis System Design About Microblog [J]. Netinfo Security, 2012(9): 60-64. )
[19] 张乐, 祁超. 网络论坛热点话题的关注度预测[J]. 计算机与数字工程, 2013, 41(5): 772-774, 861. (Zhang Le, Qi Chao. Prediction of the Attention of Internet Forum Hot Topics [D]. Computer and Digital Engineering, 2013, 41(5): 772-774, 861.)
[20] 税仪冬, 瞿有利, 黄厚宽. 周期分类和Single-Pass聚类相结合的话题识别与跟踪方法[J]. 北京交通大学学报, 2009, 33(5): 85-89. (Shui Yidong, Qu Youli, Huang Houkuan. A New Topic Detection and Tracking Approach Combining Periodic Classification and Single-Pass Clustering [J]. Journal of Beijing Jiaotong University, 2009, 33(5): 85-89.)
[21] 殷风景, 肖卫东, 葛斌, 等. 一种面向网络话题发现的增量文本聚类算法[J]. 计算机应用研究, 2011, 28(1): 54-57. (Yin Fengjing, Xiao Weidong, Ge Bin, et al. Incremental Algorithm for Clustering Texts in Internet-oriented Topic Detection [J]. Application Research of Computers, 2011, 28(1): 54-57. )
[22] 王伟, 张晶涛, 柴天佑. PID参数先进整定方法综述[J]. 自动化学报, 2000, 26(3): 347-355. (Wang Wei, Zhang Jingtao, Chai Tianyou. A Survey of Advanced PID Parameter Tuning Methods [J]. Acta Automatica Sinica, 2000, 26(3): 347-355. )
[23] 庞剑锋, 卜东波, 白硕. 基于向量空间模型的文本自动分类系统的研究与实现 [J]. 计算机应用研究, 2001(9): 23-26. (Pang Jianfeng, Bu Dongbo, Bai Shuo. Research and Implementation of Text Categorization System Based on VSM [J]. Application Research of Computers, 2001(9): 23-26. )
[24] 周水庚, 关佶红, 胡运发. 隐含语义索引及其在中文文本处理中的应用研究[J]. 小型微型计算机系统, 2001, 22(2): 239-243. (Zhou Shuigeng, Guan Jiehong, Hu Yunfa. Latent Semantic Indexing(LSI) and Its Applications in Chinese Text Processing [J]. Mini-Micro System, 2001, 22(2): 239-243.)
[25] 万源. 基于语义统计分析的网络舆情挖掘技术研究[D]. 武汉: 武汉理工大学, 2012. (Wan Yuan. Research on Mining of Internet Public Opinion Based on Semantic and Statistic Analysis [D]. Wuhan: Wuhan University of Technology, 2012.)
[26] Chen H, Jin H. Finding and Evaluating the Community Structure in Semantic Peer-to-Peer Overlay Networks [J]. Science China: Information Sciences, 2011, 54(7): 1340-1351.

[1] 田世海, 吕德丽. 改进潜在语义分析和支持向量机算法用于突发安全事件舆情预警*[J]. 数据分析与知识发现, 2017, 1(2): 11-18.
[2] 赵夷平,毕强. 关联数据在学术资源网相似文献发现中的应用研究*[J]. 现代图书情报技术, 2016, 32(3): 41-49.
[3] 李国垒, 陈先来, 夏冬, 杨荣. 面向临床决策的电子病历文本潜在语义分析*[J]. 数据分析与知识发现, 2016, 32(3): 50-57.
[4] 夏冬, 肖晓旦, 李国垒, 陈先来. 基于潜在语义分析的关键词-分类号对应关系研究[J]. 现代图书情报技术, 2014, 30(12): 92-96.
[5] 赵迎光, 安新颖, 李勇, 贾晓峰. 一种基于生命周期理论的文献热点发现方法——以肿瘤领域为例[J]. 现代图书情报技术, 2012, (11): 86-91.
[6] 刘飒 章成志. 多语言文本表示研究综述*[J]. 现代图书情报技术, 2010, 26(6): 33-41.
[7] 王嵩,代逸生,李保珍. 基于PLSA的大众标注资源主题挖掘*[J]. 现代图书情报技术, 2010, 26(3): 47-51.
[8] 王伟,许鑫. 基于聚类的网络舆情热点发现及分析*[J]. 现代图书情报技术, 2009, 3(3): 74-79.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn