Please wait a minute...
Advanced Search
数据分析与知识发现  2019, Vol. 3 Issue (3): 95-101    DOI: 10.11925/infotech.2096-3467.2018.0625
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
基于词向量和BTM的短文本话题演化分析*
张佩瑶(),刘东苏
西安电子科技大学经济与管理学院 西安 710126
Topic Evolutionary Analysis of Short Text Based on Word Vector and BTM
Peiyao Zhang(),Dongsu Liu
School of Economics and Management, Xidian University, Xi’an 710126, China
全文: PDF(709 KB)   HTML ( 4
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】构建一种微博话题演化方法, 正确把握话题发展趋势, 提高网络舆情预警能力。【方法】使用Skip-gram模型在文本集上训练得到词向量模型, 将每一时间片的微博文本输入BTM得到候选主题, 在主题维上构造候选主题词向量; 利用K-means算法对主题词向量聚类, 得到融合后的主题, 进而建立文本集在时间片上的话题演化路径。【结果】实验结果表明, 本文方法话题抽取F值为75%, 对比主题模型提高约10%, 证明本方法的可行性。【局限】话题演化的衡量标准不一致, 没有对比多种话题演化方法。【结论】本文方法能有效抽取各阶段话题, 为网络舆情分析提供有效途径。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
张佩瑶
刘东苏
关键词 BTM模型词向量话题相似度话题演化    
Abstract

[Objective] This paper aims to correctly grasp the topic development trend by constructing a microblog topic evolution method, and it is of great significance for public sentiment warning. [Methods] Firstly, the Ship-gram model is used to train the word vector model on the text set. Input the text of each time slice into the BTM to get the candidate theme. In BTM thematic dimension, the theme word vector is constructed. Secondly, k-means algorithm is used to cluster the theme word vector to get the fused theme. And the topic evolution of the text set on time slice is established. [Results] The experimental results show that the F value of this method is 75%, which is about 10% higher than that of the topic model. This proves the feasibility of the proposed method. [Limitations] There is no definite measuring standard for topic evolution, and there is no comparison between various methods of topic evolution. [Conclusions] The proposed method can effectively extract topics at all stages and provide an effective way for network public opinion analysis.

Key wordsBiterm Topic Model    Word Embedding    Topic Similarity    Topic Evolution
收稿日期: 2018-06-06     
基金资助:*本文系国家自然科学青年基金项目“大规模动态社交网络社团检测算法研究”(项目编号: 71401130)的研究成果之一
引用本文:   
张佩瑶,刘东苏. 基于词向量和BTM的短文本话题演化分析*[J]. 数据分析与知识发现, 2019, 3(3): 95-101.
Peiyao Zhang,Dongsu Liu. Topic Evolutionary Analysis of Short Text Based on Word Vector and BTM. Data Analysis and Knowledge Discovery, DOI:10.11925/infotech.2096-3467.2018.0625.
链接本文:  
http://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2018.0625
[1] 陈福集, 马梅兰. 网络舆情事件的话题演化分析——以成都女司机为例[J]. 情报杂志, 2016, 35(5): 58-64.
[1] (Chen Fuji, Ma Meilan.A Subtopic Detection Method of Specific Events for Network Public Opinion: Taking News about a Female Driver in Chengdu as Example[J]. Journal of Intelligence, 2016, 35(5): 58-64.)
[2] 赵爱华, 刘培玉, 郑燕. 基于LDA的新闻话题子话题划分方法[J]. 小型微型计算机系统, 2013, 34(4): 732-737.
[2] (Zhao Aihua, Liu Peiyu, Zheng Yan.Subtopic Division in News Topic Based on Latent Dirichlet Allocation[J]. Journal of Chinese Computer Systerms, 2013, 34(4): 732-737.)
[3] 徐佳俊, 杨飏, 姚天防, 等. 基于LDA模型的论坛热点话题识别和追踪[J]. 中文信息学报, 2016, 30(1): 43-49.
[3] (Xu Jiajun, Yang Yang, Yao Tianfang, et al.LDA Based Hot Topic Detection and Tracking for the Forum[J]. Journal of Chinese Information Processing, 2016, 30(1): 43-49.)
[4] 王亚民, 胡悦. 基于BTM的微博舆情热点发现[J]. 情报杂志, 2016, 35(11): 116-124, 140.
[4] (Wang Yamin, Hu Yue.Hotspot Detection in Microblog Public Opinion Based on Biterm Topic Model[J]. Journal of Intelligence, 2016, 35(11): 116-124, 140.)
[5] Wang X, McCallum A. Topic over Time: A Non-Markov Continuous-Time Model of Topical Trends[C]// Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2006: 424-433.
[6] Blei D M, Lafferty J D.Dynamic Topic Model[C]// Proceedings of the 23rd International Conference on Machine Learning. 2006: 113-120.
[7] 齐亚双, 祝娜, 翟羽佳. 基于DTM的国内外情报学研究主题热度演化对比研究[J]. 图书情报工作, 2016, 60(16): 99-109.
[7] (Qi Yashuang, Zhu Na, Zhai Yujia.A Comparative Study on Topic Heats Evolution in the Field of Information Science Between the Domestic and Foreign Research Based on DTM[J]. Library and Information Service, 2016, 60(16): 99-109.)
[8] Alsumait L, Barbar D, Domeniconi C.On-line LDA: Adaptive Topic Models for Mining Text Streams with Application to Topic Detection and Tracking[C]// Proceedings of the 8th IEEE International Conference on Data Mining. IEEE, 2008: 3-12.
[9] 胡艳丽, 白亮, 张维明. 网络舆情中一种基于OLDA 的在线话题演化方法[J]. 国防科技大学学报, 2012, 34(1): 150-154.
[9] (Hu Yanli, Bai Liang, Zhang Weiming.OLDA-based Method for Online Topic Evolution in Network Public Opinion Analysis[J]. Journal of National University of Defense Technology, 2012, 34(1): 150-154.)
[10] 唐晓波, 王洪艳. 基于潜在狄利克雷分配模型的微博主题演化分析[J]. 情报学报, 2013, 32(3): 281-287.
[10] (Tang Xiaobo, Wang Hongyan.Analysis of Microblog Topic Evolution Based on Latent Dirichlet Allocation Model[J]. Journal of the China Society for Scientific and Technical Information, 2013, 32(3): 281-287.)
[11] 史庆伟, 刘雨诗, 张丰田. 基于微博文本的词对主题演化模型[J]. 计算机应用, 2017, 37(5): 1407-1412.
[11] (Shi Qingwei, Liu Yushi, Zhang Fengtian.Biterm Topic Evolution Model of Microblog[J]. Journal of Computer Applications, 2017, 37(5): 1407-1412.)
[12] 李帅彬, 李亚星, 冯旭鹏, 等. 基于词向量的微博话题发现方法[J]. 计算机应用软件, 2017, 34(12): 47-52.
[12] (Li Shuaibin, Li Yaxing, Feng Xupeng, et al.Microblogging Topic Detection Based on the Word Distributed Representation[J]. Computer Application and Software, 2017, 34(12): 47-52.)
[13] 张佳明, 席耀一, 王波, 等. 基于词向量的微博事件追踪方法[J]. 计算机工程与应用, 2016, 52(17): 73-78.
[13] (Zhang Jiaming, Xi Yaoyi, Wang Bo, et al.Method of Micro-blog Event Tracking Based on Word Vector[J]. Computer Engineering and Applications, 2016, 52(17): 73-78.)
[14] Hinton G E.Learning Distributed Representations of Concepts[C]//Proceedings of the 8th Annual Conference of the Cognitive Science Society. 1986.
[15] Yan X, Guo J, Lan Y, et al.A Biterm Topic Model for Short Texts[C]//Proceedings of the 22nd International Conference on World Wide Web. 2013: 1445-1456.
[16] Mikolov T, Chen K, Corrado G, et al.Efficient Estimation of Word Representations in Vector Space[OL]. arXiv Preprint, arXiv: 1301.3781.
[17] Gensim. Gensim Word2Vec Framework[EB/OL]. [2017-11-10]. .
[18] 搜狗实验室全网新闻数据[EB/OL].[2017-11-10]. .
[18] (SogouCA[EB/OL]. [2017-11-10].
[19] 翟羽佳. 特定事件微博子话题特征提取研究[J]. 情报科学, 2016, 34(3): 145-150, 172.
[19] (Zhai Yujia.Subtopic Feature Extraction for Specified Event Microblogs[J]. Information Science, 2016, 34(3): 145-150, 172.)
[20] Gooseeker[EB/OL]. [2017-05-25]. .
[21] Jieba[EB/OL]. [2017-10-20]..
[1] 文秀贤,徐健. 基于用户评论的商品特征提取及特征价格研究 *[J]. 数据分析与知识发现, 2019, 3(7): 42-51.
[2] 余本功,陈杨楠,杨颖. 基于nBD-SVM模型的投诉短文本分类*[J]. 数据分析与知识发现, 2019, 3(5): 77-85.
[3] 汪鸿沁泠,巴志超,李纲. 微信群会话话题强度计算及演化分析*[J]. 数据分析与知识发现, 2019, 3(2): 33-42.
[4] 李慧,柴亚青. 基于卷积神经网络的细粒度情感分析方法*[J]. 数据分析与知识发现, 2019, 3(1): 95-103.
[5] 李心蕾,王昊,刘小敏,邓三鸿. 面向微博短文本分类的文本向量化方法比较研究*[J]. 数据分析与知识发现, 2018, 2(8): 41-50.
[6] 王璟琦,李锐,吴华意. 基于空间自相关的网络舆情话题演化时空规律分析*[J]. 数据分析与知识发现, 2018, 2(2): 64-73.
[7] 胡家珩,岑咏华,吴承尧. 基于深度学习的领域情感词典自动构建*——以金融领域为例[J]. 数据分析与知识发现, 2018, 2(10): 95-102.
[8] 夏天. 词向量聚类加权TextRank的关键词抽取*[J]. 数据分析与知识发现, 2017, 1(2): 28-34.
[9] 翟东升,胡等金,张杰,何喜军,刘鹤. 专利发明等级分类建模技术研究*[J]. 数据分析与知识发现, 2017, 1(12): 63-73.
[10] 宁建飞,刘降珍. 融合Word2vec与TextRank的关键词抽取研究[J]. 现代图书情报技术, 2016, 32(6): 20-27.
[11] 张群, 王红军, 王伦文. 词向量与LDA相融合的短文本分类方法*[J]. 数据分析与知识发现, 2016, 32(12): 27-35.
[12] 贺亮, 李芳. 科技文献话题演化研究[J]. 现代图书情报技术, 2012, 28(4): 61-67.
[13] 单斌, 李芳. 基于种子文档LDA话题的演化研究[J]. 现代图书情报技术, 2011, 27(7/8): 104-109.
[14] 胡泽文, 王效岳, 白如江. 基于SUMO和WordNet本体集成的文本分类模型研究[J]. 现代图书情报技术, 2011, 27(1): 31-38.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn