Please wait a minute...
Advanced Search
数据分析与知识发现  2018, Vol. 2 Issue (9): 31-41
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
徐月梅1(), 吕思凝1, 蔡连侨1, 张小娅2
1北京外国语大学计算机系 北京 100089
2北京外国语大学国际新闻与传播学院 北京 100089
Analyzing News Topic Evolution with Convolutional Neural Networks and Topic2Vec
Xu Yuemei1(), Lv Sining1, Cai Lianqiao1, Zhang Xiaoya2
1Department of Computer Science, Beijing Foreign Studies University, Beijing 100089, China
2School of International Journalism and Communication, Beijing Foreign Studies University, Beijing 100089, China
全文: PDF (1934 KB)   HTML ( 9
输出: BibTeX | EndNote (RIS)      

【目的】通过对网络新闻报道的主题演化研究, 分析新闻主题的内容和情感随时间演变过程, 把握媒体舆论方向。【方法】提出一种基于Topic2Vec的词向量表达方式改进新闻主题的语义空间距离, 并引入卷积神经网络学习主题-特征词矩阵, 实现大量新闻主题的聚类, 从而描绘相同主题的内容强度和情感演变曲线, 判别主题关注事件及关键子主题。【结果】以2015年-2017年美国有线电视新闻网对中国的新闻报道作为实验数据集, 实验结果表明该方法能够发现主题及其情感在全局时间跨度的演化趋势。【局限】时间窗口长度对主题演化的效果和可变时间窗口长度机制未能全面涉及。【结论】本文的新闻主题演变模型使同类主题在语义空间更为接近, 主题分类准确率比对比模型提升约10%, 使得分析新闻主题在全局时间跨度的演化成为可能。

E-mail Alert
关键词 新闻主题卷积神经网络主题演变Topic2Vec    

[Objective] This study analyzes the evolution of news topics, aiming to identify the public opinion and media coverage of certain events. [Methods] We proposed a word distributed representation method based on Topic2Vec to improve the semantic distance of topics. Then, we introduced the convolutional neural networks model to learn the topic vectors and cluster the similar ones. Finally, we obtained the topics’ evolution trends, focus events and related key sub-topics. [Results] We collected news reports on China from the website of CNN between 2015 and 2017 as datasets to examine the proposed method, which effectively revealed the evolution of topics and sentiments. [Limitations] We did not explore the impacts of time window length. [Conclusions] Compared with previous models, the proposed method improves the accuracy of topic clustering by 10% and helps us explore the topic evolution of news.

Key wordsNews Topic    Convolutional Neural Networks    Topic Evolution    Topic2Vec
收稿日期: 2018-01-18      出版日期: 2018-10-25
ZTFLH:  分类号: TP393  
基金资助:*本文系北京市社会科学基金项目“北京对外文化传播过程中‘两微一端’影响力比较研究”(项目编号: 15JDZHC011)和国家自然科学基金项目“信息中心网络中内嵌缓存和请求路由动态优化模型研究”(项目编号: 61502038)的研究成果之一
徐月梅, 吕思凝, 蔡连侨, 张小娅. 结合卷积神经网络和Topic2Vec的新闻主题演变分析*[J]. 数据分析与知识发现, 2018, 2(9): 31-41.
Xu Yuemei,Lv Sining,Cai Lianqiao,Zhang Xiaoya. Analyzing News Topic Evolution with Convolutional Neural Networks and Topic2Vec. Data Analysis and Knowledge Discovery, 2018, 2(9): 31-41.
链接本文:      或
方法 引入时间方式 代表模型 话题数目 优点 缺点
方法1 作为可观测连续变量 ToT 固定 得到主题的连续时间分布, 不需考虑
主题数目固定, 要求主题在所有时间分布
方法2 按时间后离散 Topic Entropy 固定 获得全局的主题信息, 较为全面 主题数目固定, 不能检测到新主题
方法3 按时间先离散 ODTM 不固定 主题数目可变, 可以检测到新主题 需设计合适的时间粒度
T5,2 T5,3 T10,1 T7,4
民生类(飞船发射) 军事类(南海问题) 军事类(南海问题) 经济类(股市)
space 0.0671 sea 0.0215 sea 0.0199 market 0.0233
mission 0.0288 south 0.0184 island 0.0189 economy 0.0191
shenzhou 0.0205 navy 0.0136 south 0.0147 stock 0.01637
astronaut 0.0192 military 0.0136 build 0.0121 growth 0.0130
Chinese 0.0091 island 0.0132 warn 0.0079 month 0.0103
launch 0.0071 flight 0.0110 aircraft 0.0079 rate 0.0088
opportunity 0.0071 aircraft 0.0088 dispute 0.0073 currency 0.0085
center 0.0064 dispute 0.0071 issue 0.0068 global 0.0075
station 0.0064 claim 0.0066 military 0.0068 bank 0.0075
spaceflight 0.0052 reef 0.0066 surveillance 0.0063 treasury 0.0067
T10,4 T6,6 T9,4 T11,1
经济类(股市) 民生类(长江沉船) 民生类(混杂) 政治类(“习马”会面)
market 0.0412 ship 0.0563 GDP 0.0475 china 0.0438
stock 0.0293 yangtze 0.0228 dinosaur 0.0368 taiwan 0.0421
economy 0.0208 river 0.0196 Sale 0.0245 ma 0.0193
bank 0.0117 sink 0.0149 quarter 0.0231 xi 0.0158
rate 0.0107 eastern 0.0146 smartphone 0.0223 meeting 0.0152
government 0.0096 cruise 0.0142 brand 0.0192 party 0.0123
month 0.0096 capsize 0.0138 democracy 0.0169 president 0.0117
economist 0.0085 report 0.0102 Status 0.0138 leader 0.0094
crash 0.0081 rescue 0.0095 product 0.0101 relation 0.0088
fall 0.0061 passenger 0.0081 feather 0.0101 close 0.0070
激活函数 ReLU
dropout 0.6
Batch 15
滤波器滑动窗口大小h 3, 4, 5
训练迭代次数 50
测试集分配 模型 4类准确率 3类准确率 2类准确率
随机分配 Word2Vec 60.67% 68.89% 83.33%
SVM-LDA 57.98% 63.54% 82.76%
Topic2Vec 73.33% 82.22% 95.00%
按时间分配 Word2Vec 53.33% 54.55% 85.71%
SVM-LDA 56.79% 64.23% 83.47%
Topic2Vec 66.67% 72.72% 100.00%
[1] Hoffman M, Bach F R, Blei D M.Online Learning for Latent Dirichlet Allocation[C]//Proceedings of the Neural Information Processing Systems Conference. 2010: 1-9.
[2] Chen F, Chiu P, Lim S.Topic Modeling of Document Metadata for Visualizing Collaborations over Time[C]//Proceedings of the 21st International Conference on Intelligent User Interfaces, California, USA. ACM, 2016:108-117.
[3] He Y, Lin C.Joint Sentiment/Topic Model for Sentiment Analysis[C]//Proceedings of the 18th ACM Conference on Information and Knowledge Management, Hong Kong, China,2009: 375-384.
[4] Lin C, He Y, Everson R, et al.Weakly Supervised Joint Sentiment-Topic Detection from Text[J]. IEEE Transactions on Knowledge and Data Engineering, 2012, 24(6): 1134-1145.
doi: 10.1109/TKDE.2011.48
[5] Hofmann T.Probabilistic Latent Semantic Indexing[J]. ACM SIGIR Forum-SIGIR Test-of-Time Awardees 1978-2001, 2017, 51(2): 211-218.
[6] Kim S, Zhang J, Chen Z, et al.A Hierarchical Aspect-Sentiment Model for Online Reviews[C]//Proceedings of the 27th AAAI Conference on Artificial Intelligence. 2013: 526-533.
[7] Ma C, Wang M, Chen X.Topic and Sentiment Unification Maximum Entropy Model for Online Review Analysis[C]//Proceedings of International World Wide Web Conference, Florence, Italy. 2015: 649-654.
[8] Zhu C, Zhu H, Ge Y, et al.Tracking the Evolution of Social Emotions with Topic Models[J].Knowledge and Information Systems, 2016, 47(3): 517-544.
doi: 10.1007/s10115-015-0865-0
[9] 黄卫东, 陈凌云, 吴美蓉. 网络舆情话题情感演化研究[J]. 情报杂志, 2014, 33(1): 102-107.
doi: 10.3969/j.issn.1002-1965.2014.01.019
[9] (Huang Weidong, Chen Lingyun, Wu Meirong.Research on Sentiment Evaluation of Online Public Opinion Topic[J]. Journal of Intelligence,2014, 33(1): 102-107.)
doi: 10.3969/j.issn.1002-1965.2014.01.019
[10] Hall D, Jurafsky D, Manning C D.Studying the History of Ideas Using Topic Models[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, Honolulu, Hawaii, USA. 2008: 363-371.
[11] Iwata T, Yamada T, Sakurai Y, et al.Online Multiscale Dynamic Topic Models[C]//Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, USA. 2010: 663-672.
[12] Kim Y.Convolutional Neural Networks for Sentence Classification[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, Doha, Qatar. 2014:1746-1751.
[13] Hutto C J, Gilbert E.VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text[C]//Proceedings of the 8th International AAAI Conference on Weblogs and Social Media, Michigan, USA. 2014: 216-225.
[14] Jonathon S.Notes on Kullback-Leibler Divergence and Likelihood[OL]. arXiv Preprint, arXiv: 1404.2000.
[15] GooSeeker[OL]. [2017-02-14]. .
[16] Zhao W, Chen J J, Perkins R.A Heuristic Approach to Determine an Appropriate Number of Topics in Topic Modeling[C]//Proceedings of the 12th Annual MCBIOS Conference, Arkansas, USA. 2017: 123-131.
[17] Mikolov T, Sutskever I, Chen K, et al.Distributed Representations of Words and Phrases and Their Compositionality[J]. Advances in Neural Information Processing Systems, 2013, 26(13): 3111-3119.
[18] Yang B, Xiang M, Zhang Y.Multi-manifold Discriminant Isomap for Visualization and Classification[J]. Pattern Recognition, 2016, 55(1): 215-230.
doi: 10.1016/j.patcog.2016.02.001
[1] 范少萍,赵雨宣,安新颖,吴清强. 基于卷积神经网络的医学实体关系分类模型研究*[J]. 数据分析与知识发现, 2021, 5(9): 75-84.
[2] 范涛,王昊,吴鹏. 基于图卷积神经网络和依存句法分析的网民负面情感分析研究*[J]. 数据分析与知识发现, 2021, 5(9): 97-106.
[3] 孟镇,王昊,虞为,邓三鸿,张宝隆. 基于特征融合的声乐分类研究*[J]. 数据分析与知识发现, 2021, 5(5): 59-70.
[4] 韩普,张展鹏,张明淘,顾亮. 基于多特征融合的中文疾病名称归一化研究*[J]. 数据分析与知识发现, 2021, 5(5): 83-94.
[5] 李跃艳,王昊,邓三鸿,王伟. 近十年信息检索领域的研究热点与演化趋势研究——基于SIGIR会议论文的分析[J]. 数据分析与知识发现, 2021, 5(4): 13-24.
[6] 邱尔丽,何鸿魏,易成岐,李慧颖. 基于字符级CNN技术的公共政策网民支持度研究 *[J]. 数据分析与知识发现, 2020, 4(7): 28-37.
[7] 刘伟江,魏海,运天鹤. 基于卷积神经网络的客户信用评估模型研究*[J]. 数据分析与知识发现, 2020, 4(6): 80-90.
[8] 徐月梅,刘韫文,蔡连侨. 基于深度融合特征的政务微博转发规模预测模型*[J]. 数据分析与知识发现, 2020, 4(2/3): 18-28.
[9] 向菲,谢耀谈. 基于混合采样与迁移学习的患者评论识别模型*[J]. 数据分析与知识发现, 2020, 4(2/3): 39-47.
[10] 彭郴,吕学强,孙宁,张乐,姜肇财,宋黎. 基于CNN的消费品缺陷领域词典构建方法研究*[J]. 数据分析与知识发现, 2020, 4(11): 112-120.
[11] 聂维民,陈永洲,马静. 融合多粒度信息的文本向量表示模型 *[J]. 数据分析与知识发现, 2019, 3(9): 45-52.
[12] 邵云飞,刘东苏. 基于类别特征扩展的短文本分类方法研究 *[J]. 数据分析与知识发现, 2019, 3(9): 60-67.
[13] 刘勘,陈露. 面向医疗分诊的深度神经网络学习*[J]. 数据分析与知识发现, 2019, 3(6): 99-108.
[14] 黄孝喜, 李晗雨, 王荣波, 王小华, 谌志群. 基于卷积神经网络与SVM分类器的隐喻识别*[J]. 数据分析与知识发现, 2018, 2(10): 77-83.
Full text



版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190