Please wait a minute...
Data Analysis and Knowledge Discovery  2020, Vol. 4 Issue (4): 1-14    DOI: 10.11925/infotech.2096-3467.2019.0511
Current Issue | Archive | Adv Search |
Research on Deep Learning Based Topic Representation of Hot Events
Yu Chuanming1(),Yuan Sai2,Zhu Xingyu1,Lin Hongjun1,Zhang Puliang1,An Lu3
1 School of Information and Security Engineering, Zhongnan University of Economics and Law, Wuhan 430073, China
2 School of Statistics and Mathematics, Zhongnan University of Economics and Law, Wuhan 430073, China
3 School of Information Management, Wuhan University, Wuhan 430072, China
Download: PDF(1966 KB)   HTML ( 25
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This study aims to explore how to learn topic representation for hot events, and investigate the performances of various topic representation models on tasks such as topic classification and topic relevance modeling. [Methods] Based on the LDA2Vec method, we proposed W-LDA2Vec, a topic representation learning model. We predicted the context vectors of the central words after joint training of the initial document and word vectors. Finally, we obtained a word representation of topic information and a topic representation of context information. [Results] In hot events topical classification task, our model achieved the highest F1 value of 0.893, which is 0.314, 0.057, 0.022 and 0.013 higher than those of the four baseline models LDA, Word2Vec, TEWV and Doc2Vec, respectively. For task of hot events topic relevance modeling, with the number of topics as 10, our model achieved a higher correlation score of 0.462 5, which is 0.067 8 higher than that of the LDA model. [Limitations] The experimental corpus is limited to Chinese and English.[Conclusions] By embedding topic information to word and document representation, our model can effectively improve the performance of topical classification and relevance modeling.

Key wordsKnowledge Representation      Topic Representation      Topic Model      Hot Events      Deep Learning     
Received: 14 May 2019      Published: 01 June 2020
ZTFLH:  TP393  
Corresponding Authors: Yu Chuanming     E-mail: yucm@zuel.edu.cn

Cite this article:

Yu Chuanming,Yuan Sai,Zhu Xingyu,Lin Hongjun,Zhang Puliang,An Lu. Research on Deep Learning Based Topic Representation of Hot Events. Data Analysis and Knowledge Discovery, 2020, 4(4): 1-14.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2019.0511     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2020/V4/I4/1

The Framework of Topic Representation for Hot Events
事件编号 新闻事件 中文语料 英文语料
总条数 平均词数 总条数 平均词数
1 英国脱欧 10 000 90 10 000 168
2 朝核问题 10 000 80 10 000 169
3 一带一路 10 000 92 10 000 166
4 中美贸易战 53 880 91 60 855 175
Experimental Data Sets
算法模型 参数类别 参数取值
Word2Vec算法 词向量维度 100-400
模型 Skip-gram
最小计数 10
迭代次数 10
Doc2Vec模型 特征维度 100-400
模型 PV-DM和PV-DBOW
最小计数 10
遍历次数 10
LDA模型 主题数 100-400
TEWV模型 主题数 10、20、30
词向量维度 100-400
滑动窗口 15
训练迭代次数 15
W-LDA2Vec模型 主题数 3、10、20、30
alpha 0.9
词向量维度 100-400
模型 Skip-gram
负采样次数 15
最小计数 20
最大计数 12 000
滑动窗口 5
遍历次数 100
SVM分类器 正则化 L1
Parameter Configurations of the Comparative Experiments
W-LDA2Vec模型(扩展)参数
最大文本长度 500
最小作者评论 5
词向量维度 256
新闻主题数 40
作者主题数 中(6);英(9)
批大小 中(6 144);英(4 096)
遍历次数 120
负采样次数 15
Parameter Configurations of the Interpretability Experiments
Variations of F1 with Dimensions for Different Models
模型(主题数,维度) Precision Recall F1
LDA(/,250) 0.575 0.583 0.579
Word2Vec(/,350) 0.893 0.786 0.836
TEWV(10,400) 0.892 0.851 0.871
Doc2Vec(/,250) 0.907 0.854 0.880
W-LDA2Vec(3,250) 0.926 0.861 0.893
Best Performances of Different Models
Document Distribution of Five Different Models
主题数 LDA TEWV W-LDA2Vec
10 0.394 7 0.497 5 0.462 5
20 0.389 4 0.426 3 0.422 3
30 0.382 8 0.454 9 0.421 5
Comparison of Three Topic Representation Models in Topic Relevance Tasks
查询 结果(相似度)
Word2Vec TEWV W-LDA2Vec
union+european-exit eu(1.02) brussels(1.01) britain(0.99)
trump+twitter-obama facebook(0.93) facebook(0.89) facebook(0.98)
russia+japan-usa allies(1.03) countries(0.99) china(0.97)
north+test testing(0.81) tests(0.52) korea(0.78)
weapons+missile missiles(0.81) missiles(0.62) nuclear(0.76)
Comparison of Word Vectors from Three Models in Word Analogy Tasks
Topic-Word Distributions of the Chinese Topic of No. 27
Topic-Word Distributions of the English Topic of No. 13
编号 主体类别 编号 主题类别 编号 主题类别 编号 主题类别
1 国际舆论 11 国内农副产品 21 中国宏观经济 31 制造业与消费
2 贸易战评判 12 美国汽车制造业 22 经济指标种类 32 中国关税减免
3 中国高层论坛 13 中美谈判 23 全球经济评价 33 农产品供求
4 国际论坛 14 加墨欧钢铝关税 24 国产核心技术 34 钢铝关税回应
5 里拉贬值影响 15 国际油价下跌 25 企业创始人 35 波及韩日经济
6 出席相关会议 16 更换农产品采购国 26 世贸组织申诉 36 美国制造业与零售业
7 中美关系看法 17 英欧关税同盟 27 美国商会态度 37 中国经济总量
8 美国石油产业 18 贸易战相关数据 28 抗癌药与自贸区 38 国家统计局公告
9 A股纳入MSCI 19 国际股市 29 美国农产品期货 39 农产品进出口额
10 外汇市场 20 中国财政政策 30 商务部发言 40 中国股市
Summaries of Chinese Topics
编号 主题类别 编号 主题类别 编号 主题类别 编号 主题类别
1 Iranian natural gas 11 Investor 21 Port shipping 31 Reverse globalization
2 Civil litigation 12 Culture research 22 Technical complaint 32 Chemical industry
3 Renewable energy 13 Cultural exchange 23 Cross-border ecommerce 33 Financial market
4 Drug regulation 14 Oil production 24 North-South talks 34 Research institute
5 Diplomatic ambassador 15 Stock market volatility 25 Film characters 35 Economist
6 Dialogue negotiation 16 Monetary policy 26 Digital currency 36 Global stock market
7 Regional economy 17 Canadian and Mexican Agriculture 27 Military defense 37 Communication Sanctions
8 North Korea denuclearization 18 WTO complaint 28 Middle Eastern religion 38 House vote
9 Military conflict 19 American agriculture 29 Economic data 39 Hillary lost
10 Espionage case 20 Historical economic index 30 Trade negotiation 40 Trend forecast
Summaries of English Topics
[1] Bengio Y, Courville A, Vincent P . Representation Learning: A Review and New Perspectives[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013,35(8):1798-1828.
doi: 10.1109/TPAMI.2013.50
[2] Blei D M, Ng A Y, Jordan M I . Latent Dirichlet Allocation[J]. Journal of Machine Learning Research, 2003,3:993-1022.
[3] 马秀峰, 郭顺利, 宋凯 . 基于LDA主题模型的“内容-方法”共现分析研究——以情报学领域为例[J]. 情报科学, 2018,36(4):69-74.
[3] ( Ma Xiufeng, Guo Shunli, Song Kai . Subject-Method Co-occurrence Analysis Based on LDA Topic Model——Taking the Information Science Field as an Example[J]. Information Science, 2018,36(4):69-74.)
[4] 刘俊婉, 龙志昕, 王菲菲 . 基于LDA主题模型与链路预测的新兴主题关联机会发现研究[J]. 数据分析与知识发现, 2019,3(1):104-117.
[4] ( Liu Junwan, Long Zhixin, Wang Feifei . Finding Collaboration Opportunities from Emerging Issues with LDA Topic Model and Link Prediction[J]. Data Analysis and Knowledge Discovery, 2019,3(1):104-117.)
[5] 张涛, 马海群 . 一种基于LDA主题模型的政策文本聚类方法研究[J]. 数据分析与知识发现, 2018,2(9):59-65.
[5] ( Zhang Tao, Ma Haiqun . Clustering Policy Texts Based on LDA Topic Model[J]. Data Analysis and Knowledge Discovery, 2018,2(9):59-65.)
[6] 熊回香, 窦燕 . 基于LDA主题模型的标签混合推荐研究[J]. 图书情报工作, 2018,62(3):104-113.
[6] ( Xiong Huixiang, Dou Yan . Research on Tag Hybrid Recommendation Based on LDA Topic Model[J]. Library and Information Service, 2018,62(3):104-113.)
[7] 熊回香, 叶佳鑫 . 基于LDA主题模型的微博标签生成研究[J]. 情报科学, 2018,36(10):7-12.
[7] ( Xiong Huixiang, Ye Jiaxin . Microblog Tags Generation Based on LDA Theme Model[J]. Information Science, 2018,36(10):7-12.)
[8] 王曰芬, 傅柱, 陈必坤 . 基于LDA主题模型的科学文献主题识别:全局和学科两个视角的对比分析[J]. 情报理论与实践, 2016,39(7):121-126,101.
[8] ( Wang Yuefen, Fu Zhu, Chen Bikun . Topic Identification of Scientific Literature Based on LDA Topic Model: Comparative Analysis of Two Views of Global and Discipline[J]. Information Studies: Theory & Application, 2016,39(7):121-126,101.)
[9] 王婷婷, 王宇, 秦琳杰 . 基于动态主题模型的时间窗口划分研究[J]. 数据分析与知识发现, 2018,2(10):54-64.
[9] ( Wang Tingting, Wang Yu, Qin Linjie . Dividing Time Windows of Dynamic Topic Model[J]. Data Analysis and Knowledge Discovery, 2018,2(10):54-64.)
[10] 李慧, 胡云凤 . 基于动态情感主题模型的在线评论分析[J]. 数据分析与知识发现, 2017,1(9):74-82.
[10] ( Li Hui, Hu Yunfeng . Analyzing Online Reviews with Dynamic Sentiment Topic Model[J]. Data Analysis and Knowledge Discovery, 2017,1(9):74-82.)
[11] Li L, Gan S, Yin X. Feedback Recurrent Neural Network-based Embedded Vector and Its Application in Topic Model[J]. EURASIP Journal on Embedded Systems, 2017, 2017(1): Article No. 5.
doi: 10.1186/s13639-016-0038-6
[12] Wei X, Croft W B. LDA-based Document Models for Ad-Hoc Retrieval[C]// Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM, 2006: 178-185.
[13] Yang L, Liu Z, Chua T S, et al. Topical Word Embeddings[C]// Proceedings of the 29th AAAI Conference on Artificial Intelligence. 2015.
[14] Jung N, Choi H I . Continuous Semantic Topic Embedding Model Using Variational Autoencoder[OL]. arXiv Preprint, arXiv: 1711. 08870.
[15] Moody C E . Mixing Dirichlet Topic Models and Word Embeddings to Make LDA2Vec[OL]. arXiv Preprint, arXiv: 1605. 02019.
[16] Li D, Li Y, Wang S. Topic Enhanced Word Vectors for Documents Representation[C]// Proceedings of the 6th National Conference on Social Media Processing. 2017: 166-177.
[17] Mikolov T, Sutskever I, Chen K, et al. Distributed Representations of Words and Phrases and Their Compositionality[C]// Proceedings of the 26th International Conference on Neural Information Processing Systems. 2013: 3111-3119.
[18] Mikolov T, Chen K, Corrado G , et al. Efficient Estimation of Word Representations in Vector Space[OL]. arXiv Preprint, arXiv: 1301. 3781.
[19] Le Q, Mikolov T. Distributed Representations of Sentences and Documents[C]// Proceedings of the 31st International Conference on Machine Learning. 2014: 1188-1196.
[20] Yao L, Zhang Y, Chen Q , et al. Mining Coherent Topics in Documents Using Word Embeddings and Large-Scale Text Data[J]. Engineering Applications of Artificial Intelligence, 2017,64:432-439.
doi: 10.1016/j.engappai.2017.06.024
[21] Levy O, Goldberg Y. Linguistic Regularities in Sparse and Explicit Word Representations[C]// Proceedings of the 18th Conference on Computational Natural Language Learning. 2014: 171-180.
[1] Wang Mo,Cui Yunpeng,Chen Li,Li Huan. A Deep Learning-based Method of Argumentative Zoning for Research Articles[J]. 数据分析与知识发现, 2020, 4(6): 60-68.
[2] Jiao Qihang,Le Xiaoqiu. Generating Sentences of Contrast Relationship[J]. 数据分析与知识发现, 2020, 4(6): 43-50.
[3] Deng Siyi,Le Xiaoqiu. Coreference Resolution Based on Dynamic Semantic Attention[J]. 数据分析与知识发现, 2020, 4(5): 46-53.
[4] Pan Youneng,Ni Xiuli. Recommending Online Medical Experts with Labeled-LDA Model[J]. 数据分析与知识发现, 2020, 4(4): 34-43.
[5] Su Chuandong,Huang Xiaoxi,Wang Rongbo,Chen Zhiqun,Mao Junyu,Zhu Jiaying,Pan Yuhao. Identifying Chinese / English Metaphors with Word Embedding and Recurrent Neural Network[J]. 数据分析与知识发现, 2020, 4(4): 91-99.
[6] Liu Tong,Ni Weijian,Sun Yujian,Zeng Qingtian. Predicting Remaining Business Time with Deep Transfer Learning[J]. 数据分析与知识发现, 2020, 4(2/3): 134-142.
[7] Liu Yuwen,Wang Kai. Finding Geographic Locations of Popular Online Topics[J]. 数据分析与知识发现, 2020, 4(2/3): 173-181.
[8] Xu Jianmin,Zhang Liqing,Wang Miao. Tracking Static Topics with Bayesian Network[J]. 数据分析与知识发现, 2020, 4(2/3): 200-206.
[9] Chuanming Yu,Haonan Li,Manyi Wang,Tingting Huang,Lu An. Knowledge Representation Based on Deep Learning:Network Perspective[J]. 数据分析与知识发现, 2020, 4(1): 63-75.
[10] Hongfei Ling,Shiyan Ou. Review of Automatic Labeling for Topic Models[J]. 数据分析与知识发现, 2019, 3(9): 16-26.
[11] Weimin Nie,Yongzhou Chen,Jing Ma. A Text Vector Representation Model Merging Multi-Granularity Information[J]. 数据分析与知识发现, 2019, 3(9): 45-52.
[12] Qingtian Zeng,Xiaohui Hu,Chao Li. Extracting Keywords with Topic Embedding and Network Structure Analysis[J]. 数据分析与知识发现, 2019, 3(7): 52-60.
[13] Mengji Zhang,Wanyu Du,Nan Zheng. Predicting Stock Trends Based on News Events[J]. 数据分析与知识发现, 2019, 3(5): 11-18.
[14] Jingjing Pei,Xiaoqiu Le. Identifying Coordinate Text Blocks in Discourses[J]. 数据分析与知识发现, 2019, 3(5): 51-56.
[15] Bengong Yu,Yangnan Chen,Ying Yang. Classifying Short Text Complaints with nBD-SVM Model[J]. 数据分析与知识发现, 2019, 3(5): 77-85.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn