Please wait a minute...
Data Analysis and Knowledge Discovery  2021, Vol. 5 Issue (2): 1-13    DOI: 10.11925/infotech.2096-3467.2020.1025
Current Issue | Archive | Adv Search |
Generating News Clues with Biterm Topic Model
Zhao Tianzi1,Duan Liang1(),Yue Kun1,Qiao Shaojie2,3,Ma Zijuan1
1School of Information Science & Engineering, Yunnan University, Kunming 650500, China
2School of Software Engineering, Chengdu University of Information Technology, Chengdu 610225, China
3Sichuan Key Laboratory of Software Automatic Generation and Intelligent Service,Chengdu University of Information Technology, Chengdu 610225, China
Download: PDF (2125 KB)   HTML ( 52
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper modifies the topic model to improve the quality of extracted news clues. [Methods] We constructed a News-IBTM model based on IBTM (Incremental Biterm Topic Model) with dynamic sliding window, which reduced the extraction scope of binary phrases. Then, we used this model to extract topics and topic-word distributions from news, and inferred the document-topic distributions. Finally, we used the JS (Jensen-Shannon) divergence to measure the difference between document-topic distributions and generate news clues. [Results] We examined our News-IBTM model with news from People’s Daily Online and Weibo. The proposed model outperformed existing ones in perplexity, accuracy and efficiency. [Limitations] The accuracy of News-IBTM algorithm needs to be further improved. [Conclusions] The proposed method could effectively extract quality news topics and clues.

Key wordsNews Events      News Clues Generation      Topic Model      Jensen-Shannon Divergence     
Received: 20 October 2020      Published: 11 March 2021
ZTFLH:  TP391  
Fund:National Natural Science Foundation of China(U1802271);Research Foundation of Educational Department of Yunnan Province(2020Y0010);China Postdoctoral Science Foundation(2020M673310)
Corresponding Authors: Duan Liang ORCID:0000-0001-9473-2533     E-mail: duanl@ynu.edu.cn

Cite this article:

Zhao Tianzi, Duan Liang, Yue Kun, Qiao Shaojie, Ma Zijuan. Generating News Clues with Biterm Topic Model. Data Analysis and Knowledge Discovery, 2021, 5(2): 1-13.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2020.1025     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2021/V5/I2/1

符号 含义
θ 主题分布
φ 主题-词分布
α 主题分布先验参数
β 主题-词分布先验参数
zk k个主题
di i个新闻子事件
cj j条新闻线索
ws s个词汇
t t个时间片
T 新闻发布时间
K 主题总数
NW 新闻数据词汇总数
NB Biterm总数
ND 新闻文档总数
Notations
Graphical Representation of News-IBTM
数据集 测试类型 文档数量(条) 单篇范围(字) 平均长度(字)
人民网 长文本 8 772 200~4 000 1 200
微博 短文本 32 502 20~500 185
Datasets
Perplexity by Varying the Number of Topics
Perplexity by Varying the Number of News
Accuracy by Varying the Number of Topics
Accuracy by Varying the Number of News
Accuracy of News-IBTM with Different Time Slices
Execution Time of News-IBTM by Varying the Number of News
Execution Time by Varying the Number of Topics
Execution Time of News-IBTM in a Single Time Slice
Visualization of News Clues
[1] Surendran S, Chithraprasad D, Kaimal M R. A Scalable Geometric Algorithm for Community Detection from Social Networks with Incremental Update[J]. Social Network Analysis & Mining, 2016, 6(1): Article No.90.
[2] Papadimitriou C H, Raghavan P, Tamaki H, et al. Latent Semantic Indexing: A Probabilistic Analysis[J]. Journal of Computer and System Sciences, 2000,61(2):217-235.
doi: 10.1006/jcss.2000.1711
[3] Kling C C, Posch L, Bleier A, et al. Topic Model Tutorial: A Basic Introduction on Latent Dirichlet Allocation and Extensions for Web Scientists[C]//Proceedings of the 8th ACM Conference on Web Science. 2016.
[4] Blei D M, Ng A Y, Jordan M I, et al. Latent Dirichlet Allocation[J]. Journal of Machine Learning Research, 2003,3:993-1022.
[5] AlSumait L, Barbará D, Domeniconi C. On-line LDA: Adaptive Topic Models for Mining Text Streams with Applications to Topic Detection and Tracking[C]//Proceedings of the 8th IEEE International Conference on Data Mining. 2008: 3-12.
[6] Yao L, Zhang Y, Wei B, et al. Incorporating Knowledge Graph Embeddings into Topic Modeling[C]//Proceedings of the 31st AAAI Conference on Artificial Intelligence. 2017: 3119-3126.
[7] 唐焕玲, 窦全胜, 于立萍, 等. 有监督主题模型的SLDA-TC文本分类新方法[J]. 电子学报, 2019,47(6):1300-1308.
[7] ( Tang Huanling, Dou Quansheng, Yu Liping, et al. SLDA-TC: A Novel Text Categorization Approach Based on Supervised Topic Model[J]. Acta Electronica Sinica, 2019,47(6):1300-1308.)
[8] Yan X, Guo J, Lan Y, et al. A Biterm Topic Model for Short Texts[C]//Proceedings of the 22nd International Conference on World Wide Web. ACM, 2013: 1445-1456.
[9] Cheng X, Yan X, Lan Y, et al. BTM: Topic Modeling over Short Texts[J]. IEEE Transactions on Knowledge and Data Engineering, 2014,26(12):2928-2941.
doi: 10.1109/TKDE.2014.2313872
[10] 梁吉业, 乔洁, 曹付元, 等. 面向短文本分析的分布式表示模型[J]. 计算机研究与发展, 2018,55(8):1631-1640.
[10] ( Liang Jiye, Qiao Jie, Cao Fuyuan, et al. A Distributed Representation Model for Short Text Analysis[J]. Journal of Computer Research and Development, 2018,55(8):1631-1640.)
[11] Pang J, Li X, Xie H, et al. SBTM: Topic Modeling over Short Texts[C]//Proceedings of the DASFAA 2016 Workshop. Springer International Publishing, 2016: 43-56.
[12] Zhou X, Ouyang J, Li X. Two Time-Efficient Gibbs Sampling Inference Algorithms for Biterm Topic Model[J]. Applied Intelligence, 2018,48(3):730-754.
doi: 10.1007/s10489-017-1004-2
[13] Li X, Zhang A, Li C, et al. Relational Biterm Topic Model: Short-Text Topic Modeling Using Word Embeddings[J]. Computer Journal, 2019,62(3):359-372.
doi: 10.1093/comjnl/bxy037
[14] Liu J, Xia C, Li X, et al. A Bert-based Ensemble Model for Chinese News Topic Prediction[C]//Proceedings of the 2nd International Conference on Big Data Engineering. 2020: 18-23.
[15] Nam H, Seo S, Mailthody V, et al. I-BERT: Inductive Generalization of Transformer to Arbitrary Context Lengths arXiv Preprint, arXiv: 2006.10220.
[16] 郑飞, 韦德壕, 黄胜. 基于LDA和深度学习的文本分类方法[J]. 计算机工程与设计, 2020,41(8):2184-2189.
[16] ( Zheng Fei, Wei Dehao, Huang Sheng. Text Classification Method Based on LDA and Deep Learning[J]. Computer Engineering and Design, 2020,41(8):2184-2189.)
[17] Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[OL]. arXiv Preprint, arXiv: 1810. 04805.
[18] Fiscus J G, Doddington G R. Topic Detection and Tracking Evaluation Overview[A]//Topic Detection and Tracking: Event-based Information Organization[M]. 2002: 17-31.
[19] Mei Q, Zhai C X. Discovering Evolutionary Theme Patterns from Text: An Exploration of Temporal Text Mining[C]//Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2005: 198-207.
[20] Goodfellow I, Bengio Y, Courville A. Deep Learning (Vol. 1) [M]. Cambridge: MIT Press, 2016: 71-73.
[21] Canini K R, Shi L, Griffiths T L. Online Inference of Topics with Latent Dirichlet Allocation[C]//Proceedings of the 12th International Conference on Artificial Intelligence and Statistics. AISTATS, 2009: 65-72.
[22] 李莹莹, 马帅, 蒋浩谊, 等. 一种基于社交事件关联的故事脉络生成方法[J]. 计算机研究与发展, 2018,55(9):1972-1986.
[22] ( Li Yingying, Ma Shuai, Jiang Haoyi, et al. An Approach for Storytelling by Correlating Events from Social Networks[J]. Journal of Computer Research and Development, 2018,55(9):1972-1986.)
[23] 何旭峰, 陈岭, 陈根才, 等. 基于LDA主题模型的分布式信息检索集合选择方法[J]. 中文信息学报, 2017,31(3):125-133.
[23] ( He Xufeng, Chen Ling, Chen Gencai, et al. A LDA Topic Model Based Collection Selection Method for Distributed Information Retrieval[J]. Journal of Chinese Information Processing, 2017,31(3):125-133.)
[24] Li C, Wang H, Zhang Z, et al. Topic Modeling for Short Texts with Auxiliary Word Embeddings[C]//Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2016: 165-174.
[25] Huang J, Peng M, Wang H, et al. A Probabilistic Method for Emerging Topic Tracking in Microblog Stream[J]. World Wide Web Journal, 2017,20(2):325-350.
doi: 10.1007/s11280-016-0390-4
[26] 彭敏, 官宸宇, 朱佳晖, 等. 面向社交媒体文本的话题检测与追踪技术研究综述[J]. 武汉大学学报(理学版), 2016,62(3):197-217.
[26] ( Peng Min, Guan Chenyu, Zhu Jiahui, et al. A Survey on Topic Detection and Tracking in Social Media Text[J]. Journal of Wuhan University (Natural Science Edition), 2016,62(3):197-217.)
[27] 张仰森, 段宇翔, 黄改娟, 等. 社交媒体话题检测与追踪技术研究综述[J]. 中文信息学报, 2019,33(7):1-10, 30.
[27] ( Zhang Yangsen, Duan Yuxiang, Huang Gaijuan, et al. A Survey on Topic Detection and Tracking Methods in Social Media[J]. Journal of Chinese Information Processing, 2019,33(7):1-10,30.)
[28] Zhang Y, Ma J, Wang Z, et al. Extraction and Tracking of Scientific Topics by LDA[C]//Proceedings of the 9th International Conference on Intelligent Networking and Collaborative Systems. 2017: 536-544.
[29] 周楠, 杜攀, 靳小龙, 等. 面向舆情事件的子话题标签生成模型ET-TAG[J]. 计算机学报, 2018,41(7):1490-1503.
[29] ( Zhou Nan, Du Pan, Jin Xiaolong, et al. ET-TAG: A Tag Generation Model for the Sub-Topics of Public Opinion Events[J]. Chinese Journal of Computers, 2018,41(7):1490-1503.)
[30] 韩忠明, 张梦玫, 李梦琪, 等. 面向复杂主题建模的流式层次狄里克雷过程[J]. 计算机学报, 2019,42(7):1539-1552.
[30] ( Han Zhongming, Zhang Mengmei, Li Mengqi, et al. Flow Hierarchical Dirichlet Process for Complex Topic Modeling[J]. Chinese Journal of Computers, 2019,42(7):1539-1552.)
[31] Huang L, Ma J, Chen C. Topic Detection from Microblogs Using T-LDA and Perplexity[C]//Proceedings of the 24th Asia-Pacific Software Engineering Conference Workshops. IEEE, 2017: 71-77.
[1] Yi Huifang,Liu Xiwen. Analyzing Patent Technology Topics with IPC Context-Enhanced Context-LDA Model[J]. 数据分析与知识发现, 2021, 5(4): 25-36.
[2] Zhang Xin,Wen Yi,Xu Haiyun. A Prediction Model with Network Representation Learning and Topic Model for Author Collaboration[J]. 数据分析与知识发现, 2021, 5(3): 88-100.
[3] Chen Hao, Zhang Mengyi, Cheng Xiufeng. Identifying Cross-Region Patent Collaboration Opportunities Using LDA and Decision Trees——Case Study of Universities from Guangdong and Wuhan[J]. 数据分析与知识发现, 2021, 5(10): 37-50.
[4] Yu Chuanming,Yuan Sai,Zhu Xingyu,Lin Hongjun,Zhang Puliang,An Lu. Research on Deep Learning Based Topic Representation of Hot Events[J]. 数据分析与知识发现, 2020, 4(4): 1-14.
[5] Pan Youneng,Ni Xiuli. Recommending Online Medical Experts with Labeled-LDA Model[J]. 数据分析与知识发现, 2020, 4(4): 34-43.
[6] Xu Jianmin,Zhang Liqing,Wang Miao. Tracking Static Topics with Bayesian Network[J]. 数据分析与知识发现, 2020, 4(2/3): 200-206.
[7] Chen Wenjie. Predicting Research Collaboration Based on Translation Model[J]. 数据分析与知识发现, 2020, 4(10): 28-36.
[8] Hongfei Ling,Shiyan Ou. Review of Automatic Labeling for Topic Models[J]. 数据分析与知识发现, 2019, 3(9): 16-26.
[9] Weimin Nie,Yongzhou Chen,Jing Ma. A Text Vector Representation Model Merging Multi-Granularity Information[J]. 数据分析与知识发现, 2019, 3(9): 45-52.
[10] Qingtian Zeng,Xiaohui Hu,Chao Li. Extracting Keywords with Topic Embedding and Network Structure Analysis[J]. 数据分析与知识发现, 2019, 3(7): 52-60.
[11] Bengong Yu,Yangnan Chen,Ying Yang. Classifying Short Text Complaints with nBD-SVM Model[J]. 数据分析与知识发现, 2019, 3(5): 77-85.
[12] Peiyao Zhang,Dongsu Liu. Topic Evolutionary Analysis of Short Text Based on Word Vector and BTM[J]. 数据分析与知识发现, 2019, 3(3): 95-101.
[13] Linna Xi,Yongxiang Dou. Examining Reposts of Micro-bloggers with Planned Behavior Theory[J]. 数据分析与知识发现, 2019, 3(2): 13-20.
[14] Jie Zhang,Junbo Zhao,Dongsheng Zhai,Ningning Sun. Patent Technology Analysis of Microalgae Biofuel Industrial Chain Based on Topic Model[J]. 数据分析与知识发现, 2019, 3(2): 52-64.
[15] Junwan Liu,Zhixin Long,Feifei Wang. Finding Collaboration Opportunities from Emerging Issues with LDA Topic Model and Link Prediction[J]. 数据分析与知识发现, 2019, 3(1): 104-117.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn