Data Analysis and Knowledge Discovery  2021, Vol. 5 Issue (2): 1-13    DOI: 10.11925/infotech.2096-3467.2020.1025
Generating News Clues with Biterm Topic Model
Zhao Tianzi1,Duan Liang1(),Yue Kun1,Qiao Shaojie2,3,Ma Zijuan1
1School of Information Science & Engineering, Yunnan University, Kunming 650500, China
2School of Software Engineering, Chengdu University of Information Technology, Chengdu 610225, China
3Sichuan Key Laboratory of Software Automatic Generation and Intelligent Service,Chengdu University of Information Technology, Chengdu 610225, China
[Objective] This paper modifies the topic model to improve the quality of extracted news clues. [Methods] We constructed a News-IBTM model based on IBTM (Incremental Biterm Topic Model) with dynamic sliding window, which reduced the extraction scope of binary phrases. Then, we used this model to extract topics and topic-word distributions from news, and inferred the document-topic distributions. Finally, we used the JS (Jensen-Shannon) divergence to measure the difference between document-topic distributions and generate news clues. [Results] We examined our News-IBTM model with news from People’s Daily Online and Weibo. The proposed model outperformed existing ones in perplexity, accuracy and efficiency. [Limitations] The accuracy of News-IBTM algorithm needs to be further improved. [Conclusions] The proposed method could effectively extract quality news topics and clues.

Key wordsNews Events      News Clues Generation      Topic Model      Jensen-Shannon Divergence     
Received: 20 October 2020      Published: 11 March 2021
ZTFLH:  TP391  
Fund:National Natural Science Foundation of China(U1802271);Research Foundation of Educational Department of Yunnan Province(2020Y0010);China Postdoctoral Science Foundation(2020M673310)
Corresponding Authors: Duan Liang ORCID:0000-0001-9473-2533     E-mail:

Zhao Tianzi, Duan Liang, Yue Kun, Qiao Shaojie, Ma Zijuan. Generating News Clues with Biterm Topic Model. Data Analysis and Knowledge Discovery, 2021, 5(2): 1-13.

符号 含义
θ 主题分布
φ 主题-词分布
α 主题分布先验参数
β 主题-词分布先验参数
zk k个主题
di i个新闻子事件
cj j条新闻线索
ws s个词汇
t t个时间片
T 新闻发布时间
K 主题总数
NW 新闻数据词汇总数
NB Biterm总数
ND 新闻文档总数
Graphical Representation of News-IBTM
数据集 测试类型 文档数量(条) 单篇范围(字) 平均长度(字)
人民网 长文本 8 772 200~4 000 1 200
微博 短文本 32 502 20~500 185
Perplexity by Varying the Number of Topics
Perplexity by Varying the Number of News
Accuracy by Varying the Number of Topics
Accuracy by Varying the Number of News
Accuracy of News-IBTM with Different Time Slices
Execution Time of News-IBTM by Varying the Number of News
Execution Time by Varying the Number of Topics
Execution Time of News-IBTM in a Single Time Slice
Visualization of News Clues
