Please wait a minute...
Data Analysis and Knowledge Discovery  2022, Vol. 6 Issue (10): 103-113    DOI: 10.11925/infotech.2096-3467.2021.1170
Current Issue | Archive | Adv Search |
Detecting Topics of Online News with Shared Nearest Neighbours and Markov Clustering
Wu Zhenfeng1,Lan Tian1,Wang Mengmeng2,Pu Mo1,Zhang Yu1,Liu Zhihui1,He Yanqing1()
1Institute of Scientific and Technical Information of China, Beijing 100038, China
2School of Economics, Renmin University of China, Beijing 100872, China
Download: PDF (2015 KB)   HTML ( 6
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper proposes a topic detection method for online news, aiming to more effectively utilize the internal structure of data. [Methods] First, we examined the association strength among online news with the number and rank of their shared nearest neighbors. Then, we constructed a graph for the shared nearest neighbors, which improved the utilization of internal structure of the data. Finally, we detected the topics of online news with dimension reduction, the decision of the optimal number of topics, Markov clustering, and automatic topic description based on closeness centrality. [Results] We examined our new model with two data sets of online news and found the ARI values were up to 0.86 and 0.97, while the ARI values of the LDA, K-means, and GMM models were all less than 0.75 and 0.90. [Limitations] We need to evaluate the performance of the proposed method with data sets from other fields and the multilingual ones. [Conclusions] The proposed method could effectively detect the topics of online news and provide new direction for the future research.

Key wordsShared Nearest Neighbour      Markov Clustering      Online News      Topic Detection     
Received: 15 October 2021      Published: 16 November 2022
ZTFLH:  TP391 G202  
Fund:National Key R&D Program of China(2019YFA0707201);Key Work Program of Institute of Scientific and Technical Information of China(ZD2021-17);Key Work Program of Institute of Scientific and Technical Information of China(ZD2022-01)
Corresponding Authors: He Yanqing,ORCID:0000-0002-8791-1581      E-mail: heyq@istic.ac.cn

Cite this article:

Wu Zhenfeng, Lan Tian, Wang Mengmeng, Pu Mo, Zhang Yu, Liu Zhihui, He Yanqing. Detecting Topics of Online News with Shared Nearest Neighbours and Markov Clustering. Data Analysis and Knowledge Discovery, 2022, 6(10): 103-113.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2021.1170     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2022/V6/I10/103

The Flow of Topic Detection Method
数据集 话题序号 话题主题 新闻数量/条
综合新闻 1 华为被制裁 185
2 奔驰漏油 31
3 波音737坠机 42
4 巴黎圣母院火灾 44
5 视觉中国版权风波 100
6 斯里兰卡连环爆炸 93
7 亚洲文明对话大会 176
8 英国脱欧 57
9 翟天临学历造假 102
10 中美贸易战 232
体育新闻 1 NBA火箭 150
2 中超实德申花 37
3 中超鲁能亚泰 35
4 女足世界杯抽签 79
5 女足比赛 59
6 斯诺克 65
7 皇马瓦伦西亚 27
Topics and the Number of Two News Data Sets
The Variation of Singular Value with Dimensionality
The KL Index Value
Perplexity Index Value
The Impact of Dimension Reduction on Topic Detection Results
The Influence of the Number of Shared Nearest Neighbors on Topic Detection Results
数据集 评价指标 snnMCL LDA K-Means GMM
综合新闻 ARI 0.86 0.70 0.74 0.72
NMI 0.90 0.77 0.83 0.81
Purity 0.95 0.78 0.89 0.86
CA 0.95 0.77 0.85 0.85
体育新闻 ARI 0.97 0.76 0.87 0.69
NMI 0.95 0.71 0.86 0.76
Purity 0.98 0.81 0.87 0.81
CA 0.98 0.81 0.86 0.76
Results of Four Topic Detection Methods
数据集 话题序号 实际话题描述 自动话题描述
综合新闻 1 华为被制裁 华为,上甘岭战役,芯片,启幕,用安卓
2 奔驰漏油 奔驰,服务费,漏油,维权,车主
3 波音737坠机 波音,MAX,软件,系统故障,传感器
4 巴黎圣母院火灾 巴黎圣母院,大火,修复,火情,退订
5 视觉中国版权风波 视觉,版权,中国,致歉,牟利
6 斯里兰卡连环爆炸 斯里兰卡,爆炸事件,公民,爆炸案,失联
7 亚洲文明对话大会 文明,亚洲,别错过,夜读,对话
8 英国脱欧 收购,软银,ARM,脱欧,特蕾莎梅
9 翟天临学历造假 翟天临,学术,事件,不端,娱乐圈
10 中美贸易战 先礼后兵,人民日报,评论员,君子,反制
体育新闻 1 NBA火箭 季后赛,爵士,火箭,图文,姚明
2 中超实德申花 申花,实德,吉梅,王鹏,画虎不成反类犬
3 中超鲁能亚泰 裁判,判罚,科维奇,爱徒,外国
4 女足世界杯抽签 女足,世界杯,赛程,丹麦,抽签
5 女足比赛 女足,联队,图文,明星,周高萍
6 斯诺克 多特,爆冷,出局,奥沙利,小晖
7 皇马瓦伦西亚 皇马,瓦伦西亚,图文,万人迷,欢庆
Automatic Topic Description
Topic Scatterplot
[1] 贺敏, 杜攀, 张瑾, 等. 基于动量模型的微博突发话题检测方法[J]. 计算机研究与发展, 2015, 52(5): 1022-1028.
[1] (He Min, Du Pan, Zhang Jin, et al. Microblog Bursty Topic Detection Method Based on Momentum Model[J]. Journal of Computer Research and Development, 2015, 52(5): 1022-1028.)
[2] Papadimitriou C H, Raghavan P, Tamaki H, et al. Latent Semantic Indexing: A Probabilistic Analysis[J]. Journal of Computer and System Sciences, 2000, 61(2): 217-235.
doi: 10.1006/jcss.2000.1711
[3] Blei D M, Ng A Y, Jordan M I. Latent Dirichlet Allocation[J]. The Journal of Machine Learning Research, 2003, 3: 993-1022.
[4] Teh Y W, Jordan M I, Beal M J, et al. Sharing Clusters among Related Groups: Hierarchical Dirichlet Processes[C]// Proceedings of the 17th International Conference on Neural Information Processing Systems. 2004: 1385-1392.
[5] Li W, McCallum A. Pachinko Allocation: DAG-Structured Mixture Models of Topic Correlations[C]// Proceedings of the 23rd International Conference on Machine Learning. 2006: 577-584.
[6] 刘红兵, 李文坤, 张仰森. 基于LDA 模型和多层聚类的微博话题检测[J]. 计算机技术与发展, 2016, 26(6): 25-30.
[6] (Liu Hongbing, Li Wenkun, Zhang Yangsen. Microblog Topic Detection Based on LDA Model and Multi-level Clustering[J]. Computer Technology and Development, 2016, 26(6): 25-30.)
[7] 李振鹏, 黄帅. 基于LDA主题模型的网络舆情研究[J]. 系统科学与数学, 2020, 40(3): 434-447.
doi: 10.12341/jssms13829
[7] (Li Zhenpeng, Huang Shuai. Analysing on Network Public Opinion Based on LDA Topic Model[J]. Journal of Systems Science and Mathematical Sciences, 2020, 40(3): 434-447.)
doi: 10.12341/jssms13829
[8] Lancichinetti A, Sirer M I, Wang J X, et al. High-Reproducibility and High-Accuracy Method for Automated Topic Classification[J]. Physical Review X, 2015, 5: 011007.
doi: 10.1103/PhysRevX.5.011007
[9] Li S D, Lv X Q, Wang T, et al. The Key Technology of Topic Detection Based on K-Means[C]// Proceedings of 2010 International Conference on Future Information Technology and Management Engineering. 2010: 387-390.
[10] 李丽蓉. 基于文本聚类算法的网络舆情话题检测研究[J]. 山西警察学院学报, 2021, 29(1): 69-72.
[10] (Li Lirong. Research on Internet Public Opinion Topic Detection Based on Text Clustering Algorithm[J]. Journal of Shanxi Police College, 2021, 29(1): 69-72.)
[11] Dai X Y, Chen Q C, Wang X L, et al. Online Topic Detection and Tracking of Financial News Based on Hierarchical Clustering[C]// Proceedings of 2010 International Conference on Machine Learning and Cybernetics. 2010: 3341-3346.
[12] Liu B, Niu D, Lai K F, et al. Growing Story Forest Online from Massive Breaking News[C]// Proceedings of the 2017 ACM Conference on Information and Knowledge Management. 2017: 777-785.
[13] 杨莲莲, 杨之音, 杨朝峰. 基于共词分析的微生物学植物学领域研究热点分析[J]. 情报工程, 2016, 2(4): 96-103.
[13] (Yang Lianlian, Yang Zhiyin, Yang Chaofeng. Research on the Hotspots of Microbiology and Botany Based on the Co-Word Analysis[J]. Technology Intelligence Engineering, 2016, 2(4): 96-103.)
[14] 肖香龙, 李信, 高寒, 等. 基于关键词共现的学科领域研究空白(Research Gaps)发现[J]. 情报工程, 2018, 4(6): 37-50.
[14] (Xiao Xianglong, Li Xin, Gao Han, et al. Research on Scientific Gaps Recognition Based on Keywords Co-occurrence[J]. Technology Intelligence Engineering, 2018, 4(6): 37-50.)
[15] Brohée S, van Helden J. Evaluation of Clustering Algorithms for Protein-Protein Interaction Networks[J]. BMC Bioinformatics, 2006, 7: 488.
pmid: 17087821
[16] Liu Z C, Lin G S, Yang S, et al. Learning Markov Clustering Networks for Scene Text Detection[C]// Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018: 6936-6944.
[17] Wang C H, Zhang M, Ma S P, et al. Automatic Online News Issue Construction in Web Environment[C]// Proceedings of the 17th International Conference on World Wide Web, Beijing, China. 2008: 457-466.
[18] White H D. Bag of Works Retrieval: TF*IDF Weighting of Co-cited Works[J]. International Journal on Digital Libraries, 2018, 19(2-3): 139-149.
doi: 10.1007/s00799-017-0217-7
[19] Satija R, Farrell J A, Gennert D, et al. Spatial Reconstruction of Single-Cell Gene Expression Data[J]. Nature Biotechnology, 2015, 33(5): 495-502.
doi: 10.1038/nbt.3192 pmid: 25867923
[20] Krzanowski W J, Lai Y T. A Criterion for Determining the Number of Groups in a Data Set Using Sum-of-Squares Clustering[J]. Biometrics, 1988, 44(1): 23-34.
doi: 10.2307/2531893
[21] Golbeck J.Network Structure and Measures[A]// Analyzing the Social Web[M]. Amsterdam: Elsevier, 2013: 25-44.
[22] van der Maaten L, Hinton G. Visualizing Data Using t-SNE[J]. Journal of Machine Learning Research, 2008, 9: 2579-2605.
[23] Hubert L, Arabie P. Comparing Partitions[J]. Journal of Classification, 1985, 2(1): 193-218.
doi: 10.1007/BF01908075
[24] Strehl A, Ghosh J. Cluster Ensembles-A Knowledge Reuse Framework for Combining Partitions[J]. The Journal of Machine Learning Research, 2002, 3(3): 583-617.
[25] Meilă M, Heckerman D. An Experimental Comparison of Model-Based Clustering Methods[J]. Machine Learning, 2001, 42(1-2): 9-29.
doi: 10.1023/A:1007648401407
[26] Tian K, Zhou S G, Guan J H. DeepCluster: A General Clustering Framework Based on Deep Learning[C]// Proceedings of Joint European Conference on Machine Learning and Knowledge Discovery in Databases. 2017: 809-825.
[27] Kuhn H W. The Hungarian Method for the Assignment Problem[J]. Naval Research Logistics Quarterly, 1955, 2(1-2): 83-97.
doi: 10.1002/nav.3800020109
[28] 关鹏, 王曰芬. 科技情报分析中LDA主题模型最优主题数确定方法研究[J]. 现代图书情报技术, 2016(9): 42-50.
[28] (Guan Peng, Wang Yuefen. Identifying Optimal Topic Numbers from Sci-Tech Information with LDA Model[J]. New Technology of Library and Information Service, 2016(9): 42-50.)
[1] Wu Xu,Chen Chunxu. Detecting Topics of Group Chats with Multiple Strategies[J]. 数据分析与知识发现, 2021, 5(5): 1-9.
[2] Wei Jiaze,Dong Cheng,He Yanqing,Liu Zhihui,Peng Keyun. Detecting News Topics Based on Equalized Paragraph and Sub-topic Vector[J]. 数据分析与知识发现, 2020, 4(10): 70-79.
[3] Gang Li,Sijing Chen,Jin Mao,Yansong Gu. Spatio-Temporal Comparison of Microblog Trending Topics on Natural Disasters[J]. 数据分析与知识发现, 2019, 3(11): 1-15.
[4] Zong Hong,Xue Chunxiang,Chen Fen. Growth Pattern of Online News Comments[J]. 数据分析与知识发现, 2018, 2(9): 50-58.
[5] Fang Xiaofei,Huang Xiaoxi,Wang Rongbo,Chen Zhiqun,Wang Xiaohua. Identifying Hot Topics from Mobile Complaint Texts[J]. 数据分析与知识发现, 2017, 1(2): 19-27.
[6] Zhang Xiaoyong,Zhou Qingqing,Zhang Chengzhi. Identifying Food Topics from User-Generated Contents in Microblogs[J]. 现代图书情报技术, 2016, 32(10): 70-80.
[7] Zhao Yingguang, An Xinying, Li Yong, Jia Xiaofeng. A Method for Detecting the Hot Topic of Literature Based on Lifecycle——A Case Study of Neoplasm Field[J]. 现代图书情报技术, 2012, (11): 86-91.
[8] Le Xiaoqiu, Hong Na. A Survey of Burst Topic Detection Towards Social Text Stream Data[J]. 现代图书情报技术, 2012, (10): 21-27.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn