基于共享最近邻和马尔科夫聚类的网络新闻话题检测方法<sup>*</sup>

doi:10.11925/infotech.2096-3467.2021.1170

数据分析与知识发现

2022, Vol. 6

Issue (10): 103-113 https://doi.org/10.11925/infotech.2096-3467.2021.1170

研究论文

本期目录 | 过刊浏览 | 高级检索

基于共享最近邻和马尔科夫聚类的网络新闻话题检测方法^*

吴振峰¹,兰天¹,王猛猛²,浦墨¹,张昱¹,刘志辉¹,何彦青¹(

)

¹中国科学技术信息研究所北京 100038
²中国人民大学经济学院北京 100872

Detecting Topics of Online News with Shared Nearest Neighbours and Markov Clustering

Wu Zhenfeng¹,Lan Tian¹,Wang Mengmeng²,Pu Mo¹,Zhang Yu¹,Liu Zhihui¹,He Yanqing¹(

)

¹Institute of Scientific and Technical Information of China, Beijing 100038, China
²School of Economics, Renmin University of China, Beijing 100872, China

摘要
图/表
参考文献
相关文章
Metrics

全文: PDF (2015 KB) HTML ( 17 )
输出: BibTeX | EndNote (RIS)

摘要

【目的】 针对现有话题检测方法对数据内在结构信息利用不够充分的问题，提出基于共享最近邻和马尔科夫聚类的网络新闻话题检测方法，实现网络新闻话题的有效检测。【方法】 通过综合考虑网络新闻间的共享最近邻个数、秩次等信息刻画新闻间的关联强度、构建共享最近邻图，并解决数据内在结构信息利用不充分的问题;利用降维、最优话题个数的决策、马尔科夫聚类、基于紧密中心度的自动话题描述等技术提升网络新闻话题检测效果。【结果】 在两个网络新闻数据集上的实验结果表明，所提方法得到的ARI值更高，分别达到0.86和0.97。参与比较的LDA、K-Means、GMM等话题检测方法在两个网络新闻数据集上的ARI值均分别低于0.75和0.90。【局限】 未在其他领域数据集以及多语言数据集上进一步验证。【结论】 所提方法可以有效提升网络新闻话题检测性能，为话题检测关键技术研究提供有价值的参考。

	服务

	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	吴振峰
	兰天
	王猛猛
	浦墨
	张昱
	刘志辉
	何彦青

关键词 ：共享最近邻, 马尔科夫聚类, 网络新闻, 话题检测

Abstract：

[Objective] This paper proposes a topic detection method for online news, aiming to more effectively utilize the internal structure of data. [Methods] First, we examined the association strength among online news with the number and rank of their shared nearest neighbors. Then, we constructed a graph for the shared nearest neighbors, which improved the utilization of internal structure of the data. Finally, we detected the topics of online news with dimension reduction, the decision of the optimal number of topics, Markov clustering, and automatic topic description based on closeness centrality. [Results] We examined our new model with two data sets of online news and found the ARI values were up to 0.86 and 0.97, while the ARI values of the LDA, K-means, and GMM models were all less than 0.75 and 0.90. [Limitations] We need to evaluate the performance of the proposed method with data sets from other fields and the multilingual ones. [Conclusions] The proposed method could effectively detect the topics of online news and provide new direction for the future research.

Key words： Shared Nearest Neighbour Markov Clustering Online News Topic Detection

收稿日期: 2021-10-15 出版日期: 2022-11-16

ZTFLH:

TP391 G202

基金资助:国家重点研发计划基金项目(2019YFA0707201);中国科学技术信息研究所重点工作项目基金项目(ZD2021-17);中国科学技术信息研究所重点工作项目基金项目(ZD2022-01)

通讯作者: 何彦青,ORCID：0000-0002-8791-1581 E-mail: heyq@istic.ac.cn

引用本文:

吴振峰, 兰天, 王猛猛, 浦墨, 张昱, 刘志辉, 何彦青. 基于共享最近邻和马尔科夫聚类的网络新闻话题检测方法^*[J]. 数据分析与知识发现, 2022, 6(10): 103-113.
Wu Zhenfeng, Lan Tian, Wang Mengmeng, Pu Mo, Zhang Yu, Liu Zhihui, He Yanqing. Detecting Topics of Online News with Shared Nearest Neighbours and Markov Clustering. Data Analysis and Knowledge Discovery, 2022, 6(10): 103-113.

链接本文:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2021.1170 或 https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2022/V6/I10/103

Fig.1 话题检测方法流程

Table 1 两个网络新闻数据集的话题主题及其数量

Fig.2 奇异值随维数的变化关系

Fig.3 KL指标值

Fig.4 困惑度指标值

Fig.5 降维对话题检测结果的影响

Fig.6 共享最近邻个数对话题检测结果的影响

Table 2 4种话题检测方法的实验结果

Table 3 自动话题描述

Fig.7 话题散点图

[1]	贺敏, 杜攀, 张瑾, 等. 基于动量模型的微博突发话题检测方法[J]. 计算机研究与发展, 2015, 52(5): 1022-1028.
[1]	(He Min, Du Pan, Zhang Jin, et al. Microblog Bursty Topic Detection Method Based on Momentum Model[J]. Journal of Computer Research and Development, 2015, 52(5): 1022-1028.)
[2]	Papadimitriou C H, Raghavan P, Tamaki H, et al. Latent Semantic Indexing: A Probabilistic Analysis[J]. Journal of Computer and System Sciences, 2000, 61(2): 217-235. doi: 10.1006/jcss.2000.1711
[3]	Blei D M, Ng A Y, Jordan M I. Latent Dirichlet Allocation[J]. The Journal of Machine Learning Research, 2003, 3: 993-1022.
[4]	Teh Y W, Jordan M I, Beal M J, et al. Sharing Clusters among Related Groups: Hierarchical Dirichlet Processes[C]// Proceedings of the 17th International Conference on Neural Information Processing Systems. 2004: 1385-1392.
[5]	Li W, McCallum A. Pachinko Allocation: DAG-Structured Mixture Models of Topic Correlations[C]// Proceedings of the 23rd International Conference on Machine Learning. 2006: 577-584.
[6]	刘红兵, 李文坤, 张仰森. 基于LDA 模型和多层聚类的微博话题检测[J]. 计算机技术与发展, 2016, 26(6): 25-30.
[6]	(Liu Hongbing, Li Wenkun, Zhang Yangsen. Microblog Topic Detection Based on LDA Model and Multi-level Clustering[J]. Computer Technology and Development, 2016, 26(6): 25-30.)
[7]	李振鹏, 黄帅. 基于LDA主题模型的网络舆情研究[J]. 系统科学与数学, 2020, 40(3): 434-447. doi: 10.12341/jssms13829
[7]	(Li Zhenpeng, Huang Shuai. Analysing on Network Public Opinion Based on LDA Topic Model[J]. Journal of Systems Science and Mathematical Sciences, 2020, 40(3): 434-447.) doi: 10.12341/jssms13829
[8]	Lancichinetti A, Sirer M I, Wang J X, et al. High-Reproducibility and High-Accuracy Method for Automated Topic Classification[J]. Physical Review X, 2015, 5: 011007. doi: 10.1103/PhysRevX.5.011007
[9]	Li S D, Lv X Q, Wang T, et al. The Key Technology of Topic Detection Based on K-Means[C]// Proceedings of 2010 International Conference on Future Information Technology and Management Engineering. 2010: 387-390.
[10]	李丽蓉. 基于文本聚类算法的网络舆情话题检测研究[J]. 山西警察学院学报, 2021, 29(1): 69-72.
[10]	(Li Lirong. Research on Internet Public Opinion Topic Detection Based on Text Clustering Algorithm[J]. Journal of Shanxi Police College, 2021, 29(1): 69-72.)
[11]	Dai X Y, Chen Q C, Wang X L, et al. Online Topic Detection and Tracking of Financial News Based on Hierarchical Clustering[C]// Proceedings of 2010 International Conference on Machine Learning and Cybernetics. 2010: 3341-3346.
[12]	Liu B, Niu D, Lai K F, et al. Growing Story Forest Online from Massive Breaking News[C]// Proceedings of the 2017 ACM Conference on Information and Knowledge Management. 2017: 777-785.
[13]	杨莲莲, 杨之音, 杨朝峰. 基于共词分析的微生物学植物学领域研究热点分析[J]. 情报工程, 2016, 2(4): 96-103.
[13]	(Yang Lianlian, Yang Zhiyin, Yang Chaofeng. Research on the Hotspots of Microbiology and Botany Based on the Co-Word Analysis[J]. Technology Intelligence Engineering, 2016, 2(4): 96-103.)
[14]	肖香龙, 李信, 高寒, 等. 基于关键词共现的学科领域研究空白(Research Gaps)发现[J]. 情报工程, 2018, 4(6): 37-50.
[14]	(Xiao Xianglong, Li Xin, Gao Han, et al. Research on Scientific Gaps Recognition Based on Keywords Co-occurrence[J]. Technology Intelligence Engineering, 2018, 4(6): 37-50.)
[15]	Brohée S, van Helden J. Evaluation of Clustering Algorithms for Protein-Protein Interaction Networks[J]. BMC Bioinformatics, 2006, 7: 488. pmid: 17087821
[16]	Liu Z C, Lin G S, Yang S, et al. Learning Markov Clustering Networks for Scene Text Detection[C]// Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018: 6936-6944.
[17]	Wang C H, Zhang M, Ma S P, et al. Automatic Online News Issue Construction in Web Environment[C]// Proceedings of the 17th International Conference on World Wide Web, Beijing, China. 2008: 457-466.
[18]	White H D. Bag of Works Retrieval: TF*IDF Weighting of Co-cited Works[J]. International Journal on Digital Libraries, 2018, 19(2-3): 139-149. doi: 10.1007/s00799-017-0217-7
[19]	Satija R, Farrell J A, Gennert D, et al. Spatial Reconstruction of Single-Cell Gene Expression Data[J]. Nature Biotechnology, 2015, 33(5): 495-502. doi: 10.1038/nbt.3192 pmid: 25867923
[20]	Krzanowski W J, Lai Y T. A Criterion for Determining the Number of Groups in a Data Set Using Sum-of-Squares Clustering[J]. Biometrics, 1988, 44(1): 23-34. doi: 10.2307/2531893
[21]	Golbeck J.Network Structure and Measures[A]// Analyzing the Social Web[M]. Amsterdam: Elsevier, 2013: 25-44.
[22]	van der Maaten L, Hinton G. Visualizing Data Using t-SNE[J]. Journal of Machine Learning Research, 2008, 9: 2579-2605.
[23]	Hubert L, Arabie P. Comparing Partitions[J]. Journal of Classification, 1985, 2(1): 193-218. doi: 10.1007/BF01908075
[24]	Strehl A, Ghosh J. Cluster Ensembles-A Knowledge Reuse Framework for Combining Partitions[J]. The Journal of Machine Learning Research, 2002, 3(3): 583-617.
[25]	Meilă M, Heckerman D. An Experimental Comparison of Model-Based Clustering Methods[J]. Machine Learning, 2001, 42(1-2): 9-29. doi: 10.1023/A:1007648401407
[26]	Tian K, Zhou S G, Guan J H. DeepCluster: A General Clustering Framework Based on Deep Learning[C]// Proceedings of Joint European Conference on Machine Learning and Knowledge Discovery in Databases. 2017: 809-825.
[27]	Kuhn H W. The Hungarian Method for the Assignment Problem[J]. Naval Research Logistics Quarterly, 1955, 2(1-2): 83-97. doi: 10.1002/nav.3800020109
[28]	关鹏, 王曰芬. 科技情报分析中LDA主题模型最优主题数确定方法研究[J]. 现代图书情报技术, 2016(9): 42-50.
[28]	(Guan Peng, Wang Yuefen. Identifying Optimal Topic Numbers from Sci-Tech Information with LDA Model[J]. New Technology of Library and Information Service, 2016(9): 42-50.)

[1]	吴旭,陈春旭. 基于多策略的群聊话题检测技术^*[J]. 数据分析与知识发现, 2021, 5(5): 1-9.
[2]	魏家泽,董诚,何彦青,刘志辉,彭柯芸. 基于均衡段落和分话题向量的新闻热点话题检测研究^*[J]. 数据分析与知识发现, 2020, 4(10): 70-79.
[3]	温廷新,李洋子,孙静霜. 基于多因素特征选择与AFOA/K-means的新闻热点发现方法^*[J]. 数据分析与知识发现, 2019, 3(4): 97-106.
[4]	张晓勇,周清清,章成志. 面向在线社交网络用户生成内容的饮食话题发现研究^*[J]. 现代图书情报技术, 2016, 32(10): 70-80.
[5]	邹伟, 刘永学, 李满春, 王加胜, 陈映雪. 网络新闻中黄岩岛争端事件舆情研究——以新浪网“中菲黄岩岛争端”专题为例[J]. 现代图书情报技术, 2014, 30(2): 72-78.
[6]	杨代庆, 王志苹, 王星, 刘敏健, 常迎春. 一种断点续传的多线程新闻组抓取方法及存储结构[J]. 现代图书情报技术, 2011, 27(2): 29-33.

Viewed

Full text

Abstract

Cited

Shared

Discussed