Please wait a minute...
Advanced Search
数据分析与知识发现  2022, Vol. 6 Issue (10): 103-113     https://doi.org/10.11925/infotech.2096-3467.2021.1170
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
基于共享最近邻和马尔科夫聚类的网络新闻话题检测方法*
吴振峰1,兰天1,王猛猛2,浦墨1,张昱1,刘志辉1,何彦青1()
1中国科学技术信息研究所 北京 100038
2中国人民大学经济学院 北京 100872
Detecting Topics of Online News with Shared Nearest Neighbours and Markov Clustering
Wu Zhenfeng1,Lan Tian1,Wang Mengmeng2,Pu Mo1,Zhang Yu1,Liu Zhihui1,He Yanqing1()
1Institute of Scientific and Technical Information of China, Beijing 100038, China
2School of Economics, Renmin University of China, Beijing 100872, China
全文: PDF (2015 KB)   HTML ( 12
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】 针对现有话题检测方法对数据内在结构信息利用不够充分的问题,提出基于共享最近邻和马尔科夫聚类的网络新闻话题检测方法,实现网络新闻话题的有效检测。【方法】 通过综合考虑网络新闻间的共享最近邻个数、秩次等信息刻画新闻间的关联强度、构建共享最近邻图,并解决数据内在结构信息利用不充分的问题;利用降维、最优话题个数的决策、马尔科夫聚类、基于紧密中心度的自动话题描述等技术提升网络新闻话题检测效果。【结果】 在两个网络新闻数据集上的实验结果表明,所提方法得到的ARI值更高,分别达到0.86和0.97。参与比较的LDA、K-Means、GMM等话题检测方法在两个网络新闻数据集上的ARI值均分别低于0.75和0.90。【局限】 未在其他领域数据集以及多语言数据集上进一步验证。【结论】 所提方法可以有效提升网络新闻话题检测性能,为话题检测关键技术研究提供有价值的参考。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
吴振峰
兰天
王猛猛
浦墨
张昱
刘志辉
何彦青
关键词 共享最近邻马尔科夫聚类网络新闻话题检测    
Abstract

[Objective] This paper proposes a topic detection method for online news, aiming to more effectively utilize the internal structure of data. [Methods] First, we examined the association strength among online news with the number and rank of their shared nearest neighbors. Then, we constructed a graph for the shared nearest neighbors, which improved the utilization of internal structure of the data. Finally, we detected the topics of online news with dimension reduction, the decision of the optimal number of topics, Markov clustering, and automatic topic description based on closeness centrality. [Results] We examined our new model with two data sets of online news and found the ARI values were up to 0.86 and 0.97, while the ARI values of the LDA, K-means, and GMM models were all less than 0.75 and 0.90. [Limitations] We need to evaluate the performance of the proposed method with data sets from other fields and the multilingual ones. [Conclusions] The proposed method could effectively detect the topics of online news and provide new direction for the future research.

Key wordsShared Nearest Neighbour    Markov Clustering    Online News    Topic Detection
收稿日期: 2021-10-15      出版日期: 2022-11-16
ZTFLH:  TP391 G202  
基金资助:国家重点研发计划基金项目(2019YFA0707201);中国科学技术信息研究所重点工作项目基金项目(ZD2021-17);中国科学技术信息研究所重点工作项目基金项目(ZD2022-01)
通讯作者: 何彦青,ORCID:0000-0002-8791-1581      E-mail: heyq@istic.ac.cn
引用本文:   
吴振峰, 兰天, 王猛猛, 浦墨, 张昱, 刘志辉, 何彦青. 基于共享最近邻和马尔科夫聚类的网络新闻话题检测方法*[J]. 数据分析与知识发现, 2022, 6(10): 103-113.
Wu Zhenfeng, Lan Tian, Wang Mengmeng, Pu Mo, Zhang Yu, Liu Zhihui, He Yanqing. Detecting Topics of Online News with Shared Nearest Neighbours and Markov Clustering. Data Analysis and Knowledge Discovery, 2022, 6(10): 103-113.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2021.1170      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2022/V6/I10/103
Fig.1  话题检测方法流程
数据集 话题序号 话题主题 新闻数量/条
综合新闻 1 华为被制裁 185
2 奔驰漏油 31
3 波音737坠机 42
4 巴黎圣母院火灾 44
5 视觉中国版权风波 100
6 斯里兰卡连环爆炸 93
7 亚洲文明对话大会 176
8 英国脱欧 57
9 翟天临学历造假 102
10 中美贸易战 232
体育新闻 1 NBA火箭 150
2 中超实德申花 37
3 中超鲁能亚泰 35
4 女足世界杯抽签 79
5 女足比赛 59
6 斯诺克 65
7 皇马瓦伦西亚 27
Table 1  两个网络新闻数据集的话题主题及其数量
Fig.2  奇异值随维数的变化关系
Fig.3  KL指标值
Fig.4  困惑度指标值
Fig.5  降维对话题检测结果的影响
Fig.6  共享最近邻个数对话题检测结果的影响
数据集 评价指标 snnMCL LDA K-Means GMM
综合新闻 ARI 0.86 0.70 0.74 0.72
NMI 0.90 0.77 0.83 0.81
Purity 0.95 0.78 0.89 0.86
CA 0.95 0.77 0.85 0.85
体育新闻 ARI 0.97 0.76 0.87 0.69
NMI 0.95 0.71 0.86 0.76
Purity 0.98 0.81 0.87 0.81
CA 0.98 0.81 0.86 0.76
Table 2  4种话题检测方法的实验结果
数据集 话题序号 实际话题描述 自动话题描述
综合新闻 1 华为被制裁 华为,上甘岭战役,芯片,启幕,用安卓
2 奔驰漏油 奔驰,服务费,漏油,维权,车主
3 波音737坠机 波音,MAX,软件,系统故障,传感器
4 巴黎圣母院火灾 巴黎圣母院,大火,修复,火情,退订
5 视觉中国版权风波 视觉,版权,中国,致歉,牟利
6 斯里兰卡连环爆炸 斯里兰卡,爆炸事件,公民,爆炸案,失联
7 亚洲文明对话大会 文明,亚洲,别错过,夜读,对话
8 英国脱欧 收购,软银,ARM,脱欧,特蕾莎梅
9 翟天临学历造假 翟天临,学术,事件,不端,娱乐圈
10 中美贸易战 先礼后兵,人民日报,评论员,君子,反制
体育新闻 1 NBA火箭 季后赛,爵士,火箭,图文,姚明
2 中超实德申花 申花,实德,吉梅,王鹏,画虎不成反类犬
3 中超鲁能亚泰 裁判,判罚,科维奇,爱徒,外国
4 女足世界杯抽签 女足,世界杯,赛程,丹麦,抽签
5 女足比赛 女足,联队,图文,明星,周高萍
6 斯诺克 多特,爆冷,出局,奥沙利,小晖
7 皇马瓦伦西亚 皇马,瓦伦西亚,图文,万人迷,欢庆
Table 3  自动话题描述
Fig.7  话题散点图
[1] 贺敏, 杜攀, 张瑾, 等. 基于动量模型的微博突发话题检测方法[J]. 计算机研究与发展, 2015, 52(5): 1022-1028.
[1] (He Min, Du Pan, Zhang Jin, et al. Microblog Bursty Topic Detection Method Based on Momentum Model[J]. Journal of Computer Research and Development, 2015, 52(5): 1022-1028.)
[2] Papadimitriou C H, Raghavan P, Tamaki H, et al. Latent Semantic Indexing: A Probabilistic Analysis[J]. Journal of Computer and System Sciences, 2000, 61(2): 217-235.
doi: 10.1006/jcss.2000.1711
[3] Blei D M, Ng A Y, Jordan M I. Latent Dirichlet Allocation[J]. The Journal of Machine Learning Research, 2003, 3: 993-1022.
[4] Teh Y W, Jordan M I, Beal M J, et al. Sharing Clusters among Related Groups: Hierarchical Dirichlet Processes[C]// Proceedings of the 17th International Conference on Neural Information Processing Systems. 2004: 1385-1392.
[5] Li W, McCallum A. Pachinko Allocation: DAG-Structured Mixture Models of Topic Correlations[C]// Proceedings of the 23rd International Conference on Machine Learning. 2006: 577-584.
[6] 刘红兵, 李文坤, 张仰森. 基于LDA 模型和多层聚类的微博话题检测[J]. 计算机技术与发展, 2016, 26(6): 25-30.
[6] (Liu Hongbing, Li Wenkun, Zhang Yangsen. Microblog Topic Detection Based on LDA Model and Multi-level Clustering[J]. Computer Technology and Development, 2016, 26(6): 25-30.)
[7] 李振鹏, 黄帅. 基于LDA主题模型的网络舆情研究[J]. 系统科学与数学, 2020, 40(3): 434-447.
doi: 10.12341/jssms13829
[7] (Li Zhenpeng, Huang Shuai. Analysing on Network Public Opinion Based on LDA Topic Model[J]. Journal of Systems Science and Mathematical Sciences, 2020, 40(3): 434-447.)
doi: 10.12341/jssms13829
[8] Lancichinetti A, Sirer M I, Wang J X, et al. High-Reproducibility and High-Accuracy Method for Automated Topic Classification[J]. Physical Review X, 2015, 5: 011007.
doi: 10.1103/PhysRevX.5.011007
[9] Li S D, Lv X Q, Wang T, et al. The Key Technology of Topic Detection Based on K-Means[C]// Proceedings of 2010 International Conference on Future Information Technology and Management Engineering. 2010: 387-390.
[10] 李丽蓉. 基于文本聚类算法的网络舆情话题检测研究[J]. 山西警察学院学报, 2021, 29(1): 69-72.
[10] (Li Lirong. Research on Internet Public Opinion Topic Detection Based on Text Clustering Algorithm[J]. Journal of Shanxi Police College, 2021, 29(1): 69-72.)
[11] Dai X Y, Chen Q C, Wang X L, et al. Online Topic Detection and Tracking of Financial News Based on Hierarchical Clustering[C]// Proceedings of 2010 International Conference on Machine Learning and Cybernetics. 2010: 3341-3346.
[12] Liu B, Niu D, Lai K F, et al. Growing Story Forest Online from Massive Breaking News[C]// Proceedings of the 2017 ACM Conference on Information and Knowledge Management. 2017: 777-785.
[13] 杨莲莲, 杨之音, 杨朝峰. 基于共词分析的微生物学植物学领域研究热点分析[J]. 情报工程, 2016, 2(4): 96-103.
[13] (Yang Lianlian, Yang Zhiyin, Yang Chaofeng. Research on the Hotspots of Microbiology and Botany Based on the Co-Word Analysis[J]. Technology Intelligence Engineering, 2016, 2(4): 96-103.)
[14] 肖香龙, 李信, 高寒, 等. 基于关键词共现的学科领域研究空白(Research Gaps)发现[J]. 情报工程, 2018, 4(6): 37-50.
[14] (Xiao Xianglong, Li Xin, Gao Han, et al. Research on Scientific Gaps Recognition Based on Keywords Co-occurrence[J]. Technology Intelligence Engineering, 2018, 4(6): 37-50.)
[15] Brohée S, van Helden J. Evaluation of Clustering Algorithms for Protein-Protein Interaction Networks[J]. BMC Bioinformatics, 2006, 7: 488.
pmid: 17087821
[16] Liu Z C, Lin G S, Yang S, et al. Learning Markov Clustering Networks for Scene Text Detection[C]// Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018: 6936-6944.
[17] Wang C H, Zhang M, Ma S P, et al. Automatic Online News Issue Construction in Web Environment[C]// Proceedings of the 17th International Conference on World Wide Web, Beijing, China. 2008: 457-466.
[18] White H D. Bag of Works Retrieval: TF*IDF Weighting of Co-cited Works[J]. International Journal on Digital Libraries, 2018, 19(2-3): 139-149.
doi: 10.1007/s00799-017-0217-7
[19] Satija R, Farrell J A, Gennert D, et al. Spatial Reconstruction of Single-Cell Gene Expression Data[J]. Nature Biotechnology, 2015, 33(5): 495-502.
doi: 10.1038/nbt.3192 pmid: 25867923
[20] Krzanowski W J, Lai Y T. A Criterion for Determining the Number of Groups in a Data Set Using Sum-of-Squares Clustering[J]. Biometrics, 1988, 44(1): 23-34.
doi: 10.2307/2531893
[21] Golbeck J.Network Structure and Measures[A]// Analyzing the Social Web[M]. Amsterdam: Elsevier, 2013: 25-44.
[22] van der Maaten L, Hinton G. Visualizing Data Using t-SNE[J]. Journal of Machine Learning Research, 2008, 9: 2579-2605.
[23] Hubert L, Arabie P. Comparing Partitions[J]. Journal of Classification, 1985, 2(1): 193-218.
doi: 10.1007/BF01908075
[24] Strehl A, Ghosh J. Cluster Ensembles-A Knowledge Reuse Framework for Combining Partitions[J]. The Journal of Machine Learning Research, 2002, 3(3): 583-617.
[25] Meilă M, Heckerman D. An Experimental Comparison of Model-Based Clustering Methods[J]. Machine Learning, 2001, 42(1-2): 9-29.
doi: 10.1023/A:1007648401407
[26] Tian K, Zhou S G, Guan J H. DeepCluster: A General Clustering Framework Based on Deep Learning[C]// Proceedings of Joint European Conference on Machine Learning and Knowledge Discovery in Databases. 2017: 809-825.
[27] Kuhn H W. The Hungarian Method for the Assignment Problem[J]. Naval Research Logistics Quarterly, 1955, 2(1-2): 83-97.
doi: 10.1002/nav.3800020109
[28] 关鹏, 王曰芬. 科技情报分析中LDA主题模型最优主题数确定方法研究[J]. 现代图书情报技术, 2016(9): 42-50.
[28] (Guan Peng, Wang Yuefen. Identifying Optimal Topic Numbers from Sci-Tech Information with LDA Model[J]. New Technology of Library and Information Service, 2016(9): 42-50.)
[1] 吴旭,陈春旭. 基于多策略的群聊话题检测技术*[J]. 数据分析与知识发现, 2021, 5(5): 1-9.
[2] 魏家泽,董诚,何彦青,刘志辉,彭柯芸. 基于均衡段落和分话题向量的新闻热点话题检测研究*[J]. 数据分析与知识发现, 2020, 4(10): 70-79.
[3] 温廷新,李洋子,孙静霜. 基于多因素特征选择与AFOA/K-means的新闻热点发现方法*[J]. 数据分析与知识发现, 2019, 3(4): 97-106.
[4] 张晓勇,周清清,章成志. 面向在线社交网络用户生成内容的饮食话题发现研究*[J]. 现代图书情报技术, 2016, 32(10): 70-80.
[5] 邹伟, 刘永学, 李满春, 王加胜, 陈映雪. 网络新闻中黄岩岛争端事件舆情研究——以新浪网“中菲黄岩岛争端”专题为例[J]. 现代图书情报技术, 2014, 30(2): 72-78.
[6] 杨代庆, 王志苹, 王星, 刘敏健, 常迎春. 一种断点续传的多线程新闻组抓取方法及存储结构[J]. 现代图书情报技术, 2011, 27(2): 29-33.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn