|
|
Detecting Topics of Online News with Shared Nearest Neighbours and Markov Clustering |
Wu Zhenfeng1,Lan Tian1,Wang Mengmeng2,Pu Mo1,Zhang Yu1,Liu Zhihui1,He Yanqing1() |
1Institute of Scientific and Technical Information of China, Beijing 100038, China 2School of Economics, Renmin University of China, Beijing 100872, China |
|
|
Abstract [Objective] This paper proposes a topic detection method for online news, aiming to more effectively utilize the internal structure of data. [Methods] First, we examined the association strength among online news with the number and rank of their shared nearest neighbors. Then, we constructed a graph for the shared nearest neighbors, which improved the utilization of internal structure of the data. Finally, we detected the topics of online news with dimension reduction, the decision of the optimal number of topics, Markov clustering, and automatic topic description based on closeness centrality. [Results] We examined our new model with two data sets of online news and found the ARI values were up to 0.86 and 0.97, while the ARI values of the LDA, K-means, and GMM models were all less than 0.75 and 0.90. [Limitations] We need to evaluate the performance of the proposed method with data sets from other fields and the multilingual ones. [Conclusions] The proposed method could effectively detect the topics of online news and provide new direction for the future research.
|
Received: 15 October 2021
Published: 16 November 2022
|
|
Fund:National Key R&D Program of China(2019YFA0707201);Key Work Program of Institute of Scientific and Technical Information of China(ZD2021-17);Key Work Program of Institute of Scientific and Technical Information of China(ZD2022-01) |
Corresponding Authors:
He Yanqing,ORCID:0000-0002-8791-1581
E-mail: heyq@istic.ac.cn
|
[1] |
贺敏, 杜攀, 张瑾, 等. 基于动量模型的微博突发话题检测方法[J]. 计算机研究与发展, 2015, 52(5): 1022-1028.
|
[1] |
(He Min, Du Pan, Zhang Jin, et al. Microblog Bursty Topic Detection Method Based on Momentum Model[J]. Journal of Computer Research and Development, 2015, 52(5): 1022-1028.)
|
[2] |
Papadimitriou C H, Raghavan P, Tamaki H, et al. Latent Semantic Indexing: A Probabilistic Analysis[J]. Journal of Computer and System Sciences, 2000, 61(2): 217-235.
doi: 10.1006/jcss.2000.1711
|
[3] |
Blei D M, Ng A Y, Jordan M I. Latent Dirichlet Allocation[J]. The Journal of Machine Learning Research, 2003, 3: 993-1022.
|
[4] |
Teh Y W, Jordan M I, Beal M J, et al. Sharing Clusters among Related Groups: Hierarchical Dirichlet Processes[C]// Proceedings of the 17th International Conference on Neural Information Processing Systems. 2004: 1385-1392.
|
[5] |
Li W, McCallum A. Pachinko Allocation: DAG-Structured Mixture Models of Topic Correlations[C]// Proceedings of the 23rd International Conference on Machine Learning. 2006: 577-584.
|
[6] |
刘红兵, 李文坤, 张仰森. 基于LDA 模型和多层聚类的微博话题检测[J]. 计算机技术与发展, 2016, 26(6): 25-30.
|
[6] |
(Liu Hongbing, Li Wenkun, Zhang Yangsen. Microblog Topic Detection Based on LDA Model and Multi-level Clustering[J]. Computer Technology and Development, 2016, 26(6): 25-30.)
|
[7] |
李振鹏, 黄帅. 基于LDA主题模型的网络舆情研究[J]. 系统科学与数学, 2020, 40(3): 434-447.
doi: 10.12341/jssms13829
|
[7] |
(Li Zhenpeng, Huang Shuai. Analysing on Network Public Opinion Based on LDA Topic Model[J]. Journal of Systems Science and Mathematical Sciences, 2020, 40(3): 434-447.)
doi: 10.12341/jssms13829
|
[8] |
Lancichinetti A, Sirer M I, Wang J X, et al. High-Reproducibility and High-Accuracy Method for Automated Topic Classification[J]. Physical Review X, 2015, 5: 011007.
doi: 10.1103/PhysRevX.5.011007
|
[9] |
Li S D, Lv X Q, Wang T, et al. The Key Technology of Topic Detection Based on K-Means[C]// Proceedings of 2010 International Conference on Future Information Technology and Management Engineering. 2010: 387-390.
|
[10] |
李丽蓉. 基于文本聚类算法的网络舆情话题检测研究[J]. 山西警察学院学报, 2021, 29(1): 69-72.
|
[10] |
(Li Lirong. Research on Internet Public Opinion Topic Detection Based on Text Clustering Algorithm[J]. Journal of Shanxi Police College, 2021, 29(1): 69-72.)
|
[11] |
Dai X Y, Chen Q C, Wang X L, et al. Online Topic Detection and Tracking of Financial News Based on Hierarchical Clustering[C]// Proceedings of 2010 International Conference on Machine Learning and Cybernetics. 2010: 3341-3346.
|
[12] |
Liu B, Niu D, Lai K F, et al. Growing Story Forest Online from Massive Breaking News[C]// Proceedings of the 2017 ACM Conference on Information and Knowledge Management. 2017: 777-785.
|
[13] |
杨莲莲, 杨之音, 杨朝峰. 基于共词分析的微生物学植物学领域研究热点分析[J]. 情报工程, 2016, 2(4): 96-103.
|
[13] |
(Yang Lianlian, Yang Zhiyin, Yang Chaofeng. Research on the Hotspots of Microbiology and Botany Based on the Co-Word Analysis[J]. Technology Intelligence Engineering, 2016, 2(4): 96-103.)
|
[14] |
肖香龙, 李信, 高寒, 等. 基于关键词共现的学科领域研究空白(Research Gaps)发现[J]. 情报工程, 2018, 4(6): 37-50.
|
[14] |
(Xiao Xianglong, Li Xin, Gao Han, et al. Research on Scientific Gaps Recognition Based on Keywords Co-occurrence[J]. Technology Intelligence Engineering, 2018, 4(6): 37-50.)
|
[15] |
Brohée S, van Helden J. Evaluation of Clustering Algorithms for Protein-Protein Interaction Networks[J]. BMC Bioinformatics, 2006, 7: 488.
pmid: 17087821
|
[16] |
Liu Z C, Lin G S, Yang S, et al. Learning Markov Clustering Networks for Scene Text Detection[C]// Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018: 6936-6944.
|
[17] |
Wang C H, Zhang M, Ma S P, et al. Automatic Online News Issue Construction in Web Environment[C]// Proceedings of the 17th International Conference on World Wide Web, Beijing, China. 2008: 457-466.
|
[18] |
White H D. Bag of Works Retrieval: TF*IDF Weighting of Co-cited Works[J]. International Journal on Digital Libraries, 2018, 19(2-3): 139-149.
doi: 10.1007/s00799-017-0217-7
|
[19] |
Satija R, Farrell J A, Gennert D, et al. Spatial Reconstruction of Single-Cell Gene Expression Data[J]. Nature Biotechnology, 2015, 33(5): 495-502.
doi: 10.1038/nbt.3192
pmid: 25867923
|
[20] |
Krzanowski W J, Lai Y T. A Criterion for Determining the Number of Groups in a Data Set Using Sum-of-Squares Clustering[J]. Biometrics, 1988, 44(1): 23-34.
doi: 10.2307/2531893
|
[21] |
Golbeck J.Network Structure and Measures[A]// Analyzing the Social Web[M]. Amsterdam: Elsevier, 2013: 25-44.
|
[22] |
van der Maaten L, Hinton G. Visualizing Data Using t-SNE[J]. Journal of Machine Learning Research, 2008, 9: 2579-2605.
|
[23] |
Hubert L, Arabie P. Comparing Partitions[J]. Journal of Classification, 1985, 2(1): 193-218.
doi: 10.1007/BF01908075
|
[24] |
Strehl A, Ghosh J. Cluster Ensembles-A Knowledge Reuse Framework for Combining Partitions[J]. The Journal of Machine Learning Research, 2002, 3(3): 583-617.
|
[25] |
Meilă M, Heckerman D. An Experimental Comparison of Model-Based Clustering Methods[J]. Machine Learning, 2001, 42(1-2): 9-29.
doi: 10.1023/A:1007648401407
|
[26] |
Tian K, Zhou S G, Guan J H. DeepCluster: A General Clustering Framework Based on Deep Learning[C]// Proceedings of Joint European Conference on Machine Learning and Knowledge Discovery in Databases. 2017: 809-825.
|
[27] |
Kuhn H W. The Hungarian Method for the Assignment Problem[J]. Naval Research Logistics Quarterly, 1955, 2(1-2): 83-97.
doi: 10.1002/nav.3800020109
|
[28] |
关鹏, 王曰芬. 科技情报分析中LDA主题模型最优主题数确定方法研究[J]. 现代图书情报技术, 2016(9): 42-50.
|
[28] |
(Guan Peng, Wang Yuefen. Identifying Optimal Topic Numbers from Sci-Tech Information with LDA Model[J]. New Technology of Library and Information Service, 2016(9): 42-50.)
|
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|