[Objective] This paper proposes an adaptive method to decide the optimal topic numbers for the LDA model, aiming to effectively identify news topics. [Methods] Frist, we extract the needed data from news using semantics and time series, which helped us construct the corresponding feature vectors. Then, we utilized the Co-DPSC algorithm to collaboratively train the two views and obtained a semantic feature matrix containing timing effects. Finally, we conducted the density peak clustering by row after the matrix dimension reduction, which generated the optimal number of topics. [Results] The precision and F value of the proposed model were improved by 35.09% and 15.39%. [Limitations] We only clustered keywords from news and need to examine the new model with datasets from other fields. [Conclusions] The proposed method could provide better number of topics for the LDA model.
Blei D M, Ng A Y, Jordan M I. Latent Dirichlet Allocation[J]. The Journal of Machine Learning Research, 2003, 3: 993-1002.
[2]
Teh Y W, Jordan M I, Beal M J, et al. Hierarchical Dirichlet Processes[J]. Journal of the American Statistical Association, 2006, 101(476): 1566-1581.
doi: 10.1198/016214506000000302
[3]
Blei D M, Jordan M I, Griffiths T L, et al. Hierarchical Topic Models and the Nested Chinese Restaurant Process[C]// Proceedings of the 16th International Conference on Neural Information Processing Systems. 2003:17-24.
(He Jianyun, Chen Xingshu, Du Min, et al. Topic Evolution Analysis Based on Improved Online LDA Model[J]. Journal of Central South University (Science and Technology), 2015, 46(2): 547-553.)
(Cao Juan, Zhang Yongdong, Li Jintao, et al. A Method of Adaptively Selecting Best LDA Model Based on Density[J]. Chinese Journal of Computers, 2008, 31(10): 1780-1787.)
doi: 10.3724/SP.J.1016.2008.01780
(Guan Peng, Wang Yuefen. Identifying Optimal Topic Numbers from Sci-Tech Information with LDA Model[J]. New Technology of Library and Information Service, 2016(9): 42-50.)
(Li Feifei, Wang Yizhi. Selection Method of LDA Optimal Topic Number Based on Frequent Word Network[J]. Computer Technology and Development, 2018, 28(8): 1-5.)
[8]
Wang H B, Wang J X, Zhang Y F, et al. Optimization of Topic Recognition Model for News Texts Based on LDA[J]. Journal of Digital Information Management, 2019, 17(5):257-269.
doi: 10.6025/jdim/2019/17/5/257-269
(Yu Chong, Li Jing, Sun Xudong, et al. Social Media Topic Recognition Based on Word Embedding and Probabilistic Topic Model[J]. Computer Engineering, 2017, 43(12):184-191.)
[10]
曹牧原. 基于爬虫和LDA的新闻话题挖掘[D]. 保定: 河北大学, 2018.
[10]
(Cao Muyuan. News Topic Mining Based on Web Crawler and LDA[D]. Baoding: Hebei University, 2018.)
(Li Cong, Yuan Fang, Liu Yu, et al. Chinese News Topic Detection Based on LDA and T-OPTICS[J]. Journal of Hebei University(Natural Science Edition), 2016, 36(1):106-112.)
(Wan Hongxin, Peng Yun, Zheng Ruiying. Time Constrained LDA for Topic Extraction of Public Opinion Texts[J]. Computer and Modernization, 2016(7):91-94.)
[13]
Stilo G, Velardi P. Efficient Temporal Mining of Micro-Blog Texts and its Application to Event Discovery[J]. Data Mining and Knowledge Discovery, 2016, 30(2):372-402.
doi: 10.1007/s10618-015-0412-3
[14]
Rodriguez A, Laio A. Clustering by Fast Search and Find of Density Peaks[J]. Science, 2014, 344(6191): 1492-1496.
doi: 10.1126/science.1242072
pmid: 24970081
[15]
李阳. 协同训练算法及其在分类中的应用研究[D]. 东营: 中国石油大学(华东), 2016.
[15]
(Li Yang. Research on Co-Training Algorithm and Its Application in Classification[D]. Dongying: China University of Petroleum (Huadong), 2016.)
[16]
Weng J S, Lee B S. Event Detection in Twitter[C]// Proceedings of the 5th International AAAI Conference on Weblogs and Social Media. 2011:401-408.
[17]
Kumar A, Daume H. A Co-Training Approach for Multi-View Spectral Clustering[C]// Proceedings of the 28th International Conference on Machine Learning. 2011:393-400.