|
|
Selecting Optimal LDA Numbers to Identify News Topics |
Yang Yang(),Jiang Kaizhong,Yuan Mingjun,Hui Lanxin |
School of Mathematics and Statistics, Shanghai University of Engineering Science, Shanghai 201620, China |
|
|
Abstract [Objective] This paper proposes an adaptive method to decide the optimal topic numbers for the LDA model, aiming to effectively identify news topics. [Methods] Frist, we extract the needed data from news using semantics and time series, which helped us construct the corresponding feature vectors. Then, we utilized the Co-DPSC algorithm to collaboratively train the two views and obtained a semantic feature matrix containing timing effects. Finally, we conducted the density peak clustering by row after the matrix dimension reduction, which generated the optimal number of topics. [Results] The precision and F value of the proposed model were improved by 35.09% and 15.39%. [Limitations] We only clustered keywords from news and need to examine the new model with datasets from other fields. [Conclusions] The proposed method could provide better number of topics for the LDA model.
|
Received: 14 February 2022
Published: 13 January 2023
|
|
Fund:National Statistical Science Research Project of China(2020LY080) |
Corresponding Authors:
Yang Yang
E-mail: yy_5ten8@126.com
|
[1] |
Blei D M, Ng A Y, Jordan M I. Latent Dirichlet Allocation[J]. The Journal of Machine Learning Research, 2003, 3: 993-1002.
|
[2] |
Teh Y W, Jordan M I, Beal M J, et al. Hierarchical Dirichlet Processes[J]. Journal of the American Statistical Association, 2006, 101(476): 1566-1581.
doi: 10.1198/016214506000000302
|
[3] |
Blei D M, Jordan M I, Griffiths T L, et al. Hierarchical Topic Models and the Nested Chinese Restaurant Process[C]// Proceedings of the 16th International Conference on Neural Information Processing Systems. 2003:17-24.
|
[4] |
何建云, 陈兴蜀, 杜敏, 等. 基于改进的在线LDA模型的主题演化分析[J]. 中南大学学报(自然科学版), 2015, 46(2): 547-553.
|
[4] |
(He Jianyun, Chen Xingshu, Du Min, et al. Topic Evolution Analysis Based on Improved Online LDA Model[J]. Journal of Central South University (Science and Technology), 2015, 46(2): 547-553.)
|
[5] |
曹娟, 张勇东, 李锦涛, 等. 一种基于密度的自适应最优LDA模型选择方法[J]. 计算机学报, 2008, 31(10): 1780-1787.
doi: 10.3724/SP.J.1016.2008.01780
|
[5] |
(Cao Juan, Zhang Yongdong, Li Jintao, et al. A Method of Adaptively Selecting Best LDA Model Based on Density[J]. Chinese Journal of Computers, 2008, 31(10): 1780-1787.)
doi: 10.3724/SP.J.1016.2008.01780
|
[6] |
关鹏, 王曰芬. 科技情报分析中LDA主题模型最优主题数确定方法研究[J]. 现代图书情报技术, 2016(9): 42-50.
|
[6] |
(Guan Peng, Wang Yuefen. Identifying Optimal Topic Numbers from Sci-Tech Information with LDA Model[J]. New Technology of Library and Information Service, 2016(9): 42-50.)
|
[7] |
李菲菲, 王移芝. 基于频繁词网络的LDA最优主题个数选取方法[J]. 计算机技术与发展, 2018, 28(8): 1-5.
|
[7] |
(Li Feifei, Wang Yizhi. Selection Method of LDA Optimal Topic Number Based on Frequent Word Network[J]. Computer Technology and Development, 2018, 28(8): 1-5.)
|
[8] |
Wang H B, Wang J X, Zhang Y F, et al. Optimization of Topic Recognition Model for News Texts Based on LDA[J]. Journal of Digital Information Management, 2019, 17(5):257-269.
doi: 10.6025/jdim/2019/17/5/257-269
|
[9] |
余冲, 李晶, 孙旭东, 等. 基于词嵌入与概率主题模型的社会媒体话题识别[J]. 计算机工程, 2017, 43(12):184-191.
|
[9] |
(Yu Chong, Li Jing, Sun Xudong, et al. Social Media Topic Recognition Based on Word Embedding and Probabilistic Topic Model[J]. Computer Engineering, 2017, 43(12):184-191.)
|
[10] |
曹牧原. 基于爬虫和LDA的新闻话题挖掘[D]. 保定: 河北大学, 2018.
|
[10] |
(Cao Muyuan. News Topic Mining Based on Web Crawler and LDA[D]. Baoding: Hebei University, 2018.)
|
[11] |
李琮, 袁方, 刘宇, 等. 基于LDA模型和T-OPTICS算法的中文新闻话题检测[J]. 河北大学学报(自然科学版), 2016, 36(1):106-112.
|
[11] |
(Li Cong, Yuan Fang, Liu Yu, et al. Chinese News Topic Detection Based on LDA and T-OPTICS[J]. Journal of Hebei University(Natural Science Edition), 2016, 36(1):106-112.)
|
[12] |
万红新, 彭云, 郑睿颖. 时序化LDA的舆情文本动态主题提取[J]. 计算机与现代化, 2016(7):91-94.
|
[12] |
(Wan Hongxin, Peng Yun, Zheng Ruiying. Time Constrained LDA for Topic Extraction of Public Opinion Texts[J]. Computer and Modernization, 2016(7):91-94.)
|
[13] |
Stilo G, Velardi P. Efficient Temporal Mining of Micro-Blog Texts and its Application to Event Discovery[J]. Data Mining and Knowledge Discovery, 2016, 30(2):372-402.
doi: 10.1007/s10618-015-0412-3
|
[14] |
Rodriguez A, Laio A. Clustering by Fast Search and Find of Density Peaks[J]. Science, 2014, 344(6191): 1492-1496.
doi: 10.1126/science.1242072
pmid: 24970081
|
[15] |
李阳. 协同训练算法及其在分类中的应用研究[D]. 东营: 中国石油大学(华东), 2016.
|
[15] |
(Li Yang. Research on Co-Training Algorithm and Its Application in Classification[D]. Dongying: China University of Petroleum (Huadong), 2016.)
|
[16] |
Weng J S, Lee B S. Event Detection in Twitter[C]// Proceedings of the 5th International AAAI Conference on Weblogs and Social Media. 2011:401-408.
|
[17] |
Kumar A, Daume H. A Co-Training Approach for Multi-View Spectral Clustering[C]// Proceedings of the 28th International Conference on Machine Learning. 2011:393-400.
|
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|