[Objective] Aiming at the problem that the number of topics needs to be specified in the traditional LDA model, an adaptive topic number determination method for the field of news topic recognition is proposed.
[Methods] This paper extracts the news data by using semantics and time series as two views to obtain the corresponding feature vectors. The Co-DPSC algorithm is used to collaboratively train the two views to obtain a semantic feature matrix containing timing effects, and finally the density peak clustering by row after the matrix dimensionality reduction process is obtained, and the result is used as the optimal number of topics.
[Results] The experimental results show that the precision and F value of the optimal number of topics are improved by considering semantic and temporal factors, among which the precision rate is increased by 35.09%, and the F value is increased by 15.39%.
[Limitations] The keyword set is clustered, and the method of obtaining keywords affects the effect of clustering and the running time to a certain extent. Because news data requires textual and temporal elements, there are limitations to other types of data.
[Conclusions] Experiments show that this method combines the timeliness and content of news data to consider the categories of news, which can improve the accuracy of the optimal number of topics to a certain extent.
杨洋, 江开忠, 原明君, 惠岚昕.
新闻话题识别中LDA最优主题数选取研究
[J]. 数据分析与知识发现, 10.11925/infotech.2096-3467.2022-0115.
Yang Yang, Jang Kaizhong, Yuan Mingjun, Hui Lanxin.
Research on the selection of LDA optimal topic number in news topic identification
. Data Analysis and Knowledge Discovery, 0, (): 1-.