[Objective] This paper proposes a K-wrLDA model based on adaptive clustering, aiming to improve the subject recognition ability of traditional LDA model, and identify the optimal number of selected topics. [Methods] First, we used the LDA and word2vec models to construct the T-WV matrix containing the probability information and the semantic relevance of the subject words. Then, we selected the number of topics based on the evaluation of clustering effects and the pseudo-F statistic. Finally, we compared the topic identification results of the proposed model with the old ones. [Results] The optimal number of topics was 33 for the proposed model, which also has lower level of perplexity than the traditional ones. [Limitations] The sample size needs to be expanded. [Conclusions] The proposed model, which has better recognition rate than the traditional LDA model, could also calculate the optimal number of topics. The new model may be applied to process large corpus in various fields.
王婷婷, 韩满, 王宇. LDA模型的优化及其主题数量选择研究*——以科技文献为例[J]. 数据分析与知识发现, 2018, 2(1): 29-40.
Wang Tingting,Han Man,Wang Yu. Optimizing LDA Model with Various Topic Numbers: Case Study of Scientific Literature. Data Analysis and Knowledge Discovery, 2018, 2(1): 29-40.
(Yan Duanwu, Tao Zhiheng, Li Lanbin.A Method of Automatic Recommendation of Subject Documents Based on HDP Model and Its Application[J]. Information Studies: Theory & Application, 2016, 39(1): 128-132.)
(Tang Haohao, Wang Bo, Xi Yaoyi, et al.Unsupervised Sentiment Orientation Analysis on Micro-Blogs Based on Hierarchical Dirichlet Processes[J]. Journal of Information Engineering University, 2015, 16(4): 463-469.)
(Guan Peng, Wang Yuefen.Research on the Method of Determining the Optimum Topic Number of LDA Topic Model in Scientific and Technical Information Analysis[J]. New Technology of Library and Information Service, 2016(9): 42-49.)
(Mao Shisong, Wang Jinglong, Pu Xiaolong.Advanced Mathematical Statistics [M]. Beijing: Higher Education Press, 2006: 446-449.)
Hinton G E.Learning Distributed Representations of Concepts[C]//Proceedings of the 8th Annual Conference of the Cognitive Science Society.1986.
Mikolov T, Sutskever I, Chen K, et al.Distributed Representations of Words and Phrases and Their Compositionality[C]//Proceedings of the Neural Information Processing Systems Conference. 2013.
MacQueen J. Some Methods for Classification and Analysis of Multivariate Observations[C]//Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, 1967.
Wei X, Croft W B.LDA-based Document Models for Ad-Hoc Retrieval[C]//Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.ACM, 2006: 178-185.
王学民. 应用多元分析[M]. 上海: 上海财经大学出版社, 2003: 217-218.
(Wang Xuemin.Applied Multivariate Analysis [M]. Shanghai: Shanghai University of Finance and Economics Press, 2003: 217-218.)
Heinrich G.Parameter Estimation for Text Analysis[R]. vsonix GmbH + University of Leipzig, 2008: 29-30.