Data Analysis and Knowledge Discovery  2018, Vol. 2 Issue (1): 29-40    DOI: 10.11925/infotech.2096-3467.2017.0715
Optimizing LDA Model with Various Topic Numbers: Case Study of Scientific Literature
Tingting Wang1,2(),Man Han1,2,Yu Wang1
1(School of Statistics, Huaqiao University, Xiamen 361021, China)
2(Center for Modern Applied Statistics and Large Data Research, Huaqiao University, Xiamen 361021,China)
[Objective] This paper proposes a K-wrLDA model based on adaptive clustering, aiming to improve the subject recognition ability of traditional LDA model, and identify the optimal number of selected topics. [Methods] First, we used the LDA and word2vec models to construct the T-WV matrix containing the probability information and the semantic relevance of the subject words. Then, we selected the number of topics based on the evaluation of clustering effects and the pseudo-F statistic. Finally, we compared the topic identification results of the proposed model with the old ones. [Results] The optimal number of topics was 33 for the proposed model, which also has lower level of perplexity than the traditional ones. [Limitations] The sample size needs to be expanded. [Conclusions] The proposed model, which has better recognition rate than the traditional LDA model, could also calculate the optimal number of topics. The new model may be applied to process large corpus in various fields.

Key wordsTopic Model      Word Embedding      Adaptive Clustering      Perplexity     
Received: 20 July 2017      Published: 05 February 2018

Cite this article:

Tingting Wang,Man Han,Yu Wang. Optimizing LDA Model with Various Topic Numbers: Case Study of Scientific Literature. Data Analysis and Knowledge Discovery, 2018, 2(1): 29-40.

