Identifying Optimal Topic Numbers from Sci-Tech Information with LDA Model
Guan Peng1,2,Wang Yuefen1()
1School of Economics and Management, Nanjing University of Science & Technology, Nanjing 210094, China 2College of Applied Mathematics, Chaohu University, Hefei 238000, China
[Objective] This paper tries to identify the optimal number of topics for the Latent Dirichlet Allocation (LDA) model to analyze scientific and technical information. [Methods] First, we used the topic similarity to measure the differences among the latent topics. Second, we proposed a method determining the optimal topic numbers and tried to utilize this model to documents from Chinese literature in the field of new energy. [Results] The proposed method achieved higher precision ratio and higher F-score in topic extration, which improved the performance of literature recommendation systems. [Limitations] We did not examine the new mothod with other datasets, such as microblog posts and XML documents. [Conclusions] The proposed method could identify more recognizable topics and improve the performance of scientific and technical literature recommendation systems.
关鹏,王曰芬. 科技情报分析中LDA主题模型最优主题数确定方法研究*[J]. 现代图书情报技术, 2016, 32(9): 42-50.
Guan Peng,Wang Yuefen. Identifying Optimal Topic Numbers from Sci-Tech Information with LDA Model. New Technology of Library and Information Service, 2016, 32(9): 42-50.
(Wang Ping.Literature Knowledge Mining Based on Probabilistic Topic Model[J]. Journal of the China Society for Scientific and Technical Information, 2011, 30(6): 583-590.)
[3]
Hassan S U, Haddawy P.Analyzing Knowledge Flows of Scientific Literature Through Semantic Links: A Case Study in the Field of Energy[J]. Scientometrics, 2015, 103(1): 33-46.
[4]
Liang H, Fang L.Topic Discovery and Trend Analysis in Scientific Literature Based on Topic Model[J]. Journal of Chinese Information Processing, 2012, 26(2): 109-115.
(Fan Yunman, Ma Jianxia.Detection of Emerging Topics Based on LDA and Feature Analysis of Emerging Topics[J]. Journal of the China Society for Scientific and Technical Information, 2014, 33(7): 698-711.)
[6]
He Q, Chen B, Pei J, et al.Detecting Topic Evolution in Scientific Literature: How Can Citations Help? [C]. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management. ACM, 2009: 957-966.
[7]
AlSumait L, Barbará D, Domeniconi C. On-line LDA: Adaptive Topic Models for Mining Text Streams with Applications to Topic Detection and Tracking [C]. In: Proceedings of the 8th IEEE International Conference on Data Mining. 2008.
(Liu Tong, Yang Guancan, Jiang Jiya, et al.Research on the Evolution and Dynamic Analysis of Multi-relation Integrated Patent Network: A Case Study on Lithiumion Battery[J]. Journal of the China Society for Scientific and Technical Information, 2014, 33(12): 1288-1301.)
[9]
贺亮, 李芳. 科技文献话题演化研究[J]. 现代图书情报技术, 2012(4): 61-67.
[9]
(He Liang, Li Fang.Topic Evolution in Scientific Literature[J]. New Technology of Library and Information Service, 2012(4): 61-67.)
[10]
Wu Q Q, Zhang C D, Hong Q Q, et al.Topic Evolution Based on LDA and HMM and Its Application in Stem Cell Research[J]. Journal of Information Science, 2014, 40(5): 611-620.
[11]
Gerrish S, Blei D M.A Language-based Approach to Measuring Scholarly Impact [C]. In: Proceedings of the 27th International Conference on Machine Learning. 2010.
[12]
Dhillon I S, Modha D S.Concept Decompositions for Large Sparse Text Data Using Clustering[J]. Machine Learning, 2001, 42(1-2): 143-175.
(Wang Lidong, Wei Baogang, Yuan Jie.Document Clustering Based on Probabilistic Topic Model[J]. Acta Electronica Sinica, 2012, 40(11): 2346-2350.)
[14]
Lee H, Kihm J, Choo J, et al.iVisClustering: An Interactive Visual Document Clustering via Topic Modeling[J]. Computer Graphics Forum, 2012, 31(3): 1155-1164.
[15]
Kabán A, Girolami M A.A Dynamic Probabilistic Model to Visualise Topic Evolution in Text Streams[J]. Journal of Intelligent Information Systems, 2002, 18(2-3): 107-125.
[16]
Chua F C T, Lauw H W, Lim E P. Generative Models for Item Adoptions Using Social Correlation[J]. IEEE Transactions on Knowledge and Data Engineering, 2013, 25(9): 2036-2048.
(Zhang Han, Xu Shuo, Qiao Xiaodong, et al.Review on Topic Models Integrating Intra- and Extra- Features of Scientific and Technical Literature[J]. Journal of the China Society for Scientific and Technical Information, 2014, 33(10): 1108-1120.)
[18]
Teh Y, Jordan M, Beal M, et al.Hierarchical Dirichlet Processes[J]. Journal of the American Statistical Association, 2007, 101(476): 1566-1581.
[19]
Griffiths T L, Steyvers M.Finding Scientific Topics[J]. Proceedings of the National Academy of Sciences of the United States of America, 2004, 101(S1): 5228-5235.
[20]
Arun R, Suresh V, Veni Madhavan C E, et al. On Finding the Natural Number of Topics with Latent Dirichlet Allocation: Some Observations [A]. //Advances in Knowledge Discovery and Data Mining[M]. Springer Berlin Heidelberg, 2010.
(Cao Juan, Zhang Yongdong, Li Jintao, et al.A Method of Adaptively Selecting Bast LDA Model Based on Density[J]. Chinese Journal of Computers, 2008, 31(10): 1780-1787.)
[22]
Grossman D A.Information Retrieval: Algorithms and Heuristics[M]. Springer Science & Business Media, 2004.
[23]
Duda R O, Hart P E, Stork D G.Pattern Classification[M]. John Wiley & Sons, 2012.
[24]
Lin J.Divergence Measures Based on Shannon Entropy[J]. IEEE Transactions on Information Theory, 199l, 37(1): 145-151.