|
|
Identifying Optimal Topic Numbers from Sci-Tech Information with LDA Model |
Guan Peng1,2,Wang Yuefen1( ) |
1School of Economics and Management, Nanjing University of Science & Technology, Nanjing 210094, China 2College of Applied Mathematics, Chaohu University, Hefei 238000, China |
|
|
Abstract [Objective] This paper tries to identify the optimal number of topics for the Latent Dirichlet Allocation (LDA) model to analyze scientific and technical information. [Methods] First, we used the topic similarity to measure the differences among the latent topics. Second, we proposed a method determining the optimal topic numbers and tried to utilize this model to documents from Chinese literature in the field of new energy. [Results] The proposed method achieved higher precision ratio and higher F-score in topic extration, which improved the performance of literature recommendation systems. [Limitations] We did not examine the new mothod with other datasets, such as microblog posts and XML documents. [Conclusions] The proposed method could identify more recognizable topics and improve the performance of scientific and technical literature recommendation systems.
|
Received: 22 February 2016
Published: 19 October 2016
|
[1] | Blei D M, Ng A Y, Jordan M I.Latent Dirichlet Allocation[J]. Journal of Machine Learning Research, 2003, 3: 993-1022. | [2] | 王萍. 基于概率主题模型的文献知识挖掘[J]. 情报学报, 2011, 30(6): 583-590. | [2] | (Wang Ping.Literature Knowledge Mining Based on Probabilistic Topic Model[J]. Journal of the China Society for Scientific and Technical Information, 2011, 30(6): 583-590.) | [3] | Hassan S U, Haddawy P.Analyzing Knowledge Flows of Scientific Literature Through Semantic Links: A Case Study in the Field of Energy[J]. Scientometrics, 2015, 103(1): 33-46. | [4] | Liang H, Fang L.Topic Discovery and Trend Analysis in Scientific Literature Based on Topic Model[J]. Journal of Chinese Information Processing, 2012, 26(2): 109-115. | [5] | 范云满, 马建霞. 基于LDA与新兴主题特征分析的新兴主题探测研究[J].情报学报, 2014, 33(7): 698-711. | [5] | (Fan Yunman, Ma Jianxia.Detection of Emerging Topics Based on LDA and Feature Analysis of Emerging Topics[J]. Journal of the China Society for Scientific and Technical Information, 2014, 33(7): 698-711.) | [6] | He Q, Chen B, Pei J, et al.Detecting Topic Evolution in Scientific Literature: How Can Citations Help? [C]. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management. ACM, 2009: 957-966. | [7] | AlSumait L, Barbará D, Domeniconi C. On-line LDA: Adaptive Topic Models for Mining Text Streams with Applications to Topic Detection and Tracking [C]. In: Proceedings of the 8th IEEE International Conference on Data Mining. 2008. | [8] | 刘彤, 杨冠灿, 蒋继娅, 等.基于多重关系的专利网络演化特征与动态分析——以锂离子电池领域为例[J]. 情报学报, 2014, 33(12): 1288-1301. | [8] | (Liu Tong, Yang Guancan, Jiang Jiya, et al.Research on the Evolution and Dynamic Analysis of Multi-relation Integrated Patent Network: A Case Study on Lithiumion Battery[J]. Journal of the China Society for Scientific and Technical Information, 2014, 33(12): 1288-1301.) | [9] | 贺亮, 李芳. 科技文献话题演化研究[J]. 现代图书情报技术, 2012(4): 61-67. | [9] | (He Liang, Li Fang.Topic Evolution in Scientific Literature[J]. New Technology of Library and Information Service, 2012(4): 61-67.) | [10] | Wu Q Q, Zhang C D, Hong Q Q, et al.Topic Evolution Based on LDA and HMM and Its Application in Stem Cell Research[J]. Journal of Information Science, 2014, 40(5): 611-620. | [11] | Gerrish S, Blei D M.A Language-based Approach to Measuring Scholarly Impact [C]. In: Proceedings of the 27th International Conference on Machine Learning. 2010. | [12] | Dhillon I S, Modha D S.Concept Decompositions for Large Sparse Text Data Using Clustering[J]. Machine Learning, 2001, 42(1-2): 143-175. | [13] | 王李冬, 魏宝刚, 袁杰. 基于概率主题模型的文档聚类[J]. 电子学报, 2012, 40(11): 2346-2350. | [13] | (Wang Lidong, Wei Baogang, Yuan Jie.Document Clustering Based on Probabilistic Topic Model[J]. Acta Electronica Sinica, 2012, 40(11): 2346-2350.) | [14] | Lee H, Kihm J, Choo J, et al.iVisClustering: An Interactive Visual Document Clustering via Topic Modeling[J]. Computer Graphics Forum, 2012, 31(3): 1155-1164. | [15] | Kabán A, Girolami M A.A Dynamic Probabilistic Model to Visualise Topic Evolution in Text Streams[J]. Journal of Intelligent Information Systems, 2002, 18(2-3): 107-125. | [16] | Chua F C T, Lauw H W, Lim E P. Generative Models for Item Adoptions Using Social Correlation[J]. IEEE Transactions on Knowledge and Data Engineering, 2013, 25(9): 2036-2048. | [17] | 张晗, 徐硕, 乔晓东, 等. 融合科技文献内外部特征的主题模型发展综述[J]. 情报学报, 2014, 33(10): 1108-1120. | [17] | (Zhang Han, Xu Shuo, Qiao Xiaodong, et al.Review on Topic Models Integrating Intra- and Extra- Features of Scientific and Technical Literature[J]. Journal of the China Society for Scientific and Technical Information, 2014, 33(10): 1108-1120.) | [18] | Teh Y, Jordan M, Beal M, et al.Hierarchical Dirichlet Processes[J]. Journal of the American Statistical Association, 2007, 101(476): 1566-1581. | [19] | Griffiths T L, Steyvers M.Finding Scientific Topics[J]. Proceedings of the National Academy of Sciences of the United States of America, 2004, 101(S1): 5228-5235. | [20] | Arun R, Suresh V, Veni Madhavan C E, et al. On Finding the Natural Number of Topics with Latent Dirichlet Allocation: Some Observations [A]. //Advances in Knowledge Discovery and Data Mining[M]. Springer Berlin Heidelberg, 2010. | [21] | 曹娟, 张勇东, 李锦涛, 等.一种基于密度的自适应最优LDA模型选择方法[J]. 计算机学报, 2008, 31(10): 1780-1787. | [21] | (Cao Juan, Zhang Yongdong, Li Jintao, et al.A Method of Adaptively Selecting Bast LDA Model Based on Density[J]. Chinese Journal of Computers, 2008, 31(10): 1780-1787.) | [22] | Grossman D A.Information Retrieval: Algorithms and Heuristics[M]. Springer Science & Business Media, 2004. | [23] | Duda R O, Hart P E, Stork D G.Pattern Classification[M]. John Wiley & Sons, 2012. | [24] | Lin J.Divergence Measures Based on Shannon Entropy[J]. IEEE Transactions on Information Theory, 199l, 37(1): 145-151. | [25] | Sun J Y. jieba0.37 [EB/OL]. [2015-10-08]. . | [26] | RehurekR. gensim 0.10.2 [EB/OL]. [2014-12-11]. . |
|
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|