Please wait a minute...
New Technology of Library and Information Service  2016, Vol. 32 Issue (9): 42-50    DOI: 10.11925/infotech.1003-3513.2016.09.05
Orginal Article Current Issue | Archive | Adv Search |
Identifying Optimal Topic Numbers from Sci-Tech Information with LDA Model
Guan Peng1,2,Wang Yuefen1()
1School of Economics and Management, Nanjing University of Science & Technology, Nanjing 210094, China
2College of Applied Mathematics, Chaohu University, Hefei 238000, China
Download: PDF(765 KB)   HTML ( 33
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper tries to identify the optimal number of topics for the Latent Dirichlet Allocation (LDA) model to analyze scientific and technical information. [Methods] First, we used the topic similarity to measure the differences among the latent topics. Second, we proposed a method determining the optimal topic numbers and tried to utilize this model to documents from Chinese literature in the field of new energy. [Results] The proposed method achieved higher precision ratio and higher F-score in topic extration, which improved the performance of literature recommendation systems. [Limitations] We did not examine the new mothod with other datasets, such as microblog posts and XML documents. [Conclusions] The proposed method could identify more recognizable topics and improve the performance of scientific and technical literature recommendation systems.

Key wordsLDA Topic model      Similarity      Perplexity      Analysis of Scientific and Technical Information     
Received: 22 February 2016      Published: 19 October 2016

Cite this article:

Guan Peng,Wang Yuefen. Identifying Optimal Topic Numbers from Sci-Tech Information with LDA Model. New Technology of Library and Information Service, 2016, 32(9): 42-50.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.1003-3513.2016.09.05     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2016/V32/I9/42

[1] Blei D M, Ng A Y, Jordan M I.Latent Dirichlet Allocation[J]. Journal of Machine Learning Research, 2003, 3: 993-1022.
[2] 王萍. 基于概率主题模型的文献知识挖掘[J]. 情报学报, 2011, 30(6): 583-590.
[2] (Wang Ping.Literature Knowledge Mining Based on Probabilistic Topic Model[J]. Journal of the China Society for Scientific and Technical Information, 2011, 30(6): 583-590.)
[3] Hassan S U, Haddawy P.Analyzing Knowledge Flows of Scientific Literature Through Semantic Links: A Case Study in the Field of Energy[J]. Scientometrics, 2015, 103(1): 33-46.
[4] Liang H, Fang L.Topic Discovery and Trend Analysis in Scientific Literature Based on Topic Model[J]. Journal of Chinese Information Processing, 2012, 26(2): 109-115.
[5] 范云满, 马建霞. 基于LDA与新兴主题特征分析的新兴主题探测研究[J].情报学报, 2014, 33(7): 698-711.
[5] (Fan Yunman, Ma Jianxia.Detection of Emerging Topics Based on LDA and Feature Analysis of Emerging Topics[J]. Journal of the China Society for Scientific and Technical Information, 2014, 33(7): 698-711.)
[6] He Q, Chen B, Pei J, et al.Detecting Topic Evolution in Scientific Literature: How Can Citations Help? [C]. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management. ACM, 2009: 957-966.
[7] AlSumait L, Barbará D, Domeniconi C. On-line LDA: Adaptive Topic Models for Mining Text Streams with Applications to Topic Detection and Tracking [C]. In: Proceedings of the 8th IEEE International Conference on Data Mining. 2008.
[8] 刘彤, 杨冠灿, 蒋继娅, 等.基于多重关系的专利网络演化特征与动态分析——以锂离子电池领域为例[J]. 情报学报, 2014, 33(12): 1288-1301.
[8] (Liu Tong, Yang Guancan, Jiang Jiya, et al.Research on the Evolution and Dynamic Analysis of Multi-relation Integrated Patent Network: A Case Study on Lithiumion Battery[J]. Journal of the China Society for Scientific and Technical Information, 2014, 33(12): 1288-1301.)
[9] 贺亮, 李芳. 科技文献话题演化研究[J]. 现代图书情报技术, 2012(4): 61-67.
[9] (He Liang, Li Fang.Topic Evolution in Scientific Literature[J]. New Technology of Library and Information Service, 2012(4): 61-67.)
[10] Wu Q Q, Zhang C D, Hong Q Q, et al.Topic Evolution Based on LDA and HMM and Its Application in Stem Cell Research[J]. Journal of Information Science, 2014, 40(5): 611-620.
[11] Gerrish S, Blei D M.A Language-based Approach to Measuring Scholarly Impact [C]. In: Proceedings of the 27th International Conference on Machine Learning. 2010.
[12] Dhillon I S, Modha D S.Concept Decompositions for Large Sparse Text Data Using Clustering[J]. Machine Learning, 2001, 42(1-2): 143-175.
[13] 王李冬, 魏宝刚, 袁杰. 基于概率主题模型的文档聚类[J]. 电子学报, 2012, 40(11): 2346-2350.
[13] (Wang Lidong, Wei Baogang, Yuan Jie.Document Clustering Based on Probabilistic Topic Model[J]. Acta Electronica Sinica, 2012, 40(11): 2346-2350.)
[14] Lee H, Kihm J, Choo J, et al.iVisClustering: An Interactive Visual Document Clustering via Topic Modeling[J]. Computer Graphics Forum, 2012, 31(3): 1155-1164.
[15] Kabán A, Girolami M A.A Dynamic Probabilistic Model to Visualise Topic Evolution in Text Streams[J]. Journal of Intelligent Information Systems, 2002, 18(2-3): 107-125.
[16] Chua F C T, Lauw H W, Lim E P. Generative Models for Item Adoptions Using Social Correlation[J]. IEEE Transactions on Knowledge and Data Engineering, 2013, 25(9): 2036-2048.
[17] 张晗, 徐硕, 乔晓东, 等. 融合科技文献内外部特征的主题模型发展综述[J]. 情报学报, 2014, 33(10): 1108-1120.
[17] (Zhang Han, Xu Shuo, Qiao Xiaodong, et al.Review on Topic Models Integrating Intra- and Extra- Features of Scientific and Technical Literature[J]. Journal of the China Society for Scientific and Technical Information, 2014, 33(10): 1108-1120.)
[18] Teh Y, Jordan M, Beal M, et al.Hierarchical Dirichlet Processes[J]. Journal of the American Statistical Association, 2007, 101(476): 1566-1581.
[19] Griffiths T L, Steyvers M.Finding Scientific Topics[J]. Proceedings of the National Academy of Sciences of the United States of America, 2004, 101(S1): 5228-5235.
[20] Arun R, Suresh V, Veni Madhavan C E, et al. On Finding the Natural Number of Topics with Latent Dirichlet Allocation: Some Observations [A]. //Advances in Knowledge Discovery and Data Mining[M]. Springer Berlin Heidelberg, 2010.
[21] 曹娟, 张勇东, 李锦涛, 等.一种基于密度的自适应最优LDA模型选择方法[J]. 计算机学报, 2008, 31(10): 1780-1787.
[21] (Cao Juan, Zhang Yongdong, Li Jintao, et al.A Method of Adaptively Selecting Bast LDA Model Based on Density[J]. Chinese Journal of Computers, 2008, 31(10): 1780-1787.)
[22] Grossman D A.Information Retrieval: Algorithms and Heuristics[M]. Springer Science & Business Media, 2004.
[23] Duda R O, Hart P E, Stork D G.Pattern Classification[M]. John Wiley & Sons, 2012.
[24] Lin J.Divergence Measures Based on Shannon Entropy[J]. IEEE Transactions on Information Theory, 199l, 37(1): 145-151.
[25] Sun J Y. jieba0.37 [EB/OL]. [2015-10-08]. .
[26] RehurekR. gensim 0.10.2 [EB/OL]. [2014-12-11]. .
[1] Peng Guan,Yuefen Wang,Zhu Fu. Analyzing Topic Semantic Evolution with LDA: Case Study of Lithium Ion Batteries[J]. 数据分析与知识发现, 2019, 3(7): 61-72.
[2] Peiyao Zhang,Dongsu Liu. Topic Evolutionary Analysis of Short Text Based on Word Vector and BTM[J]. 数据分析与知识发现, 2019, 3(3): 95-101.
[3] Linna Xi,Yongxiang Dou. Examining Reposts of Micro-bloggers with Planned Behavior Theory[J]. 数据分析与知识发现, 2019, 3(2): 13-20.
[4] Jie Zhang,Junbo Zhao,Dongsheng Zhai,Ningning Sun. Patent Technology Analysis of Microalgae Biofuel Industrial Chain Based on Topic Model[J]. 数据分析与知识发现, 2019, 3(2): 52-64.
[5] Junwan Liu,Zhixin Long,Feifei Wang. Finding Collaboration Opportunities from Emerging Issues with LDA Topic Model and Link Prediction[J]. 数据分析与知识发现, 2019, 3(1): 104-117.
[6] Dan Wu,Liuxing Lu. Semantic Changes of Queries from Cross-device Searching[J]. 数据分析与知识发现, 2018, 2(8): 69-78.
[7] Haixia Sun,Lei Wang,Yingjie Wu,Weina Hua,Junlian Li. Matching Strategies for Institution Names in Literature Database[J]. 数据分析与知识发现, 2018, 2(8): 88-97.
[8] Ya’nan Zhao,Yuqing Wang. Research on Collaborative Filtering Traveling Products Recommendation Algorithm Based on IUNCF[J]. 数据分析与知识发现, 2018, 2(7): 63-71.
[9] Mansheng Xiao, Lijuan Zhou, Zhicheng Wen. A Fuzzy C-Means Algorithm Based on Huffman Tree[J]. 数据分析与知识发现, 2018, 2(7): 81-88.
[10] Daoping Wang,Zhongyang Jiang,Boqing Zhang. Collaborative Filtering Algorithm Based on Gray Correlation Analysis and Time Factor[J]. 数据分析与知识发现, 2018, 2(6): 102-109.
[11] Lin Li,Hui Li. Computing Text Similarity Based on Concept Vector Space[J]. 数据分析与知识发现, 2018, 2(5): 48-58.
[12] Yong Wang,Yongdong Wang,Huifang Guo,Yumin Zhou. Measuring Item Similarity Based on Increment of Diversity[J]. 数据分析与知识发现, 2018, 2(5): 70-76.
[13] Lingfeng Hua,Gaoming Yang,Xiujun Wang. Recommending Diversified News Based on User’s Locations[J]. 数据分析与知识发现, 2018, 2(5): 94-104.
[14] Junwan Liu,Bo Yang,Feifei Wang. Ranking Scholarly Impacts Based on Citations and Academic Similarity[J]. 数据分析与知识发现, 2018, 2(4): 59-70.
[15] He Li,Linlin Zhu,Min Yan,Jincheng Liu,Chuang Hong. Identifying Useful Information from Open Innovation Community[J]. 数据分析与知识发现, 2018, 2(12): 12-22.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn