Please wait a minute...
Advanced Search
现代图书情报技术  2016, Vol. 32 Issue (9): 42-50     https://doi.org/10.11925/infotech.1003-3513.2016.09.05
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
科技情报分析中LDA主题模型最优主题数确定方法研究*
关鹏1,2,王曰芬1()
1南京理工大学经济管理学院 南京 210094
2巢湖学院应用数学学院 合肥 238000
Identifying Optimal Topic Numbers from Sci-Tech Information with LDA Model
Guan Peng1,2,Wang Yuefen1()
1School of Economics and Management, Nanjing University of Science & Technology, Nanjing 210094, China
2College of Applied Mathematics, Chaohu University, Hefei 238000, China
全文: PDF (765 KB)   HTML ( 44
输出: BibTeX | EndNote (RIS)      
摘要 

目的】有效确定科技情报分析中LDA主题模型的最优主题数目。【方法】利用主题相似度度量潜在主题之间的差异, 同时结合困惑度提出一种确定LDA最优主题数目的方法, 该方法既考虑主题抽取效果同时也考虑模型对新文档的泛化能力。【结果】获取国内新能源领域的科技文献作为数据集, 实证结果表明本文提出的最优LDA主题数确定方法与单纯使用困惑度相比, 具有更高的主题抽取查准率(91.67%)、F值(86.27%)及科技文献推荐精度(71.25%)。【局限】未针对其他类型的数据集进行新方法的验证, 如微博短文本、XML文档等。【结论】本文方法能够有效地从科技文献数据集中抽取辨识度较高的主题, 并能够提高科技文献推荐效果。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
王曰芬
关鹏
关键词 LDA主题模型相似度困惑度科技情报分析    
Abstract

[Objective] This paper tries to identify the optimal number of topics for the Latent Dirichlet Allocation (LDA) model to analyze scientific and technical information. [Methods] First, we used the topic similarity to measure the differences among the latent topics. Second, we proposed a method determining the optimal topic numbers and tried to utilize this model to documents from Chinese literature in the field of new energy. [Results] The proposed method achieved higher precision ratio and higher F-score in topic extration, which improved the performance of literature recommendation systems. [Limitations] We did not examine the new mothod with other datasets, such as microblog posts and XML documents. [Conclusions] The proposed method could identify more recognizable topics and improve the performance of scientific and technical literature recommendation systems.

Key wordsLDA Topic model    Similarity    Perplexity    Analysis of Scientific and Technical Information
收稿日期: 2016-02-22      出版日期: 2016-10-19
基金资助:*本文系国家自然科学基金研究项目“新研究领域科学文献传播网络生长及对传播效果影响研究”(项目编号: 71373124)、国家社会科学基金重点项目“大数据环境下社会舆情与决策支持方法体系研究”(项目编号: 14AZD084)和江苏高校哲学社会科学重点研究基地(培育点)“社会计算与舆情分析”的研究成果之一
引用本文:   
关鹏,王曰芬. 科技情报分析中LDA主题模型最优主题数确定方法研究*[J]. 现代图书情报技术, 2016, 32(9): 42-50.
Guan Peng,Wang Yuefen. Identifying Optimal Topic Numbers from Sci-Tech Information with LDA Model. New Technology of Library and Information Service, 2016, 32(9): 42-50.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.1003-3513.2016.09.05      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2016/V32/I9/42
[1] Blei D M, Ng A Y, Jordan M I.Latent Dirichlet Allocation[J]. Journal of Machine Learning Research, 2003, 3: 993-1022.
[2] 王萍. 基于概率主题模型的文献知识挖掘[J]. 情报学报, 2011, 30(6): 583-590.
[2] (Wang Ping.Literature Knowledge Mining Based on Probabilistic Topic Model[J]. Journal of the China Society for Scientific and Technical Information, 2011, 30(6): 583-590.)
[3] Hassan S U, Haddawy P.Analyzing Knowledge Flows of Scientific Literature Through Semantic Links: A Case Study in the Field of Energy[J]. Scientometrics, 2015, 103(1): 33-46.
[4] Liang H, Fang L.Topic Discovery and Trend Analysis in Scientific Literature Based on Topic Model[J]. Journal of Chinese Information Processing, 2012, 26(2): 109-115.
[5] 范云满, 马建霞. 基于LDA与新兴主题特征分析的新兴主题探测研究[J].情报学报, 2014, 33(7): 698-711.
[5] (Fan Yunman, Ma Jianxia.Detection of Emerging Topics Based on LDA and Feature Analysis of Emerging Topics[J]. Journal of the China Society for Scientific and Technical Information, 2014, 33(7): 698-711.)
[6] He Q, Chen B, Pei J, et al.Detecting Topic Evolution in Scientific Literature: How Can Citations Help? [C]. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management. ACM, 2009: 957-966.
[7] AlSumait L, Barbará D, Domeniconi C. On-line LDA: Adaptive Topic Models for Mining Text Streams with Applications to Topic Detection and Tracking [C]. In: Proceedings of the 8th IEEE International Conference on Data Mining. 2008.
[8] 刘彤, 杨冠灿, 蒋继娅, 等.基于多重关系的专利网络演化特征与动态分析——以锂离子电池领域为例[J]. 情报学报, 2014, 33(12): 1288-1301.
[8] (Liu Tong, Yang Guancan, Jiang Jiya, et al.Research on the Evolution and Dynamic Analysis of Multi-relation Integrated Patent Network: A Case Study on Lithiumion Battery[J]. Journal of the China Society for Scientific and Technical Information, 2014, 33(12): 1288-1301.)
[9] 贺亮, 李芳. 科技文献话题演化研究[J]. 现代图书情报技术, 2012(4): 61-67.
[9] (He Liang, Li Fang.Topic Evolution in Scientific Literature[J]. New Technology of Library and Information Service, 2012(4): 61-67.)
[10] Wu Q Q, Zhang C D, Hong Q Q, et al.Topic Evolution Based on LDA and HMM and Its Application in Stem Cell Research[J]. Journal of Information Science, 2014, 40(5): 611-620.
[11] Gerrish S, Blei D M.A Language-based Approach to Measuring Scholarly Impact [C]. In: Proceedings of the 27th International Conference on Machine Learning. 2010.
[12] Dhillon I S, Modha D S.Concept Decompositions for Large Sparse Text Data Using Clustering[J]. Machine Learning, 2001, 42(1-2): 143-175.
[13] 王李冬, 魏宝刚, 袁杰. 基于概率主题模型的文档聚类[J]. 电子学报, 2012, 40(11): 2346-2350.
[13] (Wang Lidong, Wei Baogang, Yuan Jie.Document Clustering Based on Probabilistic Topic Model[J]. Acta Electronica Sinica, 2012, 40(11): 2346-2350.)
[14] Lee H, Kihm J, Choo J, et al.iVisClustering: An Interactive Visual Document Clustering via Topic Modeling[J]. Computer Graphics Forum, 2012, 31(3): 1155-1164.
[15] Kabán A, Girolami M A.A Dynamic Probabilistic Model to Visualise Topic Evolution in Text Streams[J]. Journal of Intelligent Information Systems, 2002, 18(2-3): 107-125.
[16] Chua F C T, Lauw H W, Lim E P. Generative Models for Item Adoptions Using Social Correlation[J]. IEEE Transactions on Knowledge and Data Engineering, 2013, 25(9): 2036-2048.
[17] 张晗, 徐硕, 乔晓东, 等. 融合科技文献内外部特征的主题模型发展综述[J]. 情报学报, 2014, 33(10): 1108-1120.
[17] (Zhang Han, Xu Shuo, Qiao Xiaodong, et al.Review on Topic Models Integrating Intra- and Extra- Features of Scientific and Technical Literature[J]. Journal of the China Society for Scientific and Technical Information, 2014, 33(10): 1108-1120.)
[18] Teh Y, Jordan M, Beal M, et al.Hierarchical Dirichlet Processes[J]. Journal of the American Statistical Association, 2007, 101(476): 1566-1581.
[19] Griffiths T L, Steyvers M.Finding Scientific Topics[J]. Proceedings of the National Academy of Sciences of the United States of America, 2004, 101(S1): 5228-5235.
[20] Arun R, Suresh V, Veni Madhavan C E, et al. On Finding the Natural Number of Topics with Latent Dirichlet Allocation: Some Observations [A]. //Advances in Knowledge Discovery and Data Mining[M]. Springer Berlin Heidelberg, 2010.
[21] 曹娟, 张勇东, 李锦涛, 等.一种基于密度的自适应最优LDA模型选择方法[J]. 计算机学报, 2008, 31(10): 1780-1787.
[21] (Cao Juan, Zhang Yongdong, Li Jintao, et al.A Method of Adaptively Selecting Bast LDA Model Based on Density[J]. Chinese Journal of Computers, 2008, 31(10): 1780-1787.)
[22] Grossman D A.Information Retrieval: Algorithms and Heuristics[M]. Springer Science & Business Media, 2004.
[23] Duda R O, Hart P E, Stork D G.Pattern Classification[M]. John Wiley & Sons, 2012.
[24] Lin J.Divergence Measures Based on Shannon Entropy[J]. IEEE Transactions on Information Theory, 199l, 37(1): 145-151.
[25] Sun J Y. jieba0.37 [EB/OL]. [2015-10-08]. .
[26] RehurekR. gensim 0.10.2 [EB/OL]. [2014-12-11]. .
[1] 韩辉, 刘秀文. 海事适任评估中主观题自动评分技术研究*[J]. 数据分析与知识发现, 2021, 5(8): 113-121.
[2] 刘文斌, 何彦青, 吴振峰, 董诚. 基于BERT和多相似度融合的句子对齐方法研究*[J]. 数据分析与知识发现, 2021, 5(7): 48-58.
[3] 向卓元,刘志聪,吴玉. 基于用户行为自适应推荐模型研究 *[J]. 数据分析与知识发现, 2021, 5(4): 103-114.
[4] 闫强,张笑妍,周思敏. 基于义原相似度的关键词抽取方法 *[J]. 数据分析与知识发现, 2021, 5(4): 80-89.
[5] 吕学强,罗艺雄,李家全,游新冬. 中文专利侵权检测研究综述*[J]. 数据分析与知识发现, 2021, 5(3): 60-68.
[6] 吴彦文, 蔡秋亭, 刘智, 邓云泽. 融合多源数据和场景相似度计算的数字资源推荐研究*[J]. 数据分析与知识发现, 2021, 5(11): 114-123.
[7] 盛嘉祺, 许鑫. 融合主题相似度与合著网络的学者标签扩展方法研究*[J]. 数据分析与知识发现, 2020, 4(8): 75-85.
[8] 徐以聪,田学东,李新福,杨芳,史青宣. 基于犹豫模糊权重的数学表达式检索 *[J]. 数据分析与知识发现, 2020, 4(7): 118-126.
[9] 苏庆,陈思兆,吴伟民,李小妹,黄佃宽. 基于学习情况协同过滤算法的个性化学习推荐模型研究*[J]. 数据分析与知识发现, 2020, 4(5): 105-117.
[10] 刘萍,彭小芳. 基于形式概念分析的词汇相似度计算*[J]. 数据分析与知识发现, 2020, 4(5): 66-74.
[11] 高原,施元磊,张蕾,曹天奕,冯筠. 基于游记文本的游客游览行程重构*[J]. 数据分析与知识发现, 2020, 4(2/3): 165-172.
[12] 李家全,李宝安,游新冬,吕学强. 基于专利知识图谱的专利术语相似度计算研究*[J]. 数据分析与知识发现, 2020, 4(10): 104-112.
[13] 俞琰,陈磊,姜金德,赵乃瑄. 结合词向量和统计特征的专利相似度测量方法 *[J]. 数据分析与知识发现, 2019, 3(9): 53-59.
[14] 关鹏,王曰芬,傅柱. 基于LDA的主题语义演化分析方法研究 * ——以锂离子电池领域为例[J]. 数据分析与知识发现, 2019, 3(7): 61-72.
[15] 张佩瑶,刘东苏. 基于词向量和BTM的短文本话题演化分析*[J]. 数据分析与知识发现, 2019, 3(3): 95-101.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn