Please wait a minute...
Advanced Search
现代图书情报技术  2016, Vol. 32 Issue (9): 42-50    DOI: 10.11925/infotech.1003-3513.2016.09.05
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
科技情报分析中LDA主题模型最优主题数确定方法研究*
关鹏1,2,王曰芬1()
1南京理工大学经济管理学院 南京 210094
2巢湖学院应用数学学院 合肥 238000
Identifying Optimal Topic Numbers from Sci-Tech Information with LDA Model
Guan Peng1,2,Wang Yuefen1()
1School of Economics and Management, Nanjing University of Science & Technology, Nanjing 210094, China
2College of Applied Mathematics, Chaohu University, Hefei 238000, China
全文: PDF(765 KB)   HTML ( 36
输出: BibTeX | EndNote (RIS)      
摘要 

目的】有效确定科技情报分析中LDA主题模型的最优主题数目。【方法】利用主题相似度度量潜在主题之间的差异, 同时结合困惑度提出一种确定LDA最优主题数目的方法, 该方法既考虑主题抽取效果同时也考虑模型对新文档的泛化能力。【结果】获取国内新能源领域的科技文献作为数据集, 实证结果表明本文提出的最优LDA主题数确定方法与单纯使用困惑度相比, 具有更高的主题抽取查准率(91.67%)、F值(86.27%)及科技文献推荐精度(71.25%)。【局限】未针对其他类型的数据集进行新方法的验证, 如微博短文本、XML文档等。【结论】本文方法能够有效地从科技文献数据集中抽取辨识度较高的主题, 并能够提高科技文献推荐效果。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
王曰芬
关鹏
关键词 LDA主题模型相似度困惑度科技情报分析    
Abstract

[Objective] This paper tries to identify the optimal number of topics for the Latent Dirichlet Allocation (LDA) model to analyze scientific and technical information. [Methods] First, we used the topic similarity to measure the differences among the latent topics. Second, we proposed a method determining the optimal topic numbers and tried to utilize this model to documents from Chinese literature in the field of new energy. [Results] The proposed method achieved higher precision ratio and higher F-score in topic extration, which improved the performance of literature recommendation systems. [Limitations] We did not examine the new mothod with other datasets, such as microblog posts and XML documents. [Conclusions] The proposed method could identify more recognizable topics and improve the performance of scientific and technical literature recommendation systems.

Key wordsLDA Topic model    Similarity    Perplexity    Analysis of Scientific and Technical Information
收稿日期: 2016-02-22     
基金资助:*本文系国家自然科学基金研究项目“新研究领域科学文献传播网络生长及对传播效果影响研究”(项目编号: 71373124)、国家社会科学基金重点项目“大数据环境下社会舆情与决策支持方法体系研究”(项目编号: 14AZD084)和江苏高校哲学社会科学重点研究基地(培育点)“社会计算与舆情分析”的研究成果之一
引用本文:   
关鹏,王曰芬. 科技情报分析中LDA主题模型最优主题数确定方法研究*[J]. 现代图书情报技术, 2016, 32(9): 42-50.
Guan Peng,Wang Yuefen. Identifying Optimal Topic Numbers from Sci-Tech Information with LDA Model. New Technology of Library and Information Service, DOI:10.11925/infotech.1003-3513.2016.09.05.
链接本文:  
http://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.1003-3513.2016.09.05
[1] Blei D M, Ng A Y, Jordan M I.Latent Dirichlet Allocation[J]. Journal of Machine Learning Research, 2003, 3: 993-1022.
[2] 王萍. 基于概率主题模型的文献知识挖掘[J]. 情报学报, 2011, 30(6): 583-590.
[2] (Wang Ping.Literature Knowledge Mining Based on Probabilistic Topic Model[J]. Journal of the China Society for Scientific and Technical Information, 2011, 30(6): 583-590.)
[3] Hassan S U, Haddawy P.Analyzing Knowledge Flows of Scientific Literature Through Semantic Links: A Case Study in the Field of Energy[J]. Scientometrics, 2015, 103(1): 33-46.
[4] Liang H, Fang L.Topic Discovery and Trend Analysis in Scientific Literature Based on Topic Model[J]. Journal of Chinese Information Processing, 2012, 26(2): 109-115.
[5] 范云满, 马建霞. 基于LDA与新兴主题特征分析的新兴主题探测研究[J].情报学报, 2014, 33(7): 698-711.
[5] (Fan Yunman, Ma Jianxia.Detection of Emerging Topics Based on LDA and Feature Analysis of Emerging Topics[J]. Journal of the China Society for Scientific and Technical Information, 2014, 33(7): 698-711.)
[6] He Q, Chen B, Pei J, et al.Detecting Topic Evolution in Scientific Literature: How Can Citations Help? [C]. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management. ACM, 2009: 957-966.
[7] AlSumait L, Barbará D, Domeniconi C. On-line LDA: Adaptive Topic Models for Mining Text Streams with Applications to Topic Detection and Tracking [C]. In: Proceedings of the 8th IEEE International Conference on Data Mining. 2008.
[8] 刘彤, 杨冠灿, 蒋继娅, 等.基于多重关系的专利网络演化特征与动态分析——以锂离子电池领域为例[J]. 情报学报, 2014, 33(12): 1288-1301.
[8] (Liu Tong, Yang Guancan, Jiang Jiya, et al.Research on the Evolution and Dynamic Analysis of Multi-relation Integrated Patent Network: A Case Study on Lithiumion Battery[J]. Journal of the China Society for Scientific and Technical Information, 2014, 33(12): 1288-1301.)
[9] 贺亮, 李芳. 科技文献话题演化研究[J]. 现代图书情报技术, 2012(4): 61-67.
[9] (He Liang, Li Fang.Topic Evolution in Scientific Literature[J]. New Technology of Library and Information Service, 2012(4): 61-67.)
[10] Wu Q Q, Zhang C D, Hong Q Q, et al.Topic Evolution Based on LDA and HMM and Its Application in Stem Cell Research[J]. Journal of Information Science, 2014, 40(5): 611-620.
[11] Gerrish S, Blei D M.A Language-based Approach to Measuring Scholarly Impact [C]. In: Proceedings of the 27th International Conference on Machine Learning. 2010.
[12] Dhillon I S, Modha D S.Concept Decompositions for Large Sparse Text Data Using Clustering[J]. Machine Learning, 2001, 42(1-2): 143-175.
[13] 王李冬, 魏宝刚, 袁杰. 基于概率主题模型的文档聚类[J]. 电子学报, 2012, 40(11): 2346-2350.
[13] (Wang Lidong, Wei Baogang, Yuan Jie.Document Clustering Based on Probabilistic Topic Model[J]. Acta Electronica Sinica, 2012, 40(11): 2346-2350.)
[14] Lee H, Kihm J, Choo J, et al.iVisClustering: An Interactive Visual Document Clustering via Topic Modeling[J]. Computer Graphics Forum, 2012, 31(3): 1155-1164.
[15] Kabán A, Girolami M A.A Dynamic Probabilistic Model to Visualise Topic Evolution in Text Streams[J]. Journal of Intelligent Information Systems, 2002, 18(2-3): 107-125.
[16] Chua F C T, Lauw H W, Lim E P. Generative Models for Item Adoptions Using Social Correlation[J]. IEEE Transactions on Knowledge and Data Engineering, 2013, 25(9): 2036-2048.
[17] 张晗, 徐硕, 乔晓东, 等. 融合科技文献内外部特征的主题模型发展综述[J]. 情报学报, 2014, 33(10): 1108-1120.
[17] (Zhang Han, Xu Shuo, Qiao Xiaodong, et al.Review on Topic Models Integrating Intra- and Extra- Features of Scientific and Technical Literature[J]. Journal of the China Society for Scientific and Technical Information, 2014, 33(10): 1108-1120.)
[18] Teh Y, Jordan M, Beal M, et al.Hierarchical Dirichlet Processes[J]. Journal of the American Statistical Association, 2007, 101(476): 1566-1581.
[19] Griffiths T L, Steyvers M.Finding Scientific Topics[J]. Proceedings of the National Academy of Sciences of the United States of America, 2004, 101(S1): 5228-5235.
[20] Arun R, Suresh V, Veni Madhavan C E, et al. On Finding the Natural Number of Topics with Latent Dirichlet Allocation: Some Observations [A]. //Advances in Knowledge Discovery and Data Mining[M]. Springer Berlin Heidelberg, 2010.
[21] 曹娟, 张勇东, 李锦涛, 等.一种基于密度的自适应最优LDA模型选择方法[J]. 计算机学报, 2008, 31(10): 1780-1787.
[21] (Cao Juan, Zhang Yongdong, Li Jintao, et al.A Method of Adaptively Selecting Bast LDA Model Based on Density[J]. Chinese Journal of Computers, 2008, 31(10): 1780-1787.)
[22] Grossman D A.Information Retrieval: Algorithms and Heuristics[M]. Springer Science & Business Media, 2004.
[23] Duda R O, Hart P E, Stork D G.Pattern Classification[M]. John Wiley & Sons, 2012.
[24] Lin J.Divergence Measures Based on Shannon Entropy[J]. IEEE Transactions on Information Theory, 199l, 37(1): 145-151.
[25] Sun J Y. jieba0.37 [EB/OL]. [2015-10-08]. .
[26] RehurekR. gensim 0.10.2 [EB/OL]. [2014-12-11]. .
[1] 关鹏,王曰芬,傅柱. 基于LDA的主题语义演化分析方法研究 * ——以锂离子电池领域为例[J]. 数据分析与知识发现, 2019, 3(7): 61-72.
[2] 张佩瑶,刘东苏. 基于词向量和BTM的短文本话题演化分析*[J]. 数据分析与知识发现, 2019, 3(3): 95-101.
[3] 席林娜,窦永香. 基于计划行为理论的微博用户转发行为影响因素研究*[J]. 数据分析与知识发现, 2019, 3(2): 13-20.
[4] 张杰,赵君博,翟东升,孙宁宁. 基于主题模型的微藻生物燃料产业链专利技术分析*[J]. 数据分析与知识发现, 2019, 3(2): 52-64.
[5] 刘俊婉,龙志昕,王菲菲. 基于LDA主题模型与链路预测的新兴主题关联机会发现研究*[J]. 数据分析与知识发现, 2019, 3(1): 104-117.
[6] 杨贵军,徐雪,赵富强. 基于XGBoost算法的用户评分预测模型及应用*[J]. 数据分析与知识发现, 2019, 3(1): 118-126.
[7] 吴丹,陆柳杏. 跨设备搜索中设备转移前后查询式语义变化研究*[J]. 数据分析与知识发现, 2018, 2(8): 69-78.
[8] 孙海霞,王蕾,吴英杰,华薇娜,李军莲. 科技文献数据库中机构名称匹配策略研究*[J]. 数据分析与知识发现, 2018, 2(8): 88-97.
[9] 王道平,蒋中杨,张博卿. 基于灰色关联分析和时间因素的协同过滤算法*[J]. 数据分析与知识发现, 2018, 2(6): 102-109.
[10] 李琳,李辉. 一种基于概念向量空间的文本相似度计算方法[J]. 数据分析与知识发现, 2018, 2(5): 48-58.
[11] 花凌锋,杨高明,王修君. 面向位置的多样性兴趣新闻推荐研究*[J]. 数据分析与知识发现, 2018, 2(5): 94-104.
[12] 刘俊婉,杨波,王菲菲. 基于引证行为与学术相似度的学者影响力领域排名方法研究*[J]. 数据分析与知识发现, 2018, 2(4): 59-70.
[13] 王丽,邹丽雪,刘细文. 基于LDA主题模型的文献关联分析及可视化研究[J]. 数据分析与知识发现, 2018, 2(3): 98-106.
[14] 李贺,祝琳琳,闫敏,刘金承,洪闯. 开放式创新社区用户信息有用性识别研究*[J]. 数据分析与知识发现, 2018, 2(12): 12-22.
[15] 徐建民,许彩云. 基于文本和公式的科技文档相似度计算*[J]. 数据分析与知识发现, 2018, 2(10): 103-109.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn