Please wait a minute...
Advanced Search
现代图书情报技术  2016, Vol. 32 Issue (11): 20-26     https://doi.org/10.11925/infotech.1003-3513.2016.11.03
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
基于LDA挖掘计算机科学文献的研究主题
杨海霞,高宝俊(),孙含林
武汉大学经济与管理学院 武汉 430072
Extracting Topics of Computer Science Literature with LDA Model
Yang Haixia,Gao Baojun(),Sun Hanlin
Economics and Management School, Wuhan University, Wuhan 430072, China
全文: PDF (737 KB)   HTML ( 58
输出: BibTeX | EndNote (RIS)      
摘要 

目的】运用文本挖掘技术自动从海量科技文献中提取研究主题并探测其研究趋势。【方法】以《中文核心期刊要目总览(2014年版))—“TP自动化技术、计算机技术”栏目前10种期刊刊载的计算机科学类(Computer Science)文献为研究对象, 借助LDA主题模型, 考虑科技文献的发表时间信息, 挖掘出典型话题, 并根据主题强度分析主题的演化趋势。【结果】18个研究话题中有7个主题强度上升的主题和6个主题强度下降的主题。【局限】仅分析了国内计算机领域的前10种期刊, 期刊范围不够大, 也未考虑国外计算机领域的期刊文献。【结论】该方法能够深入挖掘计算机领域期刊文献的话题, 帮助从事该领域研究的学者了解主题的演化趋势并寻找新兴研究主题。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
杨海霞
高宝俊
孙含林
关键词 计算机科学LDA主题提取主题强度文档聚类    
Abstract

[Objective] This paper employs text mining technology to automatically identify research topics from large amounts of scientific literature and then detects future trends. [Methods] First, we used the LDA model to find both topical prevalence and contents of articles published by the top ten computer science journals in China. Second, we described the evolution of major topics with the help of publishing dates. [Results] We extracted 18 topics from 29, 621 computer science papers and then identified 7 trending topics as well as 6 less popular ones. [Limitations] Our study did not include papers published overseas by Chinese authors. [Conclusions] The proposed method could help us learn the evolution of computer science research and then grasp the emerging trends.

Key wordsComputer science    LDA    Topic mining    Topic prevalence    Document cluster
收稿日期: 2016-06-02      出版日期: 2016-12-20
引用本文:   
杨海霞,高宝俊,孙含林. 基于LDA挖掘计算机科学文献的研究主题[J]. 现代图书情报技术, 2016, 32(11): 20-26.
Yang Haixia,Gao Baojun,Sun Hanlin. Extracting Topics of Computer Science Literature with LDA Model. New Technology of Library and Information Service, 2016, 32(11): 20-26.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.1003-3513.2016.11.03      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2016/V32/I11/20
[1] Blei D M, Ng A Y, Jordan M I.Latent Dirichlet Allocation[J]. Journal of Machine Learning Research, 2003, 3: 993-1022.
[2] Blei D M.Probabilistic Topic Models[J]. Communications of the ACM, 2012, 55(4): 77-84.
[3] Griffiths T L, Steyvers M.Finding Scientific Topics[J]. Proceedings of the National Academy of Sciences, 2004, 101(S1): 5228-5235.
[4] 郭玉, 蔚海燕. 我国计算机科学发展态势文献计量分析[J]. 计算机应用研究, 2007, 24(12): 28-31.
[4] (Guo Yu, Yu Haiyan.Biblio-metrilogical Analysis on Development Trends of Computer Science in China[J]. Application Research of Computers, 2007, 24(12): 18-31.)
[5] 陈国良, 孙广中, 徐云, 等. 并行计算的一体化研究现状与发展趋势[J]. 科学通报, 2009, 54(8): 1043-1049.
[5] (Chen Guoliang, Sun Guangzhong, Xu Yun, et al.Integrated Research of Parallel Computing: Status and Future[J]. Chinese Science Bulletin, 2009, 54(8): 1043-1049.)
[6] 章锦文, 马远良. 神经网络计算机的现状与发展趋势[J]. 计算机科学, 1993, 20(6): 24-27.
[6] (Zhang Jinwen, Ma Yuanliang.The Development Situation and Direction of Neurocomputer[J]. Computer Science, 1993, 20(6): 24-27.)
[7] Zheng B, McLean D C, Lu X. Identifying Biological Concepts from a Protein-related Corpus with a Probabilistic Topic Model[J]. BMC Bioinformatics, 2006, 7(4): 58.
[8] Hall D, Jurafsky D, Manning C D.Studying the History of Ideas Using Topic Models [C]. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2008: 363-371.
[9] Wu H, Wang M, Feng J, et al.Research Topic Evolution in “Bioinformatics”[C]. In: Proceedings of the 4th International Conference on Bioinformatics and Biomedical Engineering (iCBBE). IEEE, 2010: 1-4.
[10] Sugimoto C R, Li D, Russell T G, et al.The Shifting Sands of Disciplinary Development: Analyzing North American Library and Information Science Dissertations Using Latent Dirichlet Allocation[J]. Journal of the American Society for Information Science and Technology, 2011, 62(1): 185-204.
[11] Piepenbrink A, Nurmammadov E.Topics in the Literature of Transition Economies and Emerging Markets[J]. Scientometrics, 2015, 102(3): 2107-2130.
[12] 贺亮, 李芳. 科技文献话题演化研究[J]. 现代图书情报技术, 2012(4): 61-67.
[12] (He Liang, Li Fang.Topic Evolution in Scientific Literature[J]. New Technology of Library and Information Service, 2012(4): 61-67.)
[13] 关鹏, 王曰芬, 傅柱.不同语料下基于LDA主题模型的科学文献主题抽取效果分析[J]. 图书情报工作, 2016, 60(2): 112-121.
[13] (Guan Peng, Wang Yuefen, Fu Zhu.Effect Analysis of Scientific Literature Extraction Based on LDA Topic Model with Different Corpus[J]. Library and Information Service, 2016, 60(2): 112-121.)
[14] 李湘东, 张娇, 袁满.基于LDA模型的科技期刊主题演化研究[J]. 情报杂志, 2014, 33(7): 115-121.
[14] (Li Xiangdong, Zhang Jiao, Yuan Man.On Topic Evolution of Scientific Journal Based on LDA Model[J]. Journal of Intelligence, 2014, 33(7): 115-121.)
[15] 王曰芬, 傅柱, 陈必坤.采用LDA主题模型的国内知识流研究结构探讨: 以学科分类主题抽取为视角[J]. 现代图书情报技术, 2016(4): 8-19.
[15] (Wang Yuefen, Fu Zhu, Chen Bikun.Analyzing Knowledge Structure Research with LDA Model[J]. New Technology of Library and Information Service, 2016(4): 8-19.)
[16] 王萍. 基于概率主题模型的文献知识挖掘[J]. 情报学报, 2011, 30(6): 583-590.
[16] (Wang Ping.Literature Knowledge Mining Based on Probabilistic Topic Model[J]. Journal of the China Society for Scientific and Technical Information, 2011, 30(6): 583-590.)
[17] 叶春蕾, 冷伏海. 基于引文—主题概率模型的科技文献主题识别方法研究[J]. 情报理论与实践, 2013, 36(9): 100-103.
[17] (Ye Chunlei, Leng Fuhai.Discovering the Topic of Science Literature Based on Citation-Topic Model[J]. Information Studies: Theory & Application, 2013, 36(9): 100-103.)
[18] 王平. 基于层次概率主题模型的科技文献主题发现及演化[J]. 图书情报工作, 2014, 58(22): 70-77.
[18] (Wang Ping.Topic Extraction and Evolution for Scientific Literature Based on Hierarchical Probabilistic Topic Model[J]. Library and Information Service, 2014, 58(22): 70-77.)
[19] 王金龙, 徐从富, 耿雪玉. 基于概率图模型的科研文献主题演化研究[J]. 情报学报, 2009, 28(3): 347-355.
[19] (Wang Jinlong, Xu Congfu, Geng Xueyu.Study on Research Topic Evolution Based on Probabilistic Graphical Models[J]. Journal of the China Society for Scientific and Technical Information, 2009, 28(3): 347-355.)
[20] 李湘东, 廖香鹏, 黄莉. LDA模型下书目信息分类系统的研究与实现[J]. 现代图书情报技术, 2014 (5): 18-25.
[20] (Li Xiangdong, Liao Xiangpeng, Huang Li.Research and Implementation of Bibliographic Information Classification System in LDA Model[J]. New Technology of Library and Information Service, 2014 (5): 18-25.)
[21] 秦晓慧, 乐小虬. 基于LDA主题关联过滤的领域主题演化研究[J]. 现代图书情报技术, 2015 (3): 18-25.
[21] (Qin Xiaohui, Le Xiaoqiu.Topic Evolution Research on a Certain Field Based on LDA Topic Association Filter[J]. New Technology of Library and Information Service, 2015 (3): 18-25.)
[22] 杨如意, 刘东苏, 李慧. 一种融合外部特征的改进主题模型[J]. 现代图书情报技术, 2016(1): 48-54.
[22] (Yang Ruyi, Liu Dongsu, Li Hui.An Improved Topic Model Integrating Extra- Features[J]. New Technology of Library and Information Service, 2016 (1): 48-54.)
[23] Grün B, Hornik K.Topicmodels: An R Package for Fitting Topic Models[J]. Journal of Statistical Software, 2011, 40(13): 1-30.
[24] Blei D M, Lafferty J D.A Correlated Topic Model of Science[J]. The Annals of Applied Statistics, 2007, 1(1): 17-35.
[25] Roberts M E, Stewart B M, Tingley D, et al.The Structural Topic Model and Applied Social Science[J]. Medical Journal of Australia, 2013, 155(6): 419-420.
[26] Roberts M E, Stewart B M, Tingley D. stm: R Package for Structural Topic Models[J]. General Information, 2014, 57(1): 445-460.
[27] Roberts M E, Stewart B M, Tingley D, et al.Structural Topic Models for Open-Ended Survey Responses[J]. American Journal of Political Science, 2014, 58(4): 1064-1082.
[1] 李跃艳,王昊,邓三鸿,王伟. 近十年信息检索领域的研究热点与演化趋势研究——基于SIGIR会议论文的分析[J]. 数据分析与知识发现, 2021, 5(4): 13-24.
[2] 伊惠芳,刘细文. 一种专利技术主题分析的IPC语境增强Context-LDA模型研究[J]. 数据分析与知识发现, 2021, 5(4): 25-36.
[3] 王伟, 高宁, 徐玉婷, 王洪伟. 基于LDA的众筹项目在线评论主题动态演化分析*[J]. 数据分析与知识发现, 2021, 5(10): 103-123.
[4] 蔡永明,刘璐,王科唯. 网络虚拟学习社区重要用户与核心主题联合分析*[J]. 数据分析与知识发现, 2020, 4(6): 69-79.
[5] 叶光辉,曾杰妍,胡婧岚,毕崇武. 城市画像视角下的社会公众情感演化研究*[J]. 数据分析与知识发现, 2020, 4(4): 15-26.
[6] 潘有能,倪秀丽. 基于Labeled-LDA模型的在线医疗专家推荐研究*[J]. 数据分析与知识发现, 2020, 4(4): 34-43.
[7] 刘玉文,王凯. 面向地域的网络话题识别方法*[J]. 数据分析与知识发现, 2020, 4(2/3): 173-181.
[8] 黄微,赵江元,闫璐. 网络热点事件话题漂移指数构建与实证研究*[J]. 数据分析与知识发现, 2020, 4(11): 92-101.
[9] 叶光辉,徐彤,毕崇武,李心悦. 基于多维度特征与LDA模型的城市旅游画像演化分析*[J]. 数据分析与知识发现, 2020, 4(11): 121-130.
[10] 王晰巍,张柳,黄博,韦雅楠. 基于LDA的微博用户主题图谱构建及实证研究*——以“埃航空难”为例[J]. 数据分析与知识发现, 2020, 4(10): 47-57.
[11] 邵云飞,刘东苏. 基于类别特征扩展的短文本分类方法研究 *[J]. 数据分析与知识发现, 2019, 3(9): 60-67.
[12] 孙明珠,马静,钱玲飞. 基于文档主题结构和词图迭代的关键词抽取方法研究 *[J]. 数据分析与知识发现, 2019, 3(8): 68-76.
[13] 夏立新,曾杰妍,毕崇武,叶光辉. 基于LDA主题模型的用户兴趣层级演化研究 *[J]. 数据分析与知识发现, 2019, 3(7): 1-13.
[14] 关鹏,王曰芬,傅柱. 基于LDA的主题语义演化分析方法研究 * ——以锂离子电池领域为例[J]. 数据分析与知识发现, 2019, 3(7): 61-72.
[15] 席林娜,窦永香. 基于计划行为理论的微博用户转发行为影响因素研究*[J]. 数据分析与知识发现, 2019, 3(2): 13-20.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn