Please wait a minute...
Advanced Search
数据分析与知识发现  2022, Vol. 6 Issue (7): 44-55     https://doi.org/10.11925/infotech.2096-3467.2021.1296
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
面向多源数据的学科主题挖掘与演化分析*
李慧(),胡吉霞,佟志颖
西安电子科技大学经济与管理学院 西安 710126
Subject Topic Mining and Evolution Analysis with Multi-Source Data
Li Hui(),Hu Jixia,Tong Zhiying
School of Economics and Management, Xidian University, Xi’an 710126, China
全文: PDF (4549 KB)   HTML ( 39
输出: BibTeX | EndNote (RIS)      
摘要 

目的】挖掘学科领域研究主题随时间的演变情况,帮助学者快速了解领域现状与研究趋势。【方法】融合多源数据后,根据时间段划分领域研究主题,运用主题热度、密度和紧密中心度计算主题重要性,利用语义相似度挖掘相邻时间段的关联主题,结合主题重要性波动与相似度判定话题演化类型,识别主题演化路径。【结果】选取人工智能领域,分析近20年研究主题的变化情况,得到4个时间段的热点研究主题和主要演化路径,各时间段间有明显的主题融合与分裂发展。【局限】 主题命名规则设定不够科学化;人工智能产业蓬勃发展,所用数据演化分析未能展示整个生命周期发展全貌。【结论】对多源数据的主题演化分析,能够有效揭示学科发展特征,主题越重要,其进化能力越强。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
李慧
胡吉霞
佟志颖
关键词 主题演化LDA主题相似度演化类型多源数据    
Abstract

[Objective] This paper examines the evolution of research topics, which helps researchers quickly identify the status quo and trends in their fields. [Methods] First, we merged multi-source datasets and divided the domain research topics by time period. Then, we calculated topic importance with their popularity, density, and closeness centrality. Third, we utilized topic semantic similarity to identify the related ones from adjacent time periods. Finally, we combined the topic importance fluctuation and the topic similarity to decide their evolution types and paths. [Results] We examined our model with papers on artificial intelligence and analyzed the changes of topics in the past 20 years. We identified the popular research topics and their evolution paths, which showed obvious thematic fusion and split development in four periods. [Limitations] The topic naming rules could be more effective and we could not show the whole life cycle of the booming artificial intelligence research. [Conclusions] The proposed model could effectively reveal the topic evolution of research.

Key wordsTopic Evolution    LDA    Topic Similarity    Evolutionary Type    Multi-Source Data
收稿日期: 2021-11-13      出版日期: 2022-08-24
ZTFLH:  G254  
基金资助:*国家自然科学基金项目的研究成果之一(71203173)
通讯作者: 李慧,ORCID:0000-0002-3468-5170     E-mail: lihui@xidian.edu.cn
引用本文:   
李慧, 胡吉霞, 佟志颖. 面向多源数据的学科主题挖掘与演化分析*[J]. 数据分析与知识发现, 2022, 6(7): 44-55.
Li Hui, Hu Jixia, Tong Zhiying. Subject Topic Mining and Evolution Analysis with Multi-Source Data. Data Analysis and Knowledge Discovery, 2022, 6(7): 44-55.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2021.1296      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2022/V6/I7/44
Fig.1  学科主题演化路径分析流程
Fig.2  指针生成网络模型框架
演化类型 判定条件 描述
生长 s i m ( T i t - 1 , T j t ) ρ f l u ( T i t - 1 , T j t ) > λ 当前主题研究热度上升
延续 s i m ( T i t - 1 , T j t ) ρ - λ f l u ( T i t - 1 , T j t ) λ 当前主题研究热度基本不变
衰减 s i m ( T i t - 1 , T j t ) ρ f l u T i t - 1 , T j t < - λ 当前主题研究热度下降
新生 前一时间段内所有主题 T t - 1, s i m ( T t - 1 , T j t ) < ρ 当前主题在前一时间段中不存在
融合 s i m ( T i t - 1 , T j t ) ρ s i m ( T k t - 1 , T j t ) ρ 当前主题是由前一时间段的多个主题合并产生的
分化 s i m ( T i t - 1 , T j t ) ρ s i m ( T i t - 1 , T k t ) ρ 前一时间段的某一个主题在当前时间段中分裂成多个
消亡 当前时间段所有主题 T t, s i m T i t - 1 , T t < ρ 前一时间段的某个主题在当前时间段中消失
Table 1  主题内容演化类型的判定条件
数据类型 检索时间范围 数据库来源 检索表达式 来源类别 文献类型 检索结果(篇)
英文论文 2001/01/01-2020/12/21 Web of Science TS=(“Artificial Intelligence” OR AI) Web of Science
核心合集
论文 29 030
英文专利 2001/01/01-2020/12/18 incoPat (TIAB=(Artificial Intelligence)) AND (AD=[20010101 to 20201218]) 美国专利、世界知识产权组织、欧洲专利局 发明申请、发明授权、
外观设计、其他
7 865
英文网页 2001/01/01-2020/12/23 Artificial intelligence | MIT News - 新闻稿件 网页 500
Table 2  数据来源
模型参数 参数说明
α 文本集在潜在主题上的狄利克雷先验,α=50/K
β 潜在主题在特征词集上的狄利克雷先验,β=0.02
K 最优主题数60
niters Gibbs抽样迭代次数,niters=1 000
twords 主题下特征词个数,twords=30
Table 3  LDA模型参数说明
Fig.3  “年份-主题”矩阵TSNE降维可视化
时间段 时间范围 包含文档数量/篇
1 2001-2006 2 615
2 2007-2011 2 825
3 2012-2016 5 127
4 2017-2020 25 143
Table 4  时间段划分结果
Fig.4  各个时间段困惑度曲线
时间段 主题及主题编号
2001
-
2006
1_T0 planning technology | 1_T1 behavior simulation | 1_T2 decision-making | 1_T3 medical diagnosis | 1_T4 pattern recognition | 1_T5 case-based reasoning | 1_T6 genetic algorithm | 1_T7 decision support system | 1_T8 machine-learning | 1_T9 fuzzy-logic | 1_T10 neural network | 1_T11 robot | 1_T12 intelligent agent | 1_T13 process control | 1_T14 mathematical computing
2007
-
2011
2_T0 forecast | 2_T1 data mining | 2_T2 machine learning | 2_T3 logic reasoning | 2_T4 decision making | 2_T5 feature extraction | 2_T6 support vector machine | 2_T7 genetic algorithm | 2_T8 intelligent agent | 2_T9 neural network prediction | 2_T10 process control | 2_T11 information retrieval | 2_T12 robotics | 2_T13 artificial neural network | 2_T14 human
2012
-
2016
3_T0 risk evaluation | 3_T1 forecast | 3_T2 resource management | 3_T3 case-based reasoning | 3_T4 random forest | 3_T5 neural network | 3_T6 mathematical model | 3_T7 hierarchical database | 3_T8 big data | 3_T9 classification | 3_T10 decision trees | 3_T11 fuzzy logic | 3_T12 neuro-fuzzy inference system | 3_T13 simulation | 3_T14 industrial robots | 3_T15 intelligent voice | 3_T16 computational complexity | 3_T17 intelligent agent | 3_T18 human-computer interaction | 3_T19 image processing | 3_T20 optimization algorithm | 3_T21 time-real monitoring | 3_T22 field application | 3_T23 robot
2017
-
2020
4_T0 smart wearable device | 4_T1 particle swarm optimization | 4_T2 evolutionary computation | 4_T3 intelligence manufacturing | 4_T4 internet of things | 4_T5 virtual reality | 4_T6 remote diagnosis | 4_T7 neuro-fuzzy inference system | 4_T8 security | 4_T9 feature extraction | 4_T10 motion control | 4_T11 medical image processing | 4_T12 intelligent life | 4_T13 transfer learning | 4_T14 computational complexity | 4_T15 object tracking | 4_T16 big data | 4_T17 social network | 4_T18 representation learning | 4_T19 neural network | 4_T20 reinforcement learning | 4_T21 machine learning | 4_T22 information retrieval | 4_T23 machine translation | 4_T24 disease diagnosis | 4_T25 knowledge graph | 4_T26 decision-making | 4_T27 forecasting | 4_T28 computational complexity | 4_T29 natural language processing | 4_T30 distributed system | 4_T31 time series model | 4_T32 deep neural network | 4_T33 robotics | 4_T34 anomaly detection | 4_T35 question & answering | 4_T36 3D | 4_T37 ontology | 4_T38 pattern recognition | 4_T39 fuzzy evaluation | 4_T40 information retrieval | 4_T41 clustering
Table 5  各时间段主题识别结果
Fig.5  主题演化路径
Fig.6  主题重要性较高的路径演化
[1] 王春秀, 冉美丽. 学科主题演化定量分析的理论基础探析[J]. 现代情报, 2008, 28(6): 48-50.
[1] ( Wang Chunxiu, Ran Meili. Theory Foundation Discussion About Quantitative Analysis of Subjects Theme Evaluation[J]. Modern Information, 2008, 28(6): 48-50.)
[2] Blei D M, Ng A Y, Jordan M I. Latent Dirichlet Allocation[J]. Journal of Machine Learning Research, 2003, 3(4-5): 993-1022.
[3] Wu Q Q, Kuang Y C, Hong Q Q, et al. Frontier Knowledge Discovery and Visualization in Cancer Field Based on KOS and LDA[J]. Scientometrics, 2019, 118(3): 979-1010.
doi: 10.1007/s11192-018-2989-y
[4] 丰米宁, 魏凤, 李健, 等. 产业链视角下的主题识别与技术演化研究——以3D打印领域为例[J]. 情报杂志, 2020, 39(8): 46-52.
[4] ( Feng Mining, Wei Feng, Li Jian, et al. Research on Topic Identification and Technology Evolution from the Perspective of Industrial Chain—A Case Study of 3D-Printing[J]. Journal of Intelligence, 2020, 39(8): 46-52.)
[5] 李湘东, 张娇, 袁满. 基于LDA模型的科技期刊主题演化研究[J]. 情报杂志, 2014, 33(7): 115-121.
[5] ( Li Xiangdong, Zhang Jiao, Yuan Man. On Topic Evolution of a Scientific Journal Based on LDA Model[J]. Journal of Intelligence, 2014, 33(7): 115-121.)
[6] Jeong Y, Park I, Yoon B. Identifying Emerging Research and Business Development(R&BD) Areas Based on Topic Modeling and Visualization with Intellectual Property Right Data[J]. Technological Forecasting and Social Change, 2019, 146: 655-672.
doi: 10.1016/j.techfore.2018.05.010
[7] 岳丽欣, 刘自强, 胡正银. 面向趋势预测的热点主题演化分析方法研究[J]. 数据分析与知识发现, 2020, 4(6): 22-34.
[7] ( Yue Lixin, Liu Ziqiang, Hu Zhengyin. Evolution Analysis of Hot Topics with Trend-Prediction[J]. Data Analysis and Knowledge Discovery, 2020, 4(6): 22-34.)
[8] 茅利锋. 基于主题模型的主题演化分析及预测[D]. 南京: 南京邮电大学, 2016.
[8] ( Mao Lifeng. Study of Text Evolution Analysis and Prediction Based on Topic Model[D]. Nanjing: Nanjing University of Posts and Telecommunications, 2016.)
[9] Chen J F, Yu J J, Shen Y. Towards Topic Trend Prediction on a Topic Evolution Model with Social Connection[C]// Proceedings of the 2012 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology. IEEE, 2012: 153-157.
[10] 何建云, 陈兴蜀, 杜敏, 等. 基于改进的在线LDA模型的主题演化分析[J]. 中南大学学报(自然科学版), 2015, 46(2): 547-553.
[10] He Jianyun, Chen Xingshu, Du Min, et al. Topic Evolution Analysis Based on Improved Online LDA Model[J]. Journal of Central South University(Science and Technology), 2015, 46(2): 547-553.)
[11] Wang J, Wu X, Li L. Semantic Connection Based Topic Evolution[C]// Proceedings of the 31st AAAI Conference on Artificial Intelligence. AAAI Press, 2017.
[12] Wei W, Guo C H, Chen J F, et al. Textual Topic Evolution Analysis Based on Term Co-Occurrence: A Case Study on the Government Work Report of the State Council(1954-2017)[C]// Proceedings of the 12th International Conference on Intelligent Systems and Knowledge Engineering(ISKE). IEEE, 2017: 1-6.
[13] 朱茂然, 王奕磊, 高松, 等. 基于LDA模型的主题演化分析: 以情报学文献为例[J]. 北京工业大学学报, 2018, 44(7): 1047-1053.
[13] ( Zhu Maoran, Wang Yilei, Gao Song, et al. Evolution of Topic Using LDA Model: Evidence from Information Science Journals[J]. Journal of Beijing University of Technology, 2018, 44(7): 1047-1053.)
[14] 曾利, 李自力, 谭跃进. 基于动态LDA的科研文献主题演化分析[J]. 软件, 2014, 35(5): 102-107.
[14] ( Zeng Li, Li Zili, Tan Yuejin. Analysis of Topic Evolution in Scientific Literature Based on Dynamic Latent Dirichlet Allocation[J]. Software, 2014, 35(5): 102-107.)
[15] 戴长松, 王永滨, 王琦. 基于在线主题模型的新闻热点演化模型分析[J]. 软件导刊, 2020, 19(1): 84-88.
[15] ( Dai Changsong, Wang Yongbin, Wang Qi. Analysis of News Hotspot Evolution Model Based on Online Topic Model[J]. Software Guide, 2020, 19(1): 84-88.)
[16] Gao W, Peng M, Wang H, et al. Generation of Topic Evolution Graphs from Short Text Streams[J]. Neurocomputing, 2020, 383: 282-294.
doi: 10.1016/j.neucom.2019.11.077
[17] Li Z F, Yin Z X, Li Q Q. Study on Topic Intensity Evolution Law of Web News Topic Based on Topic Content Evolution[C]// Proceedings of the 4th International Conference on Cloud Computing and Security. Springer, 2018: 697-709.
[18] 岳丽欣, 周晓英, 陈旖旎. 期刊论文核心研究主题识别及其演化路径可视化方法研究——以我国医疗健康信息领域期刊论文为例[J]. 图书情报工作, 2020, 64(5): 89-99.
doi: 10.13266/j.issn.0252-3116.2020.05.010
[18] ( Yue Lixin, Zhou Xiaoying, Chen Yini. Research on Topic Identification of Papers Core Research Subjects and Evolution Path Visualization Method—Taking China’s Journal of Medical and Health Information as an Example[J]. Library and Information Service, 2020, 64(5): 89-99.)
doi: 10.13266/j.issn.0252-3116.2020.05.010
[19] 匡广生, 郭岩, 俞晓明, 等. 基于图的多源数据融合框架研究[J]. 计算机科学, 2021, 48(11): 170-175.
[19] ( Kuang Guangsheng, Guo Yan, Yu Xiaoming, et al. Study on Multi-Source Data Fusion Framework Based on Graph[J]. Computer Science, 2021, 48(11): 170-175.)
[20] 许海云, 董坤, 隗玲, 等. 科学计量中多源数据融合方法研究述评[J]. 情报学报, 2018, 37(3): 318-328.
[20] ( Xu Haiyun, Dong Kun, Wei Ling, et al. Research on Multi-Source Data Fusion Method in Scientometrics[J]. Journal of the China Society for Scientific and Technical Information, 2018, 37(3): 318-328.)
[21] 徐路路, 王芳. 基于支持向量机和改进粒子群算法的科学前沿预测模型研究[J]. 情报科学, 2019, 37(8): 22-28.
[21] ( Xu Lulu, Wang Fang. Scientific Frontier Prediction Model Based on Support Vector Machine and Improved Particle Swarm Optimization[J]. Information Science, 2019, 37(8): 22-28.)
[22] See A, Liu P J, Manning C D. Get to the Point: Summarization with Pointer-Generator Networks[C]// Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. 2017: 1073-1083.
[23] 李慧, 孟玮. 专利视角下的美国空军核心技术演化分析[J]. 情报理论与实践, 2021, 44(2): 41-49.
[23] ( Li Hui, Meng Wei. An Analysis of the Evolution of Core Technologies in the USAir Force from a Patent Perspective[J]. Information Studies: Theory & Application, 2021, 44(2): 41-49.)
[24] İlhan N, Öğüdücü Ş G. Predicting Community Evolution Based on Time Series Modeling[C]// Proceedings of the 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining. 2015: 1509-1516.
[25] 徐佳俊, 杨飏, 姚天昉, 等. 基于LDA模型的论坛热点话题识别和追踪[J]. 中文信息学报, 2016, 30(1): 43-49.
[25] ( Xu Jiajun, Yang Yang, Yao Tianfang, et al. LDA Based Hot Topic Detection and Tracking for the Forum[J]. Journal of Chinese Information Processing, 2016, 30(1): 43-49.)
[1] 李国锋,李祚娟,王哲吉,吴梦. 基于多任务学习的税务稽查选案研究*[J]. 数据分析与知识发现, 2022, 6(6): 128-140.
[2] 岳铁骐, 傅友斐, 徐健. 基于招聘广告的岗位人才需求分析框架构建与实证研究*[J]. 数据分析与知识发现, 2022, 6(2/3): 151-166.
[3] 周云泽, 闵超. 基于LDA模型与共享语义空间的新兴技术识别——以自动驾驶汽车为例*[J]. 数据分析与知识发现, 2022, 6(2/3): 55-66.
[4] 李跃艳,王昊,邓三鸿,王伟. 近十年信息检索领域的研究热点与演化趋势研究——基于SIGIR会议论文的分析[J]. 数据分析与知识发现, 2021, 5(4): 13-24.
[5] 伊惠芳,刘细文. 一种专利技术主题分析的IPC语境增强Context-LDA模型研究[J]. 数据分析与知识发现, 2021, 5(4): 25-36.
[6] 沈思,李沁宇,叶媛,孙豪,叶文豪. 基于TWE模型的医学科技报告主题挖掘及演化分析研究*[J]. 数据分析与知识发现, 2021, 5(3): 35-44.
[7] 韩芳, 张生太, 冯凌子, 袁军鹏. 基于专利文献技术融合测度的突破性创新主题识别*——以太阳能光伏领域为例[J]. 数据分析与知识发现, 2021, 5(12): 137-147.
[8] 吴胜男, 田若楠, 蒲虹君, 梁雯琪, 张亚飞, 于琦, 贺培凤. 基于社交媒体的医药领域关联主题预测方法研究*[J]. 数据分析与知识发现, 2021, 5(12): 98-109.
[9] 吴彦文, 蔡秋亭, 刘智, 邓云泽. 融合多源数据和场景相似度计算的数字资源推荐研究*[J]. 数据分析与知识发现, 2021, 5(11): 114-123.
[10] 王伟, 高宁, 徐玉婷, 王洪伟. 基于LDA的众筹项目在线评论主题动态演化分析*[J]. 数据分析与知识发现, 2021, 5(10): 103-123.
[11] 李广建,王锴,张庆芝. 基于多源数据的美国出口管制分析框架及其实证研究*[J]. 数据分析与知识发现, 2020, 4(9): 26-40.
[12] 盛嘉祺, 许鑫. 融合主题相似度与合著网络的学者标签扩展方法研究*[J]. 数据分析与知识发现, 2020, 4(8): 75-85.
[13] 岳丽欣,刘自强,胡正银. 面向趋势预测的热点主题演化分析方法研究*[J]. 数据分析与知识发现, 2020, 4(6): 22-34.
[14] 蔡永明,刘璐,王科唯. 网络虚拟学习社区重要用户与核心主题联合分析*[J]. 数据分析与知识发现, 2020, 4(6): 69-79.
[15] 叶光辉,曾杰妍,胡婧岚,毕崇武. 城市画像视角下的社会公众情感演化研究*[J]. 数据分析与知识发现, 2020, 4(4): 15-26.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn