Please wait a minute...
Data Analysis and Knowledge Discovery  2022, Vol. 6 Issue (7): 44-55    DOI: 10.11925/infotech.2096-3467.2021.1296
Original article Current Issue | Archive | Adv Search |
Subject Topic Mining and Evolution Analysis with Multi-Source Data
Li Hui(),Hu Jixia,Tong Zhiying
School of Economics and Management, Xidian University, Xi’an 710126, China
Download: PDF (4549 KB)   HTML ( 39
Export: BibTeX | EndNote (RIS)      

[Objective] This paper examines the evolution of research topics, which helps researchers quickly identify the status quo and trends in their fields. [Methods] First, we merged multi-source datasets and divided the domain research topics by time period. Then, we calculated topic importance with their popularity, density, and closeness centrality. Third, we utilized topic semantic similarity to identify the related ones from adjacent time periods. Finally, we combined the topic importance fluctuation and the topic similarity to decide their evolution types and paths. [Results] We examined our model with papers on artificial intelligence and analyzed the changes of topics in the past 20 years. We identified the popular research topics and their evolution paths, which showed obvious thematic fusion and split development in four periods. [Limitations] The topic naming rules could be more effective and we could not show the whole life cycle of the booming artificial intelligence research. [Conclusions] The proposed model could effectively reveal the topic evolution of research.

Key wordsTopic Evolution      LDA      Topic Similarity      Evolutionary Type      Multi-Source Data     
Received: 13 November 2021      Published: 24 August 2022
ZTFLH:  G254  
Fund:National Natural Science Foundation of China(71203173)
Corresponding Authors: Li Hui,ORCID:0000-0002-3468-5170     E-mail:

Cite this article:

Li Hui, Hu Jixia, Tong Zhiying. Subject Topic Mining and Evolution Analysis with Multi-Source Data. Data Analysis and Knowledge Discovery, 2022, 6(7): 44-55.

URL:     OR

Flow Chart of Subject Evolution Path Analysis
Pointer Generation Network Model
演化类型 判定条件 描述
生长 s i m ( T i t - 1 , T j t ) ρ f l u ( T i t - 1 , T j t ) > λ 当前主题研究热度上升
延续 s i m ( T i t - 1 , T j t ) ρ - λ f l u ( T i t - 1 , T j t ) λ 当前主题研究热度基本不变
衰减 s i m ( T i t - 1 , T j t ) ρ f l u T i t - 1 , T j t < - λ 当前主题研究热度下降
新生 前一时间段内所有主题 T t - 1, s i m ( T t - 1 , T j t ) < ρ 当前主题在前一时间段中不存在
融合 s i m ( T i t - 1 , T j t ) ρ s i m ( T k t - 1 , T j t ) ρ 当前主题是由前一时间段的多个主题合并产生的
分化 s i m ( T i t - 1 , T j t ) ρ s i m ( T i t - 1 , T k t ) ρ 前一时间段的某一个主题在当前时间段中分裂成多个
消亡 当前时间段所有主题 T t, s i m T i t - 1 , T t < ρ 前一时间段的某个主题在当前时间段中消失
Criteria for Determining the Evolution Type of Topic Content
数据类型 检索时间范围 数据库来源 检索表达式 来源类别 文献类型 检索结果(篇)
英文论文 2001/01/01-2020/12/21 Web of Science TS=(“Artificial Intelligence” OR AI) Web of Science
论文 29 030
英文专利 2001/01/01-2020/12/18 incoPat (TIAB=(Artificial Intelligence)) AND (AD=[20010101 to 20201218]) 美国专利、世界知识产权组织、欧洲专利局 发明申请、发明授权、
7 865
英文网页 2001/01/01-2020/12/23 Artificial intelligence | MIT News - 新闻稿件 网页 500
Data Source
模型参数 参数说明
α 文本集在潜在主题上的狄利克雷先验,α=50/K
β 潜在主题在特征词集上的狄利克雷先验,β=0.02
K 最优主题数60
niters Gibbs抽样迭代次数,niters=1 000
twords 主题下特征词个数,twords=30
Description of LDA Model Parameters
TSNE Dimension Reduction Visualization for “Year-Topic” Matrix
时间段 时间范围 包含文档数量/篇
1 2001-2006 2 615
2 2007-2011 2 825
3 2012-2016 5 127
4 2017-2020 25 143
Time Segmentation Result
Perplexity Curve for Each Time Period
时间段 主题及主题编号
1_T0 planning technology | 1_T1 behavior simulation | 1_T2 decision-making | 1_T3 medical diagnosis | 1_T4 pattern recognition | 1_T5 case-based reasoning | 1_T6 genetic algorithm | 1_T7 decision support system | 1_T8 machine-learning | 1_T9 fuzzy-logic | 1_T10 neural network | 1_T11 robot | 1_T12 intelligent agent | 1_T13 process control | 1_T14 mathematical computing
2_T0 forecast | 2_T1 data mining | 2_T2 machine learning | 2_T3 logic reasoning | 2_T4 decision making | 2_T5 feature extraction | 2_T6 support vector machine | 2_T7 genetic algorithm | 2_T8 intelligent agent | 2_T9 neural network prediction | 2_T10 process control | 2_T11 information retrieval | 2_T12 robotics | 2_T13 artificial neural network | 2_T14 human
3_T0 risk evaluation | 3_T1 forecast | 3_T2 resource management | 3_T3 case-based reasoning | 3_T4 random forest | 3_T5 neural network | 3_T6 mathematical model | 3_T7 hierarchical database | 3_T8 big data | 3_T9 classification | 3_T10 decision trees | 3_T11 fuzzy logic | 3_T12 neuro-fuzzy inference system | 3_T13 simulation | 3_T14 industrial robots | 3_T15 intelligent voice | 3_T16 computational complexity | 3_T17 intelligent agent | 3_T18 human-computer interaction | 3_T19 image processing | 3_T20 optimization algorithm | 3_T21 time-real monitoring | 3_T22 field application | 3_T23 robot
4_T0 smart wearable device | 4_T1 particle swarm optimization | 4_T2 evolutionary computation | 4_T3 intelligence manufacturing | 4_T4 internet of things | 4_T5 virtual reality | 4_T6 remote diagnosis | 4_T7 neuro-fuzzy inference system | 4_T8 security | 4_T9 feature extraction | 4_T10 motion control | 4_T11 medical image processing | 4_T12 intelligent life | 4_T13 transfer learning | 4_T14 computational complexity | 4_T15 object tracking | 4_T16 big data | 4_T17 social network | 4_T18 representation learning | 4_T19 neural network | 4_T20 reinforcement learning | 4_T21 machine learning | 4_T22 information retrieval | 4_T23 machine translation | 4_T24 disease diagnosis | 4_T25 knowledge graph | 4_T26 decision-making | 4_T27 forecasting | 4_T28 computational complexity | 4_T29 natural language processing | 4_T30 distributed system | 4_T31 time series model | 4_T32 deep neural network | 4_T33 robotics | 4_T34 anomaly detection | 4_T35 question & answering | 4_T36 3D | 4_T37 ontology | 4_T38 pattern recognition | 4_T39 fuzzy evaluation | 4_T40 information retrieval | 4_T41 clustering
Topic Recognition Results in Each Time Period
Topic Evolution Path
Path Evolution Diagram of Higher Topic Importance
[1] 王春秀, 冉美丽. 学科主题演化定量分析的理论基础探析[J]. 现代情报, 2008, 28(6): 48-50.
[1] ( Wang Chunxiu, Ran Meili. Theory Foundation Discussion About Quantitative Analysis of Subjects Theme Evaluation[J]. Modern Information, 2008, 28(6): 48-50.)
[2] Blei D M, Ng A Y, Jordan M I. Latent Dirichlet Allocation[J]. Journal of Machine Learning Research, 2003, 3(4-5): 993-1022.
[3] Wu Q Q, Kuang Y C, Hong Q Q, et al. Frontier Knowledge Discovery and Visualization in Cancer Field Based on KOS and LDA[J]. Scientometrics, 2019, 118(3): 979-1010.
doi: 10.1007/s11192-018-2989-y
[4] 丰米宁, 魏凤, 李健, 等. 产业链视角下的主题识别与技术演化研究——以3D打印领域为例[J]. 情报杂志, 2020, 39(8): 46-52.
[4] ( Feng Mining, Wei Feng, Li Jian, et al. Research on Topic Identification and Technology Evolution from the Perspective of Industrial Chain—A Case Study of 3D-Printing[J]. Journal of Intelligence, 2020, 39(8): 46-52.)
[5] 李湘东, 张娇, 袁满. 基于LDA模型的科技期刊主题演化研究[J]. 情报杂志, 2014, 33(7): 115-121.
[5] ( Li Xiangdong, Zhang Jiao, Yuan Man. On Topic Evolution of a Scientific Journal Based on LDA Model[J]. Journal of Intelligence, 2014, 33(7): 115-121.)
[6] Jeong Y, Park I, Yoon B. Identifying Emerging Research and Business Development(R&BD) Areas Based on Topic Modeling and Visualization with Intellectual Property Right Data[J]. Technological Forecasting and Social Change, 2019, 146: 655-672.
doi: 10.1016/j.techfore.2018.05.010
[7] 岳丽欣, 刘自强, 胡正银. 面向趋势预测的热点主题演化分析方法研究[J]. 数据分析与知识发现, 2020, 4(6): 22-34.
[7] ( Yue Lixin, Liu Ziqiang, Hu Zhengyin. Evolution Analysis of Hot Topics with Trend-Prediction[J]. Data Analysis and Knowledge Discovery, 2020, 4(6): 22-34.)
[8] 茅利锋. 基于主题模型的主题演化分析及预测[D]. 南京: 南京邮电大学, 2016.
[8] ( Mao Lifeng. Study of Text Evolution Analysis and Prediction Based on Topic Model[D]. Nanjing: Nanjing University of Posts and Telecommunications, 2016.)
[9] Chen J F, Yu J J, Shen Y. Towards Topic Trend Prediction on a Topic Evolution Model with Social Connection[C]// Proceedings of the 2012 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology. IEEE, 2012: 153-157.
[10] 何建云, 陈兴蜀, 杜敏, 等. 基于改进的在线LDA模型的主题演化分析[J]. 中南大学学报(自然科学版), 2015, 46(2): 547-553.
[10] He Jianyun, Chen Xingshu, Du Min, et al. Topic Evolution Analysis Based on Improved Online LDA Model[J]. Journal of Central South University(Science and Technology), 2015, 46(2): 547-553.)
[11] Wang J, Wu X, Li L. Semantic Connection Based Topic Evolution[C]// Proceedings of the 31st AAAI Conference on Artificial Intelligence. AAAI Press, 2017.
[12] Wei W, Guo C H, Chen J F, et al. Textual Topic Evolution Analysis Based on Term Co-Occurrence: A Case Study on the Government Work Report of the State Council(1954-2017)[C]// Proceedings of the 12th International Conference on Intelligent Systems and Knowledge Engineering(ISKE). IEEE, 2017: 1-6.
[13] 朱茂然, 王奕磊, 高松, 等. 基于LDA模型的主题演化分析: 以情报学文献为例[J]. 北京工业大学学报, 2018, 44(7): 1047-1053.
[13] ( Zhu Maoran, Wang Yilei, Gao Song, et al. Evolution of Topic Using LDA Model: Evidence from Information Science Journals[J]. Journal of Beijing University of Technology, 2018, 44(7): 1047-1053.)
[14] 曾利, 李自力, 谭跃进. 基于动态LDA的科研文献主题演化分析[J]. 软件, 2014, 35(5): 102-107.
[14] ( Zeng Li, Li Zili, Tan Yuejin. Analysis of Topic Evolution in Scientific Literature Based on Dynamic Latent Dirichlet Allocation[J]. Software, 2014, 35(5): 102-107.)
[15] 戴长松, 王永滨, 王琦. 基于在线主题模型的新闻热点演化模型分析[J]. 软件导刊, 2020, 19(1): 84-88.
[15] ( Dai Changsong, Wang Yongbin, Wang Qi. Analysis of News Hotspot Evolution Model Based on Online Topic Model[J]. Software Guide, 2020, 19(1): 84-88.)
[16] Gao W, Peng M, Wang H, et al. Generation of Topic Evolution Graphs from Short Text Streams[J]. Neurocomputing, 2020, 383: 282-294.
doi: 10.1016/j.neucom.2019.11.077
[17] Li Z F, Yin Z X, Li Q Q. Study on Topic Intensity Evolution Law of Web News Topic Based on Topic Content Evolution[C]// Proceedings of the 4th International Conference on Cloud Computing and Security. Springer, 2018: 697-709.
[18] 岳丽欣, 周晓英, 陈旖旎. 期刊论文核心研究主题识别及其演化路径可视化方法研究——以我国医疗健康信息领域期刊论文为例[J]. 图书情报工作, 2020, 64(5): 89-99.
doi: 10.13266/j.issn.0252-3116.2020.05.010
[18] ( Yue Lixin, Zhou Xiaoying, Chen Yini. Research on Topic Identification of Papers Core Research Subjects and Evolution Path Visualization Method—Taking China’s Journal of Medical and Health Information as an Example[J]. Library and Information Service, 2020, 64(5): 89-99.)
doi: 10.13266/j.issn.0252-3116.2020.05.010
[19] 匡广生, 郭岩, 俞晓明, 等. 基于图的多源数据融合框架研究[J]. 计算机科学, 2021, 48(11): 170-175.
[19] ( Kuang Guangsheng, Guo Yan, Yu Xiaoming, et al. Study on Multi-Source Data Fusion Framework Based on Graph[J]. Computer Science, 2021, 48(11): 170-175.)
[20] 许海云, 董坤, 隗玲, 等. 科学计量中多源数据融合方法研究述评[J]. 情报学报, 2018, 37(3): 318-328.
[20] ( Xu Haiyun, Dong Kun, Wei Ling, et al. Research on Multi-Source Data Fusion Method in Scientometrics[J]. Journal of the China Society for Scientific and Technical Information, 2018, 37(3): 318-328.)
[21] 徐路路, 王芳. 基于支持向量机和改进粒子群算法的科学前沿预测模型研究[J]. 情报科学, 2019, 37(8): 22-28.
[21] ( Xu Lulu, Wang Fang. Scientific Frontier Prediction Model Based on Support Vector Machine and Improved Particle Swarm Optimization[J]. Information Science, 2019, 37(8): 22-28.)
[22] See A, Liu P J, Manning C D. Get to the Point: Summarization with Pointer-Generator Networks[C]// Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. 2017: 1073-1083.
[23] 李慧, 孟玮. 专利视角下的美国空军核心技术演化分析[J]. 情报理论与实践, 2021, 44(2): 41-49.
[23] ( Li Hui, Meng Wei. An Analysis of the Evolution of Core Technologies in the USAir Force from a Patent Perspective[J]. Information Studies: Theory & Application, 2021, 44(2): 41-49.)
[24] İlhan N, Öğüdücü Ş G. Predicting Community Evolution Based on Time Series Modeling[C]// Proceedings of the 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining. 2015: 1509-1516.
[25] 徐佳俊, 杨飏, 姚天昉, 等. 基于LDA模型的论坛热点话题识别和追踪[J]. 中文信息学报, 2016, 30(1): 43-49.
[25] ( Xu Jiajun, Yang Yang, Yao Tianfang, et al. LDA Based Hot Topic Detection and Tracking for the Forum[J]. Journal of Chinese Information Processing, 2016, 30(1): 43-49.)
[1] Li Guofeng,Li Zuojuan,Wang Zheji,Wu Meng. Identifying Tax Audit Cases with Multi-task Learning[J]. 数据分析与知识发现, 2022, 6(6): 128-140.
[2] Yue Tieqi, Fu Youfei, Xu Jian. An Analysis Framework for Job Demands from Job Postings[J]. 数据分析与知识发现, 2022, 6(2/3): 151-166.
[3] Li Yueyan,Wang Hao,Deng Sanhong,Wang Wei. Research Trends of Information Retrieval——Case Study of SIGIR Conference Papers[J]. 数据分析与知识发现, 2021, 5(4): 13-24.
[4] Yi Huifang,Liu Xiwen. Analyzing Patent Technology Topics with IPC Context-Enhanced Context-LDA Model[J]. 数据分析与知识发现, 2021, 5(4): 25-36.
[5] Wang Hongbin,Wang Jianxiong,Zhang Yafei,Yang Heng. Topic Recognition of News Reports with Imbalanced Contents[J]. 数据分析与知识发现, 2021, 5(3): 109-120.
[6] Shen Si,Li Qinyu,Ye Yuan,Sun Hao,Ye Wenhao. Topic Mining and Evolution Analysis of Medical Sci-Tech Reports with TWE Model[J]. 数据分析与知识发现, 2021, 5(3): 35-44.
[7] Han Fang, Zhang Shengtai, Feng Lingzi, Yuan Junpeng. Identifying Breakthrough Patent Topics by Measuring Technological Convergence——Case Study of Solar PV Domain[J]. 数据分析与知识发现, 2021, 5(12): 137-147.
[8] Wu Shengnan, Tian Ruonan, Pu Hongjun, Liang Wenqi, Zhang Yafei, Yu Qi, He Peifeng. Predicting Related Medical Topics from Social Media[J]. 数据分析与知识发现, 2021, 5(12): 98-109.
[9] Wu Yanwen, Cai Qiuting, Liu Zhi, Deng Yunze. Digital Resource Recommendation Based on Multi-Source Data and Scene Similarity Calculation[J]. 数据分析与知识发现, 2021, 5(11): 114-123.
[10] Wang Wei, Gao Ning, Xu Yuting, Wang Hongwei. Topic Evolution of Online Reviews for Crowdfunding Campaigns[J]. 数据分析与知识发现, 2021, 5(10): 103-123.
[11] Li Guangjian,Wang Kai,Zhang Qingzhi. Analysis Framework Based on Multi-Source Data for US Export Control: An Empirical Study[J]. 数据分析与知识发现, 2020, 4(9): 26-40.
[12] Sheng Jiaqi, Xu Xin. Expanding Scholar Labels with Research Similarity and Co-authorship Network[J]. 数据分析与知识发现, 2020, 4(8): 75-85.
[13] Liu Qian, Li Chenliang. A Survey of Topic Evolution on Social Media[J]. 数据分析与知识发现, 2020, 4(8): 1-14.
[14] Yue Lixin,Liu Ziqiang,Hu Zhengyin. Evolution Analysis of Hot Topics with Trend-Prediction[J]. 数据分析与知识发现, 2020, 4(6): 22-34.
[15] Cai Yongming,Liu Lu,Wang Kewei. Identifying Key Users and Topics from Online Learning Community[J]. 数据分析与知识发现, 2020, 4(6): 69-79.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938