Please wait a minute...
Advanced Search
数据分析与知识发现  2020, Vol. 4 Issue (6): 22-34     https://doi.org/10.11925/infotech.2096-3467.2019.1155
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
面向趋势预测的热点主题演化分析方法研究*
岳丽欣1,刘自强2,3(),胡正银2,3
1中国人民大学信息资源管理学院 北京 100872
2中国科学院成都文献情报中心 成都 610041
3中国科学院大学经济与管理学院图书情报与档案管理系 北京 100190
Evolution Analysis of Hot Topics with Trend-Prediction
Yue Lixin1,Liu Ziqiang2,3(),Hu Zhengyin2,3
1School of Information Resource Management, Renmin University of China, Beijing 100872, China
2Chengdu Library of Chinese Academy of Sciences, Chengdu 610041, China
3Department of Library, Information and Archives Management, School of Economics and Management, University of Chinese Academy of Sciences, Beijing 100190, China
全文: PDF (7483 KB)   HTML ( 45
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】 从外部数量特征和内部文本特征两个层面,构建科学的数理模型和内容预测模型,进而对热点研究主题演化趋势进行预测分析。【方法】 基于LDA模型进行主题识别并构建主题时间序列,结合均值与线性回归拟合确定热点主题;利用ARIMA模型和Word2Vec模型从主题强度和主题内容两个层面预测分析热点主题趋势。【结果】 对美国干细胞领域进行实证研究,筛选出造血干细胞移植技术、癌症干细胞和干细胞抑制作用、干细胞诱导分化、衍生配子技术、造血干细胞5个热点主题并预测其发展趋势。【局限】 基于Word2Vec模型对主题内容趋势进行分析主要以单个词汇为基础,解读过程中可能存在歧义。【结论】 与人工解读为主的主题趋势预测分析方法相比,本研究所提方法能在一定程度上提高预测分析的效率与科学性。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
岳丽欣
刘自强
胡正银
关键词 趋势预测热点主题ARIMA模型Word2Vec模型主题演化    
Abstract

[Objective] The paper constructs mathematical and content prediction models based on the external and internal characteristics academic articles, aiming to analyze the evolution of trending research topics. [Methods] With the help of LDA model, we identified the needed topics and constructed their time series. Then, we determined the popular topics by mean values and linear regression fitting. Finally, we predicted the trending topics with ARIMA and Word2Vec models based on the topic intensity and content. [Results] We conducted an empirical study to evaluate our models with stem cell research in the United States. We identified popular topics and predicted their development trends. [Limitations] There might be ambiguity in interpreting the documents, because the Word2Vec model analyzes trends of theme contents based on single words. [Conclusions] The proposed method can provide better prediction results than methods based on manual interpretation.

Key wordsTrend Prediction    Hot Topics    ARIMA Model    Word2Vec    Topic Evolution
收稿日期: 2019-10-22      出版日期: 2020-07-07
ZTFLH:  G350  
基金资助:*本文系中国科学院“十三五”信息化专项之科研信息化应用工程项目“面向干细胞领域知识发现的科研信息化应用”的研究成果之一。(XXH13506-203)
通讯作者: 刘自强     E-mail: 1224615932@qq.com
引用本文:   
岳丽欣,刘自强,胡正银. 面向趋势预测的热点主题演化分析方法研究*[J]. 数据分析与知识发现, 2020, 4(6): 22-34.
Yue Lixin,Liu Ziqiang,Hu Zhengyin. Evolution Analysis of Hot Topics with Trend-Prediction. Data Analysis and Knowledge Discovery, 2020, 4(6): 22-34.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2019.1155      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2020/V4/I6/22
模型 自相关函数(ACF) 偏自相关函数(PACF)
AR(p) 拖尾 p阶后截尾
MA(q) q阶后截尾 拖尾
ARMA(p, q) q阶后拖尾 p阶后拖尾
Table 1  模型参数确定
Fig.1  CBOW模型和Skip-Gram模型示意图[29]
Fig.2  论文数量年度分布
Fig.3  最优主题个数确定
主题序号 主题词
Topic1 acute|intestinal|hematopoietic|kinase|term|epithelial|
promote|transplantation|expansion|marrow
Topic2 pathway|cancer|embryonic|hematopoietic|virus|cell|
aldehyde|signature|maintenance|gland
Topic3 regulation|marrow|hematopoietic|embryonic|biology|
inhibitor|cancer|pluripotent|rescue|inhibits
Topic4 resistance|cell|effect|hematopoietic|cancer|imaging|
transplantation|long|bioactive|colorectal
Topic5 cell|cancer|breast|pancreatic|new|hematopoietic|
targeting|transplantation|pluripotency|embryonic
…… ……
Table 2  美国干细胞领域研究主题列表(部分)
Fig.4  干细胞领域主题时间序列(2000年-2018年)
ARIMAp,d,q BIC ARIMAp,d,q BIC
ARIMA(0, 0, 1) BIC:-77.88 ARIMA(1, 2, 0) BIC:-111.36
ARIMA(0, 0, 2) BIC:-80.42 ARIMA(1, 2, 1) BIC:-113.48
ARIMA(0, 1, 1) BIC:-127.98 ARIMA(1, 2, 2) BIC:-107.37
ARIMA(0, 1, 2) BIC:-119.46 ARIMA(2, 0, 0) BIC:-133.28
ARIMA(0, 2, 1) BIC:-110.23 ARIMA(2, 0, 1) BIC:-140.00
ARIMA(0, 2, 2) BIC:-109.99 ARIMA(2, 0, 2) BIC:-125.84
ARIMA(1, 0, 0) BIC:-136.18 ARIMA(2, 1, 0) BIC:-130.73
ARIMA(1, 0, 1) BIC:-139.63 ARIMA(2, 1, 1) BIC:-127.64
ARIMA(1, 0, 2) BIC:-132.44 ARIMA(2, 1, 2) BIC:-116.59
ARIMA(1, 1, 0) BIC:-136.13 ARIMA(2, 2, 0) BIC:-112.62
ARIMA(1, 1, 1) BIC:-129.06 ARIMA(2, 2, 1) BIC:-115.63
ARIMA(1, 1, 2) BIC:-117.46 ARIMA(2, 2, 2) BIC:-103.39
Table 3  模型参数确定
Fig.5  模型检验结果
Fig.6  热点主题强度演化趋势预测
Fig.7  热点主题内容趋势预测
Fig.8  基于VOSviewer的干细胞领域研究热点图
[1] 刘小平, 冷伏海, 李泽霞. 国际科技前沿分析的方法和途径[J]. 图书情报工作, 2012,56(12):60-65.
[1] ( Liu Xiaoping, Leng Fuhai, Li Zexia. Methods and Approaches of International S&T Front Analysis[J]. Library and Information Service, 2012,56(12):60-65.)
[2] 刘自强, 王效岳, 白如江. 多维度视角下学科主题演化可视化分析方法研究——以我国图书情报领域大数据研究为例[J]. 中国图书馆学报, 2016,42(6):67-84.
[2] ( Liu Ziqiang, Wang Xiaoyue, Bai Rujiang. Research on Visualization Analysis Method of Discipline Topics Evolution from the Perspective of Multi-Dimensions: A Case Study of the Big Data in the Field of Library and Information Science in China[J]. Journal of Library Science in China, 2016,42(6):67-84.)
[3] 静发冲, 李晨英, 韩明杰, 等. 基于文本挖掘的美国NSF生物科学部新兴前沿项目主题分析[J]. 现代情报, 2014,34(12):107-112.
[3] ( Jing Fachong, Li Chenying, Han Mingjie, et al. Topic Analysis of Projects from Emerging Frontiers Division of NSF’s Directorate for Biological Science Based on Text Mining[J]. Journal of Modern Information, 2014,34(12):107-112.)
[4] 刘自强, 王效岳, 白如江. 基于时间序列模型的研究热点分析预测方法研究[J]. 情报理论与实践, 2016,39(5):27-33.
[4] ( Liu Ziqiang, Wang Xiaoyue, Bai Rujiang. Research on the Forecasting Method of Research Hotspots Analysis Based on Time Series Model[J]. Information Studies: Theory & Application, 2016,39(5):27-33.)
[5] 许晓阳, 郑彦宁, 刘志辉. 论文和专利相结合的研究前沿识别方法研究[J]. 图书情报工作, 2016,60(24):97-106.
[5] ( Xu Xiaoyang, Zheng Yanning, Liu Zhihui. Study on the Method of Identifying Research Fronts Based on Scientific Papers and Patents[J]. Library and Information Service, 2016,60(24):97-106.)
[6] Yu G, Wang M Y, Yu D R. Characterizing Knowledge Diffusion of Nanoscience & Nanotechnology by Citation Analysis[J]. Scientometrics, 2010,84:81-97.
doi: 10.1007/s11192-009-0090-2
[7] 侯剑华, 王仲禹. 研究主题的知识流动测度及其实证分析——以H指数研究为例[J]. 图书情报工作, 2017,61(10):87-93.
[7] ( Hou Jianhua, Wang Zhongyu. The Measurement of Knowledge Flow in Research Subject with an Empirical Analysis——Taking H-index Study as an Example[J]. Library and Information Service, 2017,61(10):87-93.)
[8] 白如江, 冷伏海. k-clique社区知识创新演化方法研究[J]. 图书情报工作, 2013,57(17):86-94.
[8] ( Bai Rujiang, Leng Fuhai. Knowledge Innovational Evolution Analysis Based on k-clique Community Network[J]. Library and Information Service, 2013,57(17):86-94.)
[9] Blei D M, Ng A Y, Jordan M I. Latent Dirichlet Allocation[J]. Journal of Machine Learning Research, 2003,3:993-1022.
[10] Blei D M, Lafferty J. Dynamic Topic Models [C]// Proceedings of the 23rd International Conference on Machine Learning. New York: ACM, 2006: 113-120.
[11] 范云满, 马建霞. 基于LDA与新兴主题特征分析的新兴主题探测研究[J]. 情报学报, 2014,33(7):698-711.
[11] ( Fan Yunman, Ma Jianxia. Detection of Emerging Topics Based on LDA and Feature Analysis of Emerging Topics[J]. Journal of the China Society for Scientific and Technical Information, 2014,33(7):698-711.)
[12] 王效岳, 刘自强, 白如江, 等. 基于基金项目数据的研究前沿主题探测方法[J]. 图书情报工作, 2017,61(13):87-98.
[12] ( Wang Xiaoyue, Liu Ziqiang, Bai Rujiang, et al. The Method of Research Front Topic Detection Based on the Fund Project Data[J]. Library and Information Service, 2017,61(13):87-98.)
[13] Rosvall M, Bergstrom C T. Mapping Change in Large Networks[J]. PLoS ONE, 2010,5(1):e8694.
doi: 10.1371/journal.pone.0008694 pmid: 20111700
[14] 王晓光, 程齐凯. 基于NEViewer的学科主题演化可视化分析[J]. 情报学报, 2013,32(9):900-911.
[14] ( Wang Xiaoguang, Cheng Qikai. Analysis on Evolution of Research Topics in a Discipline Based on NEViewer[J]. Journal of the China Society for Scientific and Technical Information, 2013,32(9):900-911.)
[15] Yan E. Research Dynamics, Impact, and Dissemination: A Topic-Level Analysis[J]. Journal of the Association for Information Science and Technology, 2015,66(11):2357-2372.
doi: 10.1002/asi.2015.66.issue-11
[16] 周源, 张超, 唐杰, 等. 基于主题变迁的领域发展路径智能化识别——以人工智能为例[J]. 图书情报工作, 2018,62(14):62-71.
[16] ( Zhou Yuan, Zhang Chao, Tang Jie, et al. Intelligent Identification of Field Development Trajectory Based on Topic Evolution: A Case Study of Artificial Intelligence[J]. Library and Information Service, 2018,62(14):62-71.)
[17] Jaccard P. The Distribution of Flora in the Alpine Zone[J]. New Phytologist, 1912,11(2):37-50.
doi: 10.1111/nph.1912.11.issue-2
[18] 齐亚双, 祝娜, 翟羽佳. 基于DTM的国内外情报学研究主题热度演化对比研究[J]. 图书情报工作, 2016,60(16):99-109.
[18] ( Qi Yashuang, Zhu Na, Zhai Yujia. A Comparative Study on Topic Heats Evolution in the Field of Information Science Between the Domestic and Foreign Research Based on DTM[J]. Library and Information Service, 2016,60(16):99-109.)
[19] 陈伟, 林超然, 李金秋, 等. 基于LDA-HMM的专利技术主题演化趋势分析——以船用柴油机技术为例[J]. 情报学报, 2018,37(7):732-741.
[19] ( Chen Wei, Lin Chaoran, Li Jinqiu, et al. Analysis of the Evolutionary Trend of Technical Topics in Patents Based on LDA and HMM: Taking Marine Diesel Engine Technology as an Example[J]. Journal of the China Society for Scientific and Technical Information, 2018,37(7):732-741.)
[20] 李静, 徐路路, 赵素君. 基于时间序列分析和SVM模型的基金项目新兴主题趋势预测与可视化研究[J]. 情报理论与实践, 2019,42(1):118-123,152.
[20] ( Li Jing, Xu Lulu, Zhao Sujun. Prediction and Visualization of Emerging Topics of Fund Sponsored Projects Based on Time Series Analysis and SVM Model[J]. Information Studies: Theory & Application, 2019,42(1):118-123, 152.)
[21] 刘自强, 王效岳, 白如江. 语义分类的学科主题演化分析方法研究——以我国图书情报领域大数据研究为例[J]. 图书情报工作, 2016,60(15):76-85,93.
[21] ( Liu Ziqiang, Wang Xiaoyue, Bai Rujiang. Research on the Discipline Topic Evolution Analysis Method of Semantic Classification——A Case Study of Big Data in the Field of Library and Information Science in China[J]. Library and Information Service, 2016,60(15):76-85, 93.)
[22] 关鹏, 王曰芬, 傅柱. 基于LDA的主题语义演化分析方法研究——以锂离子电池领域为例[J]. 数据分析与知识发现, 2019,3(7):61-72.
[22] ( Guan Peng, Wang Yuefen, Fu Zhu. Analyzing Topic Semantic Evolution with LDA: Case Study of Lithium Ion Batteries[J]. Data Analysis and Knowledge Discovery, 2019,3(7):61-72.)
[23] 沈文娟, 李明诗, 黄成全. 长时间序列多源遥感数据的森林干扰监测算法研究进展[J]. 遥感学报, 2018,22(6):1005-1022.
[23] ( Shen Wenjuan, Li Mingshi, Huang Chengquan. Review of Remote Sensing Algorithms for Monitoring Forest Disturbance from Time Series and Multi-source Data Fusion[J]. Journal of Remote Sensing, 2018,22(6):1005-1022.)
[24] 张文秋, 房磊, 杨健, 等. 基于Landsat时间序列的湖南省会同县杉木人工林干扰历史重建与林龄估算[J]. 生态学杂志, 2018,37(11):3467-3479.
[24] ( Zhang Wenqiu, Fang Lei, Yang Jian, et al. Reconstruction of Stand-replacement Disturbance and Stand Age of Chinese Fir Plantation Based on a Landsat Time Series in Huitong County, Hunan[J]. Chinese Journal of Ecology, 2018,37(11):3467-3479.)
[25] 杨斌清, 张希琳. 基于ARIMA时间序列模型的稀土氧化物价格预测研究[J]. 中国稀土学报, 2017,35(5):680-686.
[25] ( Yang Binqing, Zhang Xilin. Forecast of Price of Rare Earths Neodymium Oxide and Dysprosium Oxide Based on ARIMA Time Series Model[J]. Journal of the Chinese Society of Rare Earths, 2017,35(5):680-686.)
[26] 张美英, 何杰. 时间序列预测模型研究综述[J]. 数学的实践与认识, 2011,41(18):189-195.
[26] ( Zhang Meiying, He Jie. Summary on Time Series Forecasting Model[J]. Mathematics in Practice and Theory, 2011,41(18):189-195.)
[27] 岳丽欣, 周晓英, 陈旖旎. 基于ARIMA模型的信息构建研究主题趋势预测研究[J]. 图书情报知识, 2019(5):54-63.
[27] ( Yue Lixin, Zhou Xiaoying, Chen Yini. Thematic Trend Prediction of Information Architecture Based on the ARIMA Model[J]. Documentation, Information & Knowledge, 2019(5):54-63.)
[28] 周练. Word2vec的工作原理及应用探究[J]. 科技情报开发与经济, 2015,25(2):145-148.
[28] ( Zhou Lian. Exploration of the Working Principle and Application of Word2vec[J]. Sci-Tech Information Development & Economy, 2015,25(2):145-148.)
[29] Mikolov T, Sutskever I, Chen K, et al. Distributed Representations of Words and Phrases and Their Compositionality [C]//Proceedings of the 26th International Conference on Neural Information Processing Systems. 2013: 3111-3119.
[30] 胡志刚, 林歌歌, 孙太安, 等. 基于VOSviewer的我国各省市科研热点领域分析[J]. 科学与管理, 2017,37(4):44-51,79.
[30] ( Hu Zhigang, Lin Gege, Sun Taian, et al. Research on Spotlights Analysis for Different Regions in China by VOSviewer[J]. Science and Management, 2017,37(4):44-51, 79.)
[31] 吉丽君. 基于VOSviewer的2016-2018年国内外信息素养热点分析[J]. 当代图书馆, 2019(3):23-28.
[31] ( Ji Lijun. Analysis on Information Literacy Hotspots at Home and Abroad Between 2016 and 2018 with VOSviewer[J]. Contemporary Library, 2019(3):23-28.)
[32] 侯海燕, 郭芳琪, 孙太安, 等. 基于VOSviewer的山东省生物技术领域国内及国际研究现状分析[J]. 科学与管理, 2018,38(2):25-33.
[32] ( Hou Haiyan, Guo Fangqi, Sun Taian, et al. Analysis of the Domestic and International Research Situation of Biotechnology in Shandong Province by VOSviewer[J]. Science and Management, 2018,38(2):25-33.)
[1] 沈思,李沁宇,叶媛,孙豪,叶文豪. 基于TWE模型的医学科技报告主题挖掘及演化分析研究*[J]. 数据分析与知识发现, 2021, 5(3): 35-44.
[2] 王伟, 高宁, 徐玉婷, 王洪伟. 基于LDA的众筹项目在线评论主题动态演化分析*[J]. 数据分析与知识发现, 2021, 5(10): 103-123.
[3] 刘倩, 李晨亮. 基于社交媒体的话题演变研究综述*[J]. 数据分析与知识发现, 2020, 4(8): 1-14.
[4] 丁晟春,俞沣洋,李真. 网络舆情潜在热点主题识别研究*[J]. 数据分析与知识发现, 2020, 4(2/3): 29-38.
[5] 吴江,刘冠君,胡仙. 在线医疗健康研究的系统综述: 研究热点、主题演化和研究方法*[J]. 数据分析与知识发现, 2019, 3(4): 2-12.
[6] 余传明, 龚雨田, 王峰, 安璐. 基于文本价格融合模型的股票趋势预测*[J]. 数据分析与知识发现, 2018, 2(12): 33-42.
[7] 何伟林, 奉国和, 谢红玲. 基于CSToT模型的科技文献主题发现与演化研究*[J]. 数据分析与知识发现, 2018, 2(11): 64-72.
[8] 曲佳彬, 欧石燕. 基于主题过滤与主题关联的学科主题演化分析*[J]. 数据分析与知识发现, 2018, 2(1): 64-75.
[9] 王曰芬,靳嘉林. 比较分析《现代图书情报技术》近10年发文特征与发展趋势*[J]. 现代图书情报技术, 2016, 32(9): 1-16.
[10] 赵冬晓,王效岳,白如江,刘自强. 面向情报研究的文本语义挖掘方法述评*[J]. 现代图书情报技术, 2016, 32(10): 13-24.
[11] 徐月梅,李杨,梁野,蔡连侨. 基于流形学习的新闻主题关系构建和演化研究*[J]. 现代图书情报技术, 2016, 32(10): 59-69.
[12] 秦晓慧, 乐小虬. 基于LDA主题关联过滤的领域主题演化研究[J]. 现代图书情报技术, 2015, 31(3): 18-25.
[13] 赵迎光, 洪娜, 安新颖. 主题模型在主题演化方法中的应用研究进展[J]. 现代图书情报技术, 2014, 30(10): 63-69.
[14] 陆伟, 彭玉, 陈武. 基于SOM的领域热点主题探测[J]. 现代图书情报技术, 2011, 27(1): 63-68.
[15] 安璐, 李纲. 国外图书情报类期刊热点主题及发展趋势研究[J]. 现代图书情报技术, 2010, 26(9): 48-55.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn