Please wait a minute...
Data Analysis and Knowledge Discovery  2018, Vol. 2 Issue (1): 64-75    DOI: 10.11925/infotech.2096-3467.2017.1114
Orginal Article Current Issue | Archive | Adv Search |
Analyzing Topic Evolution with Topic Filtering and Relevance
Qu Jiabin1,2, Ou Shiyan1()
1(School of Information Management, Nanjing University, Nanjing 210023, China)
2(Yantai University Library, Yantai 264005, China)
Download: PDF (1269 KB)   HTML ( 6
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] There are lots of irrelevant results among the topics identified by the LDA model, which poses negative effects to the accuracy of evolution analysis. This paper constructs topics evolution paths to analyze their evolution by filtering out noises and calculating relevance. [Methods] First, we filtered out irrelevant topics by their probability of appearing in all documents and the word propensity distribution of topics. Then, we calculated the Jensen-Shannon Divergence to identify related topics. Finally, we constructed the topic evolution paths based on the correlation between topics. [Results] The effectiveness of the proposed method was examined with scientific literature on “machine learning”, which yielded five evolution paths, i.e. rebirth, extinction, succession, division and merger. [Limitations] There are some subjective factors involving the estimated threshold values. [Conclusions] The proposed method could avoid the interference of noise topics, and then identify relevant topics from adjacent time intervals. It helps us discover the evolution of discipline topics more accurately.

Key wordsDiscipline Topics Evolution      Topic Filtering      LDA Topic Model      Evolution Analysis     
Received: 07 November 2017      Published: 05 February 2018
ZTFLH:  TP393  

Cite this article:

Qu Jiabin,Ou Shiyan. Analyzing Topic Evolution with Topic Filtering and Relevance. Data Analysis and Knowledge Discovery, 2018, 2(1): 64-75.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2017.1114     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2018/V2/I1/64

语义明确主题中的词汇 词汇分布概率 语义不明确主题中的词汇 词汇分
布概率
多分类 0.03297 自然语言处理 0.02686
支持向量机 0.02890 纹路 0.02533
样本 0.02484 语料 0.02488
增量学习 0.02022 短语 0.02450
分类 0.02004 人脸 0.02419
检验 0.01625 模型 0.02374
KNN 0.01576 名词 0.02389
信息熵值 0.18 信息熵值 0.22
主题序号 主题 主题词汇
T 4 模式识别 粗糙集理论 分类 模式识别 遥感 构造 评价 基准 函数 损失 适应性
T 8 支持向量机 多类分类 模糊支持向量机 样本 增量学习 分类 检验 KNN 函数 查全率 样本
T 21 故障诊断 诊断 最小化 经验 风险 变化 样本 支持向量机 高纬 线性 企业
T 35 图像识别 图像 特征 核函数 支持向量机 识别 算法 分类器 检测 过滤 空间
T 64 Agent系统 Agent 服务 强化学习 指导 数学模型 函数 标注 仿真 策略 分支
T 53 朴素贝叶斯 智能 系统 朴素贝叶斯 学习 效率 决策 环境 知识 规则 先验
T 23 / 自然语言 纹路 语料 短语 人脸 处理 模型 名词 统计 信息
T 26 / 随机 模糊性 编程 毕业 统计分析 最终 不完整数据 办理 公文 知识
T 48 / 正确性 查询 目的 流程模式 主观 指标 地名 大坝 教学质量 标识
时间窗口 文献数 识别的主题数 过滤后的主题数
2007-2008 996 80 31
2009-2010 973 95 33
2011-2012 989 76 29
2013-2014 1 089 88 35
2015-2016 1 886 115 52
主题 $T_{11-12}^{26}$ $T_{11-12}^{68}$ $T_{11-12}^{13}$ $T_{11-12}^{7}$
$T_{13-14}^{27}$ 0.00052 0.00031 0.00105 0.00131
$T_{13-14}^{31}$ 0.00333 0.02381 0.00093 0.00601
$T_{13-14}^{66}$ 0.00493 0.00512 0.00104 0.00079.
主题编号 主题 主题词
$T_{07-08}^{35}$ 图像识别 图像 特征 核函数 支持向量机 识别 算法 分类器 检测 过滤 空间
$T_{09-10}^{32}$ 图像识别 图像 人脑 边缘 分类规则 特征 运算 人工智能 识别 图像检索 构建
$T_{11-12}^{26}$ 提取图像特征并标注 图像标注 识别 特征 检测 提取 分类 模型 机制 识别方法 效果
$T_{11-12}^{68}$ 图像特征标注 图像 标注 图像特征 语义描述 语义 SOM算法 计算机视觉 自动 区域 成分
$T_{13-14}^{27}$ 图像标注与识别 图像 标注 特征 检测 识别 特征提取 人脸 算法 目标 空间
$T_{15-16}^{41}$ 基于深度学习的图像识别 深度学习 图像处理 视觉 行为识别 多层 领域 计算机视觉 底层 高层 学习
$T_{15-16}^{87}$ 图像识别的用途 图像处理 人工智能 精度 特征提取 功能 表征 图像分类 序列 DNA 二维
[1] 叶春蕾, 冷伏海. 基于共词分析的学科主题演化方法改进研究[J]. 情报理论与实践, 2012, 35(3): 79-82.
[1] (Ye Chunlei, Leng Fuhai.Research on the Improvement of Subject Evolution Method Based on Co-word Analysis[J]. Information Studies: Theory & Application, 2012, 35(3): 79-82.)
[2] 唐果媛, 张薇. 基于共词分析法的学科主题演化研究进展与分析[J]. 图书情报工作, 2015, 59(5): 128-136.
doi: 10.13266/j.issn.0252-3116.2015.05.020
[2] (Tang Guoyuan, Zhang Wei.Development and Analysis of Subject Theme Evolution Based on Co-word Analysis Method[J]. Library and Information Service, 2015, 59(5): 128-136.)
doi: 10.13266/j.issn.0252-3116.2015.05.020
[3] Blei D M, Ng A Y, Jordan M I.Latent Dirichlet Allocation[J]. Journal of Machine Learning Research, 2003, 3: 993-1022.
[4] 杨星, 李保利, 金明举. 基于LDA模型的研究领域热点及趋势分析[J]. 计算机技术与发展, 2012, 22(10): 66-69.
[4] (Yang Xing, Li Baoli, Jin Mingju.LDA-based Research Domain Hotspots and Trend Analysis[J]. Computer Technology and Development, 2012, 22(10): 66-69.)
[5] 单斌, 李芳. 基于LDA话题演化研究方法综述[J]. 中文信息学报, 2010, 24(6): 43-49.
doi: 10.3969/j.issn.1003-0077.2010.06.007
[5] (Shan Bin, Li Fang.A Survey of Topic Evolution Based on LDA[J]. Journal of Chinese Information Processing, 2010, 24(6): 43-49.)
doi: 10.3969/j.issn.1003-0077.2010.06.007
[6] Wang X, McCallum A. Topic over Time: A Non-Markov Continuous-Time Model of Topical Trends[C] //Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2006: 424-433.
[7] 杨海霞, 高宝俊, 孙含林. 基于LDA挖掘计算机科学文献的研究主题[J]. 现代图书情报技术, 2016(11): 20-26.
[7] (Yang Haixia, Gao Baojun, Sun Hanlin.Extracting Topics of Computer Science Literature with LDA Model[J]. New Technology of Library and Information Service, 2016(11): 20-26.)
[8] 单斌, 李芳. 基于种子文档LDA话题的演化研究[J]. 现代图书情报技术, 2011(7/8): 104-109.
[8] (Shan Bin, Li Fang.Topic Evolution Based on Seminal Document and Topic Model[J]. New Technology of Library and Information Service, 2011(7/8): 104-109.)
[9] 胡艳丽, 白亮, 张维明. 一种话题演化建模与分析方法[J]. 自动化学报, 2012, 38(10): 1690-1697.
doi: 10.3724/SP.J.1004.2012.01690
[9] (Hu Yanli, Bai Liang, Zhang Weiming.Modeling and Analyzing Topic Evolution[J]. Acta Automatic Sinica, 2012, 38(10): 1690-1697.)
doi: 10.3724/SP.J.1004.2012.01690
[10] 祝娜, 王芳. 基于主题关联的知识演化路径识别研究——以3D打印领域为例[J]. 图书情报工作, 2016, 60(5): 101-109.
doi: 10.13266/j.issn.0252-3116.2016.05.015
[10] (Zhu Na, Wang Fang.Identification of Knowledge Evolutionary Path Based on Topic Relevance: Taking the Case of 3D Printing Field[J]. Library and Information Service, 2016, 60(5): 101-109.)
doi: 10.13266/j.issn.0252-3116.2016.05.015
[11] 崔凯, 周斌, 贾焰, 等. 一种基于LDA的在线主题演化挖掘模型[J].计算机科学, 2010, 37(11): 156-159, 193.
doi: 10.3969/j.issn.1002-137X.2010.11.037
[11] (Cui Kai, Zhou Bin, Jia Yan, et al.LDA-based Model for Online Topic Evolution Mining[J]. Computer Science, 2010, 37(11): 156-159, 193.)
doi: 10.3969/j.issn.1002-137X.2010.11.037
[12] 秦晓慧, 乐小虬. 基于LDA主题关联过滤的领域主题演化研究[J]. 现代图书情报技术, 2015(3): 18-25.
[12] (Qin Xiaohui, Le Xiaoqiu.Topic Evolution Research on a Certain Field Based on LDA Topic Association Filter[J]. New Technology of Library and Information Service, 2015(3): 18-25.)
[13] 李湘东, 张娇, 袁满. 基于LDA模型的科技期刊主题演化研究[J]. 情报杂志, 2014, 33(7): 115-121.
doi: 10.3969/j.issn.1002-1965.2014.07.021
[13] (Li Xiangdong, Zhang Jiao, Yuan Man.On Topic Evolution of a Scientific Journal Based on LDA Model[J]. Journal of Intelligence, 2014, 33(7): 115-121.)
doi: 10.3969/j.issn.1002-1965.2014.07.021
[14] Blei D M, Lafferty J D.Dynamic Topic Models[C]// Proceedings of the 23rd International Conference on Machine Learning. New York: ACM, 2006: 113-120.
[15] 齐亚双, 祝娜, 翟羽佳. 基于DTM的国内外情报学研究主题热度演化对比研究[J]. 图书情报工作, 2016, 60(16): 99-109.
[15] (Qi Yashuang, Zhu Na, Zhai Yujia.A Comparative Study on Topic Heats Evolution in the Field of Information Science Between the Domestic and Foreign Research Based on DTM[J]. Library and Information Service, 2016, 60(16): 99-109.)
[16] 王燕鹏. 国内基于主题模型的科技文献主题发现及演化研究进展[J]. 图书情报工作, 2016, 60(3): 130-137.
doi: 10.13266/j.issn.0252-3116.2016.03.019
[16] (Wang Yanpeng.Research Progress of Scientific and Technical Literature Topic Detection and Evolution Based on Topic Model in China[J]. Library and Information Service, 2016, 60(3): 130-137.)
doi: 10.13266/j.issn.0252-3116.2016.03.019
[17] Cao J, Xia T, Li J, et al.A Density-based Method for Adaptive LDA Model Selection[J]. Neurocomputing, 2009, 72(7-9): 1775-1781.
doi: 10.1016/j.neucom.2008.06.011
[18] 关鹏, 王曰芬. 科技情报分析中LDA主题模型最优主题数确定方法研究[J]. 现代图书情报技术, 2016(9): 42-50.
[18] (Guan Peng, Wang Yuefen.Identifying Optimal Topic Numbers from Sci-Tech Information with LDA Model[J]. New Technology of Library and Information Service, 2016(9): 42-50.)
[19] 曹娟, 张勇东, 李锦涛, 等. 一种基于密度的自适应最优LDA模型选择方法[J]. 计算机学报, 2008, 31(10): 1780-1787.
[19] (Cao Juan, Zhang Yongdong, Li Jintao, et al.A Method of Adaptively Selecting Best LDA Model Based on Density[J]. Chinese Journal of Computers, 2008, 31(10): 1780-1787.)
[20] Lee L.On the Eectiveness of the Skew Divergence for Statistical Language Analysis[C]//Proceeding of the 4th International Conference on Artificial Intelligence & Statistics. 2001: 65-72.
[21] Alsumait L, Barbará D, Gentle J, et al.Topic Significance Ranking of LDA Generative Models[A]// Machine Learning and Knowledge Discovery in Databases[M]. Springer, Berlin, Heidelberg, 2009: 67-82.
[22] 袁胜文. 基于LDA的中文科技文献话题演化研究[D]. 郑州: 河南工业大学, 2015.
[22] (Yuan Shengwen.The Research on Topic Evolution for Chinese Literature of Science and Technology Based on LDA[D]. Zhengzhou: Henan University of Technology, 2015.)
[23] MacKay D J C. Information Theory, Inference, and Learning Algorithms[M]. Cambridge University Press, 2003.
[24] Lin J.Divergence Measures Based on Shannon Entropy[J]. IEEE Transactions on Information Theory, 1991, 37(1): 145-151.
doi: 10.1109/18.61115
[25] 吕楠. 话题追踪与演化分析技术研究[D]. 郑州: 解放军信息工程大学, 2009.
[25] (Lv Nan.Research on Topic Tracking and Evolution Analysis Technique[D]. Zhengzhou: Information Engineering University, 2009.)
[26] THULAC: 一个高效的中文词法分析工具包[EB/OL]. [2017-07-11]. .
[26] (THULAC: An Efficient Chinese Lexical Analysis Toolkit [EB/OL]. [2017-07-11].
[1] Peng Guan,Yuefen Wang,Zhu Fu. Analyzing Topic Semantic Evolution with LDA: Case Study of Lithium Ion Batteries[J]. 数据分析与知识发现, 2019, 3(7): 61-72.
[2] Linna Xi,Yongxiang Dou. Examining Reposts of Micro-bloggers with Planned Behavior Theory[J]. 数据分析与知识发现, 2019, 3(2): 13-20.
[3] Jie Zhang,Junbo Zhao,Dongsheng Zhai,Ningning Sun. Patent Technology Analysis of Microalgae Biofuel Industrial Chain Based on Topic Model[J]. 数据分析与知识发现, 2019, 3(2): 52-64.
[4] Junwan Liu,Zhixin Long,Feifei Wang. Finding Collaboration Opportunities from Emerging Issues with LDA Topic Model and Link Prediction[J]. 数据分析与知识发现, 2019, 3(1): 104-117.
[5] He Li,Linlin Zhu,Min Yan,Jincheng Liu,Chuang Hong. Identifying Useful Information from Open Innovation Community[J]. 数据分析与知识发现, 2018, 2(12): 12-22.
[6] Guan Peng,Wang Yuefen. Identifying Optimal Topic Numbers from Sci-Tech Information with LDA Model[J]. 现代图书情报技术, 2016, 32(9): 42-50.
[7] Zhuo Keqiu, Yu Wei, Su Xinning. Parallel Implementing Bursty Events Detection Using MapReduce[J]. 现代图书情报技术, 2015, 31(2): 46-54.
[8] Hu Jiming, Chen Guo. Study on Improvement of Text Classification Using HS-SVM[J]. 现代图书情报技术, 2014, 30(9): 74-80.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn