Please wait a minute...
Advanced Search
数据分析与知识发现  2019, Vol. 3 Issue (1): 46-54     https://doi.org/10.11925/infotech.2096-3467.2018.1365
  专题 本期目录 | 过刊浏览 | 高级检索 |
基于深度学习的创新主题智能挖掘算法研究*
付常雷1,钱力1,2(),张华平3,赵华茗1,谢靖1,2
1中国科学院文献情报中心 北京 100190
2中国科学院大学图书情报与档案管理系 北京 100190
3北京理工大学计算机学院 北京 100081
Mining Innovative Topics Based on Deep Learning
Changlei Fu1,Li Qian1,2(),Huaping Zhang3,Huaming Zhao1,Jing Xie1,2
1National Science Library, Chinese Academy of Sciences, Beijing 100190, China
2Department of Library, Information and Archives Management, University of Chinese Academy of Sciences, Beijing 100190, China
3School of Computer Science & Technology, Beijing Institute of Technology, Beijing 100081, China;
全文: PDF (975 KB)   HTML ( 13
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】从海量的文本数据中挖掘创新主题。【方法】以学术知识图谱数据为基础, 根据知识点的“热度”、“新颖度”、“权威度”三维指标, 筛选出权重较高的作为创新种子, 然后根据知识图谱的路径对创新 种子进行知识关联计算, 计算结果输入一个用大量科技论文数据训练而成的深度学习模型, 从而生成创 新主题; 采用的模型为由双向LSTM层组成的Sequence to Sequence模型。【结果】以人工智能领域内中 文科技论文作为实验数据, 实验结果表明, 模型的挖掘结果经过专家人为判断验证, 创新效果平均值为6.52。【局限】目前知识图谱的知识丰富度和关联性有限、用于训练模型的训练集质量和体量还有待于进一步提升。【结论】本文模型实现了从文本数据中挖掘出创新主题, 但创新主题识别模型的整体水平仍然需要进一步完善优化。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
付常雷
钱力
张华平
赵华茗
谢靖
关键词 创新主题深度学习Seq2Seq智能挖掘    
Abstract

[Objective] This paper aims to identify innovative topics from massive volumes of texts. [Methods] First, we extracted knowledge points with heavier weights from the data of scholarly knowledge graph. Then, these knowledge points were labeled as innovative seeds from the perspectives of “popularity”, “novelty” and “authority”. Third, we computed the knowledge correlation of the innovative seeds. Finally, the results were input to a deep learning model trained by large amounts of sci-tech papers to generate innovative topics. Note: the model is sequence to sequence with Bi-LSTM. [Results] We used Chinese research papers on artificial intelligence as the experimental data and found the average innovation score of the retrieved topics was 6.52, which were evaluated by experts manually. [Limitations] At present, contents of the knowledge graph and the training datasets need to be improved. [Conclusions] The proposed model, which identifies innovative topics from scholarly papers, could be optimized in the future.

Key wordsInnovative Topic    Deep Learning    Seq2Seq    Intelligent Mining
收稿日期: 2018-12-04      出版日期: 2019-03-04
基金资助:*本文系中国科学院青年创新促进会(项目编号: 院1721)和创新构想话题生成机器人研发(项目编号: JW1701)的研究成果之一
引用本文:   
付常雷,钱力,张华平,赵华茗,谢靖. 基于深度学习的创新主题智能挖掘算法研究*[J]. 数据分析与知识发现, 2019, 3(1): 46-54.
Changlei Fu,Li Qian,Huaping Zhang,Huaming Zhao,Jing Xie. Mining Innovative Topics Based on Deep Learning. Data Analysis and Knowledge Discovery, 2019, 3(1): 46-54.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2018.1365      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2019/V3/I1/46
[1] 王珊, 王会举, 覃雄派, 等. 架构大数据: 挑战、现状与展望[J]. 计算机学报, 2011, 34(10): 1741-1752.
[1] (Wang Shan, Wang Huiju, Qin Xiongpai, et al.Architecting Big Data: Challenges, Studies and Forecasts[J]. Chinese Journal of Computers, 2011, 34(10): 1741-1752.)
[2] 李家清. 知识组织方法及策略研究[J]. 图书情报工作, 2005, 49(5): 41-44.
[2] (Li Jiaqing.Approches and Strategies of Knowledge Organizations[J]. Library and Information Service, 2005, 49(5): 41-44.)
[3] 苏新宁. 面向知识服务的知识组织[J]. 情报资料工作, 2015, 36(1): 5.
[3] (Su Xinning.Research on Knowledge Service-oriented Knowledge Organizations[J]. Information and Documentation Services, 2015, 36(1): 5.)
[4] 温有奎, 温浩, 徐端颐, 等. 基于创新点的知识元挖掘[J]. 情报学报, 2005, 24(6): 663-668.
[4] (Wen Youkui, Wen Hao, Xu Duanyi, et al.Knowledge Element Mining in Knowledge Management[J]. Journal of the China Society for Scientific and Technical Information, 2005, 24(6): 663-668.)
[5] Klavans J L, Muresan S.DEFINDER: Rule-based Methods for the Extraction of Medical Terminology and Their Associated Definitions from On-line Text[J]. AMIA Annual Symposium Proceedings, 1999: 1049.
[6] Liu B, Chin C W, Ng H T.Mining Topic-specific Concepts and Definitions on the Web[C]//Proceedings of International Conference on World Wide Web. 2003: 251-260.
[7] 冷伏海, 白如江, 祝清松. 面向科技文献的混合语义信息抽取方法研究[J]. 图书情报工作, 2013, 57(11): 112-119.
[7] (Leng Fuhai, Bai Rujiang, Zhu Qingsong.A Hybrid Semantic Information Extraction Method for Scientific Research Papers[J]. Library and Information Service, 2013, 57(11): 112-119.)
[8] 张帆, 乐小虬. 面向领域科技文献的句子级创新点抽取研究[J]. 现代图书情报技术, 2014(9): 15-21.
[8] (Zhang Fan, Le Xiaoqiu.Research on Innovation Points Extraction from Scientific Research Paper Based on Field Thesaurus[J]. New Technology of Library and Information Service, 2014(9): 15-21.)
[9] Mikolov T, Chen K, Corrado G, et al.Efficient Estimation of Word Representations in Vector Space[OL]. arXiv Preprint. arXiv: 1301.3781.
[10] Mikolov T, Sutskever I, Chen K, et al.Distributed Representations of Words and Phrases and Their Compositionality[C]//Proceedings of International Conference on Neural Information Processing Systems. 2013: 3111-3119.
[11] 朱群雄, 孙锋. RNN神经网络的应用研究[J]. 北京化工大学学报: 自然科学版, 1998, 25(1): 86-90.
[11] (Zhu Qunxiong, Sun Feng.Study on Application of Recurrent Neural Network[J]. Journal of Beijing University of Chemical Technology: Natural Science Edition, 1998, 25(1): 86-90.)
[12] Pascanu R, Mikolov T, Bengio Y.On the Difficulty of TrainingRecurrent Neural Networks[C]// Proceedings of International Conference on Machine Learning. 2013.
[13] Theodoridis S.Neural Networks and Deep Learning[A]// Machine Learning[M]. 2015: 875-936.
[14] Sundermeyer M, Schlüter R, Ney H.LSTM Neural Networks for Language Modeling[C]// Proceedings of Interspeech. 2012.
[15] Gers F A, Schmidhuber J, Cummins F.Learning to Forget: Continual Prediction with LSTM[J]. Neural Computation, 2014, 12(10): 2451-2471.
[16] Hakkani-Tür D, Tur G, Celikyilmaz A, et al.Multi-Domain Joint Semantic Frame Parsing Using Bi-directional RNN- LSTM[C]//Proceedings of the Meeting of the International Speech Communication Association. 2016.
[17] Lample G, Ballesteros M, Subramanian S, et al.Neural Architectures for Named Entity Recognition[OL]. arXiv Preprint. arXiv: 1603.0136.
[18] Ma X, Hovy E.End-to-End Sequence Labeling via Bi-directional LSTM-CNNs-CRF[OL]. arXiv Preprint. arXiv: 1603.01354.
[19] Sutskever I, Vinyals O, Le Q V.Sequence to Sequence Learning with Neural Networks[OL]. arXiv Preprint. arXiv: 1409.3215.
[20] Bahdanau D, Cho K, Bengio Y.Neural Machine Translation by Jointly Learning to Align and Translate[OL]. arXiv Preprint. arXiv: 1409.0473.
[21] 李如森, 彭彩红, 赵福荣. 科技论文创新性判断方法[J]. 鞍山钢铁学院学报, 2001, 24(3): 234-236.
[21] (Li Rusen, Peng Caihong, Zhao Furong.Judging Method of Innovation for Scientific and Technological Papers[J]. Journal of Anshan of Institute of I. & S. Technology, 2001, 24(3): 234-236.)
[22] Dahl T.Contributing to the Academic Conversation: A Study of New Knowledge Claims in Economics and Linguistics[J]. Journal of Pragmatics, 2008, 40(7): 1184-1201.
[23] Parkinson J.The Discussion Section as Argument: The Language Used to Prove Knowledge Claims[J]. English for Specific Purposes, 2011, 30(3): 164-175.
[24] El-Kishky A, Song Y, Voss C R, et al.Scalable Topical Phrase Mining from Text Corpora[J]. Proceedings of the VLDB Endowment, 2014, 8(3): 305-316.
[25] 杨建林, 钱玲飞. 基于关键词对逆文档频率的主题新颖度度量方法[J]. 情报理论与实践, 2013, 36(3): 99-102.
[25] (Yang Jianlin, Qian Lingfei.A Method for Novel Novelty Measurement Based on Keyword to Inverse Document Frequency[J]. Information Studies: Theory & Application, 2013, 36(3): 99-102.)
[26] Mikolov T, Le Q V, Sutskever I.Exploiting Similarities Among Languages for Machine Translation[OL]. arXiv Preprint. arXiv: 1309.4168.
[27] Hinton G E, Srivastava N, Krizhevsky A, et al.Improving Neural Networks by Preventing Co-adaptation of Feature Detectors[OL]. arXiv Preprint. arXiv: 1207.0580.
[28] Kajdanowicz T, Kazienko P, Kraszewski J.Boosting Algorithm with Sequence-Loss Cost Function for Structured Prediction[C]//Proceedings of International Conference on Hybrid Artificial Intelligence Systems. 2010: 573-580.
[29] Kingma D, Ba J.Adam: A Method for Stochastic Optimization [OL]. arXiv Preprint. arXiv: 1412.6980.
[30] 中图分类号[DB/OL]. [2018-07-21]. .
[30] (Chinese Library Classification[DB/OL]. [2018-07-21].
[31] jieba[EB/OL].[2018-09-09]..
[32] gensim[EB/OL].[2018-09-09]..
[1] 周泽聿,王昊,赵梓博,李跃艳,张小琴. 融合关联信息的GCN文本分类模型构建及其应用研究*[J]. 数据分析与知识发现, 2021, 5(9): 31-41.
[2] 徐月梅, 王子厚, 吴子歆. 一种基于CNN-BiLSTM多特征融合的股票走势预测模型*[J]. 数据分析与知识发现, 2021, 5(7): 126-138.
[3] 赵丹宁,牟冬梅,白森. 基于深度学习的科技文献摘要结构要素自动抽取方法研究*[J]. 数据分析与知识发现, 2021, 5(7): 70-80.
[4] 黄名选,蒋曹清,卢守东. 基于词嵌入与扩展词交集的查询扩展*[J]. 数据分析与知识发现, 2021, 5(6): 115-125.
[5] 钟佳娃,刘巍,王思丽,杨恒. 文本情感分析方法及应用综述*[J]. 数据分析与知识发现, 2021, 5(6): 1-13.
[6] 张国标,李洁. 融合多模态内容语义一致性的社交媒体虚假新闻检测*[J]. 数据分析与知识发现, 2021, 5(5): 21-29.
[7] 马莹雪,甘明鑫,肖克峻. 融合标签和内容信息的矩阵分解推荐方法*[J]. 数据分析与知识发现, 2021, 5(5): 71-82.
[8] 成彬,施水才,都云程,肖诗斌. 基于融合词性的BiLSTM-CRF的期刊关键词抽取方法[J]. 数据分析与知识发现, 2021, 5(3): 101-108.
[9] 常城扬,王晓东,张胜磊. 基于深度学习方法对特定群体推特的动态政治情感极性分析*[J]. 数据分析与知识发现, 2021, 5(3): 121-131.
[10] 冯勇,刘洋,徐红艳,王嵘冰,张永刚. 融合近邻评论的GRU商品推荐模型*[J]. 数据分析与知识发现, 2021, 5(3): 78-87.
[11] 胡昊天,吉晋锋,王东波,邓三鸿. 基于深度学习的食品安全事件实体一体化呈现平台构建*[J]. 数据分析与知识发现, 2021, 5(3): 12-24.
[12] 张琪,江川,纪有书,冯敏萱,李斌,许超,刘浏. 面向多领域先秦典籍的分词词性一体化自动标注模型构建*[J]. 数据分析与知识发现, 2021, 5(3): 2-11.
[13] 吕学强,罗艺雄,李家全,游新冬. 中文专利侵权检测研究综述*[J]. 数据分析与知识发现, 2021, 5(3): 60-68.
[14] 李丹阳, 甘明鑫. 基于多源信息融合的音乐推荐方法 *[J]. 数据分析与知识发现, 2021, 5(2): 94-105.
[15] 余传明, 张贞港, 孔令格. 面向链接预测的知识图谱表示模型对比研究*[J]. 数据分析与知识发现, 2021, 5(11): 29-44.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn