Please wait a minute...
Data Analysis and Knowledge Discovery  2019, Vol. 3 Issue (7): 52-60    DOI: 10.11925/infotech.2096-3467.2018.0914
Current Issue | Archive | Adv Search |
Extracting Keywords with Topic Embedding and Network Structure Analysis
Qingtian Zeng1,2,Xiaohui Hu2,Chao Li1,3()
1(College of Electronic Information Engineering, Shandong University of Science and Technology, Qingdao 266590, China)
2(College of Computer Science and Engineering, Shandong University of Science and Technology, Qingdao 266590, China)
3(Key Laboratory of Embedded System and Service Computing (Tongji University), Ministry of Education, Shanghai 201804, China)
Download: PDF(2909 KB)   HTML ( 10
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper proposes a new model to extract topic keywords, aiming to detect those low frequency words of high relevance. [Methods] First, we designed a topic keyword extraction method, which integrated the topic embedding and network structure analysis techniques. Then, we extracted the preliminary set of topic keywords based on the LDA model, and trained the word vector with Word2Vec model. Third, we built a network based on word vector similarity and identified the final topic keywords with the help of network structure analysis. [Results] The new method improved the average similarity between topic keywords by 14.75%. Our method extracted the low frequency keywords with high topic relevance more effectively than the LDA model. [Limitations] The sample size needs to be expanded, and the segmentation process requires more manual adjustments. More research is needed to quantitatively analyze the topic keywords. [Conclusions] Our method improves the abstracting and public opinion analysis.

Key wordsNetwork Structure Analysis      Word Embeddings      Topic Model      Keywords Extraction      Representation Learning     
Received: 19 August 2018      Published: 06 September 2019
:  TP393 G35  
Corresponding Authors: Chao Li     E-mail: 1008lichao@163.com

Cite this article:

Qingtian Zeng,Xiaohui Hu,Chao Li. Extracting Keywords with Topic Embedding and Network Structure Analysis. Data Analysis and Knowledge Discovery, 2019, 3(7): 52-60.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2018.0914     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2019/V3/I7/52

参数 说明 默认值
-sentence 用于训练的语料
-size 单词向量维数 100
-window 训练中的滑动窗口大小 5
-min_count 最小单词数量 5
-negative “噪音词”数量 5
-hs 选择训练算法 0
-sg 选择使用的模型 0
-workers 工作线程数量 3
-sample 采样阈值 1e-3
停止词 举例
标点等特殊符号 , 、 : 《 》等
年月日期 2016年、3月等
分词后的单字 人、区、校、期等
数量多且无实义的词 通知、关于、做好、组织等词
关键词 教学 停电 SCI 国家奖学金
1 实习 停水 EI 省政府奖学金
2 培养 封闭 收录 国家励志奖学金
3 课程 停暖 SSCI 国家助学金
4 课堂 楼房 CSSCI 奖学金
5 立项 供水 索引 上海创立奖学金
[1] Bharti S K, Babu K S . Automatic Keyword Extraction for Text Summarization: A Survey[OL]. arXiv Preprint, arXiv: 1704. 03242.
[2] Moody C E . Mixing Dirichlet Topic Models and Word Embeddings to Make Lda2vec[OL]. arXiv Preprint, arXiv: 1605. 02019.
[3] 庞贝贝, 苟娟琼, 穆文歆 . 面向高校学生深度辅导领域的主题建模和主题上下位关系识别研究[J]. 数据分析与知识发现, 2018,2(6):92-101.
[3] ( Pang Beibei, Gou Juanqiong, Mu Wenxin . Extracting Topics and Their Relationship from College Student Mentoring[J]. Data Analysis and Knowledge Discovery, 2018,2(6):92-101.)
[4] Nadkarni P M . An Introduction to Information Retrieval: Applications in Genomics[J]. The Pharmacogenomics Journal, 2002,2(2):96-102.
[5] Pawar D D, Bewoor M S, Patil S H . Text Rank: A Novel Concept for Extraction Based Text Summarization[J]. International Journal of Computer Science & Information Technology, 2014,5(3):3301-3304.
[6] Lai S, Liu K, He S , et al. How to Generate a Good Word Embedding[J]. IEEE Intelligent Systems, 2016,31(6):5-14.
[7] Blei D M, Ng A Y, Jordan M I . Latent Dirichlet Allocation[J]. Journal of Machine Learning Research, 2003,3:993-1022.
[8] Mikolov T, Chen K, Corrado G , et al. Efficient Estimation of Word Representations in Vector Space[OL]. arXiv Preprint, arXiv: 1301. 3781.
[9] Cohen J D . Highlights: Language- and Domain-Independent Automatic Indexing Terms for Abstracting[J]. Journal of the American Society for Information Science, 1995,46(3):162-174.
[10] Luhn H P . A Statistical Approach to Mechanized Encoding and Searching of Literary Information[J]. IBM Journal of Research and Development, 1957,1(4):309-317.
[11] 姚兆旭, 马静 . 面向微博话题的“主题+观点”词条抽取算法研究[J]. 现代图书情报技术, 2016(7):78-86.
[11] ( Yao Zhaoxu, Ma Jing . Extracting Topic and Opinion from Microblog Posts with New Algorithm[J]. New Technology of Library and Information Service, 2016(7):78-86.)
[12] 覃世安, 李法运 . 文本分类中TF-IDF方法的改进研究[J]. 现代图书情报技术, 2013(10):27-30.
[12] ( Qin Shian, Li Fayun . Improved TF-IDF Method in Text Classification[J]. New Technology of Library and Information Service, 2013(10):27-30.)
[13] Matsuo Y, Ishizuka M . Keyword Extraction from a Single Document Using Word Co-occurrence Statistical Information[J]. International Journal on Artificial Intelligence Tools, 2004,13(1):157-169.
[14] Zhao Z, Li C, Zhang Y , et al. Identifying and Analyzing Popular Phrases Multi-dimensionally in Social Media Data[J]. International Journal of Data Warehousing & Mining, 2015,11(3):98-112.
[15] Barzilay R, Elhadad M. Using Lexical Chains for Text Summarization [C]. //Proceedings of the ACL Workshop on Intelligent Scalable Text Summarization. 1997.
[16] Hulth A. Improved Automatic Keyword Extraction Given More Linguistic Knowledge [C]// Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing. 2003: 216-223.
[17] Salton G, Singhal A, Mitra M , et al. Automatic Text Structuring and Summarization[J]. Information Processing & Management, 1997,33(2):193-207.
[18] Conroy J M, O’leary D P. Text Summarization via Hidden Markov Models [C]// Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 2001: 406-407.
[19] Zhang K, Xu H, Tang J, et al. Keyword Extraction Using Support Vector Machine [C]// Proceedings of the 2006 International Conference on Web-Age Information Management. 2006: 85-96.
[20] Frank E, Paynter G W, Witten I H, et al. Domain-Specific Keyphrase Extraction [C]// Proceedings of the 16th International Joint Conference on Artificial Intelligence. 1999,2:668-673.
[21] Liu Z, Chen X, Zheng Y, et al. Automatic Keyphrase Extraction by Bridging Vocabulary Gap [C]// Proceedings of the 15th Conference on Computational Natural Language Learning. 2011: 135-144.
[22] Mikolov T, Sutskever I, Chen K, et al. Distributed Representations of Words and Phrases and Their Compositionality [C]// Proceedings of the 2013 International Conference on Neural Information Processing Systems. 2013,26:3111-3119.
[23] Liu Y, Liu Z, Chua T S, et al. Topical Word Embeddings [C]// Proceedings of the 29th AAAI Conference on Artificial Intelligence. 2015: 2418-2424.
[24] Chang J, Boyd-Graber J, Gerrish S, et al. Reading Tea Leaves: How Humans Interpret Topic Models [C]// Proceedings of the 22nd International Conference on Neural Information Processing Systems. 2009: 288-296.
[25] 王婷婷, 韩满, 王宇 . LDA模型的优化及其主题数量选择研究——以科技文献为例[J]. 数据分析与知识发现, 2018,2(1):29-40.
[25] ( Wang Tingting, Han Man, Wang Yu . Optimizing LDA Model with Various Topic Numbers: Case Study of Scientific Literature[J]. Data Analysis and Knowledge Discovery, 2018,2(1):29-40.)
[26] 陈磊, 李俊 . 基于LF-LDA和Word2vec的文本表示模型研究[J]. 电子技术, 2017(7):1-5.
[26] ( Chen Lei, Li Jun . Text Representation Model Based on LF-LDA and Word2Vec[J]. Electronic Technology, 2017(7):1-5.)
[27] Liu W, Dong W . A Question Recommendation Model Based on LDA and Word2Vec[A]// Hussain A, Ivanovic M. Electronics, Communications and Networks IV[M]. 2015: 1527-1531.
[28] 董文 . 基于LDA和Word2Vec的推荐算法研究[D]. 北京: 北京邮电大学, 2015.
[28] ( Dong Wen . Research of Recommendation Algorithm Based on LDA and Word2Vec[D]. Beijing: Beijing University of Posts and Telecommunications, 2015.)
[29] Wang Z, Ma L, Zhang Y. A Hybrid Document Feature Extraction Method Using Latent Dirichlet Allocation and Word2Vec [C]// Proceedings of the 1st International Conference on Data Science in Cyberspace. 2016: 98-103.
[30] 韦强申 . 领域关键词抽取: 结合LDA与Word2Vec[D]. 贵阳: 贵州师范大学, 2016.
[30] ( Wei Qiangshen . Keyword Extraction Based on LDA and Word2Vec[D]. Guiyang: Guizhou Normal University, 2016.)
[31] 宁建飞, 刘降珍 . 融合Word2Vec与TextRank的关键词抽取研究[J]. 现代图书情报技术, 2016(6):20-27.
[31] ( Ning Jianfei, Liu Jiangzhen . Using Word2Vec with TextRank to Extract Keywords[J]. New Technology of Library and Information Service, 2016(6):20-27.)
[32] 夏天 . 词向量聚类加权TextRank的关键词抽取[J]. 数据分析与知识发现, 2017,1(2):28-34.
[32] ( Xia Tian . Extracting Keywords with Modified TextRank Model[J]. Data Analysis and Knowledge Discovery, 2017,1(2):28-34.)
[33] Wen Y, Yuan H, Zhang P. Research on Keyword Extraction Based on Word2Vec Weighted TextRank [C]// Proceedings of the 2nd International Conference on Computer and Communications. 2017: 2109-2113.
[34] 刘奇飞, 沈炜域 . 基于Word2Vec和TextRank的时政类新闻关键词抽取方法研究[J]. 情报探索, 2018(6):22-27.
[34] ( Liu Qifei, Shen Weiyu . Research of Keyword Extraction of Political News Based on Word2Vec and TextRank[J]. Information Research, 2018(6):22-27.)
[35] Brin S, Page L. The Anatomy of a Large-Scale Hyper Textual Web Search Engine [C]// Proceedings of the 7th International Conference on World Wide Web. 1998,30:107-117.
[1] Qingtian Zeng,Mingdi Dai,Chao Li,Hua Duan,Zhongying Zhao. Discovering Important Locations with User Representation and Trace Data[J]. 数据分析与知识发现, 2019, 3(6): 75-82.
[2] Jinzhu Zhang,Yiming Hu. Extracting Titles from Scientific References in Patents with Fusion of Representation Learning and Machine Learning[J]. 数据分析与知识发现, 2019, 3(5): 68-76.
[3] Bengong Yu,Yangnan Chen,Ying Yang. Classifying Short Text Complaints with nBD-SVM Model[J]. 数据分析与知识发现, 2019, 3(5): 77-85.
[4] Zhen Zhang,Jin Zeng. Extracting Keywords from User Comments: Case Study of Meituan[J]. 数据分析与知识发现, 2019, 3(3): 36-44.
[5] Peiyao Zhang,Dongsu Liu. Topic Evolutionary Analysis of Short Text Based on Word Vector and BTM[J]. 数据分析与知识发现, 2019, 3(3): 95-101.
[6] Linna Xi,Yongxiang Dou. Examining Reposts of Micro-bloggers with Planned Behavior Theory[J]. 数据分析与知识发现, 2019, 3(2): 13-20.
[7] Jie Zhang,Junbo Zhao,Dongsheng Zhai,Ningning Sun. Patent Technology Analysis of Microalgae Biofuel Industrial Chain Based on Topic Model[J]. 数据分析与知识发现, 2019, 3(2): 52-64.
[8] Junwan Liu,Zhixin Long,Feifei Wang. Finding Collaboration Opportunities from Emerging Issues with LDA Topic Model and Link Prediction[J]. 数据分析与知识发现, 2019, 3(1): 104-117.
[9] Tao Zhang,Haiqun Ma. Clustering Policy Texts Based on LDA Topic Model[J]. 数据分析与知识发现, 2018, 2(9): 59-65.
[10] Yan Yu,Naixuan Zhao. Weighted Topic Model for Patent Text Analysis[J]. 数据分析与知识发现, 2018, 2(4): 81-89.
[11] He Li,Linlin Zhu,Min Yan,Jincheng Liu,Chuang Hong. Identifying Useful Information from Open Innovation Community[J]. 数据分析与知识发现, 2018, 2(12): 12-22.
[12] Weilin He,Guohe Feng,Hongling Xie. Analyzing Scientific Literature with Content Similarity - Topics over Time Model[J]. 数据分析与知识发现, 2018, 2(11): 64-72.
[13] Tingting Wang,Yu Wang,Linjie Qin. Dividing Time Windows of Dynamic Topic Model[J]. 数据分析与知识发现, 2018, 2(10): 54-64.
[14] Tingting Wang,Man Han,Yu Wang. Optimizing LDA Model with Various Topic Numbers: Case Study of Scientific Literature[J]. 数据分析与知识发现, 2018, 2(1): 29-40.
[15] Jiabin Qu,Shiyan Ou. Analyzing Topic Evolution with Topic Filtering and Relevance[J]. 数据分析与知识发现, 2018, 2(1): 64-75.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn