Extracting Keywords with Topic Embedding and Network Structure Analysis
Qingtian Zeng1,2,Xiaohui Hu2,Chao Li1,3()
1(College of Electronic Information Engineering, Shandong University of Science and Technology, Qingdao 266590, China) 2(College of Computer Science and Engineering, Shandong University of Science and Technology, Qingdao 266590, China) 3(Key Laboratory of Embedded System and Service Computing (Tongji University), Ministry of Education, Shanghai 201804, China)
[Objective] This paper proposes a new model to extract topic keywords, aiming to detect those low frequency words of high relevance. [Methods] First, we designed a topic keyword extraction method, which integrated the topic embedding and network structure analysis techniques. Then, we extracted the preliminary set of topic keywords based on the LDA model, and trained the word vector with Word2Vec model. Third, we built a network based on word vector similarity and identified the final topic keywords with the help of network structure analysis. [Results] The new method improved the average similarity between topic keywords by 14.75%. Our method extracted the low frequency keywords with high topic relevance more effectively than the LDA model. [Limitations] The sample size needs to be expanded, and the segmentation process requires more manual adjustments. More research is needed to quantitatively analyze the topic keywords. [Conclusions] Our method improves the abstracting and public opinion analysis.
( Pang Beibei, Gou Juanqiong, Mu Wenxin . Extracting Topics and Their Relationship from College Student Mentoring[J]. Data Analysis and Knowledge Discovery, 2018,2(6):92-101.)
[4]
Nadkarni P M . An Introduction to Information Retrieval: Applications in Genomics[J]. The Pharmacogenomics Journal, 2002,2(2):96-102.
[5]
Pawar D D, Bewoor M S, Patil S H . Text Rank: A Novel Concept for Extraction Based Text Summarization[J]. International Journal of Computer Science & Information Technology, 2014,5(3):3301-3304.
[6]
Lai S, Liu K, He S , et al. How to Generate a Good Word Embedding[J]. IEEE Intelligent Systems, 2016,31(6):5-14.
[7]
Blei D M, Ng A Y, Jordan M I . Latent Dirichlet Allocation[J]. Journal of Machine Learning Research, 2003,3:993-1022.
[8]
Mikolov T, Chen K, Corrado G , et al. Efficient Estimation of Word Representations in Vector Space[OL]. arXiv Preprint, arXiv: 1301. 3781.
[9]
Cohen J D . Highlights: Language- and Domain-Independent Automatic Indexing Terms for Abstracting[J]. Journal of the American Society for Information Science, 1995,46(3):162-174.
[10]
Luhn H P . A Statistical Approach to Mechanized Encoding and Searching of Literary Information[J]. IBM Journal of Research and Development, 1957,1(4):309-317.
( Yao Zhaoxu, Ma Jing . Extracting Topic and Opinion from Microblog Posts with New Algorithm[J]. New Technology of Library and Information Service, 2016(7):78-86.)
( Qin Shian, Li Fayun . Improved TF-IDF Method in Text Classification[J]. New Technology of Library and Information Service, 2013(10):27-30.)
[13]
Matsuo Y, Ishizuka M . Keyword Extraction from a Single Document Using Word Co-occurrence Statistical Information[J]. International Journal on Artificial Intelligence Tools, 2004,13(1):157-169.
[14]
Zhao Z, Li C, Zhang Y , et al. Identifying and Analyzing Popular Phrases Multi-dimensionally in Social Media Data[J]. International Journal of Data Warehousing & Mining, 2015,11(3):98-112.
[15]
Barzilay R, Elhadad M. Using Lexical Chains for Text Summarization [C]. //Proceedings of the ACL Workshop on Intelligent Scalable Text Summarization. 1997.
[16]
Hulth A. Improved Automatic Keyword Extraction Given More Linguistic Knowledge [C]// Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing. 2003: 216-223.
[17]
Salton G, Singhal A, Mitra M , et al. Automatic Text Structuring and Summarization[J]. Information Processing & Management, 1997,33(2):193-207.
[18]
Conroy J M, O’leary D P. Text Summarization via Hidden Markov Models [C]// Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 2001: 406-407.
[19]
Zhang K, Xu H, Tang J, et al. Keyword Extraction Using Support Vector Machine [C]// Proceedings of the 2006 International Conference on Web-Age Information Management. 2006: 85-96.
[20]
Frank E, Paynter G W, Witten I H, et al. Domain-Specific Keyphrase Extraction [C]// Proceedings of the 16th International Joint Conference on Artificial Intelligence. 1999,2:668-673.
[21]
Liu Z, Chen X, Zheng Y, et al. Automatic Keyphrase Extraction by Bridging Vocabulary Gap [C]// Proceedings of the 15th Conference on Computational Natural Language Learning. 2011: 135-144.
[22]
Mikolov T, Sutskever I, Chen K, et al. Distributed Representations of Words and Phrases and Their Compositionality [C]// Proceedings of the 2013 International Conference on Neural Information Processing Systems. 2013,26:3111-3119.
[23]
Liu Y, Liu Z, Chua T S, et al. Topical Word Embeddings [C]// Proceedings of the 29th AAAI Conference on Artificial Intelligence. 2015: 2418-2424.
[24]
Chang J, Boyd-Graber J, Gerrish S, et al. Reading Tea Leaves: How Humans Interpret Topic Models [C]// Proceedings of the 22nd International Conference on Neural Information Processing Systems. 2009: 288-296.
( Wang Tingting, Han Man, Wang Yu . Optimizing LDA Model with Various Topic Numbers: Case Study of Scientific Literature[J]. Data Analysis and Knowledge Discovery, 2018,2(1):29-40.)
( Chen Lei, Li Jun . Text Representation Model Based on LF-LDA and Word2Vec[J]. Electronic Technology, 2017(7):1-5.)
[27]
Liu W, Dong W . A Question Recommendation Model Based on LDA and Word2Vec[A]// Hussain A, Ivanovic M. Electronics, Communications and Networks IV[M]. 2015: 1527-1531.
[28]
董文 . 基于LDA和Word2Vec的推荐算法研究[D]. 北京: 北京邮电大学, 2015.
[28]
( Dong Wen . Research of Recommendation Algorithm Based on LDA and Word2Vec[D]. Beijing: Beijing University of Posts and Telecommunications, 2015.)
[29]
Wang Z, Ma L, Zhang Y. A Hybrid Document Feature Extraction Method Using Latent Dirichlet Allocation and Word2Vec [C]// Proceedings of the 1st International Conference on Data Science in Cyberspace. 2016: 98-103.
( Ning Jianfei, Liu Jiangzhen . Using Word2Vec with TextRank to Extract Keywords[J]. New Technology of Library and Information Service, 2016(6):20-27.)
( Xia Tian . Extracting Keywords with Modified TextRank Model[J]. Data Analysis and Knowledge Discovery, 2017,1(2):28-34.)
[33]
Wen Y, Yuan H, Zhang P. Research on Keyword Extraction Based on Word2Vec Weighted TextRank [C]// Proceedings of the 2nd International Conference on Computer and Communications. 2017: 2109-2113.
( Liu Qifei, Shen Weiyu . Research of Keyword Extraction of Political News Based on Word2Vec and TextRank[J]. Information Research, 2018(6):22-27.)
[35]
Brin S, Page L. The Anatomy of a Large-Scale Hyper Textual Web Search Engine [C]// Proceedings of the 7th International Conference on World Wide Web. 1998,30:107-117.