Topic Recognition and Key-Phrase Extraction with Phrase Representation Learning
Zhang Jinzhu1,2(),Yu Wenqian1
1School of Economics and Management, Nanjing University of Science and Technology, Nanjing 210094, China 2Jiangsu Province Social Public Safety Science and Technology Collaborative Innovation Center, Nanjing 210094, China
[Objective] This paper designs a topic recognition and key-phrase extraction method based on phrase representation learning,aiming to address this issue from more specific perspective. [Methods] First, we constructed sequence for extracted phrases with dependency syntax analysis. Then, we modified the word representation learning model to process the phrase semantic vectors. Third, we developed topic recognition method based on the vector clustering technique. Fourth, we constructed the sequence of phrase topics with the phrases and the corresponding topic category numbers. Finally, we proposed a Topic-Phrase to Vector (TP2Vec) model to extract topic related phrases. [Results] Compared with the LDA model, the average similarity among topics of the proposed model was reduced by up-to 0.27. The extracted representative words were semantically related to the topics, and the results were more readable and interpretable. [Limitations] More research is needed to examine the proposed method with data sets from other fields. [Conclusions] The proposed method could effectively identify research topics and related phrases, which might be applied to other fields.
Leung X Y, Sun J, Bai B. Bibliometrics of Social Media Research: A Co-citation and Co-word Analysis[J]. International Journal of Hospitality Management, 2017,66:35-45.
doi: 10.1016/j.ijhm.2017.06.012
[2]
Zhang T, Chi H, Ouyang Z L. Detecting Research Focus and Research Fronts in the Medical Big Data Field Using Co-word and Co-citation Analysis[C]//Proceedings of International Conference on High Performance Computing and Communications. 2018: 313-320.
( Liu Ziqiang, Xu Haiyun, Yue Lixin, et al. Research on Core Technology Topic Identification Based on Chunk-LDAvis[J]. Library and Information Service, 2019,63(9):73-84.)
( Cui Lei, Sui Mingshuang. Study on an Approach to Presenting the Co-word Clustering Analysis Results[J]. Journal of the China Society for Scientific and Technical Information, 2015,34(12):1270-1277.)
( Guo Chonghui, Cao Mengyue. GMAP: A Co-word Analysis Method Based on AP Clustering[J]. Journal of the China Society for Scientific and Technical Information, 2017,36(11):1192-1200.)
( Zhou Lei. Study Topics and Research Focus in Domestic Library and Information Community in the Last Decade[J]. Technology Intelligence Engineering, 2019,5(3):112-126.)
[7]
Blei D M, Ng A Y, Jordan M I. Latent Dirichlet Allocation[J]. Journal of Machine Learning Research, 2003,3:993-1022.
( Liu Yuwen, Wu Xuangou, Guo Qiang. DST-LDA Approach for Identifying Dynamic Process of News Subtopic[J]. Journal of Chinese Computer Systems, 2017,38(4):738-743.)
[9]
Gao Z F, Fan Y S, Wu C, et al. SeCo-LDA: Mining Service Co-occurrence Topics for Composition Recommendation[J]. IEEE Transactions on Services Computing, 2019,12(3):446-459.
doi: 10.1109/TSC.4629386
( Cai Yongming, Chang Qing. Chinese Short Text Topic Analysis by Latent Dirichlet Allocation Model with Co-word Network Analysis[J]. Journal of the China Society for Scientific and Technical Information, 2018,37(3):305-317.)
[11]
Wu Q Q, Kuang Y C, Hong Q Q, et al. Frontier Knowledge Discovery and Visualization in Cancer Field Based on KOS and LDA[J]. Scientometrics, 2019,118(3):979-1010.
doi: 10.1007/s11192-018-2989-y
( Jiang Tianwen, Qin Bing, Liu Ting. Open Domain Knowledge Reasoning for Chinese Based on Representation Learning[J]. Journal of Chinese Information Processing, 2018,32(3):34-41.)
( Liu Zhiyuan, Sun Maosong, Lin Yankai, et al. Knowledge Representation Learning: A Review[J]. Journal of Computer Research and Development, 2016,53(2):247-261.)
doi: 10.7544/issn1000-1239.2016.20160020
[14]
Mikolov T, Sutskever I, Chen K, et al. Distributed Representations of Words and Phrases and Their Compositionality[C]//Proceedings of the 26th International Conference on Neural Information Processing Systems. 2013: 3111-3119.
[15]
Nguyen D Q, Billingsley R, Du L, et al. Improving Topic Models with Latent Feature Word Representations[J]. Transactions of the Association for Computational Linguistics, 2015,3:299-313.
doi: 10.1162/tacl_a_00140
( Yu Chong, Li Jing, Sun Xudong, et al. Social Media Topic Recognition Based on Word Embedding and Probabilistic Topic Model[J]. Computer Engineering, 2017,43(12):184-191.)
doi: 10.3969/j.issn.1000-3428.2017.12.034
( Zhang Jing, Zhu Guobin. Hot Topic Discovery Research of Stack Overflow Programming Website Based on CBOW-LDA Topic Model[J]. Computer Science, 2018,45(4):208-214.)
[18]
Niu L Q, Dai X Y. Topic2Vec: Learning Distributed Representations of Topics[C]//Proceedings of the 2015 International Conference on Asian Language Processing, Suzhou, China. 2016. DOI: 10.1109/IALP.2015.7451564.
( Zeng Qingtian, Hu Xiaohui, Li Chao. Extracting Keywords with Topic Embedding and Network Structure Analysis[J]. Data Analysis and Knowledge Discovery, 2019,3(7):52-60.)
[21]
Perozzi B, Al-Rfou R, Skiena S. DeepWalk: Online Learning of Social Representations[C]//Proceedings of the 2014 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2014: 701-710.
( Song Kai, Li Xiuxia, Zhao Sizhe. Text Clustering Based on the Combination of CTM Model and K-Means Algorithm[J]. Information Studies: Theory and Practice, 2017,40(11):135-138.)
( Qu Jingye, Chen Zhen, Zheng Yanning. Research on the Text Clustering Method of Science and Technology Reports Based on the Topic Model[J]. Library and Information Service, 2018,62(4):113-120.)
( Rao Gaoqi, Li Yuming. Lexicon Clustering Based Modern Chinese Staging[J]. Journal of Chinese Information Processing, 2017,31(6):18-24.)
[25]
Steven L. TextBlob: Simplified Text Processing[EB/OL]. [2019-02-24].https://textblob.readthedocs.io/en/dev/index.html.
[26]
Li C Z, Lu Y, Wu J F, et al. LDA Meets Word2Vec: A Novel Model for Academic Abstract Clustering[C]//Proceedings of the 2018 Companion of the Web Conference. 2018: 1699-1706.
( Wang Jianlong, Ma Xin, Duan Ganglong. Improved K-Means Clustering k-Value Selection Algorithm[J]. Computer Engineering and Applications, 2019,55(8):27-33.)
[28]
Abadi M, Barham P, Chen J M, et al. TensorFlow: A System for Large-Scale Machine Learning[C]//Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation. 2016: 265-283.
[29]
Benedetti F, Beneventano D, Bergamaschi S. Context Semantic Analysis: A Knowledge-Based Technique for Computing Inter-Document Similarity[C]//Proceedings of the 9th International Conference on Similarity Search and Applications, Tokyo, Japan. 2016. DOI: 10.1007/978-3-319-46759-7_13.
[30]
Benedetti F, Beneventano D, Bergamaschi S, et al. Computing Inter-Document Similarity with Context Semantic Analysis[J]. Information Systems, 2019,80:136-147.
doi: 10.1016/j.is.2018.02.009