Please wait a minute...
Data Analysis and Knowledge Discovery  2021, Vol. 5 Issue (2): 50-60    DOI: 10.11925/infotech.2096-3467.2020.0060
Current Issue | Archive | Adv Search |
Topic Recognition and Key-Phrase Extraction with Phrase Representation Learning
Zhang Jinzhu1,2(),Yu Wenqian1
1School of Economics and Management, Nanjing University of Science and Technology, Nanjing 210094, China
2Jiangsu Province Social Public Safety Science and Technology Collaborative Innovation Center, Nanjing 210094, China
Download: PDF (1743 KB)   HTML ( 18
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper designs a topic recognition and key-phrase extraction method based on phrase representation learning,aiming to address this issue from more specific perspective. [Methods] First, we constructed sequence for extracted phrases with dependency syntax analysis. Then, we modified the word representation learning model to process the phrase semantic vectors. Third, we developed topic recognition method based on the vector clustering technique. Fourth, we constructed the sequence of phrase topics with the phrases and the corresponding topic category numbers. Finally, we proposed a Topic-Phrase to Vector (TP2Vec) model to extract topic related phrases. [Results] Compared with the LDA model, the average similarity among topics of the proposed model was reduced by up-to 0.27. The extracted representative words were semantically related to the topics, and the results were more readable and interpretable. [Limitations] More research is needed to examine the proposed method with data sets from other fields. [Conclusions] The proposed method could effectively identify research topics and related phrases, which might be applied to other fields.

Key wordsTopic Recognition      Topic Key-Phrase      Representation Learning      Semantic Vector     
Received: 15 January 2020      Published: 11 March 2021
ZTFLH:  G350  
Fund:National Natural Science Foundation of China(71974095);National Social Science Fund of Jiangsu Province(17TQC003);National Natural Science Foundation of China(71503125)
Corresponding Authors: Zhang Jinzhu ORCID:0000-0001-7581-1850     E-mail: zhangjinzhu@njust.edu.cn

Cite this article:

Zhang Jinzhu, Yu Wenqian. Topic Recognition and Key-Phrase Extraction with Phrase Representation Learning. Data Analysis and Knowledge Discovery, 2021, 5(2): 50-60.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2020.0060     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2021/V5/I2/50

Generation of Phrase Sequence
Formulation of TP2Vec
Clusters and Visualization Based on K-Means
类名 词数量 高频表征词示例
Cluster1 1 208 web internet data social science information science
Cluster2 1 046 public library scientometric analysis paper analysis paper study comparative study
Cluster3 887 scientific field scientific literature scientific collaboration computer science scientific research
Cluster4 774 IR search engine information system WOS information retrieval system
Cluster5 610 natural science bibliometric academic research scientific discipline SSCI
Clustering Result Based on K-Means
模型 主题号 主题表征词
LDA Topic1 network analysis technology knowledge method
Topic2 citation journal paper article patent
Topic3 study search system user result
Topic4 science country publication paper collaboration
Topic5 document system retrieval method model
TP2Vec Topic1 information science library science Lotka’s law Zipf’s law Bradford’s law
Topic2 natural language process similarity measure relation extraction SVM K-Means
Topic3 scientific community collaboration network scientific communication collaboration pattern co-authorship network
Topic4 information retrieval search engine information retrieval system retrieval performance search tactics
Topic5 bibliometric analysis impact factor webometrics h-index citation analysis
Key-phrase Comparison Between LDA and TP2Vec
Key-phrases Visualization of LDA and TP2Vec
模型 前10 前20 前30 前40 前50 前60 前70 前80 前90 前100
LDA 0.310 0.400 0.427 0.461 0.474 0.481 0.480 0.496 0.515 0.533
TP2Vec 0.100 0.128 0.195 0.245 0.267 0.307 0.308 0.316 0.325 0.378
Average Similarity Among Topics Varies with the Number of Key-phrases
[1] Leung X Y, Sun J, Bai B. Bibliometrics of Social Media Research: A Co-citation and Co-word Analysis[J]. International Journal of Hospitality Management, 2017,66:35-45.
doi: 10.1016/j.ijhm.2017.06.012
[2] Zhang T, Chi H, Ouyang Z L. Detecting Research Focus and Research Fronts in the Medical Big Data Field Using Co-word and Co-citation Analysis[C]//Proceedings of International Conference on High Performance Computing and Communications. 2018: 313-320.
[3] 刘自强, 许海云, 岳丽欣, 等. 基于Chunk-LDAvis的核心技术主题识别方法研究[J]. 图书情报工作, 2019,63(9):73-84.
[3] ( Liu Ziqiang, Xu Haiyun, Yue Lixin, et al. Research on Core Technology Topic Identification Based on Chunk-LDAvis[J]. Library and Information Service, 2019,63(9):73-84.)
[4] 崔雷, 隋明爽. 共现聚类分析结果表达方法的研究[J]. 情报学报, 2015,34(12):1270-1277.
[4] ( Cui Lei, Sui Mingshuang. Study on an Approach to Presenting the Co-word Clustering Analysis Results[J]. Journal of the China Society for Scientific and Technical Information, 2015,34(12):1270-1277.)
[5] 郭崇慧, 曹梦月. GMAP:一种基于AP聚类的共词分析方法[J]. 情报学报, 2017,36(11):1192-1200.
[5] ( Guo Chonghui, Cao Mengyue. GMAP: A Co-word Analysis Method Based on AP Clustering[J]. Journal of the China Society for Scientific and Technical Information, 2017,36(11):1192-1200.)
[6] 周雷. 我国近10年图书情报领域研究主题分布及研究热点分析[J]. 情报工程, 2019,5(3):112-126.
[6] ( Zhou Lei. Study Topics and Research Focus in Domestic Library and Information Community in the Last Decade[J]. Technology Intelligence Engineering, 2019,5(3):112-126.)
[7] Blei D M, Ng A Y, Jordan M I. Latent Dirichlet Allocation[J]. Journal of Machine Learning Research, 2003,3:993-1022.
[8] 刘玉文, 吴宣够, 郭强. 网络热点新闻焦点识别与演化跟踪[J]. 小型微型计算机系统, 2017,38(4):738-743.
[8] ( Liu Yuwen, Wu Xuangou, Guo Qiang. DST-LDA Approach for Identifying Dynamic Process of News Subtopic[J]. Journal of Chinese Computer Systems, 2017,38(4):738-743.)
[9] Gao Z F, Fan Y S, Wu C, et al. SeCo-LDA: Mining Service Co-occurrence Topics for Composition Recommendation[J]. IEEE Transactions on Services Computing, 2019,12(3):446-459.
doi: 10.1109/TSC.4629386
[10] 蔡永明, 长青. 共词网络LDA模型的中文短文本主题分析[J]. 情报学报, 2018,37(3):305-317.
[10] ( Cai Yongming, Chang Qing. Chinese Short Text Topic Analysis by Latent Dirichlet Allocation Model with Co-word Network Analysis[J]. Journal of the China Society for Scientific and Technical Information, 2018,37(3):305-317.)
[11] Wu Q Q, Kuang Y C, Hong Q Q, et al. Frontier Knowledge Discovery and Visualization in Cancer Field Based on KOS and LDA[J]. Scientometrics, 2019,118(3):979-1010.
doi: 10.1007/s11192-018-2989-y
[12] 姜天文, 秦兵, 刘挺. 基于表示学习的开放域中文知识推理[J]. 中文信息学报, 2018,32(3):34-41.
[12] ( Jiang Tianwen, Qin Bing, Liu Ting. Open Domain Knowledge Reasoning for Chinese Based on Representation Learning[J]. Journal of Chinese Information Processing, 2018,32(3):34-41.)
[13] 刘知远, 孙茂松, 林衍凯, 等. 知识表示学习研究进展[J]. 计算机研究与发展, 2016,53(2):247-261.
doi: 10.7544/issn1000-1239.2016.20160020
[13] ( Liu Zhiyuan, Sun Maosong, Lin Yankai, et al. Knowledge Representation Learning: A Review[J]. Journal of Computer Research and Development, 2016,53(2):247-261.)
doi: 10.7544/issn1000-1239.2016.20160020
[14] Mikolov T, Sutskever I, Chen K, et al. Distributed Representations of Words and Phrases and Their Compositionality[C]//Proceedings of the 26th International Conference on Neural Information Processing Systems. 2013: 3111-3119.
[15] Nguyen D Q, Billingsley R, Du L, et al. Improving Topic Models with Latent Feature Word Representations[J]. Transactions of the Association for Computational Linguistics, 2015,3:299-313.
doi: 10.1162/tacl_a_00140
[16] 余冲, 李晶, 孙旭东, 等. 基于词嵌入与概率主题模型的社会媒体话题识别[J]. 计算机工程, 2017,43(12):184-191.
doi: 10.3969/j.issn.1000-3428.2017.12.034
[16] ( Yu Chong, Li Jing, Sun Xudong, et al. Social Media Topic Recognition Based on Word Embedding and Probabilistic Topic Model[J]. Computer Engineering, 2017,43(12):184-191.)
doi: 10.3969/j.issn.1000-3428.2017.12.034
[17] 张景, 朱国宾. 基于CBOW-LDA主题模型的Stack Overflow编程网站热点主题发现研究[J]. 计算机科学, 2018,45(4):208-214.
[17] ( Zhang Jing, Zhu Guobin. Hot Topic Discovery Research of Stack Overflow Programming Website Based on CBOW-LDA Topic Model[J]. Computer Science, 2018,45(4):208-214.)
[18] Niu L Q, Dai X Y. Topic2Vec: Learning Distributed Representations of Topics[C]//Proceedings of the 2015 International Conference on Asian Language Processing, Suzhou, China. 2016. DOI: 10.1109/IALP.2015.7451564.
[19] 徐守坤, 周佳, 李宁, 等. 基于Word2Vec和LDA的文本主题[J]. 计算机工程与设计, 2018,39(9):2764-2769.
[19] ( Xu Shoukun, Zhou Jia, Li Ning, et al. Text Topic Based on Word2Vec and LDA[J]. Computer Engineering and Design, 2018,39(9):2764-2769.)
[20] 曾庆田, 胡晓慧, 李超. 融合主题词嵌入和网络结构分析的主题关键词提取方法[J]. 数据分析与知识发现, 2019,3(7):52-60.
[20] ( Zeng Qingtian, Hu Xiaohui, Li Chao. Extracting Keywords with Topic Embedding and Network Structure Analysis[J]. Data Analysis and Knowledge Discovery, 2019,3(7):52-60.)
[21] Perozzi B, Al-Rfou R, Skiena S. DeepWalk: Online Learning of Social Representations[C]//Proceedings of the 2014 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2014: 701-710.
[22] 宋凯, 李秀霞, 赵思喆. 基于CTM模型与K-Means算法融合的文本聚类研究[J]. 情报理论与实践, 2017,40(11):135-138.
[22] ( Song Kai, Li Xiuxia, Zhao Sizhe. Text Clustering Based on the Combination of CTM Model and K-Means Algorithm[J]. Information Studies: Theory and Practice, 2017,40(11):135-138.)
[23] 曲靖野, 陈震, 郑彦宁. 基于主题模型的科技报告文档聚类方法研究[J]. 图书情报工作, 2018,62(4):113-120.
[23] ( Qu Jingye, Chen Zhen, Zheng Yanning. Research on the Text Clustering Method of Science and Technology Reports Based on the Topic Model[J]. Library and Information Service, 2018,62(4):113-120.)
[24] 饶高琦, 李宇明. 基于词汇聚类方法的现代汉语分期与分期体系构建[J]. 中文信息学报, 2017,31(6):18-24.
[24] ( Rao Gaoqi, Li Yuming. Lexicon Clustering Based Modern Chinese Staging[J]. Journal of Chinese Information Processing, 2017,31(6):18-24.)
[25] Steven L. TextBlob: Simplified Text Processing[EB/OL]. [2019-02-24].https://textblob.readthedocs.io/en/dev/index.html.
[26] Li C Z, Lu Y, Wu J F, et al. LDA Meets Word2Vec: A Novel Model for Academic Abstract Clustering[C]//Proceedings of the 2018 Companion of the Web Conference. 2018: 1699-1706.
[27] 王建龙, 马鑫, 段刚龙. 改进的K-means聚类k值选择算法[J]. 计算机工程与应用, 2019,55(8):27-33.
[27] ( Wang Jianlong, Ma Xin, Duan Ganglong. Improved K-Means Clustering k-Value Selection Algorithm[J]. Computer Engineering and Applications, 2019,55(8):27-33.)
[28] Abadi M, Barham P, Chen J M, et al. TensorFlow: A System for Large-Scale Machine Learning[C]//Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation. 2016: 265-283.
[29] Benedetti F, Beneventano D, Bergamaschi S. Context Semantic Analysis: A Knowledge-Based Technique for Computing Inter-Document Similarity[C]//Proceedings of the 9th International Conference on Similarity Search and Applications, Tokyo, Japan. 2016. DOI: 10.1007/978-3-319-46759-7_13.
[30] Benedetti F, Beneventano D, Bergamaschi S, et al. Computing Inter-Document Similarity with Context Semantic Analysis[J]. Information Systems, 2019,80:136-147.
doi: 10.1016/j.is.2018.02.009
[1] Yu Chuanming, Wang Manyi, Lin Hongjun, Zhu Xingyu, Huang Tingting, An Lu. A Comparative Study of Word Representation Models Based on Deep Learning[J]. 数据分析与知识发现, 2020, 4(8): 28-40.
[2] Liu Yuwen,Wang Kai. Finding Geographic Locations of Popular Online Topics[J]. 数据分析与知识发现, 2020, 4(2/3): 173-181.
[3] Zhang Chunjin,Guo Shenghui,Ji Shujuan,Yang Wei,Yi Lei. Group Recommendation Algorithms Based on Implicit Representation Learning of Multi-attribute Ratings[J]. 数据分析与知识发现, 2020, 4(12): 120-135.
[4] Ding Yong,Chen Xi,Jiang Cuiqing,Wang Zhao. Predicting Online Ratings with Network Representation Learning and XGBoost[J]. 数据分析与知识发现, 2020, 4(11): 52-62.
[5] Zhang Jinzhu,Zhu Lipeng,Liu Jingjie. Unsupervised Cross-Language Model for Patent Recommendation Based on Representation[J]. 数据分析与知识发现, 2020, 4(10): 93-103.
[6] Chuanming Yu,Haonan Li,Manyi Wang,Tingting Huang,Lu An. Knowledge Representation Based on Deep Learning:Network Perspective[J]. 数据分析与知识发现, 2020, 4(1): 63-75.
[7] Bowen Liu,Rujiang Bai,Yanting Zhou,Xiaoyue Wang. Identifying Frontier Topics from Funding and Paper——Case Study of Carbon Nanotube[J]. 数据分析与知识发现, 2019, 3(8): 114-122.
[8] Qingtian Zeng,Xiaohui Hu,Chao Li. Extracting Keywords with Topic Embedding and Network Structure Analysis[J]. 数据分析与知识发现, 2019, 3(7): 52-60.
[9] Qingtian Zeng,Mingdi Dai,Chao Li,Hua Duan,Zhongying Zhao. Discovering Important Locations with User Representation and Trace Data[J]. 数据分析与知识发现, 2019, 3(6): 75-82.
[10] Jinzhu Zhang,Yiming Hu. Extracting Titles from Scientific References in Patents with Fusion of Representation Learning and Machine Learning[J]. 数据分析与知识发现, 2019, 3(5): 68-76.
[11] Jinzhu Zhang,Yue Wang,Yiming Hu. Analyzing Sci-Tech Topics Based on Semantic Representation of Patent References[J]. 数据分析与知识发现, 2019, 3(12): 52-60.
[12] Yu Chuanming,Feng Bolin,An Lu. Sentiment Analysis in Cross-Domain Environment with Deep Representative Learning[J]. 数据分析与知识发现, 2017, 1(7): 73-81.
[13] Hu Jiming, Xiao Lu. Semantic Incremental Improvement on Vector Space Model for Text Modeling[J]. 现代图书情报技术, 2014, 30(10): 49-55.
[14] Zeng Ziming,Zhang Liyi. An Intelligent Commodity Information Retrieval Based on Semantic Similarity and Multi-attribute Decision Method[J]. 现代图书情报技术, 2010, 26(1): 22-27.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn