Please wait a minute...
Advanced Search
数据分析与知识发现  2021, Vol. 5 Issue (2): 50-60     https://doi.org/10.11925/infotech.2096-3467.2020.0060
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
基于短语表示学习的主题识别及其表征词抽取方法研究
张金柱1,2(),于文倩1
1南京理工大学经济管理学院 南京 210094
2江苏省社会公共安全科技协同创新中心 南京 210094
Topic Recognition and Key-Phrase Extraction with Phrase Representation Learning
Zhang Jinzhu1,2(),Yu Wenqian1
1School of Economics and Management, Nanjing University of Science and Technology, Nanjing 210094, China
2Jiangsu Province Social Public Safety Science and Technology Collaborative Innovation Center, Nanjing 210094, China
全文: PDF (1743 KB)   HTML ( 27
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】 从更具专指性和表征能力的短语语义表示角度,设计基于短语表示学习的主题识别及其表征词抽取方法。【方法】 基于依存句法分析抽取短语构建短语序列,并将短语序列视作词序列,将用于词表示的表示学习模型扩展形成短语表示学习模型,得到短语的语义向量表示,并结合向量聚类方法形成短语语义表示视角下的主题识别方法;将短语以及根据聚类得到的对应主题类别号作为一个整体构建短语主题序列,设计形成主题短语向量表示模型,实现主题和短语在同一向量空间的语义表示并计算相似度,从短语语义角度抽取与主题内容相关的短语作为主题表征词。【结果】 与LDA模型相比,主题间平均相似度最多降低了0.27,主题识别结果区分度更高;抽取的表征词与主题语义相关,具有专指性和辨识度,结果可读性和解释性更强。【局限】 需要在不同领域及不同数据集上进一步验证该方法的有效性。【结论】 所提方法在研究主题识别及其表征词抽取方面具有更好的效果,并可扩展应用到其他领域。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
张金柱
于文倩
关键词 主题识别主题表征词表示学习语义向量    
Abstract

[Objective] This paper designs a topic recognition and key-phrase extraction method based on phrase representation learning,aiming to address this issue from more specific perspective. [Methods] First, we constructed sequence for extracted phrases with dependency syntax analysis. Then, we modified the word representation learning model to process the phrase semantic vectors. Third, we developed topic recognition method based on the vector clustering technique. Fourth, we constructed the sequence of phrase topics with the phrases and the corresponding topic category numbers. Finally, we proposed a Topic-Phrase to Vector (TP2Vec) model to extract topic related phrases. [Results] Compared with the LDA model, the average similarity among topics of the proposed model was reduced by up-to 0.27. The extracted representative words were semantically related to the topics, and the results were more readable and interpretable. [Limitations] More research is needed to examine the proposed method with data sets from other fields. [Conclusions] The proposed method could effectively identify research topics and related phrases, which might be applied to other fields.

Key wordsTopic Recognition    Topic Key-Phrase    Representation Learning    Semantic Vector
收稿日期: 2020-01-15      出版日期: 2021-03-11
ZTFLH:  G350  
基金资助:*国家自然科学基金项目(71974095);江苏省社会科学基金项目(17TQC003);国家自然科学基金青年项目(71503125)
通讯作者: 张金柱 ORCID:0000-0001-7581-1850     E-mail: zhangjinzhu@njust.edu.cn
引用本文:   
张金柱, 于文倩. 基于短语表示学习的主题识别及其表征词抽取方法研究[J]. 数据分析与知识发现, 2021, 5(2): 50-60.
Zhang Jinzhu, Yu Wenqian. Topic Recognition and Key-Phrase Extraction with Phrase Representation Learning. Data Analysis and Knowledge Discovery, 2021, 5(2): 50-60.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2020.0060      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2021/V5/I2/50
Fig.1  短语序列生成
Fig.2  TP2Vec模型的表示
Fig.3  K-Means聚类及可视化
类名 词数量 高频表征词示例
Cluster1 1 208 web internet data social science information science
Cluster2 1 046 public library scientometric analysis paper analysis paper study comparative study
Cluster3 887 scientific field scientific literature scientific collaboration computer science scientific research
Cluster4 774 IR search engine information system WOS information retrieval system
Cluster5 610 natural science bibliometric academic research scientific discipline SSCI
Table 1  K-Means聚类结果
模型 主题号 主题表征词
LDA Topic1 network analysis technology knowledge method
Topic2 citation journal paper article patent
Topic3 study search system user result
Topic4 science country publication paper collaboration
Topic5 document system retrieval method model
TP2Vec Topic1 information science library science Lotka’s law Zipf’s law Bradford’s law
Topic2 natural language process similarity measure relation extraction SVM K-Means
Topic3 scientific community collaboration network scientific communication collaboration pattern co-authorship network
Topic4 information retrieval search engine information retrieval system retrieval performance search tactics
Topic5 bibliometric analysis impact factor webometrics h-index citation analysis
Table 2  LDA与TP2Vec主题表征词语比较
Fig.4  LDA与TP2Vec主题表征词语可视化
模型 前10 前20 前30 前40 前50 前60 前70 前80 前90 前100
LDA 0.310 0.400 0.427 0.461 0.474 0.481 0.480 0.496 0.515 0.533
TP2Vec 0.100 0.128 0.195 0.245 0.267 0.307 0.308 0.316 0.325 0.378
Table 3  主题间平均相似度随主题表征词语数量变化情况
[1] Leung X Y, Sun J, Bai B. Bibliometrics of Social Media Research: A Co-citation and Co-word Analysis[J]. International Journal of Hospitality Management, 2017,66:35-45.
doi: 10.1016/j.ijhm.2017.06.012
[2] Zhang T, Chi H, Ouyang Z L. Detecting Research Focus and Research Fronts in the Medical Big Data Field Using Co-word and Co-citation Analysis[C]//Proceedings of International Conference on High Performance Computing and Communications. 2018: 313-320.
[3] 刘自强, 许海云, 岳丽欣, 等. 基于Chunk-LDAvis的核心技术主题识别方法研究[J]. 图书情报工作, 2019,63(9):73-84.
[3] ( Liu Ziqiang, Xu Haiyun, Yue Lixin, et al. Research on Core Technology Topic Identification Based on Chunk-LDAvis[J]. Library and Information Service, 2019,63(9):73-84.)
[4] 崔雷, 隋明爽. 共现聚类分析结果表达方法的研究[J]. 情报学报, 2015,34(12):1270-1277.
[4] ( Cui Lei, Sui Mingshuang. Study on an Approach to Presenting the Co-word Clustering Analysis Results[J]. Journal of the China Society for Scientific and Technical Information, 2015,34(12):1270-1277.)
[5] 郭崇慧, 曹梦月. GMAP:一种基于AP聚类的共词分析方法[J]. 情报学报, 2017,36(11):1192-1200.
[5] ( Guo Chonghui, Cao Mengyue. GMAP: A Co-word Analysis Method Based on AP Clustering[J]. Journal of the China Society for Scientific and Technical Information, 2017,36(11):1192-1200.)
[6] 周雷. 我国近10年图书情报领域研究主题分布及研究热点分析[J]. 情报工程, 2019,5(3):112-126.
[6] ( Zhou Lei. Study Topics and Research Focus in Domestic Library and Information Community in the Last Decade[J]. Technology Intelligence Engineering, 2019,5(3):112-126.)
[7] Blei D M, Ng A Y, Jordan M I. Latent Dirichlet Allocation[J]. Journal of Machine Learning Research, 2003,3:993-1022.
[8] 刘玉文, 吴宣够, 郭强. 网络热点新闻焦点识别与演化跟踪[J]. 小型微型计算机系统, 2017,38(4):738-743.
[8] ( Liu Yuwen, Wu Xuangou, Guo Qiang. DST-LDA Approach for Identifying Dynamic Process of News Subtopic[J]. Journal of Chinese Computer Systems, 2017,38(4):738-743.)
[9] Gao Z F, Fan Y S, Wu C, et al. SeCo-LDA: Mining Service Co-occurrence Topics for Composition Recommendation[J]. IEEE Transactions on Services Computing, 2019,12(3):446-459.
doi: 10.1109/TSC.4629386
[10] 蔡永明, 长青. 共词网络LDA模型的中文短文本主题分析[J]. 情报学报, 2018,37(3):305-317.
[10] ( Cai Yongming, Chang Qing. Chinese Short Text Topic Analysis by Latent Dirichlet Allocation Model with Co-word Network Analysis[J]. Journal of the China Society for Scientific and Technical Information, 2018,37(3):305-317.)
[11] Wu Q Q, Kuang Y C, Hong Q Q, et al. Frontier Knowledge Discovery and Visualization in Cancer Field Based on KOS and LDA[J]. Scientometrics, 2019,118(3):979-1010.
doi: 10.1007/s11192-018-2989-y
[12] 姜天文, 秦兵, 刘挺. 基于表示学习的开放域中文知识推理[J]. 中文信息学报, 2018,32(3):34-41.
[12] ( Jiang Tianwen, Qin Bing, Liu Ting. Open Domain Knowledge Reasoning for Chinese Based on Representation Learning[J]. Journal of Chinese Information Processing, 2018,32(3):34-41.)
[13] 刘知远, 孙茂松, 林衍凯, 等. 知识表示学习研究进展[J]. 计算机研究与发展, 2016,53(2):247-261.
doi: 10.7544/issn1000-1239.2016.20160020
[13] ( Liu Zhiyuan, Sun Maosong, Lin Yankai, et al. Knowledge Representation Learning: A Review[J]. Journal of Computer Research and Development, 2016,53(2):247-261.)
doi: 10.7544/issn1000-1239.2016.20160020
[14] Mikolov T, Sutskever I, Chen K, et al. Distributed Representations of Words and Phrases and Their Compositionality[C]//Proceedings of the 26th International Conference on Neural Information Processing Systems. 2013: 3111-3119.
[15] Nguyen D Q, Billingsley R, Du L, et al. Improving Topic Models with Latent Feature Word Representations[J]. Transactions of the Association for Computational Linguistics, 2015,3:299-313.
doi: 10.1162/tacl_a_00140
[16] 余冲, 李晶, 孙旭东, 等. 基于词嵌入与概率主题模型的社会媒体话题识别[J]. 计算机工程, 2017,43(12):184-191.
doi: 10.3969/j.issn.1000-3428.2017.12.034
[16] ( Yu Chong, Li Jing, Sun Xudong, et al. Social Media Topic Recognition Based on Word Embedding and Probabilistic Topic Model[J]. Computer Engineering, 2017,43(12):184-191.)
doi: 10.3969/j.issn.1000-3428.2017.12.034
[17] 张景, 朱国宾. 基于CBOW-LDA主题模型的Stack Overflow编程网站热点主题发现研究[J]. 计算机科学, 2018,45(4):208-214.
[17] ( Zhang Jing, Zhu Guobin. Hot Topic Discovery Research of Stack Overflow Programming Website Based on CBOW-LDA Topic Model[J]. Computer Science, 2018,45(4):208-214.)
[18] Niu L Q, Dai X Y. Topic2Vec: Learning Distributed Representations of Topics[C]//Proceedings of the 2015 International Conference on Asian Language Processing, Suzhou, China. 2016. DOI: 10.1109/IALP.2015.7451564.
[19] 徐守坤, 周佳, 李宁, 等. 基于Word2Vec和LDA的文本主题[J]. 计算机工程与设计, 2018,39(9):2764-2769.
[19] ( Xu Shoukun, Zhou Jia, Li Ning, et al. Text Topic Based on Word2Vec and LDA[J]. Computer Engineering and Design, 2018,39(9):2764-2769.)
[20] 曾庆田, 胡晓慧, 李超. 融合主题词嵌入和网络结构分析的主题关键词提取方法[J]. 数据分析与知识发现, 2019,3(7):52-60.
[20] ( Zeng Qingtian, Hu Xiaohui, Li Chao. Extracting Keywords with Topic Embedding and Network Structure Analysis[J]. Data Analysis and Knowledge Discovery, 2019,3(7):52-60.)
[21] Perozzi B, Al-Rfou R, Skiena S. DeepWalk: Online Learning of Social Representations[C]//Proceedings of the 2014 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2014: 701-710.
[22] 宋凯, 李秀霞, 赵思喆. 基于CTM模型与K-Means算法融合的文本聚类研究[J]. 情报理论与实践, 2017,40(11):135-138.
[22] ( Song Kai, Li Xiuxia, Zhao Sizhe. Text Clustering Based on the Combination of CTM Model and K-Means Algorithm[J]. Information Studies: Theory and Practice, 2017,40(11):135-138.)
[23] 曲靖野, 陈震, 郑彦宁. 基于主题模型的科技报告文档聚类方法研究[J]. 图书情报工作, 2018,62(4):113-120.
[23] ( Qu Jingye, Chen Zhen, Zheng Yanning. Research on the Text Clustering Method of Science and Technology Reports Based on the Topic Model[J]. Library and Information Service, 2018,62(4):113-120.)
[24] 饶高琦, 李宇明. 基于词汇聚类方法的现代汉语分期与分期体系构建[J]. 中文信息学报, 2017,31(6):18-24.
[24] ( Rao Gaoqi, Li Yuming. Lexicon Clustering Based Modern Chinese Staging[J]. Journal of Chinese Information Processing, 2017,31(6):18-24.)
[25] Steven L. TextBlob: Simplified Text Processing[EB/OL]. [2019-02-24].https://textblob.readthedocs.io/en/dev/index.html.
[26] Li C Z, Lu Y, Wu J F, et al. LDA Meets Word2Vec: A Novel Model for Academic Abstract Clustering[C]//Proceedings of the 2018 Companion of the Web Conference. 2018: 1699-1706.
[27] 王建龙, 马鑫, 段刚龙. 改进的K-means聚类k值选择算法[J]. 计算机工程与应用, 2019,55(8):27-33.
[27] ( Wang Jianlong, Ma Xin, Duan Ganglong. Improved K-Means Clustering k-Value Selection Algorithm[J]. Computer Engineering and Applications, 2019,55(8):27-33.)
[28] Abadi M, Barham P, Chen J M, et al. TensorFlow: A System for Large-Scale Machine Learning[C]//Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation. 2016: 265-283.
[29] Benedetti F, Beneventano D, Bergamaschi S. Context Semantic Analysis: A Knowledge-Based Technique for Computing Inter-Document Similarity[C]//Proceedings of the 9th International Conference on Similarity Search and Applications, Tokyo, Japan. 2016. DOI: 10.1007/978-3-319-46759-7_13.
[30] Benedetti F, Beneventano D, Bergamaschi S, et al. Computing Inter-Document Similarity with Context Semantic Analysis[J]. Information Systems, 2019,80:136-147.
doi: 10.1016/j.is.2018.02.009
[1] 陈文杰,文奕,杨宁. 基于节点向量表示的模糊重叠社区划分算法*[J]. 数据分析与知识发现, 2021, 5(5): 41-50.
[2] 张鑫,文奕,许海云. 一种融合表示学习与主题表征的作者合作预测模型*[J]. 数据分析与知识发现, 2021, 5(3): 88-100.
[3] 王红斌,王健雄,张亚飞,杨恒. 主题不平衡新闻文本数据集的主题识别方法研究*[J]. 数据分析与知识发现, 2021, 5(3): 109-120.
[4] 余传明, 张贞港, 孔令格. 面向链接预测的知识图谱表示模型对比研究*[J]. 数据分析与知识发现, 2021, 5(11): 29-44.
[5] 余传明, 王曼怡, 林虹君, 朱星宇, 黄婷婷, 安璐. 基于深度学习的词汇表示模型对比研究*[J]. 数据分析与知识发现, 2020, 4(8): 28-40.
[6] 丁晟春,俞沣洋,李真. 网络舆情潜在热点主题识别研究*[J]. 数据分析与知识发现, 2020, 4(2/3): 29-38.
[7] 余传明,钟韵辞,林奥琛,安璐. 基于网络表示学习的作者重名消歧研究*[J]. 数据分析与知识发现, 2020, 4(2/3): 48-59.
[8] 丁勇,陈夕,蒋翠清,王钊. 一种融合网络表示学习与XGBoost的评分预测模型*[J]. 数据分析与知识发现, 2020, 4(11): 52-62.
[9] 张金柱,主立鹏,刘菁婕. 基于表示学习的无监督跨语言专利推荐研究*[J]. 数据分析与知识发现, 2020, 4(10): 93-103.
[10] 余传明,李浩男,王曼怡,黄婷婷,安璐. 基于深度学习的知识表示研究:网络视角*[J]. 数据分析与知识发现, 2020, 4(1): 63-75.
[11] 刘博文,白如江,周彦廷,王效岳. 基金项目数据和论文数据融合视角下科学研究前沿主题识别 *——以碳纳米管领域为例[J]. 数据分析与知识发现, 2019, 3(8): 114-122.
[12] 曾庆田,胡晓慧,李超. 融合主题词嵌入和网络结构分析的主题关键词提取方法 *[J]. 数据分析与知识发现, 2019, 3(7): 52-60.
[13] 曾庆田,戴明弟,李超,段华,赵中英. 轨迹数据融合用户表示方法的重要位置发现*[J]. 数据分析与知识发现, 2019, 3(6): 75-82.
[14] 张金柱,胡一鸣. 融合表示学习与机器学习的专利科学引文标题自动抽取研究*[J]. 数据分析与知识发现, 2019, 3(5): 68-76.
[15] 张金柱,王玥,胡一鸣. 基于专利科学引文内容表示学习的科学技术主题关联分析研究 *[J]. 数据分析与知识发现, 2019, 3(12): 52-60.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn