|
|
Extracting Keywords Based on Sememe Similarity |
Yan Qiang1,2( ),Zhang Xiaoyan2,Zhou Simin2 |
1School of Modern Post (School of Automation), Beijing University of Posts and Telecommunications,Beijing 100876, China 2School of Economics and Management, Beijing University of Posts and Telecommunications,Beijing 100876, China |
|
|
Abstract [Objective] This study introduces word semantics to TextRank algorithm, aiming to improve the performance of keywords extraction methods. [Methods] First, we used the semantic information from HowNet to calculate similarity of words. Then, we constructed graph and matrix for semantic words passing a similarity threshold. Finally, the semantic matrix and co-occurrence matrix were weighted to obtain transition probability matrix. [Results] The improved algorithm is better than TextRank, TF-IDF and LDA on short texts, which increased the F-scores by 6.6%, 9.0% and 10.3% respectively. On long texts, the results were inferior to TF-IDF, but close to TextRank. [Limitations] The segmentation program could not effectively identify compound words, new words and entities, which extracted incomplete keywords and reduced F-scores. In addition, the semantic similarity algorithm could also be improved. [Conclusions] The proposed method effectively extracts keywords from short texts with the help of co-occurrence and semantic relations of words.
|
Received: 31 July 2020
Published: 24 November 2020
|
|
Fund:National Social Science Fund of China(17AGL026);BUPT Excellent Ph.D. Students Foundation(CX2019128) |
Corresponding Authors:
Yan Qiang
E-mail: yan@bupt.edu.cn
|
[1] |
孙茂松, 陈新雄. 借重于人工知识库的词和义项的向量表示: 以HowNet为例[J]. 中文信息学报, 2016,30(6):1-6, 14.
|
[1] |
( Sun Maosong, Chen Xinxiong. Embedding for Words and Word Senses Based on Human Annotated Knowledge Base: A Case Study on HowNet[J]. Journal of Chinese Information Processing, 2016,30(6):1-6, 14.)
|
[2] |
马永起, 韩德培, 蒙立荣, 等. 基于How-net的词语语义相似度算法[J]. 计算机工程, 2018,44(6):151-155.
|
[2] |
( Ma Yongqi, Han Depei, Meng Lirong, et al. Lexical Semantic Similarity Algorithm Based on How-net[J]. Computer Engineering, 2018,44(6):151-155.)
|
[3] |
Liu J M, Xu J N, Zhang Y J. An Approach of Hybrid Hierarchical Structure for Word Similarity Computing by HowNet[C]// Proceedings of the 6th International Joint Conference on Natural Language Processing (IJCNLP). 2013: 927-931.
|
[4] |
刘群, 李素建. 基于《知网》的词汇语义相似度计算[J]. 中文计算语言学, 2002,7(2):59-76.
|
[4] |
( Liu Qun, Li Sujian. Word Similarity Computing Based on How-net[J]. International Journal of Computational Linguistics and Chinese Language Processing, 2002,7(2):59-76.)
|
[5] |
Mihalcea R, Tarau P. TextRank: Bringing Order into Texts[C]// Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing. 2004: 404-411.
|
[6] |
施聪莺, 徐朝军, 杨晓江. TFIDF算法研究综述[J]. 计算机应用, 2009,29(S1):167-170, 180.
|
[6] |
( Shi Congying, Xu Chaojun, Yang Xiaojiang. Study of TFIDF Algorithm[J]. Journal of Computer Applications, 2009,29(S1):167-170, 180.)
|
[7] |
单斌, 李芳. 基于LDA话题演化研究方法综述[J]. 中文信息学报, 2010,24(6):43-49, 68.
|
[7] |
( Shan Bin, Li Fang. A Survey of Topic Evolution Based on LDA[J]. Journal of Chinese Information Processing, 2010,24(6):43-49,68.)
|
[8] |
夏天. 词语位置加权TextRank的关键词抽取研究[J]. 现代图书情报技术, 2013(9):30-34.
|
[8] |
( Xia Tian. Study on Keyword Extraction Using Word Position Weighted TextRank[J]. New Technology of Library and Information Service, 2013(9):30-34.)
|
[9] |
顾益军, 夏天. 融合LDA与TextRank的关键词抽取研究[J]. 现代图书情报技术, 2014(7/8):41-47.
|
[9] |
( Gu Yijun, Xia Tian. Study on Keyword Extraction with LDA and TextRank Combination[J]. New Technology of Library and Information Service, 2014(7/8):41-47.)
|
[10] |
孙明珠, 马静, 钱玲飞. 基于文档主题结构和词图迭代的关键词抽取方法研究[J]. 数据分析与知识发现, 2019,3(8):68-76.
|
[10] |
( Sun Mingzhu, Ma Jing, Qian Lingfei. Extracting Keywords Based on Topic Structure and Word Diagram Iteration[J]. Data Analysis and Knowledge Discovery, 2019,3(8):68-76.)
|
[11] |
宁建飞, 刘降珍. 融合Word2vec与TextRank的关键词抽取研究[J]. 现代图书情报技术, 2016(6):20-27.
|
[11] |
( Ning Jianfei, Liu Jiangzhen. Using Word2vec with TextRank to Extract Keywords[J]. New Technology of Library and Information Service, 2016(6):20-27.)
|
[12] |
夏天. 词向量聚类加权TextRank的关键词抽取[J]. 数据分析与知识发现, 2017,1(2):28-34.
|
[12] |
( Xia Tian. Extracting Keywords with Modified TextRank Model[J]. Data Analysis and Knowledge Discovery, 2017,1(2):28-34.)
|
[13] |
方俊伟, 崔浩冉, 贺国秀, 等. 基于先验知识TextRank的学术文本关键词抽取[J]. 情报科学, 2019,37(3):75-80.
|
[13] |
( Fang Junwei, Cui Haoran, He Guoxiu, et al. Keyword Extraction of Academic Text with TextRank Model Based on Prior Knowledge[J]. Information Science, 2019,37(3):75-80.)
|
[14] |
聂卉. 结合词向量和词图算法的用户兴趣建模研究[J]. 数据分析与知识发现, 2019,3(12):30-40.
|
[14] |
( Nie Hui. Modeling Users with Word Vector and Term-Graph Algorithm[J]. Data Analysis and Knowledge Discovery, 2019,3(12):30-40.)
|
[15] |
王安, 顾益军, 李坤明, 等. 基于复杂网络词节点移除的关键词抽取方法[J]. 数据分析与知识发现, 2019,3(11):35-42.
|
[15] |
( Wang An, Gu Yijun, Li Kunming, et al. Extracting Keywords Based on Removed Network Word Nodes[J]. Data Analysis and Knowledge Discovery, 2019,3(11):35-42.)
|
[16] |
Abulaish M, Parwez A, Jahiruddin. DiseaSE: A Biomedical Text Analytics System for Disease Symptom Extraction and Characterization[J]. Journal of Biomedical Informatics, 2019,100:103324.
doi: 10.1016/j.jbi.2019.103324
|
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|