Please wait a minute...
Data Analysis and Knowledge Discovery  2021, Vol. 5 Issue (4): 80-89    DOI: 10.11925/infotech.2096-3467.2020.0748
Current Issue | Archive | Adv Search |
Extracting Keywords Based on Sememe Similarity
Yan Qiang1,2(),Zhang Xiaoyan2,Zhou Simin2
1School of Modern Post (School of Automation), Beijing University of Posts and Telecommunications,Beijing 100876, China
2School of Economics and Management, Beijing University of Posts and Telecommunications,Beijing 100876, China
Download: PDF (1297 KB)   HTML ( 11
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This study introduces word semantics to TextRank algorithm, aiming to improve the performance of keywords extraction methods. [Methods] First, we used the semantic information from HowNet to calculate similarity of words. Then, we constructed graph and matrix for semantic words passing a similarity threshold. Finally, the semantic matrix and co-occurrence matrix were weighted to obtain transition probability matrix. [Results] The improved algorithm is better than TextRank, TF-IDF and LDA on short texts, which increased the F-scores by 6.6%, 9.0% and 10.3% respectively. On long texts, the results were inferior to TF-IDF, but close to TextRank. [Limitations] The segmentation program could not effectively identify compound words, new words and entities, which extracted incomplete keywords and reduced F-scores. In addition, the semantic similarity algorithm could also be improved. [Conclusions] The proposed method effectively extracts keywords from short texts with the help of co-occurrence and semantic relations of words.

Key wordsTextRank Extraction      Sememe      Word Similarity     
Received: 31 July 2020      Published: 24 November 2020
ZTFLH:  TP393  
Fund:National Social Science Fund of China(17AGL026);BUPT Excellent Ph.D. Students Foundation(CX2019128)
Corresponding Authors: Yan Qiang     E-mail: yan@bupt.edu.cn

Cite this article:

Yan Qiang,Zhang Xiaoyan,Zhou Simin. Extracting Keywords Based on Sememe Similarity. Data Analysis and Knowledge Discovery, 2021, 5(4): 80-89.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2020.0748     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2021/V5/I4/80

The Sememe Tree of “Lianxiang”
Research Framework
Word Graph under Different Threshold Values of Word Similarity
算法 λ η P R F
语义+TextRank 0.05 0.4 0.368 0.337 0.352
0.1 0.4 0.367 0.335 0.350
0.15 0.2, 0.4 0.364 0.333 0.348
0.2 0.1 0.361 0.330 0.345
0.25 0.2 0.358 0.327 0.342
0.3 0.1 0.355 0.325 0.339
TextRank 0.355 0.325 0.339
TF-IDF 0.369 0.337 0.352
LDA 0.314 0.287 0.300
Algorithm Performance of Keyword Extraction
文档内容或编号 人工标注关键词 抽取方法 抽取关键词
#NBA复赛方案出炉# 22队复赛,8月1号开打!
联盟消息人士称,复赛计划是各队按原赛程打完8场常规赛。如遇老鹰、公牛、活塞等未参加复赛的球队,则自动跳到下一场。常规赛打完后,第8、9名打资格赛。第9名落后第8名不超过4个胜场,方可参加资格赛。资格赛中,9号种子晋级条件是连续两场击败8号种子,而后者只需一场胜利便可晋级。
复赛、计划、打完、常规赛、晋级 语义+TextRank 复赛、打完、参加、晋级、老鹰
TextRank 复赛、参加、打完、老鹰、消息人士
TF-IDF 复赛、下一场、公牛、活塞、老鹰
LDA 复赛、中国、资格赛、参加、打完
1 869 墓葬、成都、古墓、文物、考古 语义+TextRank 墓葬、一直、6 000、时期、考古
TextRank 墓葬、一直、6 000、时期、汉晋
TF-IDF 都城、汉晋、墓葬、时期、考古
LDA 墓葬、时期、都城、汉晋、海浪
Examples of Keyword Extraction Results
λ and η
">
Precision, Recall and F-score Curve under Different Values of λ and η
Extraction Result of Different Length and Topics
算法 λ η P R F
语义+TextRank 0.05 0.4 0.366 0.341 0.353
TextRank 0.343 0.320 0.331
TF-IDF 0.335 0.313 0.324
LDA 0.331 0.309 0.320
Improvement of Keyword Extraction on Short Text
文档编号 算法抽取结果 正确分词结果
154 台积 台积电(机构名)
535 张家 张家城(人名)
678 莱因、克尔 莱因克尔(人名)
1019 名医药 未名医药(机构名)
1040 联社 财联社(机构名)
1352 龙磁、科技 龙磁科技(机构名)
1517 麒麟 郭麒麟(人名)
Examples of Invalid Keywords Extraction Due to Wrong Segmentation
[1] 孙茂松, 陈新雄. 借重于人工知识库的词和义项的向量表示: 以HowNet为例[J]. 中文信息学报, 2016,30(6):1-6, 14.
[1] ( Sun Maosong, Chen Xinxiong. Embedding for Words and Word Senses Based on Human Annotated Knowledge Base: A Case Study on HowNet[J]. Journal of Chinese Information Processing, 2016,30(6):1-6, 14.)
[2] 马永起, 韩德培, 蒙立荣, 等. 基于How-net的词语语义相似度算法[J]. 计算机工程, 2018,44(6):151-155.
[2] ( Ma Yongqi, Han Depei, Meng Lirong, et al. Lexical Semantic Similarity Algorithm Based on How-net[J]. Computer Engineering, 2018,44(6):151-155.)
[3] Liu J M, Xu J N, Zhang Y J. An Approach of Hybrid Hierarchical Structure for Word Similarity Computing by HowNet[C]// Proceedings of the 6th International Joint Conference on Natural Language Processing (IJCNLP). 2013: 927-931.
[4] 刘群, 李素建. 基于《知网》的词汇语义相似度计算[J]. 中文计算语言学, 2002,7(2):59-76.
[4] ( Liu Qun, Li Sujian. Word Similarity Computing Based on How-net[J]. International Journal of Computational Linguistics and Chinese Language Processing, 2002,7(2):59-76.)
[5] Mihalcea R, Tarau P. TextRank: Bringing Order into Texts[C]// Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing. 2004: 404-411.
[6] 施聪莺, 徐朝军, 杨晓江. TFIDF算法研究综述[J]. 计算机应用, 2009,29(S1):167-170, 180.
[6] ( Shi Congying, Xu Chaojun, Yang Xiaojiang. Study of TFIDF Algorithm[J]. Journal of Computer Applications, 2009,29(S1):167-170, 180.)
[7] 单斌, 李芳. 基于LDA话题演化研究方法综述[J]. 中文信息学报, 2010,24(6):43-49, 68.
[7] ( Shan Bin, Li Fang. A Survey of Topic Evolution Based on LDA[J]. Journal of Chinese Information Processing, 2010,24(6):43-49,68.)
[8] 夏天. 词语位置加权TextRank的关键词抽取研究[J]. 现代图书情报技术, 2013(9):30-34.
[8] ( Xia Tian. Study on Keyword Extraction Using Word Position Weighted TextRank[J]. New Technology of Library and Information Service, 2013(9):30-34.)
[9] 顾益军, 夏天. 融合LDA与TextRank的关键词抽取研究[J]. 现代图书情报技术, 2014(7/8):41-47.
[9] ( Gu Yijun, Xia Tian. Study on Keyword Extraction with LDA and TextRank Combination[J]. New Technology of Library and Information Service, 2014(7/8):41-47.)
[10] 孙明珠, 马静, 钱玲飞. 基于文档主题结构和词图迭代的关键词抽取方法研究[J]. 数据分析与知识发现, 2019,3(8):68-76.
[10] ( Sun Mingzhu, Ma Jing, Qian Lingfei. Extracting Keywords Based on Topic Structure and Word Diagram Iteration[J]. Data Analysis and Knowledge Discovery, 2019,3(8):68-76.)
[11] 宁建飞, 刘降珍. 融合Word2vec与TextRank的关键词抽取研究[J]. 现代图书情报技术, 2016(6):20-27.
[11] ( Ning Jianfei, Liu Jiangzhen. Using Word2vec with TextRank to Extract Keywords[J]. New Technology of Library and Information Service, 2016(6):20-27.)
[12] 夏天. 词向量聚类加权TextRank的关键词抽取[J]. 数据分析与知识发现, 2017,1(2):28-34.
[12] ( Xia Tian. Extracting Keywords with Modified TextRank Model[J]. Data Analysis and Knowledge Discovery, 2017,1(2):28-34.)
[13] 方俊伟, 崔浩冉, 贺国秀, 等. 基于先验知识TextRank的学术文本关键词抽取[J]. 情报科学, 2019,37(3):75-80.
[13] ( Fang Junwei, Cui Haoran, He Guoxiu, et al. Keyword Extraction of Academic Text with TextRank Model Based on Prior Knowledge[J]. Information Science, 2019,37(3):75-80.)
[14] 聂卉. 结合词向量和词图算法的用户兴趣建模研究[J]. 数据分析与知识发现, 2019,3(12):30-40.
[14] ( Nie Hui. Modeling Users with Word Vector and Term-Graph Algorithm[J]. Data Analysis and Knowledge Discovery, 2019,3(12):30-40.)
[15] 王安, 顾益军, 李坤明, 等. 基于复杂网络词节点移除的关键词抽取方法[J]. 数据分析与知识发现, 2019,3(11):35-42.
[15] ( Wang An, Gu Yijun, Li Kunming, et al. Extracting Keywords Based on Removed Network Word Nodes[J]. Data Analysis and Knowledge Discovery, 2019,3(11):35-42.)
[16] Abulaish M, Parwez A, Jahiruddin. DiseaSE: A Biomedical Text Analytics System for Disease Symptom Extraction and Characterization[J]. Journal of Biomedical Informatics, 2019,100:103324.
doi: 10.1016/j.jbi.2019.103324
[1] Liu Ping, Chen Ye. Survey of the State of the Art in Word Similarity[J]. 现代图书情报技术, 2012, 28(7): 82-89.
[2] Lu Shengjun,Li Fayong,Qian Jianjun ,Zhen Zhen. WCONS+:An Ontology Integration Approach Based on WCONS[J]. 现代图书情报技术, 2009, 3(2): 18-22.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn