1School of Modern Post (School of Automation), Beijing University of Posts and Telecommunications,Beijing 100876, China 2School of Economics and Management, Beijing University of Posts and Telecommunications,Beijing 100876, China
[Objective] This study introduces word semantics to TextRank algorithm, aiming to improve the performance of keywords extraction methods. [Methods] First, we used the semantic information from HowNet to calculate similarity of words. Then, we constructed graph and matrix for semantic words passing a similarity threshold. Finally, the semantic matrix and co-occurrence matrix were weighted to obtain transition probability matrix. [Results] The improved algorithm is better than TextRank, TF-IDF and LDA on short texts, which increased the F-scores by 6.6%, 9.0% and 10.3% respectively. On long texts, the results were inferior to TF-IDF, but close to TextRank. [Limitations] The segmentation program could not effectively identify compound words, new words and entities, which extracted incomplete keywords and reduced F-scores. In addition, the semantic similarity algorithm could also be improved. [Conclusions] The proposed method effectively extracts keywords from short texts with the help of co-occurrence and semantic relations of words.
闫强,张笑妍,周思敏. 基于义原相似度的关键词抽取方法 *[J]. 数据分析与知识发现, 2021, 5(4): 80-89.
Yan Qiang,Zhang Xiaoyan,Zhou Simin. Extracting Keywords Based on Sememe Similarity. Data Analysis and Knowledge Discovery, 2021, 5(4): 80-89.
( Sun Maosong, Chen Xinxiong. Embedding for Words and Word Senses Based on Human Annotated Knowledge Base: A Case Study on HowNet[J]. Journal of Chinese Information Processing, 2016,30(6):1-6, 14.)
( Ma Yongqi, Han Depei, Meng Lirong, et al. Lexical Semantic Similarity Algorithm Based on How-net[J]. Computer Engineering, 2018,44(6):151-155.)
[3]
Liu J M, Xu J N, Zhang Y J. An Approach of Hybrid Hierarchical Structure for Word Similarity Computing by HowNet[C]// Proceedings of the 6th International Joint Conference on Natural Language Processing (IJCNLP). 2013: 927-931.
( Liu Qun, Li Sujian. Word Similarity Computing Based on How-net[J]. International Journal of Computational Linguistics and Chinese Language Processing, 2002,7(2):59-76.)
[5]
Mihalcea R, Tarau P. TextRank: Bringing Order into Texts[C]// Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing. 2004: 404-411.
( Gu Yijun, Xia Tian. Study on Keyword Extraction with LDA and TextRank Combination[J]. New Technology of Library and Information Service, 2014(7/8):41-47.)
( Sun Mingzhu, Ma Jing, Qian Lingfei. Extracting Keywords Based on Topic Structure and Word Diagram Iteration[J]. Data Analysis and Knowledge Discovery, 2019,3(8):68-76.)
( Fang Junwei, Cui Haoran, He Guoxiu, et al. Keyword Extraction of Academic Text with TextRank Model Based on Prior Knowledge[J]. Information Science, 2019,37(3):75-80.)
( Wang An, Gu Yijun, Li Kunming, et al. Extracting Keywords Based on Removed Network Word Nodes[J]. Data Analysis and Knowledge Discovery, 2019,3(11):35-42.)
[16]
Abulaish M, Parwez A, Jahiruddin. DiseaSE: A Biomedical Text Analytics System for Disease Symptom Extraction and Characterization[J]. Journal of Biomedical Informatics, 2019,100:103324.
doi: 10.1016/j.jbi.2019.103324