Data Analysis and Knowledge Discovery  2021, Vol. 5 Issue (4): 80-89    DOI: 10.11925/infotech.2096-3467.2020.0748
Extracting Keywords Based on Sememe Similarity
Yan Qiang1,2(),Zhang Xiaoyan2,Zhou Simin2
1School of Modern Post (School of Automation), Beijing University of Posts and Telecommunications,Beijing 100876, China
2School of Economics and Management, Beijing University of Posts and Telecommunications,Beijing 100876, China
[Objective] This study introduces word semantics to TextRank algorithm, aiming to improve the performance of keywords extraction methods. [Methods] First, we used the semantic information from HowNet to calculate similarity of words. Then, we constructed graph and matrix for semantic words passing a similarity threshold. Finally, the semantic matrix and co-occurrence matrix were weighted to obtain transition probability matrix. [Results] The improved algorithm is better than TextRank, TF-IDF and LDA on short texts, which increased the F-scores by 6.6%, 9.0% and 10.3% respectively. On long texts, the results were inferior to TF-IDF, but close to TextRank. [Limitations] The segmentation program could not effectively identify compound words, new words and entities, which extracted incomplete keywords and reduced F-scores. In addition, the semantic similarity algorithm could also be improved. [Conclusions] The proposed method effectively extracts keywords from short texts with the help of co-occurrence and semantic relations of words.

Key wordsTextRank Extraction      Sememe      Word Similarity
Received: 31 July 2020      Published: 24 November 2020
 ZTFLH: TP393
Fund:National Social Science Fund of China(17AGL026);BUPT Excellent Ph.D. Students Foundation(CX2019128)
Corresponding Authors: Yan Qiang     E-mail: yan@bupt.edu.cn
