Data Analysis and Knowledge Discovery  2021, Vol. 5 Issue (4): 80-89    DOI: 10.11925/infotech.2096-3467.2020.0748
 Current Issue | Archive | Adv Search |
Extracting Keywords Based on Sememe Similarity
Yan Qiang1,2(),Zhang Xiaoyan2,Zhou Simin2
1School of Modern Post (School of Automation), Beijing University of Posts and Telecommunications,Beijing 100876, China
2School of Economics and Management, Beijing University of Posts and Telecommunications,Beijing 100876, China
 Download: PDF (1297 KB)   HTML ( 19 )  Export: BibTeX | EndNote (RIS)
Abstract

[Objective] This study introduces word semantics to TextRank algorithm, aiming to improve the performance of keywords extraction methods. [Methods] First, we used the semantic information from HowNet to calculate similarity of words. Then, we constructed graph and matrix for semantic words passing a similarity threshold. Finally, the semantic matrix and co-occurrence matrix were weighted to obtain transition probability matrix. [Results] The improved algorithm is better than TextRank, TF-IDF and LDA on short texts, which increased the F-scores by 6.6%, 9.0% and 10.3% respectively. On long texts, the results were inferior to TF-IDF, but close to TextRank. [Limitations] The segmentation program could not effectively identify compound words, new words and entities, which extracted incomplete keywords and reduced F-scores. In addition, the semantic similarity algorithm could also be improved. [Conclusions] The proposed method effectively extracts keywords from short texts with the help of co-occurrence and semantic relations of words.

Key wordsTextRank Extraction      Sememe      Word Similarity
Received: 31 July 2020      Published: 24 November 2020
 ZTFLH: TP393
Fund:National Social Science Fund of China(17AGL026);BUPT Excellent Ph.D. Students Foundation(CX2019128)
Corresponding Authors: Yan Qiang     E-mail: yan@bupt.edu.cn
 The Sememe Tree of “Lianxiang” Research Framework Word Graph under Different Threshold Values of Word Similarity Algorithm Performance of Keyword Extraction Examples of Keyword Extraction Results λ and η"> Precision, Recall and F-score Curve under Different Values of $λ$ and η Extraction Result of Different Length and Topics Improvement of Keyword Extraction on Short Text Examples of Invalid Keywords Extraction Due to Wrong Segmentation