Data Analysis and Knowledge Discovery  2017, Vol. 1 Issue (4): 1-8    DOI: 10.11925/infotech.2096-3467.2017.04.01
Recognizing Core Topic Sentences with Improved TextRank Algorithm Based on WMD Semantic Similarity
Wang Zixuan1,2, Le Xiaoqiu1(), He Yuanbiao1
1National Science Library, Chinese Academy of Sciences, Beijing 100190, China
2University of Chinese Academy of Sciences, Beijing 100049, China
[Objective] This paper aims to automatically recognize key sentences describing the research topics of scientific papers. [Methods] First, we used paper sections as the unit to organize sentence sets. Then, we calculated the WMD distance between sentences by trained domain word embeddings. Third, we optimized the iterative process of TextRank algorithm, and used external features to adjust sentence’s weights. Finally, we identified the core topic sentences according to the sentence’s weights descendingly. [Results] We examined the proposed method with scientific papers on climate changes and compared it with the traditional TextRank algorithm. The recognition efficiency (F-value) was about 5% higher than that of the TextRank algorithm. [Limitations] The extraction of sentence features needs to be improved, and word embedding training and related parameters of the proposed method need to be further optimized. [Conclusions] The improved TextRank algorithm, could effectively recognize inner core sentences of scientific paper sections. It could recognize core topic sentences of a paper with the adjusted weights of external features.

Key wordsWMD      TextRank      Semantic Similarity      Topic Sentence Recognition      External Features     
Received: 19 January 2017      Published: 24 May 2017
Wang Zixuan,Le Xiaoqiu,He Yuanbiao. Recognizing Core Topic Sentences with Improved TextRank Algorithm Based on WMD Semantic Similarity. Data Analysis and Knowledge Discovery, 2017, 1(4): 1-8.

方法 准确率 召回率 F1值
TextRank 24.88% 22.94% 23.87%
WMD 22.89% 21.10% 21.96%
WMD+TextRank 23.38% 21.56% 22.43%
27.11% 25% 26.01%
方法 准确率 召回率 F1值
TextRank 25.05% 38.59% 30.37%
WMD 20.24% 31.17% 24.54%
WMD+TextRank 27.66% 42.59% 33.54%
本文方法(WMD+TextRank +外部特征优化) 29.06% 44.75% 35.24%
