[Objective] This paper aims to automatically recognize key sentences describing the research topics of scientific papers. [Methods] First, we used paper sections as the unit to organize sentence sets. Then, we calculated the WMD distance between sentences by trained domain word embeddings. Third, we optimized the iterative process of TextRank algorithm, and used external features to adjust sentence’s weights. Finally, we identified the core topic sentences according to the sentence’s weights descendingly. [Results] We examined the proposed method with scientific papers on climate changes and compared it with the traditional TextRank algorithm. The recognition efficiency (F-value) was about 5% higher than that of the TextRank algorithm. [Limitations] The extraction of sentence features needs to be improved, and word embedding training and related parameters of the proposed method need to be further optimized. [Conclusions] The improved TextRank algorithm, could effectively recognize inner core sentences of scientific paper sections. It could recognize core topic sentences of a paper with the adjusted weights of external features.
Sunayama W, Yachida M.Panoramic View System for Extracting Key Sentences Based on Viewpoints and Application to a Search Engine[J]. Journal of Network and Computer Applications, 2005, 28(2): 115-127.
Mihalcea R, Tarau P. TextRank: Bringing Order into Texts [OL]. Unt Scholarly Works, 2004.
Kusner M J, Sun Y, Kolkin N I, et al.From Word Embeddings to Document Distances[C]// Proceedings of the 32nd International Conference on Machine Learning. 2015: 957-966.
Batcha N K, Aziz N A.An Algebraic Approach for Sentence Based Feature Extraction Applied for Automatic Text Summarization[J]. Journal of Computational & Theoretical Nanoscience, 2014, 20(1): 139-143.
Luhn H P.The Automatic Creation of Literature Abstracts[J]. IBM Journal of Research and Development, 1958, 2(2): 159-165.
Baxendale P B.Machine-Made Index for Technical Literature——An Experiment[J]. IBM Journal of Research and Development, 1958, 2(4): 354-361.
(Liu Ting, Wang Kaizhu.Four Kinds of Main Methods of Automatic Abstracting[J]. Journal of the China Society for Scientific and Technical Information, 1999, 18(1): 10-19.)
Edmundson H P.New Methods in Automatic Extracting[J]. Journal of the ACM, 1969, 16(2): 264-285.
Kupiec J, Pedersen J, Chen F.A Trainable Document Summarizer[C]//Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 1995: 68-73.
Conroy J M, O’leary D P.Text Summarization via Hidden Markov Models[C]//Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2001: 406-407.
Mihalcea R.Graph-based Ranking Algorithms for Sentence Extraction, Applied to Text Summarization[C]//Proceedings of the ACL 2004 on Interactive Poster and Demonstration Sessions. Association for Computational Linguistics, 2004: 20.
(Geng Huantong, Cai Qingsheng, Zhao Peng, et al.A Kind of Automatic Text Keyphrase Extraction Method Based on Word Co-occurrence[J]. Journal of the China Society for Scientific and Technical Information, 2005, 24(6): 651-656.)
(Guo Qinglin, Li Yanmei, Tang Qi.Similarity Computing of Documents Based on VSM[J]. Application Research of Computers, 2008, 25(11): 3256-3258.)
Wang R, Neumann G.Recognizing Textual Entailment Using Sentence Similarity Based on Dependency Tree Skeletons[C]//Proceedings of the ACL-PASCAL Workshop on Textual Entailment and Paraphrasing. Association for Computational Linguistics, 2007: 36-41.
Wang D, Li T, Zhu S, et al.Multi-document Summarization via Sentence-level Semantic Analysis and Symmetric Matrix Factorization[C]//Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2008.
Mikolov T, Chen K, Corrado G, et al. Efficient Estimation of Word Representations in Vector Space[OL]. arXiv Preprint. arXiv:1301.3781, 2013.
Ling H, Okada K.An Efficient Earth Mover’s Distance Algorithm for Robust Histogram Comparison[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2007, 29(5): 840-853.