Please wait a minute...
Data Analysis and Knowledge Discovery  2017, Vol. 1 Issue (4): 1-8    DOI: 10.11925/infotech.2096-3467.2017.04.01
Orginal Article Current Issue | Archive | Adv Search |
Recognizing Core Topic Sentences with Improved TextRank Algorithm Based on WMD Semantic Similarity
Wang Zixuan1,2, Le Xiaoqiu1(), He Yuanbiao1
1National Science Library, Chinese Academy of Sciences, Beijing 100190, China
2University of Chinese Academy of Sciences, Beijing 100049, China
Download: PDF (611 KB)   HTML ( 2
Export: BibTeX | EndNote (RIS)      

[Objective] This paper aims to automatically recognize key sentences describing the research topics of scientific papers. [Methods] First, we used paper sections as the unit to organize sentence sets. Then, we calculated the WMD distance between sentences by trained domain word embeddings. Third, we optimized the iterative process of TextRank algorithm, and used external features to adjust sentence’s weights. Finally, we identified the core topic sentences according to the sentence’s weights descendingly. [Results] We examined the proposed method with scientific papers on climate changes and compared it with the traditional TextRank algorithm. The recognition efficiency (F-value) was about 5% higher than that of the TextRank algorithm. [Limitations] The extraction of sentence features needs to be improved, and word embedding training and related parameters of the proposed method need to be further optimized. [Conclusions] The improved TextRank algorithm, could effectively recognize inner core sentences of scientific paper sections. It could recognize core topic sentences of a paper with the adjusted weights of external features.

Key wordsWMD      TextRank      Semantic Similarity      Topic Sentence Recognition      External Features     
Received: 19 January 2017      Published: 24 May 2017
ZTFLH:  TP393  

Cite this article:

Wang Zixuan,Le Xiaoqiu,He Yuanbiao. Recognizing Core Topic Sentences with Improved TextRank Algorithm Based on WMD Semantic Similarity. Data Analysis and Knowledge Discovery, 2017, 1(4): 1-8.

URL:     OR

方法 准确率 召回率 F1值
TextRank 24.88% 22.94% 23.87%
WMD 22.89% 21.10% 21.96%
WMD+TextRank 23.38% 21.56% 22.43%
27.11% 25% 26.01%
方法 准确率 召回率 F1值
TextRank 25.05% 38.59% 30.37%
WMD 20.24% 31.17% 24.54%
WMD+TextRank 27.66% 42.59% 33.54%
本文方法(WMD+TextRank +外部特征优化) 29.06% 44.75% 35.24%
[1] Sunayama W, Yachida M.Panoramic View System for Extracting Key Sentences Based on Viewpoints and Application to a Search Engine[J]. Journal of Network and Computer Applications, 2005, 28(2): 115-127.
doi: 10.1016/j.jnca.2004.01.005
[2] Mihalcea R, Tarau P. TextRank: Bringing Order into Texts [OL]. Unt Scholarly Works, 2004.
[3] Kusner M J, Sun Y, Kolkin N I, et al.From Word Embeddings to Document Distances[C]// Proceedings of the 32nd International Conference on Machine Learning. 2015: 957-966.
[4] Batcha N K, Aziz N A.An Algebraic Approach for Sentence Based Feature Extraction Applied for Automatic Text Summarization[J]. Journal of Computational & Theoretical Nanoscience, 2014, 20(1): 139-143.
doi: 10.1166/asl.2014.5258
[5] Luhn H P.The Automatic Creation of Literature Abstracts[J]. IBM Journal of Research and Development, 1958, 2(2): 159-165.
doi: 10.1147/rd.22.0159
[6] Baxendale P B.Machine-Made Index for Technical Literature——An Experiment[J]. IBM Journal of Research and Development, 1958, 2(4): 354-361.
doi: 10.1147/rd.24.0354
[7] 刘挺, 王开铸. 自动文摘的四种主要方法[J]. 情报学报, 1999, 18(1): 10-19.
doi: 10.3969/j.issn.1000-0135.1999.01.002
[7] (Liu Ting, Wang Kaizhu.Four Kinds of Main Methods of Automatic Abstracting[J]. Journal of the China Society for Scientific and Technical Information, 1999, 18(1): 10-19.)
doi: 10.3969/j.issn.1000-0135.1999.01.002
[8] Edmundson H P.New Methods in Automatic Extracting[J]. Journal of the ACM, 1969, 16(2): 264-285.
[9] Kupiec J, Pedersen J, Chen F.A Trainable Document Summarizer[C]//Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 1995: 68-73.
[10] Conroy J M, O’leary D P.Text Summarization via Hidden Markov Models[C]//Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2001: 406-407.
[11] Mihalcea R.Graph-based Ranking Algorithms for Sentence Extraction, Applied to Text Summarization[C]//Proceedings of the ACL 2004 on Interactive Poster and Demonstration Sessions. Association for Computational Linguistics, 2004: 20.
[12] 余珊珊, 苏锦钿, 李鹏飞. 基于改进的TextRank的自动摘要提取方法[J]. 计算机科学, 2016, 43(6): 240-247.
doi: 10.11896/j.issn.1002-137X.2016.6.048
[12] (Yu Shanshan, Su Jindian, Li Pengfei.Improved TextRank-based Method for Automatic Summarization[J]. Computer Science, 2016, 43(6): 240-247.)
doi: 10.11896/j.issn.1002-137X.2016.6.048
[13] 耿焕同, 蔡庆生, 赵鹏, 等. 一种基于词共现图的文档自动摘要研究[J]. 情报学报, 2005, 24(6): 651-656.
doi: 10.3969/j.issn.1000-0135.2005.06.002
[13] (Geng Huantong, Cai Qingsheng, Zhao Peng, et al.A Kind of Automatic Text Keyphrase Extraction Method Based on Word Co-occurrence[J]. Journal of the China Society for Scientific and Technical Information, 2005, 24(6): 651-656.)
doi: 10.3969/j.issn.1000-0135.2005.06.002
[14] 何维, 王宇. 基于句子关系图的网页文本主题句抽取[J]. 现代图书情报技术, 2009(3): 57-61.
[14] (He Wei, Wang Yu.Extracting Topic Sentences from Web Text Based on Sentence Relationship Map[J]. New Technology of Library and Information Service, 2009(3): 57-61.)
[15] Saïd T, Evrard F.Intentional Structures of Documents[C]// Proceedings of the 12th ACM Conference on Hypertext and Hypermedia. ACM, 2001: 39-40.
[16] Harris Z S.Distributional Structure[A]. //Papers on Syntax[M]. Springer Netherlands, 1954.
[17] Bengio Y, Schwenk H, Senécal J S, et al.A Neural Probabilistic Language Model[J]. Journal of Machine Learning Research, 2003, 3(6): 1137-1155.
doi: 10.1007/3-540-33486-6_6
[18] Mikolov T, Sutskever I, Chen K, et al.Distributed Representations of Words and Phrases and Their Compositionality[J]. Advances in Neural Information Processing Systems, 2013, 26: 3111-3119.
[19] Word2Vec [EB/OL]. [2016-12-26]..
[20] 郭庆琳, 李艳梅, 唐琦. 基于VSM的文本相似度计算的研究[J]. 计算机应用研究, 2008, 25(11): 3256-3258.
doi: 10.3969/j.issn.1001-3695.2008.11.015
[20] (Guo Qinglin, Li Yanmei, Tang Qi.Similarity Computing of Documents Based on VSM[J]. Application Research of Computers, 2008, 25(11): 3256-3258.)
doi: 10.3969/j.issn.1001-3695.2008.11.015
[21] Wang R, Neumann G.Recognizing Textual Entailment Using Sentence Similarity Based on Dependency Tree Skeletons[C]//Proceedings of the ACL-PASCAL Workshop on Textual Entailment and Paraphrasing. Association for Computational Linguistics, 2007: 36-41.
[22] Wang D, Li T, Zhu S, et al.Multi-document Summarization via Sentence-level Semantic Analysis and Symmetric Matrix Factorization[C]//Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2008.
[23] Mikolov T, Chen K, Corrado G, et al. Efficient Estimation of Word Representations in Vector Space[OL]. arXiv Preprint. arXiv:1301.3781, 2013.
[24] Ling H, Okada K.An Efficient Earth Mover’s Distance Algorithm for Robust Histogram Comparison[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2007, 29(5): 840-853.
doi: 10.1109/TPAMI.2007.1058 pmid: 17356203
[25] 何远标, 乐小虬, 张帆. 学术论文大纲中关键术语抽取方法研究[J]. 现代图书情报技术, 2014(3): 73-79.
[25] (He Yuanbiao, Le Xiaoqiu, Zhang Fan.Research on Keyphrase Extraction from Scholarly Article Outline[J]. New Technology of Library and Information Service, 2014(3): 73-79.)
[26] 何远标. 基于学术论文大纲的术语层级关系挖掘[D]. 北京: 中国科学院大学, 2014.
[26] (He Yuanbiao.Phrase Hierarchical Relationship Mining Based on Scholarly Article Outline[D]. Beijing: University of Chinese Academy of Sciences, 2014.)
[27] Chung J, Gulcehre C, Cho K H, et al. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling[OL]. arXiv Preprint. arXiv:1412.3555, 2014.
[1] Xia Tian. Extracting Key-phrases from Chinese Scholarly Papers[J]. 数据分析与知识发现, 2020, 4(7): 76-86.
[2] Mingzhu Sun,Jing Ma,Lingfei Qian. Extracting Keywords Based on Topic Structure and Word Diagram Iteration[J]. 数据分析与知识发现, 2019, 3(8): 68-76.
[3] Jiao Yan,Jing Ma,Kang Fang. Computing Text Semantic Similarity with Syntactic Network of Co-occurrence Distance[J]. 数据分析与知识发现, 2019, 3(12): 93-100.
[4] An Wang,Yijun Gu,Kunming Li,Wenzheng Li. Extracting Keywords Based on Removed Network Word Nodes[J]. 数据分析与知识发现, 2019, 3(11): 35-44.
[5] Zhuchen Liu,Hao Chen,Yanhua Yu,Jie Li. Extracting Keywords with TextRank and Weighted Word Positions[J]. 数据分析与知识发现, 2018, 2(9): 74-79.
[6] Erjing Chen,Enbo Jiang. Review of Studies on Text Similarity Measures[J]. 数据分析与知识发现, 2017, 1(6): 1-11.
[7] Xia Tian. Extracting Keywords with Modified TextRank Model[J]. 数据分析与知识发现, 2017, 1(2): 28-34.
[8] Zhai Dongsheng,Cai Wenhao,Zhang Jie,Li Zhenfei. An Improved Method of Semantic Similarity Calculation of Chinese Trademarks[J]. 数据分析与知识发现, 2017, 1(11): 19-28.
[9] Liu Jian,Bi Qiang,Liu Qingxu,Wang Fu. New Content Recommendation Service of Digital Literature[J]. 现代图书情报技术, 2016, 32(9): 70-77.
[10] Ning Jianfei,Liu Jiangzhen. Using Word2vec with TextRank to Extract Keywords[J]. 现代图书情报技术, 2016, 32(6): 20-27.
[11] Ba Zhichao,Li Gang,Zhu Shiwei. Similarity Measurement of Research Interests in Semantic Network[J]. 现代图书情报技术, 2016, 32(4): 81-90.
[12] Qiang Bi, Jian Liu, Yulai Bao. A New Text Clustering Method Based on Semantic Similarity[J]. 数据分析与知识发现, 2016, 32(12): 9-16.
[13] Liu Huailiang, Du Kun, Qin Chunxiu. Research on Chinese Text Categorization Based on Semantic Similarity of HowNet[J]. 现代图书情报技术, 2015, 31(2): 39-45.
[14] Fan Xuexue, Wang Zhirong, Xu Wu, Liang Yin, Ma Xiaohu. Research on Semantic Similarity Estimation Algorithm of Medical Terminology Based on Medical Ontology[J]. 现代图书情报技术, 2015, 31(12): 57-64.
[15] Hu Jiming, Xiao Lu. Semantic Incremental Improvement on Vector Space Model for Text Modeling[J]. 现代图书情报技术, 2014, 30(10): 49-55.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938