Please wait a minute...
Advanced Search
数据分析与知识发现  2017, Vol. 1 Issue (4): 1-8     https://doi.org/10.11925/infotech.2096-3467.2017.04.01
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
基于WMD语义相似度的TextRank改进算法识别论文核心主题句研究
王子璇1,2, 乐小虬1(), 何远标1
1中国科学院文献情报中心 北京 100190
2中国科学院大学 北京 100049
Recognizing Core Topic Sentences with Improved TextRank Algorithm Based on WMD Semantic Similarity
Wang Zixuan1,2, Le Xiaoqiu1(), He Yuanbiao1
1National Science Library, Chinese Academy of Sciences, Beijing 100190, China
2University of Chinese Academy of Sciences, Beijing 100049, China
全文: PDF (611 KB)   HTML ( 2
输出: BibTeX | EndNote (RIS)      
摘要 

目的】自动甄别科技论文中描述研究主题的关键语句。【方法】以论文小节为单位组织句子集, 通过训练领域词向量计算句子间WMD距离得到相应语义相似度, 优化TextRank算法迭代过程, 利用外部特征对所得权值进行调整, 按句子权值降序选取关键主题句。【结果】以气候变化领域科技论文作为实验数据, 以人工标注的结果为基准对本文的算法和传统的TextRank算法进行对比实验, 初步结果表明该方法的识别效果(F值)比传统TextRank算法提升约5%。【局限】句子特征提取有待提高, 词向量训练及方法中的相关参数需要做进一步优化。【结论】基于领域词向量, 融合WMD语义相似度的TextRank改进算法, 能够较好地甄别科技论文小节内部中心句, 辅以外部特征的权值调整后可以较好地识别出一篇论文的核心主题句。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
王子璇
乐小虬
何远标
关键词 WMDTextRank语义相似主题句识别外部特征    
Abstract

[Objective] This paper aims to automatically recognize key sentences describing the research topics of scientific papers. [Methods] First, we used paper sections as the unit to organize sentence sets. Then, we calculated the WMD distance between sentences by trained domain word embeddings. Third, we optimized the iterative process of TextRank algorithm, and used external features to adjust sentence’s weights. Finally, we identified the core topic sentences according to the sentence’s weights descendingly. [Results] We examined the proposed method with scientific papers on climate changes and compared it with the traditional TextRank algorithm. The recognition efficiency (F-value) was about 5% higher than that of the TextRank algorithm. [Limitations] The extraction of sentence features needs to be improved, and word embedding training and related parameters of the proposed method need to be further optimized. [Conclusions] The improved TextRank algorithm, could effectively recognize inner core sentences of scientific paper sections. It could recognize core topic sentences of a paper with the adjusted weights of external features.

Key wordsWMD    TextRank    Semantic Similarity    Topic Sentence Recognition    External Features
收稿日期: 2017-01-19      出版日期: 2017-05-24
ZTFLH:  TP393  
引用本文:   
王子璇, 乐小虬, 何远标. 基于WMD语义相似度的TextRank改进算法识别论文核心主题句研究[J]. 数据分析与知识发现, 2017, 1(4): 1-8.
Wang Zixuan,Le Xiaoqiu,He Yuanbiao. Recognizing Core Topic Sentences with Improved TextRank Algorithm Based on WMD Semantic Similarity. Data Analysis and Knowledge Discovery, 2017, 1(4): 1-8.
链接本文:  
http://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2017.04.01      或      http://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2017/V1/I4/1
  本文核心主题句识别方法基本流程
  句子WMD距离计算过程
方法 准确率 召回率 F1值
TextRank 24.88% 22.94% 23.87%
WMD 22.89% 21.10% 21.96%
WMD+TextRank 23.38% 21.56% 22.43%
本文方法(WMD+TextRank
+外部特征优化)
27.11% 25% 26.01%
  气候变化领域4种算法的实验结果比较
方法 准确率 召回率 F1值
TextRank 25.05% 38.59% 30.37%
WMD 20.24% 31.17% 24.54%
WMD+TextRank 27.66% 42.59% 33.54%
本文方法(WMD+TextRank +外部特征优化) 29.06% 44.75% 35.24%
  计算机领域4种算法的实验结果比较
[1] Sunayama W, Yachida M.Panoramic View System for Extracting Key Sentences Based on Viewpoints and Application to a Search Engine[J]. Journal of Network and Computer Applications, 2005, 28(2): 115-127.
doi: 10.1016/j.jnca.2004.01.005
[2] Mihalcea R, Tarau P. TextRank: Bringing Order into Texts [OL]. Unt Scholarly Works, 2004.
[3] Kusner M J, Sun Y, Kolkin N I, et al.From Word Embeddings to Document Distances[C]// Proceedings of the 32nd International Conference on Machine Learning. 2015: 957-966.
[4] Batcha N K, Aziz N A.An Algebraic Approach for Sentence Based Feature Extraction Applied for Automatic Text Summarization[J]. Journal of Computational & Theoretical Nanoscience, 2014, 20(1): 139-143.
doi: 10.1166/asl.2014.5258
[5] Luhn H P.The Automatic Creation of Literature Abstracts[J]. IBM Journal of Research and Development, 1958, 2(2): 159-165.
doi: 10.1147/rd.22.0159
[6] Baxendale P B.Machine-Made Index for Technical Literature——An Experiment[J]. IBM Journal of Research and Development, 1958, 2(4): 354-361.
doi: 10.1147/rd.24.0354
[7] 刘挺, 王开铸. 自动文摘的四种主要方法[J]. 情报学报, 1999, 18(1): 10-19.
doi: 10.3969/j.issn.1000-0135.1999.01.002
[7] (Liu Ting, Wang Kaizhu.Four Kinds of Main Methods of Automatic Abstracting[J]. Journal of the China Society for Scientific and Technical Information, 1999, 18(1): 10-19.)
doi: 10.3969/j.issn.1000-0135.1999.01.002
[8] Edmundson H P.New Methods in Automatic Extracting[J]. Journal of the ACM, 1969, 16(2): 264-285.
[9] Kupiec J, Pedersen J, Chen F.A Trainable Document Summarizer[C]//Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 1995: 68-73.
[10] Conroy J M, O’leary D P.Text Summarization via Hidden Markov Models[C]//Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2001: 406-407.
[11] Mihalcea R.Graph-based Ranking Algorithms for Sentence Extraction, Applied to Text Summarization[C]//Proceedings of the ACL 2004 on Interactive Poster and Demonstration Sessions. Association for Computational Linguistics, 2004: 20.
[12] 余珊珊, 苏锦钿, 李鹏飞. 基于改进的TextRank的自动摘要提取方法[J]. 计算机科学, 2016, 43(6): 240-247.
doi: 10.11896/j.issn.1002-137X.2016.6.048
[12] (Yu Shanshan, Su Jindian, Li Pengfei.Improved TextRank-based Method for Automatic Summarization[J]. Computer Science, 2016, 43(6): 240-247.)
doi: 10.11896/j.issn.1002-137X.2016.6.048
[13] 耿焕同, 蔡庆生, 赵鹏, 等. 一种基于词共现图的文档自动摘要研究[J]. 情报学报, 2005, 24(6): 651-656.
doi: 10.3969/j.issn.1000-0135.2005.06.002
[13] (Geng Huantong, Cai Qingsheng, Zhao Peng, et al.A Kind of Automatic Text Keyphrase Extraction Method Based on Word Co-occurrence[J]. Journal of the China Society for Scientific and Technical Information, 2005, 24(6): 651-656.)
doi: 10.3969/j.issn.1000-0135.2005.06.002
[14] 何维, 王宇. 基于句子关系图的网页文本主题句抽取[J]. 现代图书情报技术, 2009(3): 57-61.
[14] (He Wei, Wang Yu.Extracting Topic Sentences from Web Text Based on Sentence Relationship Map[J]. New Technology of Library and Information Service, 2009(3): 57-61.)
[15] Saïd T, Evrard F.Intentional Structures of Documents[C]// Proceedings of the 12th ACM Conference on Hypertext and Hypermedia. ACM, 2001: 39-40.
[16] Harris Z S.Distributional Structure[A]. //Papers on Syntax[M]. Springer Netherlands, 1954.
[17] Bengio Y, Schwenk H, Senécal J S, et al.A Neural Probabilistic Language Model[J]. Journal of Machine Learning Research, 2003, 3(6): 1137-1155.
doi: 10.1007/3-540-33486-6_6
[18] Mikolov T, Sutskever I, Chen K, et al.Distributed Representations of Words and Phrases and Their Compositionality[J]. Advances in Neural Information Processing Systems, 2013, 26: 3111-3119.
[19] Word2Vec [EB/OL]. [2016-12-26]..
[20] 郭庆琳, 李艳梅, 唐琦. 基于VSM的文本相似度计算的研究[J]. 计算机应用研究, 2008, 25(11): 3256-3258.
doi: 10.3969/j.issn.1001-3695.2008.11.015
[20] (Guo Qinglin, Li Yanmei, Tang Qi.Similarity Computing of Documents Based on VSM[J]. Application Research of Computers, 2008, 25(11): 3256-3258.)
doi: 10.3969/j.issn.1001-3695.2008.11.015
[21] Wang R, Neumann G.Recognizing Textual Entailment Using Sentence Similarity Based on Dependency Tree Skeletons[C]//Proceedings of the ACL-PASCAL Workshop on Textual Entailment and Paraphrasing. Association for Computational Linguistics, 2007: 36-41.
[22] Wang D, Li T, Zhu S, et al.Multi-document Summarization via Sentence-level Semantic Analysis and Symmetric Matrix Factorization[C]//Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2008.
[23] Mikolov T, Chen K, Corrado G, et al. Efficient Estimation of Word Representations in Vector Space[OL]. arXiv Preprint. arXiv:1301.3781, 2013.
[24] Ling H, Okada K.An Efficient Earth Mover’s Distance Algorithm for Robust Histogram Comparison[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2007, 29(5): 840-853.
doi: 10.1109/TPAMI.2007.1058 pmid: 17356203
[25] 何远标, 乐小虬, 张帆. 学术论文大纲中关键术语抽取方法研究[J]. 现代图书情报技术, 2014(3): 73-79.
[25] (He Yuanbiao, Le Xiaoqiu, Zhang Fan.Research on Keyphrase Extraction from Scholarly Article Outline[J]. New Technology of Library and Information Service, 2014(3): 73-79.)
[26] 何远标. 基于学术论文大纲的术语层级关系挖掘[D]. 北京: 中国科学院大学, 2014.
[26] (He Yuanbiao.Phrase Hierarchical Relationship Mining Based on Scholarly Article Outline[D]. Beijing: University of Chinese Academy of Sciences, 2014.)
[27] Chung J, Gulcehre C, Cho K H, et al. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling[OL]. arXiv Preprint. arXiv:1412.3555, 2014.
[1] 夏天. 面向中文学术文本的单文档关键短语抽取 *[J]. 数据分析与知识发现, 2020, 4(7): 76-86.
[2] 孙明珠,马静,钱玲飞. 基于文档主题结构和词图迭代的关键词抽取方法研究 *[J]. 数据分析与知识发现, 2019, 3(8): 68-76.
[3] 严娇,马静,房康. 基于融合共现距离的句法网络下文本语义相似度计算 *[J]. 数据分析与知识发现, 2019, 3(12): 93-100.
[4] 王安,顾益军,李坤明,李文政. 基于复杂网络词节点移除的关键词抽取方法 *[J]. 数据分析与知识发现, 2019, 3(11): 35-44.
[5] 刘竹辰,陈浩,于艳华,李劼. 词位置分布加权TextRank的关键词提取*[J]. 数据分析与知识发现, 2018, 2(9): 74-79.
[6] 陈二静,姜恩波. 文本相似度计算方法研究综述[J]. 数据分析与知识发现, 2017, 1(6): 1-11.
[7] 夏天. 词向量聚类加权TextRank的关键词抽取*[J]. 数据分析与知识发现, 2017, 1(2): 28-34.
[8] 翟东升, 蔡文浩, 张杰, 李振飞. 改进的中文商标语义相似度计算方法研究[J]. 数据分析与知识发现, 2017, 1(11): 19-28.
[9] 刘健,毕强,刘庆旭,王福. 数字文献资源内容服务推荐研究*——基于本体规则推理和语义相似度计算[J]. 现代图书情报技术, 2016, 32(9): 70-77.
[10] 宁建飞,刘降珍. 融合Word2vec与TextRank的关键词抽取研究[J]. 现代图书情报技术, 2016, 32(6): 20-27.
[11] 巴志超,李纲,朱世伟. 基于语义网络的研究兴趣相似性度量方法*[J]. 现代图书情报技术, 2016, 32(4): 81-90.
[12] 毕强, 刘健, 鲍玉来. 基于语义相似度的文本聚类研究*[J]. 数据分析与知识发现, 2016, 32(12): 9-16.
[13] 刘怀亮, 杜坤, 秦春秀. 基于知网语义相似度的中文文本分类研究[J]. 现代图书情报技术, 2015, 31(2): 39-45.
[14] 范雪雪, 王志荣, 徐晤, 梁银, 马小虎. 基于医学本体的术语相似度算法研究[J]. 现代图书情报技术, 2015, 31(12): 57-64.
[15] 胡吉明, 肖璐. 向量空间模型文本建模的语义增量化改进研究[J]. 现代图书情报技术, 2014, 30(10): 49-55.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn