Please wait a minute...
Data Analysis and Knowledge Discovery  2019, Vol. 3 Issue (11): 16-23    DOI: 10.11925/infotech.2096-3467.2019.0045
Current Issue | Archive | Adv Search |
Factors Affecting Rhetorical Move Recognition with SVM Model
Liangping Ding1,2,Zhixiong Zhang1,2,3(),Huan Liu1,2
1 National Science Library, Chinese Academy of Sciences, Beijing 100190, China
2 Department of Library, Information and Archives Management, University of Chinese Academy of Science, Beijing 100190, China
3 Wuhan Library, Chinese Academy of Sciences, Wuhan 430071, China
Download: PDF (432 KB)   HTML ( 25
Export: BibTeX | EndNote (RIS)      

[Objective] The paper explores the influence of sample size, the N value of N-gram, stop words, and weighting methods of word frequency on the automatic recognition of rhetorical moves in scientific paper, aiming to improve the abstracting method based on support vector machine (SVM) model. [Methods] We retrieved a total of 1.1 million labeled moves from 720,000 structured abstracts of scientific papers as experimental data, and constructed SVM model for move recognition. Based on the principle of single variable, we used control variable method by changing the sample size, the N value, removal of stop words, and word frequency weighting methods to analyze their impacts on the model’s performance. [Results] We found that the model yielded the best result with a sample size of 600,000 abstracts, the N value [1,2], keeping stop words, and using TF-IDF to weight word frequency. [Limitations] We only examined the model with structured abstracts, which might not be comparable with other studies. [Conclusions] The sample size and some fine features have significant impacts on the performance of traditional machine learning models.

Key wordsMove Recognition      Support Vector Machine      Structured Abstracts     
Received: 10 January 2019      Published: 18 December 2019
ZTFLH:  TP393  
Corresponding Authors: Zhixiong Zhang     E-mail:

Cite this article:

Liangping Ding,Zhixiong Zhang,Huan Liu. Factors Affecting Rhetorical Move Recognition with SVM Model. Data Analysis and Knowledge Discovery, 2019, 3(11): 16-23.

URL:     OR

作者 模型 语步类型 模型效果
Teufel S, et al[7](2002) NBM 7类 准确率44%, 目的语步召回率65%
Ruch P, et al[8](2007) NBM OMRC 结论语步F1值85%
Wu J, et al[10](2006) HMM BOMRC 准确率80.54%
Lin J, et al[9](2006) HMM IMRC 各个语步F1值: 88.5%, 84.3%, 89.8, 89.7%
McKnight L, et al[12](2003) SVM IMRC 各个语步F1值: 89.2%, 82.0%, 82.1%, 89.5%
Shimbo M, et al[11](2003) SVM BOMRC 准确率91.9%
Ito T, et al[14](2004) TSVM COMRC 各个语步F1值: 66.0%, 51%, 49.3%, 72.9%, 67.7%
Yamamoto Y, et al[13](2005) SVM IMRC 各个语步F1值: 91.3%, 83.6%, 87.2%, 89.8%
Hirohata K, et al[2](2008) CRF OMRC 准确率95.5%
Kim S N, et al[15](2010) CRF PICO 所有语步平均F1值80.9%
作者 训练样本量 N-gram取值 是否去停用词 词频加权方式
McKnight L, et al[12] 7 253 1 没有明确提及 TF(简单词频统计)
Shimbo M, et al[11] 10 000 [1,2] 没有明确提及 TF
Ito T, et al[14] 4 185 1 没有明确提及 TF
Yamamoto Y, et al[13] 8 383 1 TF-IDF
Hirohata K, et al[2] 51 000 [1,2] 卡方值
Kim S N, et al[15] 1 000 [1,2] 没有明确提及 TF
Ruch P, et al[8] 12 000 [1,3] 卡方值
Background语步重要词项 Purpose语步重要词项 Method语步重要词项 Result语步重要词项 Conclusion语步重要词项
have to Be be May
be evaluate Use to Be
be purpose In have That
know determine Measure significantly Should
aim report Perform respectively Can
recently this By show Suggest
N取值 训练样本量 是否去停用词 词频加权方式 Precision(%) Recall(%) F1值(%)
10 000 一元词、二元词 TF 86.75 86.75 86.75
50 000 88.25 88.00 88.12
600 000 90.25 91.75 91.00
N取值 训练样本量 是否去停用词 词频加权方式 Precision(%) Recall(%) F1值(%)
10 000 一元词、二元词 TF-IDF 88.75 89.00 88.87
50 000 90.25 90.00 90.12
600 000 93.50 93.50 93.50
N取值 训练样本量 是否去停用词 词频加权方式 Precision(%) Recall(%) F1值(%)
[1,3] 10 000 TF 80.50 79.50 80.00
[1,2] 86.75 86.75 86.75
[2,2] 84.50 84.25 84.37
[2,3] 84.00 83.50 83.75
是否去停用词 训练样本量 N元词 词频加权方式 Precision(%) Recall(%) F1值(%)
10 000 一元词、二元词 TF 80.50 79.75 80.12
86.75 86.75 86.75
是否去停用词 训练样本量 N元词 词频加权方式 Precision(%) Recall(%) F1值(%)
10 000 一元词、二元词 TF-IDF 82.00 82.00 82.00
88.75 89.00 88.87
词频加权方式 训练样本量 N元词 是否去停用词 Precision(%) Recall(%) F1值(%)
TF 10 000 一元词、二元词 86.75 86.75 86.75
TF-IDF 88.75 89.00 88.87
词频加权方式 训练样本量 N元词 是否去停用词 Precision(%) Recall(%) F1值(%)
TF 50 000 一元词、二元词 88.25 88.00 88.12
TF-IDF 90.25 90.00 90.12
词频加权方式 训练样本量 N元词 是否去停用词 Precision(%) Recall(%) F1值(%)
TF 600 000 一元词、二元词 90.25 91.75 91.00
TF-IDF 93.50 93.50 93.50
[1] Swales J. Research Genres: Explorations and Applications[M]. Cambridge University Press, 2004: 228-229.
[2] Hirohata K, Okazaki N, Ananiadou S , et al. Identifying Sections in Scientific Abstracts Using Conditional Random Fields [C]// Proceedings of the 3rd International Joint Conference on Natural Language Processing. 2008.
[3] American National Standards Institute(ANSI Z39.14-1979). American National Standard for Writing Abstracts[S]. New York:American National Standards Institute, 1979.
[4] Swales J. Genre Analysis: English in Academic and Research Settings[M]. Cambridge University Press, 1990.
[5] Nwogu K N . The Medical Research Papers: Structure and Function[J]. English for Specific Purposes, 1997,16(2):119-138.
doi: 10.1016/j.ceca.2019.102107 pmid: 31841954
[6] Dos Santos M B . The Textual Organization of Research Paper Abstracts in Applied Linguistics[J]. Text-Interdisciplinary Journal for the Study of Discourse, 1996,16(4):481-500.
[7] Teufel S, Moens M . Summarizing Scientific Articles: Experiments with Relevance and Rhetorical Status[J]. Computational Linguistics, 2002,28(4):409-445.
doi: 10.1162/089120102762671936
[8] Ruch P, Boyer C, Chichester C , et al. Using Argumentation to Extract Key Sentences from Biomedical Abstracts[J]. International Journal of Medical Informatics, 2007,76(2-3):195-200.
doi: 10.1016/j.ijmedinf.2006.05.002 pmid: 16815739
[9] Lin J, Karakos D, Demner-Fushman D , et al. Generative Content Models for Structural Analysis of Medical Abstracts [C]// Proceedings of the HLT-NAACL BioNLP Workshop on Linking Natural Language and Biology. Association for Computational Linguistics, 2006: 65-72.
[10] Wu J C, Chang Y C, Liou H C , et al. Computational Analysis of Move Structures in Academic Abstracts [C]//Proceedings of the COLING/ACL 2006 on Interactive Presentation Sessions. Association for Computational Linguistics, 2006: 41-44.
[11] Shimbo M, Yamasaki T, Matsumoto Y . Using Sectioning Information for Text Retrieval: A Case Study with the Medline Abstracts [C]// Proceedings of the 2nd International Workshop on Active Mining. 2003.
[12] McKnight L, Srinivasan P . Categorization of Sentence Types in Medical Abstracts [C]//Proceedings of AMIA Annual Symposium. American Medical Informatics Association, 2003.
[13] Yamamoto Y, Takagi T . A Sentence Classification System for Multi-Document Summarization in the Biomedical Domain [C] // Proceedings of the 2005 International Workshop on Biomedical Data Engineering. 2005: 90-95.
[14] Ito T, Shimbo M, Yamasaki T , et al. Semi-Supervised Sentence Classification for Medline Documents[J]. Methods, 2004,138:141-146.
[15] Kim S N, Martinez D, Cavedon L , et al. Automatic Classification of Sentences to Support Evidence Based Medicine[J]. BMC Bioinformatics, 2011, 12(2): Article No. S5.
doi: 10.1371/journal.pone.0122199 pmid: 25961290
[16] Vapnik V . The Nature of Statistical Learning Theory[M]. Springer Science & Business Media, 2013.
[17] Joachims T . Text Categorization with Support Vector Machines: Learning with Many Relevant Features [C]// Proceedings of the 10th European Conference on Machine Learning. 1998: 137-142.
[18] Kivinen J, Warmuth M K, Auer P . The Perceptron Algorithm vs. Winnow: Linear vs. Logarithmic Mistake Bounds When Few Input Variables are Relevant [C] // Proceedings of the Conference on Computational Learning Theory, 1995.
[1] Feng Hao, Li Shuqing. Multi-layer Cascade Classifier for Credit Scoring with Multiple-Support Vector Machines[J]. 数据分析与知识发现, 2021, 5(10): 28-36.
[2] Ding Shengchun,Yu Fengyang,Li Zhen. Identifying Potential Trending Topics of Online Public Opinion[J]. 数据分析与知识发现, 2020, 4(2/3): 29-38.
[3] Heran Qin,Liu Liu,Bin Li,Dongbo Wang. Automatic Classification of Ancient Classics with Entity Features[J]. 数据分析与知识发现, 2019, 3(9): 68-76.
[4] Ruojia Wang,Lu Zhang,Jimin Wang. Automatic Triage of Online Doctor Services Based on Machine Learning[J]. 数据分析与知识发现, 2019, 3(9): 88-97.
[5] Qingtian Zeng,Mingdi Dai,Chao Li,Hua Duan,Zhongying Zhao. Discovering Important Locations with User Representation and Trace Data[J]. 数据分析与知识发现, 2019, 3(6): 75-82.
[6] Cheng Zhou,Hongqin Wei. Evaluating and Classifying Patent Values Based on Self-Organizing Maps and Support Vector Machine[J]. 数据分析与知识发现, 2019, 3(5): 117-124.
[7] Zhixiong Zhang,Huan Liu,Liangping Ding,Pengmin Wu,Gaihong Yu. Identifying Moves of Research Abstracts with Deep Learning Methods[J]. 数据分析与知识发现, 2019, 3(12): 1-9.
[8] Huang Xiaoxi,Li Hanyu,Wang Rongbo,Wang Xiaohua,Chen Zhiqun. Recognizing Metaphor with Convolution Neural Network and SVM[J]. 数据分析与知识发现, 2018, 2(10): 77-83.
[9] Zeng Jin,Lu Wei,Ding Heng,Chen Haihua. Modeling User’s Interests Based on Image Semantics[J]. 数据分析与知识发现, 2017, 1(4): 76-83.
[10] Tian Shihai,Lyu Deli. An Early Warning Algorithm for Public Opinion of Safety Emergency[J]. 数据分析与知识发现, 2017, 1(2): 11-18.
[11] Yang Shuang,Chen Fen. Analyzing Sentiments of Micro-blog Posts Based on Support Vector Machine[J]. 数据分析与知识发现, 2017, 1(2): 73-79.
[12] Zhang Ye,Zhang Han,Yin Bincan,Zhao Yuhong. Building Disease Prediction Model Using Support Vector Machine ——Case Study of Severe Acute Pancreatitis[J]. 现代图书情报技术, 2016, 32(2): 83-89.
[13] He Yue, Song Lingxi, Qi Liyun. Spillover Effect of Internet Word of Mouth in Negative Events——Take the “Deadly Yuantong Express” Event for an Example[J]. 现代图书情报技术, 2015, 31(10): 58-64.
[14] Hu Jiming, Chen Guo. Study on Improvement of Text Classification Using HS-SVM[J]. 现代图书情报技术, 2014, 30(9): 74-80.
[15] Li Xiangdong, Liao Xiangpeng, Huang Li. Research and Implementation of Bibliographic Information Classification System in LDA Model[J]. 现代图书情报技术, 2014, 30(5): 18-25.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938