Data Analysis and Knowledge Discovery  2019, Vol. 3 Issue (11): 16-23    DOI: 10.11925/infotech.2096-3467.2019.0045
Factors Affecting Rhetorical Move Recognition with SVM Model
Liangping Ding1,2,Zhixiong Zhang1,2,3(),Huan Liu1,2
1 National Science Library, Chinese Academy of Sciences, Beijing 100190, China
2 Department of Library, Information and Archives Management, University of Chinese Academy of Science, Beijing 100190, China
3 Wuhan Library, Chinese Academy of Sciences, Wuhan 430071, China
[Objective] The paper explores the influence of sample size, the N value of N-gram, stop words, and weighting methods of word frequency on the automatic recognition of rhetorical moves in scientific paper, aiming to improve the abstracting method based on support vector machine (SVM) model. [Methods] We retrieved a total of 1.1 million labeled moves from 720,000 structured abstracts of scientific papers as experimental data, and constructed SVM model for move recognition. Based on the principle of single variable, we used control variable method by changing the sample size, the N value, removal of stop words, and word frequency weighting methods to analyze their impacts on the model’s performance. [Results] We found that the model yielded the best result with a sample size of 600,000 abstracts, the N value [1,2], keeping stop words, and using TF-IDF to weight word frequency. [Limitations] We only examined the model with structured abstracts, which might not be comparable with other studies. [Conclusions] The sample size and some fine features have significant impacts on the performance of traditional machine learning models.

Key wordsMove Recognition      Support Vector Machine      Structured Abstracts     
Received: 10 January 2019      Published: 18 December 2019
Corresponding Authors: Zhixiong Zhang

Liangping Ding,Zhixiong Zhang,Huan Liu. Factors Affecting Rhetorical Move Recognition with SVM Model. Data Analysis and Knowledge Discovery, 2019, 3(11): 16-23.

作者 模型 语步类型 模型效果
Teufel S, et al[7](2002) NBM 7类 准确率44%, 目的语步召回率65%
Ruch P, et al[8](2007) NBM OMRC 结论语步F1值85%
Wu J, et al[10](2006) HMM BOMRC 准确率80.54%
Lin J, et al[9](2006) HMM IMRC 各个语步F1值: 88.5%, 84.3%, 89.8, 89.7%
McKnight L, et al[12](2003) SVM IMRC 各个语步F1值: 89.2%, 82.0%, 82.1%, 89.5%
Shimbo M, et al[11](2003) SVM BOMRC 准确率91.9%
Ito T, et al[14](2004) TSVM COMRC 各个语步F1值: 66.0%, 51%, 49.3%, 72.9%, 67.7%
Yamamoto Y, et al[13](2005) SVM IMRC 各个语步F1值: 91.3%, 83.6%, 87.2%, 89.8%
Hirohata K, et al[2](2008) CRF OMRC 准确率95.5%
Kim S N, et al[15](2010) CRF PICO 所有语步平均F1值80.9%
作者 训练样本量 N-gram取值 是否去停用词 词频加权方式
McKnight L, et al[12] 7 253 1 没有明确提及 TF(简单词频统计)
Shimbo M, et al[11] 10 000 [1,2] 没有明确提及 TF
Ito T, et al[14] 4 185 1 没有明确提及 TF
Yamamoto Y, et al[13] 8 383 1 TF-IDF
Hirohata K, et al[2] 51 000 [1,2] 卡方值
Kim S N, et al[15] 1 000 [1,2] 没有明确提及 TF
Ruch P, et al[8] 12 000 [1,3] 卡方值
Background语步重要词项 Purpose语步重要词项 Method语步重要词项 Result语步重要词项 Conclusion语步重要词项
have to Be be May
be evaluate Use to Be
be purpose In have That
know determine Measure significantly Should
aim report Perform respectively Can
recently this By show Suggest
N取值 训练样本量 是否去停用词 词频加权方式 Precision(%) Recall(%) F1值(%)
10 000 一元词、二元词 TF 86.75 86.75 86.75
50 000 88.25 88.00 88.12
600 000 90.25 91.75 91.00
N取值 训练样本量 是否去停用词 词频加权方式 Precision(%) Recall(%) F1值(%)
10 000 一元词、二元词 TF-IDF 88.75 89.00 88.87
50 000 90.25 90.00 90.12
600 000 93.50 93.50 93.50
N取值 训练样本量 是否去停用词 词频加权方式 Precision(%) Recall(%) F1值(%)
[1,3] 10 000 TF 80.50 79.50 80.00
[1,2] 86.75 86.75 86.75
[2,2] 84.50 84.25 84.37
[2,3] 84.00 83.50 83.75
是否去停用词 训练样本量 N元词 词频加权方式 Precision(%) Recall(%) F1值(%)
10 000 一元词、二元词 TF 80.50 79.75 80.12
86.75 86.75 86.75
是否去停用词 训练样本量 N元词 词频加权方式 Precision(%) Recall(%) F1值(%)
10 000 一元词、二元词 TF-IDF 82.00 82.00 82.00
88.75 89.00 88.87
词频加权方式 训练样本量 N元词 是否去停用词 Precision(%) Recall(%) F1值(%)
TF 10 000 一元词、二元词 86.75 86.75 86.75
TF-IDF 88.75 89.00 88.87
词频加权方式 训练样本量 N元词 是否去停用词 Precision(%) Recall(%) F1值(%)
TF 50 000 一元词、二元词 88.25 88.00 88.12
TF-IDF 90.25 90.00 90.12
词频加权方式 训练样本量 N元词 是否去停用词 Precision(%) Recall(%) F1值(%)
TF 600 000 一元词、二元词 90.25 91.75 91.00
TF-IDF 93.50 93.50 93.50
