1 National Science Library, Chinese Academy of Sciences, Beijing 100190, China 2 Department of Library, Information and Archives Management, University of Chinese Academy of Science, Beijing 100190, China 3 Wuhan Library, Chinese Academy of Sciences, Wuhan 430071, China
[Objective] The paper explores the influence of sample size, the N value of N-gram, stop words, and weighting methods of word frequency on the automatic recognition of rhetorical moves in scientific paper, aiming to improve the abstracting method based on support vector machine (SVM) model. [Methods] We retrieved a total of 1.1 million labeled moves from 720,000 structured abstracts of scientific papers as experimental data, and constructed SVM model for move recognition. Based on the principle of single variable, we used control variable method by changing the sample size, the N value, removal of stop words, and word frequency weighting methods to analyze their impacts on the model’s performance. [Results] We found that the model yielded the best result with a sample size of 600,000 abstracts, the N value [1,2], keeping stop words, and using TF-IDF to weight word frequency. [Limitations] We only examined the model with structured abstracts, which might not be comparable with other studies. [Conclusions] The sample size and some fine features have significant impacts on the performance of traditional machine learning models.
Swales J. Research Genres: Explorations and Applications[M]. Cambridge University Press, 2004: 228-229.
[2]
Hirohata K, Okazaki N, Ananiadou S , et al. Identifying Sections in Scientific Abstracts Using Conditional Random Fields [C]// Proceedings of the 3rd International Joint Conference on Natural Language Processing. 2008.
[3]
American National Standards Institute(ANSI Z39.14-1979). American National Standard for Writing Abstracts[S]. New York:American National Standards Institute, 1979.
[4]
Swales J. Genre Analysis: English in Academic and Research Settings[M]. Cambridge University Press, 1990.
[5]
Nwogu K N . The Medical Research Papers: Structure and Function[J]. English for Specific Purposes, 1997,16(2):119-138.
doi: 10.1016/j.ceca.2019.102107
pmid: 31841954
[6]
Dos Santos M B . The Textual Organization of Research Paper Abstracts in Applied Linguistics[J]. Text-Interdisciplinary Journal for the Study of Discourse, 1996,16(4):481-500.
[7]
Teufel S, Moens M . Summarizing Scientific Articles: Experiments with Relevance and Rhetorical Status[J]. Computational Linguistics, 2002,28(4):409-445.
doi: 10.1162/089120102762671936
[8]
Ruch P, Boyer C, Chichester C , et al. Using Argumentation to Extract Key Sentences from Biomedical Abstracts[J]. International Journal of Medical Informatics, 2007,76(2-3):195-200.
doi: 10.1016/j.ijmedinf.2006.05.002
pmid: 16815739
[9]
Lin J, Karakos D, Demner-Fushman D , et al. Generative Content Models for Structural Analysis of Medical Abstracts [C]// Proceedings of the HLT-NAACL BioNLP Workshop on Linking Natural Language and Biology. Association for Computational Linguistics, 2006: 65-72.
[10]
Wu J C, Chang Y C, Liou H C , et al. Computational Analysis of Move Structures in Academic Abstracts [C]//Proceedings of the COLING/ACL 2006 on Interactive Presentation Sessions. Association for Computational Linguistics, 2006: 41-44.
[11]
Shimbo M, Yamasaki T, Matsumoto Y . Using Sectioning Information for Text Retrieval: A Case Study with the Medline Abstracts [C]// Proceedings of the 2nd International Workshop on Active Mining. 2003.
[12]
McKnight L, Srinivasan P . Categorization of Sentence Types in Medical Abstracts [C]//Proceedings of AMIA Annual Symposium. American Medical Informatics Association, 2003.
[13]
Yamamoto Y, Takagi T . A Sentence Classification System for Multi-Document Summarization in the Biomedical Domain [C] // Proceedings of the 2005 International Workshop on Biomedical Data Engineering. 2005: 90-95.
[14]
Ito T, Shimbo M, Yamasaki T , et al. Semi-Supervised Sentence Classification for Medline Documents[J]. Methods, 2004,138:141-146.
[15]
Kim S N, Martinez D, Cavedon L , et al. Automatic Classification of Sentences to Support Evidence Based Medicine[J]. BMC Bioinformatics, 2011, 12(2): Article No. S5.
doi: 10.1371/journal.pone.0122199
pmid: 25961290
[16]
Vapnik V . The Nature of Statistical Learning Theory[M]. Springer Science & Business Media, 2013.
[17]
Joachims T . Text Categorization with Support Vector Machines: Learning with Many Relevant Features [C]// Proceedings of the 10th European Conference on Machine Learning. 1998: 137-142.
[18]
Kivinen J, Warmuth M K, Auer P . The Perceptron Algorithm vs. Winnow: Linear vs. Logarithmic Mistake Bounds When Few Input Variables are Relevant [C] // Proceedings of the Conference on Computational Learning Theory, 1995.