Please wait a minute...
New Technology of Library and Information Service  2014, Vol. 30 Issue (7): 34-40    DOI: 10.11925/infotech.1003-3513.2014.07.05
Current Issue | Archive | Adv Search |
Application of Machine Learning with Limited Corpus to Identify Structure of Scientific Abstracts Automatically
Bai Guangzu1,3, He Yuanbiao2,3, Ma Jianxia1, Liu Jianhuaz2,3, Zou Yimin4
1. Lanzhou Library, ChineseAcademy of Sciences, Lanzhou 730000, China;
2. National Science Library, Chinese Academy of Sciences, Beijing 100190, China;
3. University of Chinese Academy of Sciences, Beijing 100049, China;
4. College of Economics&Management, Zhejiang Normal University, Jinhua 321004, China
Download: PDF(391 KB)   HTML  
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This study aims to identify structural contents of scientific abstract automatically by classifying the academic abstracts sentences based on machine learning with limited samples.[Methods] This paper designs a variety of text features to represent scientific abstract sentences, then extracts these features from the academic abstracts based on natural language processing techniques so as to instruct Naive Bayesian Model and Support Vector Machines in training, and ultimately identifies the structure of academic abstracts automatically by using these models.[Results]Experiments show that the method can achieve fairly even better recognition accuracy compared with previous methods by using less training corpus.[Limitations] Due to the lack of feature words and core verbs in abstract sentences with"METHOD" class label, it resulted in a lower recognition accuracy on these sentences.[Conclusions] This method is an effective approach to achieve the automatic recognition of academic abstracts structure by using limited corpus.

Key wordsScience abstract      Structure identifying      Machine learning     
Received: 08 October 2013      Published: 20 October 2014
:  G356.7  

Cite this article:

Bai Guangzu, He Yuanbiao, Ma Jianxia, Liu Jianhuaz, Zou Yimin. Application of Machine Learning with Limited Corpus to Identify Structure of Scientific Abstracts Automatically. New Technology of Library and Information Service, 2014, 30(7): 34-40.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.1003-3513.2014.07.05     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2014/V30/I7/34

[1] U.S. National Library of Medicine. Structured Abstracts[EB/OL].[2014-04-01]. http://www.nlm.nih.gov/bsd/policy/structured_abstracts.html.
[2] 张晓林, 彭希珺. 用高水平学术规范保障论文学术质量[J].现代图书情报技术, 2014(1): 1-3. (Zhang Xiaolin, Peng Xijun. Secure the Quality of Academic Papers by High-level Academic Norms[J]. New Technology of Library and Information Service, 2014(1): 1-3.)
[3] Hirohata K, Okazaki N, Ananiadou S, et al. Identifying Sections in Scientific Abstracts Using Conditional Random Fields[C]. In: Proceedings of the 3rd International Joint Conference on Natural Language Processing (IJCNLP’08). 2008: 381-388.
[4] Teufel S, Siddharthan A, Batchelor C. Towards Discipline- independent Argumentative Zoning: Evidence from Chemistry and Computational Linguistics[C]. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing (EMNLP’09). Stroudsburg: Association for Computational Linguistics, 2009: 1493-1502.
[5] Mizuta Y, Korhonen A, Mullen T, et al. Zone Analysis in Biology Articles as a Basis for Information Extraction[J]. International Journal of Medical Informatics, 2006, 75(6): 468-487.
[6] Liakata M, Teufel S, Siddharthan A, et al. Corpora for the Conceptualisation and Zoning of Scientific Papers[C]. In: Proceedings of the 7th International Conference on Language Resources and Evaluation. 2010:2054-2061.
[7] Ruch P, Boyer C, Chichester C, et al. Using Argumentation to Extract Key Sentences from Biomedical Abstracts[J]. International Journal of Medical Informatics, 2007, 76(2-3): 195-200.
[8] McKnight L, Srinivasan P. Categorization of Sentence Types in Medical Abstracts[C]. In: Proceedings of the 17th Annual Symposium of the American Medical Informatics Association. 2003: 440-444.
[9] Guo Y, Korhonen A, Liakata M, et al. Identifying the Information Structure of Scientific Abstracts: An Investigation of Three Different Schemes[C]. In: Proceedings of the 2010 Workshop on Biomedical Natural Language Processing. 2010: 99-107.
[10] Yamamoto Y, Takagi T. A Sentence Classification System for Multi-document Summarization in the Biomedical Domain[C]. In: Proceedings of the International Workshop on Biomedical Data Engineering (BMDE’05). 2005: 90-95.
[11] 霍东云, 聂峰光, 郭力. 利用Medline文摘数据库研究文本分类[J]. 计算机与应用化学, 2007, 24(9): 1281-1284. (Huo Dongyun, Nie Fengguang, Guo Li. Text Categorization Research by Using Medline Database[J]. Computers and Applied Chemistry, 2007, 24(9): 1281-1284.)
[12] 潘华山, 严馨, 余正涛, 等. 基于支持向量机的越语新闻文本分类方法[J]. 山西大学学报: 自然科学版, 2013, 36(4): 505-509. (Pan Huashan, Yan Xin, Yu Zhengtao, et al. Vietnamese News Text Classification Method Based on Support Vector Machine[J]. Journal of Shanxi University: Natural Science Edition, 2013, 36(4): 505-509.)
[13] Alpaydin E. 机器学习导论[M]. 范明, 昝红英, 牛常勇译. 北京: 机械工业出版社, 2009. (Alpaydin E. Introduction to Machine Learning[M]. Translated by Fan Ming, Zan Hong ying, Niu Changyong. Beijing: China Machine Press, 2009.)
[14] Hsu C W, Lin C J.A Comparison of Methods for Multiclass Support Vector Machines[J]. IEEE Transactions on Neural Networks, 2002, 13(2): 415-425.
[15] 张艳. 汉语句法分析的理论、方法的研究及其应用[D]. 北京: 中国科学院自动化研究所, 2003. (Zhang Yan. Research on Theory and Methods of Chinese Syntactic Parsing and Application[D]. Beijing: Institute of Automation, Chinese Academy of Sciences, 2003.)
[16] Huth E J. Structured Abstracts for Papers Reporting Clinical Trials[J]. Annals of Internal Medicine, 1987, 106(4): 626-627.
[17] The Stanford Natural Language Processing Group. Stanford CoreNLP[EB/OL].[2014-04-01]. http://nlp.stanford.edu/software/corenlp.shtml.
[18] The Stanford Natural Language Processing Group. Tregex, Tsurgeon and Semgrex[EB/OL].[2014-04-01]. http://nlp. stanford.edu/software/tregex.shtml.
[19] The Stanford Natural Language Processing Group. The Stanford Parser: A Statistical Parser[EB/OL].[2014-04-01]. http://nlp.stanford.edu/software/lex-parser.shtml.
[20] WEKA. Weka3: Data Mining Software in Java[EB/OL].[2014-04-01]. http://www.cs.waikato.ac.nz/ml/weka/.
[21] Yang Y. An Evaluation of Statistical Approaches to Text Categorization[J]. Information Retrieval, 1999, 1(1-2): 69-90.

[1] Jiahui Hu,An Fang,Wanqing Zhao,Chenliu Yang,Huiling Ren. Annotating Chinese E-Medical Record for Knowledge Discovery[J]. 数据分析与知识发现, 2019, 3(7): 123-132.
[2] Jinzhu Zhang,Yiming Hu. Extracting Titles from Scientific References in Patents with Fusion of Representation Learning and Machine Learning[J]. 数据分析与知识发现, 2019, 3(5): 68-76.
[3] Zhiqiang Liu,Yuncheng Du,Shuicai Shi. Extraction of Key Information in Web News Based on Improved Hidden Markov Model[J]. 数据分析与知识发现, 2019, 3(3): 120-128.
[4] Hongxia Xu,Chunwang Li. Review of Knowledge Extraction of Scientific Literature[J]. 数据分析与知识发现, 2019, 3(3): 14-24.
[5] Zixuan Zhang,Hao Wang,Liping Zhu,Sanhong eng. Identifying Risks of HS Codes by China Customs[J]. 数据分析与知识发现, 2019, 3(1): 72-84.
[6] Lina Liu,Jiayin Qi,Zhenping Zhang,Dan Zeng. Analyzing Impacts of Brand Reputation on Online Sales Based on Massive Commodity Reviews and Brand[J]. 数据分析与知识发现, 2018, 2(9): 10-21.
[7] Longjia Jia,Bangzuo Zhang. Classifying Topics of Internet Public Opinion from College Students: Case Study of Sina Weibo[J]. 数据分析与知识发现, 2018, 2(7): 55-62.
[8] Wei Lu,Mengqi Luo,Heng Ding,Xin Li. Image Annotation Tags by Deep Learning and Real Users: A Comparative Study[J]. 数据分析与知识发现, 2018, 2(5): 1-10.
[9] Li Wang,Lixue Zou,Xiwen Liu. Visualizing Document Correlation Based on LDA Model[J]. 数据分析与知识发现, 2018, 2(3): 98-106.
[10] Xinyue Fan,Lei Cui. Predicting Antineoplastic Drug Targets Based on Network Properties[J]. 数据分析与知识发现, 2018, 2(12): 98-108.
[11] Yang Zhao,Xini Yuan,Yawen Chen,Liqiang Wu. Predicting Conversion Rate of APP Advertising with Machine Learning[J]. 数据分析与知识发现, 2018, 2(11): 2-9.
[12] Xin Wang,Wen’gang Feng. Review of Techniques Detecting Online Extremism and Radicalization[J]. 数据分析与知识发现, 2018, 2(10): 2-8.
[13] Zhongyi Hu,Chaoqun Wang,Jiang Wu. Identifying Phishing Websites with Multiple Online Data Sources[J]. 数据分析与知识发现, 2017, 1(6): 47-55.
[14] Weimin Lv,Xiaomei Wang,Tao Han. Recommending Scientific Research Collaborators with Link Prediction and Extremely Randomized Trees Algorithm[J]. 数据分析与知识发现, 2017, 1(4): 38-45.
[15] Yue He,Min Xiao,Yue Zhang. Sentiment Analysis of Trending Topics Based on Relevance[J]. 数据分析与知识发现, 2017, 1(3): 46-53.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn