Please wait a minute...
New Technology of Library and Information Service  2014, Vol. 30 Issue (7): 34-40    DOI: 10.11925/infotech.1003-3513.2014.07.05
Current Issue | Archive | Adv Search |
Application of Machine Learning with Limited Corpus to Identify Structure of Scientific Abstracts Automatically
Bai Guangzu1,3, He Yuanbiao2,3, Ma Jianxia1, Liu Jianhuaz2,3, Zou Yimin4
1. Lanzhou Library, ChineseAcademy of Sciences, Lanzhou 730000, China;
2. National Science Library, Chinese Academy of Sciences, Beijing 100190, China;
3. University of Chinese Academy of Sciences, Beijing 100049, China;
4. College of Economics&Management, Zhejiang Normal University, Jinhua 321004, China
Download:
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This study aims to identify structural contents of scientific abstract automatically by classifying the academic abstracts sentences based on machine learning with limited samples.[Methods] This paper designs a variety of text features to represent scientific abstract sentences, then extracts these features from the academic abstracts based on natural language processing techniques so as to instruct Naive Bayesian Model and Support Vector Machines in training, and ultimately identifies the structure of academic abstracts automatically by using these models.[Results]Experiments show that the method can achieve fairly even better recognition accuracy compared with previous methods by using less training corpus.[Limitations] Due to the lack of feature words and core verbs in abstract sentences with"METHOD" class label, it resulted in a lower recognition accuracy on these sentences.[Conclusions] This method is an effective approach to achieve the automatic recognition of academic abstracts structure by using limited corpus.

Key wordsScience abstract      Structure identifying      Machine learning     
Received: 08 October 2013      Published: 20 October 2014
:  G356.7  

Cite this article:

Bai Guangzu, He Yuanbiao, Ma Jianxia, Liu Jianhuaz, Zou Yimin. Application of Machine Learning with Limited Corpus to Identify Structure of Scientific Abstracts Automatically. New Technology of Library and Information Service, 2014, 30(7): 34-40.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.1003-3513.2014.07.05     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2014/V30/I7/34

[1] U.S. National Library of Medicine. Structured Abstracts[EB/OL].[2014-04-01]. http://www.nlm.nih.gov/bsd/policy/structured_abstracts.html.
[2] 张晓林, 彭希珺. 用高水平学术规范保障论文学术质量[J].现代图书情报技术, 2014(1): 1-3. (Zhang Xiaolin, Peng Xijun. Secure the Quality of Academic Papers by High-level Academic Norms[J]. New Technology of Library and Information Service, 2014(1): 1-3.)
[3] Hirohata K, Okazaki N, Ananiadou S, et al. Identifying Sections in Scientific Abstracts Using Conditional Random Fields[C]. In: Proceedings of the 3rd International Joint Conference on Natural Language Processing (IJCNLP’08). 2008: 381-388.
[4] Teufel S, Siddharthan A, Batchelor C. Towards Discipline- independent Argumentative Zoning: Evidence from Chemistry and Computational Linguistics[C]. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing (EMNLP’09). Stroudsburg: Association for Computational Linguistics, 2009: 1493-1502.
[5] Mizuta Y, Korhonen A, Mullen T, et al. Zone Analysis in Biology Articles as a Basis for Information Extraction[J]. International Journal of Medical Informatics, 2006, 75(6): 468-487.
[6] Liakata M, Teufel S, Siddharthan A, et al. Corpora for the Conceptualisation and Zoning of Scientific Papers[C]. In: Proceedings of the 7th International Conference on Language Resources and Evaluation. 2010:2054-2061.
[7] Ruch P, Boyer C, Chichester C, et al. Using Argumentation to Extract Key Sentences from Biomedical Abstracts[J]. International Journal of Medical Informatics, 2007, 76(2-3): 195-200.
[8] McKnight L, Srinivasan P. Categorization of Sentence Types in Medical Abstracts[C]. In: Proceedings of the 17th Annual Symposium of the American Medical Informatics Association. 2003: 440-444.
[9] Guo Y, Korhonen A, Liakata M, et al. Identifying the Information Structure of Scientific Abstracts: An Investigation of Three Different Schemes[C]. In: Proceedings of the 2010 Workshop on Biomedical Natural Language Processing. 2010: 99-107.
[10] Yamamoto Y, Takagi T. A Sentence Classification System for Multi-document Summarization in the Biomedical Domain[C]. In: Proceedings of the International Workshop on Biomedical Data Engineering (BMDE’05). 2005: 90-95.
[11] 霍东云, 聂峰光, 郭力. 利用Medline文摘数据库研究文本分类[J]. 计算机与应用化学, 2007, 24(9): 1281-1284. (Huo Dongyun, Nie Fengguang, Guo Li. Text Categorization Research by Using Medline Database[J]. Computers and Applied Chemistry, 2007, 24(9): 1281-1284.)
[12] 潘华山, 严馨, 余正涛, 等. 基于支持向量机的越语新闻文本分类方法[J]. 山西大学学报: 自然科学版, 2013, 36(4): 505-509. (Pan Huashan, Yan Xin, Yu Zhengtao, et al. Vietnamese News Text Classification Method Based on Support Vector Machine[J]. Journal of Shanxi University: Natural Science Edition, 2013, 36(4): 505-509.)
[13] Alpaydin E. 机器学习导论[M]. 范明, 昝红英, 牛常勇译. 北京: 机械工业出版社, 2009. (Alpaydin E. Introduction to Machine Learning[M]. Translated by Fan Ming, Zan Hong ying, Niu Changyong. Beijing: China Machine Press, 2009.)
[14] Hsu C W, Lin C J.A Comparison of Methods for Multiclass Support Vector Machines[J]. IEEE Transactions on Neural Networks, 2002, 13(2): 415-425.
[15] 张艳. 汉语句法分析的理论、方法的研究及其应用[D]. 北京: 中国科学院自动化研究所, 2003. (Zhang Yan. Research on Theory and Methods of Chinese Syntactic Parsing and Application[D]. Beijing: Institute of Automation, Chinese Academy of Sciences, 2003.)
[16] Huth E J. Structured Abstracts for Papers Reporting Clinical Trials[J]. Annals of Internal Medicine, 1987, 106(4): 626-627.
[17] The Stanford Natural Language Processing Group. Stanford CoreNLP[EB/OL].[2014-04-01]. http://nlp.stanford.edu/software/corenlp.shtml.
[18] The Stanford Natural Language Processing Group. Tregex, Tsurgeon and Semgrex[EB/OL].[2014-04-01]. http://nlp. stanford.edu/software/tregex.shtml.
[19] The Stanford Natural Language Processing Group. The Stanford Parser: A Statistical Parser[EB/OL].[2014-04-01]. http://nlp.stanford.edu/software/lex-parser.shtml.
[20] WEKA. Weka3: Data Mining Software in Java[EB/OL].[2014-04-01]. http://www.cs.waikato.ac.nz/ml/weka/.
[21] Yang Y. An Evaluation of Statistical Approaches to Text Categorization[J]. Information Retrieval, 1999, 1(1-2): 69-90.

[1] Wang Hanxue,Cui Wenjuan,Zhou Yuanchun,Du Yi. Identifying Pathogens of Foodborne Diseases with Machine Learning[J]. 数据分析与知识发现, 2021, 5(9): 54-62.
[2] Chen Donghua,Zhao Hongmei,Shang Xiaopu,Zhang Runtong. Optimizing Large Hospital Operating Rooms with Data Analytics[J]. 数据分析与知识发现, 2021, 5(9): 115-128.
[3] Che Hongxin,Wang Tong,Wang Wei. Comparing Prediction Models for Prostate Cancer[J]. 数据分析与知识发现, 2021, 5(9): 107-114.
[4] Su Qiang, Hou Xiaoli, Zou Ni. Predicting Surgical Infections Based on Machine Learning[J]. 数据分析与知识发现, 2021, 5(8): 65-75.
[5] Cao Rui,Liao Bin,Li Min,Sun Ruina. Predicting Prices and Analyzing Features of Online Short-Term Rentals Based on XGBoost[J]. 数据分析与知识发现, 2021, 5(6): 51-65.
[6] Zhong Jiawa,Liu Wei,Wang Sili,Yang Heng. Review of Methods and Applications of Text Sentiment Analysis[J]. 数据分析与知识发现, 2021, 5(6): 1-13.
[7] Xiang Zhuoyuan,Liu Zhicong,Wu Yu. Adaptive Recommendation Model Based on User Behaviors[J]. 数据分析与知识发现, 2021, 5(4): 103-114.
[8] Chai Guorong,Wang Bin,Sha Yongzhong. Public Health Risk Forecasting with Multiple Machine Learning Methods Combined:Case Study of Influenza Forecasting in Lanzhou, China[J]. 数据分析与知识发现, 2021, 5(1): 90-98.
[9] Chen Dong,Wang Jiandong,Li Huiying,Cai Sihang,Huang Qianqian,Yi Chengqi,Cao Pan. Forecasting Poultry Turnovers with Machine Learning and Multiple Factors[J]. 数据分析与知识发现, 2020, 4(7): 18-27.
[10] Liang Ye,Li Xiaoyuan,Xu Hang,Hu Yiran. CLOpin: A Cross-Lingual Knowledge Graph Framework for Public Opinion Analysis and Early Warning[J]. 数据分析与知识发现, 2020, 4(6): 1-14.
[11] Yang Heng,Wang Sili,Zhu Zhongming,Liu Wei,Wang Nan. Recommending Domain Knowledge Based on Parallel Collaborative Filtering Algorithm[J]. 数据分析与知识发现, 2020, 4(6): 15-21.
[12] Wang Shuyi,Liu Sai,Ma Zheng. Microblog Image Privacy Classification with Deep Transfer Learning[J]. 数据分析与知识发现, 2020, 4(10): 80-92.
[13] Ruojia Wang,Lu Zhang,Jimin Wang. Automatic Triage of Online Doctor Services Based on Machine Learning[J]. 数据分析与知识发现, 2019, 3(9): 88-97.
[14] Gang Li,Huayang Zhou,Jin Mao,Sijing Chen. Classifying Social Media Users with Machine Learning[J]. 数据分析与知识发现, 2019, 3(8): 1-9.
[15] Jiahui Hu,An Fang,Wanqing Zhao,Chenliu Yang,Huiling Ren. Annotating Chinese E-Medical Record for Knowledge Discovery[J]. 数据分析与知识发现, 2019, 3(7): 123-132.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn