利用小样本量机器学习实现学术文摘结构的自动识别

doi:10.11925/infotech.1003-3513.2014.07.05

现代图书情报技术

2014, Vol. 30

Issue (7): 34-40 https://doi.org/10.11925/infotech.1003-3513.2014.07.05

知识组织与知识管理

本期目录 | 过刊浏览 | 高级检索

利用小样本量机器学习实现学术文摘结构的自动识别

白光祖^1,3, 何远标^2,3, 马建霞¹, 刘建华^2,3, 邹益民⁴

1. 中国科学院兰州文献情报中心, 兰州730000;
2. 中国科学院文献情报中心, 北京100190;
3. 中国科学院大学, 北京100049;
4. 浙江师范大学经济与管理学院, 金华321004

Application of Machine Learning with Limited Corpus to Identify Structure of Scientific Abstracts Automatically

Bai Guangzu^1,3, He Yuanbiao^2,3, Ma Jianxia¹, Liu Jianhuaz^2,3, Zou Yimin⁴

1. Lanzhou Library, ChineseAcademy of Sciences, Lanzhou 730000, China;
2. National Science Library, Chinese Academy of Sciences, Beijing 100190, China;
3. University of Chinese Academy of Sciences, Beijing 100049, China;
4. College of Economics&Management, Zhejiang Normal University, Jinhua 321004, China

摘要
参考文献
相关文章
Metrics

全文: PDF (391 KB) HTML
输出: BibTeX | EndNote (RIS)

摘要

[目的]通过在小样本量下基于机器学习算法实现文摘语句的自动分类，以此实现学术文摘结构的自动识别。[方法]设计多种学术文摘的文本表示特征，利用自然语言处理技术实现特征的自动提取，以此指导朴素贝叶斯、支持向量机模型进行训练，并利用训练模型自动识别文摘结构。[结果]实验证明该方法较之于同类方法能够在较少训练语料下实现较好的识别准确率。[局限]由于文摘中“方法”类别语句缺乏固定的类别特征同与核心动同，导致算法对该类别语句识别准确率较低。[结论]所提方法是一种小样本量情况下行之有效的学术文摘结构自动识别方法。

	服务

	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	何远标
	马建霞
	邹益民
	刘建华
	白光祖

Abstract：

[Objective] This study aims to identify structural contents of scientific abstract automatically by classifying the academic abstracts sentences based on machine learning with limited samples.[Methods] This paper designs a variety of text features to represent scientific abstract sentences, then extracts these features from the academic abstracts based on natural language processing techniques so as to instruct Naive Bayesian Model and Support Vector Machines in training, and ultimately identifies the structure of academic abstracts automatically by using these models.[Results]Experiments show that the method can achieve fairly even better recognition accuracy compared with previous methods by using less training corpus.[Limitations] Due to the lack of feature words and core verbs in abstract sentences with"METHOD" class label, it resulted in a lower recognition accuracy on these sentences.[Conclusions] This method is an effective approach to achieve the automatic recognition of academic abstracts structure by using limited corpus.

Key words： Science abstract Structure identifying Machine learning

收稿日期: 2013-10-08 出版日期: 2014-10-20

G356.7

基金资助:

国科学院西部之光联合学者项目“基于计算情报方法的甘肃省战略新兴产业技术创新竞争与发展研究”（项目编号：Y200201001）的研究成果之一

通讯作者: 自光祖E-mail：baigz@llas.ac.cn E-mail: baigz@llas.ac.cn

作者简介: 作者贡献声明：自光祖：研究命题的提出、研究过程的实施、主要数据的获取与分析处理、论文的起草与修订；自光祖，何远标：研究方案的设计；马建霞，刘建华，邹益民：实施方案的设计。

引用本文:

白光祖, 何远标, 马建霞, 刘建华, 邹益民. 利用小样本量机器学习实现学术文摘结构的自动识别[J]. 现代图书情报技术, 2014, 30(7): 34-40.
Bai Guangzu, He Yuanbiao, Ma Jianxia, Liu Jianhuaz, Zou Yimin. Application of Machine Learning with Limited Corpus to Identify Structure of Scientific Abstracts Automatically. New Technology of Library and Information Service, 2014, 30(7): 34-40.

链接本文:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.1003-3513.2014.07.05 或 https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2014/V30/I7/34

[1] U.S. National Library of Medicine. Structured Abstracts[EB/OL].[2014-04-01]. http://www.nlm.nih.gov/bsd/policy/structured_abstracts.html.
[2] 张晓林, 彭希珺. 用高水平学术规范保障论文学术质量[J].现代图书情报技术, 2014(1): 1-3. (Zhang Xiaolin, Peng Xijun. Secure the Quality of Academic Papers by High-level Academic Norms[J]. New Technology of Library and Information Service, 2014(1): 1-3.)
[3] Hirohata K, Okazaki N, Ananiadou S, et al. Identifying Sections in Scientific Abstracts Using Conditional Random Fields[C]. In: Proceedings of the 3rd International Joint Conference on Natural Language Processing (IJCNLP’08). 2008: 381-388.
[4] Teufel S, Siddharthan A, Batchelor C. Towards Discipline- independent Argumentative Zoning: Evidence from Chemistry and Computational Linguistics[C]. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing (EMNLP’09). Stroudsburg: Association for Computational Linguistics, 2009: 1493-1502.
[5] Mizuta Y, Korhonen A, Mullen T, et al. Zone Analysis in Biology Articles as a Basis for Information Extraction[J]. International Journal of Medical Informatics, 2006, 75(6): 468-487.
[6] Liakata M, Teufel S, Siddharthan A, et al. Corpora for the Conceptualisation and Zoning of Scientific Papers[C]. In: Proceedings of the 7th International Conference on Language Resources and Evaluation. 2010:2054-2061.
[7] Ruch P, Boyer C, Chichester C, et al. Using Argumentation to Extract Key Sentences from Biomedical Abstracts[J]. International Journal of Medical Informatics, 2007, 76(2-3): 195-200.
[8] McKnight L, Srinivasan P. Categorization of Sentence Types in Medical Abstracts[C]. In: Proceedings of the 17th Annual Symposium of the American Medical Informatics Association. 2003: 440-444.
[9] Guo Y, Korhonen A, Liakata M, et al. Identifying the Information Structure of Scientific Abstracts: An Investigation of Three Different Schemes[C]. In: Proceedings of the 2010 Workshop on Biomedical Natural Language Processing. 2010: 99-107.
[10] Yamamoto Y, Takagi T. A Sentence Classification System for Multi-document Summarization in the Biomedical Domain[C]. In: Proceedings of the International Workshop on Biomedical Data Engineering (BMDE’05). 2005: 90-95.
[11] 霍东云, 聂峰光, 郭力. 利用Medline文摘数据库研究文本分类[J]. 计算机与应用化学, 2007, 24(9): 1281-1284. (Huo Dongyun, Nie Fengguang, Guo Li. Text Categorization Research by Using Medline Database[J]. Computers and Applied Chemistry, 2007, 24(9): 1281-1284.)
[12] 潘华山, 严馨, 余正涛, 等. 基于支持向量机的越语新闻文本分类方法[J]. 山西大学学报: 自然科学版, 2013, 36(4): 505-509. (Pan Huashan, Yan Xin, Yu Zhengtao, et al. Vietnamese News Text Classification Method Based on Support Vector Machine[J]. Journal of Shanxi University: Natural Science Edition, 2013, 36(4): 505-509.)
[13] Alpaydin E. 机器学习导论[M]. 范明, 昝红英, 牛常勇译. 北京: 机械工业出版社, 2009. (Alpaydin E. Introduction to Machine Learning[M]. Translated by Fan Ming, Zan Hong ying, Niu Changyong. Beijing: China Machine Press, 2009.)
[14] Hsu C W, Lin C J.A Comparison of Methods for Multiclass Support Vector Machines[J]. IEEE Transactions on Neural Networks, 2002, 13(2): 415-425.
[15] 张艳. 汉语句法分析的理论、方法的研究及其应用[D]. 北京: 中国科学院自动化研究所, 2003. (Zhang Yan. Research on Theory and Methods of Chinese Syntactic Parsing and Application[D]. Beijing: Institute of Automation, Chinese Academy of Sciences, 2003.)
[16] Huth E J. Structured Abstracts for Papers Reporting Clinical Trials[J]. Annals of Internal Medicine, 1987, 106(4): 626-627.
[17] The Stanford Natural Language Processing Group. Stanford CoreNLP[EB/OL].[2014-04-01]. http://nlp.stanford.edu/software/corenlp.shtml.
[18] The Stanford Natural Language Processing Group. Tregex, Tsurgeon and Semgrex[EB/OL].[2014-04-01]. http://nlp. stanford.edu/software/tregex.shtml.
[19] The Stanford Natural Language Processing Group. The Stanford Parser: A Statistical Parser[EB/OL].[2014-04-01]. http://nlp.stanford.edu/software/lex-parser.shtml.
[20] WEKA. Weka3: Data Mining Software in Java[EB/OL].[2014-04-01]. http://www.cs.waikato.ac.nz/ml/weka/.
[21] Yang Y. An Evaluation of Statistical Approaches to Text Categorization[J]. Information Retrieval, 1999, 1(1-2): 69-90.

[1]	沈玮杰. 基于文献结构的自动文摘的初探[J]. 现代图书情报技术, 2002, 18(3): 23-27.

Viewed

Full text

Abstract

Cited

Shared

Discussed