Please wait a minute...
Advanced Search
数据分析与知识发现  2019, Vol. 3 Issue (8): 53-61    DOI: 10.11925/infotech.2096-3467.2018.1198
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
基于主动学习的科技论文句子功能识别研究 *
陈果1,2(),许天祥1
1南京理工大学经济管理学院 南京 210094
2江苏省社会公共安全科技协同创新中心 南京 210094
Sentence Function Recognition Based on Active Learning
Guo Chen1,2(),Tianxiang Xu1
1School of Economics and Management, Nanjing University of Science and Technology, Nanjing 210094, China
2Jiangsu Science and Technology Collaborative Innovation Center of Social Public Safety, Nanjing 210094, China
全文: PDF(1017 KB)   HTML ( 15
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】为降低对人工标注语料的依赖, 探索利用主动学习方法, 充分结合现成的结构化摘要和有针对性的少量人工标注, 以获得具有更好泛化能力的句子功能分类模型, 识别文献句子功能类型(如表述研究的目的、方法、结论等)。【方法】以结构化摘要功能句为初始语料训练SVM、CNN、Bi-LSTM三种初始分类器, 再展开主动学习: 对大量无标签普通摘要句子进行功能预测, 自动筛选不确定性高的样例提请人工标注, 标注结果用于优化初始分类器, 迭代进行主动学习, 以提高分类器在新任务场景下的泛化性能。【结果】在图书情报学科文献集上实验表明, 开展主动学习可取得较好的句子功能分类效果, 准确率、召回率、F1值达84.65%、84.49%、84.57%, 较主动学习前分别提升3.25%、3.24%、3.25%。【局限】为避免大量的人工语料标注工作, 仅做了5次迭代。【结论】主动学习方法善于发现新任务场景下未标注语料与已有现成训练语料的差异, 有针对性地降低人工标注成本, 以提升基本模型的泛化能力。后续可进一步扩展应用于其他场景下(如引文、全文)的句子功能识别。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
陈果
许天祥
关键词 结构化摘要句子功能识别主动学习短文本分类    
Abstract

[Objective] This paper uses active learning methods, structured abstracts and a few annotations to create a classification model for sentence functions, aiming to reduce the dependence on manually labeled corpus. [Methods] First, we trained the SVM, CNN and Bi-LSTM classifiers with structured function sentences from abstracts. With the help of active learning techniques, we predicted the function of a large number of unlabeled common abstract sentences. Third, we automatically identified uncertain samples for manual annotation, which were used to optimize the initial classifier. Finally, we used active learning to improve the performance of classifiers. [Results] We examined the new method with Library and Information Science literature. The precision, recall, and F1 values were 84.65%, 84.49%, and 84.57%, which were 3.25%, 3.24%, and 3.25% higher than those of the traditional methods. [Limitations] We only conducted five iterations to avoid massive work of manual corpus annotation. [Conclusions] Active learning method could effectively discover the difference between unlabeled corpus and existing training corpus, which also reduces the manual labeling costs. The proposed method might be used in citation and full text analysis.

Key wordsStructured Abstract    Sentence Function Recognition    Active Learning    Short Text Classification
收稿日期: 2018-10-29     
中图分类号:  TP391  
基金资助:*本文系国家社会科学基金青年项目“领域分析视角下的科技词汇语义挖掘与知识演化研究”的研究成果 之一(16CTQ024)
通讯作者: 陈果     E-mail: delphi1987@qq.com
引用本文:   
陈果,许天祥. 基于主动学习的科技论文句子功能识别研究 *[J]. 数据分析与知识发现, 2019, 3(8): 53-61.
Guo Chen,Tianxiang Xu. Sentence Function Recognition Based on Active Learning. Data Analysis and Knowledge Discovery, DOI:10.11925/infotech.2096-3467.2018.1198.
链接本文:  
http://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2018.1198
图1  主动学习流程[29]
图2  基于主动学习的普通摘要语句功能识别流程
功能类型 结构化摘要标签
目的/意义 目的/意义 目的
方法/过程 方法/过程 过程/方法 方法/内容 方法 过程
结果/结论 结论/结果 结果/结论 结果 结论
局限 局限
应用背景 应用背景
文献范围 文献范围
表1  摘要结构化标签抽取与功能归类
图3  基于不同特征选择方法的分类准确率
类别 特征词
目的/意义 重要旨在意义以期问题
方法/过程 进行分析提供采用通过
结果/结论 结果表明发现能够表明结果显示
表2  各类别特征选择结果示例
序号 SVM CNN Bi-LSTM
P R F1 P R F1 P R F1
1 91.66 91.12 91.39 92.75 92.45 92.60 91.93 92.07 92.00
2 91.80 91.62 91.71 92.56 92.51 92.53 93.01 93.20 93.10
3 91.21 91.12 91.16 92.73 92.60 92.66 93.12 93.07 93.09
4 90.96 90.77 90.86 89.50 91.41 90.44 93.68 93.48 93.58
5 92.39 92.21 92.29 92.54 92.52 92.53 94.35 94.19 94.27
6 91.19 91.05 91.11 90.30 90.36 90.32 93.35 93.38 93.36
7 90.10 90.62 90.35 93.23 93.18 93.20 93.81 93.97 93.89
8 91.87 90.68 91.27 93.41 93.21 93.31 93.23 93.52 93.37
9 91.39 91.36 91.37 92.11 91.12 92.11 92.23 92.13 92.18
10 89.88 89.91 89.89 91.01 91.11 91.06 91.68 91.48 91.58
均值 91.24 91.05 91.14 92.01 92.05 92.03 93.04 93.05 93.04
表3  结构化摘要句子功能训练十折交叉验证结果(%)
方法 P R F1
SVM 81.62 81.19 81.40
CNN 81.21 81.12 81.16
Bi-LSTM 81.40 81.25 81.32
表4  普通摘要功能句识别测试结果(%)
迭代轮数 SVM CNN Bi-LSTM
P R F1 P R F1 P R F1
1 82.94 81.21 82.07 81.80 82.57 82.18 83.07 82.22 82.64
2 83.14 83.18 83.16 82.87 82.70 82.78 83.90 83.80 83.85
3 83.46 83.46 83.46 82.85 82.70 82.77 83.70 83.32 83.51
4 83.37 83.39 83.38 83.38 83.18 83.28 84.29 83.94 84.11
5 83.31 83.32 83.31 83.94 83.80 83.87 84.65 84.49 84.57
表5  基于主动学习的普通摘要功能句识别测试结果(%)
图4  不同迭代次数下的各分类器主动学习性能结果
[1] 陆伟, 黄永, 程齐凯 . 学术文本的结构功能识别——功能框架及基于章节标题的识别[J]. 情报学报, 2014,33(9):979-985.
( Lu Wei, Huang Yong, Cheng Qikai . The Structure Function of Academic Text and Its Classification[J]. Journal of the China Society for Scientific and Technical Information, 2014,33(9):979-985.)
[2] 唐晓波, 肖璐 . 基于单句粒度的微博主题挖掘研究[J]. 情报学报, 2014,33(6):623-632.
( Tang Xiaobo, Xiao Lu . Research of Micro-Blog Topics Mining Based on Sentence Granularity[J]. Journal of the China Society for Scientific and Technical Information, 2014,33(6):623-632.)
[3] 段平 . 如何撰写科技论文英文信息型摘要[J]. 大学英语, 2000(12):51-52.
( Duan Ping . How to Write English Informative Abstract in Paper for Special Science and Technology[J]. College English, 2000(12):51-52.)
[4] 郑彦宁, 化柏林 . 句子级知识抽取在情报学中的应用分析[J]. 情报理论与实践, 2011,34(12):1-4.
( Zheng Yanning, Hua Bolin . An Analysis of the Application of Sentence-Level Knowledge Extraction in Information Science[J]. Information Studies:Theory & Application, 2011,34(12):1-4.)
[5] 王文娟, 马建霞, 陈春 , 等. 引文文本分类与实现方法研究综述[J]. 图书情报工作, 2016,60(6):118-127.
( Wang Wenjuan, Ma Jianxia, Chen Chun , et al. A Review of Citation Context Classifications and Implementation Methods[J]. Library and Information Service, 2016,60(6):118-127.)
[6] 刘康, 钱旭, 王自强 . 主动学习算法综述[J]. 计算机工程与应用, 2012,48(34):1-4, 22.
( Liu Kang, Qian Xu, Wang Ziqiang . Survey on Active Learning Algorithms[J]. Computer Engineering and Applications, 2012,48(34):1-4, 22.)
[7] 李湘东, 曹环, 丁丛 , 等. 利用《知网》和领域关键词集扩展方法的短文本分类研究[J]. 现代图书情报技术, 2015(2):31-38.
( Li Xiangdong, Cao Huan, Ding Cong , et al. Short-text Classification Based on HowNet and Domain Keyword Set Extension[J]. New Technology of Library and Information Service, 2015(2):31-38.)
[8] Fan X, Hu H . Utilizing High-quality Feature Extension Mode to Classify Chinese Short-text[J]. Journal of Networks, 2010,5(12):1417-1425.
[9] Kim K, Chung B S, Choi Y , et al. Language Independent Semantic Kernels for Short-Text Classification[J]. Expert Systems with Applications, 2014,41(2):735-743.
[10] Chen M, Jin X, Shen D. Short Text Classification Improved by Learning Multi-Granularity Topics[C]// Proceedings of the 22nd International Joint Conference on Artificial Intelligence. AAAI Press, 2011: 1776-1781.
[11] Dai Z, Sun A, Liu X Y. Crest: Cluster-based Representation Enrichment for Short Text Classification [C]// Proceedings of the 2013 Pacific-Asia Conference on Knowledge Discovery and Data Mining. 2013: 256-267.
[12] Young T, Hazarika D, Poria S , et al. Recent Trends in Deep Learning Based Natural Language Processing[J]. IEEE Computational Intelligence Magazine, 2018,13(3):55-75.
[13] 吴鹏, 应杨, 沈思 . 基于双向长短期记忆模型的网民负面情感分类研究[J]. 情报学报, 2018,37(8):845-853.
( Wu Peng, Ying Yang, Shen Si . Negative Emotions of Online Users’ Analysis Based on Bidirectional Long Short-Term Memory[J]. Journal of the China Society for Scientific and Technical Information, 2018,37(8):845-853.)
[14] 王东波, 高瑞卿, 沈思 , 等. 基于深度学习的先秦典籍问句自动分类研究[J]. 情报学报, 2018,37(11):1114-1122.
( Wang Dongbo, Gao Ruiqing, Shen Si , et al. Deep Learning-Based Classification of Pre-Qin Classics Questions[J]. Journal of the China Society for Scientific and Technical Information, 2018,37(11):1114-1122.)
[15] 王盛玉, 曾碧卿, 商齐 , 等. 基于词注意力卷积神经网络模型的情感分析研究[J]. 中文信息学报, 2018,32(9):123-131.
( Wang Shengyu, Zeng Biqing, Shang Qi , et al. Word Attention-based Convolutional Neural Networks for Sentiment Analysis[J]. Journal of Chinese Information Processing, 2018,32(9):123-131.)
[16] Teufel S, Siddharthan A, Dan T. Automatic Classification of Citation Function [C]// Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2006: 103-110.
[17] Dong C, Schäfer U. Ensemble-style Self-training on Citation Classification [C]// Proceedings of the 5th International Joint Conference on Natural Language Processing. 2011: 623-631.
[18] Teufel S, Moens M . Summarizing Scientific Articles: Experiments with Relevance and Rhetorical Status[J]. Computational Linguistics, 2002,28(4):409-445.
[19] Abu-Jbara A, Radev D. Coherent Citation-Based Summarization of Scientific Papers [C]// Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. 2011: 500-509.
[20] 许红波 . 基于引文上下文的学术文献摘要方法研究[D]. 西安: 西北农林科技大学, 2017.
( Xu Hongbo . Citation-Context Based Academic Literature Summarization Method[D]. Xi’an: Northwest A&F University, 2017.)
[21] McKnight L, Srinivasan P . Categorization of Sentence Types in Medical Abstracts[J]. AMIA Annual Symposium Proceedings, 2003: 440-444.
[22] 华秀丽, 徐凡, 王中卿 , 等. 细粒度科技论文摘要句子分类方法[J]. 计算机工程, 2012,38(14):138-140.
doi: 10.3969/j.issn.1000-3428.2012.14.041
( Hua Xiuli, Xu Fan, Wang Zhongqing , et al. Fine-grained Classification Method for Abstract Sentence of Scientific Paper[J]. Computer Engineering, 2012,38(14):138-140.)
doi: 10.3969/j.issn.1000-3428.2012.14.041
[23] 王东波, 陆昊翔, 周鑫 , 等. 面向摘要结构功能划分的模型性能比较研究[J]. 图书情报工作, 2018,62(12):84-90.
( Wang Dongbo, Lu Haoxiang, Zhou Xin , et al. A Comparative Study of Model Performances Facing Abstract Structure Function[J]. Library and Information Service, 2018,62(12):84-90.)
[24] Karlos S, Fazakis N, Kalleris K, et al. An Incremental Self-Trained Ensemble Algorithm [C]// Proceedings of the 2018 IEEE Conference on Evolving & Adaptive Intelligent Systems. IEEE, 2018: 1-8.
[25] 赵洪, 王芳 . 理论术语抽取的深度学习模型及自训练算法研究[J]. 情报学报, 2018,37(9):923-938.
( Zhao Hong, Wang Fang . A Deep Learning Model and Self-Training Algorithm for Theoretical Terms Extraction[J]. Journal of the China Society for Scientific and Technical Information, 2018,37(9):923-938.)
[26] Pan S J, Yang Q . A Survey on Transfer Learning[J]. IEEE Transactions on Knowledge & Data Engineering, 2009,22(10):1345-1359.
[27] 周清清, 章成志 . 基于迁移学习微博情绪分类研究——以H7N9微博为例[J]. 情报学报, 2016,35(4):339-348.
( Zhou Qingqing, Zhang Chengzhi . Microblog Emotion Classification Based on Transfer Learning——A Case Study of Microblogs About H7N9[J]. Journal of the China Society for Scientific and Technical Information, 2016,35(4):339-348.)
[28] Cohn D A, Ghahramani Z, Jordan M I . Active Learning with Statistical Models[J]. Journal of Artificial Intelligence Research, 1996,4(1):705-712.
[29] 主动学习[OL]. [2018-12-27].
( Active Learning[OL]. [2018-12-27]. )
[30] Yamamoto Y, Takagi T. A Sentence Classification System for Multi Biomedical Literature Summarization [C]// Proceedings of the 21st International Conference on Data Engineering Workshops. IEEE, 2005: 1163.
[31] 陈涛, 谢阳群 . 文本分类中的特征降维方法综述[J]. 情报学报, 2005,24(6):690-695.
( Chen Tao, Xie Yangqun . Literature Review of Feature Dimension Reduction in Text Categorization[J]. Journal of the China Society for Scientific and Technical Information, 2005,24(6):690-695.)
[32] Mikolov T, Sutskever I, Chen K, et al. Distributed Representations of Words and Phrases and Their Compositionality [C]// Proceedings of the Neural Information Processing Systems 2013. 2013:3111-3119.
[1] 黄菡,王宏宇,王晓光. 结合主动学习的条件随机场模型用于法律术语的自动识别*[J]. 数据分析与知识发现, 2019, 3(6): 66-74.
[2] 高广尚. 关于实体解析基本方法的研究和述评*[J]. 数据分析与知识发现, 2019, 3(5): 27-40.
[3] 余本功,陈杨楠,杨颖. 基于nBD-SVM模型的投诉短文本分类*[J]. 数据分析与知识发现, 2019, 3(5): 77-85.
[4] 李心蕾,王昊,刘小敏,邓三鸿. 面向微博短文本分类的文本向量化方法比较研究*[J]. 数据分析与知识发现, 2018, 2(8): 41-50.
[5] 贺惠新,刘丽娟. 主动学习的科技文献研究对象标引体系研究*[J]. 现代图书情报技术, 2016, 32(3): 67-73.
[6] 张群, 王红军, 王伦文. 词向量与LDA相融合的短文本分类方法*[J]. 数据分析与知识发现, 2016, 32(12): 27-35.
[7] 李湘东, 曹环, 丁丛, 黄莉. 利用《知网》和领域关键词集扩展方法的短文本分类研究[J]. 现代图书情报技术, 2015, 31(2): 31-38.
[8] 毕秋敏, 李明, 曾志勇. 一种主动学习和协同训练相结合的半监督微博情感分类方法[J]. 现代图书情报技术, 2015, 31(1): 38-44.
[9] 胡勇军, 江嘉欣, 常会友. 基于LDA高频词扩展的中文短文本分类[J]. 现代图书情报技术, 2013, (6): 42-48.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn