Please wait a minute...
Advanced Search
现代图书情报技术  2016, Vol. 32 Issue (3): 67-73    DOI: 10.11925/infotech.1003-3513.2016.03.09
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
主动学习的科技文献研究对象标引体系研究*
贺惠新,刘丽娟()
同方知网(北京)技术有限公司 北京 100192
A Scientific Research Object Labeling System Based on Active earning
He Huixin,Liu Lijuan()
Tongfang Knowledge Network Technology Co., Ltd. (Beijing), Beijing 100192, China
全文: PDF(558 KB)   HTML ( 27
输出: BibTeX | EndNote (RIS)      
摘要 

目的】识别论文标题中的研究对象属性实例, 试图利用少量标注样本, 最大限度地提高研究对象识别的准确率。【方法】分析科技文献中研究对象的语法特征, 利用少量样本基于条件随机场序列标注算法, 对研究对象进行识别和抽取, 并引入基于未标注数据的主动学习的迭代标引体系, 提高研究对象识别的准确率。【结果】能够高效利用未标注数据, 并最大限度地提高研究对象识别的准确率, 标注准确率达到78.3%。【局限】算法运行效率有待进一步优化。【结论】对科技文献中研究对象属性实例具有较好的识别效果, 为进一步挖掘科技文献中的知识体系和结构打下基础。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
贺惠新
刘丽娟
关键词 科技文献研究对象条件随机场迭代标引体系主动学习    
Abstract

[Objective] This study aims to identify the research object attribute instance from the paper titles. With the help of limited labeled samples, we could maximumize the accuracy of research object recognition. [Methods] We first analyzed the grammatical features of scientific research objects based on conditional random field sequence labeling algorithm. Second, we recognized and extracted research objects using a small amount of samples. Finally, we introduced an active learning iterative labeling system based on unlabeled data to improve the research object recognition accuracy. [Results] The results showed that the proposed method could efficiently use the unlabeled data, and increase the accuracy of the research object recognition to 78.3%. [Limitations] The proposed algorithm needs to be further optimized to improve its efficiency. [Conclusions] The proposed method performed well on the research object attributes identification, which is the foundation for further mining the knowledge system and the structure of science and technology literature.

Key wordsScientific literature    Research objects    Conditional Random Fields    Iterative labeling system    Active learning
收稿日期: 2015-10-13     
基金资助:*本文系国家自然科学基金项目“群体性突发事件预警的超网络方法研究”(项目编号:71473034)的研究成果之一
引用本文:   
贺惠新,刘丽娟. 主动学习的科技文献研究对象标引体系研究*[J]. 现代图书情报技术, 2016, 32(3): 67-73.
He Huixin,Liu Lijuan. A Scientific Research Object Labeling System Based on Active earning. New Technology of Library and Information Service, DOI:10.11925/infotech.1003-3513.2016.03.09.
链接本文:  
http://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.1003-3513.2016.03.09
[1] Lan M, Zhang Y Z, Lu Y, et al.Which Who are They? People Attribute Extraction and Disambiguation in Web Search Results [C]. In: Proceedings of the 18th World Wide Web Conference, Madrid, Spain. 2009.
[2] 李红亮. 基于规则的百科人物属性抽取算法的研究[D]. 成都: 西南交通大学, 2013.
[2] (Li Hongliang.Research on Character Attributes Extraction Based on Rules from Baidu Encyclopedia [D]. Chengdu: Southwest Jiaotong University, 2013.)
[3] 曾道建, 来斯惟, 张元哲, 等. 面向非结构化文本的开放式实体属性抽取[J]. 江西师范大学学报: 自然科学版, 2013, 37(3): 279-283.
[3] (Zeng Daojian, Lai Siwei, Zhang Yuanzhe, et al.Open Entity Attribute-Value Extraction from Unstructured Text[J]. Journal of Jiangxi Normal University: Natural Science Edition, 2013, 37(3): 279-283.)
[4] Ghani R, Probst K, Liu Y, et al.Text Mining for Product Attribute Extraction[J]. ACM SIGKDD Explorations Newsletter, 2006, 8(1): 41-48.
[5] 贾真, 杨宇飞, 何大可, 等. 面向中文网络百科的属性和属性值抽取[J]. 北京大学学报: 自然科学版, 2014, 50(1): 41-47.
[5] (Jia Zhen, Yang Yufei, He Dake, et al.Attribute and Attribute Value Extracted from Chinese Online Encyclopedia[J]. Acta Scientiarum Naturalium University Pekinensis, 2014, 50(1): 41-47.)
[6] 刘丽佳, 郭剑毅, 周兰江, 等. 基于LM 算法的领域概念实体属性关系抽取[J]. 中文信息学报, 2014, 28(6): 216-222.
[6] (Liu Lijia, Guo Jianyi, Zhou Lanjiang, et al.Domain Concepts Entity Attribute Relation Extraction Based on LM Algorithm[J]. Journal of Chinese Information Processing, 2014, 28(6): 216-222.)
[7] 丁玉飞, 王曰芬, 刘卫江. 面向半结构化文本的知识抽取研究[J]. 情报理论与实践, 2015, 38(3): 101-106.
[7] (Ding Yufei, Wang Yuefen, Liu Weijiang.Research on Knowledge Extraction for Semi-structure Text[J]. Information Studies: Theory & Application, 2015, 38(3): 101-106.)
[8] 丁君军, 郑彦宁, 化柏林. 基于规则的学术概念属性抽取[J]. 情报理论与实践, 2011, 34(12): 10-14.
[8] (Ding Junjun, Zheng Yanning, Hua Bolin.Academic Concept Attribute Extraction Based on the Rules[J]. Information Studies: Theory & Application, 2011, 34(12): 10-14.)
[9] Rebholz-Schuhmann D.Biomedical Named Entity Recognition, Whatizit [A]. // Encyclopedia of Systems Biology[M]. Springer New York, 2013: 132-134.
[10] Fundel K, Küffner R, Zimmer R.RelEx—Relation Extraction Using Dependency Parse Trees[J]. Bioinformatics, 2007, 23(3): 365-371.
[11] 张晗, 路振宇, 崔雷. 利用关联规则对医学文本数据库进行知识抽取的尝试——以四种抗肿瘤药为例[J]. 现代图书情报技术, 2006(9): 49-52.
[11] (Zhang Han, Lu Zhenyu, Cui Lei.Knowledge Extraction from Medical Literature Database Using Association Rule Mining —— Taking Four Anti- neoplastic Medicines as an Example[J]. New Technology of Library and Information Service, 2006(9): 49-52.)
[12] Lafferty J D, McCallum A, Pereira F C N. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data [C]. In: Proceedings of the 18th International Conference on Machine Learning. 2001.
[13] 孟洪宇, 谢晴宇, 常虹, 等. 基于条件随机场的《伤寒论》 中医术语自动识别[J]. 北京中医药大学学报, 2015, 38(9): 587-590.
[13] (Meng Hongyu, Xie Qingyu, Chang Hong, et al.Automatic Identification of TCM Terminology in Shanghan Lun Based on Conditional Random Field[J]. Journal of Beijing University of Chinese Medicine, 2015, 38(9): 587-590.)
[14] 张帆, 乐小虬. 领域科技文献创新点句中主题属性实例识别方法研究[J]. 现代图书情报技术, 2015(5): 15-23.
[14] (Zhang Fan, Le Xiaoqiu.Research on Recognition of Concept Attribute Instances in Innovation Sentences of Scientific Research Paper[J]. New Technology of Library and Information Service, 2015(5): 15-23.)
[15] Pham S B, Hoffmann A.Extracting Positive Attributions from Scientific Papers[A]. // Discovery Science[M]. Springer Berlin Heidelberg, 2004: 169-182.
[16] Pechsiri C, Kawtrakul A.Mining Causality for Explanation Knowledge from Text[J]. Journal of Computer Science and Technology, 2007, 22(6): 877-889.
[17] Pechsiri C, Piriyakul R.Explanation Knowledge Graph Construction Through Causality Extraction from Texts[J]. Journal of Computer Science and Technology, 2010, 25(5): 1055-1070.
[18] Xiao L, Tang K, Liu X, et al.Information Extraction from Nanotoxicity Related Publications [C]. In: Proceedings of the 2013 IEEE International Conference on Bioinformatics and Biomedicine, Shanghai, China. 2013: 25-30.
[19] 程紫光. 面向领域知识库构建的实体识别及关系抽取技术[D]. 哈尔滨: 哈尔滨工业大学, 2014.
[19] (Cheng Ziguang.Research on Named Entity Recognition and Relation Extraction Facing to Domain-Oriented Knowledge Base Construction [D]. Harbin: Harbin Institute of Technology, 2014.)
[20] Xiao J, Su J, Zhou G D, et al.Protein-Protein Interaction Extraction: A Supervised Learning Approach [C]. In: Proceedings of the 1st International Symposium on Semantic Mining in Biomedicine. 2005: 51-59.
[21] 张益嘉. 生物医学领域的信息抽取与复合物识别研究[D]. 大连: 大连理工大学, 2014.
[21] (Zhang Yijia.Information Extraction in Biomedical Literature and Protein Complex Identification [D]. Dalian: Dalian University of Technology, 2014.)
[22] Li Y P, Hu X H, Lin H F, et al.Learning an Enriched Representation from Unlabeled Data for Protein-Protein Interaction Extraction[J]. BMC Bioinformatics, 2010, 11(S2): 7-10.
[23] 闫紫飞, 姬东鸿. 基于CRF和半监督学习的中文时间信息抽取[J]. 计算机工程与设计, 2015, 36(6): 1642-1646.
[23] (Yan Zifei, Ji Donghong.Exploration of Chinese Temporal Information Extraction Based on CRF and Semi-supervised Learning[J]. Computer Engineering and Design, 2015, 36(6): 1642-1646.)
[24] 中国知网[OL]. [2015-06-25].
[24] (CNKI [OL]. [2015-06-25].
[1] 黄菡,王宏宇,王晓光. 结合主动学习的条件随机场模型用于法律术语的自动识别*[J]. 数据分析与知识发现, 2019, 3(6): 66-74.
[2] 高广尚. 关于实体解析基本方法的研究和述评*[J]. 数据分析与知识发现, 2019, 3(5): 27-40.
[3] 徐红霞,李春旺. 科技文献内容知识点抽取研究综述[J]. 数据分析与知识发现, 2019, 3(3): 14-24.
[4] 刘清民,姚长青,石崇德,温晓洁,孙玥莹. 面向科技文献神经机器翻译词汇表优化研究*[J]. 数据分析与知识发现, 2019, 3(3): 76-82.
[5] 唐慧慧,王昊,张紫玄,王雪颖. 基于汉字标注的中文历史事件名抽取研究*[J]. 数据分析与知识发现, 2018, 2(7): 89-100.
[6] 王佳琪,张均胜,乔晓东. 基于文献的科研事件表示与语义链接研究*[J]. 数据分析与知识发现, 2018, 2(5): 32-39.
[7] 王东波,吴毅,叶文豪,刘睿伦. 多特征知识下的食品安全事件实体抽取研究*[J]. 数据分析与知识发现, 2017, 1(3): 54-61.
[8] 张越,王东波,朱丹浩. 面向食品安全突发事件汉语分词的特征选择及模型优化研究*[J]. 数据分析与知识发现, 2017, 1(2): 64-72.
[9] 张琳,秦策,叶文豪. 基于条件随机场的法言法语实体自动识别模型研究*[J]. 数据分析与知识发现, 2017, 1(11): 46-52.
[10] 王密平,王昊,邓三鸿,吴志祥. 基于CRFs的冶金领域中文专利术语抽取研究*[J]. 现代图书情报技术, 2016, 32(6): 28-36.
[11] 隋明爽,崔雷. 结合多种特征的CRF模型用于化学物质-疾病命名实体识别[J]. 现代图书情报技术, 2016, 32(10): 91-97.
[12] 王颖, 吴振新, 谢靖. 面向科技文献的语义检索系统研究综述[J]. 现代图书情报技术, 2015, 31(5): 1-7.
[13] 段宇锋, 朱雯晶, 陈巧, 刘伟, 刘凤红. 条件随机场与领域本体元素集相结合的未登录词识别研究[J]. 现代图书情报技术, 2015, 31(4): 41-49.
[14] 姜春涛. 自动标注中文专利的引文信息[J]. 现代图书情报技术, 2015, 31(10): 81-87.
[15] 何宇, 吕学强, 徐丽萍. 新能源汽车领域中文术语抽取方法[J]. 现代图书情报技术, 2015, 31(10): 88-94.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn