Please wait a minute...
Advanced Search
数据分析与知识发现  2020, Vol. 4 Issue (1): 89-98    DOI: 10.11925/infotech.2096-3467.2019.0869
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
基于特征融合的术语型引用对象自动识别方法研究*
马娜1,2,张智雄1,2,3,4(),吴朋民5
1中国科学院文献情报中心 北京 100190
2中国科学院大学经济管理学院图书情报与档案管理系 北京 100190
3中国科学院武汉文献情报中心 武汉 430071
4科技大数据湖北省重点实验室 武汉 430071
5中国科学院自动化研究所 北京 100190
Automatic Identification of Term Citation Object with Feature Fusion
Na Ma1,2,Zhixiong Zhang1,2,3,4(),Pengmin Wu5
1National Science Library, Chinese Academy of Sciences, Beijing 100190, China
2School of Economic and Management, University of Chinese Academy of Sciences, Beijing 100190, China
3Wuhan Library, Chinese Academy of Sciences, Wuhan 430071, China
4Hubei Key Laboratory of Big Data in Science and Technology, Wuhan 430071, China
5Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China
全文: PDF(907 KB)   HTML ( 7
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】 设计特征融合和伪标签降噪策略,探索科技论文术语型引用对象自动识别方法。【方法】 将术语型引用对象识别转换为序列标注问题,在BiLSTM-CNN-CRF输入层融合术语型引用对象的语言学和启发式两大类特征,增强引用对象的特征表示,设计伪标签学习降噪机制,采用半监督学习方法探究不同特征组合对识别效果的影响。【结果】 本方法在术语型引用对象识别任务中最优F1值达到0.6018,比BERT模型实验结果提升8%。【局限】 实验数据仅涉及计算机领域,在其他领域的可移植性有待考证。【结论】 基于特征融合的深度学习方法在术语型引用对象的识别中有较好性能,伪标签学习方法解决了引用对象标注数据不足的问题,两者结合有效地探索了术语型引用对象自动化识别方法。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
马娜
张智雄
吴朋民
关键词 引用对象识别特征融合伪标签学习BiLSTM-CNN-CRF    
Abstract

[Objective] This paper explores methods automatically identifying term citation objects from scientific papers, with feature fusion and pseudo-label noise reduction strategy.[Methods] First, we converted the identification of term citation objects into sequential annotation. Then, we combined linguistic and heuristic features of term citation objects in the BiLSTM-CNN-CRF input layer, which enhanced their feature representations. Finally, we designed pseudo-label learning noise reduction mechanism, and compared the performance of different models.[Results] The optimal F1 value of our method reached 0.6018, which was 8% higher than that of the BERT model.[Limitations] The experimental data was collected from computer science articles, thus, our model needs to be examined with data from other fields.[Conclusions] The proposed method could effectively identify term citation objects.

Key wordsCitation Object Identification    Feature Fusion    Pseudo-Label Learning    BiLSTM-CNN-CRF
收稿日期: 2019-07-23     
中图分类号:  TP391  
基金资助:*本文系中国科学院基金项目“科技文献丰富语义检索应用示范”的研究成果之一(院1734)
通讯作者: 张智雄     E-mail: zhangzhx@mail.las.ac.cn
引用本文:   
马娜,张智雄,吴朋民. 基于特征融合的术语型引用对象自动识别方法研究*[J]. 数据分析与知识发现, 2020, 4(1): 89-98.
Na Ma,Zhixiong Zhang,Pengmin Wu. Automatic Identification of Term Citation Object with Feature Fusion. Data Analysis and Knowledge Discovery, DOI:10.11925/infotech.2096-3467.2019.0869.
链接本文:  
http://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2019.0869
图1  BiLSTM-CNN-CRF模型
图2  CNN训练字符级特征
图3  术语型引用对象特征表示
图4  伪标签学习流程
训练参数
LSTM层数 2
神经单元数量 100
学习率 0.015
Dropout率 0.5
损失函数 交叉熵损失函数
Batch_size 10
优化器 Adam
L2(权重衰减率) 1.0e-8
Char_max_len 20
卷积核大小 3*3
卷积核数量 30
句子最大长度 300
表1  参数说明
模型-特征 Precision Recall F1
BiLSTM-CNN-CRF(Baseline) 25.57% 8.17% 12.38%
BiLSTM-CNN-CRF(POS) 30.43% 15.26% 20.33%
BiLSTM-CNN-CRF(POS+REF) 60.42% 49.15% 54.21%
BiLSTM-CNN-CRF(POS+DIS) 61.18% 51.07% 55.67%
BiLSTM-CNN-CRF(REF+DIS) 61.71% 56.02% 58.73%
BiLSTM-CNN-CRF(POS+REF+DIS) 62.96% 57.63% 60.18%
BERT 52.13% 51.55% 51.94%
表2  各种特征组合在测试集上的测试结果
预测模型 预测结果
BiLSTM-CNN-CRF(POS+REF+DIS) We have adopted the Conditional Maximum Entropy (MaxEnt) modeling paradigm as outlined in REF3 and REF19
To quickly (and approximately) evaluate this phenomenon, we trained the statistical IBM word-alignment model 4 REF7, using the GIZA ++ software REF11 for the following language pairs: Chinese-English, Italian-English, and Dutch-English, using the IWSLT-2006 corpus REF23 for the first two language pairs, and the Europarl corpus REF9 for the last one.
In computational linguistic literature, much effort has been devoted to phonetic transliteration, such as English-Arabic, English-Chinese REF5, English-Japanese REF6 and English-Korean.
Tokenisation, species word identification and chunking were implemented in-house using the LTXML2 tools REF4, whilst abbreviation extraction used the Schwartz and Hearst abbreviation extractor REF9 and lemmatisation used morpha REF12.
表3  标注结果与预测结果差异实例
[1] Ding Y, Zhang G, Chambers T , et al. Content-based Citation Analysis: The Next Generation of Citation Analysis[J]. Journal of the Association for Information Science and Technology, 2014,65(9):1820-1833.
[2] 赵蓉英, 曾宪琴, 陈必坤 . 全文本引文分析——引文分析的新发展[J]. 图书情报工作, 2014,58(9):129-135.
( Zhao Rongying, Zeng Xianqin, Chen Bikun . Citation in Full-text:The Development of Citation Analysis[J]. Library & Information Service, 2014,58(9):129-135.)
[3] Small H G . Cited Documents as Concept Symbols[J]. Social Studies of Science, 1978,8(3):327-340.
[4] Qazvinian V, Radev D R. Scientific Paper Summarization Using Citation Summary Networks [C]// Proceedings of the 22nd International Conference on Computational Linguistics, Manchester. Association for Computational Linguistics, 2008: 689-696.
[5] Qazvinian V, Radev D R, Ozgur A. Citation Summarization Through Keyphrase Extraction [C]// Proceedings of the 23rd International Conference on Computational Linguistics, Beijing. Association for Computational Linguistics, 2010: 895-903.
[6] Jha R, Finegan-Dollak C, King B, et al. Content Models for Survey Generation: A Factoid-Based Evaluation [C]// Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, Beijing. Association for Computational Linguistics, 2015,1:441-450.
[7] Anderson M H, Sun P Y T . What Have Scholars Retrieved from Walsh and Ungson (1991)? A Citation Context Study[J]. Management Learning, 2010,41(2):131-145.
[8] Radoulov R . Exploring Automatic Citation Classification[D]. Waterloo: University of Waterloo, 2008.
[9] 许德山 . 科技论文引用中的观点倾向分析[D]. 北京:中国科学院文献情报中心, 2012.
( Xu Deshan . Sentiment Orientation Analysis for Evaluation Information of Citation on Scientific & Technical Paper[D].Bejing: National Science Library, Chinese Academy of Sciences, 2012.)
[10] Khalid A, Khan F A, Imran M , et al. Reference Terms Identification of Cited Articles as Topics from Citation Contexts[J]. Computers and Electrical Engineering, 2019,74:569-580.
[11] Ma X, Hovy E . End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF [C]// Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany. Association for Computational Linguistics, 2016: 1064-1074.
[12] Bengio Y, Ducharme R, Vincent P , et al. A Neural Probabilistic Language Model[J]. Journal of Machine Learning Research, 2003,3:1137-1155
[13] Santos C D, Zadrozny B. Learning Character-Level Representations for Part-of-Speech Tagging [C]// Proceedings of the 31st International Conference on Machine Learning, Beijing. Association for Computational Linguistics, 2014: 1818-1826.
[14] Rei M, Crichton G K O, Pyysalo S. Attending to Characters in Neural Sequence Labeling Models [C]// Proceedings of the 26th International Conference on Computational Linguistics, Osaka, Japan. Association for Computational Linguistics, 2016: 309-318.
[15] 赵洪, 王芳 . 理论术语抽取的深度学习模型及自训练算法研究[J]. 情报学报, 2018,37(9):923-938.
( Zhao Hong, Wang Fang . A Deep Learning Model and Self-Training Algorithm for Theoretical Terms Extraction[J]. Journal of the China Society for Scientific and Technical Information, 2018,37(9):923-938.)
[16] Zhang Z Y, Han X, Liu Z Y , et al. ERNIE: Enhanced Language Representation with Informative Entities[OL]. arXiv Preprint. arXiv: 1905. 07129.
[17] Shen Y Y, Yun H, Lipton Z C, et al. Deep Active Learning for Named Entity Recognition [C]// Proceedings of the 2nd Workshop on Representation Learning for NLP, Vancouver, Canada. Association for Computational Linguistics, 2017: 252-256.
[18] Ye Z X, Ling Z H . Hybrid Semi-Markov CRF for Neural Sequence Labeling[OL]. arXiv Preprint. arXiv: 1805. 03838.
[19] Devlin J, Chang M W, Lee K , et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[OL]. arXiv Preprint. arXiv: 1810. 04805.
[20] Bikel D M, Miller S, Schwartz R, et al. Nymble: A High-Performance Learning Name-finder [C]// Proceedings of the 5th Conference on Applied Natural Language Processing, Washington. Association for Computational Linguistics, 1997: 194-201.
[21] Lafferty J, McCallum A, Pereira F C N. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data [C]// Proceedings of the 18th International Conference on Machine Learning, Williamstown, USA. Morgan Kaufmann Publishers Inc, 2001: 282-289.
[22] Ma C, Zheng H F, Xie P, et al. DM_NLP at SemEval-2018 Task 8: Neural Sequence Labeling with Linguistic Features [C]// Proceedings of the 12th International Workshop on Semantic Evaluation, New Orleans, USA. Association for Computational Linguistics, 2018: 707-711.
[23] Pennington J, Socher R, Manning C. Glove: Global Vectors for Word Representation [C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, Doha, Qatar. Association for Computational Linguistics, 2014: 1532-1543.
[24] Lee D H. Pseudo-Label: The Simple and Efficient Semi-Supervised Learning Method for Deep Neural Networks [C]// Proceedings of the 30th International Conference on Machine Learning, Atlanta, USA. 2013.
[25] Li Z, Ko B S, Choi H J . Naive Semi-supervised Deep Learning Using Pseudo-label[J]. Peer-to-Peer Networking and Applications, 2019,12(5):1358-1368.
[26] Dempster A P, Larird N M, Rubin D B . Maximum Likelihood from Incomplete Data via the EM Algorithm[J]. Journal of Royal Statistical Society: Series B, 1977,39(1):1-38.
[27] Radev D R, Muthukrishnan P, Qazinian V , et al. The ACL Anthology Network Corpus[J]. Language Resources and Evaluation, 2013,47(4):919-944.
[28] Manning C, Surdeanu M, Bauer J, et al. The Stanford CoreNLP Natural Language Processing Toolkit [C]// Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Baltimore, USA. Association for Computational Linguistics, 2014: 55-60.
[29] Sang E F, De Meulder F . Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition[OL]. arXiv Preprint. arXiv: 0306050.
[30] IEEE Thesaurus [EB/OL]. [2019-07-12]..
[1] 余传明,龚雨田,赵晓莉,安璐. 基于多特征融合的金融领域科研合作推荐研究*[J]. 数据分析与知识发现, 2017, 1(8): 39-47.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn