Automatic Identification of Term Citation Object with Feature Fusion
Na Ma1,2,Zhixiong Zhang1,2,3,4(),Pengmin Wu5
1National Science Library, Chinese Academy of Sciences, Beijing 100190, China 2School of Economic and Management, University of Chinese Academy of Sciences, Beijing 100190, China 3Wuhan Library, Chinese Academy of Sciences, Wuhan 430071, China 4Hubei Key Laboratory of Big Data in Science and Technology, Wuhan 430071, China 5Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China
[Objective] This paper explores methods automatically identifying term citation objects from scientific papers, with feature fusion and pseudo-label noise reduction strategy.[Methods] First, we converted the identification of term citation objects into sequential annotation. Then, we combined linguistic and heuristic features of term citation objects in the BiLSTM-CNN-CRF input layer, which enhanced their feature representations. Finally, we designed pseudo-label learning noise reduction mechanism, and compared the performance of different models.[Results] The optimal F1 value of our method reached 0.6018, which was 8% higher than that of the BERT model.[Limitations] The experimental data was collected from computer science articles, thus, our model needs to be examined with data from other fields.[Conclusions] The proposed method could effectively identify term citation objects.
马娜,张智雄,吴朋民. 基于特征融合的术语型引用对象自动识别方法研究*[J]. 数据分析与知识发现, 2020, 4(1): 89-98.
Na Ma,Zhixiong Zhang,Pengmin Wu. Automatic Identification of Term Citation Object with Feature Fusion. Data Analysis and Knowledge Discovery, 2020, 4(1): 89-98.
We have adopted the Conditional Maximum Entropy (MaxEnt) modeling paradigm as outlined in REF3 and REF19
To quickly (and approximately) evaluate this phenomenon, we trained the statistical IBM word-alignment model 4 REF7, using the GIZA ++ software REF11 for the following language pairs: Chinese-English, Italian-English, and Dutch-English, using the IWSLT-2006 corpus REF23 for the first two language pairs, and the Europarl corpus REF9 for the last one.
In computational linguistic literature, much effort has been devoted to phonetic transliteration, such as English-Arabic, English-Chinese REF5, English-Japanese REF6 and English-Korean.
Tokenisation, species word identification and chunking were implemented in-house using the LTXML2 tools REF4, whilst abbreviation extraction used the Schwartz and Hearst abbreviation extractor REF9 and lemmatisation used morpha REF12.
Ding Y, Zhang G, Chambers T , et al. Content-based Citation Analysis: The Next Generation of Citation Analysis[J]. Journal of the Association for Information Science and Technology, 2014,65(9):1820-1833.
( Zhao Rongying, Zeng Xianqin, Chen Bikun . Citation in Full-text:The Development of Citation Analysis[J]. Library & Information Service, 2014,58(9):129-135.)
Small H G . Cited Documents as Concept Symbols[J]. Social Studies of Science, 1978,8(3):327-340.
Qazvinian V, Radev D R. Scientific Paper Summarization Using Citation Summary Networks [C]// Proceedings of the 22nd International Conference on Computational Linguistics, Manchester. Association for Computational Linguistics, 2008: 689-696.
Qazvinian V, Radev D R, Ozgur A. Citation Summarization Through Keyphrase Extraction [C]// Proceedings of the 23rd International Conference on Computational Linguistics, Beijing. Association for Computational Linguistics, 2010: 895-903.
Jha R, Finegan-Dollak C, King B, et al. Content Models for Survey Generation: A Factoid-Based Evaluation [C]// Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, Beijing. Association for Computational Linguistics, 2015,1:441-450.
Anderson M H, Sun P Y T . What Have Scholars Retrieved from Walsh and Ungson (1991)? A Citation Context Study[J]. Management Learning, 2010,41(2):131-145.
Radoulov R . Exploring Automatic Citation Classification[D]. Waterloo: University of Waterloo, 2008.
许德山 . 科技论文引用中的观点倾向分析[D]. 北京:中国科学院文献情报中心, 2012.
( Xu Deshan . Sentiment Orientation Analysis for Evaluation Information of Citation on Scientific & Technical Paper[D].Bejing: National Science Library, Chinese Academy of Sciences, 2012.)
Khalid A, Khan F A, Imran M , et al. Reference Terms Identification of Cited Articles as Topics from Citation Contexts[J]. Computers and Electrical Engineering, 2019,74:569-580.
Ma X, Hovy E . End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF [C]// Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany. Association for Computational Linguistics, 2016: 1064-1074.
Bengio Y, Ducharme R, Vincent P , et al. A Neural Probabilistic Language Model[J]. Journal of Machine Learning Research, 2003,3:1137-1155
Santos C D, Zadrozny B. Learning Character-Level Representations for Part-of-Speech Tagging [C]// Proceedings of the 31st International Conference on Machine Learning, Beijing. Association for Computational Linguistics, 2014: 1818-1826.
Rei M, Crichton G K O, Pyysalo S. Attending to Characters in Neural Sequence Labeling Models [C]// Proceedings of the 26th International Conference on Computational Linguistics, Osaka, Japan. Association for Computational Linguistics, 2016: 309-318.
( Zhao Hong, Wang Fang . A Deep Learning Model and Self-Training Algorithm for Theoretical Terms Extraction[J]. Journal of the China Society for Scientific and Technical Information, 2018,37(9):923-938.)
Zhang Z Y, Han X, Liu Z Y , et al. ERNIE: Enhanced Language Representation with Informative Entities[OL]. arXiv Preprint. arXiv: 1905. 07129.
Shen Y Y, Yun H, Lipton Z C, et al. Deep Active Learning for Named Entity Recognition [C]// Proceedings of the 2nd Workshop on Representation Learning for NLP, Vancouver, Canada. Association for Computational Linguistics, 2017: 252-256.
Ye Z X, Ling Z H . Hybrid Semi-Markov CRF for Neural Sequence Labeling[OL]. arXiv Preprint. arXiv: 1805. 03838.
Devlin J, Chang M W, Lee K , et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[OL]. arXiv Preprint. arXiv: 1810. 04805.
Bikel D M, Miller S, Schwartz R, et al. Nymble: A High-Performance Learning Name-finder [C]// Proceedings of the 5th Conference on Applied Natural Language Processing, Washington. Association for Computational Linguistics, 1997: 194-201.
Lafferty J, McCallum A, Pereira F C N. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data [C]// Proceedings of the 18th International Conference on Machine Learning, Williamstown, USA. Morgan Kaufmann Publishers Inc, 2001: 282-289.
Ma C, Zheng H F, Xie P, et al. DM_NLP at SemEval-2018 Task 8: Neural Sequence Labeling with Linguistic Features [C]// Proceedings of the 12th International Workshop on Semantic Evaluation, New Orleans, USA. Association for Computational Linguistics, 2018: 707-711.
Pennington J, Socher R, Manning C. Glove: Global Vectors for Word Representation [C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, Doha, Qatar. Association for Computational Linguistics, 2014: 1532-1543.
Lee D H. Pseudo-Label: The Simple and Efficient Semi-Supervised Learning Method for Deep Neural Networks [C]// Proceedings of the 30th International Conference on Machine Learning, Atlanta, USA. 2013.
Li Z, Ko B S, Choi H J . Naive Semi-supervised Deep Learning Using Pseudo-label[J]. Peer-to-Peer Networking and Applications, 2019,12(5):1358-1368.
Dempster A P, Larird N M, Rubin D B . Maximum Likelihood from Incomplete Data via the EM Algorithm[J]. Journal of Royal Statistical Society: Series B, 1977,39(1):1-38.
Radev D R, Muthukrishnan P, Qazinian V , et al. The ACL Anthology Network Corpus[J]. Language Resources and Evaluation, 2013,47(4):919-944.
Manning C, Surdeanu M, Bauer J, et al. The Stanford CoreNLP Natural Language Processing Toolkit [C]// Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Baltimore, USA. Association for Computational Linguistics, 2014: 55-60.
Sang E F, De Meulder F . Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition[OL]. arXiv Preprint. arXiv: 0306050.