Automatically Detecting and Tagging Foreign Language Citation Metadata
Jiang Lin1,2(), Wang Dongbo3
1School of Information Management, Nanjing University, Nanjing 210023, China 2Jiangsu Key Laboratory of Data Engineering and Knowledge Service, Nanjing 210023, China 3College of Information Science and Technology, Nanjing Agricultural University, Nanjing 210095, China
[Objective]This paper proposes a new method to automatically extract bibliographic metadata, with the help of semantic knowledge and machine learning technologies. [Methods] We used the neural network model to create word vectors from manually split data, and then found that same type of metadata is relatively concentrated at certain locations in the vector space. Thus, we proposed a new SVM classification algorithm to classify and annotate the bibliographic metadata automatically. [Results] The proposed method achieved high recall and precision rates with citation data, especially for citations with various languages and abbreviations. [Limitations] The fine-grained extraction of the time related content could be improved. [Conclusions] The proposed method could effectively detect and tag bibliographic metadata, and improve the system’s compatibility and fault tolerance ability.
(Jiang Xin.Several Main Quotation Ways in British-American Academic Documents[J]. Library and Information, 2003(3): 26-30.)
[2]
Wei W, King I, Lee J H M. Bibliographic Attributes Extraction with Layer-upon-Layer Tagging[C]//Proceedings of the 9th International Conference on Document Analysis and Recognition. IEEE, 2007, 2: 804-808.
[3]
Besagni D, Belaïd A, Benet N.A Segmentation Method for Bibliographic References by Contextual Tagging of Fields[C]//Proceedings of the 7th International Conference on Document Analysis and Recognition. IEEE, 2003: 384-388.
(Li Chaoguang, Zhang Ming, Deng Zhihong, et al.Automatic Metadata Extraction for Scientific Documents[J]. Computer Engineering and Applications, 2002, 38(21): 189-191, 235.)
[5]
Day M Y, Tsai R T H, Sung C L, et al. Reference Metadata Extraction Using a Hierarchical Knowledge Representation Framework[J]. Decision Support Systems, 2007, 43(1): 152-167.
doi: 10.1016/j.dss.2006.08.006
[6]
Cortez E, da Silva A S, Gonçalves M A, et al. FLUX-CIM: Flexible Unsupervised Extraction of Citation Metadata[C]//Proceedings of the 7th ACM/IEEE Joint Conference on Digital Libraries. ACM, 2007: 215-224.
[7]
Huang I A, Ho J M, Kao H Y, et al.Extracting Citation Metadata from Online Publication Lists Using BLAST[C]// Proceedings of the 8th Pacific-Asia Conference, PAKDD 2004. Springer Berlin Heidelberg, 2004: 539-548.
[8]
Chen C C, Yang K H, Kao H Y, et al.BibPro: A Citation Parser Based on Sequence Alignment Techniques[C]// Proceedings of the 22nd International Conference on Advanced Information Networking and Applications- Workshops (AINAW 2008). IEEE, 2008: 1175-1180.
[9]
Han H, Giles C L, Manavoglu E, et al.Automatic Document Metadata Extraction Using Support Vector Machines[C]// Proceedings of the 2003 Joint Conference on Digital Libraries. IEEE, 2003: 37-48.
[10]
Peng F, McCallum A. Accurate Information Extraction from Research Papers Using Conditional Random Fields[C] // Proceedings of the Human Language Technology Conference of the North American Chapter of the Association-for- Computational-Linguistics. 2004:329-336.
[11]
Yu J, Fan X.Metadata Extraction from Chinese Research Papers Based on Conditional Random Fields[C]//Proceedings of the 4th International Conference on Fuzzy Systems and Knowledge Discovery. IEEE, 2007, 1: 497-501.
[12]
Mikolov T, Le Q V, Sutskever I. Exploiting Similarities Among Languages for Machine Translation [OL]. arXiv Preprint.arXiv:1309.4168, 2013.
(Zhou Lian.Exploration of the Working Principle and Application of Word2Vec[J]. Sci-Tech Information Development & Economy, 2015 (2): 145-148.)
doi: 10.3969/j.issn.1005-6033.2015.02.061
[15]
Stitson M O, Weston J A E, et al. Theory of Support Vector Machines [R]. Technical Report, CSD-TR-96-17, London: University of London, 1996.
[16]
Lafferty J, McCallum A, Pereira F C N. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data [EB/OL]. [2016-07-15]. .