Please wait a minute...
Data Analysis and Knowledge Discovery  2017, Vol. 1 Issue (1): 47-54    DOI: 10.11925/infotech.2096-3467.2017.01.06
Orginal Article Current Issue | Archive | Adv Search |
Automatically Detecting and Tagging Foreign Language Citation Metadata
Lin Jiang1,2(),Dongbo Wang3
1School of Information Management, Nanjing University, Nanjing 210023, China
2Jiangsu Key Laboratory of Data Engineering and Knowledge Service, Nanjing 210023, China
3College of Information Science and Technology, Nanjing Agricultural University, Nanjing 210095, China
Download: PDF(1285 KB)   HTML ( 46
Export: BibTeX | EndNote (RIS)      

[Objective]This paper proposes a new method to automatically extract bibliographic metadata, with the help of semantic knowledge and machine learning technologies. [Methods] We used the neural network model to create word vectors from manually split data, and then found that same type of metadata is relatively concentrated at certain locations in the vector space. Thus, we proposed a new SVM classification algorithm to classify and annotate the bibliographic metadata automatically. [Results] The proposed method achieved high recall and precision rates with citation data, especially for citations with various languages and abbreviations. [Limitations] The fine-grained extraction of the time related content could be improved. [Conclusions] The proposed method could effectively detect and tag bibliographic metadata, and improve the system’s compatibility and fault tolerance ability.

Key wordsBibliographic Metadata      Metadata Extraction      Machine Learning      Neural Network     
Received: 18 August 2016      Published: 22 February 2017

Cite this article:

Lin Jiang,Dongbo Wang. Automatically Detecting and Tagging Foreign Language Citation Metadata. Data Analysis and Knowledge Discovery, 2017, 1(1): 47-54.

URL:     OR

[1] 蒋新. 英美学术文献的几种主要引文方式[J]. 图书与情报, 2003(3): 26-30.
[1] (Jiang Xin.Several Main Quotation Ways in British-American Academic Documents[J]. Library and Information, 2003(3): 26-30.)
[2] Wei W, King I, Lee J H M. Bibliographic Attributes Extraction with Layer-upon-Layer Tagging[C]//Proceedings of the 9th International Conference on Document Analysis and Recognition. IEEE, 2007, 2: 804-808.
[3] Besagni D, Bela?d A, Benet N.A Segmentation Method for Bibliographic References by Contextual Tagging of Fields[C]//Proceedings of the 7th International Conference on Document Analysis and Recognition. IEEE, 2003: 384-388.
[4] 李朝光, 张铭, 邓志鸿, 等. 论文元数据信息的自动抽取[J]. 计算机工程与应用, 2002, 38(21): 189-191, 235.
[4] (Li Chaoguang, Zhang Ming, Deng Zhihong, et al.Automatic Metadata Extraction for Scientific Documents[J]. Computer Engineering and Applications, 2002, 38(21): 189-191, 235.)
[5] Day M Y, Tsai R T H, Sung C L, et al. Reference Metadata Extraction Using a Hierarchical Knowledge Representation Framework[J]. Decision Support Systems, 2007, 43(1): 152-167.
[6] Cortez E, da Silva A S, Gon?alves M A, et al. FLUX-CIM: Flexible Unsupervised Extraction of Citation Metadata[C]//Proceedings of the 7th ACM/IEEE Joint Conference on Digital Libraries. ACM, 2007: 215-224.
[7] Huang I A, Ho J M, Kao H Y, et al.Extracting Citation Metadata from Online Publication Lists Using BLAST[C]// Proceedings of the 8th Pacific-Asia Conference, PAKDD 2004. Springer Berlin Heidelberg, 2004: 539-548.
[8] Chen C C, Yang K H, Kao H Y, et al.BibPro: A Citation Parser Based on Sequence Alignment Techniques[C]// Proceedings of the 22nd International Conference on Advanced Information Networking and Applications- Workshops (AINAW 2008). IEEE, 2008: 1175-1180.
[9] Han H, Giles C L, Manavoglu E, et al.Automatic Document Metadata Extraction Using Support Vector Machines[C]// Proceedings of the 2003 Joint Conference on Digital Libraries. IEEE, 2003: 37-48.
[10] Peng F, McCallum A. Accurate Information Extraction from Research Papers Using Conditional Random Fields[C] // Proceedings of the Human Language Technology Conference of the North American Chapter of the Association-for- Computational-Linguistics. 2004:329-336.
[11] Yu J, Fan X.Metadata Extraction from Chinese Research Papers Based on Conditional Random Fields[C]//Proceedings of the 4th International Conference on Fuzzy Systems and Knowledge Discovery. IEEE, 2007, 1: 497-501.
[12] Mikolov T, Le Q V, Sutskever I. Exploiting Similarities Among Languages for Machine Translation [OL]. arXiv Preprint.arXiv:1309.4168, 2013.
[13] Mikolov T. Word2Vec Code [EB/OL]. [2015-09-18]. .
[14] 周练. Word2Vec 的工作原理及应用探究[J]. 科技情报开发与经济, 2015 (2): 145-148.
[14] (Zhou Lian.Exploration of the Working Principle and Application of Word2Vec[J]. Sci-Tech Information Development & Economy, 2015 (2): 145-148.)
[15] Stitson M O, Weston J A E, et al. Theory of Support Vector Machines [R]. Technical Report, CSD-TR-96-17, London: University of London, 1996.
[16] Lafferty J, McCallum A, Pereira F C N. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data [EB/OL]. [2016-07-15]. .
[1] Jiahui Hu,An Fang,Wanqing Zhao,Chenliu Yang,Huiling Ren. Annotating Chinese E-Medical Record for Knowledge Discovery[J]. 数据分析与知识发现, 2019, 3(7): 123-132.
[2] Zhenyu He,Xiangxiang Dong,Qinghua Zhu. Classifying Baidu Encyclopedia Entries with User Behaviors[J]. 数据分析与知识发现, 2019, 3(6): 117-122.
[3] Kan Liu,Lu Chen. Deep Neural Network Learning for Medical Triage[J]. 数据分析与知识发现, 2019, 3(6): 99-108.
[4] Wancheng Chen,Haoran Dai,Yinghan Jin. Appraising Home Prices with HEDONIC Model: Case Study of Seattle, U.S.[J]. 数据分析与知识发现, 2019, 3(5): 19-26.
[5] Jinzhu Zhang,Yiming Hu. Extracting Titles from Scientific References in Patents with Fusion of Representation Learning and Machine Learning[J]. 数据分析与知识发现, 2019, 3(5): 68-76.
[6] Zhiqiang Liu,Yuncheng Du,Shuicai Shi. Extraction of Key Information in Web News Based on Improved Hidden Markov Model[J]. 数据分析与知识发现, 2019, 3(3): 120-128.
[7] Hongxia Xu,Chunwang Li. Review of Knowledge Extraction of Scientific Literature[J]. 数据分析与知识发现, 2019, 3(3): 14-24.
[8] Zixuan Zhang,Hao Wang,Liping Zhu,Sanhong eng. Identifying Risks of HS Codes by China Customs[J]. 数据分析与知识发现, 2019, 3(1): 72-84.
[9] Lina Liu,Jiayin Qi,Zhenping Zhang,Dan Zeng. Analyzing Impacts of Brand Reputation on Online Sales Based on Massive Commodity Reviews and Brand[J]. 数据分析与知识发现, 2018, 2(9): 10-21.
[10] Yuemei Xu,Sining Lv,Lianqiao Cai,Xiaoya Zhang. Analyzing News Topic Evolution with Convolutional Neural Networks and Topic2Vec[J]. 数据分析与知识发现, 2018, 2(9): 31-41.
[11] Xiaoyu Ma,Han Zhang,Yuhong Zhao. Building Childhood Asthma Prediction Model with Artificial Neural Network and BRFSS Database[J]. 数据分析与知识发现, 2018, 2(8): 10-15.
[12] Longjia Jia,Bangzuo Zhang. Classifying Topics of Internet Public Opinion from College Students: Case Study of Sina Weibo[J]. 数据分析与知识发现, 2018, 2(7): 55-62.
[13] Wei Lu,Mengqi Luo,Heng Ding,Xin Li. Image Annotation Tags by Deep Learning and Real Users: A Comparative Study[J]. 数据分析与知识发现, 2018, 2(5): 1-10.
[14] Li Wang,Lixue Zou,Xiwen Liu. Visualizing Document Correlation Based on LDA Model[J]. 数据分析与知识发现, 2018, 2(3): 98-106.
[15] Xinyue Fan,Lei Cui. Predicting Antineoplastic Drug Targets Based on Network Properties[J]. 数据分析与知识发现, 2018, 2(12): 98-108.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938