Please wait a minute...
Data Analysis and Knowledge Discovery  2017, Vol. 1 Issue (1): 47-54    DOI: 10.11925/infotech.2096-3467.2017.01.06
Orginal Article Current Issue | Archive | Adv Search |
Automatically Detecting and Tagging Foreign Language Citation Metadata
Jiang Lin1,2(), Wang Dongbo3
1School of Information Management, Nanjing University, Nanjing 210023, China
2Jiangsu Key Laboratory of Data Engineering and Knowledge Service, Nanjing 210023, China
3College of Information Science and Technology, Nanjing Agricultural University, Nanjing 210095, China
Download: PDF (1285 KB)   HTML ( 49
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective]This paper proposes a new method to automatically extract bibliographic metadata, with the help of semantic knowledge and machine learning technologies. [Methods] We used the neural network model to create word vectors from manually split data, and then found that same type of metadata is relatively concentrated at certain locations in the vector space. Thus, we proposed a new SVM classification algorithm to classify and annotate the bibliographic metadata automatically. [Results] The proposed method achieved high recall and precision rates with citation data, especially for citations with various languages and abbreviations. [Limitations] The fine-grained extraction of the time related content could be improved. [Conclusions] The proposed method could effectively detect and tag bibliographic metadata, and improve the system’s compatibility and fault tolerance ability.

Key wordsBibliographic Metadata      Metadata Extraction      Machine Learning      Neural Network     
Received: 18 August 2016      Published: 22 February 2017
ZTFLH:  G254  

Cite this article:

Jiang Lin,Wang Dongbo. Automatically Detecting and Tagging Foreign Language Citation Metadata. Data Analysis and Knowledge Discovery, 2017, 1(1): 47-54.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2017.01.06     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2017/V1/I1/47

分类标号 表示的分类
1 作者姓名
2 文献标题
3 期刊名或者书名
4 地点
5 出版商或者出版商
6 出版时间和页码
单元内容 离聚类1
的距离
离聚类2
的距离
离聚类3
的距离
离聚类4
的距离
离聚类5
的距离
离聚类6
的距离
切割单元
位置特征
Chatterjee 169.70 172.06 140.57 101.79 53.43 138.36 0.17
S* 57.93 55.77 86.09 124.75 174.15 89.56 0.33
Regression and
Analysis by Example
17.64 17.11 18.00 56.29 106.29 20.70 0.50
John Wiley & Sons Inc 110.96 113.44 81.81 43.00 13.34 80.03 0.67
2000 164.11 166.58 135.09 96.33 48.81 132.70 0.83
248 168.45 170.95 139.48 100.74 52.93 137.23 1.00
标记符号 表示含义
B Begin 出版社名称的开始
C Continue 连续, 名称未完结
E End 出版社名称的结束
SW Single Word单个词的出版社名称
N Not 非出版社名称词
词性标注 识别序列标注
Ollman NNP N
, , N
Bertell NNP N
Left VBN N
Academy NNP N
- : N
Marxist JJ N
Scholarship NN N
on IN N
American JJ N
Campuses NNS N
. . N
McGraw NNP B
- : C
Hill NNP C
Book NN C
Company NN E
, , N
1982 CD N
[1] 蒋新. 英美学术文献的几种主要引文方式[J]. 图书与情报, 2003(3): 26-30.
[1] (Jiang Xin.Several Main Quotation Ways in British-American Academic Documents[J]. Library and Information, 2003(3): 26-30.)
[2] Wei W, King I, Lee J H M. Bibliographic Attributes Extraction with Layer-upon-Layer Tagging[C]//Proceedings of the 9th International Conference on Document Analysis and Recognition. IEEE, 2007, 2: 804-808.
[3] Besagni D, Belaïd A, Benet N.A Segmentation Method for Bibliographic References by Contextual Tagging of Fields[C]//Proceedings of the 7th International Conference on Document Analysis and Recognition. IEEE, 2003: 384-388.
[4] 李朝光, 张铭, 邓志鸿, 等. 论文元数据信息的自动抽取[J]. 计算机工程与应用, 2002, 38(21): 189-191, 235.
[4] (Li Chaoguang, Zhang Ming, Deng Zhihong, et al.Automatic Metadata Extraction for Scientific Documents[J]. Computer Engineering and Applications, 2002, 38(21): 189-191, 235.)
[5] Day M Y, Tsai R T H, Sung C L, et al. Reference Metadata Extraction Using a Hierarchical Knowledge Representation Framework[J]. Decision Support Systems, 2007, 43(1): 152-167.
doi: 10.1016/j.dss.2006.08.006
[6] Cortez E, da Silva A S, Gonçalves M A, et al. FLUX-CIM: Flexible Unsupervised Extraction of Citation Metadata[C]//Proceedings of the 7th ACM/IEEE Joint Conference on Digital Libraries. ACM, 2007: 215-224.
[7] Huang I A, Ho J M, Kao H Y, et al.Extracting Citation Metadata from Online Publication Lists Using BLAST[C]// Proceedings of the 8th Pacific-Asia Conference, PAKDD 2004. Springer Berlin Heidelberg, 2004: 539-548.
[8] Chen C C, Yang K H, Kao H Y, et al.BibPro: A Citation Parser Based on Sequence Alignment Techniques[C]// Proceedings of the 22nd International Conference on Advanced Information Networking and Applications- Workshops (AINAW 2008). IEEE, 2008: 1175-1180.
[9] Han H, Giles C L, Manavoglu E, et al.Automatic Document Metadata Extraction Using Support Vector Machines[C]// Proceedings of the 2003 Joint Conference on Digital Libraries. IEEE, 2003: 37-48.
[10] Peng F, McCallum A. Accurate Information Extraction from Research Papers Using Conditional Random Fields[C] // Proceedings of the Human Language Technology Conference of the North American Chapter of the Association-for- Computational-Linguistics. 2004:329-336.
[11] Yu J, Fan X.Metadata Extraction from Chinese Research Papers Based on Conditional Random Fields[C]//Proceedings of the 4th International Conference on Fuzzy Systems and Knowledge Discovery. IEEE, 2007, 1: 497-501.
[12] Mikolov T, Le Q V, Sutskever I. Exploiting Similarities Among Languages for Machine Translation [OL]. arXiv Preprint.arXiv:1309.4168, 2013.
[13] Mikolov T. Word2Vec Code [EB/OL]. [2015-09-18]. .
[14] 周练. Word2Vec 的工作原理及应用探究[J]. 科技情报开发与经济, 2015 (2): 145-148.
doi: 10.3969/j.issn.1005-6033.2015.02.061
[14] (Zhou Lian.Exploration of the Working Principle and Application of Word2Vec[J]. Sci-Tech Information Development & Economy, 2015 (2): 145-148.)
doi: 10.3969/j.issn.1005-6033.2015.02.061
[15] Stitson M O, Weston J A E, et al. Theory of Support Vector Machines [R]. Technical Report, CSD-TR-96-17, London: University of London, 1996.
[16] Lafferty J, McCallum A, Pereira F C N. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data [EB/OL]. [2016-07-15]. .
[1] Wang Hanxue,Cui Wenjuan,Zhou Yuanchun,Du Yi. Identifying Pathogens of Foodborne Diseases with Machine Learning[J]. 数据分析与知识发现, 2021, 5(9): 54-62.
[2] Chen Donghua,Zhao Hongmei,Shang Xiaopu,Zhang Runtong. Optimizing Large Hospital Operating Rooms with Data Analytics[J]. 数据分析与知识发现, 2021, 5(9): 115-128.
[3] Che Hongxin,Wang Tong,Wang Wei. Comparing Prediction Models for Prostate Cancer[J]. 数据分析与知识发现, 2021, 5(9): 107-114.
[4] Su Qiang, Hou Xiaoli, Zou Ni. Predicting Surgical Infections Based on Machine Learning[J]. 数据分析与知识发现, 2021, 5(8): 65-75.
[5] Gu Yaowen, Zhang Bowen, Zheng Si, Yang Fengchun, Li Jiao. Predicting Drug ADMET Properties Based on Graph Attention Network[J]. 数据分析与知识发现, 2021, 5(8): 76-85.
[6] Zhang Le, Leng Jidong, Lv Xueqiang, Cui Zhuo, Wang Lei, You Xindong. RLCPAR: A Rewriting Model for Chinese Patent Abstracts Based on Reinforcement Learning[J]. 数据分析与知识发现, 2021, 5(7): 59-69.
[7] Cao Rui,Liao Bin,Li Min,Sun Ruina. Predicting Prices and Analyzing Features of Online Short-Term Rentals Based on XGBoost[J]. 数据分析与知识发现, 2021, 5(6): 51-65.
[8] Zhong Jiawa,Liu Wei,Wang Sili,Yang Heng. Review of Methods and Applications of Text Sentiment Analysis[J]. 数据分析与知识发现, 2021, 5(6): 1-13.
[9] Han Pu,Zhang Zhanpeng,Zhang Mingtao,Gu Liang. Normalizing Chinese Disease Names with Multi-feature Fusion[J]. 数据分析与知识发现, 2021, 5(5): 83-94.
[10] Xiang Zhuoyuan,Liu Zhicong,Wu Yu. Adaptive Recommendation Model Based on User Behaviors[J]. 数据分析与知识发现, 2021, 5(4): 103-114.
[11] Wang Nan,Li Hairong,Tan Shuru. Predicting of Public Opinion Reversal with Improved SMOTE Algorithm and Ensemble Learning[J]. 数据分析与知识发现, 2021, 5(4): 37-48.
[12] Li Danyang, Gan Mingxin. Music Recommendation Method Based on Multi-Source Information Fusion[J]. 数据分析与知识发现, 2021, 5(2): 94-105.
[13] Ding Hao, Ai Wenhua, Hu Guangwei, Li Shuqing, Suo Wei. A Personalized Recommendation Model with Time Series Fluctuation of User Interest[J]. 数据分析与知识发现, 2021, 5(11): 45-58.
[14] Chai Guorong,Wang Bin,Sha Yongzhong. Public Health Risk Forecasting with Multiple Machine Learning Methods Combined:Case Study of Influenza Forecasting in Lanzhou, China[J]. 数据分析与知识发现, 2021, 5(1): 90-98.
[15] Yin Haoran,Cao Jinxuan,Cao Luzhe,Wang Guodong. Identifying Emergency Elements Based on BiGRU-AM Model with Extended Semantic Dimension[J]. 数据分析与知识发现, 2020, 4(9): 91-99.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn