Please wait a minute...
Data Analysis and Knowledge Discovery  2017, Vol. 1 Issue (1): 47-54    DOI: 10.11925/infotech.2096-3467.2017.01.06
Orginal Article Current Issue | Archive | Adv Search |
Automatically Detecting and Tagging Foreign Language Citation Metadata
Jiang Lin1,2(), Wang Dongbo3
1School of Information Management, Nanjing University, Nanjing 210023, China
2Jiangsu Key Laboratory of Data Engineering and Knowledge Service, Nanjing 210023, China
3College of Information Science and Technology, Nanjing Agricultural University, Nanjing 210095, China
Download: PDF (1285 KB)   HTML ( 47
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective]This paper proposes a new method to automatically extract bibliographic metadata, with the help of semantic knowledge and machine learning technologies. [Methods] We used the neural network model to create word vectors from manually split data, and then found that same type of metadata is relatively concentrated at certain locations in the vector space. Thus, we proposed a new SVM classification algorithm to classify and annotate the bibliographic metadata automatically. [Results] The proposed method achieved high recall and precision rates with citation data, especially for citations with various languages and abbreviations. [Limitations] The fine-grained extraction of the time related content could be improved. [Conclusions] The proposed method could effectively detect and tag bibliographic metadata, and improve the system’s compatibility and fault tolerance ability.

Key wordsBibliographic Metadata      Metadata Extraction      Machine Learning      Neural Network     
Received: 18 August 2016      Published: 22 February 2017
ZTFLH:  G254  

Cite this article:

Jiang Lin,Wang Dongbo. Automatically Detecting and Tagging Foreign Language Citation Metadata. Data Analysis and Knowledge Discovery, 2017, 1(1): 47-54.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2017.01.06     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2017/V1/I1/47

分类标号 表示的分类
1 作者姓名
2 文献标题
3 期刊名或者书名
4 地点
5 出版商或者出版商
6 出版时间和页码
单元内容 离聚类1
的距离
离聚类2
的距离
离聚类3
的距离
离聚类4
的距离
离聚类5
的距离
离聚类6
的距离
切割单元
位置特征
Chatterjee 169.70 172.06 140.57 101.79 53.43 138.36 0.17
S* 57.93 55.77 86.09 124.75 174.15 89.56 0.33
Regression and
Analysis by Example
17.64 17.11 18.00 56.29 106.29 20.70 0.50
John Wiley & Sons Inc 110.96 113.44 81.81 43.00 13.34 80.03 0.67
2000 164.11 166.58 135.09 96.33 48.81 132.70 0.83
248 168.45 170.95 139.48 100.74 52.93 137.23 1.00
标记符号 表示含义
B Begin 出版社名称的开始
C Continue 连续, 名称未完结
E End 出版社名称的结束
SW Single Word单个词的出版社名称
N Not 非出版社名称词
词性标注 识别序列标注
Ollman NNP N
, , N
Bertell NNP N
Left VBN N
Academy NNP N
- : N
Marxist JJ N
Scholarship NN N
on IN N
American JJ N
Campuses NNS N
. . N
McGraw NNP B
- : C
Hill NNP C
Book NN C
Company NN E
, , N
1982 CD N
[1] 蒋新. 英美学术文献的几种主要引文方式[J]. 图书与情报, 2003(3): 26-30.
[1] (Jiang Xin.Several Main Quotation Ways in British-American Academic Documents[J]. Library and Information, 2003(3): 26-30.)
[2] Wei W, King I, Lee J H M. Bibliographic Attributes Extraction with Layer-upon-Layer Tagging[C]//Proceedings of the 9th International Conference on Document Analysis and Recognition. IEEE, 2007, 2: 804-808.
[3] Besagni D, Belaïd A, Benet N.A Segmentation Method for Bibliographic References by Contextual Tagging of Fields[C]//Proceedings of the 7th International Conference on Document Analysis and Recognition. IEEE, 2003: 384-388.
[4] 李朝光, 张铭, 邓志鸿, 等. 论文元数据信息的自动抽取[J]. 计算机工程与应用, 2002, 38(21): 189-191, 235.
[4] (Li Chaoguang, Zhang Ming, Deng Zhihong, et al.Automatic Metadata Extraction for Scientific Documents[J]. Computer Engineering and Applications, 2002, 38(21): 189-191, 235.)
[5] Day M Y, Tsai R T H, Sung C L, et al. Reference Metadata Extraction Using a Hierarchical Knowledge Representation Framework[J]. Decision Support Systems, 2007, 43(1): 152-167.
doi: 10.1016/j.dss.2006.08.006
[6] Cortez E, da Silva A S, Gonçalves M A, et al. FLUX-CIM: Flexible Unsupervised Extraction of Citation Metadata[C]//Proceedings of the 7th ACM/IEEE Joint Conference on Digital Libraries. ACM, 2007: 215-224.
[7] Huang I A, Ho J M, Kao H Y, et al.Extracting Citation Metadata from Online Publication Lists Using BLAST[C]// Proceedings of the 8th Pacific-Asia Conference, PAKDD 2004. Springer Berlin Heidelberg, 2004: 539-548.
[8] Chen C C, Yang K H, Kao H Y, et al.BibPro: A Citation Parser Based on Sequence Alignment Techniques[C]// Proceedings of the 22nd International Conference on Advanced Information Networking and Applications- Workshops (AINAW 2008). IEEE, 2008: 1175-1180.
[9] Han H, Giles C L, Manavoglu E, et al.Automatic Document Metadata Extraction Using Support Vector Machines[C]// Proceedings of the 2003 Joint Conference on Digital Libraries. IEEE, 2003: 37-48.
[10] Peng F, McCallum A. Accurate Information Extraction from Research Papers Using Conditional Random Fields[C] // Proceedings of the Human Language Technology Conference of the North American Chapter of the Association-for- Computational-Linguistics. 2004:329-336.
[11] Yu J, Fan X.Metadata Extraction from Chinese Research Papers Based on Conditional Random Fields[C]//Proceedings of the 4th International Conference on Fuzzy Systems and Knowledge Discovery. IEEE, 2007, 1: 497-501.
[12] Mikolov T, Le Q V, Sutskever I. Exploiting Similarities Among Languages for Machine Translation [OL]. arXiv Preprint.arXiv:1309.4168, 2013.
[13] Mikolov T. Word2Vec Code [EB/OL]. [2015-09-18]. .
[14] 周练. Word2Vec 的工作原理及应用探究[J]. 科技情报开发与经济, 2015 (2): 145-148.
doi: 10.3969/j.issn.1005-6033.2015.02.061
[14] (Zhou Lian.Exploration of the Working Principle and Application of Word2Vec[J]. Sci-Tech Information Development & Economy, 2015 (2): 145-148.)
doi: 10.3969/j.issn.1005-6033.2015.02.061
[15] Stitson M O, Weston J A E, et al. Theory of Support Vector Machines [R]. Technical Report, CSD-TR-96-17, London: University of London, 1996.
[16] Lafferty J, McCallum A, Pereira F C N. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data [EB/OL]. [2016-07-15]. .
[1] Qiu Erli,He Hongwei,Yi Chengqi,Li Huiying. Research on Public Policy Support Based on Character-level CNN Technology[J]. 数据分析与知识发现, 2020, 4(7): 28-37.
[2] Chen Dong,Wang Jiandong,Li Huiying,Cai Sihang,Huang Qianqian,Yi Chengqi,Cao Pan. Forecasting Poultry Turnovers with Machine Learning and Multiple Factors[J]. 数据分析与知识发现, 2020, 4(7): 18-27.
[3] Liang Ye,Li Xiaoyuan,Xu Hang,Hu Yiran. CLOpin: A Cross-Lingual Knowledge Graph Framework for Public Opinion Analysis and Early Warning[J]. 数据分析与知识发现, 2020, 4(6): 1-14.
[4] Yang Heng,Wang Sili,Zhu Zhongming,Liu Wei,Wang Nan. Recommending Domain Knowledge Based on Parallel Collaborative Filtering Algorithm[J]. 数据分析与知识发现, 2020, 4(6): 15-21.
[5] Liu Weijiang,Wei Hai,Yun Tianhe. Evaluation Model for Customer Credits Based on Convolutional Neural Network[J]. 数据分析与知识发现, 2020, 4(6): 80-90.
[6] Wang Mo,Cui Yunpeng,Chen Li,Li Huan. A Deep Learning-based Method of Argumentative Zoning for Research Articles[J]. 数据分析与知识发现, 2020, 4(6): 60-68.
[7] Yan Chun,Liu Lu. Classifying Non-life Insurance Customers Based on Improved SOM and RFM Models[J]. 数据分析与知识发现, 2020, 4(4): 83-90.
[8] Su Chuandong,Huang Xiaoxi,Wang Rongbo,Chen Zhiqun,Mao Junyu,Zhu Jiaying,Pan Yuhao. Identifying Chinese / English Metaphors with Word Embedding and Recurrent Neural Network[J]. 数据分析与知识发现, 2020, 4(4): 91-99.
[9] Xu Yuemei,Liu Yunwen,Cai Lianqiao. Predicitng Retweets of Government Microblogs with Deep-combined Features[J]. 数据分析与知识发现, 2020, 4(2/3): 18-28.
[10] Xiang Fei,Xie Yaotan. Recognition Model of Patient Reviews Based on Mixed Sampling and Transfer Learning[J]. 数据分析与知识发现, 2020, 4(2/3): 39-47.
[11] Ni Weijian,Guo Haoyu,Liu Tong,Zeng Qingtian. Online Product Recommendation Based on Multi-Head Self-Attention Neural Networks[J]. 数据分析与知识发现, 2020, 4(2/3): 68-77.
[12] Weimin Nie,Yongzhou Chen,Jing Ma. A Text Vector Representation Model Merging Multi-Granularity Information[J]. 数据分析与知识发现, 2019, 3(9): 45-52.
[13] Ruojia Wang,Lu Zhang,Jimin Wang. Automatic Triage of Online Doctor Services Based on Machine Learning[J]. 数据分析与知识发现, 2019, 3(9): 88-97.
[14] Gang Li,Huayang Zhou,Jin Mao,Sijing Chen. Classifying Social Media Users with Machine Learning[J]. 数据分析与知识发现, 2019, 3(8): 1-9.
[15] Jiahui Hu,An Fang,Wanqing Zhao,Chenliu Yang,Huiling Ren. Annotating Chinese E-Medical Record for Knowledge Discovery[J]. 数据分析与知识发现, 2019, 3(7): 123-132.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn