Please wait a minute...
New Technology of Library and Information Service  2009, Vol. 3 Issue (2): 102-106    DOI: 10.11925/infotech.1003-3513.2009.02.17
Current Issue | Archive | Adv Search |
Automatic Extraction of Semantic Metadata from PDF Research Papers
Zhang Xiuxiu   Ma Jianxia
(The Lanzhou Branch of National Science Library, Chinese Academy of Sciences, Lanzhou 730000, China)
Download: PDF(664 KB)   HTML  
Export: BibTeX | EndNote (RIS)      

This paper analyzes content streams of PDF files based on its structure, and extracts semantic metadata automatically from research papers by way of rule-based matching and format-based locating. Experimental results show that this method can extract important semantic metadata such as title and author effectively.

Key wordsPDF      Research paper      Semantic metadata      Automatic extraction     
Received: 03 November 2008      Published: 25 February 2009


Corresponding Authors: Zhang Xiuxiu     E-mail:
About author:: Zhang Xiuxiu ,Ma Jianxia

Cite this article:

Zhang Xiuxiu ,Ma Jianxia. Automatic Extraction of Semantic Metadata from PDF Research Papers. New Technology of Library and Information Service, 2009, 3(2): 102-106.

URL:     OR

[1] 李朝光, 张铭, 邓志鸿, 等. 论文元数据信息的自动抽取[J].计算机工程与应用, 2002(21):189-191,235.
[2] Min Yuh Day, Richard Tzong Han Tsai, Cheng Lung Sung, et al. Reference Metadata Extraction Using a Hierarchical Knowledge Representation Framework[J].Decision Support Systems, 2007(43):152–167.
[3] Hu Y H, Li H, Cao Y B, et al. Automatic Extraction of Titles from General Documents Using Machine Learning[J].Information Processing and Management, 2006,42(1):1276-1293.
[4] Yu J D, Fan X Z. Metadata Extraction from Chinese Research Papers Based on Conditional Random Fields[J/OL]. [2008-10-21].
[5] Giles C L, Bollacker K D, Lawrence S. CiteSeer: An Automatic Citation Indexing System[J/OL]. [2008-10-21].
[6] 陈俊林, 张文德. 基于XSLT的PDF论文元数据的优化抽取[J].现代图书情报技术, 2007(2):18-23.
[7] PDF Reference[EB/OL].[2008-04-15].

[1] Liu Qingxiang,Zhang Pengzhu,Zhang Xiaoyan,Liu Jingfang. Automatically Extracting Talents’ Knowledge Structure Online[J]. 现代图书情报技术, 2016, 32(4): 56-63.
[2] Zhang Fan, Le Xiaoqiu. Research on Innovation Points Extraction from Scientific Research Paper Based on Field Thesaurus[J]. 现代图书情报技术, 2014, 30(9): 15-21.
[3] Zeng Wen,Xu Shuo,Zhang Yunliang,Zhai Juanhua. The Research and Analysis on Automatic Extraction of Science and Technology Literature Terms[J]. 现代图书情报技术, 2014, 30(1): 51-55.
[4] Li Yu, Wang Wei. Design and Prototype Implementation of PDF Downloading Abuse Warning System[J]. 现代图书情报技术, 2011, 27(4): 71-76.
[5] Zeng Su,Ma Jianxia,Zhang Xiuxiu. New Development of Automatic Metadata Extraction[J]. 现代图书情报技术, 2008, 24(4): 7-11.
[6] He Lin. Research on the Relation Extraction of Domain Ontology[J]. 现代图书情报技术, 2008, 24(4): 35-38.
[7] Tan Chunmei,Yan Shiwei,Liu Zimu. Design and Realization of Knowledge Element Automatic Extraction of Network Special Subject Knowledge Organization[J]. 现代图书情报技术, 2008, 24(3): 62-67.
[8] Liu Fanxin. Design and Implementation of Reader-card System Based on PDF417[J]. 现代图书情报技术, 2007, 2(6): 83-86.
[9] Chen Junlin,Zhang Wende . Optimizing Extraction of Science Documents’ Metadata in PDF Format Based on XSLT[J]. 现代图书情报技术, 2007, 2(2): 18-23.
[10] Zhao Yang,Jiang Airong,Wu Jianxin . Establishment of University Theses and Dissertations Fulltext Database——Taking the Tsinghua University Library as Example[J]. 现代图书情报技术, 2006, 1(5): 6-9.
[11] Li Mingwu,Fang Liping . Implement of Converting the TIFF Image File into the PDF Document[J]. 现代图书情报技术, 2006, 1(3): 89-91.
[12] Cheng Huirong,Zhang Xiaoyang,Sun Tan,Huang Guobin . A Quantitative Analysis of Ontology Research Articles Based  on Web of Science[J]. 现代图书情报技术, 2006, 1(11): 46-50.
[13] Song Yanjuan,Zhang Wende. Research on PDF Documents Information Extraction System  Based on XML[J]. 现代图书情报技术, 2005, 21(9): 10-13.
[14] Du Shujun. Electronic Document Disc Publication by PDF[J]. 现代图书情报技术, 2002, 18(4): 89-90.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938