Automatic Extraction of Semantic Metadata from PDF Research Papers

doi:10.11925/infotech.1003-3513.2009.02.17

New Technology of Library and Information Service

2009, Vol. 3

Issue (2): 102-106 DOI: 10.11925/infotech.1003-3513.2009.02.17

Current Issue | Archive | Adv Search

Automatic Extraction of Semantic Metadata from PDF Research Papers

Zhang Xiuxiu Ma Jianxia

(The Lanzhou Branch of National Science Library, Chinese Academy of Sciences, Lanzhou 730000, China)

Download:
Export: BibTeX | EndNote (RIS)

Abstract

This paper analyzes content streams of PDF files based on its structure, and extracts semantic metadata automatically from research papers by way of rule-based matching and format-based locating. Experimental results show that this method can extract important semantic metadata such as title and author effectively.

Key words： PDF Research paper Semantic metadata Automatic extraction

Received: 03 November 2008 Published: 25 February 2009

TP391.43

Corresponding Authors: Zhang Xiuxiu E-mail: zhangxx@llas.ac.cn

About author:: Zhang Xiuxiu ,Ma Jianxia

	Service

	E-mail this article
	Add to my bookshelf
	Add to citation manager
	E-mail Alert
	RSS
	Articles by authors
	ZHANG Xiu-Xiu
	MA Jian-Xia

Cite this article:

Zhang Xiuxiu ,Ma Jianxia. Automatic Extraction of Semantic Metadata from PDF Research Papers. New Technology of Library and Information Service, 2009, 3(2): 102-106.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.1003-3513.2009.02.17 OR https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2009/V3/I2/102

［1］李朝光, 张铭, 邓志鸿, 等. 论文元数据信息的自动抽取［J］.计算机工程与应用, 2002(21):189-191,235.
［2］ Min Yuh Day, Richard Tzong Han Tsai, Cheng Lung Sung, et al. Reference Metadata Extraction Using a Hierarchical Knowledge Representation Framework［J］.Decision Support Systems, 2007(43):152–167.
［3］ Hu Y H, Li H, Cao Y B, et al. Automatic Extraction of Titles from General Documents Using Machine Learning［J］.Information Processing and Management, 2006,42(1):1276-1293.
［4］ Yu J D, Fan X Z. Metadata Extraction from Chinese Research Papers Based on Conditional Random Fields［J/OL］. ［2008-10-21］. http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=4405975&isnumber=4405869.
［5］ Giles C L, Bollacker K D, Lawrence S. CiteSeer: An Automatic Citation Indexing System［J/OL］. ［2008-10-21］.http://clgiles.ist.psu.edu/papers/DL-1998-citeseer.pdf.
［6］陈俊林, 张文德. 基于XSLT的PDF论文元数据的优化抽取［J］.现代图书情报技术, 2007(2):18-23.
［7］ PDF Reference［EB/OL］.［2008-04-15］. http://www.adobe.com/devnet/pdf/pdfs/PDFReference13.pdf.

[1]	Zhang Jiandong, Chen Shiji, Xu Xiaoting, Zuo Wenge. Extracting PDF Tables Based on Word Vectors[J]. 数据分析与知识发现, 2021, 5(8): 34-44.
[2]	Yue Mingliang,Li Fushan,Tang Hongbo,Lv Xinhua,Ma Tingcan. Evaluating Consistency of Scholarly Article Reviewers[J]. 数据分析与知识发现, 2021, 5(4): 115-122.
[3]	Liu Qingxiang,Zhang Pengzhu,Zhang Xiaoyan,Liu Jingfang. Automatically Extracting Talents’ Knowledge Structure Online[J]. 现代图书情报技术, 2016, 32(4): 56-63.
[4]	Zhang Fan, Le Xiaoqiu. Research on Innovation Points Extraction from Scientific Research Paper Based on Field Thesaurus[J]. 现代图书情报技术, 2014, 30(9): 15-21.
[5]	Zeng Wen,Xu Shuo,Zhang Yunliang,Zhai Juanhua. The Research and Analysis on Automatic Extraction of Science and Technology Literature Terms[J]. 现代图书情报技术, 2014, 30(1): 51-55.
[6]	Li Yu, Wang Wei. Design and Prototype Implementation of PDF Downloading Abuse Warning System[J]. 现代图书情报技术, 2011, 27(4): 71-76.
[7]	Zeng Su,Ma Jianxia,Zhang Xiuxiu. New Development of Automatic Metadata Extraction[J]. 现代图书情报技术, 2008, 24(4): 7-11.
[8]	He Lin. Research on the Relation Extraction of Domain Ontology[J]. 现代图书情报技术, 2008, 24(4): 35-38.
[9]	Tan Chunmei,Yan Shiwei,Liu Zimu. Design and Realization of Knowledge Element Automatic Extraction of Network Special Subject Knowledge Organization[J]. 现代图书情报技术, 2008, 24(3): 62-67.
[10]	Liu Fanxin. Design and Implementation of Reader-card System Based on PDF417[J]. 现代图书情报技术, 2007, 2(6): 83-86.
[11]	Chen Junlin,Zhang Wende . Optimizing Extraction of Science Documents’ Metadata in PDF Format Based on XSLT[J]. 现代图书情报技术, 2007, 2(2): 18-23.
[12]	Zhao Yang,Jiang Airong,Wu Jianxin . Establishment of University Theses and Dissertations Fulltext Database——Taking the Tsinghua University Library as Example[J]. 现代图书情报技术, 2006, 1(5): 6-9.
[13]	Li Mingwu,Fang Liping . Implement of Converting the TIFF Image File into the PDF Document[J]. 现代图书情报技术, 2006, 1(3): 89-91.
[14]	Cheng Huirong,Zhang Xiaoyang,Sun Tan,Huang Guobin . A Quantitative Analysis of Ontology Research Articles Based on Web of Science[J]. 现代图书情报技术, 2006, 1(11): 46-50.
[15]	Song Yanjuan,Zhang Wende. Research on PDF Documents Information Extraction System Based on XML[J]. 现代图书情报技术, 2005, 21(9): 10-13.

Viewed

Full text

Abstract

Cited

Shared

Discussed