|
|
Automatic Extraction of Semantic Metadata from PDF Research Papers |
Zhang Xiuxiu Ma Jianxia |
(The Lanzhou Branch of National Science Library, Chinese Academy of Sciences, Lanzhou 730000, China) |
|
|
Abstract This paper analyzes content streams of PDF files based on its structure, and extracts semantic metadata automatically from research papers by way of rule-based matching and format-based locating. Experimental results show that this method can extract important semantic metadata such as title and author effectively.
|
Received: 03 November 2008
Published: 25 February 2009
|
|
Corresponding Authors:
Zhang Xiuxiu
E-mail: zhangxx@llas.ac.cn
|
About author:: Zhang Xiuxiu ,Ma Jianxia |
[1] 李朝光, 张铭, 邓志鸿, 等. 论文元数据信息的自动抽取[J].计算机工程与应用, 2002(21):189-191,235.
[2] Min Yuh Day, Richard Tzong Han Tsai, Cheng Lung Sung, et al. Reference Metadata Extraction Using a Hierarchical Knowledge Representation Framework[J].Decision Support Systems, 2007(43):152–167.
[3] Hu Y H, Li H, Cao Y B, et al. Automatic Extraction of Titles from General Documents Using Machine Learning[J].Information Processing and Management, 2006,42(1):1276-1293.
[4] Yu J D, Fan X Z. Metadata Extraction from Chinese Research Papers Based on Conditional Random Fields[J/OL]. [2008-10-21]. http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=4405975&isnumber=4405869.
[5] Giles C L, Bollacker K D, Lawrence S. CiteSeer: An Automatic Citation Indexing System[J/OL]. [2008-10-21].http://clgiles.ist.psu.edu/papers/DL-1998-citeseer.pdf.
[6] 陈俊林, 张文德. 基于XSLT的PDF论文元数据的优化抽取[J].现代图书情报技术, 2007(2):18-23.
[7] PDF Reference[EB/OL].[2008-04-15]. http://www.adobe.com/devnet/pdf/pdfs/PDFReference13.pdf. |
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|