This paper analyzes content streams of PDF files based on its structure, and extracts semantic metadata automatically from research papers by way of rule-based matching and format-based locating. Experimental results show that this method can extract important semantic metadata such as title and author effectively.
张秀秀,马建霞. PDF科技论文语义元数据的自动抽取研究*[J]. 现代图书情报技术, 2009, 3(2): 102-106.
Zhang Xiuxiu ,Ma Jianxia. Automatic Extraction of Semantic Metadata from PDF Research Papers. New Technology of Library and Information Service, 2009, 3(2): 102-106.
[1] 李朝光, 张铭, 邓志鸿, 等. 论文元数据信息的自动抽取[J].计算机工程与应用, 2002(21):189-191,235.
[2] Min Yuh Day, Richard Tzong Han Tsai, Cheng Lung Sung, et al. Reference Metadata Extraction Using a Hierarchical Knowledge Representation Framework[J].Decision Support Systems, 2007(43):152–167.
[3] Hu Y H, Li H, Cao Y B, et al. Automatic Extraction of Titles from General Documents Using Machine Learning[J].Information Processing and Management, 2006,42(1):1276-1293.
[4] Yu J D, Fan X Z. Metadata Extraction from Chinese Research Papers Based on Conditional Random Fields[J/OL]. [2008-10-21]. http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=4405975&isnumber=4405869.
[5] Giles C L, Bollacker K D, Lawrence S. CiteSeer: An Automatic Citation Indexing System[J/OL]. [2008-10-21].http://clgiles.ist.psu.edu/papers/DL-1998-citeseer.pdf.
[6] 陈俊林, 张文德. 基于XSLT的PDF论文元数据的优化抽取[J].现代图书情报技术, 2007(2):18-23.
[7] PDF Reference[EB/OL].[2008-04-15]. http://www.adobe.com/devnet/pdf/pdfs/PDFReference13.pdf.