|
|
Research on PDF Documents Information Extraction System Based on XML |
Song Yanjuan1 Zhang Wende2 |
1(College of Mathematics and Computer Science, Fuzhou Uninversity, Fuzhou 350002,China)
2(Library of Fuzhou Uninversity, Fuzhou 350002, China) |
|
|
Abstract The article is structured as follows. Firstly, we try to design a DTD of articles of science and technology. Secondly, we analyze the structure of PDF documents. Based on that, we dwell on the design of a PDF information extraction system, which use the above-mentioned DTD as a template, transfer a PDF-formatted scientific and technological article to a valid XML document.
|
Received: 23 May 2005
Published: 25 September 2005
|
|
Corresponding Authors:
Zhang Wende
E-mail: zhangwd @ fzu.edu.cn
|
About author:: Song Yanjuan,Zhang Wende |
1 Adobe Systems Inc..PDF Reference , Adobe Portable Document Format version 1.4_.3nd ,2001. http://www.adobe.com/support/downloads/product.jsp?product=44&platform=Windows(Accessed Mar.8,2005)
2 Extensible Markup Language 1.0 Second Edition. http://www.w3.org/TR/REC-xml,2000-10(Accessed Mar.8, 2005)
3 Simple DocBook. http://www.docbook.org/xml/simple/1.1CR2/ (Accessed Mar.8, 2005)
4 杨道良等.面向对象的中文PDF阅读器的设计与实现.计算机应用,1999,19(6): 1-4
5 Introduction to XML ,Java, databases and the web Nazmul Idris 1999/06/24 http://www.developerlife.com (Accessed Mar.8, 2005)
6 Norbert Fuhr. XML Information Retrieal and Information Extraction. http://ls6-www.informatik.uni-dortmund.debibfulltext/ir/Fuhr:02a.pd,2002 (Accessed Mar.8, 2005)
7 余锦凤等.中文信息处理基础教程. 北京:北京大学出版社,2002
8 李辉,史忠植等.运用文本领域的常识改善基于支撑向量机的文本分类器性能.中文信息学报, 2002,16(2):7-13
9 Ekkuitte Rusty Harold 著,杜大鹏等译. XML实用大全.北京:中国水利水电出版社,2001 |
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|