Please wait a minute...
New Technology of Library and Information Service  2005, Vol. 21 Issue (9): 10-13    DOI: 10.11925/infotech.1003-3513.2005.09.03
article Current Issue | Archive | Adv Search |
Research on PDF Documents Information Extraction System  Based on XML
Song Yanjuan  Zhang Wende2
1(College of Mathematics and Computer Science, Fuzhou Uninversity, Fuzhou 350002,China)
2(Library of Fuzhou Uninversity, Fuzhou 350002, China)
Download: PDF (0 KB)  
Export: BibTeX | EndNote (RIS)      

The article is structured as follows. Firstly, we try to design a DTD of articles of science and technology. Secondly, we analyze the structure of PDF documents. Based on that, we dwell on the design of a PDF information extraction system,  which use the above-mentioned DTD as a template, transfer a PDF-formatted scientific and technological article to a valid XML document.

Key wordsInformation Extraction      PDF      XML     
Received: 23 May 2005      Published: 25 September 2005


Corresponding Authors: Zhang Wende     E-mail: zhangwd @
About author:: Song Yanjuan,Zhang Wende

Cite this article:

Song Yanjuan,Zhang Wende. Research on PDF Documents Information Extraction System  Based on XML. New Technology of Library and Information Service, 2005, 21(9): 10-13.

URL:     OR

1 Adobe Systems Inc..PDF Reference , Adobe Portable Document Format version 1.4_.3nd ,2001. Mar.8,2005)
2 Extensible Markup Language 1.0 Second Edition.,2000-10(Accessed Mar.8, 2005)
3 Simple DocBook.  (Accessed Mar.8, 2005)
4 杨道良等.面向对象的中文PDF阅读器的设计与实现.计算机应用,1999,19(6): 1-4
5 Introduction to XML ,Java, databases and the web Nazmul Idris 1999/06/24 (Accessed Mar.8, 2005)
6 Norbert Fuhr. XML Information Retrieal and Information Extraction. http://ls6-www.informatik.uni-dortmund.debibfulltext/ir/Fuhr:02a.pd,2002 (Accessed Mar.8, 2005)
7 余锦凤等.中文信息处理基础教程. 北京:北京大学出版社,2002
8 李辉,史忠植等.运用文本领域的常识改善基于支撑向量机的文本分类器性能.中文信息学报, 2002,16(2):7-13
9 Ekkuitte Rusty Harold 著,杜大鹏等译. XML实用大全.北京:中国水利水电出版社,2001

[1] Zhiqiang Liu,Yuncheng Du,Shuicai Shi. Extraction of Key Information in Web News Based on Improved Hidden Markov Model[J]. 数据分析与知识发现, 2019, 3(3): 120-128.
[2] Chengzhi Zhang,Zheng Li. Extracting Sentences of Research Originality from Full Text Academic Articles[J]. 数据分析与知识发现, 2019, 3(10): 12-18.
[3] Mu Dongmei,Jin Shan,Ju Yuanhong. Finding Association Between Diseases and Genes from Literature Abstracts[J]. 数据分析与知识发现, 2018, 2(8): 98-106.
[4] Yufeng Duan,Sisi Huang. Information Extraction from Chinese Plant Species Diversity Description Text[J]. 现代图书情报技术, 2016, 32(1): 87-96.
[5] Liu Wei, Wang Xing, Song Peiyan. A Noise Cleaning Method for Synonym Extraction Results[J]. 现代图书情报技术, 2015, 31(6): 64-70.
[6] Jiang Chuntao. Automatic Annotation of Bibliographical References in Chinese Patent Documents[J]. 现代图书情报技术, 2015, 31(10): 81-87.
[7] Li Xiangdong, Huo Yayong, Huang Li. Study of Book Pages Automatic Identification and Bibliographic Information Extraction[J]. 现代图书情报技术, 2014, 30(4): 71-77.
[8] Liu Yajing, Wang Yanxi, Hao Dan, Zhou Jinhui. Study on the Methods of Institutional Repository Supporting Research Services[J]. 现代图书情报技术, 2014, 30(3): 1-7.
[9] Zhang Han, Liu Shuangmei. Comparative Analysis of Centrality Indices in Extracting Concepts from Semantic Predication Network——Based on Disease Treatment Research[J]. 现代图书情报技术, 2013, (6): 30-35.
[10] Hu Zhenning, Yang Wei, Ding Pei, Lin Weiming, Wu Yuanye. Design and Implementation of Multi-language Interface in SULCMIS OPAC[J]. 现代图书情报技术, 2013, 29(2): 70-76.
[11] Huang Xun, You Hongliang, Yu Yang. A Review of Relation Extraction[J]. 现代图书情报技术, 2013, 29(11): 30-39.
[12] Wang Liwei, Mu Dongmei, Wang Wei. NCBO-based Ontology Mapping and Application[J]. 现代图书情报技术, 2013, 29(10): 15-19.
[13] Lin Weiming. E-reading Used in SULCMIS OPAC——Taking Shenzhen University Library as an Example[J]. 现代图书情报技术, 2013, 29(10): 85-89.
[14] He Lin, He Juan, Shen Gengyu, Yang Bo, Huang Shuiqing. An Approach to Discovery of Reference Control Gene for qRT-PCR Experiment Based on Texting Mining[J]. 现代图书情报技术, 2012, 28(7): 109-114.
[15] Li Shuqing, Liu Xiaoqian. The Matching Algorithm of Heterogeneous User Personalized Profile Based on Centripetal Spreading Weighted XML Model[J]. 现代图书情报技术, 2012, 28(5): 32-40.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938