Please wait a minute...
New Technology of Library and Information Service  2005, Vol. 21 Issue (9): 10-13    DOI: 10.11925/infotech.1003-3513.2005.09.03
article Current Issue | Archive | Adv Search |
Research on PDF Documents Information Extraction System  Based on XML
Song Yanjuan  Zhang Wende2
1(College of Mathematics and Computer Science, Fuzhou Uninversity, Fuzhou 350002,China)
2(Library of Fuzhou Uninversity, Fuzhou 350002, China)
Download:
Export: BibTeX | EndNote (RIS)      
Abstract  

The article is structured as follows. Firstly, we try to design a DTD of articles of science and technology. Secondly, we analyze the structure of PDF documents. Based on that, we dwell on the design of a PDF information extraction system,  which use the above-mentioned DTD as a template, transfer a PDF-formatted scientific and technological article to a valid XML document.

Key wordsInformation Extraction      PDF      XML     
Received: 23 May 2005      Published: 25 September 2005
: 

TP392

 
Corresponding Authors: Zhang Wende     E-mail: zhangwd @ fzu.edu.cn
About author:: Song Yanjuan,Zhang Wende

Cite this article:

Song Yanjuan,Zhang Wende. Research on PDF Documents Information Extraction System  Based on XML. New Technology of Library and Information Service, 2005, 21(9): 10-13.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.1003-3513.2005.09.03     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2005/V21/I9/10

1 Adobe Systems Inc..PDF Reference , Adobe Portable Document Format version 1.4_.3nd ,2001. http://www.adobe.com/support/downloads/product.jsp?product=44&platform=Windows(Accessed Mar.8,2005)
2 Extensible Markup Language 1.0 Second Edition. http://www.w3.org/TR/REC-xml,2000-10(Accessed Mar.8, 2005)
3 Simple DocBook. http://www.docbook.org/xml/simple/1.1CR2/  (Accessed Mar.8, 2005)
4 杨道良等.面向对象的中文PDF阅读器的设计与实现.计算机应用,1999,19(6): 1-4
5 Introduction to XML ,Java, databases and the web Nazmul Idris 1999/06/24  http://www.developerlife.com (Accessed Mar.8, 2005)
6 Norbert Fuhr. XML Information Retrieal and Information Extraction. http://ls6-www.informatik.uni-dortmund.debibfulltext/ir/Fuhr:02a.pd,2002 (Accessed Mar.8, 2005)
7 余锦凤等.中文信息处理基础教程. 北京:北京大学出版社,2002
8 李辉,史忠植等.运用文本领域的常识改善基于支撑向量机的文本分类器性能.中文信息学报, 2002,16(2):7-13
9 Ekkuitte Rusty Harold 著,杜大鹏等译. XML实用大全.北京:中国水利水电出版社,2001

[1] Tan Ying, Tang Yifei. Extracting Citation Contents with Coreference Resolution[J]. 数据分析与知识发现, 2021, 5(8): 25-33.
[2] Zhang Jiandong, Chen Shiji, Xu Xiaoting, Zuo Wenge. Extracting PDF Tables Based on Word Vectors[J]. 数据分析与知识发现, 2021, 5(8): 34-44.
[3] Wang Yi,Shen Zhe,Yao Yifan,Cheng Ying. Domain-Specific Event Graph Construction Methods:A Review[J]. 数据分析与知识发现, 2020, 4(10): 1-13.
[4] Tao Yue,Yu Li,Zhang Runjie. Active Learning Strategies for Extracting Phrase-Level Topics from Scientific Literature[J]. 数据分析与知识发现, 2020, 4(10): 134-143.
[5] Zhiqiang Liu,Yuncheng Du,Shuicai Shi. Extraction of Key Information in Web News Based on Improved Hidden Markov Model[J]. 数据分析与知识发现, 2019, 3(3): 120-128.
[6] Chengzhi Zhang,Zheng Li. Extracting Sentences of Research Originality from Full Text Academic Articles[J]. 数据分析与知识发现, 2019, 3(10): 12-18.
[7] Mu Dongmei,Jin Shan,Ju Yuanhong. Finding Association Between Diseases and Genes from Literature Abstracts[J]. 数据分析与知识发现, 2018, 2(8): 98-106.
[8] Yufeng Duan,Sisi Huang. Information Extraction from Chinese Plant Species Diversity Description Text[J]. 现代图书情报技术, 2016, 32(1): 87-96.
[9] Liu Wei, Wang Xing, Song Peiyan. A Noise Cleaning Method for Synonym Extraction Results[J]. 现代图书情报技术, 2015, 31(6): 64-70.
[10] Jiang Chuntao. Automatic Annotation of Bibliographical References in Chinese Patent Documents[J]. 现代图书情报技术, 2015, 31(10): 81-87.
[11] Li Xiangdong, Huo Yayong, Huang Li. Study of Book Pages Automatic Identification and Bibliographic Information Extraction[J]. 现代图书情报技术, 2014, 30(4): 71-77.
[12] Liu Yajing, Wang Yanxi, Hao Dan, Zhou Jinhui. Study on the Methods of Institutional Repository Supporting Research Services[J]. 现代图书情报技术, 2014, 30(3): 1-7.
[13] Zhang Han, Liu Shuangmei. Comparative Analysis of Centrality Indices in Extracting Concepts from Semantic Predication Network——Based on Disease Treatment Research[J]. 现代图书情报技术, 2013, (6): 30-35.
[14] Hu Zhenning, Yang Wei, Ding Pei, Lin Weiming, Wu Yuanye. Design and Implementation of Multi-language Interface in SULCMIS OPAC[J]. 现代图书情报技术, 2013, 29(2): 70-76.
[15] Huang Xun, You Hongliang, Yu Yang. A Review of Relation Extraction[J]. 现代图书情报技术, 2013, 29(11): 30-39.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn