Data Analysis and Knowledge Discovery  2021, Vol. 5 Issue (8): 34-44    DOI: 10.11925/infotech.2096-3467.2021.0164
 Current Issue | Archive | Adv Search |
Extracting PDF Tables Based on Word Vectors
Zhang Jiandong1,Chen Shiji2,Xu Xiaoting1,Zuo Wenge1()
1China Agricultural University Library, Beijing 100193, China
2Chinese Academy of Science and Education Evaluation (CASEE), Hangzhou Dianzi University, Hangzhou 310018, China
 Download: PDF (1125 KB)   HTML ( 12 )  Export: BibTeX | EndNote (RIS)
Abstract

[Objective] This paper tries to reduce the manual annotations in extracting table with complicated header from PDF documents. [Methods] First, we identified table cells structure based on the line segment and represented the cell contents with word vectors. Then, we calculated the word vector similarity of the table content in each line. Finally, we separeted the table headers and contents. [Results] We examined our method on the self-built PDF table data set. The value of the table information extraction result F1 was 98.07%, and the table content division result F1 value exceeded 99%. They are close to the deep learning text classification model requiring large amount of annotated corpus. [Limitations] Our method can only extract relational tables, and cannot be applied to scanned PDF documents. [Conclusions] The proposed method can automatically extract PDF tables with complicated heades.

Key wordsTable Extraction      PDF      Word Vector
Received: 20 February 2021      Published: 15 September 2021
 ZTFLH: G350
Fund:National Social Science Fund of China(19ZDA348)
Corresponding Authors: Zuo Wenge ORCID：0000-0002-9685-0629     E-mail: zuowg@cau.edu.cn