Data Analysis and Knowledge Discovery  2021, Vol. 5 Issue (8): 34-44    DOI: 10.11925/infotech.2096-3467.2021.0164
Extracting PDF Tables Based on Word Vectors
Zhang Jiandong1,Chen Shiji2,Xu Xiaoting1,Zuo Wenge1()
1China Agricultural University Library, Beijing 100193, China
2Chinese Academy of Science and Education Evaluation (CASEE), Hangzhou Dianzi University, Hangzhou 310018, China
[Objective] This paper tries to reduce the manual annotations in extracting table with complicated header from PDF documents. [Methods] First, we identified table cells structure based on the line segment and represented the cell contents with word vectors. Then, we calculated the word vector similarity of the table content in each line. Finally, we separeted the table headers and contents. [Results] We examined our method on the self-built PDF table data set. The value of the table information extraction result F1 was 98.07%, and the table content division result F1 value exceeded 99%. They are close to the deep learning text classification model requiring large amount of annotated corpus. [Limitations] Our method can only extract relational tables, and cannot be applied to scanned PDF documents. [Conclusions] The proposed method can automatically extract PDF tables with complicated heades.

Key wordsTable Extraction      PDF      Word Vector     
Received: 20 February 2021      Published: 15 September 2021
ZTFLH:  G350  
Fund:National Social Science Fund of China(19ZDA348)
Corresponding Authors: Zuo Wenge ORCID:0000-0002-9685-0629     E-mail:

Cite this article:

Zhang Jiandong, Chen Shiji, Xu Xiaoting, Zuo Wenge. Extracting PDF Tables Based on Word Vectors. Data Analysis and Knowledge Discovery, 2021, 5(8): 34-44.

PDF Table Extraction Process Based on Word Vector
Sample Diagram of Complex Header Table
Flow Chart of Content Division and Processing
Example of Processed Table Header
文档来源 PDF数量 表格数量
社会科学司 257 7 085
高等教育司 17 7 388
科学技术司 15 193
总计 289 14 666
PDF Document Sources and Quantities
表格类型 表格总数 正确率
本文方法 Pdfplumber
所有表格 14 666 98.13% 97.77%
普通表格 11 733 99.02% 99.14%
缺失表头表 2 279 95.35% 95.70%
复杂表头表 677 89.22% 80.06%
Table Detection and Element Identification Result
Example of Non-relational Tables
Example of Missing Line Segment Table
Example of Spare Table
表格类型 表格抽取结果统计
词向量类型 正确数 R P F 1
所有表格 BERT 14 328 97.70% 97.70% 97.70%
Mixed-large 14 378 98.04% 98.04% 98.04%
Baidubaike 14 383 98.07% 98.07% 98.07%
普通表格 BERT 11 569 98.60% 98.70% 98.65%
Mixed-large 11 607 98.93% 98.99% 98.96%
Baidubaike 11 610 98.95% 98.99% 98.97%
缺失表头表格 BERT 2 167 95.09% 92.69% 93.87%
Mixed-large 2 171 95.26% 93.14% 94.19%
Baidubaike 2 173 95.35% 93.22% 94.27%
复杂表头表格 BERT 594 87.74% 85.96% 86.84%
Mixed-large 602 88.92% 88.27% 88.59%
Baidubaike 602 88.92% 88.79% 88.86%
Table Extraction Result Statistics
Data Set Example
Content Division Flow Chart
算法 类别 评价指标
P R F 1
Transformer 表头 99.97% 99.99% 99.98%
内容 99.68% 99.37% 99.52%
本文方法 表头 99.80% 99.17% 99.48%
内容 99.96% 99.99% 99.97%
Comparison of Table Content Division Algorithms
