Please wait a minute...
Data Analysis and Knowledge Discovery  2021, Vol. 5 Issue (8): 34-44    DOI: 10.11925/infotech.2096-3467.2021.0164
Current Issue | Archive | Adv Search |
Extracting PDF Tables Based on Word Vectors
Zhang Jiandong1,Chen Shiji2,Xu Xiaoting1,Zuo Wenge1()
1China Agricultural University Library, Beijing 100193, China
2Chinese Academy of Science and Education Evaluation (CASEE), Hangzhou Dianzi University, Hangzhou 310018, China
Download: PDF (1125 KB)   HTML ( 7
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper tries to reduce the manual annotations in extracting table with complicated header from PDF documents. [Methods] First, we identified table cells structure based on the line segment and represented the cell contents with word vectors. Then, we calculated the word vector similarity of the table content in each line. Finally, we separeted the table headers and contents. [Results] We examined our method on the self-built PDF table data set. The value of the table information extraction result F1 was 98.07%, and the table content division result F1 value exceeded 99%. They are close to the deep learning text classification model requiring large amount of annotated corpus. [Limitations] Our method can only extract relational tables, and cannot be applied to scanned PDF documents. [Conclusions] The proposed method can automatically extract PDF tables with complicated heades.

Key wordsTable Extraction      PDF      Word Vector     
Received: 20 February 2021      Published: 15 September 2021
ZTFLH:  G350  
Fund:National Social Science Fund of China(19ZDA348)
Corresponding Authors: Zuo Wenge ORCID:0000-0002-9685-0629     E-mail: zuowg@cau.edu.cn

Cite this article:

Zhang Jiandong, Chen Shiji, Xu Xiaoting, Zuo Wenge. Extracting PDF Tables Based on Word Vectors. Data Analysis and Knowledge Discovery, 2021, 5(8): 34-44.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2021.0164     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2021/V5/I8/34

PDF Table Extraction Process Based on Word Vector
Sample Diagram of Complex Header Table
Flow Chart of Content Division and Processing
Example of Processed Table Header
文档来源 PDF数量 表格数量
社会科学司 257 7 085
高等教育司 17 7 388
科学技术司 15 193
总计 289 14 666
PDF Document Sources and Quantities
表格类型 表格总数 正确率
本文方法 Pdfplumber
所有表格 14 666 98.13% 97.77%
普通表格 11 733 99.02% 99.14%
缺失表头表 2 279 95.35% 95.70%
复杂表头表 677 89.22% 80.06%
Table Detection and Element Identification Result
Example of Non-relational Tables
Example of Missing Line Segment Table
Example of Spare Table
表格类型 表格抽取结果统计
词向量类型 正确数 R P F 1
所有表格 BERT 14 328 97.70% 97.70% 97.70%
Mixed-large 14 378 98.04% 98.04% 98.04%
Baidubaike 14 383 98.07% 98.07% 98.07%
普通表格 BERT 11 569 98.60% 98.70% 98.65%
Mixed-large 11 607 98.93% 98.99% 98.96%
Baidubaike 11 610 98.95% 98.99% 98.97%
缺失表头表格 BERT 2 167 95.09% 92.69% 93.87%
Mixed-large 2 171 95.26% 93.14% 94.19%
Baidubaike 2 173 95.35% 93.22% 94.27%
复杂表头表格 BERT 594 87.74% 85.96% 86.84%
Mixed-large 602 88.92% 88.27% 88.59%
Baidubaike 602 88.92% 88.79% 88.86%
Table Extraction Result Statistics
Data Set Example
Content Division Flow Chart
算法 类别 评价指标
P R F 1
Transformer 表头 99.97% 99.99% 99.98%
内容 99.68% 99.37% 99.52%
本文方法 表头 99.80% 99.17% 99.48%
内容 99.96% 99.99% 99.97%
Comparison of Table Content Division Algorithms
[1] Corrêa A S, Zander P O. Unleashing Tabular Content to Open Data: A Survey on PDF Table Extraction Methods and Tools[C]// Proceedings of the 18th Annual International Conference on Digital Government Research. 2017:54-63.
[2] 张秀秀, 马建霞. PDF科技论文语义元数据的自动抽取研究[J]. 现代图书情报技术, 2009(2):102-106.
[2] ( Zhang Xiuxiu, Ma Jianxia. Automatic Extraction of Semantic Metadata from PDF Research Papers[J]. New Technology of Library and Information Service, 2009(2):102-106.)
[3] 陈俊林, 张文德. 基于XSLT的PDF论文元数据的优化抽取[J]. 现代图书情报技术, 2007(2):18-23.
[3] ( Chen Junlin, Zhang Wende. Optimizing Extraction of Science Documents’ Metadata in PDF Format Based on XSLT[J]. New Technology of Library and Information Service, 2007(2):18-23.)
[4] Wang N X R, Burdick D, Li Y Y. TableLab: An Interactive Table Extraction System with Adaptive Deep Learning[C]// Proceedings of the 26th International Conference on Intelligent User Interfaces. 2021:87-89.
[5] 文家朝, 杨鸿章. 针对PDF的多文件信息抽取的研究与实现[J]. 凯里学院学报, 2016, 34(3):95-97.
[5] ( Wen Jiachao, Yang Hongzhang. Research on Extracting Information from Multiple PDF Files[J]. Journal of Kaili University, 2016, 34(3):95-97.)
[6] 王晓娟, 谭建龙, 刘燕兵, 等. 基于自动机理论的PDF文本内容抽取[J]. 计算机应用, 2012, 32(9):2491-2495.
[6] ( Wang Xiaojuan, Tan Jianlong, Liu Yanbing, et al. Extraction of Text Content from PDF Documents Based on Automaton Theory[J]. Journal of Computer Applications, 2012, 32(9):2491-2495.)
[7] 宋艳娟, 李金铭, 陈振标. 基于XSLT的PDF信息抽取技术的研究[J]. 计算机与数字工程, 2008, 36(5):156-159.
[7] ( Song Yanjuan, Li Jinming, Chen Zhenbiao. Research on PDF Information Extraction Technology Based on XSLT[J]. Computer & Digital Engineering, 2008, 36(5):156-159.)
[8] 张波. PDF文档语义信息抽取研究[D]. 保定:河北大学, 2004.
[8] ( Zhang Bo. Research for Semantic Information Extraction from PDF Document[D]. Baoding: Hebei University, 2004.)
[9] Mao J, Abayan M, Mohiuddin K. A Model-based Form Processing Sub-system[C]// Proceedings of the 13th International Conference on Pattern Recognition. 1996. DOI: 10.1109/ICPR.1996.547034.
doi: 10.1109/ICPR.1996.547034
[10] Hassan T, Baumgartner R. Table Recognition and Understanding from PDF Files[C]// Proceedings of the 9th International Conference on Document Analysis and Recognition. IEEE, 2007:1143-1147.
[11] Neves L A P, de Carvalho J M, Facon J, et al. Table-form Extraction with Artefact Removal[J]. Journal of Universal Computer Science, 2008, 14(2):252-265.
[12] Oro E, Ruffolo M. PDF-TREX: An Approach for Recognizing and Extracting Tables from PDF Documents[C]// Proceedings of the 10th International Conference on Document Analysis and Recognition. 2009: 906-910.
[13] 唐皓瑾. 一种面向PDF文件的表格数据抽取方法的研究与实现[D]. 北京: 北京邮电大学, 2015.
[13] ( Tang Haojin. Design and Implementation of PDF Format Based Table Extraction Method[D]. Beijing: Beijing University of Posts and Telecommunications, 2015.)
[14] Perez-Arriaga M O, Estrada T, Abad-Mota S. TAO: System for Table Detection and Extraction from PDF Document[C]// Proceedings of the 29th International Florida Artificial Intelligence Research Society Conference. 2016:591-596.
[15] Clark C, Divvala S. PDFFigures 2.0: Mining Figures from Research Papers[C]// Proceeding of the 16th ACM/IEEE-CS Joint Conference on Digital Libraries. 2016: 143-152.
[16] Clark C, Divvala S. Looking Beyond Text: Extracting Figures, Tables, and Captions from Computer Science Papers[C]// Proceedings of the 29th AAAI Conference on Artificial Intelligence. 2015: 2-8.
[17] 于丰畅, 陆伟. 基于机器视觉的PDF学术文献结构识别[J]. 情报学报, 2019, 38(4):384-390.
[17] ( Yu Fengchang, Lu Wei. Structural Recognition of PDF Academic Literature Based on Computer Vision[J]. Journal of the China Society for Scientific and Technical Information, 2019, 38(4):384-390.)
[18] 于丰畅, 程齐凯, 陆伟. 基于几何对象聚类的学术文献图表定位研究[J]. 数据分析与知识发现, 2021, 5(1):140-149.
[18] ( Yu Fengchang, Cheng Qikai, Lu Wei. Locating Academic Literature Figures and Tables with Geometric Object Clustering[J]. Data Analysis and Knowledge Discovery, 2021, 5(1):140-149.)
[19] Pinto D, McCallum A, Wei X, et al. Table Extraction Using Conditional Random Fields[C]// Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 2003: 235-242.
[20] Yildiz B, Kaiser K, Miksch S. pdf2table: A Method to Extract Table Information from PDF Files[C]// Proceedings of the 2nd Indian International Conference on Artificial Intelligence. 2008.
[21] 张伯. 基于PDF文字流的表格识别技术的研究[D]. 北京: 北京工业大学, 2010.
[21] ( Zhang Bo. Research on Table Recognition Technique Based on PDF Text Stream[D]. Beijing: Beijing University of Technology, 2010.)
[22] 赵洪, 王芳. 大规模异构的政府统计报表信息抽取与集成融合研究[J]. 情报学报, 2020, 39(9):938-948.
[22] ( Zhao Hong, Wang Fang. Information Extraction and Integration of Large-scale Heterogeneous Socio-economic Statistical Statements[J]. Journal of the China Society for Scientific and Technical Information, 2020, 39(9):938-948.)
[23] 刘仕阳, 王威威, 化柏林. 多源数据环境下公共文化服务机构年报的数据抽取研究[J]. 图书馆杂志, 2020, 39(12):52-60.
[23] ( Liu Shiyang, Wang Weiwei, Hua Bolin. Research on Data Extraction from Annual Reports of Public Cultural Service Institutions in the Multi-source Data Environment[J]. Library Journal, 2020, 39(12):52-60.)
[24] 毛尚伟, 张志清, 汤槟, 等. 基于Transfer-CRF神经网络的电子表格智能识别算法[J]. 重庆理工大学学报(自然科学), 2019, 33(10):155-160.
[24] ( Mao Shangwei, Zhang Zhiqing, Tang Bin, et al. Intelligent Recognition Algorithm of Spreadsheet Based on Transfer-CRF Neural Network[J]. Journal of Chongqing University of Technology (Natural Science), 2019, 33(10):155-160.)
[25] Mikolov T, Sutskever I, Chen K, et al. Distributed Representations of Words and Phrases and their Compositionality[C]// Proceedings of the 26th International Conference on Neural Information Processing Systems. 2013: 3111-3119.
[26] Vaswani A, Shazeer N, Parmar N, et al. Attention is All You Need[C]// Proceedings of the 31st Conference on Neural Information Processing Systems. 2017: 6000-6010.
[1] Weimin Nie,Yongzhou Chen,Jing Ma. A Text Vector Representation Model Merging Multi-Granularity Information[J]. 数据分析与知识发现, 2019, 3(9): 45-52.
[2] Yunfei Shao,Dongsu Liu. Classifying Short-texts with Class Feature Extension[J]. 数据分析与知识发现, 2019, 3(9): 60-67.
[3] Xiuxian Wen,Jian Xu. Research on Product Characteristics Extraction and Hedonic Price Based on User Comments[J]. 数据分析与知识发现, 2019, 3(7): 42-51.
[4] Bengong Yu,Yangnan Chen,Ying Yang. Classifying Short Text Complaints with nBD-SVM Model[J]. 数据分析与知识发现, 2019, 3(5): 77-85.
[5] Hui Li,Yaqing Chai. Fine-Grained Sentiment Analysis Based on Convolutional Neural Network[J]. 数据分析与知识发现, 2019, 3(1): 95-103.
[6] Li Xinlei,Wang Hao,Liu Xiaomin,Deng Sanhong. Comparing Text Vector Generators for Weibo Short Text Classification[J]. 数据分析与知识发现, 2018, 2(8): 41-50.
[7] Hu Jiaheng,Cen Yonghua,Wu Chengyao. Constructing Sentiment Dictionary with Deep Learning: Case Study of Financial Data[J]. 数据分析与知识发现, 2018, 2(10): 95-102.
[8] Zhai Dongsheng,Hu Dengjin,Zhang Jie,He Xijun,Liu He. Hierarchical Classification Model for Invention Patents[J]. 数据分析与知识发现, 2017, 1(12): 63-73.
[9] Ning Jianfei,Liu Jiangzhen. Using Word2vec with TextRank to Extract Keywords[J]. 现代图书情报技术, 2016, 32(6): 20-27.
[10] Li Yu, Wang Wei. Design and Prototype Implementation of PDF Downloading Abuse Warning System[J]. 现代图书情报技术, 2011, 27(4): 71-76.
[11] Hu Zewen, Wang Xiaoyue, Bai Rujiang. Study on Text Classification Model Based on SUMO and WordNet Ontology Integration[J]. 现代图书情报技术, 2011, 27(1): 31-38.
[12] Zhang Xiuxiu ,Ma Jianxia. Automatic Extraction of Semantic Metadata from PDF Research Papers[J]. 现代图书情报技术, 2009, 3(2): 102-106.
[13] Liu Fanxin. Design and Implementation of Reader-card System Based on PDF417[J]. 现代图书情报技术, 2007, 2(6): 83-86.
[14] Chen Junlin,Zhang Wende . Optimizing Extraction of Science Documents’ Metadata in PDF Format Based on XSLT[J]. 现代图书情报技术, 2007, 2(2): 18-23.
[15] Zhao Yang,Jiang Airong,Wu Jianxin . Establishment of University Theses and Dissertations Fulltext Database——Taking the Tsinghua University Library as Example[J]. 现代图书情报技术, 2006, 1(5): 6-9.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn