|
|
Extracting PDF Tables Based on Word Vectors |
Zhang Jiandong1,Chen Shiji2,Xu Xiaoting1,Zuo Wenge1() |
1China Agricultural University Library, Beijing 100193, China 2Chinese Academy of Science and Education Evaluation (CASEE), Hangzhou Dianzi University, Hangzhou 310018, China |
|
|
Abstract [Objective] This paper tries to reduce the manual annotations in extracting table with complicated header from PDF documents. [Methods] First, we identified table cells structure based on the line segment and represented the cell contents with word vectors. Then, we calculated the word vector similarity of the table content in each line. Finally, we separeted the table headers and contents. [Results] We examined our method on the self-built PDF table data set. The value of the table information extraction result F1 was 98.07%, and the table content division result F1 value exceeded 99%. They are close to the deep learning text classification model requiring large amount of annotated corpus. [Limitations] Our method can only extract relational tables, and cannot be applied to scanned PDF documents. [Conclusions] The proposed method can automatically extract PDF tables with complicated heades.
|
Received: 20 February 2021
Published: 15 September 2021
|
|
Fund:National Social Science Fund of China(19ZDA348) |
Corresponding Authors:
Zuo Wenge ORCID:0000-0002-9685-0629
E-mail: zuowg@cau.edu.cn
|
[1] |
Corrêa A S, Zander P O. Unleashing Tabular Content to Open Data: A Survey on PDF Table Extraction Methods and Tools[C]// Proceedings of the 18th Annual International Conference on Digital Government Research. 2017:54-63.
|
[2] |
张秀秀, 马建霞. PDF科技论文语义元数据的自动抽取研究[J]. 现代图书情报技术, 2009(2):102-106.
|
[2] |
( Zhang Xiuxiu, Ma Jianxia. Automatic Extraction of Semantic Metadata from PDF Research Papers[J]. New Technology of Library and Information Service, 2009(2):102-106.)
|
[3] |
陈俊林, 张文德. 基于XSLT的PDF论文元数据的优化抽取[J]. 现代图书情报技术, 2007(2):18-23.
|
[3] |
( Chen Junlin, Zhang Wende. Optimizing Extraction of Science Documents’ Metadata in PDF Format Based on XSLT[J]. New Technology of Library and Information Service, 2007(2):18-23.)
|
[4] |
Wang N X R, Burdick D, Li Y Y. TableLab: An Interactive Table Extraction System with Adaptive Deep Learning[C]// Proceedings of the 26th International Conference on Intelligent User Interfaces. 2021:87-89.
|
[5] |
文家朝, 杨鸿章. 针对PDF的多文件信息抽取的研究与实现[J]. 凯里学院学报, 2016, 34(3):95-97.
|
[5] |
( Wen Jiachao, Yang Hongzhang. Research on Extracting Information from Multiple PDF Files[J]. Journal of Kaili University, 2016, 34(3):95-97.)
|
[6] |
王晓娟, 谭建龙, 刘燕兵, 等. 基于自动机理论的PDF文本内容抽取[J]. 计算机应用, 2012, 32(9):2491-2495.
|
[6] |
( Wang Xiaojuan, Tan Jianlong, Liu Yanbing, et al. Extraction of Text Content from PDF Documents Based on Automaton Theory[J]. Journal of Computer Applications, 2012, 32(9):2491-2495.)
|
[7] |
宋艳娟, 李金铭, 陈振标. 基于XSLT的PDF信息抽取技术的研究[J]. 计算机与数字工程, 2008, 36(5):156-159.
|
[7] |
( Song Yanjuan, Li Jinming, Chen Zhenbiao. Research on PDF Information Extraction Technology Based on XSLT[J]. Computer & Digital Engineering, 2008, 36(5):156-159.)
|
[8] |
张波. PDF文档语义信息抽取研究[D]. 保定:河北大学, 2004.
|
[8] |
( Zhang Bo. Research for Semantic Information Extraction from PDF Document[D]. Baoding: Hebei University, 2004.)
|
[9] |
Mao J, Abayan M, Mohiuddin K. A Model-based Form Processing Sub-system[C]// Proceedings of the 13th International Conference on Pattern Recognition. 1996. DOI: 10.1109/ICPR.1996.547034.
doi: 10.1109/ICPR.1996.547034
|
[10] |
Hassan T, Baumgartner R. Table Recognition and Understanding from PDF Files[C]// Proceedings of the 9th International Conference on Document Analysis and Recognition. IEEE, 2007:1143-1147.
|
[11] |
Neves L A P, de Carvalho J M, Facon J, et al. Table-form Extraction with Artefact Removal[J]. Journal of Universal Computer Science, 2008, 14(2):252-265.
|
[12] |
Oro E, Ruffolo M. PDF-TREX: An Approach for Recognizing and Extracting Tables from PDF Documents[C]// Proceedings of the 10th International Conference on Document Analysis and Recognition. 2009: 906-910.
|
[13] |
唐皓瑾. 一种面向PDF文件的表格数据抽取方法的研究与实现[D]. 北京: 北京邮电大学, 2015.
|
[13] |
( Tang Haojin. Design and Implementation of PDF Format Based Table Extraction Method[D]. Beijing: Beijing University of Posts and Telecommunications, 2015.)
|
[14] |
Perez-Arriaga M O, Estrada T, Abad-Mota S. TAO: System for Table Detection and Extraction from PDF Document[C]// Proceedings of the 29th International Florida Artificial Intelligence Research Society Conference. 2016:591-596.
|
[15] |
Clark C, Divvala S. PDFFigures 2.0: Mining Figures from Research Papers[C]// Proceeding of the 16th ACM/IEEE-CS Joint Conference on Digital Libraries. 2016: 143-152.
|
[16] |
Clark C, Divvala S. Looking Beyond Text: Extracting Figures, Tables, and Captions from Computer Science Papers[C]// Proceedings of the 29th AAAI Conference on Artificial Intelligence. 2015: 2-8.
|
[17] |
于丰畅, 陆伟. 基于机器视觉的PDF学术文献结构识别[J]. 情报学报, 2019, 38(4):384-390.
|
[17] |
( Yu Fengchang, Lu Wei. Structural Recognition of PDF Academic Literature Based on Computer Vision[J]. Journal of the China Society for Scientific and Technical Information, 2019, 38(4):384-390.)
|
[18] |
于丰畅, 程齐凯, 陆伟. 基于几何对象聚类的学术文献图表定位研究[J]. 数据分析与知识发现, 2021, 5(1):140-149.
|
[18] |
( Yu Fengchang, Cheng Qikai, Lu Wei. Locating Academic Literature Figures and Tables with Geometric Object Clustering[J]. Data Analysis and Knowledge Discovery, 2021, 5(1):140-149.)
|
[19] |
Pinto D, McCallum A, Wei X, et al. Table Extraction Using Conditional Random Fields[C]// Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 2003: 235-242.
|
[20] |
Yildiz B, Kaiser K, Miksch S. pdf2table: A Method to Extract Table Information from PDF Files[C]// Proceedings of the 2nd Indian International Conference on Artificial Intelligence. 2008.
|
[21] |
张伯. 基于PDF文字流的表格识别技术的研究[D]. 北京: 北京工业大学, 2010.
|
[21] |
( Zhang Bo. Research on Table Recognition Technique Based on PDF Text Stream[D]. Beijing: Beijing University of Technology, 2010.)
|
[22] |
赵洪, 王芳. 大规模异构的政府统计报表信息抽取与集成融合研究[J]. 情报学报, 2020, 39(9):938-948.
|
[22] |
( Zhao Hong, Wang Fang. Information Extraction and Integration of Large-scale Heterogeneous Socio-economic Statistical Statements[J]. Journal of the China Society for Scientific and Technical Information, 2020, 39(9):938-948.)
|
[23] |
刘仕阳, 王威威, 化柏林. 多源数据环境下公共文化服务机构年报的数据抽取研究[J]. 图书馆杂志, 2020, 39(12):52-60.
|
[23] |
( Liu Shiyang, Wang Weiwei, Hua Bolin. Research on Data Extraction from Annual Reports of Public Cultural Service Institutions in the Multi-source Data Environment[J]. Library Journal, 2020, 39(12):52-60.)
|
[24] |
毛尚伟, 张志清, 汤槟, 等. 基于Transfer-CRF神经网络的电子表格智能识别算法[J]. 重庆理工大学学报(自然科学), 2019, 33(10):155-160.
|
[24] |
( Mao Shangwei, Zhang Zhiqing, Tang Bin, et al. Intelligent Recognition Algorithm of Spreadsheet Based on Transfer-CRF Neural Network[J]. Journal of Chongqing University of Technology (Natural Science), 2019, 33(10):155-160.)
|
[25] |
Mikolov T, Sutskever I, Chen K, et al. Distributed Representations of Words and Phrases and their Compositionality[C]// Proceedings of the 26th International Conference on Neural Information Processing Systems. 2013: 3111-3119.
|
[26] |
Vaswani A, Shazeer N, Parmar N, et al. Attention is All You Need[C]// Proceedings of the 31st Conference on Neural Information Processing Systems. 2017: 6000-6010.
|
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|