1China Agricultural University Library, Beijing 100193, China 2Chinese Academy of Science and Education Evaluation (CASEE), Hangzhou Dianzi University, Hangzhou 310018, China
[Objective] This paper tries to reduce the manual annotations in extracting table with complicated header from PDF documents. [Methods] First, we identified table cells structure based on the line segment and represented the cell contents with word vectors. Then, we calculated the word vector similarity of the table content in each line. Finally, we separeted the table headers and contents. [Results] We examined our method on the self-built PDF table data set. The value of the table information extraction result F1 was 98.07%, and the table content division result F1 value exceeded 99%. They are close to the deep learning text classification model requiring large amount of annotated corpus. [Limitations] Our method can only extract relational tables, and cannot be applied to scanned PDF documents. [Conclusions] The proposed method can automatically extract PDF tables with complicated heades.
Corrêa A S, Zander P O. Unleashing Tabular Content to Open Data: A Survey on PDF Table Extraction Methods and Tools[C]// Proceedings of the 18th Annual International Conference on Digital Government Research. 2017:54-63.
( Zhang Xiuxiu, Ma Jianxia. Automatic Extraction of Semantic Metadata from PDF Research Papers[J]. New Technology of Library and Information Service, 2009(2):102-106.)
( Chen Junlin, Zhang Wende. Optimizing Extraction of Science Documents’ Metadata in PDF Format Based on XSLT[J]. New Technology of Library and Information Service, 2007(2):18-23.)
[4]
Wang N X R, Burdick D, Li Y Y. TableLab: An Interactive Table Extraction System with Adaptive Deep Learning[C]// Proceedings of the 26th International Conference on Intelligent User Interfaces. 2021:87-89.
( Wang Xiaojuan, Tan Jianlong, Liu Yanbing, et al. Extraction of Text Content from PDF Documents Based on Automaton Theory[J]. Journal of Computer Applications, 2012, 32(9):2491-2495.)
( Song Yanjuan, Li Jinming, Chen Zhenbiao. Research on PDF Information Extraction Technology Based on XSLT[J]. Computer & Digital Engineering, 2008, 36(5):156-159.)
[8]
张波. PDF文档语义信息抽取研究[D]. 保定:河北大学, 2004.
[8]
( Zhang Bo. Research for Semantic Information Extraction from PDF Document[D]. Baoding: Hebei University, 2004.)
[9]
Mao J, Abayan M, Mohiuddin K. A Model-based Form Processing Sub-system[C]// Proceedings of the 13th International Conference on Pattern Recognition. 1996. DOI: 10.1109/ICPR.1996.547034.
doi: 10.1109/ICPR.1996.547034
[10]
Hassan T, Baumgartner R. Table Recognition and Understanding from PDF Files[C]// Proceedings of the 9th International Conference on Document Analysis and Recognition. IEEE, 2007:1143-1147.
[11]
Neves L A P, de Carvalho J M, Facon J, et al. Table-form Extraction with Artefact Removal[J]. Journal of Universal Computer Science, 2008, 14(2):252-265.
[12]
Oro E, Ruffolo M. PDF-TREX: An Approach for Recognizing and Extracting Tables from PDF Documents[C]// Proceedings of the 10th International Conference on Document Analysis and Recognition. 2009: 906-910.
( Tang Haojin. Design and Implementation of PDF Format Based Table Extraction Method[D]. Beijing: Beijing University of Posts and Telecommunications, 2015.)
[14]
Perez-Arriaga M O, Estrada T, Abad-Mota S. TAO: System for Table Detection and Extraction from PDF Document[C]// Proceedings of the 29th International Florida Artificial Intelligence Research Society Conference. 2016:591-596.
[15]
Clark C, Divvala S. PDFFigures 2.0: Mining Figures from Research Papers[C]// Proceeding of the 16th ACM/IEEE-CS Joint Conference on Digital Libraries. 2016: 143-152.
[16]
Clark C, Divvala S. Looking Beyond Text: Extracting Figures, Tables, and Captions from Computer Science Papers[C]// Proceedings of the 29th AAAI Conference on Artificial Intelligence. 2015: 2-8.
( Yu Fengchang, Lu Wei. Structural Recognition of PDF Academic Literature Based on Computer Vision[J]. Journal of the China Society for Scientific and Technical Information, 2019, 38(4):384-390.)
( Yu Fengchang, Cheng Qikai, Lu Wei. Locating Academic Literature Figures and Tables with Geometric Object Clustering[J]. Data Analysis and Knowledge Discovery, 2021, 5(1):140-149.)
[19]
Pinto D, McCallum A, Wei X, et al. Table Extraction Using Conditional Random Fields[C]// Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 2003: 235-242.
[20]
Yildiz B, Kaiser K, Miksch S. pdf2table: A Method to Extract Table Information from PDF Files[C]// Proceedings of the 2nd Indian International Conference on Artificial Intelligence. 2008.
[21]
张伯. 基于PDF文字流的表格识别技术的研究[D]. 北京: 北京工业大学, 2010.
[21]
( Zhang Bo. Research on Table Recognition Technique Based on PDF Text Stream[D]. Beijing: Beijing University of Technology, 2010.)
( Zhao Hong, Wang Fang. Information Extraction and Integration of Large-scale Heterogeneous Socio-economic Statistical Statements[J]. Journal of the China Society for Scientific and Technical Information, 2020, 39(9):938-948.)
( Liu Shiyang, Wang Weiwei, Hua Bolin. Research on Data Extraction from Annual Reports of Public Cultural Service Institutions in the Multi-source Data Environment[J]. Library Journal, 2020, 39(12):52-60.)
( Mao Shangwei, Zhang Zhiqing, Tang Bin, et al. Intelligent Recognition Algorithm of Spreadsheet Based on Transfer-CRF Neural Network[J]. Journal of Chongqing University of Technology (Natural Science), 2019, 33(10):155-160.)
[25]
Mikolov T, Sutskever I, Chen K, et al. Distributed Representations of Words and Phrases and their Compositionality[C]// Proceedings of the 26th International Conference on Neural Information Processing Systems. 2013: 3111-3119.
[26]
Vaswani A, Shazeer N, Parmar N, et al. Attention is All You Need[C]// Proceedings of the 31st Conference on Neural Information Processing Systems. 2017: 6000-6010.