Please wait a minute...
Advanced Search
数据分析与知识发现  2021, Vol. 5 Issue (8): 34-44     https://doi.org/10.11925/infotech.2096-3467.2021.0164
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
基于词向量的PDF表格抽取研究*
张建东1,陈仕吉2,徐小婷1,左文革1()
1中国农业大学图书馆 北京 100193
2杭州电子科技大学中国科教评价研究院 杭州 310018
Extracting PDF Tables Based on Word Vectors
Zhang Jiandong1,Chen Shiji2,Xu Xiaoting1,Zuo Wenge1()
1China Agricultural University Library, Beijing 100193, China
2Chinese Academy of Science and Education Evaluation (CASEE), Hangzhou Dianzi University, Hangzhou 310018, China
全文: PDF (1125 KB)   HTML ( 7
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】 解决PDF表格抽取中复杂表头的表格需要依靠大量人工标注的问题。【方法】 利用框线信息进行表格检测与结构构建识别文档中表格结构信息后,使用词向量表示其中的内容文本,并计算表格行间内容余弦相似度,最后利用该值判断表格中表头与内容分界行。【结果】 在自建PDF表格数据集上进行信息抽取实验,表格信息抽取结果F1值为98.07%,表格内容划分结果F1值超过99%,效果接近需要大量标注语料的深度学习文本分类模型。【局限】 所提方法只能抽取关系型表格,且不适用于扫描型PDF文档。【结论】 所提方法能够在一定程度上解决PDF文件复杂表头表格的自动抽取问题。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
张建东
陈仕吉
徐小婷
左文革
关键词 表格抽取PDF词向量    
Abstract

[Objective] This paper tries to reduce the manual annotations in extracting table with complicated header from PDF documents. [Methods] First, we identified table cells structure based on the line segment and represented the cell contents with word vectors. Then, we calculated the word vector similarity of the table content in each line. Finally, we separeted the table headers and contents. [Results] We examined our method on the self-built PDF table data set. The value of the table information extraction result F1 was 98.07%, and the table content division result F1 value exceeded 99%. They are close to the deep learning text classification model requiring large amount of annotated corpus. [Limitations] Our method can only extract relational tables, and cannot be applied to scanned PDF documents. [Conclusions] The proposed method can automatically extract PDF tables with complicated heades.

Key wordsTable Extraction    PDF    Word Vector
收稿日期: 2021-02-20      出版日期: 2021-09-15
ZTFLH:  G350  
基金资助:*国家社会科学基金重大项目(19ZDA348)
通讯作者: 左文革 ORCID:0000-0002-9685-0629     E-mail: zuowg@cau.edu.cn
引用本文:   
张建东, 陈仕吉, 徐小婷, 左文革. 基于词向量的PDF表格抽取研究*[J]. 数据分析与知识发现, 2021, 5(8): 34-44.
Zhang Jiandong, Chen Shiji, Xu Xiaoting, Zuo Wenge. Extracting PDF Tables Based on Word Vectors. Data Analysis and Knowledge Discovery, 2021, 5(8): 34-44.
链接本文:  
http://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2021.0164      或      http://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2021/V5/I8/34
Fig.1  基于词向量的PDF表格抽取流程
Fig.2  复杂表头表格示例
Fig.3  内容划分与处理流程
Fig.4  转换后表头示例
文档来源 PDF数量 表格数量
社会科学司 257 7 085
高等教育司 17 7 388
科学技术司 15 193
总计 289 14 666
Table 1  PDF文档来源及数量
表格类型 表格总数 正确率
本文方法 Pdfplumber
所有表格 14 666 98.13% 97.77%
普通表格 11 733 99.02% 99.14%
缺失表头表 2 279 95.35% 95.70%
复杂表头表 677 89.22% 80.06%
Table 2  表格检测与元素识别结果统计
Fig.5  非关系型表示例
Fig.6  缺失线段表格示例
Fig.7  稀疏表格示例
表格类型 表格抽取结果统计
词向量类型 正确数 R P F 1
所有表格 BERT 14 328 97.70% 97.70% 97.70%
Mixed-large 14 378 98.04% 98.04% 98.04%
Baidubaike 14 383 98.07% 98.07% 98.07%
普通表格 BERT 11 569 98.60% 98.70% 98.65%
Mixed-large 11 607 98.93% 98.99% 98.96%
Baidubaike 11 610 98.95% 98.99% 98.97%
缺失表头表格 BERT 2 167 95.09% 92.69% 93.87%
Mixed-large 2 171 95.26% 93.14% 94.19%
Baidubaike 2 173 95.35% 93.22% 94.27%
复杂表头表格 BERT 594 87.74% 85.96% 86.84%
Mixed-large 602 88.92% 88.27% 88.59%
Baidubaike 602 88.92% 88.79% 88.86%
Table 3  表格抽取结果统计
Fig.8  数据集示例
Fig.9  内容划分流程
算法 类别 评价指标
P R F 1
Transformer 表头 99.97% 99.99% 99.98%
内容 99.68% 99.37% 99.52%
本文方法 表头 99.80% 99.17% 99.48%
内容 99.96% 99.99% 99.97%
Table 4  表格内容划分算法对比
[1] Corrêa A S, Zander P O. Unleashing Tabular Content to Open Data: A Survey on PDF Table Extraction Methods and Tools[C]// Proceedings of the 18th Annual International Conference on Digital Government Research. 2017:54-63.
[2] 张秀秀, 马建霞. PDF科技论文语义元数据的自动抽取研究[J]. 现代图书情报技术, 2009(2):102-106.
[2] ( Zhang Xiuxiu, Ma Jianxia. Automatic Extraction of Semantic Metadata from PDF Research Papers[J]. New Technology of Library and Information Service, 2009(2):102-106.)
[3] 陈俊林, 张文德. 基于XSLT的PDF论文元数据的优化抽取[J]. 现代图书情报技术, 2007(2):18-23.
[3] ( Chen Junlin, Zhang Wende. Optimizing Extraction of Science Documents’ Metadata in PDF Format Based on XSLT[J]. New Technology of Library and Information Service, 2007(2):18-23.)
[4] Wang N X R, Burdick D, Li Y Y. TableLab: An Interactive Table Extraction System with Adaptive Deep Learning[C]// Proceedings of the 26th International Conference on Intelligent User Interfaces. 2021:87-89.
[5] 文家朝, 杨鸿章. 针对PDF的多文件信息抽取的研究与实现[J]. 凯里学院学报, 2016, 34(3):95-97.
[5] ( Wen Jiachao, Yang Hongzhang. Research on Extracting Information from Multiple PDF Files[J]. Journal of Kaili University, 2016, 34(3):95-97.)
[6] 王晓娟, 谭建龙, 刘燕兵, 等. 基于自动机理论的PDF文本内容抽取[J]. 计算机应用, 2012, 32(9):2491-2495.
[6] ( Wang Xiaojuan, Tan Jianlong, Liu Yanbing, et al. Extraction of Text Content from PDF Documents Based on Automaton Theory[J]. Journal of Computer Applications, 2012, 32(9):2491-2495.)
[7] 宋艳娟, 李金铭, 陈振标. 基于XSLT的PDF信息抽取技术的研究[J]. 计算机与数字工程, 2008, 36(5):156-159.
[7] ( Song Yanjuan, Li Jinming, Chen Zhenbiao. Research on PDF Information Extraction Technology Based on XSLT[J]. Computer & Digital Engineering, 2008, 36(5):156-159.)
[8] 张波. PDF文档语义信息抽取研究[D]. 保定:河北大学, 2004.
[8] ( Zhang Bo. Research for Semantic Information Extraction from PDF Document[D]. Baoding: Hebei University, 2004.)
[9] Mao J, Abayan M, Mohiuddin K. A Model-based Form Processing Sub-system[C]// Proceedings of the 13th International Conference on Pattern Recognition. 1996. DOI: 10.1109/ICPR.1996.547034.
doi: 10.1109/ICPR.1996.547034
[10] Hassan T, Baumgartner R. Table Recognition and Understanding from PDF Files[C]// Proceedings of the 9th International Conference on Document Analysis and Recognition. IEEE, 2007:1143-1147.
[11] Neves L A P, de Carvalho J M, Facon J, et al. Table-form Extraction with Artefact Removal[J]. Journal of Universal Computer Science, 2008, 14(2):252-265.
[12] Oro E, Ruffolo M. PDF-TREX: An Approach for Recognizing and Extracting Tables from PDF Documents[C]// Proceedings of the 10th International Conference on Document Analysis and Recognition. 2009: 906-910.
[13] 唐皓瑾. 一种面向PDF文件的表格数据抽取方法的研究与实现[D]. 北京: 北京邮电大学, 2015.
[13] ( Tang Haojin. Design and Implementation of PDF Format Based Table Extraction Method[D]. Beijing: Beijing University of Posts and Telecommunications, 2015.)
[14] Perez-Arriaga M O, Estrada T, Abad-Mota S. TAO: System for Table Detection and Extraction from PDF Document[C]// Proceedings of the 29th International Florida Artificial Intelligence Research Society Conference. 2016:591-596.
[15] Clark C, Divvala S. PDFFigures 2.0: Mining Figures from Research Papers[C]// Proceeding of the 16th ACM/IEEE-CS Joint Conference on Digital Libraries. 2016: 143-152.
[16] Clark C, Divvala S. Looking Beyond Text: Extracting Figures, Tables, and Captions from Computer Science Papers[C]// Proceedings of the 29th AAAI Conference on Artificial Intelligence. 2015: 2-8.
[17] 于丰畅, 陆伟. 基于机器视觉的PDF学术文献结构识别[J]. 情报学报, 2019, 38(4):384-390.
[17] ( Yu Fengchang, Lu Wei. Structural Recognition of PDF Academic Literature Based on Computer Vision[J]. Journal of the China Society for Scientific and Technical Information, 2019, 38(4):384-390.)
[18] 于丰畅, 程齐凯, 陆伟. 基于几何对象聚类的学术文献图表定位研究[J]. 数据分析与知识发现, 2021, 5(1):140-149.
[18] ( Yu Fengchang, Cheng Qikai, Lu Wei. Locating Academic Literature Figures and Tables with Geometric Object Clustering[J]. Data Analysis and Knowledge Discovery, 2021, 5(1):140-149.)
[19] Pinto D, McCallum A, Wei X, et al. Table Extraction Using Conditional Random Fields[C]// Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 2003: 235-242.
[20] Yildiz B, Kaiser K, Miksch S. pdf2table: A Method to Extract Table Information from PDF Files[C]// Proceedings of the 2nd Indian International Conference on Artificial Intelligence. 2008.
[21] 张伯. 基于PDF文字流的表格识别技术的研究[D]. 北京: 北京工业大学, 2010.
[21] ( Zhang Bo. Research on Table Recognition Technique Based on PDF Text Stream[D]. Beijing: Beijing University of Technology, 2010.)
[22] 赵洪, 王芳. 大规模异构的政府统计报表信息抽取与集成融合研究[J]. 情报学报, 2020, 39(9):938-948.
[22] ( Zhao Hong, Wang Fang. Information Extraction and Integration of Large-scale Heterogeneous Socio-economic Statistical Statements[J]. Journal of the China Society for Scientific and Technical Information, 2020, 39(9):938-948.)
[23] 刘仕阳, 王威威, 化柏林. 多源数据环境下公共文化服务机构年报的数据抽取研究[J]. 图书馆杂志, 2020, 39(12):52-60.
[23] ( Liu Shiyang, Wang Weiwei, Hua Bolin. Research on Data Extraction from Annual Reports of Public Cultural Service Institutions in the Multi-source Data Environment[J]. Library Journal, 2020, 39(12):52-60.)
[24] 毛尚伟, 张志清, 汤槟, 等. 基于Transfer-CRF神经网络的电子表格智能识别算法[J]. 重庆理工大学学报(自然科学), 2019, 33(10):155-160.
[24] ( Mao Shangwei, Zhang Zhiqing, Tang Bin, et al. Intelligent Recognition Algorithm of Spreadsheet Based on Transfer-CRF Neural Network[J]. Journal of Chongqing University of Technology (Natural Science), 2019, 33(10):155-160.)
[25] Mikolov T, Sutskever I, Chen K, et al. Distributed Representations of Words and Phrases and their Compositionality[C]// Proceedings of the 26th International Conference on Neural Information Processing Systems. 2013: 3111-3119.
[26] Vaswani A, Shazeer N, Parmar N, et al. Attention is All You Need[C]// Proceedings of the 31st Conference on Neural Information Processing Systems. 2017: 6000-6010.
[1] 魏庭新,柏文雷,曲维光. 词向量和语义知识相结合的汉语未登录词语义预测研究*[J]. 数据分析与知识发现, 2020, 4(6): 109-117.
[2] 聂卉,何欢. 引入词向量的隐性特征识别研究*[J]. 数据分析与知识发现, 2020, 4(1): 99-110.
[3] 聂维民,陈永洲,马静. 融合多粒度信息的文本向量表示模型 *[J]. 数据分析与知识发现, 2019, 3(9): 45-52.
[4] 俞琰,陈磊,姜金德,赵乃瑄. 结合词向量和统计特征的专利相似度测量方法 *[J]. 数据分析与知识发现, 2019, 3(9): 53-59.
[5] 邵云飞,刘东苏. 基于类别特征扩展的短文本分类方法研究 *[J]. 数据分析与知识发现, 2019, 3(9): 60-67.
[6] 文秀贤,徐健. 基于用户评论的商品特征提取及特征价格研究 *[J]. 数据分析与知识发现, 2019, 3(7): 42-51.
[7] 余本功,陈杨楠,杨颖. 基于nBD-SVM模型的投诉短文本分类*[J]. 数据分析与知识发现, 2019, 3(5): 77-85.
[8] 张佩瑶,刘东苏. 基于词向量和BTM的短文本话题演化分析*[J]. 数据分析与知识发现, 2019, 3(3): 95-101.
[9] 李慧,柴亚青. 基于卷积神经网络的细粒度情感分析方法*[J]. 数据分析与知识发现, 2019, 3(1): 95-103.
[10] 李心蕾, 王昊, 刘小敏, 邓三鸿. 面向微博短文本分类的文本向量化方法比较研究*[J]. 数据分析与知识发现, 2018, 2(8): 41-50.
[11] 胡家珩, 岑咏华, 吴承尧. 基于深度学习的领域情感词典自动构建*——以金融领域为例[J]. 数据分析与知识发现, 2018, 2(10): 95-102.
[12] 夏天. 词向量聚类加权TextRank的关键词抽取*[J]. 数据分析与知识发现, 2017, 1(2): 28-34.
[13] 翟东升, 胡等金, 张杰, 何喜军, 刘鹤. 专利发明等级分类建模技术研究*[J]. 数据分析与知识发现, 2017, 1(12): 63-73.
[14] 宁建飞,刘降珍. 融合Word2vec与TextRank的关键词抽取研究[J]. 现代图书情报技术, 2016, 32(6): 20-27.
[15] 张群, 王红军, 王伦文. 词向量与LDA相融合的短文本分类方法*[J]. 数据分析与知识发现, 2016, 32(12): 27-35.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn