Please wait a minute...
Advanced Search
数据分析与知识发现  2021, Vol. 5 Issue (1): 140-149     https://doi.org/10.11925/infotech.2096-3467.2020.0630
     研究论文 本期目录 | 过刊浏览 | 高级检索 |
基于几何对象聚类的学术文献图表定位研究
于丰畅,程齐凯,陆伟()
武汉大学信息管理学院 武汉 430072
Locating Academic Literature Figures and Tables with Geometric Object Clustering
Yu Fengchang,Cheng Qikai,Lu Wei()
School of Information Management, Wuhan University, Wuhan 430072, China
全文: PDF (2097 KB)   HTML ( 9
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】 解决学术文献图表定位中低召回率问题。【方法】 提取学术文献PDF文件中的几何对象,从编码分析和图片理解两种视角获取图表范围的先验信息,使用K-means聚类算法对几何对象进行合并,并用启发式算法重构图表文字内容,以此确定文献中的图表位置。【结果】 在实验数据集上,本文算法定位的准确率为0.915,召回率为0.918,与当前先进的算法准确率相近,且召回率提高0.193,相对提升达到26.6%。【局限】 复杂排版和文档符号的不规范使用,会给算法造成一定误差。聚类K值确定和干扰文字过滤算法尚有提升空间。【结论】 算法不依赖特定的排版方式,充分利用了PDF学术文献的视觉和编码特点,有效地提高学术文献图表定位的召回率。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
于丰畅
程齐凯
陆伟
关键词 学术文献图表定位聚类    
Abstract

[Objective] This paper tries to improve the recall of figures/tables from academic literature. [Methods] First, we extracted geometric objects from the PDF files of literature. Then, we obtained priori information on scopes of figures/tables from the perspectives of underlying coding analysis and image comprehension. Third, we merged the geometric objects using K-means. Finally, we reconstructed the text contents using heuristic algorithm to determine the locations of figures/tables. [Results] On the experimental dataset, the precision of the proposed algorithm reached 0.915 and the recall was 0.918. The precision level is close to the state-of-the-art algorithms and the recall value was improved by 0.193 (26.6% better than the existing ones). [Limitations] Documents with complex layouts and irregular use of symbols will generate errors. The determination of the clustering k value and the algorithm for text filtering could be improved. [Conclusions] The proposed algorithm effectively increases the recall of figures/tables from academic literature.

Key wordsAcademic Literature    Figures/Tables Localization    Clustering
收稿日期: 2020-07-01      出版日期: 2020-10-29
ZTFLH:  TP393  
通讯作者: 陆伟     E-mail: weilu@whu.edu.cno
引用本文:   
于丰畅,程齐凯,陆伟. 基于几何对象聚类的学术文献图表定位研究[J]. 数据分析与知识发现, 2021, 5(1): 140-149.
Yu Fengchang,Cheng Qikai,Lu Wei. Locating Academic Literature Figures and Tables with Geometric Object Clustering. Data Analysis and Knowledge Discovery, 2021, 5(1): 140-149.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2020.0630      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2021/V5/I1/140
Fig.1  本文方法流程图
Fig.2  加入文档先验知识示意图
Fig.3  使用K-means对图2(e)中的物体进行聚类
Fig.4  图表边沿区域的文字块定位示意图
算法 准确率 召回率 F1
PDFFigures 2.0 0.950 0.725 0.822
本文算法 0.915 0.918 0.916
Table 1  算法性能对比
Fig.5  非图表几何对象对定位结果的干扰
Fig.6  连字符未使用文本符号造成的错误示意图
Fig.7  K-means聚类中K值的错误
Fig.8  文本过滤错误示意图
[1] 胡蓉, 唐振贵, 赵宇翔 , 等. 文内视觉资源的分析框架与计量探索[J]. 情报学报, 2017,36(2):141-151.
[1] ( Hu Rong, Tang Zhengui, Zhao Yuxiang , et al. Integrated Framework and Visual Knowledgometrics Exploration for Analyzing Visual Resources in Academic Literature[J]. Journal of the China Society for Scientific and Technical Information, 2017,36(2):141-151.)
[2] Landhuis E . Scientific Literature: Information Overload[J]. Nature, 2016,535(7612):457-458.
doi: 10.1038/nj7612-457a pmid: 27453968
[3] Choudhury S R, Wang S T, Lee G C. Scalable Algorithms for Scholarly Figure Mining and Semantics[C]// Proceedings of the ACM SIGMOD International Conference on Management of Data. 2016: 1-6.
[4] Perez-Arriaga M O, Estrada T, Abad-Mota S. TAO : System for Table Detection and Extraction from PDF Documents[C]// Proceedings of the 29th International Florida Artificial Intelligence Research Society Conference. 2016: 591-596.
[5] Corrêa A S, Zander P-O. Unleashing Tabular Content to Open Data[C]// Proceedings of the 18th Annual International Conference on Digital Government Research. 2017: 54-63.
[6] Siegel N, Horvitz Z, Levin R, et al. Figureseer: Parsing Result-Figures in Research Papers[C]// Proceedings of 2016 European Conference on Computer Vision. 2016: 664-680.
[7] Al-Zaidy R A, Giles C L. A Machine Learning Approach for Semantic Structuring of Scientific Charts in Scholarly Documents[C]// Proceedings of the 29th AAAI Conference on Artificial Intelligence. 2017: 4644-4649.
[8] 张重毅, 方梅 . 科技论文隐性学术不端行为判别特征分析[J]. 中国科技期刊研究, 2019,30(1):24-28.
[8] ( Zhang Zhongyi, Fang Mei . Feature Analysis on the Unobserved Academic Misconduct of Scientific Papers[J]. Chinese Journal of Scientific and Technical Periodicals, 2019,30(1):24-28.)
[9] Ma Y X, Tung A K H, Wang W, et al. ScatterNet: A Deep Subjective Similarity Model for Visual Analysis of Scatterplots[J]. IEEE Transactions on Visualization and Computer Graphics, 2020,26(3):1562-1576.
doi: 10.1109/TVCG.2018.2875702 pmid: 30334762
[10] 张静 . Figshare 平台与 CNKI 学术图片库比较分析[J]. 科技与出版, 2015(1):63-66.
[10] ( Zhang Jing. Comparative Analysis of Figshare Platform and CNKI Academic Picture Library, Science Technology & Publication, 2015(1):63-66.)
[11] Yu C N, Levy C C, Saniee I. Convolutional Neural Networks for Figure Extraction in Historical Technical Documents[C]// Proceedings of the International Conference on Document Analysis and Recognition. 2017.
[12] Cliche M, Rosenberg D, Madeka D, et al. Scatteract: Automated Extraction of Data from Scatter Plots[C]// Proceedings of Joint European Conference on Machine Learning and Knowledge Discovery in Databases. 2017.
[13] Amin A, Shiu R . Page Segmentation and Classification Utilizing Bottom-Up Approach[J]. International Journal of Image and Graphics, 2001,1(2):345-361.
[14] Chen K, Seuret M, Liwicki M, et al. Page Segmentation of Historical Document Images with Convolutional Autoencoders[C]// Proceedings of the International Conference on Document Analysis and Recognition. 2015.
[15] Simon A, Pret J-C, Johnson A P . A Fast Algorithm for Bottom-Up Document Layout Analysis[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1997,19(3):273-277.
[16] Ha J, Haralick R M, Phillips I T. Recursive X-Y Cut Using Bounding Boxes of Connected Components[C]// Proceedings of the 3rd International Conference on Document Analysis and Recognition. 1995: 952 .
[17] Choudhury S R, Mitra P, Giles C L. Automatic Extraction of Figures from Scholarly Documents[C]// Proceedings of the 2015 ACM Symposium on Document Engineering. 2015.
[18] Li P Y, Jiang X Y, Shatkay H . Figure and Caption Extraction from Biomedical Documents[J]. Bioinformatics, 2019,35(21):4381-4388.
pmid: 30949681
[19] Clark C, Divvala S. Looking Beyond Text: Extracting Figures, Tables, and Captions from Computer Science Paper[C]// Proceedings of AAAI 2015 Workshop on Scholarly Big Data. 2015.
[20] Clark C, Divvala S. PDFFigures 2.0: Mining Figures from Research Papers[C]// Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries. 2016: 143-152.
[21] Jang H, Chae Y, Lee S , et al. Automatic Object Extraction from Electronic Documents Using Deep Neural Network[J]. KIPS Transactions on Software and Data Engineering, 2018,7(11):411-418.
[22] Rahman M M, Finin T . Understanding and Representing the Semantics of Large Structured Documents[OL]. arXiv Preprint, arXiv: 1807. 09842.
[23] Hansen M, Pomp A, Erki K , et al. Data-Driven Recognition and Extraction of PDF Document Elements[J]. Technologies, 2019,7(3):65.
[24] Barnes D G, Vidiassov M, Ruthensteiner B , et al. Embedding and Publishing Interactive, 3-Dimensional, Scientific Figures in Portable Document Format (PDF) Files[J]. PLoS One, 2013,8(9):e69446.
[25] ISO. Document Management-Portable Document Format-Part 1: PDF 1.7[S].ISO. 32000- 1:2008 .
[26] Hassan T. Object-Level Document Analysis of PDF Files[C]// Proceedings of the 9th ACM Symposium on Document Engineering. 2009: 47-55.
[1] 王若琳, 牛振东, 蔺奇卡, 朱一凡, 邱萍, 陆浩, 刘东磊. 基于异质信息嵌入与RNN聚类参数预测的作者姓名消歧方法*[J]. 数据分析与知识发现, 2021, 5(8): 13-24.
[2] 王晰巍,贾若男,韦雅楠,张柳. 多维度社交网络舆情用户群体聚类分析方法研究*[J]. 数据分析与知识发现, 2021, 5(6): 25-35.
[3] 卢利农,祝忠明,张旺强,王小春. 基于Lingo3G聚类算法的机构知识库跨库知识整合与知识指纹服务实现[J]. 数据分析与知识发现, 2021, 5(5): 127-132.
[4] 张梦瑶, 朱广丽, 张顺香, 张标. 基于情感分析的微博热点话题用户群体划分模型 *[J]. 数据分析与知识发现, 2021, 5(2): 43-49.
[5] 丁浩, 艾文华, 胡广伟, 李树青, 索炜. 融合用户兴趣波动时序的个性化推荐模型*[J]. 数据分析与知识发现, 2021, 5(11): 45-58.
[6] 杨辰, 陈晓虹, 王楚涵, 刘婷婷. 基于用户细粒度属性偏好聚类的推荐策略*[J]. 数据分析与知识发现, 2021, 5(10): 94-102.
[7] 温萍梅,叶志炜,丁文健,刘颖,徐健. 命名实体消歧研究进展综述*[J]. 数据分析与知识发现, 2020, 4(9): 15-25.
[8] 邬金鸣,侯跃芳,崔雷. 基于医学主题词标引规则的词共现聚类分析结果自动判读和表达的研究[J]. 数据分析与知识发现, 2020, 4(9): 133-144.
[9] 席运江, 杜蝶蝶, 廖晓, 仉学红. 基于超网络的企业微博用户聚类研究及特征分析*[J]. 数据分析与知识发现, 2020, 4(8): 107-118.
[10] 于丰畅,陆伟. 一种学术文献图表位置标注数据集构建方法[J]. 数据分析与知识发现, 2020, 4(6): 35-42.
[11] 杨旭,钱晓东. 基于改进的Vicsek模型的社会网络同步聚类算法*[J]. 数据分析与知识发现, 2020, 4(4): 119-128.
[12] 熊回香,李晓敏,李跃艳. 基于图书评论属性挖掘的群组推荐研究*[J]. 数据分析与知识发现, 2020, 4(2/3): 214-222.
[13] 魏家泽,董诚,何彦青,刘志辉,彭柯芸. 基于均衡段落和分话题向量的新闻热点话题检测研究*[J]. 数据分析与知识发现, 2020, 4(10): 70-79.
[14] 赵华茗,余丽,周强. 基于均值漂移算法的文本聚类数目优化研究 *[J]. 数据分析与知识发现, 2019, 3(9): 27-35.
[15] 李珊,姚叶慧,厉浩,刘洁,嘎玛白姆. 基于ISA联合聚类的组推荐算法研究 *[J]. 数据分析与知识发现, 2019, 3(8): 77-87.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn