Please wait a minute...
Data Analysis and Knowledge Discovery  2021, Vol. 5 Issue (1): 140-149    DOI: 10.11925/infotech.2096-3467.2020.0630
Current Issue | Archive | Adv Search |
Locating Academic Literature Figures and Tables with Geometric Object Clustering
Yu Fengchang,Cheng Qikai,Lu Wei()
School of Information Management, Wuhan University, Wuhan 430072, China
Download: PDF (2097 KB)   HTML ( 2
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper tries to improve the recall of figures/tables from academic literature. [Methods] First, we extracted geometric objects from the PDF files of literature. Then, we obtained priori information on scopes of figures/tables from the perspectives of underlying coding analysis and image comprehension. Third, we merged the geometric objects using K-means. Finally, we reconstructed the text contents using heuristic algorithm to determine the locations of figures/tables. [Results] On the experimental dataset, the precision of the proposed algorithm reached 0.915 and the recall was 0.918. The precision level is close to the state-of-the-art algorithms and the recall value was improved by 0.193 (26.6% better than the existing ones). [Limitations] Documents with complex layouts and irregular use of symbols will generate errors. The determination of the clustering k value and the algorithm for text filtering could be improved. [Conclusions] The proposed algorithm effectively increases the recall of figures/tables from academic literature.

Key wordsAcademic Literature      Figures/Tables Localization      Clustering     
Received: 01 July 2020      Published: 29 October 2020
ZTFLH:  TP393  
Corresponding Authors: Lu Wei     E-mail: weilu@whu.edu.cno

Cite this article:

Yu Fengchang,Cheng Qikai,Lu Wei. Locating Academic Literature Figures and Tables with Geometric Object Clustering. Data Analysis and Knowledge Discovery, 2021, 5(1): 140-149.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2020.0630     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2021/V5/I1/140

Flow Chart of the Proposed Method
A Diagram of Adding a Priori from Document
Fig. 2(e)
">
Schematic Diagram of Using K-means to Cluster the Objects in Fig. 2(e)
Diagram of Text Block Positioning in the Edge Area of the Figure
算法 准确率 召回率 F1
PDFFigures 2.0 0.950 0.725 0.822
本文算法 0.915 0.918 0.916
Algorithm Performance
Interference of Non-graph Geometric Elements with Localization Results
The Errors Caused by Hyphens without Using Text Symbols
Error of K Value in K-means Clustering
Diagram of Text Filtering Errors
[1] 胡蓉, 唐振贵, 赵宇翔 , 等. 文内视觉资源的分析框架与计量探索[J]. 情报学报, 2017,36(2):141-151.
[1] ( Hu Rong, Tang Zhengui, Zhao Yuxiang , et al. Integrated Framework and Visual Knowledgometrics Exploration for Analyzing Visual Resources in Academic Literature[J]. Journal of the China Society for Scientific and Technical Information, 2017,36(2):141-151.)
[2] Landhuis E . Scientific Literature: Information Overload[J]. Nature, 2016,535(7612):457-458.
doi: 10.1038/nj7612-457a pmid: 27453968
[3] Choudhury S R, Wang S T, Lee G C. Scalable Algorithms for Scholarly Figure Mining and Semantics[C]// Proceedings of the ACM SIGMOD International Conference on Management of Data. 2016: 1-6.
[4] Perez-Arriaga M O, Estrada T, Abad-Mota S. TAO : System for Table Detection and Extraction from PDF Documents[C]// Proceedings of the 29th International Florida Artificial Intelligence Research Society Conference. 2016: 591-596.
[5] Corrêa A S, Zander P-O. Unleashing Tabular Content to Open Data[C]// Proceedings of the 18th Annual International Conference on Digital Government Research. 2017: 54-63.
[6] Siegel N, Horvitz Z, Levin R, et al. Figureseer: Parsing Result-Figures in Research Papers[C]// Proceedings of 2016 European Conference on Computer Vision. 2016: 664-680.
[7] Al-Zaidy R A, Giles C L. A Machine Learning Approach for Semantic Structuring of Scientific Charts in Scholarly Documents[C]// Proceedings of the 29th AAAI Conference on Artificial Intelligence. 2017: 4644-4649.
[8] 张重毅, 方梅 . 科技论文隐性学术不端行为判别特征分析[J]. 中国科技期刊研究, 2019,30(1):24-28.
[8] ( Zhang Zhongyi, Fang Mei . Feature Analysis on the Unobserved Academic Misconduct of Scientific Papers[J]. Chinese Journal of Scientific and Technical Periodicals, 2019,30(1):24-28.)
[9] Ma Y X, Tung A K H, Wang W, et al. ScatterNet: A Deep Subjective Similarity Model for Visual Analysis of Scatterplots[J]. IEEE Transactions on Visualization and Computer Graphics, 2020,26(3):1562-1576.
doi: 10.1109/TVCG.2018.2875702 pmid: 30334762
[10] 张静 . Figshare 平台与 CNKI 学术图片库比较分析[J]. 科技与出版, 2015(1):63-66.
[10] ( Zhang Jing. Comparative Analysis of Figshare Platform and CNKI Academic Picture Library, Science Technology & Publication, 2015(1):63-66.)
[11] Yu C N, Levy C C, Saniee I. Convolutional Neural Networks for Figure Extraction in Historical Technical Documents[C]// Proceedings of the International Conference on Document Analysis and Recognition. 2017.
[12] Cliche M, Rosenberg D, Madeka D, et al. Scatteract: Automated Extraction of Data from Scatter Plots[C]// Proceedings of Joint European Conference on Machine Learning and Knowledge Discovery in Databases. 2017.
[13] Amin A, Shiu R . Page Segmentation and Classification Utilizing Bottom-Up Approach[J]. International Journal of Image and Graphics, 2001,1(2):345-361.
[14] Chen K, Seuret M, Liwicki M, et al. Page Segmentation of Historical Document Images with Convolutional Autoencoders[C]// Proceedings of the International Conference on Document Analysis and Recognition. 2015.
[15] Simon A, Pret J-C, Johnson A P . A Fast Algorithm for Bottom-Up Document Layout Analysis[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1997,19(3):273-277.
[16] Ha J, Haralick R M, Phillips I T. Recursive X-Y Cut Using Bounding Boxes of Connected Components[C]// Proceedings of the 3rd International Conference on Document Analysis and Recognition. 1995: 952 .
[17] Choudhury S R, Mitra P, Giles C L. Automatic Extraction of Figures from Scholarly Documents[C]// Proceedings of the 2015 ACM Symposium on Document Engineering. 2015.
[18] Li P Y, Jiang X Y, Shatkay H . Figure and Caption Extraction from Biomedical Documents[J]. Bioinformatics, 2019,35(21):4381-4388.
pmid: 30949681
[19] Clark C, Divvala S. Looking Beyond Text: Extracting Figures, Tables, and Captions from Computer Science Paper[C]// Proceedings of AAAI 2015 Workshop on Scholarly Big Data. 2015.
[20] Clark C, Divvala S. PDFFigures 2.0: Mining Figures from Research Papers[C]// Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries. 2016: 143-152.
[21] Jang H, Chae Y, Lee S , et al. Automatic Object Extraction from Electronic Documents Using Deep Neural Network[J]. KIPS Transactions on Software and Data Engineering, 2018,7(11):411-418.
[22] Rahman M M, Finin T . Understanding and Representing the Semantics of Large Structured Documents[OL]. arXiv Preprint, arXiv: 1807. 09842.
[23] Hansen M, Pomp A, Erki K , et al. Data-Driven Recognition and Extraction of PDF Document Elements[J]. Technologies, 2019,7(3):65.
[24] Barnes D G, Vidiassov M, Ruthensteiner B , et al. Embedding and Publishing Interactive, 3-Dimensional, Scientific Figures in Portable Document Format (PDF) Files[J]. PLoS One, 2013,8(9):e69446.
[25] ISO. Document Management-Portable Document Format-Part 1: PDF 1.7[S].ISO. 32000- 1:2008 .
[26] Hassan T. Object-Level Document Analysis of PDF Files[C]// Proceedings of the 9th ACM Symposium on Document Engineering. 2009: 47-55.
[1] Wu Jinming,Hou Yuefang,Cui Lei. Automatic Expression of Co-occurrence Clustering Based on Indexing Rules of Medical Subject Headings[J]. 数据分析与知识发现, 2020, 4(9): 133-144.
[2] Wen Pingmei,Ye Zhiwei,Ding Wenjian,Liu Ying,Xu Jian. Developments of Named Entity Disambiguation[J]. 数据分析与知识发现, 2020, 4(9): 15-25.
[3] Xi Yunjiang, Du Diedie, Liao Xiao, Zhang Xuehong. Analyzing & Clustering Enterprise Microblog Users with Supernetwork[J]. 数据分析与知识发现, 2020, 4(8): 107-118.
[4] Yu Fengchang,Lu Wei. Constructing Data Set for Location Annotations of Academic Literature Figures and Tables[J]. 数据分析与知识发现, 2020, 4(6): 35-42.
[5] Yang Xu,Qian Xiaodong. Synchronous Clustering Algorithm for Social Networks Based on Improved Vicsek Model[J]. 数据分析与知识发现, 2020, 4(4): 119-128.
[6] Xiong Xin,Wang Hao,Zhang Haichao,Zhang Baolong. Impacts of Chinese Term Granularity on Measuring Term Discriminative Capacity[J]. 数据分析与知识发现, 2020, 4(2/3): 143-152.
[7] Xiong Huixiang,Li Xiaomin,Li Yueyan. Group Recommendation Based on Attribute Mining of Book Reviews[J]. 数据分析与知识发现, 2020, 4(2/3): 214-222.
[8] Wei Jiaze,Dong Cheng,He Yanqing,Liu Zhihui,Peng Keyun. Detecting News Topics Based on Equalized Paragraph and Sub-topic Vector[J]. 数据分析与知识发现, 2020, 4(10): 70-79.
[9] Huaming Zhao,Li Yu,Qiang Zhou. Determining Best Text Clustering Number with Mean Shift Algorithm[J]. 数据分析与知识发现, 2019, 3(9): 27-35.
[10] Shan Li,Yehui Yao,Hao Li,Jie Liu,Karmapemo. ISA Biclustering Algorithm for Group Recommendation[J]. 数据分析与知识发现, 2019, 3(8): 77-87.
[11] Ke Li,Yuya Sasaki. Analyzing Sentiment Distribution with Spatial-textual Data of Multi-dimensional Clustering[J]. 数据分析与知识发现, 2019, 3(7): 14-22.
[12] Cheng Zhou,Hongqin Wei. Evaluating and Classifying Patent Values Based on Self-Organizing Maps and Support Vector Machine[J]. 数据分析与知识发现, 2019, 3(5): 117-124.
[13] Quan Lu,Anqi Zhu,Jiyue Zhang,Jing Chen. Research on User Information Requirement in Chinese Network Health Community: Taking Tumor-forum Data of Qiuyi as an Example[J]. 数据分析与知识发现, 2019, 3(4): 22-32.
[14] Jiang Wu,Yinghui Zhao,Jiahui Gao. Research on Weibo Opinion Leaders Identification and Analysis in Medical Public Opinion Incidents[J]. 数据分析与知识发现, 2019, 3(4): 53-62.
[15] Lianjie Xiao,Mengrui Gao,Xinning Su. An Under-sampling Ensemble Classification Algorithm Based on Fuzzy C-Means Clustering for Imbalanced Data[J]. 数据分析与知识发现, 2019, 3(4): 90-96.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn