[Objective] This paper tries to improve the recall of figures/tables from academic literature. [Methods] First, we extracted geometric objects from the PDF files of literature. Then, we obtained priori information on scopes of figures/tables from the perspectives of underlying coding analysis and image comprehension. Third, we merged the geometric objects using K-means. Finally, we reconstructed the text contents using heuristic algorithm to determine the locations of figures/tables. [Results] On the experimental dataset, the precision of the proposed algorithm reached 0.915 and the recall was 0.918. The precision level is close to the state-of-the-art algorithms and the recall value was improved by 0.193 (26.6% better than the existing ones). [Limitations] Documents with complex layouts and irregular use of symbols will generate errors. The determination of the clustering k value and the algorithm for text filtering could be improved. [Conclusions] The proposed algorithm effectively increases the recall of figures/tables from academic literature.
( Hu Rong, Tang Zhengui, Zhao Yuxiang , et al. Integrated Framework and Visual Knowledgometrics Exploration for Analyzing Visual Resources in Academic Literature[J]. Journal of the China Society for Scientific and Technical Information, 2017,36(2):141-151.)
[2]
Landhuis E . Scientific Literature: Information Overload[J]. Nature, 2016,535(7612):457-458.
doi: 10.1038/nj7612-457a
pmid: 27453968
[3]
Choudhury S R, Wang S T, Lee G C. Scalable Algorithms for Scholarly Figure Mining and Semantics[C]// Proceedings of the ACM SIGMOD International Conference on Management of Data. 2016: 1-6.
[4]
Perez-Arriaga M O, Estrada T, Abad-Mota S. TAO : System for Table Detection and Extraction from PDF Documents[C]// Proceedings of the 29th International Florida Artificial Intelligence Research Society Conference. 2016: 591-596.
[5]
Corrêa A S, Zander P-O. Unleashing Tabular Content to Open Data[C]// Proceedings of the 18th Annual International Conference on Digital Government Research. 2017: 54-63.
[6]
Siegel N, Horvitz Z, Levin R, et al. Figureseer: Parsing Result-Figures in Research Papers[C]// Proceedings of 2016 European Conference on Computer Vision. 2016: 664-680.
[7]
Al-Zaidy R A, Giles C L. A Machine Learning Approach for Semantic Structuring of Scientific Charts in Scholarly Documents[C]// Proceedings of the 29th AAAI Conference on Artificial Intelligence. 2017: 4644-4649.
( Zhang Zhongyi, Fang Mei . Feature Analysis on the Unobserved Academic Misconduct of Scientific Papers[J]. Chinese Journal of Scientific and Technical Periodicals, 2019,30(1):24-28.)
[9]
Ma Y X, Tung A K H, Wang W, et al. ScatterNet: A Deep Subjective Similarity Model for Visual Analysis of Scatterplots[J]. IEEE Transactions on Visualization and Computer Graphics, 2020,26(3):1562-1576.
doi: 10.1109/TVCG.2018.2875702
pmid: 30334762
( Zhang Jing. Comparative Analysis of Figshare Platform and CNKI Academic Picture Library, Science Technology & Publication, 2015(1):63-66.)
[11]
Yu C N, Levy C C, Saniee I. Convolutional Neural Networks for Figure Extraction in Historical Technical Documents[C]// Proceedings of the International Conference on Document Analysis and Recognition. 2017.
[12]
Cliche M, Rosenberg D, Madeka D, et al. Scatteract: Automated Extraction of Data from Scatter Plots[C]// Proceedings of Joint European Conference on Machine Learning and Knowledge Discovery in Databases. 2017.
[13]
Amin A, Shiu R . Page Segmentation and Classification Utilizing Bottom-Up Approach[J]. International Journal of Image and Graphics, 2001,1(2):345-361.
[14]
Chen K, Seuret M, Liwicki M, et al. Page Segmentation of Historical Document Images with Convolutional Autoencoders[C]// Proceedings of the International Conference on Document Analysis and Recognition. 2015.
[15]
Simon A, Pret J-C, Johnson A P . A Fast Algorithm for Bottom-Up Document Layout Analysis[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1997,19(3):273-277.
[16]
Ha J, Haralick R M, Phillips I T. Recursive X-Y Cut Using Bounding Boxes of Connected Components[C]// Proceedings of the 3rd International Conference on Document Analysis and Recognition. 1995: 952 .
[17]
Choudhury S R, Mitra P, Giles C L. Automatic Extraction of Figures from Scholarly Documents[C]// Proceedings of the 2015 ACM Symposium on Document Engineering. 2015.
[18]
Li P Y, Jiang X Y, Shatkay H . Figure and Caption Extraction from Biomedical Documents[J]. Bioinformatics, 2019,35(21):4381-4388.
pmid: 30949681
[19]
Clark C, Divvala S. Looking Beyond Text: Extracting Figures, Tables, and Captions from Computer Science Paper[C]// Proceedings of AAAI 2015 Workshop on Scholarly Big Data. 2015.
[20]
Clark C, Divvala S. PDFFigures 2.0: Mining Figures from Research Papers[C]// Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries. 2016: 143-152.
[21]
Jang H, Chae Y, Lee S , et al. Automatic Object Extraction from Electronic Documents Using Deep Neural Network[J]. KIPS Transactions on Software and Data Engineering, 2018,7(11):411-418.
[22]
Rahman M M, Finin T . Understanding and Representing the Semantics of Large Structured Documents[OL]. arXiv Preprint, arXiv: 1807. 09842.
[23]
Hansen M, Pomp A, Erki K , et al. Data-Driven Recognition and Extraction of PDF Document Elements[J]. Technologies, 2019,7(3):65.
[24]
Barnes D G, Vidiassov M, Ruthensteiner B , et al. Embedding and Publishing Interactive, 3-Dimensional, Scientific Figures in Portable Document Format (PDF) Files[J]. PLoS One, 2013,8(9):e69446.