[Objective] This study proposes a size-adaptive template matching algorithm to quickly construct large-scale data set for academic literature figure and table positions. [Methods] First, we used the PubMed Open Access database to retrieve documents with figure/table images, and parsed their contents. Then, we matched document pages and pictures to extract their features. Finally, we identified the figure/table positions based on matched feature points. [Results] The proposed method’s precision and F1 value reached 98.87% and 97.44%, respectively. [Limitations] We only used simple keywords to match literature pages and figure/table pictures. [Conclusions] ;The proposed algorithm could quickly construct data set for chart positions in academic literature.
于丰畅,陆伟. 一种学术文献图表位置标注数据集构建方法[J]. 数据分析与知识发现, 2020, 4(6): 35-42.
Yu Fengchang,Lu Wei. Constructing Data Set for Location Annotations of Academic Literature Figures and Tables. Data Analysis and Knowledge Discovery, 2020, 4(6): 35-42.
( Hu Rong, Tang Zhengui, Zhao Yuxiang, et al. Integrated Framework and Visual Knowledgometrics Exploration for Analyzing Visual Resources in Academic Literature[J]. Journal of the China Society for Scientific and Technical Information, 2017,36(2):141-151.)
[2]
AJE Scholar . Scholarly Publishing in 2016: A Look Back at Global and National Trends in Research Publication [R/OL]. [2019-09-08]. https://www.aje.com/arc/scholarly-publishing-trends-2016/.
( Fang Hao, Shang Yuanyuan, Zhang Rui, et al. Research on Reading Effect of the Information Chart in the Data News: Evidence from the Eye Movement[J]. Library and Information Service, 2019,63(8):74-86.)
[4]
Cabanac G, Hubert G, Hartley J. Solo Versus Collaborative Writing: Discrepancies in the Use of Tables and Graphs in Academic Articles[J]. Journal of the Association for Information Science and Technology, 2014,65(4):812-820.
doi: 10.1002/asi.23014
[5]
Lee P S, West J D, Howe B. Viziometrics: Analyzing Visual Information in the Scientific Literature[J]. IEEE Transactions on Big Data, 2016,4(1):117-129.
doi: 10.1109/TBDATA.2017.2689038
[6]
Apostolova E, You D, Xue Z, et al. Image Retrieval from Scientific Publications: Text and Image Content Processing to Separate Multipanel Figures[J]. Journal of the American Society for Information Science and Technology, 2013,64(5):893-908.
doi: 10.1002/asi.2013.64.issue-5
[7]
Splendiani B, Ribera M. How to Textually Describe Images in Medical Academic Publications [C]//Proceedings of the XV International Conference on Human Computer Interaction. 2014.
[8]
Ha J, Haralick RM, Phillips IT. Recursive XY Cut Using Bounding Boxes of Connected Components [C]//Proceedings of the 3rd International Conference on Document Analysis and Recognition. 1995.
[9]
Bloomberg D S. Multiresolution Morphological Approach to Document Image Analysis [C]//Proceedings of the International Conference on Document Analysis and Recognition. 1991.
[10]
Clark C, Divvala S. PDFFigures 2.0: Mining Figures from Research Papers[C]// Proceedings of the 2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL). 2016.
( Yu Fengchang, Lu Wei. Structural Recognition of PDF Academic Literature Based on Computer Vision[J]. Journal of the China Society for Scientific and Technical Information, 2019,38(4):54-60.)
[12]
Choudhury S R, Mitra P, Giles C L. Automatic Extraction of Figures from Scholarly Documents [C]//Proceedings of the 2015 ACM Symposium on Document Engineering, New York, USA: ACM, 2015: 47-50.
[13]
Rastan R, Paik H Y, Shepherd J. TEXUS: A Unified Framework for Extracting and Understanding Tables in PDF Documents[J]. Information Processing & Management, 2019,56(3):895-918.
doi: 10.1016/j.ipm.2019.01.008
[14]
He K, Zhang X, Ren S, et al. Deep Residual Learning for Image Recognition [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016.
[15]
Siegel N, Lourie N, Power R, et al. Extracting Scientific Figures with Distantly Supervised Neural Networks[C]// Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries. 2018.
[16]
Li P, Jiang X, Shatkay H. Extracting Figures and Captions from Scientific Publications [C]//Proceedings of the 27th ACM International Conference on Information and Knowledge Management. 2018.
[17]
Perez-Arriaga M O, Estrada T, Abad-Mota S. TAO: System for Table Detection and Extraction from PDF Documents [C]//Proceedings of the 29th International Flairs Conference. 2016.
[18]
Mesbah S, Fragkeskos K, Lofi C, et al. Semantic Annotation of Data Processing Pipelines in Scientific Publications[C]//Proceedings of the 14th International Conference on the Semantic Web(ESWC). 2017.
[19]
Zech J, Pain M, Titano J, et al. Natural Language-based Machine Learning Models for the Annotation of Clinical Radiology Reports[J]. Radiology, 2018,287(2):570-580.
doi: 10.1148/radiol.2018171093
pmid: 29381109
[20]
Remez T, Huang J, Brown M. Learning to Segment via Cut-And-Paste[C]//Proceedings of the European Conference on Computer Vision (ECCV). 2018.
[21]
Liljekvist M S, Andresen K, Pommergaard H C, et al. For 481 Biomedical Open Access Journals, Articles are Not Searchable in the Directory of Open Access Journals Nor in Conventional Biomedical Databases[J]. PeerJ, 2015,3(5):e972.
doi: 10.7717/peerj.972
[22]
Hanebeck U D. Template Matching Using Fast Normalized Cross Correlation [C]//Proceedings of SPIE: Optical Pattern Recognition XII. 2001.
[23]
Lowe D G. Distinctive Image Features from Scale-Invariant Keypoints[J]. International Journal of Computer Vision, 2004,60(2):91-110.
doi: 10.1023/B:VISI.0000029664.99615.94