Constructing Data Set for Location Annotations of Academic Literature Figures and Tables

doi:10.11925/infotech.2096-3467.2019.1330

Data Analysis and Knowledge Discovery

2020, Vol. 4

Issue (6): 35-42 DOI: 10.11925/infotech.2096-3467.2019.1330

Current Issue | Archive | Adv Search

Constructing Data Set for Location Annotations of Academic Literature Figures and Tables

Yu Fengchang,Lu Wei(

)

School of Information Management, Wuhan University, Wuhan 430072, China

Download: PDF (1833 KB) HTML ( 20 )
Export: BibTeX | EndNote (RIS)

Abstract

[Objective] This study proposes a size-adaptive template matching algorithm to quickly construct large-scale data set for academic literature figure and table positions. [Methods] First, we used the PubMed Open Access database to retrieve documents with figure/table images, and parsed their contents. Then, we matched document pages and pictures to extract their features. Finally, we identified the figure/table positions based on matched feature points. [Results] The proposed method’s precision and F1 value reached 98.87% and 97.44%, respectively. [Limitations] We only used simple keywords to match literature pages and figure/table pictures. [Conclusions] ;The proposed algorithm could quickly construct data set for chart positions in academic literature.

Key words： Data Set Annotation Template Matching Academic Literature

Received: 13 December 2019 Published: 23 April 2020

ZTFLH:

TP393

Corresponding Authors: Lu Wei E-mail: weilu@whu.edu.cn

	Service

	E-mail this article
	Add to my bookshelf
	Add to citation manager
	E-mail Alert
	RSS
	Articles by authors
	Fengchang Yu
	Wei Lu

Cite this article:

Yu Fengchang,Lu Wei. Constructing Data Set for Location Annotations of Academic Literature Figures and Tables. Data Analysis and Knowledge Discovery, 2020, 4(6): 35-42.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2019.1330 OR https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2020/V4/I6/35

A Schematic Diagram of Annotation Mode

Flow Chart of Size Adaptive Template Matching Algorithm

Examples of Figure Annotation

Results of Annotation Experiment

An Example of Error in Correspondence Between Template and Paper Page

An Example of Matching Error Caused by Lack of Corner Features

[1]	胡蓉, 唐振贵, 赵宇翔, 等. 文内视觉资源的分析框架与计量探索[J]. 情报学报, 2017,36(2):141-151.
[1]	( Hu Rong, Tang Zhengui, Zhao Yuxiang, et al. Integrated Framework and Visual Knowledgometrics Exploration for Analyzing Visual Resources in Academic Literature[J]. Journal of the China Society for Scientific and Technical Information, 2017,36(2):141-151.)
[2]	AJE Scholar . Scholarly Publishing in 2016: A Look Back at Global and National Trends in Research Publication [R/OL]. [2019-09-08]. https://www.aje.com/arc/scholarly-publishing-trends-2016/.
[3]	方浩, 尚媛媛, 张锐, 等. 数据新闻中信息图表的阅读效果:来自眼动的证据[J]. 图书情报工作, 2019,63(8):74-86.
[3]	( Fang Hao, Shang Yuanyuan, Zhang Rui, et al. Research on Reading Effect of the Information Chart in the Data News: Evidence from the Eye Movement[J]. Library and Information Service, 2019,63(8):74-86.)
[4]	Cabanac G, Hubert G, Hartley J. Solo Versus Collaborative Writing: Discrepancies in the Use of Tables and Graphs in Academic Articles[J]. Journal of the Association for Information Science and Technology, 2014,65(4):812-820. doi: 10.1002/asi.23014
[5]	Lee P S, West J D, Howe B. Viziometrics: Analyzing Visual Information in the Scientific Literature[J]. IEEE Transactions on Big Data, 2016,4(1):117-129. doi: 10.1109/TBDATA.2017.2689038
[6]	Apostolova E, You D, Xue Z, et al. Image Retrieval from Scientific Publications: Text and Image Content Processing to Separate Multipanel Figures[J]. Journal of the American Society for Information Science and Technology, 2013,64(5):893-908. doi: 10.1002/asi.2013.64.issue-5
[7]	Splendiani B, Ribera M. How to Textually Describe Images in Medical Academic Publications [C]//Proceedings of the XV International Conference on Human Computer Interaction. 2014.
[8]	Ha J, Haralick RM, Phillips IT. Recursive XY Cut Using Bounding Boxes of Connected Components [C]//Proceedings of the 3rd International Conference on Document Analysis and Recognition. 1995.
[9]	Bloomberg D S. Multiresolution Morphological Approach to Document Image Analysis [C]//Proceedings of the International Conference on Document Analysis and Recognition. 1991.
[10]	Clark C, Divvala S. PDFFigures 2.0: Mining Figures from Research Papers[C]// Proceedings of the 2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL). 2016.
[11]	于丰畅, 陆伟. 基于机器视觉的PDF学术文献结构识别[J]. 情报学报, 2019,38(4):54-60.
[11]	( Yu Fengchang, Lu Wei. Structural Recognition of PDF Academic Literature Based on Computer Vision[J]. Journal of the China Society for Scientific and Technical Information, 2019,38(4):54-60.)
[12]	Choudhury S R, Mitra P, Giles C L. Automatic Extraction of Figures from Scholarly Documents [C]//Proceedings of the 2015 ACM Symposium on Document Engineering, New York, USA: ACM, 2015: 47-50.
[13]	Rastan R, Paik H Y, Shepherd J. TEXUS: A Unified Framework for Extracting and Understanding Tables in PDF Documents[J]. Information Processing & Management, 2019,56(3):895-918. doi: 10.1016/j.ipm.2019.01.008
[14]	He K, Zhang X, Ren S, et al. Deep Residual Learning for Image Recognition [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016.
[15]	Siegel N, Lourie N, Power R, et al. Extracting Scientific Figures with Distantly Supervised Neural Networks[C]// Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries. 2018.
[16]	Li P, Jiang X, Shatkay H. Extracting Figures and Captions from Scientific Publications [C]//Proceedings of the 27th ACM International Conference on Information and Knowledge Management. 2018.
[17]	Perez-Arriaga M O, Estrada T, Abad-Mota S. TAO: System for Table Detection and Extraction from PDF Documents [C]//Proceedings of the 29th International Flairs Conference. 2016.
[18]	Mesbah S, Fragkeskos K, Lofi C, et al. Semantic Annotation of Data Processing Pipelines in Scientific Publications[C]//Proceedings of the 14th International Conference on the Semantic Web(ESWC). 2017.
[19]	Zech J, Pain M, Titano J, et al. Natural Language-based Machine Learning Models for the Annotation of Clinical Radiology Reports[J]. Radiology, 2018,287(2):570-580. doi: 10.1148/radiol.2018171093 pmid: 29381109
[20]	Remez T, Huang J, Brown M. Learning to Segment via Cut-And-Paste[C]//Proceedings of the European Conference on Computer Vision (ECCV). 2018.
[21]	Liljekvist M S, Andresen K, Pommergaard H C, et al. For 481 Biomedical Open Access Journals, Articles are Not Searchable in the Directory of Open Access Journals Nor in Conventional Biomedical Databases[J]. PeerJ, 2015,3(5):e972. doi: 10.7717/peerj.972
[22]	Hanebeck U D. Template Matching Using Fast Normalized Cross Correlation [C]//Proceedings of SPIE: Optical Pattern Recognition XII. 2001.
[23]	Lowe D G. Distinctive Image Features from Scale-Invariant Keypoints[J]. International Journal of Computer Vision, 2004,60(2):91-110. doi: 10.1023/B:VISI.0000029664.99615.94

[1]	Yu Fengchang,Cheng Qikai,Lu Wei. Locating Academic Literature Figures and Tables with Geometric Object Clustering[J]. 数据分析与知识发现, 2021, 5(1): 140-149.
[2]	Xiong Xin,Wang Hao,Zhang Haichao,Zhang Baolong. Impacts of Chinese Term Granularity on Measuring Term Discriminative Capacity[J]. 数据分析与知识发现, 2020, 4(2/3): 143-152.
[3]	Hao Xu,Xuefang Zhu,Chengzhi Zhang,Chuan Jiang. System Analysis and Design for Methodological Entities Extraction in Full Text of Academic Literature[J]. 数据分析与知识发现, 2019, 3(10): 29-36.
[4]	Wu Dan,Lu Liuxing. Impacts of Mobile Tools on Students’ Academic Reading Efficiency[J]. 数据分析与知识发现, 2017, 1(1): 64-72.
[5]	Hua Bolin. Extracting Information Method Term from Chinese Academic Literature[J]. 现代图书情报技术, 2013, (6): 68-75.

Viewed

Full text

Abstract

Cited

Shared

Discussed