Please wait a minute...
Advanced Search
数据分析与知识发现  2020, Vol. 4 Issue (6): 35-42     https://doi.org/10.11925/infotech.2096-3467.2019.1330
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
一种学术文献图表位置标注数据集构建方法
于丰畅,陆伟()
武汉大学信息管理学院 武汉 430072
Constructing Data Set for Location Annotations of Academic Literature Figures and Tables
Yu Fengchang,Lu Wei()
School of Information Management, Wuhan University, Wuhan 430072, China
全文: PDF (1833 KB)   HTML ( 13
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】 提出用于学术文献的尺寸自适应模板匹配算法,快速构建大规模学术文献图表位置标注数据集。【方法】 PubMed Open Access数据集提供文献和图表的图片格式文件,解析文献内容,匹配文献页面和图表的图片格式文件,对页面和图表的图片格式文件进行特征提取,对特征点进行匹配,定位图表位置。【结果】 使用本文方法对测试数据集进行标注实验,精确率为98.87%,F1值为97.44%。【局限】 匹配文献页面和图表的图片格式文件的算法仅使用简单的关键词匹配方式,性能仍有提升空间。【结论】 本文算法能够快速地构造学术文献图表位置数据集,节省大量人力时间成本。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
于丰畅
陆伟
关键词 数据集标注模板匹配学术文献    
Abstract

[Objective] This study proposes a size-adaptive template matching algorithm to quickly construct large-scale data set for academic literature figure and table positions. [Methods] First, we used the PubMed Open Access database to retrieve documents with figure/table images, and parsed their contents. Then, we matched document pages and pictures to extract their features. Finally, we identified the figure/table positions based on matched feature points. [Results] The proposed method’s precision and F1 value reached 98.87% and 97.44%, respectively. [Limitations] We only used simple keywords to match literature pages and figure/table pictures. [Conclusions] ;The proposed algorithm could quickly construct data set for chart positions in academic literature.

Key wordsData Set Annotation    Template Matching    Academic Literature
收稿日期: 2019-12-13      出版日期: 2020-04-23
ZTFLH:  TP393  
通讯作者: 陆伟     E-mail: weilu@whu.edu.cn
引用本文:   
于丰畅,陆伟. 一种学术文献图表位置标注数据集构建方法[J]. 数据分析与知识发现, 2020, 4(6): 35-42.
Yu Fengchang,Lu Wei. Constructing Data Set for Location Annotations of Academic Literature Figures and Tables. Data Analysis and Knowledge Discovery, 2020, 4(6): 35-42.
链接本文:  
http://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2019.1330      或      http://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2020/V4/I6/35
Fig. 1  标注方式示意图
Fig. 2  尺寸自适应模板匹配的算法流程图
Fig. 3  图表标注样例
性能指标 精确率 召回率 F1
结果 98.87% 96.06% 97.44%
Table 1  标注实验结果
Fig. 4  模板与文献页对应关系错误
Fig. 5  图片角特征过少导致匹配错误
[1] 胡蓉, 唐振贵, 赵宇翔, 等. 文内视觉资源的分析框架与计量探索[J]. 情报学报, 2017,36(2):141-151.
[1] ( Hu Rong, Tang Zhengui, Zhao Yuxiang, et al. Integrated Framework and Visual Knowledgometrics Exploration for Analyzing Visual Resources in Academic Literature[J]. Journal of the China Society for Scientific and Technical Information, 2017,36(2):141-151.)
[2] AJE Scholar . Scholarly Publishing in 2016: A Look Back at Global and National Trends in Research Publication [R/OL]. [2019-09-08]. https://www.aje.com/arc/scholarly-publishing-trends-2016/.
[3] 方浩, 尚媛媛, 张锐, 等. 数据新闻中信息图表的阅读效果:来自眼动的证据[J]. 图书情报工作, 2019,63(8):74-86.
[3] ( Fang Hao, Shang Yuanyuan, Zhang Rui, et al. Research on Reading Effect of the Information Chart in the Data News: Evidence from the Eye Movement[J]. Library and Information Service, 2019,63(8):74-86.)
[4] Cabanac G, Hubert G, Hartley J. Solo Versus Collaborative Writing: Discrepancies in the Use of Tables and Graphs in Academic Articles[J]. Journal of the Association for Information Science and Technology, 2014,65(4):812-820.
doi: 10.1002/asi.23014
[5] Lee P S, West J D, Howe B. Viziometrics: Analyzing Visual Information in the Scientific Literature[J]. IEEE Transactions on Big Data, 2016,4(1):117-129.
doi: 10.1109/TBDATA.2017.2689038
[6] Apostolova E, You D, Xue Z, et al. Image Retrieval from Scientific Publications: Text and Image Content Processing to Separate Multipanel Figures[J]. Journal of the American Society for Information Science and Technology, 2013,64(5):893-908.
doi: 10.1002/asi.2013.64.issue-5
[7] Splendiani B, Ribera M. How to Textually Describe Images in Medical Academic Publications [C]//Proceedings of the XV International Conference on Human Computer Interaction. 2014.
[8] Ha J, Haralick RM, Phillips IT. Recursive XY Cut Using Bounding Boxes of Connected Components [C]//Proceedings of the 3rd International Conference on Document Analysis and Recognition. 1995.
[9] Bloomberg D S. Multiresolution Morphological Approach to Document Image Analysis [C]//Proceedings of the International Conference on Document Analysis and Recognition. 1991.
[10] Clark C, Divvala S. PDFFigures 2.0: Mining Figures from Research Papers[C]// Proceedings of the 2016 IEEE/ACM Joint Conference on Digital Libraries (JCDL). 2016.
[11] 于丰畅, 陆伟. 基于机器视觉的PDF学术文献结构识别[J]. 情报学报, 2019,38(4):54-60.
[11] ( Yu Fengchang, Lu Wei. Structural Recognition of PDF Academic Literature Based on Computer Vision[J]. Journal of the China Society for Scientific and Technical Information, 2019,38(4):54-60.)
[12] Choudhury S R, Mitra P, Giles C L. Automatic Extraction of Figures from Scholarly Documents [C]//Proceedings of the 2015 ACM Symposium on Document Engineering, New York, USA: ACM, 2015: 47-50.
[13] Rastan R, Paik H Y, Shepherd J. TEXUS: A Unified Framework for Extracting and Understanding Tables in PDF Documents[J]. Information Processing & Management, 2019,56(3):895-918.
doi: 10.1016/j.ipm.2019.01.008
[14] He K, Zhang X, Ren S, et al. Deep Residual Learning for Image Recognition [C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016.
[15] Siegel N, Lourie N, Power R, et al. Extracting Scientific Figures with Distantly Supervised Neural Networks[C]// Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries. 2018.
[16] Li P, Jiang X, Shatkay H. Extracting Figures and Captions from Scientific Publications [C]//Proceedings of the 27th ACM International Conference on Information and Knowledge Management. 2018.
[17] Perez-Arriaga M O, Estrada T, Abad-Mota S. TAO: System for Table Detection and Extraction from PDF Documents [C]//Proceedings of the 29th International Flairs Conference. 2016.
[18] Mesbah S, Fragkeskos K, Lofi C, et al. Semantic Annotation of Data Processing Pipelines in Scientific Publications[C]//Proceedings of the 14th International Conference on the Semantic Web(ESWC). 2017.
[19] Zech J, Pain M, Titano J, et al. Natural Language-based Machine Learning Models for the Annotation of Clinical Radiology Reports[J]. Radiology, 2018,287(2):570-580.
doi: 10.1148/radiol.2018171093 pmid: 29381109
[20] Remez T, Huang J, Brown M. Learning to Segment via Cut-And-Paste[C]//Proceedings of the European Conference on Computer Vision (ECCV). 2018.
[21] Liljekvist M S, Andresen K, Pommergaard H C, et al. For 481 Biomedical Open Access Journals, Articles are Not Searchable in the Directory of Open Access Journals Nor in Conventional Biomedical Databases[J]. PeerJ, 2015,3(5):e972.
doi: 10.7717/peerj.972
[22] Hanebeck U D. Template Matching Using Fast Normalized Cross Correlation [C]//Proceedings of SPIE: Optical Pattern Recognition XII. 2001.
[23] Lowe D G. Distinctive Image Features from Scale-Invariant Keypoints[J]. International Journal of Computer Vision, 2004,60(2):91-110.
doi: 10.1023/B:VISI.0000029664.99615.94
[1] 徐浩,朱学芳,章成志,江川. 面向学术文献全文本的方法论知识抽取系统分析与设计 *[J]. 数据分析与知识发现, 2019, 3(10): 29-36.
[2] 吴丹,陆柳杏. 移动阅读工具对大学生学术文献阅读效率的影响研究*[J]. 数据分析与知识发现, 2017, 1(1): 64-72.
[3] 化柏林. 针对中文学术文献的情报方法术语抽取[J]. 现代图书情报技术, 2013, (6): 68-75.
[4] 吴夙慧, 成颖, 郑彦宁, 潘云涛. 基于N元语法的英文学术文献聚类标签抽取算法[J]. 现代图书情报技术, 2011, 27(7/8): 68-75.
[5] 张云. 基于开源软件的中文学术文献计量软件的开发实践[J]. 现代图书情报技术, 2010, 26(4): 87-91.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn