Please wait a minute...
Data Analysis and Knowledge Discovery  2023, Vol. 7 Issue (6): 38-49    DOI: 10.11925/infotech.2096-3467.2022.0662
Current Issue | Archive | Adv Search |
Review of Detection Methods for Scientific Data Citations
Zhou Jiayin,Qian Qing,Tang Mingkun,Wu Sizhu()
Institute of Medical Information, Chinese Academy of Medical Sciences/Beijing Union Medical College, Beijing 100020, China
Download: PDF (666 KB)   HTML ( 13
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper analyzes the characteristics of the existing data citation practices and summarizes their recognition methods. It also explores current research and future development trends. [Methods] The existing data citation detection methods could be divided into three categories: rule-based recognition, supervised machine learning algorithm, and semi-supervised machine learning algorithm. We also reviewed each method’s principles, characteristics, existing problems, performance, and applications of each method. [Results] The existing technologies are concentrated on supervised machine learning algorithms. Detecting data citation with the help of citing behaviors and extracting data citation elements are the future direction. [Limitations] This paper summarizes the characteristics of data citations and existing recognition algorithms. It did not elaborate on the technical details of these algorithms. [Conclusions] There are still some problems in detecting data citation, such as research field limitations, lack of diversity in methods, and insufficient consideration of data citation characteristics, which need further optimization.

Key wordsScientific Data      Data Citation      Data Sharing      Citation Identification     
Received: 27 June 2022      Published: 09 August 2023
ZTFLH:  G203  
  G301  
Fund:Medical and Health Science and Technology Innovation Project of Chinese Academy of Medical Sciences(2021-I2M-1-057)
Corresponding Authors: Wu Sizhu, ORCID: 0000-0003-4540-9910, E-mail: wu.sizhu@imicams.ac.cn。   

Cite this article:

Zhou Jiayin, Qian Qing, Tang Mingkun, Wu Sizhu. Review of Detection Methods for Scientific Data Citations. Data Analysis and Knowledge Discovery, 2023, 7(6): 38-49.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2022.0662     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2023/V7/I6/38

出版社 期刊 数据可用性声明 数据引用元素要求 数据引用信息
位置要求
Springer
Nature
Humanities and Social Science
Communications[12]
强制要求,并鼓励同行评审员检查稿件的数据可用性声明 至少包括作者、数据集名称、发布机构(数据仓储名称)、永久标识符等4项,数据唯一标识符应表示为完整URL 参考文献
Springer
Nature
Scientific Data[13] 强制要求,同行评审员将检查稿件的数据可用性声明 至少包括作者、数据集名称、发布机构(数据仓储名称)、永久标识符等4项,数据唯一标识符应表示为完整URL 参考文献
SAGE Technology in Cancer
Research & Treatment[14]
强制要求 作者、数据集名称、发布机构、永久标识符等4项 正文
Taylor&
Francis
Annals of Medicine[15] 强制要求 至少包含作者、数据集名称、发布机构、永久标识符等4项 数据可用性
声明部分
Elsevier Med[16] 强制要求 作者、名称、发布机构、版本、永久标识符等5项 数据可用性
声明部分
Wiley Journal of Chemometrics[17] 强制要求 作者、年份、数据集名称、发布机构、版本、永久标识符等6项 参考文献列表
Periodical Data Citation Policies of Five Major Publisher
数据引用情况 引用元素完整性 识别对象位置 识别对象特征 举例 方法
规范引用 引用元素齐全,包含作者、数据集名称、数据唯一标识符、解析地址数据出版来源等关键元素 参考文献、数据可用性声明、附录 URL、数据唯一
标识符、数据集
名称等
Elliott, Joshua (2013): Simulated county- and state-level maize yields, 1979-2012. Version 1. Figshare. http://dx.doi.org/10.6084/m9.figshare.501263”;西南大学(2006):海岛棉,国家自然科技资源e平台,doi:10.3416/db.ninr.2151C0001A00014341 规则识别
不规范引用 引用元素缺失,位置分布分散 正文、表格、图片、参考文献、致谢、附录、脚注、注释 数据集名称、
数据仓储名称、
URL等
Elliott’s Maize Yield Data (2013). Data accessed from Figshare [June 15, 2015] ;感谢美国冰雪数据中心(NSIDC)提供ICESat数据 规则识别、
机器学习
算法
未引用 无引用或有引用行为但无关键引用元素 正文 特征词句例如
“感谢”、“来源”等
文献中数据来源为线下问卷调研 机器学习
算法
Data Citation Features Analysis
数据库 登记号正则表达式 数据库 登记号正则表达式
ENA [A-Z][0-9]{5};
[B- [A-Z]{2}[0-9]{6};
[C-[A-Z] {3}[0-9]{5};
[D-[A-Z]{4}[0-9]{8,10};
[E-[A-Z] {5}[0-9]{7}
ENCODE ENCSR000[A-Z]{3}
UniProt [A-N,R-Z][0-9][A-Z][A-Z, 0-9][A-Z, 0-9][0-9];
[O,P,Q][0-9][A-Z, 0-9] [A-Z, 0-9][A-Z, 0-9][0-9]
SRA ERX[0-9]{6};
SR[A-Z][0-9]{6,7}
PDBe [0-9][A-Z, 0-9]{3} ENA ER(A|P|R)[0-9]{6}
InterPro IPR[0-9]{6} KEGG sce[0-9]{5}
Pfam PF(AM)?[0-9]{5} BioProject PRJ[A-Z]{2}[0-9]{4,6}
ArrayExpress E-[A-Z]{4}-[0-9]+ BioSample SAMN[0-9]{8}
OMIM [0-9]{6} GEO GSE\d +;
GSM\d +;
GPL[0-9]{4,5};
Ensembl ENS[A-Z]*G[0-9]{11}+ Assembly GCA_\d{ 9} ( ?:[.]\d + ) ?;
GCF_\d{ 9} ( ?:[.]\d + ) ?
RefSeq (AC|AP|NC|NG|NM|NP|NR|NT|NW|
NZ|XM|XP|XR|YP|ZP|NS)_([A-Z]{4})*
[0-9]{6,9}(?:[.][0-9]+)?
GenBank [A-Z]{3}[0-9]{5}(\.[0-9]{1})?;
[A-Z]{2}[0-9]{6}(\.[0-9]{1,2})?;
[A-Z]{4}[0-9]{8,9}(\.[0-9]{1})?;
L[0-9]{5}(\.[0-9]{1})?;
X[0-9]{5}(\.[0-9]{1})?;
N[A-Z]{1}_[0-9]{6,9}(\.[0-9]{1})?;
XP_[0-9]G[0-9]{5}
RefSNP RS[0-9]{5,9} ClinVar SCV\d{ 9} ( ?:[.]\d + ) ?;
RCV\d{ 9} (?:[.]\d + ) ?;
VCV\d{ 9} (?:[.]\d + ) ?
Regular Expression of Common Biological Database Registration Number
类型 特征词句
URL .edu ftp://* .gov .com
机构名称 National Institutes of Health NIH
商业数据库 commercial Inc. laboratories Ltd.
高频动词 accession available [at,from]* obtained from purchased from
stored deposited donated download
added archived assigned posted
entered imported included inserted
loaded lodged placed
provided registered submitted uploaded_to
常见名词 dataset* repository [S,s]uppl [S,s]upplemental
survey sample sets publicly available database
Data availability statement Dataset on Github DATA_JOURNAL_DOIS DATA_AVAILABILITY
Common Characteristic Words and Sentences
方法 优点 缺点 适用范围 优化方向
基于规则识别 准确、对于特定
领域识别准确率
较高
召回率不高,无法发现新的规则与规律;领域局限性、匹配特征需要专家制定,费时费力;对新领域语料移植性不高 某一领域、数据可用性声明较规范的期刊;使用数据唯一标识符进行数据引用的文献 根据期刊发表的数据引用政策及时更新特征词典;扩充数据唯一标识符特征库
基于统计的
机器学习方法
准确率高,
泛化性好
人工标注数据引用语句的起始、各个引用元素,受样本量大小影响 引文样本规模较小、维度较低的数据集;无法在全文中直接进行抽取 根据引文可能出现在文献中的位置,先进行定位,局部进行识别
深度学习 学习能力强、准确率
更高、可移植性强
模型设计复杂、硬件需求高、计算量大,便携性差 引文样本规模较大、特征复杂的数据集;无法在全文中直接进行抽取
半监督机器
学习算法
不需要人工标注、
自动筛选提取特征
需要大量的样本学习,复杂,性能欠佳 引文样本规模较大、特征较为明显的数据集 通过调整种子特征词选择策略以优化算法
Comparison of Data Citation Identification Methods
[1] UNESCO. UNESCO Recommendation on Open Science[EB/OL]. [2022-06-01]. https://unesdoc.unesco.org/ark:/48223/pf0000379949.
[2] 孔丽华, 习妍, 姜璐璐. 科技期刊关联数据开放共享及出版政策研究[J]. 中国科技期刊研究, 2022, 33(2): 192-199.
doi: 10.11946/cjstp.202106300526
[2] (Kong Lihua, Xi Yan, Jiang Lulu. Open Sharing and Publishing Policies for Research Data of Scientific Journals[J]. Chinese Journal of Scientific and Technical Periodicals, 2022, 33(2): 192-199.)
doi: 10.11946/cjstp.202106300526
[3] Springer Nature. Research Data Policies[EB/OL]. [2022-05-31]. https://www.springernature.com/gp/authors.
[4] Parsons M A, Duerr R E, Jones M B. The History and Future of Data Citation in Practice[J]. Data Science Journal, 2019, 18(1): 52.
doi: 10.5334/dsj-2019-052
[5] FORCE 11. Joint Declaration of Data Citation Principles[EB/OL]. [2022-09-09]. https://force11.org/info/joint-declaration-of-data-citation-principles-final/.
[6] Vasilevsky N A, Minnier J, Haendel M A, et al. Reproducible and Reusable Research: Are Journal Data Sharing Policies Meeting the Mark?[J]. PeerJ, 2017, 5: e3208.
doi: 10.7717/peerj.3208
[7] 邱均平, 肖博轩, 徐中阳, 等. 国内外图书情报领域数据引用特征的多维度分析[J]. 情报理论与实践, 2022, 45(9): 44-50.
doi: 10.16353/j.cnki.1000-7490.2022.09.007
[7] (Qiu Junping, Xiao Boxuan, Xu Zhongyang, et al. Multi-Dimensional Analysis of Data Citation in the Field of Library and Information Science at Home and Abroad[J]. Information Studies: Theory & Application, 2022, 45(9): 44-50.)
doi: 10.16353/j.cnki.1000-7490.2022.09.007
[8] USGS. Data Citation[EB/OL]. [2022-06-01]. https://www.usgs.gov/data-management/data-citation.
[9] Springer Nature. Data Available Statement[EB/OL]. [2022-09-09]. https://www.springernature.com/gp/authors/research-data-policy/data-availability-statements/12330880.
[10] Web of Science. Recommended Practices to Promote Scholarly Data Citation and Tracking[EB/OL]. [2022-06-13]. https://clarivate.com/webofsciencegroup/wp-content/uploads/sites/2/2019/08/Crv_WOS_Whitepaper_DCI_web.pdf.
[11] DataCite. Why is It So Important to Cite Data?[EB/OL]. [2022-06-13]. https://datacite.org/cite-your-data.html.
[12] Humanities and Social Science Communications. Availability of Materials and Data[EB/OL]. [2022-09-13]. https://www.nature.com/palcomms/journal-policies/editorial-and-publishing-policies#Availability%20of%20materials%20and%20data.
[13] Scientific Data. Data Policies[EB/OL]. [2022-09-13]. https://www.nature.com/sdata/policies/data-policies.
[14] SAGE Journals. Submit Paper[EB/OL]. [2022-09-09]. https://journals.sagepub.com/author-instructions/TCT#ResearchData.
[15] Annals of Medicine. Instructions for Authors[EB/OL]. [2022-09-09]. https://www.tandfonline.com/action/authorSubmission?show=instructions&journalCode=iann20#dsp.
[16] Med. Information for Authors[EB/OL]. [2023-04-15]. https://www.cell.com/med/authors.
[17] Wiley’s Data Citation Policy[EB/OL].[2023-04-15]. https://authorservices.wiley.com/author-resources/Journal-Authors/open-access/data-sharing-citation/data-citation-policy.html.
[18] GB/T 35294—2017, 信息技术科学数据引用[S]. 北京: 中国质检出版社, 2017.
[18] (GB/T 35294—2017, Information Technology—Scientific Data Citation[S]. Beijing: China Quality Inspection Press, 2017.)
[19] 杨宁, 张志强. 结合计量分析和内容分析的科学数据集使用特征研究[J]. 图书情报工作, 2022, 66(10): 122-130.
doi: 10.13266/j.issn.0252-3116.2022.010.011
[19] (Yang Ning, Zhang Zhiqiang. Research on the Use Characteristics of Scientific Datasets Combined with Quantitative Analysis and Content Analysis[J]. Library and Information Service, 2022, 66(10): 122-130.)
doi: 10.13266/j.issn.0252-3116.2022.010.011
[20] 孙玉伟, 成颖, 谢娟. 科研人员数据复用行为研究: 系统综述与元综合[J]. 中国图书馆学报, 2019, 45(3): 110-130.
[20] (Sun Yuwei, Cheng Ying, Xie Juan. A Review on the Data Reuse Behavior of Scholars: System Review and Meta Synthesis[J]. Journal of Library Science in China, 2019, 45(3): 110-130.)
[21] 宰冰欣. 科研数据共享中的数据安全规范研究——以澳大利亚高校科研数据共享政策为例[J]. 新世纪图书馆, 2022(1): 61-68.
[21] (Zai Bingxin. Research on Research Data Security During the Process of Data Sharing: A Case Study of University Research Data Sharing Policy in Australia[J]. New Century Library, 2022(1): 61-68.)
[22] Grechkin M, Poon H, Howe B. Wide-Open: Accelerating Public Data Release by Automating Detection of Overdue Datasets[J]. PLoS Biology, 2017, 15(6): e2002477.
doi: 10.1371/journal.pbio.2002477
[23] Goldstein J C, Mayernik M S, Ramapriyan H K. Identifiers for Earth Science Data Sets: Where We Have Been and Where We Need to Go[J]. Data Science Journal, 2017, 16: 23.
doi: 10.5334/dsj-2017-023
[24] re3data[EB/OL]. [2022-06-14]. https://www.re3data.org/search?subjects%5B%5D=2%20Life%20Sciences.
[25] Kafkas Ş, Kim J H, Pi X J, et al. Database Citation in Supplementary Data Linked to Europe PubMed Central Full Text Biomedical Articles[J]. Journal of Biomedical Semantics, 2015, 6. DOI: 10.1186/2041-1480-6-1.
doi: 10.1186/2041-1480-6-1
[26] 焦红, 杨波, 周琪. 生物医学领域科学数据集复用特征研究[J]. 情报理论与实践, 2021, 44(9): 90-96.
doi: 10.16353/j.cnki.1000-7490.2021.09.013
[26] (Jiao Hong, Yang Bo, Zhou Qi. Research on Characteristics of Scientific Datasets Reuse in the Field of Biomedicine[J]. Information Studies: Theory & Application, 2021, 44(9): 90-96.)
doi: 10.16353/j.cnki.1000-7490.2021.09.013
[27] Womack R P. Research Data in Core Journals in Biology, Chemistry, Mathematics, and Physics[J]. PLoS One, 2015, 10(12): e0143460.
doi: 10.1371/journal.pone.0143460
[28] Ghavimi B, Mayr P, Lange C, et al. A Semi-Automatic Approach for Detecting Dataset References in Social Science Texts[J]. Information Services & Use, 2017, 36(3-4): 171-187.
[29] Park H, You S, Wolfram D. Informal Data Citation for Data Sharing and Reuse is More Common than Formal Data Citation in Biomedical Fields[J]. Journal of the Association for Information Science and Technology, 2018, 69(11): 1346-1354.
doi: 10.1002/asi.2018.69.issue-11
[30] Riedel N, Kip M, Bobrov E. ODDPub—A Text-Mining Algorithm to Detect Data Sharing in Biomedical Publications[J]. Data Science Journal, 2020, 19(1): 42.
doi: 10.5334/dsj-2020-042
[31] Piwowar H, Chapman W. Identifying Data Sharing in Biomedical Literature[J]. Nature Precedings, 2008. https://doi.org/10.1038/npre.2008.1721.1
[32] Névéol A, Wilbur W J, Lu Z Y. Extraction of Data Deposition Statements from the Literature: A Method for Automatically Tracking Research Results[J]. Bioinformatics, 2011, 27(23): 3306-3312.
doi: 10.1093/bioinformatics/btr573 pmid: 21998156
[33] 赵佳骏. 科学文献中的数据引用识别研究[D]. 南京: 南京农业大学, 2019.
[33] (Zhao Jiajun. Research on Data Citation Identification in Scientific Literature[D]. Nanjing: Nanjing Agricultural University, 2019.)
[34] Colavizza G, Hrynaszkiewicz I, Staden I, et al. The Citation Advantage of Linking Publications to Research Data[J]. PLoS One, 2020, 15(4): e0230416.
doi: 10.1371/journal.pone.0230416
[35] 杨宁, 张志强. 基于机器学习的科学数据正式引用识别方法研究[J]. 情报杂志, 2022, 41(2): 182-189.
[35] (Yang Ning, Zhang Zhiqiang. Research on the Method of Formal Citation Recognition of Scientific Data Based on Machine Learning[J]. Journal of Intelligence, 2022, 41(2): 182-189.)
[36] Goodfellow I, Bengio Y, Courville A. Deep Learning[M]. Cambridge: MIT Press, 2016.
[37] 杨宁, 张志强. 融合全文信息的科学数据正式引用识别方法研究[J]. 情报理论与实践, 2022, 45(2): 191-197.
[37] (Yang Ning, Zhang Zhiqiang. Research on Formal Citation Recognition Method of Scientific Data Fused with Full-Text Information[J]. Information Studies: Theory & Application, 2022, 45(2): 191-197.)
[38] Hou L L, Zhang J, Wu O, et al. Method and Dataset Entity Mining in Scientific Literature: A CNN + BiLSTM Model with Self-Attention[J]. Knowledge-Based Systems, 2022, 235: 107621.
doi: 10.1016/j.knosys.2021.107621
[39] Boland K, Ritze D, Eckert K, et al. Identifying References to Datasets in Publications[C]// Proceedings of the 2012 International Conference on Theory and Practice of Digital Libraries. Berlin, Heidelberg: Springer, 2012: 150-161.
[40] 张秋子. 学术文献中数据使用的自动识别——以计算机科学为例[D]. 武汉: 武汉大学, 2017.
[40] (Zhang Qiuzi. Automatic Data Usage Identification in Scientific Articles— An Example from Computer Science[D]. Wuhan: Wuhan University, 2017.)
[41] Groth P, Cousijn H, Clark T, et al. FAIR Data Reuse—The Path Through Data Citation[J]. Data Intelligence, 2020, 2(1/2): 78-86.
doi: 10.1162/dint_a_00030
[42] Smith L M, Kearney T D, Rutherford C, et al. Data Identification, Citation and Tracking Best Practices: A White Paper from the Observatory Best Practices/Lessons Learned Series[R]. Washington, DC, Consortium for Ocean Leadership, 2019. DOI: http://dx.doi.org/10.25607/OBP-505.
doi: http://dx.doi.org/10.25607/OBP-505
[43] 史雅莉. 科学数据引用标准实施的关键问题探析[J]. 现代情报, 2019, 39(4): 34-41.
doi: 10.3969/j.issn.1008-0821.2019.04.004
[43] (Shi Yali. Analysis on the Key Issues in the Implementation of Scientific Data Citation Standards[J]. Journal of Modern Information, 2019, 39(4): 34-41.)
doi: 10.3969/j.issn.1008-0821.2019.04.004
[44] Force M M, Robinson N J. Encouraging Data Citation and Discovery with the Data Citation Index[J]. Journal of Computer-Aided Molecular Design, 2014, 28(10): 1043-1048.
doi: 10.1007/s10822-014-9768-5 pmid: 24980647
[45] Robinson-García N, Jiménez-Contreras E, Torres-Salinas D. Analyzing Data Citation Practices Using the Data Citation Index[J]. Journal of the Association for Information Science and Technology, 2016, 67(12): 2964-2975.
doi: 10.1002/asi.2016.67.issue-12
[46] Clarivate. Data Citation Index[EB/OL]. [2022-06-16]. https://clarivate.com/webofsciencegroup/solutions/webofscience-data-citation-index/.
[47] Buneman P, Dosso D, Lissandrini M, et al. Data Citation and the Citation Graph[J]. Quantitative Science Studies, 2021, 2(4): 1399-1422.
doi: 10.1162/qss_a_00166
[48] 涂志芳, 刘兹恒. 国内外科学数据管理服务评价研究与实践进展[J]. 图书馆建设, 2021(2): 108-117.
[48] (Tu Zhifang, Liu Ziheng. Advances in Evaluation of Research Data Management Services at Home and Abroad: Research and Practice[J]. Library Development, 2021(2): 108-117.)
[49] Digital Science, Fane B, Ayris P, et al. The State of Open Data Report 2019[R/OL]. [2019-10-24]. https://doi.org/10.6084/m9.figshare.9980783.v2.
[50] Vines T H, Andrew R L, Bock D G, et al. Mandated Data Archiving Greatly Improves Access to Research Data[J]. The FASEB Journal, 2013, 27(4): 1304-1308.
doi: 10.1096/fsb2.v27.4
[51] Digital Science, Simons N, Goodey G, et al. The State of Open Data 2021[R/OL]. [2021-11-30]. https://doi.org/10.6084/m9.figshare.17061347.v1.
[52] Cho J. Study About Research Data Citation Based on DCI (Data Citation Index)[J]. Journal of the Korean Society for Library and Information Science, 2016, 50(1): 189-207.
doi: 10.4275/KSLIS.2016.50.1.189
[1] Liu Feng, Zhang Xiaolin. Review on the Scientific Metadata Standards and Research on Its Generic Design[J]. 现代图书情报技术, 2015, 31(12): 3-12.
[2] Wang Hui, Michael Witt, Dou Tianfang. Purdue University Research Repository and Scientific Data Management Services Based on PURR[J]. 现代图书情报技术, 2015, 31(1): 9-16.
[3] Huang Yongwen, Zhang Jianyong, Huang Jinxia, Wang Fang. Research on the Open Research Data[J]. 现代图书情报技术, 2013, (5): 21-27.
[4] Sun Wei. A Feature Representation Method of Scientific Data Based on Complex Text Description[J]. 现代图书情报技术, 2009, 25(5): 22-27.
[5] Huang Xianglin. ILAS Periodical Database and External Data Sharing Exchange[J]. 现代图书情报技术, 2009, (10): 77-81.
[6] Zhang Xiaolin,Yuan Li,Yang Feng,Huang Ying,Huang Xuejun. Web-based Personalized Library and Information Service Mechanism[J]. 现代图书情报技术, 2001, 17(1): 25-29.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn