[Objective] This paper analyzes the characteristics of the existing data citation practices and summarizes their recognition methods. It also explores current research and future development trends. [Methods] The existing data citation detection methods could be divided into three categories: rule-based recognition, supervised machine learning algorithm, and semi-supervised machine learning algorithm. We also reviewed each method’s principles, characteristics, existing problems, performance, and applications of each method. [Results] The existing technologies are concentrated on supervised machine learning algorithms. Detecting data citation with the help of citing behaviors and extracting data citation elements are the future direction. [Limitations] This paper summarizes the characteristics of data citations and existing recognition algorithms. It did not elaborate on the technical details of these algorithms. [Conclusions] There are still some problems in detecting data citation, such as research field limitations, lack of diversity in methods, and insufficient consideration of data citation characteristics, which need further optimization.
(Kong Lihua, Xi Yan, Jiang Lulu. Open Sharing and Publishing Policies for Research Data of Scientific Journals[J]. Chinese Journal of Scientific and Technical Periodicals, 2022, 33(2): 192-199.)
doi: 10.11946/cjstp.202106300526
[3]
Springer Nature. Research Data Policies[EB/OL]. [2022-05-31]. https://www.springernature.com/gp/authors.
[4]
Parsons M A, Duerr R E, Jones M B. The History and Future of Data Citation in Practice[J]. Data Science Journal, 2019, 18(1): 52.
doi: 10.5334/dsj-2019-052
[5]
FORCE 11. Joint Declaration of Data Citation Principles[EB/OL]. [2022-09-09]. https://force11.org/info/joint-declaration-of-data-citation-principles-final/.
[6]
Vasilevsky N A, Minnier J, Haendel M A, et al. Reproducible and Reusable Research: Are Journal Data Sharing Policies Meeting the Mark?[J]. PeerJ, 2017, 5: e3208.
doi: 10.7717/peerj.3208
(Qiu Junping, Xiao Boxuan, Xu Zhongyang, et al. Multi-Dimensional Analysis of Data Citation in the Field of Library and Information Science at Home and Abroad[J]. Information Studies: Theory & Application, 2022, 45(9): 44-50.)
doi: 10.16353/j.cnki.1000-7490.2022.09.007
[8]
USGS. Data Citation[EB/OL]. [2022-06-01]. https://www.usgs.gov/data-management/data-citation.
[9]
Springer Nature. Data Available Statement[EB/OL]. [2022-09-09]. https://www.springernature.com/gp/authors/research-data-policy/data-availability-statements/12330880.
[10]
Web of Science. Recommended Practices to Promote Scholarly Data Citation and Tracking[EB/OL]. [2022-06-13]. https://clarivate.com/webofsciencegroup/wp-content/uploads/sites/2/2019/08/Crv_WOS_Whitepaper_DCI_web.pdf.
[11]
DataCite. Why is It So Important to Cite Data?[EB/OL]. [2022-06-13]. https://datacite.org/cite-your-data.html.
[12]
Humanities and Social Science Communications. Availability of Materials and Data[EB/OL]. [2022-09-13]. https://www.nature.com/palcomms/journal-policies/editorial-and-publishing-policies#Availability%20of%20materials%20and%20data.
[13]
Scientific Data. Data Policies[EB/OL]. [2022-09-13]. https://www.nature.com/sdata/policies/data-policies.
Annals of Medicine. Instructions for Authors[EB/OL]. [2022-09-09]. https://www.tandfonline.com/action/authorSubmission?show=instructions&journalCode=iann20#dsp.
[16]
Med. Information for Authors[EB/OL]. [2023-04-15]. https://www.cell.com/med/authors.
[17]
Wiley’s Data Citation Policy[EB/OL].[2023-04-15]. https://authorservices.wiley.com/author-resources/Journal-Authors/open-access/data-sharing-citation/data-citation-policy.html.
(Yang Ning, Zhang Zhiqiang. Research on the Use Characteristics of Scientific Datasets Combined with Quantitative Analysis and Content Analysis[J]. Library and Information Service, 2022, 66(10): 122-130.)
doi: 10.13266/j.issn.0252-3116.2022.010.011
(Sun Yuwei, Cheng Ying, Xie Juan. A Review on the Data Reuse Behavior of Scholars: System Review and Meta Synthesis[J]. Journal of Library Science in China, 2019, 45(3): 110-130.)
(Zai Bingxin. Research on Research Data Security During the Process of Data Sharing: A Case Study of University Research Data Sharing Policy in Australia[J]. New Century Library, 2022(1): 61-68.)
[22]
Grechkin M, Poon H, Howe B. Wide-Open: Accelerating Public Data Release by Automating Detection of Overdue Datasets[J]. PLoS Biology, 2017, 15(6): e2002477.
doi: 10.1371/journal.pbio.2002477
[23]
Goldstein J C, Mayernik M S, Ramapriyan H K. Identifiers for Earth Science Data Sets: Where We Have Been and Where We Need to Go[J]. Data Science Journal, 2017, 16: 23.
doi: 10.5334/dsj-2017-023
Kafkas Ş, Kim J H, Pi X J, et al. Database Citation in Supplementary Data Linked to Europe PubMed Central Full Text Biomedical Articles[J]. Journal of Biomedical Semantics, 2015, 6. DOI: 10.1186/2041-1480-6-1.
doi: 10.1186/2041-1480-6-1
(Jiao Hong, Yang Bo, Zhou Qi. Research on Characteristics of Scientific Datasets Reuse in the Field of Biomedicine[J]. Information Studies: Theory & Application, 2021, 44(9): 90-96.)
doi: 10.16353/j.cnki.1000-7490.2021.09.013
[27]
Womack R P. Research Data in Core Journals in Biology, Chemistry, Mathematics, and Physics[J]. PLoS One, 2015, 10(12): e0143460.
doi: 10.1371/journal.pone.0143460
[28]
Ghavimi B, Mayr P, Lange C, et al. A Semi-Automatic Approach for Detecting Dataset References in Social Science Texts[J]. Information Services & Use, 2017, 36(3-4): 171-187.
[29]
Park H, You S, Wolfram D. Informal Data Citation for Data Sharing and Reuse is More Common than Formal Data Citation in Biomedical Fields[J]. Journal of the Association for Information Science and Technology, 2018, 69(11): 1346-1354.
doi: 10.1002/asi.2018.69.issue-11
[30]
Riedel N, Kip M, Bobrov E. ODDPub—A Text-Mining Algorithm to Detect Data Sharing in Biomedical Publications[J]. Data Science Journal, 2020, 19(1): 42.
doi: 10.5334/dsj-2020-042
[31]
Piwowar H, Chapman W. Identifying Data Sharing in Biomedical Literature[J]. Nature Precedings, 2008. https://doi.org/10.1038/npre.2008.1721.1
[32]
Névéol A, Wilbur W J, Lu Z Y. Extraction of Data Deposition Statements from the Literature: A Method for Automatically Tracking Research Results[J]. Bioinformatics, 2011, 27(23): 3306-3312.
doi: 10.1093/bioinformatics/btr573
pmid: 21998156
[33]
赵佳骏. 科学文献中的数据引用识别研究[D]. 南京: 南京农业大学, 2019.
[33]
(Zhao Jiajun. Research on Data Citation Identification in Scientific Literature[D]. Nanjing: Nanjing Agricultural University, 2019.)
[34]
Colavizza G, Hrynaszkiewicz I, Staden I, et al. The Citation Advantage of Linking Publications to Research Data[J]. PLoS One, 2020, 15(4): e0230416.
doi: 10.1371/journal.pone.0230416
(Yang Ning, Zhang Zhiqiang. Research on the Method of Formal Citation Recognition of Scientific Data Based on Machine Learning[J]. Journal of Intelligence, 2022, 41(2): 182-189.)
[36]
Goodfellow I, Bengio Y, Courville A. Deep Learning[M]. Cambridge: MIT Press, 2016.
(Yang Ning, Zhang Zhiqiang. Research on Formal Citation Recognition Method of Scientific Data Fused with Full-Text Information[J]. Information Studies: Theory & Application, 2022, 45(2): 191-197.)
[38]
Hou L L, Zhang J, Wu O, et al. Method and Dataset Entity Mining in Scientific Literature: A CNN + BiLSTM Model with Self-Attention[J]. Knowledge-Based Systems, 2022, 235: 107621.
doi: 10.1016/j.knosys.2021.107621
[39]
Boland K, Ritze D, Eckert K, et al. Identifying References to Datasets in Publications[C]// Proceedings of the 2012 International Conference on Theory and Practice of Digital Libraries. Berlin, Heidelberg: Springer, 2012: 150-161.
[40]
张秋子. 学术文献中数据使用的自动识别——以计算机科学为例[D]. 武汉: 武汉大学, 2017.
[40]
(Zhang Qiuzi. Automatic Data Usage Identification in Scientific Articles— An Example from Computer Science[D]. Wuhan: Wuhan University, 2017.)
[41]
Groth P, Cousijn H, Clark T, et al. FAIR Data Reuse—The Path Through Data Citation[J]. Data Intelligence, 2020, 2(1/2): 78-86.
doi: 10.1162/dint_a_00030
[42]
Smith L M, Kearney T D, Rutherford C, et al. Data Identification, Citation and Tracking Best Practices: A White Paper from the Observatory Best Practices/Lessons Learned Series[R]. Washington, DC, Consortium for Ocean Leadership, 2019. DOI: http://dx.doi.org/10.25607/OBP-505.
doi: http://dx.doi.org/10.25607/OBP-505
(Shi Yali. Analysis on the Key Issues in the Implementation of Scientific Data Citation Standards[J]. Journal of Modern Information, 2019, 39(4): 34-41.)
doi: 10.3969/j.issn.1008-0821.2019.04.004
[44]
Force M M, Robinson N J. Encouraging Data Citation and Discovery with the Data Citation Index[J]. Journal of Computer-Aided Molecular Design, 2014, 28(10): 1043-1048.
doi: 10.1007/s10822-014-9768-5
pmid: 24980647
[45]
Robinson-García N, Jiménez-Contreras E, Torres-Salinas D. Analyzing Data Citation Practices Using the Data Citation Index[J]. Journal of the Association for Information Science and Technology, 2016, 67(12): 2964-2975.
doi: 10.1002/asi.2016.67.issue-12
[46]
Clarivate. Data Citation Index[EB/OL]. [2022-06-16]. https://clarivate.com/webofsciencegroup/solutions/webofscience-data-citation-index/.
[47]
Buneman P, Dosso D, Lissandrini M, et al. Data Citation and the Citation Graph[J]. Quantitative Science Studies, 2021, 2(4): 1399-1422.
doi: 10.1162/qss_a_00166
(Tu Zhifang, Liu Ziheng. Advances in Evaluation of Research Data Management Services at Home and Abroad: Research and Practice[J]. Library Development, 2021(2): 108-117.)
[49]
Digital Science, Fane B, Ayris P, et al. The State of Open Data Report 2019[R/OL]. [2019-10-24]. https://doi.org/10.6084/m9.figshare.9980783.v2.
[50]
Vines T H, Andrew R L, Bock D G, et al. Mandated Data Archiving Greatly Improves Access to Research Data[J]. The FASEB Journal, 2013, 27(4): 1304-1308.
doi: 10.1096/fsb2.v27.4
[51]
Digital Science, Simons N, Goodey G, et al. The State of Open Data 2021[R/OL]. [2021-11-30]. https://doi.org/10.6084/m9.figshare.17061347.v1.
[52]
Cho J. Study About Research Data Citation Based on DCI (Data Citation Index)[J]. Journal of the Korean Society for Library and Information Science, 2016, 50(1): 189-207.
doi: 10.4275/KSLIS.2016.50.1.189