Review of Semantic Representation of Experimental Protocols at Process-Level
Fu Yun1,2,Liu Xiwen1,2(),Zhu Liya1,Han Tao1,2
1National Science Library, Chinese Academy of Sciences, Beijing 100190, China 2Department of Information Resources Management, School of Economics and Management, University of Chinese Academy of Sciences, Beijing 100190, China
[Objective] This paper explores the research progress of the process-level semantic representation of experimental protocols. It aims to discover the key issues to be addressed and identify development trends. [Coverage] We used related topics to retrieve the relevant literature from Web of Science, arXiv, Engineering Village, CNKI, Wanfang, and VIP. We also examined the requirements of the submission requirements and evaluation principles of renowned journals on experimental protocols. [Methods] First, we defined the concepts of experimental protocols and their semantic representation at the process-level. Then, we examined the representation methods, representation element extraction, and application of representative data. [Results] The research on process-level semantic representation is in the early development stages. The representation framework was not unified, and the elements were different. The experimental protocols were mainly written in natural language, which were difficult to extract the representation elements automatically. Some studies explored the application of process-level semantic representation data, which leaves more knowledge gaps to be filled. [Limitations] This paper does not thoroughly discuss the technical details of extracting representation elements from literature and data application methods. [Conclusions] We need to establish a unified representation framework for more complete elements by integrating various representation methods. We should also explore automatic extraction methods based on advanced intelligent technology and application using the process-level semantic representation data.
付芸, 刘细文, 朱丽雅, 韩涛. 实验规程的过程级语义表示研究综述*[J]. 数据分析与知识发现, 2023, 7(8): 1-16.
Fu Yun, Liu Xiwen, Zhu Liya, Han Tao. Review of Semantic Representation of Experimental Protocols at Process-Level. Data Analysis and Knowledge Discovery, 2023, 7(8): 1-16.
Jumper J, Evans R, Pritzel A, et al. Highly Accurate Protein Structure Prediction with AlphaFold[J]. Nature, 2021, 596(7873): 583-589.
doi: 10.1038/s41586-021-03819-2
[2]
De Pablo J J, Jackson N E, Webb M A, et al. New Frontiers for the Materials Genome Initiative[J]. NPJ Computational Materials, 2019, 5: 41.
doi: 10.1038/s41524-019-0173-4
[3]
Girault I, D’Ham C, Ney M, et al. Characterizing the Experimental Procedure in Science Laboratories: A Preliminary Step Towards Students Experimental Design[J]. International Journal of Science Education, 2012, 34(6): 825-854.
doi: 10.1080/09500693.2011.569901
[4]
Kim E, Huang K, Saunders A, et al. Materials Synthesis Insights from Scientific Literature via Text Extraction and Machine Learning[J]. Chemistry of Materials, 2017, 29(21): 9436-9444.
doi: 10.1021/acs.chemmater.7b03500
[5]
Mysore S, Kim E, Strubell E, et al. Automatically Extracting Action Graphs from Materials Science Synthesis Procedures[OL]. arXiv Preprint, arXiv:1711.06872.
[6]
Baker M. 1,500 Scientists Lift the Lid on Reproducibility[J]. Nature, 2016, 533(7604): 452-454.
doi: 10.1038/533452a
[7]
Seifrid M, Pollice R, Aguilar-Granda A, et al. Autonomous Chemical Experiments: Challenges and Perspectives on Establishing a Self-Driving Lab[J]. Accounts of Chemical Research, 2022, 55(17): 2454-2466.
doi: 10.1021/acs.accounts.2c00220
pmid: 35948428
[8]
Coley C W, Eyke N S, Jensen K F. Autonomous Discovery in the Chemical Sciences Part II: Outlook[J]. Angewandte Chemie, 2020, 59(52): 23414-23436.
[9]
Mehr S H M, Craven M, Leonov A I, et al. A Universal System for Digitization and Automatic Execution of the Chemical Synthesis Literature[J]. Science, 2020, 370(6512): 101-108.
doi: 10.1126/science.abc2986
pmid: 33004517
[10]
Soldatova L N, King R D. An Ontology of Scientific Experiments[J]. Journal of the Royal Society, Interface, 2006, 3(11): 795-803.
pmid: 17015305
[11]
Lewis T. Design and Inquiry: Bases for an Accommodation Between Science and Technology Education in the Curriculum?[J]. Journal of Research in Science Teaching, 2006, 43(3): 255-281.
doi: 10.1002/(ISSN)1098-2736
[12]
Yang X J, Zhang X L, Zuo J, et al. An Analysis of Relation Extraction Within Sentences from Wet Lab Protocols[C]// Proceedings of the 2021 IEEE International Conference on Big Data. 2021: 562-570.
[13]
Soldatova L N, Nadis D, King R D, et al. EXACT2: The Semantics of Biomedical Protocols[J]. BMC Bioinformatics, 2014, 15(14): S5.
[14]
Vaucher A C, Zipoli F, Geluykens J, et al. Automated Extraction of Chemical Synthesis Actions from Experimental Procedures[J]. Nature Communications, 2020, 11: 3601.
doi: 10.1038/s41467-020-17266-6
pmid: 32681088
[15]
Tamari R, Bai F, Ritter A, et al. Process-Level Representation of Scientific Protocols with Interactive Annotation[C]// Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics, 2021: 2190-2202.
[16]
Steiner S, Wolf J, Glatzel S, et al. Organic Synthesis in a Modular Robotic System Driven by a Chemical Programming Language[J]. Science, 2019, 363(6423): eaav2211.
[17]
Arch-int N, Arch-int S. Semantic Ontology Mapping for Interoperability of Learning Resource Systems Using a Rule-Based Reasoning Approach[J]. Expert Systems with Applications, 2013, 40(18): 7428-7443.
doi: 10.1016/j.eswa.2013.07.027
[18]
Daraio C, Lenzerini M, Leporelli C, et al. The Advantages of an Ontology-Based Data Management Approach: Openness, Interoperability and Data Quality[J]. Scientometrics, 2016, 108(1): 441-455.
doi: 10.1007/s11192-016-1913-6
[19]
Nelson E K, Piehler B, Eckels J, et al. LabKey Server: An Open Source Platform for Scientific Data Integration, Analysis and Collaboration[J]. BMC Bioinformatics, 2011, 12: 71.
doi: 10.1186/1471-2105-12-71
pmid: 21385461
[20]
Rodríguez M, Laguía J. An Ontology for Process Safety[J]. Chemical Engineering Transactions, 2019, 77: 67-72.
[21]
McGuinness D L, Harmelen F V. Web Ontology Language [A]// Encyclopedia of Social Network Analysis and Mining[M]. New York: Springer, 2014.
[22]
Kügler P, Marian M, Schleich B, et al. tribAIn—Towards an Explicit Specification of Shared Tribological Understanding[J]. Applied Sciences, 2020, 10(13): 4421.
doi: 10.3390/app10134421
[23]
King R D, Rowland J, Oliver S G, et al. The Automation of Science[J]. Science, 2009, 324(5923): 85-89.
doi: 10.1126/science.1165620
pmid: 19342587
[24]
Qi D, King R D, Hopkins A L, et al. An Ontology for Description of Drug Discovery Investigations[J]. Journal of Integrative Bioinformatics, 2010, 7(3): 126.
[25]
Vanschoren J, Soldatova L N. Exposé: An Ontology for Data Mining Experiments[C]// Proceedings of International Workshop on the 3 rd Generation Data Mining: Towards Service-Oriented Knowledge Discovery. 2010: 31-46.
[26]
Cheung K, Drennan J, Hunter J. Towards an Ontology for Data-Driven Discovery of New Materials[C]// Proceedings of Semantic Scientific Knowledge Integration AAAI/SSS Workshop. 2008: 9-14.
[27]
Soldatova L N, Aubrey W, King R D, et al. The EXACT Description of Biomedical Protocols[J]. Bioinformatics, 2008, 24(13): i295-i303.
doi: 10.1093/bioinformatics/btn156
[28]
Celebi R, Moreira J R, Hassan A A, et al. Towards FAIR Protocols and Workflows: The OpenPREDICT Use Case[J]. PeerJ Computer Science, 2020, 6: e281.
doi: 10.7717/peerj-cs.281
pmid: 33816932
[29]
Barrows E, Martin K, Smith T. Markup Language for Chemical Process Control and Simulation[J]. Computers & Chemical Engineering, 2022, 160: 107702.
doi: 10.1016/j.compchemeng.2022.107702
[30]
Wang Z R, Cruse K, Fei Y X, et al. ULSA: Unified Language of Synthesis Actions for the Representation of Inorganic Synthesis Protocols[J]. Digital Discovery, 2022, 1(3): 313-324.
doi: 10.1039/D1DD00034A
[31]
Kononova O, Huo H Y, He T J, et al. Text-Mined Dataset of Inorganic Materials Synthesis Recipes[J]. Scientific Data, 2019, 6: 203.
doi: 10.1038/s41597-019-0224-1
pmid: 31615989
[32]
Wang Z R, Kononova O, Cruse K, et al. Dataset of Solution-Based Inorganic Materials Synthesis Procedures Extracted from the Scientific Literature[J]. Scientific Data, 2022, 9: 231.
doi: 10.1038/s41597-022-01317-2
pmid: 35614129
[33]
Cruse K, Trewartha A, Lee S, et al. Text-Mined Dataset of Gold Nanoparticle Synthesis Procedures, Morphologies, and Size Entities[J]. Scientific Data, 2022, 9: 234.
doi: 10.1038/s41597-022-01321-6
pmid: 35618761
[34]
Coley C W, Thomas III D A, Lummiss J A M, et al. A Robotic Platform for Flow Synthesis of Organic Compounds Informed by AI Planning[J]. Science, 2019, 365(6453): eaax1566.
doi: 10.1126/science.aax1566
[35]
Hammer A J S, Leonov A I, Bell N L, et al. Chemputation and the Standardization of Chemical Informatics[J]. JACS Au, 2021, 1(10): 1572-1587.
doi: 10.1021/jacsau.1c00303
pmid: 34723260
[36]
Wang Z, Zhao W, Hao G F, et al. Automated Synthesis: Current Platforms and Further Needs[J]. Drug Discovery Today, 2020, 25(11): 2006-2011.
doi: 10.1016/j.drudis.2020.09.009
[37]
Collins N, Stout D, Lim J P, et al. Fully Automated Chemical Synthesis: Toward the Universal Synthesizer[J]. Organic Process Research & Development, 2020, 24(10): 2064-2077.
[38]
Bubliauskas A, Blair D J, Powell-Davies H, et al. Digitizing Chemical Synthesis in 3D Printed Reactionware[J]. Angewandte Chemie, 2022, 61(24): e202116108.
[39]
Angelone D, Hammer A J S, Rohrbach S, et al. Convergence of Multiple Synthetic Paradigms in a Universally Programmable Chemical Synthesis Machine[J]. Nature Chemistry, 2021, 13(1): 63-69.
doi: 10.1038/s41557-020-00596-9
pmid: 33353971
[40]
Wilbraham L, Mehr S H M, Cronin L. Digitizing Chemistry Using the Chemical Processing Unit: From Synthesis to Discovery[J]. Accounts of Chemical Research, 2021, 54(2): 253-262.
doi: 10.1021/acs.accounts.0c00674
pmid: 33370095
[41]
Rohrbach S, Šiaučiulis M, Chisholm G, et al. Digitization and Validation of a Chemical Synthesis Literature Database in the ChemPU[J]. Science, 2022, 377(6602): 172-180.
doi: 10.1126/science.abo0058
pmid: 35857541
[42]
Kuniyoshi F, Makino K, Ozawa J, et al. Annotating and Extracting Synthesis Process of All-Solid-State Batteries from Scientific Literature [OL]. arXiv Preprint, arXiv:2002.07339.
[43]
Makino K, Kuniyoshi F, Ozawa J, et al. Extracting and Analyzing Inorganic Material Synthesis Procedures in the Literature[J]. IEEE Access, 2022, 10: 31524-31537.
doi: 10.1109/ACCESS.2022.3160201
[44]
Guo J, Ibanez-Lopez A S, Gao H Y, et al. Automated Chemical Reaction Extraction from Scientific Literature[J]. Journal of Chemical Information and Modeling, 2022, 62(9): 2035-2045.
doi: 10.1021/acs.jcim.1c00284
[45]
Mysore S J Z, Kim E, Huang K, et al. The Materials Science Procedural Text Corpus: Annotating Materials Synthesis Procedures with Shallow Semantic Structures[C]// Proceedings of the 13th Linguistic Annotation Workshop (Law Xiii). 2019: 56-64.
[46]
Kononova O, Huo H Y, He T J, et al. Author Correction: Text-Mined Dataset of Inorganic Materials Synthesis Recipes[J]. Scientific Data, 2019, 6: 273.
doi: 10.1038/s41597-019-0297-x
pmid: 31729397
[47]
Kim E, Huang K, Kononova O, et al. Distilling a Materials Synthesis Ontology[J]. Matter, 2019, 1(1): 8-12.
doi: 10.1016/j.matt.2019.05.011
[48]
Artrith N, Butler K T, Coudert F X, et al. Best Practices in Machine Learning for Chemistry[J]. Nature Chemistry, 2021, 13(6): 505-508.
doi: 10.1038/s41557-021-00716-z
pmid: 34059804
[49]
Hiszpanski A M, Gallagher B, Chellappan K, et al. Nanomaterial Synthesis Insights from Machine Learning of Scientific Articles by Extracting, Structuring, and Visualizing Knowledge[J]. Journal of Chemical Information and Modeling, 2020, 60(6): 2876-2887.
doi: 10.1021/acs.jcim.0c00199
pmid: 32286818
[50]
Zhang Y, Wang C, Soukaseum M, et al. Unleashing the Power of Knowledge Extraction from Scientific Literature in Catalysis[J]. Journal of Chemical Information and Modeling, 2022, 62(14): 3316-3330.
doi: 10.1021/acs.jcim.2c00359
[51]
Pedregosa F, Varoquaux G, Gramfort A, et al. Scikit-learn: Machine Learning in Python[J]. Journal of Machine Learning Research, 2011, 12: 2825-2830.
[52]
Kim E, Huang K, Tomala A, et al. Machine-Learned and Codified Synthesis Parameters of Oxide Materials[J]. Scientific Data, 2017, 4: 170127.
doi: 10.1038/sdata.2017.127
[53]
Wang W R, Jiang X, Tian S H, et al. Automated Pipeline for Superalloy Data by Text Mining[J]. NPJ Computational Materials, 2022, 8: 9.
doi: 10.1038/s41524-021-00687-2
[54]
Huo H Y, Rong Z Z, Kononova O, et al. Semi-Supervised Machine-Learning Classification of Materials Synthesis Procedures[J]. NPJ Computational Materials, 2019, 5: 62.
doi: 10.1038/s41524-019-0204-1
[55]
Krippendorff K. Content Analysis: An Introduction to Its Methodology[M]. The 4th Edition. Thousand Oaks: SAGE Publications, 2019.
[56]
Cohen J. Weighted Kappa: Nominal Scale Agreement with Provision for Scaled Disagreement or Partial Credit[J]. Psychological Bulletin, 1968, 70(4): 213-220.
doi: 10.1037/h0026256
pmid: 19673146
[57]
Mavračić J, Court C J, Isazawa T, et al. ChemDataExtractor 2.0: Autopopulated Ontologies for Materials Science[J]. Journal of Chemical Information and Modeling, 2021, 61(9): 4280-4289.
doi: 10.1021/acs.jcim.1c00446
pmid: 34529432
[58]
Hawizy L, Jessop D M, Adams N, et al. ChemicalTagger: A Tool for Semantic Text-Mining in Chemistry[J]. Journal of Cheminformatics, 2011, 3(1): 1-13.
doi: 10.1186/1758-2946-3-1
[59]
Swain M C, Cole J M. ChemDataExtractor: A Toolkit for Automated Extraction of Chemical Information from the Scientific Literature[J]. Journal of Chemical Information and Modeling, 2016, 56(10): 1894-1904.
doi: 10.1021/acs.jcim.6b00207
pmid: 27669338
[60]
Friedrich A, Adel H, Tomazic F, et al. The SOFC-Exp Corpus and Neural Approaches to Information Extraction in the Materials Science Domain[C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020: 1255-1268.
[61]
Sohrab M G, Duong Nguyen A K, Miwa M, et al. mgsohrab at WNUT 2020 Shared Task-1: Neural Exhaustive Approach for Entity and Relation Recognition over Wet Lab Protocols[C]// Proceedings of the 6th Workshop on Noisy User-Generated Text. 2020: 290-298.
[62]
Tabassum J, Xu W, Ritter A, et al. WNUT-2020 Task 1 Overview: Extracting Entities and Relations from Wet Lab Protocols[C]// Proceedings of the 6th Workshop on Noisy User-Generated Text. 2020: 260-267.
[63]
Kulkarni C, Chan J, Fosler-Lussier E, et al. Learning Latent Structures for Cross Action Phrase Relations in Wet Lab Protocols[C]// Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1:Long Papers). 2021: 6737-6750.
[64]
Gopalaswamy V, Betti R, Knauer J P, et al. Tripled Yield in Direct-Drive Laser Fusion Through Statistical Modelling[J]. Nature, 2019, 565(7741): 581-586.
doi: 10.1038/s41586-019-0877-0
[65]
Walker E, Kammeraad J, Goetz J, et al. Learning to Predict Reaction Conditions: Relationships Between Solvent, Molecular Structure, and Catalyst[J]. Journal of Chemical Information and Modeling, 2019, 59(9): 3645-3654.
doi: 10.1021/acs.jcim.9b00313
pmid: 31381340
[66]
Gao H Y, Struble T J, Coley C W, et al. Using Machine Learning to Predict Suitable Conditions for Organic Reactions[J]. ACS Central Science, 2018, 4(11): 1465-1476.
doi: 10.1021/acscentsci.8b00357
pmid: 30555898
[67]
Maser M R, Cui A Y, Ryou S, et al. Multilabel Classification Models for the Prediction of Cross-Coupling Reaction Conditions[J]. Journal of Chemical Information and Modeling, 2021, 61(1): 156-166.
doi: 10.1021/acs.jcim.0c01234
pmid: 33417449
[68]
Vaucher A C, Schwaller P, Geluykens J, et al. Inferring Experimental Procedures from Text-Based Representations of Chemical Reactions[J]. Nature Communications, 2021, 12: 2573.
doi: 10.1038/s41467-021-22951-1
pmid: 33958589
[69]
Miyao T, Kaneko H, Funatsu K. Inverse QSPR/QSAR Analysis for Chemical Structure Generation (from y to x)[J]. Journal of Chemical Information and Modeling, 2016, 56(2): 286-299.
doi: 10.1021/acs.jcim.5b00628
pmid: 26818135
[70]
Tagade P M, Adiga S P, Pandian S, et al. Attribute Driven Inverse Materials Design Using Deep Learning Bayesian Framework[J]. NPJ Computational Materials, 2019, 5: 127.
doi: 10.1038/s41524-019-0263-3
[71]
Onishi T, Kadohira T, Watanabe I. Relation Extraction with Weakly Supervised Learning Based on Process-Structure-Property-Performance Reciprocity[J]. Science and Technology of Advanced Materials, 2018, 19(1): 649-659.
doi: 10.1080/14686996.2018.1500852
pmid: 30245757
[72]
Fukada K, Seyama M. Designing a Multilayer Film via Machine Learning of Scientific Literature[J]. Scientific Reports, 2022, 12: 930.
doi: 10.1038/s41598-022-05010-7
pmid: 35042971
[73]
MacLeod B P, Parlane F G L, Morrissey T D, et al. Self-Driving Laboratory for Accelerated Discovery of Thin-Film Materials[J]. Science Advances, 2020, 6(20): eaaz8867.
doi: 10.1126/sciadv.aaz8867
[74]
Li J G, Tu Y X, Liu R L, et al. Toward “On‐Demand” Materials Synthesis and Scientific Discovery Through Intelligent Robots[J]. Advanced Science, 2020, 7(7): 1901957.
doi: 10.1002/advs.v7.7
[75]
Kusne A G, Yu H S, Wu C M, et al. On-the-Fly Closed-Loop Materials Discovery via Bayesian Active Learning[J]. Nature Communications, 2020, 11: 5966.
doi: 10.1038/s41467-020-19597-w
pmid: 33235197
[76]
Zhao H T, Chen W, Huang H, et al. A Robotic Platform for the Synthesis of Colloidal Nanocrystals[J]. Nature Synthesis, 2023, 2(6): 505-514.
doi: 10.1038/s44160-023-00250-5
[77]
Burger B, Maffettone P M, Gusev V V, et al. A Mobile Robotic Chemist[J]. Nature, 2020, 583(7815): 237-241.
doi: 10.1038/s41586-020-2442-2
[78]
Zhu Q, Zhang F, Huang Y, et al. An All-Round AI-Chemist with a Scientific Mind[J]. National Science Review, 2022, 9(10): nwac190.
doi: 10.1093/nsr/nwac190
[79]
Williams K, Bilsland E, Sparkes A, et al. Cheaper Faster Drug Development Validated by the Repositioning of Drugs Against Neglected Tropical Diseases[J]. Journal of the Royal Society, Interface, 2015, 12(104): 20141289.
doi: 10.1098/rsif.2014.1289
[80]
Wei Z P, Su J L, Wang Y, et al. A Novel Cascade Binary Tagging Framework for Relational Triple Extraction[C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020: 1476-1488.