|
|
Identifying Subtypes of Clinical Trial Diseases with BERT-TextCNN |
Yang Lin1,Huang Xiaoshuo1,Wang Jiayang1,Ding Lingling2,3,Li Zixiao2,3(),Li Jiao1() |
1Institute of Medical Information/Medical Library, Chinese Academy of Medical Science & Peking Union Medical College, Beijing 100020, China 2China National Clinical Research Center for Neurological Diseases, Beijing Tiantan Hospital, Capital Medical University, Beijing 100070, China 3Department of Neurology, Beijing Tiantan Hospital, Capital Medical University, Beijing 100070, China |
|
|
Abstract [Objective] This study develops a method to identify disease subtypes based on BERT-TextCNN, which could facilitate cohort selection for clinical trials. [Methods] We transformed the disease subtype identification into a single-label classification task based on BERT-TextCNN. Then, we examined our new model with clinical trials data for strokes from ClinicalTrials.gov. [Results] The BERT-TextCNN based on the LP method yielded the best weighted macro-average F1 value of 0.905 3. It identified stroke subtypes for participants of a clinical trial. [Limitations] More research is needed to evaluate our model with other diseases and data sets. [Conclusions] The proposed method could be an effective approach to identify complex disease subtypes.
|
Received: 16 July 2021
Published: 12 May 2022
|
|
Fund:Natural Science Foundation of Beijing, China(Z200016) |
Corresponding Authors:
Li Zixiao,ORCID:0000-0002-4713-5418,Li Jiao,ORCID:0000-0001-6391-8343
E-mail: lizixiao2008@hotmail.com;li.jiao@imicams.ac.cn
|
[1] |
Saria S, Goldenberg A. Subtyping: What It is and Its Role in Precision Medicine[J]. IEEE Intelligent Systems, 2015, 30(4):70-75.
|
[2] |
Fereshtehnejad S M, Zeighami Y, Dagher A, et al. Clinical Criteria for Subtyping Parkinson’s Disease: Biomarkers and Longitudinal Progression[J]. Brain, 2017, 140(7):1959-1976.
doi: 10.1093/brain/awx118
|
[3] |
Jalanko T, de Jong J J, Gibb E A, et al. Genomic Subtyping in Bladder Cancer[J]. Current Urology Reports, 2020, 21(2):9.
doi: 10.1007/s11934-020-0960-y
pmid: 32166460
|
[4] |
Laher F, Bekker L G, Garrett N, et al. Review of Preventative HIV Vaccine Clinical Trials in South Africa[J]. Archives of Virology, 2020, 165(11):2439-2452.
doi: 10.1007/s00705-020-04777-2
|
[5] |
国家药品监督管理局. 总局关于发布急性缺血性脑卒中治疗药物临床试验技术指导原则的通告(2018年第28号) [EB/OL]. [2021-07-12]. https://www.nmpa.gov.cn/yaopin/ypggtg/ypqtgg/20180209175801578.html.
|
[5] |
(National Medical Products Administration. Announcement of the General Administration on Issuing the Technical Guidelines for Clinical Trials of Drugs for the Treatment of Acute Ischemic Stroke (Year 2018, No. 28)[EB/OL]. [2021-07-12]. https://www.nmpa.gov.cn/yaopin/ypggtg/ypqtgg/20180209175801578.html.)
|
[6] |
Feldman W B, Kim A S, Chiong W. Trends in Recruitment Rates for Acute Stroke Trials, 1990-2014[J]. Stroke, 2017, 48(3):799-801.
doi: 10.1161/STROKEAHA.116.014458
|
[7] |
Zong H, Yang J X, Zhang Z Y, et al. Semantic Categorization of Chinese Eligibility Criteria in Clinical Trials Using Machine Learning Methods[J]. BMC Medical Informatics and Decision Making, 2021, 21(1):128.
doi: 10.1186/s12911-021-01487-w
pmid: 33858409
|
[8] |
Harrer S, Shah P, Antony B, et al. Artificial Intelligence for Clinical Trial Design[J]. Trends in Pharmacological Sciences, 2019, 40(8):577-591.
doi: 10.1016/j.tips.2019.05.005
|
[9] |
Stubbs A, Filannino M, Soysal E, et al. Cohort Selection for Clinical Trials: N2C2 2018 Shared Task Track 1[J]. Journal of the American Medical Informatics Association, 2019, 26(11):1163-1171.
doi: 10.1093/jamia/ocz163
pmid: 31562516
|
[10] |
杨林, 黄晓硕, 王嘉阳, 等. 基于语义对齐的临床量表信息提取方法及其临床试验队列识别的应用研究[J]. 数据分析与知识发现, 2020, 4(12):33-44.
|
[10] |
( Yang Lin, Huang Xiaoshuo, Wang Jiayang, et al. Extracting Clinical Scale Information and Identifying Trial Cohorts with Semantic Alignment[J]. Data Analysis and Knowledge Discovery, 2020, 4(12):33-44.)
|
[11] |
Dhayne H, Kilany R, Haque R, et al. EMR2vec: Bridging the Gap Between Patient Data and Clinical Trial[J]. Computers & Industrial Engineering, 2021, 156:107236.
doi: 10.1016/j.cie.2021.107236
|
[12] |
Chen L, Gu Y, Ji X, et al. Clinical Trial Cohort Selection Based on Multi-Level Rule-Based Natural Language Processing System[J]. Journal of the American Medical Informatics Association, 2019, 26(11):1218-1226.
doi: 10.1093/jamia/ocz109
|
[13] |
Weng C H, Wu X Y, Luo Z H, et al. EliXR: An Approach to Eligibility Criteria Extraction and Representation[J]. Journal of the American Medical Informatics Association, 2011, 18(S1):i116-i124.
doi: 10.1136/amiajnl-2011-000321
|
[14] |
Kang T, Zhang S D, Tang Y L, et al. EliIE: An Open-Source Information Extraction System for Clinical Trial Eligibility Criteria[J]. Journal of the American Medical Informatics Association, 2017, 24(6):1062-1071.
doi: 10.1093/jamia/ocx019
|
[15] |
第五届中国健康信息处理会议(CHIP2019)评测任务 [EB/OL]. [2021-03-20]. http://www.cips-chip.org.cn:8000/evaluation.
|
[15] |
(The 5th China Health Information Processing Conference (CHIP2019) Evaluation Task [EB/OL]. [2021-03-20]. http://www.cips-chip.org.cn:8000/evaluation.)
|
[16] |
Tseo Y, Salkola M I, Mohamed A, et al. Information Extraction of Clinical Trial Eligibility Criteria[OL]. arXiv Preprint, arXiv:2006.07296.
|
[17] |
Anusha B, Li J F, Xu Y Q, et al. Deep Learning Approach to Parse Eligibility Criteria in Dietary Supplements Clinical Trials Following OMOP Common Data Model[J]. AMIA Annual Symposium Proceedings AMIA Symposium, 2020: 243-252.
|
[18] |
Blanco A, Perez-de-Viñaspre O, Pérez A, et al. Boosting ICD Multi-Label Classification of Health Records with Contextual Embeddings and Label-Granularity[J]. Computer Methods and Programs in Biomedicine, 2020, 188:105264.
doi: 10.1016/j.cmpb.2019.105264
|
[19] |
杨飞洪, 王序文, 李姣. 基于BERT-TextCNN模型的临床试验筛选短文本分类方法[J]. 中华医学图书情报杂志, 2021, 30(1):54-59.
|
[19] |
( Yang Feihong, Wang Xuwen, Li Jiao. BERT-TextCNN-Based Classification of Short Texts from Clinical Trials[J]. Chinese Journal of Medical Library and Information Science, 2021, 30(1):54-59.)
|
[20] |
Zhang M L, Zhou Z H. A Review on Multi-Label Learning Algorithms[J]. IEEE Transactions on Knowledge and Data Engineering, 2014, 26(8):1819-1837.
doi: 10.1109/TKDE.2013.39
|
[21] |
Alsudias L, Rayson P. Social Media Monitoring of the COVID-19 Pandemic and Influenza Epidemic with Adaptation for Informal Language in Arabic Twitter Data: Qualitative Study[J]. JMIR Medical Informatics, 2021, 9(9):e27670.
doi: 10.2196/27670
|
[22] |
Zhang X Q, Zhao H L, Zhang S, et al. A Novel Deep Neural Network Model for Multi-Label Chronic Disease Prediction[J]. Frontiers in Genetics, 2019, 10:351.
doi: 10.3389/fgene.2019.00351
|
[23] |
Gargiulo F, Silvestri S, Ciampi M, et al. Deep Neural Network for Hierarchical Extreme Multi-Label Text Classification[J]. Applied Soft Computing, 2019, 79:125-138.
doi: 10.1016/j.asoc.2019.03.041
|
[24] |
Zhang M L, Zhou Z H. Multilabel Neural Networks with Applications to Functional Genomics and Text Categorization[J]. IEEE Transactions on Knowledge and Data Engineering, 2006, 18(10):1338-1351.
doi: 10.1109/TKDE.2006.162
|
[25] |
Du J C, Chen Q Y, Peng Y F, et al. ML-Net: Multi-Label Classification of Biomedical Texts with Deep Neural Networks[J]. Journal of the American Medical Informatics Association, 2019, 26(11):1279-1285.
doi: 10.1093/jamia/ocz085
|
[26] |
Winata G I, Khodra M L. Handling Imbalanced Dataset in Multi-Label Text Categorization Using Bagging and Adaptive Boosting[C]//Proceedings of 2015 International Conference on Electrical Engineering and Informatics. 2015: 500-505.
|
[27] |
Sammani A, Bagheri A Van Der Heijden P G M, et al. Automatic Multilabel Detection of ICD10 Codes in Dutch Cardiology Discharge Letters Using Neural Networks[J]. NPJ Digital Medicine, 2021, 4(1):37.
doi: 10.1038/s41746-021-00404-9
pmid: 33637859
|
[28] |
Ibrahim M A, Ghani Khan M U, Mehmood F, et al. GHS-NET a Generic Hybridized Shallow Neural Network for Multi-Label Biomedical Text Classification[J]. Journal of Biomedical Informatics, 2021, 116:103699.
doi: 10.1016/j.jbi.2021.103699
|
[29] |
Wu S M, Wu B, Liu M, et al. Stroke in China: Advances and Challenges in Epidemiology, Prevention, and Management[J]. The Lancet Neurology, 2019, 18(4):394-405.
doi: 10.1016/S1474-4422(18)30500-3
|
[30] |
Li Z X, Jiang Y, Li H, et al. China’s Response to the Rising Stroke Burden[J]. BMJ (Clinical Research Ed), 2019, 364:l879.
|
[31] |
Wang Y J, Li Z X, Gu H Q, et al. China Stroke Statistics 2019: A Report from the National Center for Healthcare Quality Management in Neurological Diseases, China National Clinical Research Center for Neurological Diseases, the Chinese Stroke Association, National Center for Chronic and Non-Communicable Disease Control and Prevention, Chinese Center for Disease Control and Prevention and Institute for Global Neuroscience and Stroke Collaborations[J]. Stroke and Vascular Neurology, 2020, 5(3):211-239.
doi: 10.1136/svn-2020-000457
|
[32] |
ClinicalTrials.gov[EB/OL]. [2021-07-12]. https://clinicaltrials.gov/.
|
[33] |
Boutell M R, Luo J B, Shen X P, et al. Learning Multi-Label Scene Classification[J]. Pattern Recognition, 2004, 37(9):1757-1771.
doi: 10.1016/j.patcog.2004.03.009
|
[34] |
Tsoumakas G, Vlahavas I. Random K-Labelsets: An Ensemble Method for Multilabel Classification[C]//Proceedings of the 18th European Conference on Machine Learning. 2007: 406-417.
|
[35] |
Read J, Puurula A, Bifet A. Multi-Label Classification with Meta-Labels[C]//Proceedings of the 2014 IEEE International Conference on Data Mining. 2014: 941-946.
|
[36] |
Qiu X P, Sun T X, Xu Y G, et al. Pre-Trained Models for Natural Language Processing: A Survey[J]. Science China Technological Sciences, 2020, 63(10):1872-1897.
doi: 10.1007/s11431-020-1647-3
|
[37] |
Mikolov T, Sutskever I, Chen K, et al. Distributed Representations of Words and Phrases and Their Compositionality[OL]. arXiv Preprint, arXiv: 1310.4546.
|
[38] |
Pennington J, Socher R, Manning C D. GloVe: Global Vectors for Word Representation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. 2014: 1532-1543.
|
[39] |
Peters M E, Neumann M, Iyyer M, et al. Deep Contextualized Word Representations[OL]. arXiv Preprint, arXiv:1802.05365.
|
[40] |
Devlin J, Chang M W, Lee K, et al. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding[OL]. arXiv Preprint, arXiv:1810.04805.
|
[41] |
Kim Y. Convolutional Neural Networks for Sentence Classification[OL]. arXiv Preprint, arXiv: 1408.5882.
|
[42] |
Mencıa E L. Multilabel Classification in Parallel Tasks[C]//Proceedings of the 2nd International Workshop on Learning from Multi-Label Data. 2010: 20-36.
|
[43] |
Wu G Q, Zhu J. Multi-Label Classification: Do Hamming Loss and Subset Accuracy Really Conflict with Each Other? [OL]. arXiv Preprint, arXiv:2011.07805.
|
[44] |
Wang W Z, Jiang B, Sun H X, et al. Prevalence, Incidence, and Mortality of Stroke in China: Results from a Nationwide Population-Based Survey of 480 687 Adults[J]. Circulation, 2017, 135(8):759-771.
doi: 10.1161/CIRCULATIONAHA.116.025250
|
[45] |
Kumar S, Selim M, Marchina S, et al. Transient Neurological Symptoms in Patients with Intracerebral Hemorrhage[J]. JAMA Neurology, 2016, 73(3):316-320.
doi: 10.1001/jamaneurol.2015.4202
|
[46] |
周蕾, 王飞, 肖盈奇, 等. 以短暂性脑缺血发作为首发症状的蛛网膜下腔出血1例报告[J]. 中国神经精神疾病杂志, 2020, 46(1):41-42.
|
[46] |
( Zhou Lei, Wang Fei, Xiao Yingqi, et al. Subarachnoid Hemorrhage with Transient Cerebral Ischemia as the First Symptom: One Case Report[J]. Chinese Journal of Nervous and Mental Diseases, 2020, 46(1):41-42.)
|
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|