Identifying Subtypes of Clinical Trial Diseases with BERT-TextCNN
Yang Lin1,Huang Xiaoshuo1,Wang Jiayang1,Ding Lingling2,3,Li Zixiao2,3(),Li Jiao1()
1Institute of Medical Information/Medical Library, Chinese Academy of Medical Science & Peking Union Medical College, Beijing 100020, China 2China National Clinical Research Center for Neurological Diseases, Beijing Tiantan Hospital, Capital Medical University, Beijing 100070, China 3Department of Neurology, Beijing Tiantan Hospital, Capital Medical University, Beijing 100070, China
[Objective] This study develops a method to identify disease subtypes based on BERT-TextCNN, which could facilitate cohort selection for clinical trials. [Methods] We transformed the disease subtype identification into a single-label classification task based on BERT-TextCNN. Then, we examined our new model with clinical trials data for strokes from ClinicalTrials.gov. [Results] The BERT-TextCNN based on the LP method yielded the best weighted macro-average F1 value of 0.905 3. It identified stroke subtypes for participants of a clinical trial. [Limitations] More research is needed to evaluate our model with other diseases and data sets. [Conclusions] The proposed method could be an effective approach to identify complex disease subtypes.
杨林, 黄晓硕, 王嘉阳, 丁玲玲, 李子孝, 李姣. 基于BERT-TextCNN的临床试验疾病亚型识别研究*[J]. 数据分析与知识发现, 2022, 6(4): 69-81.
Yang Lin, Huang Xiaoshuo, Wang Jiayang, Ding Lingling, Li Zixiao, Li Jiao. Identifying Subtypes of Clinical Trial Diseases with BERT-TextCNN. Data Analysis and Knowledge Discovery, 2022, 6(4): 69-81.
Saria S, Goldenberg A. Subtyping: What It is and Its Role in Precision Medicine[J]. IEEE Intelligent Systems, 2015, 30(4):70-75.
[2]
Fereshtehnejad S M, Zeighami Y, Dagher A, et al. Clinical Criteria for Subtyping Parkinson’s Disease: Biomarkers and Longitudinal Progression[J]. Brain, 2017, 140(7):1959-1976.
doi: 10.1093/brain/awx118
[3]
Jalanko T, de Jong J J, Gibb E A, et al. Genomic Subtyping in Bladder Cancer[J]. Current Urology Reports, 2020, 21(2):9.
doi: 10.1007/s11934-020-0960-y
pmid: 32166460
[4]
Laher F, Bekker L G, Garrett N, et al. Review of Preventative HIV Vaccine Clinical Trials in South Africa[J]. Archives of Virology, 2020, 165(11):2439-2452.
doi: 10.1007/s00705-020-04777-2
(National Medical Products Administration. Announcement of the General Administration on Issuing the Technical Guidelines for Clinical Trials of Drugs for the Treatment of Acute Ischemic Stroke (Year 2018, No. 28)[EB/OL]. [2021-07-12]. https://www.nmpa.gov.cn/yaopin/ypggtg/ypqtgg/20180209175801578.html.)
[6]
Feldman W B, Kim A S, Chiong W. Trends in Recruitment Rates for Acute Stroke Trials, 1990-2014[J]. Stroke, 2017, 48(3):799-801.
doi: 10.1161/STROKEAHA.116.014458
[7]
Zong H, Yang J X, Zhang Z Y, et al. Semantic Categorization of Chinese Eligibility Criteria in Clinical Trials Using Machine Learning Methods[J]. BMC Medical Informatics and Decision Making, 2021, 21(1):128.
doi: 10.1186/s12911-021-01487-w
pmid: 33858409
[8]
Harrer S, Shah P, Antony B, et al. Artificial Intelligence for Clinical Trial Design[J]. Trends in Pharmacological Sciences, 2019, 40(8):577-591.
doi: 10.1016/j.tips.2019.05.005
[9]
Stubbs A, Filannino M, Soysal E, et al. Cohort Selection for Clinical Trials: N2C2 2018 Shared Task Track 1[J]. Journal of the American Medical Informatics Association, 2019, 26(11):1163-1171.
doi: 10.1093/jamia/ocz163
pmid: 31562516
( Yang Lin, Huang Xiaoshuo, Wang Jiayang, et al. Extracting Clinical Scale Information and Identifying Trial Cohorts with Semantic Alignment[J]. Data Analysis and Knowledge Discovery, 2020, 4(12):33-44.)
[11]
Dhayne H, Kilany R, Haque R, et al. EMR2vec: Bridging the Gap Between Patient Data and Clinical Trial[J]. Computers & Industrial Engineering, 2021, 156:107236.
doi: 10.1016/j.cie.2021.107236
[12]
Chen L, Gu Y, Ji X, et al. Clinical Trial Cohort Selection Based on Multi-Level Rule-Based Natural Language Processing System[J]. Journal of the American Medical Informatics Association, 2019, 26(11):1218-1226.
doi: 10.1093/jamia/ocz109
[13]
Weng C H, Wu X Y, Luo Z H, et al. EliXR: An Approach to Eligibility Criteria Extraction and Representation[J]. Journal of the American Medical Informatics Association, 2011, 18(S1):i116-i124.
doi: 10.1136/amiajnl-2011-000321
[14]
Kang T, Zhang S D, Tang Y L, et al. EliIE: An Open-Source Information Extraction System for Clinical Trial Eligibility Criteria[J]. Journal of the American Medical Informatics Association, 2017, 24(6):1062-1071.
doi: 10.1093/jamia/ocx019
(The 5th China Health Information Processing Conference (CHIP2019) Evaluation Task [EB/OL]. [2021-03-20]. http://www.cips-chip.org.cn:8000/evaluation.)
[16]
Tseo Y, Salkola M I, Mohamed A, et al. Information Extraction of Clinical Trial Eligibility Criteria[OL]. arXiv Preprint, arXiv:2006.07296.
[17]
Anusha B, Li J F, Xu Y Q, et al. Deep Learning Approach to Parse Eligibility Criteria in Dietary Supplements Clinical Trials Following OMOP Common Data Model[J]. AMIA Annual Symposium Proceedings AMIA Symposium, 2020: 243-252.
[18]
Blanco A, Perez-de-Viñaspre O, Pérez A, et al. Boosting ICD Multi-Label Classification of Health Records with Contextual Embeddings and Label-Granularity[J]. Computer Methods and Programs in Biomedicine, 2020, 188:105264.
doi: 10.1016/j.cmpb.2019.105264
( Yang Feihong, Wang Xuwen, Li Jiao. BERT-TextCNN-Based Classification of Short Texts from Clinical Trials[J]. Chinese Journal of Medical Library and Information Science, 2021, 30(1):54-59.)
[20]
Zhang M L, Zhou Z H. A Review on Multi-Label Learning Algorithms[J]. IEEE Transactions on Knowledge and Data Engineering, 2014, 26(8):1819-1837.
doi: 10.1109/TKDE.2013.39
[21]
Alsudias L, Rayson P. Social Media Monitoring of the COVID-19 Pandemic and Influenza Epidemic with Adaptation for Informal Language in Arabic Twitter Data: Qualitative Study[J]. JMIR Medical Informatics, 2021, 9(9):e27670.
doi: 10.2196/27670
[22]
Zhang X Q, Zhao H L, Zhang S, et al. A Novel Deep Neural Network Model for Multi-Label Chronic Disease Prediction[J]. Frontiers in Genetics, 2019, 10:351.
doi: 10.3389/fgene.2019.00351
[23]
Gargiulo F, Silvestri S, Ciampi M, et al. Deep Neural Network for Hierarchical Extreme Multi-Label Text Classification[J]. Applied Soft Computing, 2019, 79:125-138.
doi: 10.1016/j.asoc.2019.03.041
[24]
Zhang M L, Zhou Z H. Multilabel Neural Networks with Applications to Functional Genomics and Text Categorization[J]. IEEE Transactions on Knowledge and Data Engineering, 2006, 18(10):1338-1351.
doi: 10.1109/TKDE.2006.162
[25]
Du J C, Chen Q Y, Peng Y F, et al. ML-Net: Multi-Label Classification of Biomedical Texts with Deep Neural Networks[J]. Journal of the American Medical Informatics Association, 2019, 26(11):1279-1285.
doi: 10.1093/jamia/ocz085
[26]
Winata G I, Khodra M L. Handling Imbalanced Dataset in Multi-Label Text Categorization Using Bagging and Adaptive Boosting[C]//Proceedings of 2015 International Conference on Electrical Engineering and Informatics. 2015: 500-505.
[27]
Sammani A, Bagheri A Van Der Heijden P G M, et al. Automatic Multilabel Detection of ICD10 Codes in Dutch Cardiology Discharge Letters Using Neural Networks[J]. NPJ Digital Medicine, 2021, 4(1):37.
doi: 10.1038/s41746-021-00404-9
pmid: 33637859
[28]
Ibrahim M A, Ghani Khan M U, Mehmood F, et al. GHS-NET a Generic Hybridized Shallow Neural Network for Multi-Label Biomedical Text Classification[J]. Journal of Biomedical Informatics, 2021, 116:103699.
doi: 10.1016/j.jbi.2021.103699
[29]
Wu S M, Wu B, Liu M, et al. Stroke in China: Advances and Challenges in Epidemiology, Prevention, and Management[J]. The Lancet Neurology, 2019, 18(4):394-405.
doi: 10.1016/S1474-4422(18)30500-3
[30]
Li Z X, Jiang Y, Li H, et al. China’s Response to the Rising Stroke Burden[J]. BMJ (Clinical Research Ed), 2019, 364:l879.
[31]
Wang Y J, Li Z X, Gu H Q, et al. China Stroke Statistics 2019: A Report from the National Center for Healthcare Quality Management in Neurological Diseases, China National Clinical Research Center for Neurological Diseases, the Chinese Stroke Association, National Center for Chronic and Non-Communicable Disease Control and Prevention, Chinese Center for Disease Control and Prevention and Institute for Global Neuroscience and Stroke Collaborations[J]. Stroke and Vascular Neurology, 2020, 5(3):211-239.
doi: 10.1136/svn-2020-000457
Boutell M R, Luo J B, Shen X P, et al. Learning Multi-Label Scene Classification[J]. Pattern Recognition, 2004, 37(9):1757-1771.
doi: 10.1016/j.patcog.2004.03.009
[34]
Tsoumakas G, Vlahavas I. Random K-Labelsets: An Ensemble Method for Multilabel Classification[C]//Proceedings of the 18th European Conference on Machine Learning. 2007: 406-417.
[35]
Read J, Puurula A, Bifet A. Multi-Label Classification with Meta-Labels[C]//Proceedings of the 2014 IEEE International Conference on Data Mining. 2014: 941-946.
[36]
Qiu X P, Sun T X, Xu Y G, et al. Pre-Trained Models for Natural Language Processing: A Survey[J]. Science China Technological Sciences, 2020, 63(10):1872-1897.
doi: 10.1007/s11431-020-1647-3
[37]
Mikolov T, Sutskever I, Chen K, et al. Distributed Representations of Words and Phrases and Their Compositionality[OL]. arXiv Preprint, arXiv: 1310.4546.
[38]
Pennington J, Socher R, Manning C D. GloVe: Global Vectors for Word Representation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. 2014: 1532-1543.
[39]
Peters M E, Neumann M, Iyyer M, et al. Deep Contextualized Word Representations[OL]. arXiv Preprint, arXiv:1802.05365.
[40]
Devlin J, Chang M W, Lee K, et al. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding[OL]. arXiv Preprint, arXiv:1810.04805.
[41]
Kim Y. Convolutional Neural Networks for Sentence Classification[OL]. arXiv Preprint, arXiv: 1408.5882.
[42]
Mencıa E L. Multilabel Classification in Parallel Tasks[C]//Proceedings of the 2nd International Workshop on Learning from Multi-Label Data. 2010: 20-36.
[43]
Wu G Q, Zhu J. Multi-Label Classification: Do Hamming Loss and Subset Accuracy Really Conflict with Each Other? [OL]. arXiv Preprint, arXiv:2011.07805.
[44]
Wang W Z, Jiang B, Sun H X, et al. Prevalence, Incidence, and Mortality of Stroke in China: Results from a Nationwide Population-Based Survey of 480 687 Adults[J]. Circulation, 2017, 135(8):759-771.
doi: 10.1161/CIRCULATIONAHA.116.025250
[45]
Kumar S, Selim M, Marchina S, et al. Transient Neurological Symptoms in Patients with Intracerebral Hemorrhage[J]. JAMA Neurology, 2016, 73(3):316-320.
doi: 10.1001/jamaneurol.2015.4202
( Zhou Lei, Wang Fei, Xiao Yingqi, et al. Subarachnoid Hemorrhage with Transient Cerebral Ischemia as the First Symptom: One Case Report[J]. Chinese Journal of Nervous and Mental Diseases, 2020, 46(1):41-42.)