Please wait a minute...
Data Analysis and Knowledge Discovery  2022, Vol. 6 Issue (4): 69-81    DOI: 10.11925/infotech.2096-3467.2021.0712
Current Issue | Archive | Adv Search |
Identifying Subtypes of Clinical Trial Diseases with BERT-TextCNN
Yang Lin1,Huang Xiaoshuo1,Wang Jiayang1,Ding Lingling2,3,Li Zixiao2,3(),Li Jiao1()
1Institute of Medical Information/Medical Library, Chinese Academy of Medical Science & Peking Union Medical College, Beijing 100020, China
2China National Clinical Research Center for Neurological Diseases, Beijing Tiantan Hospital, Capital Medical University, Beijing 100070, China
3Department of Neurology, Beijing Tiantan Hospital, Capital Medical University, Beijing 100070, China
Download: PDF (1169 KB)   HTML ( 42
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This study develops a method to identify disease subtypes based on BERT-TextCNN, which could facilitate cohort selection for clinical trials. [Methods] We transformed the disease subtype identification into a single-label classification task based on BERT-TextCNN. Then, we examined our new model with clinical trials data for strokes from ClinicalTrials.gov. [Results] The BERT-TextCNN based on the LP method yielded the best weighted macro-average F1 value of 0.905 3. It identified stroke subtypes for participants of a clinical trial. [Limitations] More research is needed to evaluate our model with other diseases and data sets. [Conclusions] The proposed method could be an effective approach to identify complex disease subtypes.

Key wordsClinical Trial      Text Classification      BERT-TextCNN      Stroke      Disease Subtype     
Received: 16 July 2021      Published: 12 May 2022
ZTFLH:  TP391  
Fund:Natural Science Foundation of Beijing, China(Z200016)
Corresponding Authors: Li Zixiao,ORCID:0000-0002-4713-5418,Li Jiao,ORCID:0000-0001-6391-8343     E-mail: lizixiao2008@hotmail.com;li.jiao@imicams.ac.cn

Cite this article:

Yang Lin, Huang Xiaoshuo, Wang Jiayang, Ding Lingling, Li Zixiao, Li Jiao. Identifying Subtypes of Clinical Trial Diseases with BERT-TextCNN. Data Analysis and Knowledge Discovery, 2022, 6(4): 69-81.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2021.0712     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2022/V6/I4/69

Workflow of Clinical Trial Disease Subtype Identification
Example Annotations of Inclusion Criteria
Problem Transformation Methods for Clinical Trial Stroke Subtype Identification
Clinical Trial Stroke Subtype Identification Based on BERT-TextCNN
模型 训练类别 轮次 批大小 学习率 补齐长度
GloVe-TextCNN_
LabelPowerset
20 24 5e-3 250
GloVe-TextCNN_IS 20 24 5e-3 300
GloVe-TextCNN GloVe-TextCNN_ICH 20 24 5e-3 300
GloVe-TextCNN_SAH 20 16 5e-3 250
GloVe-TextCNN_TIA 20 16 5e-3 250
BERT_LabelPowerset 5 16 1e-5 250
BERT_IS 5 24 9e-6 150
BERT BERT_ICH 5 16 9e-6 150
BERT_SAH 5 24 1e-5 250
BERT_TIA 5 16 1e-5 250
BERT-TextCNN_
LabelPowerset
5 16 1e-5 250
BERT-TextCNN_IS 5 16 1e-5 250
BERT-TextCNN BERT-TextCNN_ICH 5 24 9e-6 250
BERT-TextCNN_SAH 5 32 1e-5 150
BERT-TextCNN_TIA 5 16 1e-5 250
Model Parameters
临床试验 类型 试验数量
招募状态 完成(Completed) 1 208
招募中(Recruiting) 552
未知状态(Unknown Status) 393
未开始招募(Not yet Recruiting) 237
终止(Terminated) 188
正在进行,非招募中(Active, not Recruiting) 70
撤回(Withdrawn) 62
暂停(Suspended) 25
邀请招募(Enrolling by Invitation) 20
干预措施 器械(Device) 790
药物(Drug) 773
其他(Other) 501
行为(Behavioral) 340
手术(Procedure) 213
生物(Biological) 67
诊断测试(Diagnostic Test) 23
饮食补充(Dietary Supplement) 20
复合产品(Combination Product) 13
基因(Genetic) 8
放射(Radiation) 7
纳入标准长度 (0, 50] 2 122
(50, 100] 346
(100, 150] 136
(150, 200] 68
(200, 250] 38
(250, 300] 19
(350, 400] 9
(300, 350] 7
(450, 500] 6
(500, 600] 3
(400, 450] 1
Distribution of Stroke Clinical Trials
序号 可纳入卒中亚型 训练集 测试集 总数量
1 IS 697 310 1 007
2 IS + ICH 405 162 567
3 IS + ICH + SAH 380 166 546
4 IS + TIA 165 66 231
5 SAH 120 53 173
6 ICH 103 53 156
7 IS + ICH + TIA 15 11 26
8 TIA 11 8 19
9 ICH + SAH 12 3 15
10 IS + ICH + SAH + TIA 10 2 12
11 ICH + TIA 1 0 1
12 IS + SAH 1 1 2
总数量 1 920 835 2 755
Distribution of Stroke Subtype Inclusion Criteria
模型 问题转化方法 加权宏平均精确率 加权宏平均召回率 加权宏平均F1值 汉明损失
GloVe-TextCNN Label Powerset 0.822 9 0.841 3 0.830 6 0.059 3
GloVe-TextCNN Binary Relevance 0.802 1 0.789 2 0.781 7 0.065 9
BERT Label Powerset 0.905 1 0.905 0 0.903 7 0.029 9
BERT Binary Relevance 0.889 8 0.879 9 0.883 7 0.035 9
BERT-TextCNN Label Powerset 0.905 7 0.906 2 0.905 3 0.031 1
BERT-TextCNN Binary Relevance 0.897 2 0.889 7 0.891 7 0.031 1
Overall Performance of Clinical Trial Stroke Subtype Identification
序号 可纳入卒中亚型 LP法 二元相关性法
精确率 召回率 F1值 精确率 召回率 F1值
1 IS 0.960 7 0.945 2 0.952 8 0.969 5 0.940 8 0.954 9
2 IS+ICH 0.838 2 0.906 2 0.870 9 0.782 1 0.897 4 0.835 8
3 IS+ICH+SAH 0.896 1 0.831 3 0.862 5 0.899 3 0.827 2 0.861 7
4 IS+TIA 0.910 4 0.924 2 0.917 3 0.857 1 0.843 8 0.850 4
5 SAH 0.928 6 0.981 1 0.954 1 0.961 5 0.943 4 0.952 4
6 ICH 0.909 1 0.961 5 0.934 6 1.000 0 0.942 3 0.970 3
7 IS+ICH+TIA 0.800 0 0.727 3 0.761 9 0.500 0 0.636 4 0.560 0
8 TIA 0.750 0 0.750 0 0.750 0 0.666 7 0.500 0 0.571 4
9 ICH+SAH 0.500 0 0.333 3 0.400 0 0.250 0 0.333 3 0.285 7
10 IS+ICH+SAH+TIA 0.000 0 0.000 0 0.000 0 0.000 0 0.000 0 0.000 0
11 ICH+TIA 0.000 0 0.000 0 0.000 0 0.000 0 0.000 0 0.000 0
12 IS+SAH 0.000 0 0.000 0 0.000 0 0.250 0 1.000 0 0.400 0
Performance of Clinical Trial Stroke Subtype Identification Based on BERT-TextCNN
[1] Saria S, Goldenberg A. Subtyping: What It is and Its Role in Precision Medicine[J]. IEEE Intelligent Systems, 2015, 30(4):70-75.
[2] Fereshtehnejad S M, Zeighami Y, Dagher A, et al. Clinical Criteria for Subtyping Parkinson’s Disease: Biomarkers and Longitudinal Progression[J]. Brain, 2017, 140(7):1959-1976.
doi: 10.1093/brain/awx118
[3] Jalanko T, de Jong J J, Gibb E A, et al. Genomic Subtyping in Bladder Cancer[J]. Current Urology Reports, 2020, 21(2):9.
doi: 10.1007/s11934-020-0960-y pmid: 32166460
[4] Laher F, Bekker L G, Garrett N, et al. Review of Preventative HIV Vaccine Clinical Trials in South Africa[J]. Archives of Virology, 2020, 165(11):2439-2452.
doi: 10.1007/s00705-020-04777-2
[5] 国家药品监督管理局. 总局关于发布急性缺血性脑卒中治疗药物临床试验技术指导原则的通告(2018年第28号) [EB/OL]. [2021-07-12]. https://www.nmpa.gov.cn/yaopin/ypggtg/ypqtgg/20180209175801578.html.
[5] (National Medical Products Administration. Announcement of the General Administration on Issuing the Technical Guidelines for Clinical Trials of Drugs for the Treatment of Acute Ischemic Stroke (Year 2018, No. 28)[EB/OL]. [2021-07-12]. https://www.nmpa.gov.cn/yaopin/ypggtg/ypqtgg/20180209175801578.html.)
[6] Feldman W B, Kim A S, Chiong W. Trends in Recruitment Rates for Acute Stroke Trials, 1990-2014[J]. Stroke, 2017, 48(3):799-801.
doi: 10.1161/STROKEAHA.116.014458
[7] Zong H, Yang J X, Zhang Z Y, et al. Semantic Categorization of Chinese Eligibility Criteria in Clinical Trials Using Machine Learning Methods[J]. BMC Medical Informatics and Decision Making, 2021, 21(1):128.
doi: 10.1186/s12911-021-01487-w pmid: 33858409
[8] Harrer S, Shah P, Antony B, et al. Artificial Intelligence for Clinical Trial Design[J]. Trends in Pharmacological Sciences, 2019, 40(8):577-591.
doi: 10.1016/j.tips.2019.05.005
[9] Stubbs A, Filannino M, Soysal E, et al. Cohort Selection for Clinical Trials: N2C2 2018 Shared Task Track 1[J]. Journal of the American Medical Informatics Association, 2019, 26(11):1163-1171.
doi: 10.1093/jamia/ocz163 pmid: 31562516
[10] 杨林, 黄晓硕, 王嘉阳, 等. 基于语义对齐的临床量表信息提取方法及其临床试验队列识别的应用研究[J]. 数据分析与知识发现, 2020, 4(12):33-44.
[10] ( Yang Lin, Huang Xiaoshuo, Wang Jiayang, et al. Extracting Clinical Scale Information and Identifying Trial Cohorts with Semantic Alignment[J]. Data Analysis and Knowledge Discovery, 2020, 4(12):33-44.)
[11] Dhayne H, Kilany R, Haque R, et al. EMR2vec: Bridging the Gap Between Patient Data and Clinical Trial[J]. Computers & Industrial Engineering, 2021, 156:107236.
doi: 10.1016/j.cie.2021.107236
[12] Chen L, Gu Y, Ji X, et al. Clinical Trial Cohort Selection Based on Multi-Level Rule-Based Natural Language Processing System[J]. Journal of the American Medical Informatics Association, 2019, 26(11):1218-1226.
doi: 10.1093/jamia/ocz109
[13] Weng C H, Wu X Y, Luo Z H, et al. EliXR: An Approach to Eligibility Criteria Extraction and Representation[J]. Journal of the American Medical Informatics Association, 2011, 18(S1):i116-i124.
doi: 10.1136/amiajnl-2011-000321
[14] Kang T, Zhang S D, Tang Y L, et al. EliIE: An Open-Source Information Extraction System for Clinical Trial Eligibility Criteria[J]. Journal of the American Medical Informatics Association, 2017, 24(6):1062-1071.
doi: 10.1093/jamia/ocx019
[15] 第五届中国健康信息处理会议(CHIP2019)评测任务 [EB/OL]. [2021-03-20]. http://www.cips-chip.org.cn:8000/evaluation.
[15] (The 5th China Health Information Processing Conference (CHIP2019) Evaluation Task [EB/OL]. [2021-03-20]. http://www.cips-chip.org.cn:8000/evaluation.)
[16] Tseo Y, Salkola M I, Mohamed A, et al. Information Extraction of Clinical Trial Eligibility Criteria[OL]. arXiv Preprint, arXiv:2006.07296.
[17] Anusha B, Li J F, Xu Y Q, et al. Deep Learning Approach to Parse Eligibility Criteria in Dietary Supplements Clinical Trials Following OMOP Common Data Model[J]. AMIA Annual Symposium Proceedings AMIA Symposium, 2020: 243-252.
[18] Blanco A, Perez-de-Viñaspre O, Pérez A, et al. Boosting ICD Multi-Label Classification of Health Records with Contextual Embeddings and Label-Granularity[J]. Computer Methods and Programs in Biomedicine, 2020, 188:105264.
doi: 10.1016/j.cmpb.2019.105264
[19] 杨飞洪, 王序文, 李姣. 基于BERT-TextCNN模型的临床试验筛选短文本分类方法[J]. 中华医学图书情报杂志, 2021, 30(1):54-59.
[19] ( Yang Feihong, Wang Xuwen, Li Jiao. BERT-TextCNN-Based Classification of Short Texts from Clinical Trials[J]. Chinese Journal of Medical Library and Information Science, 2021, 30(1):54-59.)
[20] Zhang M L, Zhou Z H. A Review on Multi-Label Learning Algorithms[J]. IEEE Transactions on Knowledge and Data Engineering, 2014, 26(8):1819-1837.
doi: 10.1109/TKDE.2013.39
[21] Alsudias L, Rayson P. Social Media Monitoring of the COVID-19 Pandemic and Influenza Epidemic with Adaptation for Informal Language in Arabic Twitter Data: Qualitative Study[J]. JMIR Medical Informatics, 2021, 9(9):e27670.
doi: 10.2196/27670
[22] Zhang X Q, Zhao H L, Zhang S, et al. A Novel Deep Neural Network Model for Multi-Label Chronic Disease Prediction[J]. Frontiers in Genetics, 2019, 10:351.
doi: 10.3389/fgene.2019.00351
[23] Gargiulo F, Silvestri S, Ciampi M, et al. Deep Neural Network for Hierarchical Extreme Multi-Label Text Classification[J]. Applied Soft Computing, 2019, 79:125-138.
doi: 10.1016/j.asoc.2019.03.041
[24] Zhang M L, Zhou Z H. Multilabel Neural Networks with Applications to Functional Genomics and Text Categorization[J]. IEEE Transactions on Knowledge and Data Engineering, 2006, 18(10):1338-1351.
doi: 10.1109/TKDE.2006.162
[25] Du J C, Chen Q Y, Peng Y F, et al. ML-Net: Multi-Label Classification of Biomedical Texts with Deep Neural Networks[J]. Journal of the American Medical Informatics Association, 2019, 26(11):1279-1285.
doi: 10.1093/jamia/ocz085
[26] Winata G I, Khodra M L. Handling Imbalanced Dataset in Multi-Label Text Categorization Using Bagging and Adaptive Boosting[C]//Proceedings of 2015 International Conference on Electrical Engineering and Informatics. 2015: 500-505.
[27] Sammani A, Bagheri A Van Der Heijden P G M, et al. Automatic Multilabel Detection of ICD10 Codes in Dutch Cardiology Discharge Letters Using Neural Networks[J]. NPJ Digital Medicine, 2021, 4(1):37.
doi: 10.1038/s41746-021-00404-9 pmid: 33637859
[28] Ibrahim M A, Ghani Khan M U, Mehmood F, et al. GHS-NET a Generic Hybridized Shallow Neural Network for Multi-Label Biomedical Text Classification[J]. Journal of Biomedical Informatics, 2021, 116:103699.
doi: 10.1016/j.jbi.2021.103699
[29] Wu S M, Wu B, Liu M, et al. Stroke in China: Advances and Challenges in Epidemiology, Prevention, and Management[J]. The Lancet Neurology, 2019, 18(4):394-405.
doi: 10.1016/S1474-4422(18)30500-3
[30] Li Z X, Jiang Y, Li H, et al. China’s Response to the Rising Stroke Burden[J]. BMJ (Clinical Research Ed), 2019, 364:l879.
[31] Wang Y J, Li Z X, Gu H Q, et al. China Stroke Statistics 2019: A Report from the National Center for Healthcare Quality Management in Neurological Diseases, China National Clinical Research Center for Neurological Diseases, the Chinese Stroke Association, National Center for Chronic and Non-Communicable Disease Control and Prevention, Chinese Center for Disease Control and Prevention and Institute for Global Neuroscience and Stroke Collaborations[J]. Stroke and Vascular Neurology, 2020, 5(3):211-239.
doi: 10.1136/svn-2020-000457
[32] ClinicalTrials.gov[EB/OL]. [2021-07-12]. https://clinicaltrials.gov/.
[33] Boutell M R, Luo J B, Shen X P, et al. Learning Multi-Label Scene Classification[J]. Pattern Recognition, 2004, 37(9):1757-1771.
doi: 10.1016/j.patcog.2004.03.009
[34] Tsoumakas G, Vlahavas I. Random K-Labelsets: An Ensemble Method for Multilabel Classification[C]//Proceedings of the 18th European Conference on Machine Learning. 2007: 406-417.
[35] Read J, Puurula A, Bifet A. Multi-Label Classification with Meta-Labels[C]//Proceedings of the 2014 IEEE International Conference on Data Mining. 2014: 941-946.
[36] Qiu X P, Sun T X, Xu Y G, et al. Pre-Trained Models for Natural Language Processing: A Survey[J]. Science China Technological Sciences, 2020, 63(10):1872-1897.
doi: 10.1007/s11431-020-1647-3
[37] Mikolov T, Sutskever I, Chen K, et al. Distributed Representations of Words and Phrases and Their Compositionality[OL]. arXiv Preprint, arXiv: 1310.4546.
[38] Pennington J, Socher R, Manning C D. GloVe: Global Vectors for Word Representation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. 2014: 1532-1543.
[39] Peters M E, Neumann M, Iyyer M, et al. Deep Contextualized Word Representations[OL]. arXiv Preprint, arXiv:1802.05365.
[40] Devlin J, Chang M W, Lee K, et al. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding[OL]. arXiv Preprint, arXiv:1810.04805.
[41] Kim Y. Convolutional Neural Networks for Sentence Classification[OL]. arXiv Preprint, arXiv: 1408.5882.
[42] Mencıa E L. Multilabel Classification in Parallel Tasks[C]//Proceedings of the 2nd International Workshop on Learning from Multi-Label Data. 2010: 20-36.
[43] Wu G Q, Zhu J. Multi-Label Classification: Do Hamming Loss and Subset Accuracy Really Conflict with Each Other? [OL]. arXiv Preprint, arXiv:2011.07805.
[44] Wang W Z, Jiang B, Sun H X, et al. Prevalence, Incidence, and Mortality of Stroke in China: Results from a Nationwide Population-Based Survey of 480 687 Adults[J]. Circulation, 2017, 135(8):759-771.
doi: 10.1161/CIRCULATIONAHA.116.025250
[45] Kumar S, Selim M, Marchina S, et al. Transient Neurological Symptoms in Patients with Intracerebral Hemorrhage[J]. JAMA Neurology, 2016, 73(3):316-320.
doi: 10.1001/jamaneurol.2015.4202
[46] 周蕾, 王飞, 肖盈奇, 等. 以短暂性脑缺血发作为首发症状的蛛网膜下腔出血1例报告[J]. 中国神经精神疾病杂志, 2020, 46(1):41-42.
[46] ( Zhou Lei, Wang Fei, Xiao Yingqi, et al. Subarachnoid Hemorrhage with Transient Cerebral Ischemia as the First Symptom: One Case Report[J]. Chinese Journal of Nervous and Mental Diseases, 2020, 46(1):41-42.)
[1] Tu Zhenchao, Ma Jing. Item Categorization Algorithm Based on Improved Text Representation[J]. 数据分析与知识发现, 2022, 6(5): 34-43.
[2] Chen Guo, Ye Chao. News Classification with Semi-Supervised and Active Learning[J]. 数据分析与知识发现, 2022, 6(4): 28-38.
[3] Xiao Yuejun, Li Honglian, Zhang Le, Lv Xueqiang, You Xindong. Classifying Chinese Patent Texts with Feature Fusion[J]. 数据分析与知识发现, 2022, 6(4): 49-59.
[4] Xu Yuemei, Fan Zuwei, Cao Han. A Multi-Task Text Classification Model Based on Label Embedding of Attention Mechanism[J]. 数据分析与知识发现, 2022, 6(2/3): 105-116.
[5] Xie Xingyu, Yu Bengong. Automatic Classification of E-commerce Comments with Multi-Feature Fusion Model[J]. 数据分析与知识发现, 2022, 6(1): 101-112.
[6] Zhou Zeyu,Wang Hao,Zhao Zibo,Li Yueyan,Zhang Xiaoqin. Construction and Application of GCN Model for Text Classification with Associated Information[J]. 数据分析与知识发现, 2021, 5(9): 31-41.
[7] Chen Jie,Ma Jing,Li Xiaofeng. Short-Text Classification Method with Text Features from Pre-trained Models[J]. 数据分析与知识发现, 2021, 5(9): 21-30.
[8] Yu Bengong,Zhu Xiaojie,Zhang Ziwei. A Capsule Network Model for Text Classification with Multi-level Feature Extraction[J]. 数据分析与知识发现, 2021, 5(6): 93-102.
[9] Zhou Zhichao. Review of Automatic Citation Classification Based on Machine Learning[J]. 数据分析与知识发现, 2021, 5(12): 14-24.
[10] Wang Yan, Wang Huyan, Yu Bengong. Chinese Text Classification with Feature Fusion[J]. 数据分析与知识发现, 2021, 5(10): 1-14.
[11] Wang Sidi,Hu Guangwei,Yang Siyu,Shi Yun. Automatic Transferring Government Website E-Mails Based on Text Classification[J]. 数据分析与知识发现, 2020, 4(6): 51-59.
[12] Xu Yuemei,Liu Yunwen,Cai Lianqiao. Predicitng Retweets of Government Microblogs with Deep-combined Features[J]. 数据分析与知识发现, 2020, 4(2/3): 18-28.
[13] Yang Lin, Huang Xiaoshuo, Wang Jiayang, Li Jiao. Extracting Clinical Scale Information and Identifying Trial Cohorts with Semantic Alignment[J]. 数据分析与知识发现, 2020, 4(12): 33-44.
[14] Xu Tongtong,Sun Huazhi,Ma Chunmei,Jiang Lifen,Liu Yichen. Classification Model for Few-shot Texts Based on Bi-directional Long-term Attention Features[J]. 数据分析与知识发现, 2020, 4(10): 113-123.
[15] Bengong Yu,Yumeng Cao,Yangnan Chen,Ying Yang. Classification of Short Texts Based on nLD-SVM-RF Model[J]. 数据分析与知识发现, 2020, 4(1): 111-120.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn