Please wait a minute...
Advanced Search
数据分析与知识发现  2022, Vol. 6 Issue (4): 69-81     https://doi.org/10.11925/infotech.2096-3467.2021.0712
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
基于BERT-TextCNN的临床试验疾病亚型识别研究*
杨林1,黄晓硕1,王嘉阳1,丁玲玲2,3,李子孝2,3(),李姣1()
1中国医学科学院北京协和医学院医学信息研究所/图书馆 北京 100020
2首都医科大学附属北京天坛医院国家神经系统疾病临床研究中心 北京 100070
3首都医科大学附属北京天坛医院神经内科 北京 100070
Identifying Subtypes of Clinical Trial Diseases with BERT-TextCNN
Yang Lin1,Huang Xiaoshuo1,Wang Jiayang1,Ding Lingling2,3,Li Zixiao2,3(),Li Jiao1()
1Institute of Medical Information/Medical Library, Chinese Academy of Medical Science & Peking Union Medical College, Beijing 100020, China
2China National Clinical Research Center for Neurological Diseases, Beijing Tiantan Hospital, Capital Medical University, Beijing 100070, China
3Department of Neurology, Beijing Tiantan Hospital, Capital Medical University, Beijing 100070, China
全文: PDF (1169 KB)   HTML ( 42
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】 面向复杂疾病临床试验招募的需求,提出一种基于BERT-TextCNN的临床试验疾病亚型识别方法,辅助识别复杂疾病特定亚型的受试人群。【方法】 将临床试验疾病亚型识别问题转化为单标签分类问题,应用基于BERT-TextCNN的单标签分类模型进行分类,以卒中为例在临床试验数据集(ClinicalTrials.gov)上开展实验验证。【结果】 基于LP法的BERT-TextCNN模型性能最佳,加权宏平均F1值为0.905 3,可以有效判定一项卒中临床试验可纳入卒中亚型受试者情况。【局限】 缺乏在其他单病种上的可行性研究,以及在外部数据集上的有效性验证。【结论】 本文方法可以有效解决从纳入标准中准确识别复杂疾病亚型的问题。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
杨林
黄晓硕
王嘉阳
丁玲玲
李子孝
李姣
关键词 临床试验文本分类BERT-TextCNN卒中疾病亚型    
Abstract

[Objective] This study develops a method to identify disease subtypes based on BERT-TextCNN, which could facilitate cohort selection for clinical trials. [Methods] We transformed the disease subtype identification into a single-label classification task based on BERT-TextCNN. Then, we examined our new model with clinical trials data for strokes from ClinicalTrials.gov. [Results] The BERT-TextCNN based on the LP method yielded the best weighted macro-average F1 value of 0.905 3. It identified stroke subtypes for participants of a clinical trial. [Limitations] More research is needed to evaluate our model with other diseases and data sets. [Conclusions] The proposed method could be an effective approach to identify complex disease subtypes.

Key wordsClinical Trial    Text Classification    BERT-TextCNN    Stroke    Disease Subtype
收稿日期: 2021-07-16      出版日期: 2022-05-12
ZTFLH:  TP391  
基金资助:*北京市自然科学基金重点研究专题(Z200016)
通讯作者: 李子孝,ORCID:0000-0002-4713-5418,李姣,ORCID:0000-0001-6391-8343     E-mail: lizixiao2008@hotmail.com;li.jiao@imicams.ac.cn
引用本文:   
杨林, 黄晓硕, 王嘉阳, 丁玲玲, 李子孝, 李姣. 基于BERT-TextCNN的临床试验疾病亚型识别研究*[J]. 数据分析与知识发现, 2022, 6(4): 69-81.
Yang Lin, Huang Xiaoshuo, Wang Jiayang, Ding Lingling, Li Zixiao, Li Jiao. Identifying Subtypes of Clinical Trial Diseases with BERT-TextCNN. Data Analysis and Knowledge Discovery, 2022, 6(4): 69-81.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2021.0712      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2022/V6/I4/69
Fig.1  临床试验疾病亚型识别流程
Fig.2  纳入标准标注样例
Fig.3  临床试验卒中亚型识别的问题转化方法
Fig.4  基于BERT-TextCNN的临床试验卒中亚型识别
模型 训练类别 轮次 批大小 学习率 补齐长度
GloVe-TextCNN_
LabelPowerset
20 24 5e-3 250
GloVe-TextCNN_IS 20 24 5e-3 300
GloVe-TextCNN GloVe-TextCNN_ICH 20 24 5e-3 300
GloVe-TextCNN_SAH 20 16 5e-3 250
GloVe-TextCNN_TIA 20 16 5e-3 250
BERT_LabelPowerset 5 16 1e-5 250
BERT_IS 5 24 9e-6 150
BERT BERT_ICH 5 16 9e-6 150
BERT_SAH 5 24 1e-5 250
BERT_TIA 5 16 1e-5 250
BERT-TextCNN_
LabelPowerset
5 16 1e-5 250
BERT-TextCNN_IS 5 16 1e-5 250
BERT-TextCNN BERT-TextCNN_ICH 5 24 9e-6 250
BERT-TextCNN_SAH 5 32 1e-5 150
BERT-TextCNN_TIA 5 16 1e-5 250
Table 1  模型参数设置
临床试验 类型 试验数量
招募状态 完成(Completed) 1 208
招募中(Recruiting) 552
未知状态(Unknown Status) 393
未开始招募(Not yet Recruiting) 237
终止(Terminated) 188
正在进行,非招募中(Active, not Recruiting) 70
撤回(Withdrawn) 62
暂停(Suspended) 25
邀请招募(Enrolling by Invitation) 20
干预措施 器械(Device) 790
药物(Drug) 773
其他(Other) 501
行为(Behavioral) 340
手术(Procedure) 213
生物(Biological) 67
诊断测试(Diagnostic Test) 23
饮食补充(Dietary Supplement) 20
复合产品(Combination Product) 13
基因(Genetic) 8
放射(Radiation) 7
纳入标准长度 (0, 50] 2 122
(50, 100] 346
(100, 150] 136
(150, 200] 68
(200, 250] 38
(250, 300] 19
(350, 400] 9
(300, 350] 7
(450, 500] 6
(500, 600] 3
(400, 450] 1
Table 2  卒中临床试验注册数据分布
序号 可纳入卒中亚型 训练集 测试集 总数量
1 IS 697 310 1 007
2 IS + ICH 405 162 567
3 IS + ICH + SAH 380 166 546
4 IS + TIA 165 66 231
5 SAH 120 53 173
6 ICH 103 53 156
7 IS + ICH + TIA 15 11 26
8 TIA 11 8 19
9 ICH + SAH 12 3 15
10 IS + ICH + SAH + TIA 10 2 12
11 ICH + TIA 1 0 1
12 IS + SAH 1 1 2
总数量 1 920 835 2 755
Table 3  卒中亚型纳入标准分布
模型 问题转化方法 加权宏平均精确率 加权宏平均召回率 加权宏平均F1值 汉明损失
GloVe-TextCNN Label Powerset 0.822 9 0.841 3 0.830 6 0.059 3
GloVe-TextCNN Binary Relevance 0.802 1 0.789 2 0.781 7 0.065 9
BERT Label Powerset 0.905 1 0.905 0 0.903 7 0.029 9
BERT Binary Relevance 0.889 8 0.879 9 0.883 7 0.035 9
BERT-TextCNN Label Powerset 0.905 7 0.906 2 0.905 3 0.031 1
BERT-TextCNN Binary Relevance 0.897 2 0.889 7 0.891 7 0.031 1
Table 4  临床试验卒中亚型识别总体性能
序号 可纳入卒中亚型 LP法 二元相关性法
精确率 召回率 F1值 精确率 召回率 F1值
1 IS 0.960 7 0.945 2 0.952 8 0.969 5 0.940 8 0.954 9
2 IS+ICH 0.838 2 0.906 2 0.870 9 0.782 1 0.897 4 0.835 8
3 IS+ICH+SAH 0.896 1 0.831 3 0.862 5 0.899 3 0.827 2 0.861 7
4 IS+TIA 0.910 4 0.924 2 0.917 3 0.857 1 0.843 8 0.850 4
5 SAH 0.928 6 0.981 1 0.954 1 0.961 5 0.943 4 0.952 4
6 ICH 0.909 1 0.961 5 0.934 6 1.000 0 0.942 3 0.970 3
7 IS+ICH+TIA 0.800 0 0.727 3 0.761 9 0.500 0 0.636 4 0.560 0
8 TIA 0.750 0 0.750 0 0.750 0 0.666 7 0.500 0 0.571 4
9 ICH+SAH 0.500 0 0.333 3 0.400 0 0.250 0 0.333 3 0.285 7
10 IS+ICH+SAH+TIA 0.000 0 0.000 0 0.000 0 0.000 0 0.000 0 0.000 0
11 ICH+TIA 0.000 0 0.000 0 0.000 0 0.000 0 0.000 0 0.000 0
12 IS+SAH 0.000 0 0.000 0 0.000 0 0.250 0 1.000 0 0.400 0
Table 5  基于BERT-TextCNN模型的临床试验卒中亚型识别性能
[1] Saria S, Goldenberg A. Subtyping: What It is and Its Role in Precision Medicine[J]. IEEE Intelligent Systems, 2015, 30(4):70-75.
[2] Fereshtehnejad S M, Zeighami Y, Dagher A, et al. Clinical Criteria for Subtyping Parkinson’s Disease: Biomarkers and Longitudinal Progression[J]. Brain, 2017, 140(7):1959-1976.
doi: 10.1093/brain/awx118
[3] Jalanko T, de Jong J J, Gibb E A, et al. Genomic Subtyping in Bladder Cancer[J]. Current Urology Reports, 2020, 21(2):9.
doi: 10.1007/s11934-020-0960-y pmid: 32166460
[4] Laher F, Bekker L G, Garrett N, et al. Review of Preventative HIV Vaccine Clinical Trials in South Africa[J]. Archives of Virology, 2020, 165(11):2439-2452.
doi: 10.1007/s00705-020-04777-2
[5] 国家药品监督管理局. 总局关于发布急性缺血性脑卒中治疗药物临床试验技术指导原则的通告(2018年第28号) [EB/OL]. [2021-07-12]. https://www.nmpa.gov.cn/yaopin/ypggtg/ypqtgg/20180209175801578.html.
[5] (National Medical Products Administration. Announcement of the General Administration on Issuing the Technical Guidelines for Clinical Trials of Drugs for the Treatment of Acute Ischemic Stroke (Year 2018, No. 28)[EB/OL]. [2021-07-12]. https://www.nmpa.gov.cn/yaopin/ypggtg/ypqtgg/20180209175801578.html.)
[6] Feldman W B, Kim A S, Chiong W. Trends in Recruitment Rates for Acute Stroke Trials, 1990-2014[J]. Stroke, 2017, 48(3):799-801.
doi: 10.1161/STROKEAHA.116.014458
[7] Zong H, Yang J X, Zhang Z Y, et al. Semantic Categorization of Chinese Eligibility Criteria in Clinical Trials Using Machine Learning Methods[J]. BMC Medical Informatics and Decision Making, 2021, 21(1):128.
doi: 10.1186/s12911-021-01487-w pmid: 33858409
[8] Harrer S, Shah P, Antony B, et al. Artificial Intelligence for Clinical Trial Design[J]. Trends in Pharmacological Sciences, 2019, 40(8):577-591.
doi: 10.1016/j.tips.2019.05.005
[9] Stubbs A, Filannino M, Soysal E, et al. Cohort Selection for Clinical Trials: N2C2 2018 Shared Task Track 1[J]. Journal of the American Medical Informatics Association, 2019, 26(11):1163-1171.
doi: 10.1093/jamia/ocz163 pmid: 31562516
[10] 杨林, 黄晓硕, 王嘉阳, 等. 基于语义对齐的临床量表信息提取方法及其临床试验队列识别的应用研究[J]. 数据分析与知识发现, 2020, 4(12):33-44.
[10] ( Yang Lin, Huang Xiaoshuo, Wang Jiayang, et al. Extracting Clinical Scale Information and Identifying Trial Cohorts with Semantic Alignment[J]. Data Analysis and Knowledge Discovery, 2020, 4(12):33-44.)
[11] Dhayne H, Kilany R, Haque R, et al. EMR2vec: Bridging the Gap Between Patient Data and Clinical Trial[J]. Computers & Industrial Engineering, 2021, 156:107236.
doi: 10.1016/j.cie.2021.107236
[12] Chen L, Gu Y, Ji X, et al. Clinical Trial Cohort Selection Based on Multi-Level Rule-Based Natural Language Processing System[J]. Journal of the American Medical Informatics Association, 2019, 26(11):1218-1226.
doi: 10.1093/jamia/ocz109
[13] Weng C H, Wu X Y, Luo Z H, et al. EliXR: An Approach to Eligibility Criteria Extraction and Representation[J]. Journal of the American Medical Informatics Association, 2011, 18(S1):i116-i124.
doi: 10.1136/amiajnl-2011-000321
[14] Kang T, Zhang S D, Tang Y L, et al. EliIE: An Open-Source Information Extraction System for Clinical Trial Eligibility Criteria[J]. Journal of the American Medical Informatics Association, 2017, 24(6):1062-1071.
doi: 10.1093/jamia/ocx019
[15] 第五届中国健康信息处理会议(CHIP2019)评测任务 [EB/OL]. [2021-03-20]. http://www.cips-chip.org.cn:8000/evaluation.
[15] (The 5th China Health Information Processing Conference (CHIP2019) Evaluation Task [EB/OL]. [2021-03-20]. http://www.cips-chip.org.cn:8000/evaluation.)
[16] Tseo Y, Salkola M I, Mohamed A, et al. Information Extraction of Clinical Trial Eligibility Criteria[OL]. arXiv Preprint, arXiv:2006.07296.
[17] Anusha B, Li J F, Xu Y Q, et al. Deep Learning Approach to Parse Eligibility Criteria in Dietary Supplements Clinical Trials Following OMOP Common Data Model[J]. AMIA Annual Symposium Proceedings AMIA Symposium, 2020: 243-252.
[18] Blanco A, Perez-de-Viñaspre O, Pérez A, et al. Boosting ICD Multi-Label Classification of Health Records with Contextual Embeddings and Label-Granularity[J]. Computer Methods and Programs in Biomedicine, 2020, 188:105264.
doi: 10.1016/j.cmpb.2019.105264
[19] 杨飞洪, 王序文, 李姣. 基于BERT-TextCNN模型的临床试验筛选短文本分类方法[J]. 中华医学图书情报杂志, 2021, 30(1):54-59.
[19] ( Yang Feihong, Wang Xuwen, Li Jiao. BERT-TextCNN-Based Classification of Short Texts from Clinical Trials[J]. Chinese Journal of Medical Library and Information Science, 2021, 30(1):54-59.)
[20] Zhang M L, Zhou Z H. A Review on Multi-Label Learning Algorithms[J]. IEEE Transactions on Knowledge and Data Engineering, 2014, 26(8):1819-1837.
doi: 10.1109/TKDE.2013.39
[21] Alsudias L, Rayson P. Social Media Monitoring of the COVID-19 Pandemic and Influenza Epidemic with Adaptation for Informal Language in Arabic Twitter Data: Qualitative Study[J]. JMIR Medical Informatics, 2021, 9(9):e27670.
doi: 10.2196/27670
[22] Zhang X Q, Zhao H L, Zhang S, et al. A Novel Deep Neural Network Model for Multi-Label Chronic Disease Prediction[J]. Frontiers in Genetics, 2019, 10:351.
doi: 10.3389/fgene.2019.00351
[23] Gargiulo F, Silvestri S, Ciampi M, et al. Deep Neural Network for Hierarchical Extreme Multi-Label Text Classification[J]. Applied Soft Computing, 2019, 79:125-138.
doi: 10.1016/j.asoc.2019.03.041
[24] Zhang M L, Zhou Z H. Multilabel Neural Networks with Applications to Functional Genomics and Text Categorization[J]. IEEE Transactions on Knowledge and Data Engineering, 2006, 18(10):1338-1351.
doi: 10.1109/TKDE.2006.162
[25] Du J C, Chen Q Y, Peng Y F, et al. ML-Net: Multi-Label Classification of Biomedical Texts with Deep Neural Networks[J]. Journal of the American Medical Informatics Association, 2019, 26(11):1279-1285.
doi: 10.1093/jamia/ocz085
[26] Winata G I, Khodra M L. Handling Imbalanced Dataset in Multi-Label Text Categorization Using Bagging and Adaptive Boosting[C]//Proceedings of 2015 International Conference on Electrical Engineering and Informatics. 2015: 500-505.
[27] Sammani A, Bagheri A Van Der Heijden P G M, et al. Automatic Multilabel Detection of ICD10 Codes in Dutch Cardiology Discharge Letters Using Neural Networks[J]. NPJ Digital Medicine, 2021, 4(1):37.
doi: 10.1038/s41746-021-00404-9 pmid: 33637859
[28] Ibrahim M A, Ghani Khan M U, Mehmood F, et al. GHS-NET a Generic Hybridized Shallow Neural Network for Multi-Label Biomedical Text Classification[J]. Journal of Biomedical Informatics, 2021, 116:103699.
doi: 10.1016/j.jbi.2021.103699
[29] Wu S M, Wu B, Liu M, et al. Stroke in China: Advances and Challenges in Epidemiology, Prevention, and Management[J]. The Lancet Neurology, 2019, 18(4):394-405.
doi: 10.1016/S1474-4422(18)30500-3
[30] Li Z X, Jiang Y, Li H, et al. China’s Response to the Rising Stroke Burden[J]. BMJ (Clinical Research Ed), 2019, 364:l879.
[31] Wang Y J, Li Z X, Gu H Q, et al. China Stroke Statistics 2019: A Report from the National Center for Healthcare Quality Management in Neurological Diseases, China National Clinical Research Center for Neurological Diseases, the Chinese Stroke Association, National Center for Chronic and Non-Communicable Disease Control and Prevention, Chinese Center for Disease Control and Prevention and Institute for Global Neuroscience and Stroke Collaborations[J]. Stroke and Vascular Neurology, 2020, 5(3):211-239.
doi: 10.1136/svn-2020-000457
[32] ClinicalTrials.gov[EB/OL]. [2021-07-12]. https://clinicaltrials.gov/.
[33] Boutell M R, Luo J B, Shen X P, et al. Learning Multi-Label Scene Classification[J]. Pattern Recognition, 2004, 37(9):1757-1771.
doi: 10.1016/j.patcog.2004.03.009
[34] Tsoumakas G, Vlahavas I. Random K-Labelsets: An Ensemble Method for Multilabel Classification[C]//Proceedings of the 18th European Conference on Machine Learning. 2007: 406-417.
[35] Read J, Puurula A, Bifet A. Multi-Label Classification with Meta-Labels[C]//Proceedings of the 2014 IEEE International Conference on Data Mining. 2014: 941-946.
[36] Qiu X P, Sun T X, Xu Y G, et al. Pre-Trained Models for Natural Language Processing: A Survey[J]. Science China Technological Sciences, 2020, 63(10):1872-1897.
doi: 10.1007/s11431-020-1647-3
[37] Mikolov T, Sutskever I, Chen K, et al. Distributed Representations of Words and Phrases and Their Compositionality[OL]. arXiv Preprint, arXiv: 1310.4546.
[38] Pennington J, Socher R, Manning C D. GloVe: Global Vectors for Word Representation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. 2014: 1532-1543.
[39] Peters M E, Neumann M, Iyyer M, et al. Deep Contextualized Word Representations[OL]. arXiv Preprint, arXiv:1802.05365.
[40] Devlin J, Chang M W, Lee K, et al. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding[OL]. arXiv Preprint, arXiv:1810.04805.
[41] Kim Y. Convolutional Neural Networks for Sentence Classification[OL]. arXiv Preprint, arXiv: 1408.5882.
[42] Mencıa E L. Multilabel Classification in Parallel Tasks[C]//Proceedings of the 2nd International Workshop on Learning from Multi-Label Data. 2010: 20-36.
[43] Wu G Q, Zhu J. Multi-Label Classification: Do Hamming Loss and Subset Accuracy Really Conflict with Each Other? [OL]. arXiv Preprint, arXiv:2011.07805.
[44] Wang W Z, Jiang B, Sun H X, et al. Prevalence, Incidence, and Mortality of Stroke in China: Results from a Nationwide Population-Based Survey of 480 687 Adults[J]. Circulation, 2017, 135(8):759-771.
doi: 10.1161/CIRCULATIONAHA.116.025250
[45] Kumar S, Selim M, Marchina S, et al. Transient Neurological Symptoms in Patients with Intracerebral Hemorrhage[J]. JAMA Neurology, 2016, 73(3):316-320.
doi: 10.1001/jamaneurol.2015.4202
[46] 周蕾, 王飞, 肖盈奇, 等. 以短暂性脑缺血发作为首发症状的蛛网膜下腔出血1例报告[J]. 中国神经精神疾病杂志, 2020, 46(1):41-42.
[46] ( Zhou Lei, Wang Fei, Xiao Yingqi, et al. Subarachnoid Hemorrhage with Transient Cerebral Ischemia as the First Symptom: One Case Report[J]. Chinese Journal of Nervous and Mental Diseases, 2020, 46(1):41-42.)
[1] 屠振超, 马静. 基于改进文本表示的商品文本分类算法研究*[J]. 数据分析与知识发现, 2022, 6(5): 34-43.
[2] 陈果, 叶潮. 融合半监督学习与主动学习的细分领域新闻分类研究*[J]. 数据分析与知识发现, 2022, 6(4): 28-38.
[3] 肖悦珺, 李红莲, 张乐, 吕学强, 游新冬. 特征融合的中文专利文本分类方法研究*[J]. 数据分析与知识发现, 2022, 6(4): 49-59.
[4] 徐月梅, 樊祖薇, 曹晗. 基于标签嵌入注意力机制的多任务文本分类模型*[J]. 数据分析与知识发现, 2022, 6(2/3): 105-116.
[5] 谢星雨, 余本功. 基于MFFMB的电商评论文本分类研究*[J]. 数据分析与知识发现, 2022, 6(1): 101-112.
[6] 陈杰,马静,李晓峰. 融合预训练模型文本特征的短文本分类方法*[J]. 数据分析与知识发现, 2021, 5(9): 21-30.
[7] 周泽聿,王昊,赵梓博,李跃艳,张小琴. 融合关联信息的GCN文本分类模型构建及其应用研究*[J]. 数据分析与知识发现, 2021, 5(9): 31-41.
[8] 余本功,朱晓洁,张子薇. 基于多层次特征提取的胶囊网络文本分类研究*[J]. 数据分析与知识发现, 2021, 5(6): 93-102.
[9] 周志超. 基于机器学习技术的自动引文分类研究综述*[J]. 数据分析与知识发现, 2021, 5(12): 14-24.
[10] 王艳, 王胡燕, 余本功. 基于多特征融合的中文文本分类研究*[J]. 数据分析与知识发现, 2021, 5(10): 1-14.
[11] 唐晓波,高和璇. 基于关键词词向量特征扩展的健康问句分类研究 *[J]. 数据分析与知识发现, 2020, 4(7): 66-75.
[12] 王思迪,胡广伟,杨巳煜,施云. 基于文本分类的政府网站信箱自动转递方法研究*[J]. 数据分析与知识发现, 2020, 4(6): 51-59.
[13] 徐月梅,刘韫文,蔡连侨. 基于深度融合特征的政务微博转发规模预测模型*[J]. 数据分析与知识发现, 2020, 4(2/3): 18-28.
[14] 杨林, 黄晓硕, 王嘉阳, 李姣. 基于语义对齐的临床量表信息提取方法及其临床试验队列识别的应用研究*[J]. 数据分析与知识发现, 2020, 4(12): 33-44.
[15] 徐彤彤,孙华志,马春梅,姜丽芬,刘逸琛. 基于双向长效注意力特征表达的少样本文本分类模型研究*[J]. 数据分析与知识发现, 2020, 4(10): 113-123.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn