Please wait a minute...
Advanced Search
数据分析与知识发现  2021, Vol. 5 Issue (11): 145-152     https://doi.org/10.11925/infotech.2096-3467.2021.0671
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
利用Text-CNN改进PubMedBERT在化学诱导性疾病实体关系分类效果的尝试
董淼1,4,苏中琪2,周晓北3,兰雪4,崔志刚5,崔雷4()
1中国医科大学财务处 沈阳 110122
2中国医科大学图书馆 沈阳 110122
3中国医科大学健康科学研究院 沈阳 110122
4中国医科大学健康管理学院 沈阳 110122
5中国医科大学护理学院 沈阳 110122
Improving PubMedBERT for CID-Entity-Relation Classification Using Text-CNN
Dong Miao1,4,Su Zhongqi2,Zhou Xiaobei3,Lan Xue4,Cui Zhigang5,Cui Lei4()
1Financial Section, China Medical University, Shenyang 110122, China
2China Medical University Library, Shenyang 110122, China
3Institute of Health Sciences, China Medical University, Shenyang 110122, China
4School of Health Management, China Medical University, Shenyang 110122, China
5Nursing School, China Medical University, Shenyang 110122, China
全文: PDF (989 KB)   HTML ( 6
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】 改进PubMedBERT在化学诱导性疾病(CID)实体关系分类的效果。【方法】 提出一种基于PubMedBERT并结合Text-CNN的实体关系分类方法。该方法以实体对和文本组成句子对进行输入,利用PubMedBERT预训练模型对化学诱导性疾病相关文本进行编码获取全局特征,通过Text-CNN捕捉文本局部重要信息,判断实体对是否具有CID关系。【结果】 在BioCreative V CDR数据集中,该方法的精确率、召回率和F1值分别达到78.3%、73.5%和75.8%,较其他方法最少提升了3.1%、1.5%和3.3%。【局限】 仅考虑了化学诱导性疾病文本语料,在临床等其他语料上的效果有待检验。【结论】 该方法能够捕捉化学诱导性疾病文本特征,提升实体关系分类的效果。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
董淼
苏中琪
周晓北
兰雪
崔志刚
崔雷
关键词 CID实体关系分类PubMedBERTText-CNN句子对    
Abstract

[Objective] This paper tries to improve the performance of PubMedBERT for CID entity relation classification. [Methods] We proposed a classification model based on PubMedBERT, which was also fine-tuned by Text-CNN. Then, we input entity pairs and sentence pairs to the model. Third, we used PubMedBERT to encode CID texts and obtained their global features. Finally, we captured important local information from the global features with Text-CNN to decide whether entity pairs have CID relation. [Results] The precision, recall and F1 value of this method on the BioCreative V CDR dataset reached 78.3%, 73.5% and 75.8% respectively, which were at least 3.1%, 1.5% and 3.3% higher than other methods. [Limitations] This model only examines CID texts, and more research is needed to evaluate its performance on clinical data or corpus of other domains. [Conclusions] This method can capture the features of CID texts and improve their entity relation classification.

Key wordsCID Entity Relation Classification    PubMedBERT    Text-CNN    Sentence Pair
收稿日期: 2021-07-06      出版日期: 2021-12-23
ZTFLH:  TP391  
通讯作者: 崔雷,ORCID:0000-0001-9479-8225     E-mail: lcui@cmu.edu.cn
引用本文:   
董淼, 苏中琪, 周晓北, 兰雪, 崔志刚, 崔雷. 利用Text-CNN改进PubMedBERT在化学诱导性疾病实体关系分类效果的尝试[J]. 数据分析与知识发现, 2021, 5(11): 145-152.
Dong Miao, Su Zhongqi, Zhou Xiaobei, Lan Xue, Cui Zhigang, Cui Lei. Improving PubMedBERT for CID-Entity-Relation Classification Using Text-CNN. Data Analysis and Knowledge Discovery, 2021, 5(11): 145-152.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2021.0671      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2021/V5/I11/145
Fig.1  模型架构图
Fig.2  BC5CDR5语料库(PMID:354896)
Dataset Document Chemicals Diseases CID
relation
Mention ID Mention ID
Training 500 5203 1467 4182 1965 1038
Development 500 5347 1507 4244 1865 1012
Test 500 5385 1435 4424 1988 1066
Table 1  BC5CDR5语料库摘要
ID e1 e2 sentence
3693336_1 triazolo manic triazolam-induced brief episodes of secondary mania in a depressed patient. large doses of triazolam repeatedly induced brief episodes of mania in a depressed elderly woman. features of organic mental disorder (delirium) were not present. manic excitement was coincident with the duration of action of triazolam. the possible contribution of the triazolo group to changes in affective status is discussed
Table 2  BC5CDR5的正样本
ID e1 e2 sentence
3693336_2 triazolo depressed triazolam-induced brief episodes of secondary mania in a depressed patient. large doses of triazolam repeatedly induced brief episodes of mania in a depressed elderly woman. features of organic mental disorder (delirium) were not present. manic excitement was coincident with the duration of action of triazolam. the possible contribution of the triazolo group to changes in affective status is discussed
Table 3  BC5CDR5的负样本
方法 Precision Recall F1
Best Approach of BioCreative V CDR[24] 55.6% 58.4% 57.0%
LSTM-based[20] 64.9% 49.3% 56.0%
CNN-based[15] 60.9% 59.5% 60.2%
BERT Original 70.1% 67.7% 65.6%
BERT+Text-CNN 71.2% 68.3% 69.7%
ClinicalBERT 70.5% 69.3% 69.8%
ClinicalBERT+Text-CNN 70.9% 70.0% 70.4%
BioBERT 72.0% 70.3% 71.1%
BioBERT+Text-CNN 73.1% 72.0% 72.5%
PubMedBERT 75.2% 69.1% 72.0%
PubMedBERT+Text-CNN 78.3% 73.5% 75.8%
Table 4  在BC5CDR5语料库上各模型结果的比较
Method Precision Recall F1
PubMed Embedding+Text-CNN 62.7% 56.3% 59.3%
Glove Embedding+Text-CNN 60.4% 54.6% 57.4%
PubMedBERT+Text-CNN 78.3% 73.5% 75.8%
Table 5  预训练模型与词嵌入模型的对比
[1] Dogan R I, Murray G C, Névéol A, et al. Understanding PubMed® User Search Behavior Through Log Analysis[J/OL]. Database, 2009. https://doi.org/10.1093/database/bap018.
[2] Lu Z Y. PubMed and Beyond: A Survey of Web Tools for Searching Biomedical Literature[J/OL]. Database, 2011. https://doi.org/10.1093/database/baq036.
[3] Kang N, Singh B, Bui C, et al. Knowledge-based Extraction of Adverse Drug Events from Biomedical Text[J]. BMC Bioinformatics, 2014, 15(1): Article No. 64.
[4] Davis A P, Grondin C J, Johnson R J, et al. The Comparative Toxicogenomics Database: Update 2017[J]. Nucleic Acids Research, 2017, 45:D972-D978.
doi: 10.1093/nar/gkw838
[5] PharmGKB[EB/OL]. [2021-07-16]. https://www.pharmgkb.org/.
[6] Zhou D Y, Zhong D Y, He Y L. Biomedical Relation Extraction: From Binary to Complex[J]. Computational and Mathematical Methods in Medicine, 2014: Article ID 298473.
[7] Kim Y. Convolutional Neural Networks for Sentence Classification[C]// Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2014: 1746-1751.
[8] Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[C]// Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2019, 1:4171-4186.
[9] Gu Y, Tinn R, Cheng H, et al. Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing[OL]. arXiv Preprint, arXiv: 2007.15779.
[10] Abacha A B, Zweigenbaum P. Automatic Extraction of Semantic Relations Between Medical Entities: Application to the Treatment Relation[C]// Proceedings of the 4th International Symposium for Semantic Mining in Biomedicine, Cambridge, United Kingdom. 2010.
[11] Li H, Tang B, Chen Q, et al. HITSZ_CDR: An End-to-End Chemical and Disease Relation Extraction System for BioCreative V[J/OL]. Database, 2016. https://doi.org/10.1093/database/baw077.
[12] Peng Y F, Wei C H, Lu Z Y. Improving Chemical Disease Relation Extraction with Rich Features and Weakly Labeled Data[J]. Journal of Cheminformatics, 2016, 8: Article No.53.
[13] Giles C B, Wren J D. Large-scale Directional Relationship Extraction and Resolution[J]. BMC Bioinformatics, 2008, 9: Article No.S11.
[14] Alam F, Corazza A, Lavelli A, et al. A Knowledge-poor Approach to Chemical-Disease Relation Extraction[J/OL]. Database, 2016. https://doi.org/10.1093/database/baw071.
[15] Gu J H, Sun F Q, Qian L H, et al. Chemical-induced Disease Relation Extraction via Convolutional Neural Network[J/OL]. Database, 2017. https://doi.org/10.1093/database/bax024.
[16] Zhou H W, Lang C K, Liu Z, et al. Knowledge-guided Convolutional Networks for Chemical-isease elation Extraction[J]. BMC Bioinformatics, 2019, 20: Article No.260.
[17] Gu J H, Sun F Q, Qian L H, et al. Chemical-induced Disease Relation Extraction via Attention-based Distant Supervision[J]. BMC Bioinformatics, 2019, 20: Article No.403.
[18] Li Z H, Yang Z H, Xiang Y, et al. Exploiting Sequence Labeling Framework to Extract Document-level Relations from Biomedical Texts[J]. BMC Bioinformatics, 2020, 21. DOI: 10.1186/s12859-020-3457-2.
doi: 10.1186/s12859-020-3457-2
[19] Mitra S, Saha S, Hasanuzzaman M. A Multi-view Deep Neural Network Model for Chemical-Disease Relation Extraction from Imbalanced Datasets[J]. IEEE Journal of Biomedical and Health Informatics, 2020, 24(11):3315-3325.
doi: 10.1109/JBHI.6221020
[20] Zhou H W, Deng H J, Chen L, et al. Exploiting Syntactic and Semantics Information for Chemical-Disease Relation Extraction[J/OL]. Database, 2016. https://doi.org/10.1093/database/baw048.
[21] Lee J, Yoon W, Kim S, et al. BioBERT: A Pre-trained Biomedical Language Representation Model for Biomedical Text Mining[J]. Bioinformatics, 2020, 36(4):1234-1240.
[22] Alsentzer E, Murphy J R, Boag W, et al. Publicly Available Clinical BERT Embeddings[OL]. arXiv Preprint, arXiv: 1904.03323.
[23] Li J, Sun Y P, Johnson R J, et al. BioCreative V CDR Task Corpus: A Resource for Chemical Disease Relation Extraction[J/OL]. Database, 2016. https://doi.org/10.1093/database/baw068.
[24] Bowman S R, Gauthier J, Rastogi A, et al. A Fast Unified Model for Parsing and Sentence understanding[C]// Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. 2016, 3:1466-1477.
[25] 廖开际, 黄琼影, 席运江. 在线医疗社区问答文本的知识图谱构建研究[J]. 情报科学, 2021, 39(3):51-59, 75.
[25] (Liao Kaiji, Huang Qiongying, Xi Yunjiang. Research on the construction of knowledge graph of Q & A text in online medical community[J]. Information Science, 2021, 39(3):51-59, 75.)
[26] 黄梦醒, 李梦龙, 韩惠蕊. 基于电子病历的实体识别和知识图谱构建的研究[J]. 计算机应用研究, 2019, 36(12):3735-3739.
[26] (Huang Mengxing, Li Menglong, Han Huirui. Research on Entity Recognition and Knowledge Graph Construction Based on Electronic Medical Record[J]. Application Research of Computers, 2019, 36(12):3735-3739.)
[27] 李东奇, 李明鑫, 张潇. 基于知识库的开放域问答研究[J]. 电脑知识与技术, 2020, 16(36):179-181.
[27] (Li Dongqi, Li Mingxin, Zhang Xiao. Research on Open Domain Question Answering Based on Knowledge Base[J]. Computer Knowledge and Technology, 2020, 16(36):179-181.)
[28] 高曼, 崔雷. 利用文本挖掘进行药物重新定位的步骤与工具[J]. 中华医学图书情报杂志, 2017, 26(3):6-9.
[28] (Gao Man, Cui Lei. Steps and Tools for Drug Repositioning Using Text Mining[J]. Chinese Journal of Medical Library and Information Science, 2017, 26(3):6-9.)
[29] 隋明爽, 崔雷. 用文本挖掘方法发现药物的副作用[J]. 中华医学图书情报杂志, 2015, 24(11):67-72.
[29] (Sui Mingshuang, Cui Lei. Using Text Mining to Find the Side Effects of Drugs[J]. Chinese Journal of Medical Library and Information Science, 2015, 24(11):67-72.)
[30] 王秀艳, 崔雷. 采用混合方法抽取生物医学实体间语义关系[J]. 现代图书情报技术, 2013(3):77-82.
[30] (Wang Xiuyan, Cui Lei. A Hybrid Method to Extract Semantic Relation of Biomedical Entities[J].New Technology of Library and Information Service, 2013(3):77-82.)
[31] 王可鉴, 石乐明, 贺林, 等. 中国药物研发的新机遇:基于医药大数据的系统性药物重定位[J]. 科学通报, 2014, 59(18):1790-1796.
[31] (Wang Kejian, Shi Leming, He Lin, et al. New Opportunities for Drug Research and Development in China: Systematic Drug Repositioning Based on Big Data of Medicine[J]. Science Bulletin, 2014, 59(18):1790-1796.)
[1] 刘文斌, 何彦青, 吴振峰, 董诚. 基于BERT和多相似度融合的句子对齐方法研究*[J]. 数据分析与知识发现, 2021, 5(7): 48-58.
[2] 梁继文,江川,王东波. 基于多特征融合的先秦典籍汉英句子对齐研究*[J]. 数据分析与知识发现, 2020, 4(9): 123-132.
[3] 邵健, 章成志. 从互联网上自动获取领域平行语料[J]. 现代图书情报技术, 2014, 30(12): 36-43.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn