利用Text-CNN改进PubMedBERT在化学诱导性疾病实体关系分类效果的尝试

doi:10.11925/infotech.2096-3467.2021.0671

数据分析与知识发现

2021, Vol. 5

Issue (11): 145-152 https://doi.org/10.11925/infotech.2096-3467.2021.0671

研究论文

本期目录 | 过刊浏览 | 高级检索

利用Text-CNN改进PubMedBERT在化学诱导性疾病实体关系分类效果的尝试

董淼^1,⁴,苏中琪²,周晓北³,兰雪⁴,崔志刚⁵,崔雷⁴(

)

¹中国医科大学财务处沈阳 110122
²中国医科大学图书馆沈阳 110122
³中国医科大学健康科学研究院沈阳 110122
⁴中国医科大学健康管理学院沈阳 110122
⁵中国医科大学护理学院沈阳 110122

Improving PubMedBERT for CID-Entity-Relation Classification Using Text-CNN

Dong Miao^1,⁴,Su Zhongqi²,Zhou Xiaobei³,Lan Xue⁴,Cui Zhigang⁵,Cui Lei⁴(

)

¹Financial Section, China Medical University, Shenyang 110122, China
²China Medical University Library, Shenyang 110122, China
³Institute of Health Sciences, China Medical University, Shenyang 110122, China
⁴School of Health Management, China Medical University, Shenyang 110122, China
⁵Nursing School, China Medical University, Shenyang 110122, China

摘要
图/表
参考文献
相关文章
Metrics

全文: PDF (989 KB) HTML ( 10 )
输出: BibTeX | EndNote (RIS)

摘要

【目的】 改进PubMedBERT在化学诱导性疾病（CID）实体关系分类的效果。【方法】 提出一种基于PubMedBERT并结合Text-CNN的实体关系分类方法。该方法以实体对和文本组成句子对进行输入,利用PubMedBERT预训练模型对化学诱导性疾病相关文本进行编码获取全局特征,通过Text-CNN捕捉文本局部重要信息,判断实体对是否具有CID关系。【结果】 在BioCreative V CDR数据集中,该方法的精确率、召回率和F1值分别达到78.3%、73.5%和75.8%,较其他方法最少提升了3.1%、1.5%和3.3%。【局限】 仅考虑了化学诱导性疾病文本语料,在临床等其他语料上的效果有待检验。【结论】 该方法能够捕捉化学诱导性疾病文本特征,提升实体关系分类的效果。

	服务

	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	董淼
	苏中琪
	周晓北
	兰雪
	崔志刚
	崔雷

关键词 ： CID实体关系分类, PubMedBERT, Text-CNN, 句子对

Abstract：

[Objective] This paper tries to improve the performance of PubMedBERT for CID entity relation classification. [Methods] We proposed a classification model based on PubMedBERT, which was also fine-tuned by Text-CNN. Then, we input entity pairs and sentence pairs to the model. Third, we used PubMedBERT to encode CID texts and obtained their global features. Finally, we captured important local information from the global features with Text-CNN to decide whether entity pairs have CID relation. [Results] The precision, recall and F1 value of this method on the BioCreative V CDR dataset reached 78.3%, 73.5% and 75.8% respectively, which were at least 3.1%, 1.5% and 3.3% higher than other methods. [Limitations] This model only examines CID texts, and more research is needed to evaluate its performance on clinical data or corpus of other domains. [Conclusions] This method can capture the features of CID texts and improve their entity relation classification.

Key words： CID Entity Relation Classification PubMedBERT Text-CNN Sentence Pair

收稿日期: 2021-07-06 出版日期: 2021-12-23

ZTFLH:

TP391

通讯作者: 崔雷,ORCID：0000-0001-9479-8225 E-mail: lcui@cmu.edu.cn

引用本文:

董淼, 苏中琪, 周晓北, 兰雪, 崔志刚, 崔雷. 利用Text-CNN改进PubMedBERT在化学诱导性疾病实体关系分类效果的尝试[J]. 数据分析与知识发现, 2021, 5(11): 145-152.
Dong Miao, Su Zhongqi, Zhou Xiaobei, Lan Xue, Cui Zhigang, Cui Lei. Improving PubMedBERT for CID-Entity-Relation Classification Using Text-CNN. Data Analysis and Knowledge Discovery, 2021, 5(11): 145-152.

链接本文:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2021.0671 或 https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2021/V5/I11/145

Fig.1 模型架构图

Fig.2 BC5CDR5语料库（PMID：354896）

Table 1 BC5CDR5语料库摘要

Table 2 BC5CDR5的正样本

Table 3 BC5CDR5的负样本

Table 4 在BC5CDR5语料库上各模型结果的比较

Table 5 预训练模型与词嵌入模型的对比

[1]	Dogan R I, Murray G C, Névéol A, et al. Understanding PubMed^® User Search Behavior Through Log Analysis[J/OL]. Database, 2009. https://doi.org/10.1093/database/bap018.
[2]	Lu Z Y. PubMed and Beyond: A Survey of Web Tools for Searching Biomedical Literature[J/OL]. Database, 2011. https://doi.org/10.1093/database/baq036.
[3]	Kang N, Singh B, Bui C, et al. Knowledge-based Extraction of Adverse Drug Events from Biomedical Text[J]. BMC Bioinformatics, 2014, 15(1): Article No. 64.
[4]	Davis A P, Grondin C J, Johnson R J, et al. The Comparative Toxicogenomics Database: Update 2017[J]. Nucleic Acids Research, 2017, 45:D972-D978. doi: 10.1093/nar/gkw838
[5]	PharmGKB[EB/OL]. [2021-07-16]. https://www.pharmgkb.org/.
[6]	Zhou D Y, Zhong D Y, He Y L. Biomedical Relation Extraction: From Binary to Complex[J]. Computational and Mathematical Methods in Medicine, 2014: Article ID 298473.
[7]	Kim Y. Convolutional Neural Networks for Sentence Classification[C]// Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2014: 1746-1751.
[8]	Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[C]// Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2019, 1:4171-4186.
[9]	Gu Y, Tinn R, Cheng H, et al. Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing[OL]. arXiv Preprint, arXiv: 2007.15779.
[10]	Abacha A B, Zweigenbaum P. Automatic Extraction of Semantic Relations Between Medical Entities: Application to the Treatment Relation[C]// Proceedings of the 4th International Symposium for Semantic Mining in Biomedicine, Cambridge, United Kingdom. 2010.
[11]	Li H, Tang B, Chen Q, et al. HITSZ_CDR: An End-to-End Chemical and Disease Relation Extraction System for BioCreative V[J/OL]. Database, 2016. https://doi.org/10.1093/database/baw077.
[12]	Peng Y F, Wei C H, Lu Z Y. Improving Chemical Disease Relation Extraction with Rich Features and Weakly Labeled Data[J]. Journal of Cheminformatics, 2016, 8: Article No.53.
[13]	Giles C B, Wren J D. Large-scale Directional Relationship Extraction and Resolution[J]. BMC Bioinformatics, 2008, 9: Article No.S11.
[14]	Alam F, Corazza A, Lavelli A, et al. A Knowledge-poor Approach to Chemical-Disease Relation Extraction[J/OL]. Database, 2016. https://doi.org/10.1093/database/baw071.
[15]	Gu J H, Sun F Q, Qian L H, et al. Chemical-induced Disease Relation Extraction via Convolutional Neural Network[J/OL]. Database, 2017. https://doi.org/10.1093/database/bax024.
[16]	Zhou H W, Lang C K, Liu Z, et al. Knowledge-guided Convolutional Networks for Chemical-isease elation Extraction[J]. BMC Bioinformatics, 2019, 20: Article No.260.
[17]	Gu J H, Sun F Q, Qian L H, et al. Chemical-induced Disease Relation Extraction via Attention-based Distant Supervision[J]. BMC Bioinformatics, 2019, 20: Article No.403.
[18]	Li Z H, Yang Z H, Xiang Y, et al. Exploiting Sequence Labeling Framework to Extract Document-level Relations from Biomedical Texts[J]. BMC Bioinformatics, 2020, 21. DOI: 10.1186/s12859-020-3457-2. doi: 10.1186/s12859-020-3457-2
[19]	Mitra S, Saha S, Hasanuzzaman M. A Multi-view Deep Neural Network Model for Chemical-Disease Relation Extraction from Imbalanced Datasets[J]. IEEE Journal of Biomedical and Health Informatics, 2020, 24(11):3315-3325. doi: 10.1109/JBHI.6221020
[20]	Zhou H W, Deng H J, Chen L, et al. Exploiting Syntactic and Semantics Information for Chemical-Disease Relation Extraction[J/OL]. Database, 2016. https://doi.org/10.1093/database/baw048.
[21]	Lee J, Yoon W, Kim S, et al. BioBERT: A Pre-trained Biomedical Language Representation Model for Biomedical Text Mining[J]. Bioinformatics, 2020, 36(4):1234-1240.
[22]	Alsentzer E, Murphy J R, Boag W, et al. Publicly Available Clinical BERT Embeddings[OL]. arXiv Preprint, arXiv: 1904.03323.
[23]	Li J, Sun Y P, Johnson R J, et al. BioCreative V CDR Task Corpus: A Resource for Chemical Disease Relation Extraction[J/OL]. Database, 2016. https://doi.org/10.1093/database/baw068.
[24]	Bowman S R, Gauthier J, Rastogi A, et al. A Fast Unified Model for Parsing and Sentence understanding[C]// Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. 2016, 3:1466-1477.
[25]	廖开际, 黄琼影, 席运江. 在线医疗社区问答文本的知识图谱构建研究[J]. 情报科学, 2021, 39(3):51-59, 75.
[25]	(Liao Kaiji, Huang Qiongying, Xi Yunjiang. Research on the construction of knowledge graph of Q & A text in online medical community[J]. Information Science, 2021, 39(3):51-59, 75.)
[26]	黄梦醒, 李梦龙, 韩惠蕊. 基于电子病历的实体识别和知识图谱构建的研究[J]. 计算机应用研究, 2019, 36(12):3735-3739.
[26]	(Huang Mengxing, Li Menglong, Han Huirui. Research on Entity Recognition and Knowledge Graph Construction Based on Electronic Medical Record[J]. Application Research of Computers, 2019, 36(12):3735-3739.)
[27]	李东奇, 李明鑫, 张潇. 基于知识库的开放域问答研究[J]. 电脑知识与技术, 2020, 16(36):179-181.
[27]	(Li Dongqi, Li Mingxin, Zhang Xiao. Research on Open Domain Question Answering Based on Knowledge Base[J]. Computer Knowledge and Technology, 2020, 16(36):179-181.)
[28]	高曼, 崔雷. 利用文本挖掘进行药物重新定位的步骤与工具[J]. 中华医学图书情报杂志, 2017, 26(3):6-9.
[28]	(Gao Man, Cui Lei. Steps and Tools for Drug Repositioning Using Text Mining[J]. Chinese Journal of Medical Library and Information Science, 2017, 26(3):6-9.)
[29]	隋明爽, 崔雷. 用文本挖掘方法发现药物的副作用[J]. 中华医学图书情报杂志, 2015, 24(11):67-72.
[29]	(Sui Mingshuang, Cui Lei. Using Text Mining to Find the Side Effects of Drugs[J]. Chinese Journal of Medical Library and Information Science, 2015, 24(11):67-72.)
[30]	王秀艳, 崔雷. 采用混合方法抽取生物医学实体间语义关系[J]. 现代图书情报技术, 2013(3):77-82.
[30]	(Wang Xiuyan, Cui Lei. A Hybrid Method to Extract Semantic Relation of Biomedical Entities[J].New Technology of Library and Information Service, 2013(3):77-82.)
[31]	王可鉴, 石乐明, 贺林, 等. 中国药物研发的新机遇:基于医药大数据的系统性药物重定位[J]. 科学通报, 2014, 59(18):1790-1796.
[31]	(Wang Kejian, Shi Leming, He Lin, et al. New Opportunities for Drug Research and Development in China: Systematic Drug Repositioning Based on Big Data of Medicine[J]. Science Bulletin, 2014, 59(18):1790-1796.)

[1]	刘文斌, 何彦青, 吴振峰, 董诚. 基于BERT和多相似度融合的句子对齐方法研究*[J]. 数据分析与知识发现, 2021, 5(7): 48-58.
[2]	梁继文,江川,王东波. 基于多特征融合的先秦典籍汉英句子对齐研究^*[J]. 数据分析与知识发现, 2020, 4(9): 123-132.
[3]	邵健, 章成志. 从互联网上自动获取领域平行语料[J]. 现代图书情报技术, 2014, 30(12): 36-43.

Viewed

Full text

Abstract

Cited

Shared

Discussed