Please wait a minute...
Advanced Search
数据分析与知识发现  2022, Vol. 6 Issue (2/3): 242-250     https://doi.org/10.11925/infotech.2096-3467.2021.0951
  专辑 本期目录 | 过刊浏览 | 高级检索 |
基于RoBERTa-wwm动态融合模型的中文电子病历命名实体识别*
张云秋(),汪洋,李博诚
吉林大学公共卫生学院 长春 130021
Identifying Named Entities of Chinese Electronic Medical Records Based on RoBERTa-wwm Dynamic Fusion Model
Zhang Yunqiu(),Wang Yang,Li Bocheng
School of Public Health, Jilin University, Changchun 130021, China
全文: PDF (1093 KB)   HTML ( 16
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】 提出基于RoBERTa-wwm动态融合的实体识别模型,提高中文电子病历实体识别效果。【方法】 将预训练语言模型RoBERTa-wwm各Transformer层生成的语义表示进行动态融合后,输入双向长短时记忆网络和条件随机场模块完成电子病历中的实体识别。【结果】 在“2017全国知识图谱与语义计算大会(CCKS2017)”数据集与自主标注的电子病历数据集上F1值分别达到94.08%和90.08%,在RoBERTa-wwm-BiLSTM-CRF模型的基础上提高了0.23%与0.39%。【局限】 本文所采用的RoBERTa-wwm基于非医学语料完成预训练过程。【结论】 语义层的动态融合能更好利用各编码层的不同信息,提升下游实体识别任务效果。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
张云秋
汪洋
李博诚
关键词 电子病历命名实体识别RoBERTa-wwm动态融合    
Abstract

[Objective] This paper proposes an entity recognition model based on RoBERTa-wwm dynamic fusion, aiming to improve the entity identification of Chinese electronic medical records. [Methods] First, we merged the semantic representations generated by each Transformer layer of the pre-trained language model RoBERTa-wwm. Then, we input the bi-directional long short-term memory network and the conditional random field module to recognize the entities of the electronic medical records. [Results] We examined our new model with the dataset of “2017 National Knowledge Graph and Semantic Computing Conference (CCKS 2017)” and self-annotated electronic medical records. Their F1 values reached 94.08% and 90.08%, which were 0.23% and 0.39% higher than the RoBERTa-wwm-BiLSTM-CRF model. [Limitations] The RoBERTa-wwm used in this paper completed the pre-training process with non-medical corpus. [Conclusions] The proposed method could improve the results of entity recognition tasks.

Key wordsElectronic Medical Record    Named Entity Recognition    RoBERTa-wwm    Dynamic Fusion
收稿日期: 2021-08-31      出版日期: 2022-01-07
ZTFLH:  TP391  
基金资助:*教育部人文社会科学规划项目(18YJA870017);吉林省社会科学基金项目(2019B59);吉林大学研究生创新基金项目的研究成果之一(101832020CX279)
通讯作者: 张云秋,ORCID:0000-0002-9790-9581     E-mail: yunqiu@jlu.edu.cn
引用本文:   
张云秋, 汪洋, 李博诚. 基于RoBERTa-wwm动态融合模型的中文电子病历命名实体识别*[J]. 数据分析与知识发现, 2022, 6(2/3): 242-250.
Zhang Yunqiu, Wang Yang, Li Bocheng. Identifying Named Entities of Chinese Electronic Medical Records Based on RoBERTa-wwm Dynamic Fusion Model. Data Analysis and Knowledge Discovery, 2022, 6(2/3): 242-250.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2021.0951      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2022/V6/I2/3/242
Fig.1  RoBERTa-wwm动态融合模型框架
Fig.2  RoBERTa-wwm的全词掩码示例
Fig.3  动态权重融合
Fig.4  LSTM单元结构
数据集 疾病诊断 症状体征 检查检验 身体部位 治疗
300份 1 209 7 538 9 995 9 844 1 470
4.02% 25.08% 33.25% 32.75% 4.89%
Table 1  CCKS2017 Shared Task2数据集实体分布
相关参数
Dropout 0.5
隐藏层维度 768
优化器 Adam
学习率 0.000 1
Batch_size 32
Decay rate 0.8
LSTM_dim 256
Epoch 24
Max_seq_len 150
Step_size 2 000
Table 2  模型参数设置
模型 P/% R/% F1/%
BiLSTM-CRF 86.63 87.84 87.23
BERT-BiLSTM-CRF 93.10 94.27 93.68
RoBERTa-wwm-BiLSTM-CRF 93.25 94.45 93.85
本文模型 93.42 94.73 94.08
Table 3  各模型在CCKS2017数据集上的评价结果
模型 P/% R/% F1/%
BiLSTM-CRF 83.07 83.79 83.43
BERT-BiLSTM-CRF 88.93 89.57 89.24
RoBERTa-wwm-BiLSTM-CRF 89.62 89.75 89.69
本文模型 89.77 90.39 90.08
Table 4  各模型在自主标注数据集上的评价结果
[1] Shen L, Li Q, Wang W, et al. Treatment Patterns and Direct Medical Costs of Metastatic Colorectal Cancer Patients: A Retrospective Study of Electronic Medical Records from Urban China[J]. Journal of Medical Economics, 2020, 23(5):456-463.
doi: 10.1080/13696998.2020.1717500 pmid: 31950863
[2] 刘浏, 王东波. 命名实体识别研究综述[J]. 情报学报, 2018, 37(3):329-340.
[2] ( Liu Liu, Wang Dongbo. A Review on Named Entity Recognition[J]. Journal of the China Society for Scientific and Technical Information, 2018, 37(3):329-340.)
[3] McCallum A, Li W. Early Results for Named Entity Recognition with Conditional Random Fields, Feature Induction and Web-Enhanced Lexicons[C]// Proceedings of the 7th Conference on Natural Language Learning at HLT-NAACL 2003. 2003:188-191.
[4] 黄菡, 王宏宇, 王晓光. 结合主动学习的条件随机场模型用于法律术语的自动识别[J]. 数据分析与知识发现, 2019, 3(6):66-74.
[4] ( Huang Han, Wang Hongyu, Wang Xiaoguang. Automatic Recognizing Legal Terminologies with Active Learning and Conditional Random Field Model[J]. Data Analysis and Knowledge Discovery, 2019, 3(6):66-74.)
[5] Zhao S J. Named Entity Recognition in Biomedical Texts Using an HMM Model[C]// Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and Its Applications - JNLPBA ’04. 2004: 87-90.
[6] 冯静, 李正武, 张登云, 等. 基于隐马尔可夫模型的桥梁检测文本命名实体识别[J]. 交通世界, 2020(8):32-33.
[6] ( Feng Jing, Li Zhengwu, Zhang Dengyun, et al. Named Entity Recognition of Bridge Detection Text Based on Hidden Markov Model[J]. TranspoWorld, 2020(8):32-33.)
[7] Kazama J, Makino T, Ohta Y, et al. Tuning Support Vector Machines for Biomedical Named Entity Recognition[C]// Proceedings of the ACL-02 Workshop on Natural Language Processing in the Biomedical Domain. 2002: 1-8.
[8] 晏雷, 周兰江, 张建安, 等. 融合多特征的老挝机构名实体识别方法[J]. 现代电子技术, 2020, 43(19):122-125.
[8] ( Yan Lei, Zhou Lanjiang, Zhang Jianan, et al. Lao Organization Name Entity Recognition Method Fusing Multiple Features[J]. Modern Electronics Technique, 2020, 43(19):122-125.)
[9] Cocos A, Fiks A G, Masino A J. Deep Learning for Pharmacovigilance: Recurrent Neural Network Architectures for Labeling Adverse Drug Reactions in Twitter Posts[J]. Journal of the American Medical Informatics Association, 2017, 24(4):813-821.
doi: 10.1093/jamia/ocw180 pmid: 28339747
[10] Ji B, Liu R, Li S S, et al. A BiLSTM-CRF Method to Chinese Electronic Medical Record Named Entity Recognition[C]// Proceedings of the 2018 International Conference on Algorithms, Computing and Artificial Intelligence. 2018: 1-6.
[11] 刘婧茹, 宋阳, 贾睿, 等. 基于BiLSTM-CRF中文临床文本中受保护的健康信息识别[J]. 数据分析与知识发现, 2020, 4(10):124-133.
[11] ( Liu Jingru, Song Yang, Jia Rui, et al. A BiLSTM-CRF Model for Protected Health Information in Chinese[J]. Data Analysis and Knowledge Discovery, 2020, 4(10):124-133.)
[12] Giorgi J M, Bader G D. Towards Reliable Named Entity Recognition in the Biomedical Domain[J]. Bioinformatics, 2020, 36(1):280-286.
doi: 10.1093/bioinformatics/btz504 pmid: 31218364
[13] 赵丹丹, 黄德根, 孟佳娜, 等. 多头注意力与字词融合的中文命名实体识别[J/OL]. 计算机工程与应用. [2021-08-25]. http://kns.cnki.net/kcms/detail/11.2127.TP.20210726.1521.024.html.
[13] ( Zhao Dandan, Huang Degen, Meng Jiana, et al. Chinese Named Entity Recognition by Integrating Multi-heads Attention Mechanism and Character and Words Fusion[J/OL]. Computer Engineering and Applications. [2021-08-25]. http://kns.cnki.net/kcms/detail/11.2127.TP.20210726.1521.024.html. )
[14] 廖开际, 邹珂欣, 席运江. 一种在线医疗社区问答文本实体识别方法: 基于卷积神经网络和双向长短期记忆神经网络[J]. 科技管理研究, 2021, 41(8):173-179.
[14] ( Liao Kaiji, Zou Kexin, Xi Yunjiang. An Online Medical Community Q&A Text Entity Recognition Method: Based on CNN and BiLSTM[J]. Science and Technology Management Research, 2021, 41(8):173-179.)
[15] Lee J, Yoon W, Kim S, et al. BioBERT: A Pre-Trained Biomedical Language Representation Model for Biomedical Text Mining[J]. Bioinformatics, 2020, 36(4):1234-1240.
[16] Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[OL]. arXiv Preprint, arXiv:1810.04805.
[17] Jawahar G, Sagot B, Seddah D. What does BERT Learn about the Structure of Language?[C]// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019.
[18] Albilali E, Altwairesh N, Hosny M. What does BERT Learn from Arabic Machine Reading Comprehension Datasets?[C]// Proceedings of the 6th Arabic Natural Language Processing Workshop. 2021: 32-41.
[19] Cui Y M, Che W X, Liu T, et al. Pre-Training with Whole Word Masking for Chinese BERT[OL]. arXiv Preprint, arXiv:1906.08101.
[20] Greff K, Srivastava R K, Koutník J, et al. LSTM: A Search Space Odyssey[J]. IEEE Transactions on Neural Networks and Learning Systems, 2017, 28(10):2222-2232.
doi: 10.1109/TNNLS.2016.2582924
[1] 余传明, 林虹君, 张贞港. 基于多任务深度学习的实体和事件联合抽取模型*[J]. 数据分析与知识发现, 2022, 6(2/3): 117-128.
[2] 张云秋, 李博诚, 陈妍. 面向不平衡数据的电子病历自动分类研究*[J]. 数据分析与知识发现, 2022, 6(2/3): 233-241.
[3] 张芳丛, 秦秋莉, 姜勇, 庄润涛. 基于RoBERTa-WWM-BiLSTM-CRF的中文电子病历命名实体识别研究[J]. 数据分析与知识发现, 2022, 6(2/3): 251-262.
[4] 徐晨飞, 叶海影, 包平. 基于深度学习的方志物产资料实体自动识别模型构建研究*[J]. 数据分析与知识发现, 2020, 4(8): 86-97.
[5] 高原,施元磊,张蕾,曹天奕,冯筠. 基于游记文本的游客游览行程重构*[J]. 数据分析与知识发现, 2020, 4(2/3): 165-172.
[6] 马建霞,袁慧,蒋翔. 基于Bi-LSTM+CRF的科学文献中生态治理技术相关命名实体抽取研究*[J]. 数据分析与知识发现, 2020, 4(2/3): 78-88.
[7] 刘婧茹,宋阳,贾睿,张翼鹏,罗勇,马敬东. 基于BiLSTM-CRF中文临床文本中受保护的健康信息识别*[J]. 数据分析与知识发现, 2020, 4(10): 124-133.
[8] 胡佳慧,方安,赵琬清,杨晨柳,任慧玲. 面向知识发现的中文电子病历标注方法研究 *[J]. 数据分析与知识发现, 2019, 3(7): 123-132.
[9] 黄菡,王宏宇,王晓光. 结合主动学习的条件随机场模型用于法律术语的自动识别*[J]. 数据分析与知识发现, 2019, 3(6): 66-74.
[10] 刘勘,陈露. 面向医疗分诊的深度神经网络学习*[J]. 数据分析与知识发现, 2019, 3(6): 99-108.
[11] 陈美杉,夏晨曦. 肝癌患者在线提问的命名实体识别研究:一种基于迁移学习的方法 *[J]. 数据分析与知识发现, 2019, 3(12): 61-69.
[12] 余丽,钱力,付常雷,赵华茗. 基于深度学习的文本中细粒度知识元抽取方法研究*[J]. 数据分析与知识发现, 2019, 3(1): 38-45.
[13] 唐慧慧, 王昊, 张紫玄, 王雪颖. 基于汉字标注的中文历史事件名抽取研究*[J]. 数据分析与知识发现, 2018, 2(7): 89-100.
[14] 范馨月, 崔雷. 基于文本挖掘的药物副作用知识发现研究[J]. 数据分析与知识发现, 2018, 2(3): 79-86.
[15] 牟冬梅, 王萍, 赵丹宁. 高维电子病历的数据降维策略与实证研究*[J]. 数据分析与知识发现, 2018, 2(1): 88-98.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn