Please wait a minute...
Data Analysis and Knowledge Discovery  2022, Vol. 6 Issue (2/3): 242-250    DOI: 10.11925/infotech.2096-3467.2021.0951
Current Issue | Archive | Adv Search |
Identifying Named Entities of Chinese Electronic Medical Records Based on RoBERTa-wwm Dynamic Fusion Model
Zhang Yunqiu(),Wang Yang,Li Bocheng
School of Public Health, Jilin University, Changchun 130021, China
Download: PDF (1093 KB)   HTML ( 16
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper proposes an entity recognition model based on RoBERTa-wwm dynamic fusion, aiming to improve the entity identification of Chinese electronic medical records. [Methods] First, we merged the semantic representations generated by each Transformer layer of the pre-trained language model RoBERTa-wwm. Then, we input the bi-directional long short-term memory network and the conditional random field module to recognize the entities of the electronic medical records. [Results] We examined our new model with the dataset of “2017 National Knowledge Graph and Semantic Computing Conference (CCKS 2017)” and self-annotated electronic medical records. Their F1 values reached 94.08% and 90.08%, which were 0.23% and 0.39% higher than the RoBERTa-wwm-BiLSTM-CRF model. [Limitations] The RoBERTa-wwm used in this paper completed the pre-training process with non-medical corpus. [Conclusions] The proposed method could improve the results of entity recognition tasks.

Key wordsElectronic Medical Record      Named Entity Recognition      RoBERTa-wwm      Dynamic Fusion     
Received: 31 August 2021      Published: 07 January 2022
ZTFLH:  TP391  
Fund:Humanities and Social Science Foundation of Ministry of Education(18YJA870017);Jilin Province Social Science Foundation(2019B59);Graduate Innovation Foundation of Jilin University(101832020CX279)
Corresponding Authors: Zhang Yunqiu,ORCID:0000-0002-9790-9581     E-mail: yunqiu@jlu.edu.cn

Cite this article:

Zhang Yunqiu, Wang Yang, Li Bocheng. Identifying Named Entities of Chinese Electronic Medical Records Based on RoBERTa-wwm Dynamic Fusion Model. Data Analysis and Knowledge Discovery, 2022, 6(2/3): 242-250.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2021.0951     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2022/V6/I2/3/242

RoBERTa-wwm Dynamic Fusion Model
Example of Whole Word Masking in RoBERTa-wwm
Dynamic Weight Fusion
LSTM Unit Structure
数据集 疾病诊断 症状体征 检查检验 身体部位 治疗
300份 1 209 7 538 9 995 9 844 1 470
4.02% 25.08% 33.25% 32.75% 4.89%
CCKS2017 Shared Task2 Entity Distribution
相关参数
Dropout 0.5
隐藏层维度 768
优化器 Adam
学习率 0.000 1
Batch_size 32
Decay rate 0.8
LSTM_dim 256
Epoch 24
Max_seq_len 150
Step_size 2 000
Parameter Settings
模型 P/% R/% F1/%
BiLSTM-CRF 86.63 87.84 87.23
BERT-BiLSTM-CRF 93.10 94.27 93.68
RoBERTa-wwm-BiLSTM-CRF 93.25 94.45 93.85
本文模型 93.42 94.73 94.08
Evaluation Results of Each Model on the CCKS2017
模型 P/% R/% F1/%
BiLSTM-CRF 83.07 83.79 83.43
BERT-BiLSTM-CRF 88.93 89.57 89.24
RoBERTa-wwm-BiLSTM-CRF 89.62 89.75 89.69
本文模型 89.77 90.39 90.08
Evaluation Results of Each Model on the Self-Labeled Data Set
[1] Shen L, Li Q, Wang W, et al. Treatment Patterns and Direct Medical Costs of Metastatic Colorectal Cancer Patients: A Retrospective Study of Electronic Medical Records from Urban China[J]. Journal of Medical Economics, 2020, 23(5):456-463.
doi: 10.1080/13696998.2020.1717500 pmid: 31950863
[2] 刘浏, 王东波. 命名实体识别研究综述[J]. 情报学报, 2018, 37(3):329-340.
[2] ( Liu Liu, Wang Dongbo. A Review on Named Entity Recognition[J]. Journal of the China Society for Scientific and Technical Information, 2018, 37(3):329-340.)
[3] McCallum A, Li W. Early Results for Named Entity Recognition with Conditional Random Fields, Feature Induction and Web-Enhanced Lexicons[C]// Proceedings of the 7th Conference on Natural Language Learning at HLT-NAACL 2003. 2003:188-191.
[4] 黄菡, 王宏宇, 王晓光. 结合主动学习的条件随机场模型用于法律术语的自动识别[J]. 数据分析与知识发现, 2019, 3(6):66-74.
[4] ( Huang Han, Wang Hongyu, Wang Xiaoguang. Automatic Recognizing Legal Terminologies with Active Learning and Conditional Random Field Model[J]. Data Analysis and Knowledge Discovery, 2019, 3(6):66-74.)
[5] Zhao S J. Named Entity Recognition in Biomedical Texts Using an HMM Model[C]// Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and Its Applications - JNLPBA ’04. 2004: 87-90.
[6] 冯静, 李正武, 张登云, 等. 基于隐马尔可夫模型的桥梁检测文本命名实体识别[J]. 交通世界, 2020(8):32-33.
[6] ( Feng Jing, Li Zhengwu, Zhang Dengyun, et al. Named Entity Recognition of Bridge Detection Text Based on Hidden Markov Model[J]. TranspoWorld, 2020(8):32-33.)
[7] Kazama J, Makino T, Ohta Y, et al. Tuning Support Vector Machines for Biomedical Named Entity Recognition[C]// Proceedings of the ACL-02 Workshop on Natural Language Processing in the Biomedical Domain. 2002: 1-8.
[8] 晏雷, 周兰江, 张建安, 等. 融合多特征的老挝机构名实体识别方法[J]. 现代电子技术, 2020, 43(19):122-125.
[8] ( Yan Lei, Zhou Lanjiang, Zhang Jianan, et al. Lao Organization Name Entity Recognition Method Fusing Multiple Features[J]. Modern Electronics Technique, 2020, 43(19):122-125.)
[9] Cocos A, Fiks A G, Masino A J. Deep Learning for Pharmacovigilance: Recurrent Neural Network Architectures for Labeling Adverse Drug Reactions in Twitter Posts[J]. Journal of the American Medical Informatics Association, 2017, 24(4):813-821.
doi: 10.1093/jamia/ocw180 pmid: 28339747
[10] Ji B, Liu R, Li S S, et al. A BiLSTM-CRF Method to Chinese Electronic Medical Record Named Entity Recognition[C]// Proceedings of the 2018 International Conference on Algorithms, Computing and Artificial Intelligence. 2018: 1-6.
[11] 刘婧茹, 宋阳, 贾睿, 等. 基于BiLSTM-CRF中文临床文本中受保护的健康信息识别[J]. 数据分析与知识发现, 2020, 4(10):124-133.
[11] ( Liu Jingru, Song Yang, Jia Rui, et al. A BiLSTM-CRF Model for Protected Health Information in Chinese[J]. Data Analysis and Knowledge Discovery, 2020, 4(10):124-133.)
[12] Giorgi J M, Bader G D. Towards Reliable Named Entity Recognition in the Biomedical Domain[J]. Bioinformatics, 2020, 36(1):280-286.
doi: 10.1093/bioinformatics/btz504 pmid: 31218364
[13] 赵丹丹, 黄德根, 孟佳娜, 等. 多头注意力与字词融合的中文命名实体识别[J/OL]. 计算机工程与应用. [2021-08-25]. http://kns.cnki.net/kcms/detail/11.2127.TP.20210726.1521.024.html.
[13] ( Zhao Dandan, Huang Degen, Meng Jiana, et al. Chinese Named Entity Recognition by Integrating Multi-heads Attention Mechanism and Character and Words Fusion[J/OL]. Computer Engineering and Applications. [2021-08-25]. http://kns.cnki.net/kcms/detail/11.2127.TP.20210726.1521.024.html. )
[14] 廖开际, 邹珂欣, 席运江. 一种在线医疗社区问答文本实体识别方法: 基于卷积神经网络和双向长短期记忆神经网络[J]. 科技管理研究, 2021, 41(8):173-179.
[14] ( Liao Kaiji, Zou Kexin, Xi Yunjiang. An Online Medical Community Q&A Text Entity Recognition Method: Based on CNN and BiLSTM[J]. Science and Technology Management Research, 2021, 41(8):173-179.)
[15] Lee J, Yoon W, Kim S, et al. BioBERT: A Pre-Trained Biomedical Language Representation Model for Biomedical Text Mining[J]. Bioinformatics, 2020, 36(4):1234-1240.
[16] Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[OL]. arXiv Preprint, arXiv:1810.04805.
[17] Jawahar G, Sagot B, Seddah D. What does BERT Learn about the Structure of Language?[C]// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019.
[18] Albilali E, Altwairesh N, Hosny M. What does BERT Learn from Arabic Machine Reading Comprehension Datasets?[C]// Proceedings of the 6th Arabic Natural Language Processing Workshop. 2021: 32-41.
[19] Cui Y M, Che W X, Liu T, et al. Pre-Training with Whole Word Masking for Chinese BERT[OL]. arXiv Preprint, arXiv:1906.08101.
[20] Greff K, Srivastava R K, Koutník J, et al. LSTM: A Search Space Odyssey[J]. IEEE Transactions on Neural Networks and Learning Systems, 2017, 28(10):2222-2232.
doi: 10.1109/TNNLS.2016.2582924
[1] Yu Chuanming, Lin Hongjun, Zhang Zhengang. Joint Extraction Model for Entities and Events with Multi-task Deep Learning[J]. 数据分析与知识发现, 2022, 6(2/3): 117-128.
[2] Zhang Yunqiu, Li Bocheng, Chen Yan. Automatic Classification with Unbalanced Data for Electronic Medical Records[J]. 数据分析与知识发现, 2022, 6(2/3): 233-241.
[3] Zhang Fangcong, Qin Qiuli, Jiang Yong, Zhuang Runtao. Named Entity Recognition for Chinese EMR with RoBERTa-WWM-BiLSTM-CRF[J]. 数据分析与知识发现, 2022, 6(2/3): 251-262.
[4] Xu Chenfei, Ye Haiying, Bao Ping. Automatic Recognition of Produce Entities from Local Chronicles with Deep Learning[J]. 数据分析与知识发现, 2020, 4(8): 86-97.
[5] Gao Yuan,Shi Yuanlei,Zhang Lei,Cao Tianyi,Feng Jun. Reconstructing Tour Routes Based on Travel Notes[J]. 数据分析与知识发现, 2020, 4(2/3): 165-172.
[6] Ma Jianxia,Yuan Hui,Jiang Xiang. Extracting Name Entities from Ecological Restoration Literature with Bi-LSTM+CRF[J]. 数据分析与知识发现, 2020, 4(2/3): 78-88.
[7] Liu Jingru,Song Yang,Jia Rui,Zhang Yipeng,Luo Yong,Ma Jingdong. A BiLSTM-CRF Model for Protected Health Information in Chinese[J]. 数据分析与知识发现, 2020, 4(10): 124-133.
[8] Jiahui Hu,An Fang,Wanqing Zhao,Chenliu Yang,Huiling Ren. Annotating Chinese E-Medical Record for Knowledge Discovery[J]. 数据分析与知识发现, 2019, 3(7): 123-132.
[9] Han Huang,Hongyu Wang,Xiaoguang Wang. Automatic Recognizing Legal Terminologies with Active Learning and Conditional Random Field Model[J]. 数据分析与知识发现, 2019, 3(6): 66-74.
[10] Kan Liu,Lu Chen. Deep Neural Network Learning for Medical Triage[J]. 数据分析与知识发现, 2019, 3(6): 99-108.
[11] Meishan Chen,Chenxi Xia. Identifying Entities of Online Questions from Cancer Patients Based on Transfer Learning[J]. 数据分析与知识发现, 2019, 3(12): 61-69.
[12] Li Yu,Li Qian,Changlei Fu,Huaming Zhao. Extracting Fine-grained Knowledge Units from Texts with Deep Learning[J]. 数据分析与知识发现, 2019, 3(1): 38-45.
[13] Tang Huihui,Wang Hao,Zhang Zixuan,Wang Xueying. Extracting Names of Historical Events Based on Chinese Character Tags[J]. 数据分析与知识发现, 2018, 2(7): 89-100.
[14] Fan Xinyue,Cui Lei. Using Text Mining to Discover Drug Side Effects: Case Study of PubMed[J]. 数据分析与知识发现, 2018, 2(3): 79-86.
[15] Mu Dongmei,Wang Ping,Zhao Danning. Reducing Data Dimension of Electronic Medical Records: An Empirical Study[J]. 数据分析与知识发现, 2018, 2(1): 88-98.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn