Please wait a minute...
Data Analysis and Knowledge Discovery  2020, Vol. 4 Issue (10): 124-133    DOI: 10.11925/infotech.2096-3467.2020.0167
Current Issue | Archive | Adv Search |
A BiLSTM-CRF Model for Protected Health Information in Chinese
Liu Jingru1,Song Yang1,Jia Rui2,3,Zhang Yipeng1,Luo Yong2,4,Ma Jingdong1()
1School of Medical and Health Management, Tongji Medical College, Huazhong University of Science and Technology, Wuhan 430030, China
2Sichuan Province Electronic Medical Record Engineering Technology Research Center, Chengdu 610041, China
3School of Public Health, Chengdu University of Traditional Chinese Medicine, Chengdu 611137, China
4Sichuan Jiuzhen Technology Co., Ltd., Chengdu 610041, China
Download: PDF (801 KB)   HTML ( 0
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper proposes an automated scheme to remove personal information from clinical records based on the BiLSTM-CRF model, aiming to protect patient privacy and identify protected health information (PHI) from unstructured files.[Methods] We collected experimental data from the discharge summaries of a health information platform. According to the 18 PHI regulations specified by HIPAA, we determined 7 PHI categories and 15 PHI types. We used the BiLSTM-CRF model to effectively identify protected health information from unstructured clinical records.[Results] The accuracy rate, recall rate and F value of all entity category recognition were 98.66%, 99.36%, and 99.01% respectively, and the wrong labels were summarized and analyzed.[Limitations] The corpus characteristics need to be improved, and the clinical text quality after automatic recognition of PHI was not evaluated.[Conclusions] The BiLSTM-CRF model could automatically recognize named entities without feature engineering, which promotes the sharing and utilization of clinical information.

Key wordsChinese Clinical Text      Protected Health Information      Long Short-Term Memory      Private Information      Named Entity Recognition     
Received: 06 March 2020      Published: 09 November 2020
ZTFLH:  TP391  
Corresponding Authors: Ma Jingdong     E-mail: jdma@hust.edu.cn

Cite this article:

Liu Jingru,Song Yang,Jia Rui,Zhang Yipeng,Luo Yong,Ma Jingdong. A BiLSTM-CRF Model for Protected Health Information in Chinese. Data Analysis and Knowledge Discovery, 2020, 4(10): 124-133.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2020.0167     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2020/V4/I10/124

Flow Chart of Protected Health Information Identification
类别 实体类型 实体标签
姓名 患者
医生
NAM_PATIENT
NAM_DOCTOR
地理位置

区县
街道
LOC_PROVINCE
LOC_CITY
LOC_COUNTY
LOC_STREET
日期 日期(排除年份,仅包括月、日) DAT
年龄 年龄(>89) AGE
医疗机构 医疗机构名称 ORG
联系方式 电话 CON_TELEPHONE
编号 住院号
病理编号
X光片编号
超声编号
医生编号
NUM_ADMISSION
NUM_PATHOLOGY
NUM_X-RAY
NUM_B
NUM-DOCTOR
Types of Protected Privacy Information
Basic Structure of BiLSTM-CRF Model
Unigram Feature Template Based on the Symbolic and Part of Speech Features of Words
实体类型 实体标签 训练集 验证集 测试集
个数 占比/% 个数 占比/% 个数 占比/%
医生 NAM_DOCTOR 5 207 12.45 755 12.65 1 558 12.92
患者 NAM_PATIENT 2 875 6.87 393 6.59 854 7.08
LOC_CITY 444 1.06 60 1.01 107 0.89
区县 LOC_COUNTY 1 078 2.58 146 2.45 322 2.67
LOC_PROVINCE 458 1.10 66 1.11 114 0.95
街道 LOC_STREET 1 255 3.00 180 3.02 372 3.09
日期 DAT 19 935 47.67 2 872 48.12 5 629 46.69
年龄 AGE 48 0.11 2 0.03 15 0.12
医疗机构名称 ORG 8 994 21.51 1 279 21.43 2 604 21.60
电话 CON_TELEPHONE 329 0.79 39 0.65 104 0.86
住院号 NUM_ADMISSION 387 0.93 59 0.99 108 0.90
超声编号 NUM_B 15 0.04 0 0.00 4 0.03
病理编号 NUM_PATHOLOGY 38 0.09 2 0.03 14 0.12
X光片编号 NUM_X-RAY 26 0.06 7 0.12 13 0.11
医生编号 NUM-DOCTOR 730 1.75 108 1.81 238 1.97
总 计 41 819 100 5 968 100 12 056 100
Distribution of Entities in Training, Verification and Test Set
Impact of Dropout on Model Performance
名称 参数 取值
实验环境 操作系统 Windows 10
编程语言 Python 3.6
分词工具 Jieba 0.37
BiLSTM模型参数 隐藏层大小 100
学习率 0.001
L2正则系数 0.001
批处理大小 32
Dropout 0.5
Experimental Environment and Parameter Setting
模型 准确率 召回率 F值
CRF 95.93 94.26 95.08
BiLSTM 98.16 98.13 98.14
BiLSTM-CRF 98.91 98.00 98.45
Evaluation Results of CRF, BiLSTM and BiLSTM-CRF
实体类型 实体标签 准确率 召回率 F值
医生 NAM_DOCTOR 98.60 99.59 99.09
患者 NAM_PATIENT 98.29 98.06 98.17
LOC_CITY 98.34 96.42 97.37
区县 LOC_COUNTY 98.86 97.34 98.09
LOC_PROVINCE 99.12 99.41 99.27
街道 LOC_STREET 99.65 99.73 99.69
日期 DAT 98.88 99.52 99.20
年龄 AGE 73.68 95.45 83.17
医疗机构名称 ORG 98.46 99.39 98.92
电话 CON_TELEPHONE 98.57 100.00 99.28
住院号 NUM_ADMISSION 93.19 95.50 94.33
超声编号 NUM_B 100 100 100
病理编号 NUM_PATHOLOGY 100 87.39 93.27
X光片编号 NUM_X-RAY 64.71 64.71 64.71
医生编号 NUM-DOCTOR 98.43 99.65 99.04
Evaluation Results of Each Entity Type of BiLSTM-CRF Model
错误类型 类型错误 边界错误 假阴性错误 假阳性错误
占比 9.80 15.03 66.67 8.50
BiLSTM-CRF Model Error Types
[1] Demner-Fushman D, Chapman W W, McDonald C J. What Can Natural Language Processing do for Clinical Decision Support?[J]. Journal of Biomedical Informatics, 2009,42(5):760-772.
doi: 10.1016/j.jbi.2009.08.007 pmid: 19683066
[2] Wagholikar K B, Maclaughlin K L, Henry M R, et al. Clinical Decision Support with Automated Text Processing for Cervical Cancer Screening[J]. Journal of the American Medical Informatics Association, 2012,19(5):833-839.
doi: 10.1136/amiajnl-2012-000820 pmid: 22542812
[3] Weng C H, Wu X Y, Luo Z H, et al. EliXR: An Approach to Eligibility Criteria Extraction and Representation[J]. Journal of the American Medical Informatics Association, 2011(S1):116-124.
[4] Stubbs A, Uzuner O. Annotating Longitudinal Clinical Narratives for De-identification: The 2014 i2b2/UTHealth Corpus[J]. Journal of Biomedical Informatics, 2015,58:S20-S29.
doi: 10.1016/j.jbi.2015.07.020 pmid: 26319540
[5] Tucker K, Branson J, Dilleen M, et al. Protecting Patient Privacy When Sharing Patient-Level Data from Clinical Trials[J]. BMC Medical Research Methodology, 2016, 16(S1): Article 77.
[6] Deven M G. Building Public Trust in Uses of Health Insurance Portability and Accountability Act De-Identified Data[J]. Journal of the American Medical Informatics Association, 2013,20(1):29-34.
doi: 10.1136/amiajnl-2012-000936 pmid: 22735615
[7] Dernoncourt F, Lee J Y Uzuner O et al. De-Identification of Patient Notes with Recurrent Neural Networks[J]. Journal of the American Medical Informatics Association, 2017,24(3):596-606.
doi: 10.1093/jamia/ocw156 pmid: 28040687
[8] Jian Z, Guo X S, Liu S J, et al. A Cascaded Approach for Chinese Clinical Text De-Identification with Less Annotation Effort[J]. Journal of Biomedical Informatics, 2017,73:76-83.
doi: 10.1016/j.jbi.2017.07.017 pmid: 28756160
[9] Meystre S M, Friedlin F J, South B R, et al. Automatic De-Identification of Textual Documents in the Electronic Health Record: A Review of Recent Research[J]. BMC Medical Research Methodology, 2010,10(1):1-16.
doi: 10.1186/1471-2288-10-1
[10] Sundheim B M. Named Entity Task Definition,Version 2.1[C]//Proceedings of the 6th Message Understanding Conference. 1995.
[11] Chen L, Yang J J, Wang Q. Privacy-Preserving Data Publishing for Free Text Chinese Electronic Medical Records[C]// Proceedings of the 2012 IEEE 36th Annual Computer Software and Applications Conference.. 2012: 567-572.
[12] Ford E, Carrol J A, Smith H E, et al. Extracting Information from the Text of Electronic Medical Records to Improve Case Detection: A Systematic Review[J]. Journal of the American Medical Informatics Association, 2016,23(5):1007-1015.
doi: 10.1093/jamia/ocv180 pmid: 26911811
[13] 韩旭. 基于神经网络的文本特征表示关键技术研究[D]. 北京: 北京邮电大学, 2019.
[13] ( Han Xu. Research on Key Technologies of Text Feature Representation Based on Neural Network[D]. Beijing: Beijing University of Posts and Telecommunications, 2019.)
[14] 顾溢. 基于BiLSTM-CRF的复杂中文命名实体识别研究[D]. 南京: 南京大学, 2019.
[14] ( Gu Yi. Research on Complex Chinese Named Entity Recognition Based on BiLSTM-CRF[D]. Nanjing: Nanjing University, 2019.)
[15] Ji B, Liu R, Li S S, et al. A Hybrid Approach for Named Entity Recognition in Chinese Electronic Medical Record[J]. BMC Medical Informatics and Decision Making, 2019,19(S2):64.
doi: 10.1186/s12911-019-0767-2
[16] 陈曙东, 欧阳小叶. 命名实体识别技术综述[J]. 无线电通信技术, 2020,46(3):251-260.
[16] ( Chen Shudong, Ouyang Xiaoye. Overview of Named Entity Recognition Technology[J]. Radio Communications Technology, 2020,46(3):251-260.)
[17] Yang X, Bian J, Gong Y, et al. MADEx: A System for Detecting Medications, Adverse Drug Events, and Their Relations from Clinical Notes[J]. Drug Safety, 2019,42(1):123-133.
doi: 10.1007/s40264-018-0761-0 pmid: 30600484
[18] 申站. 基于神经网络的中文电子病历命名实体识别[D]. 北京: 北京邮电大学, 2018.
[18] ( Shen Zhan. Named Entity Recognition for Chinese Electronic Record with Neural Network[D]. Beijing: Beijing University of Posts and Telecommunications, 2018.)
[19] Ji B, Liu R, Li S S, et al. A BiLSTM-CRF Method to Chinese Electronic Medical Record Named Entity Recognition[C]//Proceedings of the 2018 International Conference on Algorithms, Computing and Artificial Intelligence. ACM, 2018.
[20] Ji B, Li S S, Yu J, et al. Research on Chinese Medical Named Entity Recognition Based on Collaborative Cooperation of Multiple Neural Network Models[J]. Journal of Biomedical Informatics, 2020,104:103395.
doi: 10.1016/j.jbi.2020.103395 pmid: 32109551
[21] 潘璀然, 王青华, 汤步洲, 等. 基于句子级Lattice-长短记忆神经网络的中文电子病历命名实体识别[J]. 第二军医大学学报, 2019,40(5):497-506.
[21] ( Pan Cuiran, Wang Qinghua, Tang Buzhou, et al. Chinese Electronic Medical Record Named Entity Recognition Based on Sentence-Level Lattice-Long Short-Term Memory Neural Network[J]. Academic Journal of Second Military Medical University, 2019,40(5):497-506.)
[22] 曹春萍, 关鹏举. 基于E-CNN和BLSTM-CRF的临床文本命名实体识别[J]. 计算机应用研究, 2019,36(12):3748-3751.
[22] ( Cao Chunping, Guan Pengju. Clinical Text Named Entity Recognition Based on E-CNN and BLSTM-CRF[J]. Computer Application Research, 2019,36(12):3748-3751.)
[23] Luo L, Yang Z H, Yang P, et al. An Attention-based BiLSTM-CRF Approach to Document-level Chemical Named Entity Recognition[J]. Bioinformatics, 2018,34(8):1381-1388.
doi: 10.1093/bioinformatics/btx761 pmid: 29186323
[24] Hochreiter S, Schmidhuber J. Long Short-Term Memory[J]. Neural Computation, 1997,9(8):1735-1780.
doi: 10.1162/neco.1997.9.8.1735 pmid: 9377276
[25] Du L, Xia C, Deng Z, et al. A Machine Learning Based Approach to Identify Protected Health Information in Chinese Clinical Text[J]. International Journal of Medical Informatics, 2018,116:24-32.
doi: 10.1016/j.ijmedinf.2018.05.010 pmid: 29887232
[26] 都丽婷. 临床文本数据信息挖掘去识别技术研究[D]. 武汉: 华中科技大学, 2018.
[26] ( Du Liting. Research on Clinical Text Data Information Mining De-Identification Technology[D]. Wuhan: Huazhong University of Science and Technology, 2018.)
[27] 武惠, 吕立, 于碧辉. 基于迁移学习和BiLSTM-CRF的中文命名实体识别[J]. 小型微型计算机系统, 2019,40(6):1142-1147.
[27] ( Wu Hui, Lv Li, Yu Bihui. Chinese Named Entity Recognition Based on Transfer Learning and BiLSTM-CRF[J]. Journal of Chinese Computer Systems, 2019,40(6):1142-1147.)
[28] Li X Q, Shi T Y, Li P, et al. BiLSTM-CRF Model for Named Entity Recognition in Railway Accident and Fault Analysis Report[C]//Proceedings of the Asia-Pacific Conference on Intelligent Medical 2018 & International Conference on Transportation and Traffic Engineering 2018. 2018:1-5.
[29] Arellano A M, Dai W R, Wang S, et al. Privacy Policy and Technology in Biomedical Data Science[M]. Annual Review of Biomedical Data Science, 2018,1:115-129.
[1] Xu Chenfei, Ye Haiying, Bao Ping. Automatic Recognition of Produce Entities from Local Chronicles with Deep Learning[J]. 数据分析与知识发现, 2020, 4(8): 86-97.
[2] Gao Yuan,Shi Yuanlei,Zhang Lei,Cao Tianyi,Feng Jun. Reconstructing Tour Routes Based on Travel Notes[J]. 数据分析与知识发现, 2020, 4(2/3): 165-172.
[3] Ma Jianxia,Yuan Hui,Jiang Xiang. Extracting Name Entities from Ecological Restoration Literature with Bi-LSTM+CRF[J]. 数据分析与知识发现, 2020, 4(2/3): 78-88.
[4] Han Huang,Hongyu Wang,Xiaoguang Wang. Automatic Recognizing Legal Terminologies with Active Learning and Conditional Random Field Model[J]. 数据分析与知识发现, 2019, 3(6): 66-74.
[5] Meishan Chen,Chenxi Xia. Identifying Entities of Online Questions from Cancer Patients Based on Transfer Learning[J]. 数据分析与知识发现, 2019, 3(12): 61-69.
[6] Li Yu,Li Qian,Changlei Fu,Huaming Zhao. Extracting Fine-grained Knowledge Units from Texts with Deep Learning[J]. 数据分析与知识发现, 2019, 3(1): 38-45.
[7] Tang Huihui,Wang Hao,Zhang Zixuan,Wang Xueying. Extracting Names of Historical Events Based on Chinese Character Tags[J]. 数据分析与知识发现, 2018, 2(7): 89-100.
[8] Fan Xinyue,Cui Lei. Using Text Mining to Discover Drug Side Effects: Case Study of PubMed[J]. 数据分析与知识发现, 2018, 2(3): 79-86.
[9] Sui Mingshuang,Cui Lei. Extracting Chemical and Disease Named Entities with Multiple-Feature CRF Model[J]. 现代图书情报技术, 2016, 32(10): 91-97.
[10] Wang Run,He Lin,Wang Dongbo,Huang Shuiqing,Fan Yuanbiao. Research on Plant Growth and Development Stage Named Entity Recognition for Text Mining[J]. 现代图书情报技术, 2014, 30(1): 24-27.
[11] Gao Qiang, You Hongliang. Study on Named Entity Recognition Based on Cascaded Model for Field of Defense[J]. 现代图书情报技术, 2012, (11): 47-52.
[12] Yu Chuanming, Huang Jianqiu, Guo Fei. Recognizing Named Entity from Free-text Customer Reviews——A Maximum Entropy Model-based Approach[J]. 现代图书情报技术, 2011, 27(5): 77-82.
[13] Sun Zhen Wang Huilin. Overview on the Advance of the Research on Named Entity Recognition[J]. 现代图书情报技术, 2010, 26(6): 42-47.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn