Data Analysis and Knowledge Discovery  2020, Vol. 4 Issue (10): 124-133    DOI: 10.11925/infotech.2096-3467.2020.0167
A BiLSTM-CRF Model for Protected Health Information in Chinese
Liu Jingru1,Song Yang1,Jia Rui2,3,Zhang Yipeng1,Luo Yong2,4,Ma Jingdong1()
1School of Medical and Health Management, Tongji Medical College, Huazhong University of Science and Technology, Wuhan 430030, China
2Sichuan Province Electronic Medical Record Engineering Technology Research Center, Chengdu 610041, China
3School of Public Health, Chengdu University of Traditional Chinese Medicine, Chengdu 611137, China
4Sichuan Jiuzhen Technology Co., Ltd., Chengdu 610041, China
[Objective] This paper proposes an automated scheme to remove personal information from clinical records based on the BiLSTM-CRF model, aiming to protect patient privacy and identify protected health information (PHI) from unstructured files.[Methods] We collected experimental data from the discharge summaries of a health information platform. According to the 18 PHI regulations specified by HIPAA, we determined 7 PHI categories and 15 PHI types. We used the BiLSTM-CRF model to effectively identify protected health information from unstructured clinical records.[Results] The accuracy rate, recall rate and F value of all entity category recognition were 98.66%, 99.36%, and 99.01% respectively, and the wrong labels were summarized and analyzed.[Limitations] The corpus characteristics need to be improved, and the clinical text quality after automatic recognition of PHI was not evaluated.[Conclusions] The BiLSTM-CRF model could automatically recognize named entities without feature engineering, which promotes the sharing and utilization of clinical information.

Key wordsChinese Clinical Text      Protected Health Information      Long Short-Term Memory      Private Information      Named Entity Recognition     
Received: 06 March 2020      Published: 09 November 2020
ZTFLH:  TP391  
Corresponding Authors: Ma Jingdong     E-mail:

Cite this article:

Liu Jingru,Song Yang,Jia Rui,Zhang Yipeng,Luo Yong,Ma Jingdong. A BiLSTM-CRF Model for Protected Health Information in Chinese. Data Analysis and Knowledge Discovery, 2020, 4(10): 124-133.

Flow Chart of Protected Health Information Identification
类别 实体类型 实体标签
姓名 患者

日期 日期(排除年份,仅包括月、日) DAT
年龄 年龄(>89) AGE
医疗机构 医疗机构名称 ORG
编号 住院号
Types of Protected Privacy Information
Basic Structure of BiLSTM-CRF Model
Unigram Feature Template Based on the Symbolic and Part of Speech Features of Words
实体类型 实体标签 训练集 验证集 测试集
个数 占比/% 个数 占比/% 个数 占比/%
医生 NAM_DOCTOR 5 207 12.45 755 12.65 1 558 12.92
患者 NAM_PATIENT 2 875 6.87 393 6.59 854 7.08
LOC_CITY 444 1.06 60 1.01 107 0.89
区县 LOC_COUNTY 1 078 2.58 146 2.45 322 2.67
LOC_PROVINCE 458 1.10 66 1.11 114 0.95
街道 LOC_STREET 1 255 3.00 180 3.02 372 3.09
日期 DAT 19 935 47.67 2 872 48.12 5 629 46.69
年龄 AGE 48 0.11 2 0.03 15 0.12
医疗机构名称 ORG 8 994 21.51 1 279 21.43 2 604 21.60
电话 CON_TELEPHONE 329 0.79 39 0.65 104 0.86
住院号 NUM_ADMISSION 387 0.93 59 0.99 108 0.90
超声编号 NUM_B 15 0.04 0 0.00 4 0.03
病理编号 NUM_PATHOLOGY 38 0.09 2 0.03 14 0.12
X光片编号 NUM_X-RAY 26 0.06 7 0.12 13 0.11
医生编号 NUM-DOCTOR 730 1.75 108 1.81 238 1.97
总 计 41 819 100 5 968 100 12 056 100
Distribution of Entities in Training, Verification and Test Set
Impact of Dropout on Model Performance
名称 参数 取值
实验环境 操作系统 Windows 10
编程语言 Python 3.6
分词工具 Jieba 0.37
BiLSTM模型参数 隐藏层大小 100
学习率 0.001
L2正则系数 0.001
批处理大小 32
Dropout 0.5
Experimental Environment and Parameter Setting
模型 准确率 召回率 F值
CRF 95.93 94.26 95.08
BiLSTM 98.16 98.13 98.14
BiLSTM-CRF 98.91 98.00 98.45
Evaluation Results of CRF, BiLSTM and BiLSTM-CRF
实体类型 实体标签 准确率 召回率 F值
医生 NAM_DOCTOR 98.60 99.59 99.09
患者 NAM_PATIENT 98.29 98.06 98.17
LOC_CITY 98.34 96.42 97.37
区县 LOC_COUNTY 98.86 97.34 98.09
LOC_PROVINCE 99.12 99.41 99.27
街道 LOC_STREET 99.65 99.73 99.69
日期 DAT 98.88 99.52 99.20
年龄 AGE 73.68 95.45 83.17
医疗机构名称 ORG 98.46 99.39 98.92
电话 CON_TELEPHONE 98.57 100.00 99.28
住院号 NUM_ADMISSION 93.19 95.50 94.33
超声编号 NUM_B 100 100 100
病理编号 NUM_PATHOLOGY 100 87.39 93.27
X光片编号 NUM_X-RAY 64.71 64.71 64.71
医生编号 NUM-DOCTOR 98.43 99.65 99.04
Evaluation Results of Each Entity Type of BiLSTM-CRF Model
错误类型 类型错误 边界错误 假阴性错误 假阳性错误
占比 9.80 15.03 66.67 8.50
BiLSTM-CRF Model Error Types
