Please wait a minute...
Advanced Search
数据分析与知识发现  2020, Vol. 4 Issue (10): 124-133     https://doi.org/10.11925/infotech.2096-3467.2020.0167
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
基于BiLSTM-CRF中文临床文本中受保护的健康信息识别*
刘婧茹1,宋阳1,贾睿2,3,张翼鹏1,罗勇2,4,马敬东1()
1华中科技大学同济医学院医药卫生管理学院 武汉 430030
2四川省电子病历工程技术研究中心 成都 610041
3成都中医药大学公共卫生学院 成都611137
4四川九阵科技股份有限公司 成都 610041
A BiLSTM-CRF Model for Protected Health Information in Chinese
Liu Jingru1,Song Yang1,Jia Rui2,3,Zhang Yipeng1,Luo Yong2,4,Ma Jingdong1()
1School of Medical and Health Management, Tongji Medical College, Huazhong University of Science and Technology, Wuhan 430030, China
2Sichuan Province Electronic Medical Record Engineering Technology Research Center, Chengdu 610041, China
3School of Public Health, Chengdu University of Traditional Chinese Medicine, Chengdu 611137, China
4Sichuan Jiuzhen Technology Co., Ltd., Chengdu 610041, China
全文: PDF (801 KB)   HTML ( 10
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】 为保护临床文本中的隐私信息,有效地从非结构化文本中识别受保护的健康信息(PHI),提出利用BiLSTM-CRF模型从临床记录中删除隐私信息的自动化方案。【方法】 选择一家区域卫生信息平台电子健康档案中的出院小结作为实验数据,根据《健康保险可携性与责任法案》(HIPAA)所规定的18项PHI结合实验数据特征确定7个PHI类别及其下包含的15个PHI类型。基于BiLSTM-CRF模型有效地从非结构化的临床记录中识别受保护的健康信息。【结果】 对所有实体类别识别的准确率、召回率以及F值分别达98.66%、99.36%以及99.01%,并对识别错误的标签进行总结分析。【局限】 结合语料特征对模型性能的优化有待完善,并且对于自动识别PHI后的临床文本质量未进行评估。【结论】 BiLSTM-CRF模型在不需要特征工程的情况下实现了命名实体自动化识别,有利于促进临床信息共享与利用。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
刘婧茹
宋阳
贾睿
张翼鹏
罗勇
马敬东
关键词 中文临床文本受保护的健康信息长短期记忆网络隐私信息命名实体识别    
Abstract

[Objective] This paper proposes an automated scheme to remove personal information from clinical records based on the BiLSTM-CRF model, aiming to protect patient privacy and identify protected health information (PHI) from unstructured files.[Methods] We collected experimental data from the discharge summaries of a health information platform. According to the 18 PHI regulations specified by HIPAA, we determined 7 PHI categories and 15 PHI types. We used the BiLSTM-CRF model to effectively identify protected health information from unstructured clinical records.[Results] The accuracy rate, recall rate and F value of all entity category recognition were 98.66%, 99.36%, and 99.01% respectively, and the wrong labels were summarized and analyzed.[Limitations] The corpus characteristics need to be improved, and the clinical text quality after automatic recognition of PHI was not evaluated.[Conclusions] The BiLSTM-CRF model could automatically recognize named entities without feature engineering, which promotes the sharing and utilization of clinical information.

Key wordsChinese Clinical Text    Protected Health Information    Long Short-Term Memory    Private Information    Named Entity Recognition
收稿日期: 2020-03-06      出版日期: 2020-11-09
ZTFLH:  TP391  
基金资助:*本文系四川省科技计划项目重点研发基金项目“海量健康数据信息挖掘脱敏技术研究及应用”的研究成果之一(2018GZ0201)
通讯作者: 马敬东     E-mail: jdma@hust.edu.cn
引用本文:   
刘婧茹,宋阳,贾睿,张翼鹏,罗勇,马敬东. 基于BiLSTM-CRF中文临床文本中受保护的健康信息识别*[J]. 数据分析与知识发现, 2020, 4(10): 124-133.
Liu Jingru,Song Yang,Jia Rui,Zhang Yipeng,Luo Yong,Ma Jingdong. A BiLSTM-CRF Model for Protected Health Information in Chinese. Data Analysis and Knowledge Discovery, 2020, 4(10): 124-133.
链接本文:  
http://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2020.0167      或      http://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2020/V4/I10/124
Fig.1  受保护的健康信息识别流程图
类别 实体类型 实体标签
姓名 患者
医生
NAM_PATIENT
NAM_DOCTOR
地理位置

区县
街道
LOC_PROVINCE
LOC_CITY
LOC_COUNTY
LOC_STREET
日期 日期(排除年份,仅包括月、日) DAT
年龄 年龄(>89) AGE
医疗机构 医疗机构名称 ORG
联系方式 电话 CON_TELEPHONE
编号 住院号
病理编号
X光片编号
超声编号
医生编号
NUM_ADMISSION
NUM_PATHOLOGY
NUM_X-RAY
NUM_B
NUM-DOCTOR
Table 1  受保护的隐私信息类型
Fig.2  BiLSTM-CRF模型基本结构[15]
Fig.3  符号特征、词性特征的Unigram特征模板
实体类型 实体标签 训练集 验证集 测试集
个数 占比/% 个数 占比/% 个数 占比/%
医生 NAM_DOCTOR 5 207 12.45 755 12.65 1 558 12.92
患者 NAM_PATIENT 2 875 6.87 393 6.59 854 7.08
LOC_CITY 444 1.06 60 1.01 107 0.89
区县 LOC_COUNTY 1 078 2.58 146 2.45 322 2.67
LOC_PROVINCE 458 1.10 66 1.11 114 0.95
街道 LOC_STREET 1 255 3.00 180 3.02 372 3.09
日期 DAT 19 935 47.67 2 872 48.12 5 629 46.69
年龄 AGE 48 0.11 2 0.03 15 0.12
医疗机构名称 ORG 8 994 21.51 1 279 21.43 2 604 21.60
电话 CON_TELEPHONE 329 0.79 39 0.65 104 0.86
住院号 NUM_ADMISSION 387 0.93 59 0.99 108 0.90
超声编号 NUM_B 15 0.04 0 0.00 4 0.03
病理编号 NUM_PATHOLOGY 38 0.09 2 0.03 14 0.12
X光片编号 NUM_X-RAY 26 0.06 7 0.12 13 0.11
医生编号 NUM-DOCTOR 730 1.75 108 1.81 238 1.97
总 计 41 819 100 5 968 100 12 056 100
Table 2  各类实体训练集、验证集及测试集分布
Fig.4  Dropout对模型性能的影响
名称 参数 取值
实验环境 操作系统 Windows 10
编程语言 Python 3.6
分词工具 Jieba 0.37
BiLSTM模型参数 隐藏层大小 100
学习率 0.001
L2正则系数 0.001
批处理大小 32
Dropout 0.5
Table 3  实验环境与参数设置
模型 准确率 召回率 F值
CRF 95.93 94.26 95.08
BiLSTM 98.16 98.13 98.14
BiLSTM-CRF 98.91 98.00 98.45
Table 4  CRF、BiLSTM与BiLSTM-CRF评估结果(%)
实体类型 实体标签 准确率 召回率 F值
医生 NAM_DOCTOR 98.60 99.59 99.09
患者 NAM_PATIENT 98.29 98.06 98.17
LOC_CITY 98.34 96.42 97.37
区县 LOC_COUNTY 98.86 97.34 98.09
LOC_PROVINCE 99.12 99.41 99.27
街道 LOC_STREET 99.65 99.73 99.69
日期 DAT 98.88 99.52 99.20
年龄 AGE 73.68 95.45 83.17
医疗机构名称 ORG 98.46 99.39 98.92
电话 CON_TELEPHONE 98.57 100.00 99.28
住院号 NUM_ADMISSION 93.19 95.50 94.33
超声编号 NUM_B 100 100 100
病理编号 NUM_PATHOLOGY 100 87.39 93.27
X光片编号 NUM_X-RAY 64.71 64.71 64.71
医生编号 NUM-DOCTOR 98.43 99.65 99.04
Table 5  BiLSTM-CRF模型每类实体类型的评估结果(%)
错误类型 类型错误 边界错误 假阴性错误 假阳性错误
占比 9.80 15.03 66.67 8.50
Table 6  BiLSTM-CRF模型错误类型构成情况(%)
[1] Demner-Fushman D, Chapman W W, McDonald C J. What Can Natural Language Processing do for Clinical Decision Support?[J]. Journal of Biomedical Informatics, 2009,42(5):760-772.
doi: 10.1016/j.jbi.2009.08.007 pmid: 19683066
[2] Wagholikar K B, Maclaughlin K L, Henry M R, et al. Clinical Decision Support with Automated Text Processing for Cervical Cancer Screening[J]. Journal of the American Medical Informatics Association, 2012,19(5):833-839.
doi: 10.1136/amiajnl-2012-000820 pmid: 22542812
[3] Weng C H, Wu X Y, Luo Z H, et al. EliXR: An Approach to Eligibility Criteria Extraction and Representation[J]. Journal of the American Medical Informatics Association, 2011(S1):116-124.
[4] Stubbs A, Uzuner O. Annotating Longitudinal Clinical Narratives for De-identification: The 2014 i2b2/UTHealth Corpus[J]. Journal of Biomedical Informatics, 2015,58:S20-S29.
doi: 10.1016/j.jbi.2015.07.020 pmid: 26319540
[5] Tucker K, Branson J, Dilleen M, et al. Protecting Patient Privacy When Sharing Patient-Level Data from Clinical Trials[J]. BMC Medical Research Methodology, 2016, 16(S1): Article 77.
[6] Deven M G. Building Public Trust in Uses of Health Insurance Portability and Accountability Act De-Identified Data[J]. Journal of the American Medical Informatics Association, 2013,20(1):29-34.
doi: 10.1136/amiajnl-2012-000936 pmid: 22735615
[7] Dernoncourt F, Lee J Y Uzuner O et al. De-Identification of Patient Notes with Recurrent Neural Networks[J]. Journal of the American Medical Informatics Association, 2017,24(3):596-606.
doi: 10.1093/jamia/ocw156 pmid: 28040687
[8] Jian Z, Guo X S, Liu S J, et al. A Cascaded Approach for Chinese Clinical Text De-Identification with Less Annotation Effort[J]. Journal of Biomedical Informatics, 2017,73:76-83.
doi: 10.1016/j.jbi.2017.07.017 pmid: 28756160
[9] Meystre S M, Friedlin F J, South B R, et al. Automatic De-Identification of Textual Documents in the Electronic Health Record: A Review of Recent Research[J]. BMC Medical Research Methodology, 2010,10(1):1-16.
doi: 10.1186/1471-2288-10-1
[10] Sundheim B M. Named Entity Task Definition,Version 2.1[C]//Proceedings of the 6th Message Understanding Conference. 1995.
[11] Chen L, Yang J J, Wang Q. Privacy-Preserving Data Publishing for Free Text Chinese Electronic Medical Records[C]// Proceedings of the 2012 IEEE 36th Annual Computer Software and Applications Conference.. 2012: 567-572.
[12] Ford E, Carrol J A, Smith H E, et al. Extracting Information from the Text of Electronic Medical Records to Improve Case Detection: A Systematic Review[J]. Journal of the American Medical Informatics Association, 2016,23(5):1007-1015.
doi: 10.1093/jamia/ocv180 pmid: 26911811
[13] 韩旭. 基于神经网络的文本特征表示关键技术研究[D]. 北京: 北京邮电大学, 2019.
[13] ( Han Xu. Research on Key Technologies of Text Feature Representation Based on Neural Network[D]. Beijing: Beijing University of Posts and Telecommunications, 2019.)
[14] 顾溢. 基于BiLSTM-CRF的复杂中文命名实体识别研究[D]. 南京: 南京大学, 2019.
[14] ( Gu Yi. Research on Complex Chinese Named Entity Recognition Based on BiLSTM-CRF[D]. Nanjing: Nanjing University, 2019.)
[15] Ji B, Liu R, Li S S, et al. A Hybrid Approach for Named Entity Recognition in Chinese Electronic Medical Record[J]. BMC Medical Informatics and Decision Making, 2019,19(S2):64.
doi: 10.1186/s12911-019-0767-2
[16] 陈曙东, 欧阳小叶. 命名实体识别技术综述[J]. 无线电通信技术, 2020,46(3):251-260.
[16] ( Chen Shudong, Ouyang Xiaoye. Overview of Named Entity Recognition Technology[J]. Radio Communications Technology, 2020,46(3):251-260.)
[17] Yang X, Bian J, Gong Y, et al. MADEx: A System for Detecting Medications, Adverse Drug Events, and Their Relations from Clinical Notes[J]. Drug Safety, 2019,42(1):123-133.
doi: 10.1007/s40264-018-0761-0 pmid: 30600484
[18] 申站. 基于神经网络的中文电子病历命名实体识别[D]. 北京: 北京邮电大学, 2018.
[18] ( Shen Zhan. Named Entity Recognition for Chinese Electronic Record with Neural Network[D]. Beijing: Beijing University of Posts and Telecommunications, 2018.)
[19] Ji B, Liu R, Li S S, et al. A BiLSTM-CRF Method to Chinese Electronic Medical Record Named Entity Recognition[C]//Proceedings of the 2018 International Conference on Algorithms, Computing and Artificial Intelligence. ACM, 2018.
[20] Ji B, Li S S, Yu J, et al. Research on Chinese Medical Named Entity Recognition Based on Collaborative Cooperation of Multiple Neural Network Models[J]. Journal of Biomedical Informatics, 2020,104:103395.
doi: 10.1016/j.jbi.2020.103395 pmid: 32109551
[21] 潘璀然, 王青华, 汤步洲, 等. 基于句子级Lattice-长短记忆神经网络的中文电子病历命名实体识别[J]. 第二军医大学学报, 2019,40(5):497-506.
[21] ( Pan Cuiran, Wang Qinghua, Tang Buzhou, et al. Chinese Electronic Medical Record Named Entity Recognition Based on Sentence-Level Lattice-Long Short-Term Memory Neural Network[J]. Academic Journal of Second Military Medical University, 2019,40(5):497-506.)
[22] 曹春萍, 关鹏举. 基于E-CNN和BLSTM-CRF的临床文本命名实体识别[J]. 计算机应用研究, 2019,36(12):3748-3751.
[22] ( Cao Chunping, Guan Pengju. Clinical Text Named Entity Recognition Based on E-CNN and BLSTM-CRF[J]. Computer Application Research, 2019,36(12):3748-3751.)
[23] Luo L, Yang Z H, Yang P, et al. An Attention-based BiLSTM-CRF Approach to Document-level Chemical Named Entity Recognition[J]. Bioinformatics, 2018,34(8):1381-1388.
doi: 10.1093/bioinformatics/btx761 pmid: 29186323
[24] Hochreiter S, Schmidhuber J. Long Short-Term Memory[J]. Neural Computation, 1997,9(8):1735-1780.
doi: 10.1162/neco.1997.9.8.1735 pmid: 9377276
[25] Du L, Xia C, Deng Z, et al. A Machine Learning Based Approach to Identify Protected Health Information in Chinese Clinical Text[J]. International Journal of Medical Informatics, 2018,116:24-32.
doi: 10.1016/j.ijmedinf.2018.05.010 pmid: 29887232
[26] 都丽婷. 临床文本数据信息挖掘去识别技术研究[D]. 武汉: 华中科技大学, 2018.
[26] ( Du Liting. Research on Clinical Text Data Information Mining De-Identification Technology[D]. Wuhan: Huazhong University of Science and Technology, 2018.)
[27] 武惠, 吕立, 于碧辉. 基于迁移学习和BiLSTM-CRF的中文命名实体识别[J]. 小型微型计算机系统, 2019,40(6):1142-1147.
[27] ( Wu Hui, Lv Li, Yu Bihui. Chinese Named Entity Recognition Based on Transfer Learning and BiLSTM-CRF[J]. Journal of Chinese Computer Systems, 2019,40(6):1142-1147.)
[28] Li X Q, Shi T Y, Li P, et al. BiLSTM-CRF Model for Named Entity Recognition in Railway Accident and Fault Analysis Report[C]//Proceedings of the Asia-Pacific Conference on Intelligent Medical 2018 & International Conference on Transportation and Traffic Engineering 2018. 2018:1-5.
[29] Arellano A M, Dai W R, Wang S, et al. Privacy Policy and Technology in Biomedical Data Science[M]. Annual Review of Biomedical Data Science, 2018,1:115-129.
[1] 徐晨飞, 叶海影, 包平. 基于深度学习的方志物产资料实体自动识别模型构建研究*[J]. 数据分析与知识发现, 2020, 4(8): 86-97.
[2] 高原,施元磊,张蕾,曹天奕,冯筠. 基于游记文本的游客游览行程重构*[J]. 数据分析与知识发现, 2020, 4(2/3): 165-172.
[3] 薛福亮,刘丽芳. 一种基于CRF与ATAE-LSTM的细粒度情感分析方法*[J]. 数据分析与知识发现, 2020, 4(2/3): 207-213.
[4] 马建霞,袁慧,蒋翔. 基于Bi-LSTM+CRF的科学文献中生态治理技术相关命名实体抽取研究*[J]. 数据分析与知识发现, 2020, 4(2/3): 78-88.
[5] 黄菡,王宏宇,王晓光. 结合主动学习的条件随机场模型用于法律术语的自动识别*[J]. 数据分析与知识发现, 2019, 3(6): 66-74.
[6] 陈美杉,夏晨曦. 肝癌患者在线提问的命名实体识别研究:一种基于迁移学习的方法 *[J]. 数据分析与知识发现, 2019, 3(12): 61-69.
[7] 余丽,钱力,付常雷,赵华茗. 基于深度学习的文本中细粒度知识元抽取方法研究*[J]. 数据分析与知识发现, 2019, 3(1): 38-45.
[8] 唐慧慧, 王昊, 张紫玄, 王雪颖. 基于汉字标注的中文历史事件名抽取研究*[J]. 数据分析与知识发现, 2018, 2(7): 89-100.
[9] 范馨月, 崔雷. 基于文本挖掘的药物副作用知识发现研究[J]. 数据分析与知识发现, 2018, 2(3): 79-86.
[10] 隋明爽,崔雷. 结合多种特征的CRF模型用于化学物质-疾病命名实体识别[J]. 现代图书情报技术, 2016, 32(10): 91-97.
[11] 汪润,何琳,王东波,黄水清,范远标. 面向文本挖掘的植物生长发育实体识别研究*[J]. 现代图书情报技术, 2014, 30(1): 24-27.
[12] 高强, 游宏梁. 基于层叠模型的国防领域命名实体识别研究[J]. 现代图书情报技术, 2012, (11): 47-52.
[13] 余传明, 黄建秋, 郭飞. 从客户评论中识别命名实体——基于最大熵模型的实现[J]. 现代图书情报技术, 2011, 27(5): 77-82.
[14] 孙镇 王惠临. 命名实体识别研究进展综述[J]. 现代图书情报技术, 2010, 26(6): 42-47.
[15] 谢靖, 江岚, 王东波, 苏新宁. 基于万方数据(2003-2007)的知识发现应用研究[J]. 现代图书情报技术, 2010, 26(12): 64-69.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn