Please wait a minute...
Advanced Search
数据分析与知识发现  2019, Vol. 3 Issue (7): 123-132    DOI: 10.11925/infotech.2096-3467.2018.1454
  应用论文 本期目录 | 过刊浏览 | 高级检索 |
面向知识发现的中文电子病历标注方法研究 *
胡佳慧,方安(),赵琬清,杨晨柳,任慧玲
中国医学科学院医学信息研究所 北京 100020
Annotating Chinese E-Medical Record for Knowledge Discovery
Jiahui Hu,An Fang(),Wanqing Zhao,Chenliu Yang,Huiling Ren
Institute of Medical Information, Chinese Academy of Medical Sciences, Beijing 100020, China
全文: PDF(817 KB)   HTML ( 29
输出: BibTeX | EndNote (RIS)      
摘要 

目的】研究基于中文电子病历的标注方法, 提升临床文本分析与处理能力, 促进临床知识发现。【方法】提出中文电子病历标注思路, 并构建可视化交互平台, 基于电子病历文本的字与词特征, 综合利用自然语言处理和机器学习方法开展临床命名实体识别实证研究。【结果】获得700份标注病历语料, 基于Pipeline的标注方法总体F值达0.8772, 较基于原始标注病历数据集的命名实体识别效果提升32.9%。【局限】由于电子病历包含与隐私相关的敏感信息, 本研究基于开放评测数据开展实验研究, 语料库大小受限。【结论】本研究所提出的中文电子病历标注方法和所构建的标注平台适用于临床文本处理, 能够促进医学临床文本资源的知识关联化。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
胡佳慧
方安
赵琬清
杨晨柳
任慧玲
关键词 中文电子病历文本标注自然语言处理机器学习知识发现    
Abstract

[Objective] This paper studies the annotation method for Chinese electronic medical records, aiming to improve the processing of massive clinical texts and clinical knowledge discovery. [Methods] First, we proposed annotation method for Chinese e-medical records, and constructed a visual interactive platform. Then, based on the word and phrase features of these records, we identified the medical name entities with natural language processing and machine learning approaches. [Results] A total of 700 annotated records were obtained, and the overall F value of the Pipeline-based annotation method reached 0.8772, which was 32.9% higher than those based on the original medical records. [Limitations] Since the electronic medical record contains sensitive privacy information, this study was conducted with open dataset, and the corpus size was limited. [Conclusions] The Chinese electronic medical record annotation method and platform constructed in this study could effectively process clinical texts, and the association of medical knowledge.

Key wordsChinese Electronic Medical Record    Text Annotation    Natural Language Processing    Machine Learning    Knowledge Discovery
收稿日期: 2018-12-24     
中图分类号:  TP391  
基金资助:*本文系中国医学科学院中央级公益性科研院所基本科研业务费项目“面向知识发现的中文电子病历语义标注方法研究”(2018PT33005);中国医学科学院医学与健康科技创新工程协同创新团队项目“中文临床医学术语系统构建研究”的研究成果之一(2017-I2M-3-014)
通讯作者: 方安     E-mail: fang.an@imicams.ac.cn
引用本文:   
胡佳慧,方安,赵琬清,杨晨柳,任慧玲. 面向知识发现的中文电子病历标注方法研究 *[J]. 数据分析与知识发现, 2019, 3(7): 123-132.
Jiahui Hu,An Fang,Wanqing Zhao,Chenliu Yang,Huiling Ren. Annotating Chinese E-Medical Record for Knowledge Discovery. Data Analysis and Knowledge Discovery, DOI:10.11925/infotech.2096-3467.2018.1454.
链接本文:  
http://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2018.1454
图1  总体思路
图2  平台架构
图3  中文电子病历标注平台
图4  基于字的CRF特征模板
图5  基于词的CRF特征模板
数据集 症状和体征 检查和
检验
治疗 疾病和
诊断
身体部位 合计
训练集 6 486 7 987 853 515 8 942 24 783
测试集 1 345 1 559 195 207 1 777 5 083
表1  5类医疗实体的统计数据
序号 编码 实体类别
1 B-SYMPTOM 症状和体征
2 E-SYMPTOM
3 M-SYMPTOM
4 S-SYMPTOM
5 B-CHECK 检查和检验
6 E-CHECK
7 M-CHECK
8 S-CHECK
9 B-TREATMENT 治疗
10 E-TREATMENT
11 M-TREATMENT
12 S-TREATMENT
13 B-DISEASE 疾病和诊断
14 E-DISEASE
15 M-DISEASE
16 S-DISEASE
17 B-BODY 身体部位
18 E-BODY
19 M-BODY
20 S-BODY
21 O 非医疗实体
表2  BEMS编码
症状和
体征
检查和
检验
治疗 疾病和
诊断
身体
部位
总体
P 0.9898 0.9554 0.9588 0.9703 0.9237 0.9531
R 0.9864 0.8233 0.9555 0.9515 0.9358 0.9138
F值 0.9881 0.8845 0.9571 0.9608 0.9297 0.9331
表3  训练集实验结果
症状和体征 检查和检验 治疗 疾病和诊断 身体
部位
总体
P 0.9439 0.9091 0.7945 0.7772 0.8419 0.8860
R 0.9636 0.7505 0.5949 0.6908 0.8149 0.8210
F值 0.9536 0.8222 0.6804 0.7315 0.8281 0.8522
表4  测试集实验结果
症状和
体征
检查和
检验
治疗 疾病和
诊断
身体部位
重合度 0.6148 0.3263 0.1181 0.1803 0.2423
表5  测试集与训练集实体重合度
图6  不同数据集上的F值
[1] Yetisgen M, Vanderwende L. Automatic Identification of Substance Abuse from Social History in Clinical Text [C]// Proceedings of the 16th Conference on Artificial Intelligence in Medicine. 2017: 171-181.
[2] Brisimi T S, Xu T, Wang T , et al. Predicting Chronic Disease Hospitalizations from Electronic Health Records: An Interpretable Classification Approach[J]. Proceedings of the IEEE, 2018,106(4):690-707.
[3] Karakurt G, Patel V, Whiting K , et al. Mining Electronic Health Records Data: Domestic Violence and Adverse Health Effects[J]. Journal of Family Violence, 2017,32(1):79-87.
[4] Caillet C, Sichanh C, Assemat G , et al. Role of Medicines of Unknown Identity in Adverse Drug Reaction-Related Hospitalizations in Developing Countries: Evidence from a Cross-Sectional Study in a Teaching Hospital in the Lao People’s Democratic Republic[J]. Drug Safety, 2017,40(9):809-821.
[5] Skeppstedt M, Kvist M, Nilsson G H , et al. Automatic Recognition of Disorders, Findings, Pharmaceuticals and Body Structures from Clinical Text: An Annotation and Machine Learning Study[J]. Journal of Biomedical Informatics, 2014,49:148-158.
[6] Bates D W, Saria S, Ohno-Machado L , et al. Big Data in Health Care: Using Analytics to Identify and Manage High-Risk and High-Cost Patients[J]. Health Affairs, 2014,33(7):1123-1131.
[7] Zhou X, Peng Y, Liu B . Text Mining for Traditional Chinese Medical Knowledge Discovery: A Survey[J]. Journal of Biomedical Informatics, 2010,43(4):650-660.
[8] Leaman R, Khare R, Lu Z . Challenges in Clinical Natural Language Processing for Automated Disorder Normalization[J]. Journal of Biomedical Informatics, 2015,57:28-37.
[9] 杨锦锋, 于秋滨, 关毅 , 等. 电子病历命名实体识别和实体关系抽取研究综述[J]. 自动化学报, 2014,40(8):1537-1562.
doi: 10.3724/SP.J.1004.2014.01537
( Yang Jinfeng, Yu Qiubin, Guan Yi , et al. An Overview of Research on Electronic Medical Record Oriented Named Entity Recognition and Entity Relation Extraction[J]. Acta Automatica Sinica, 2014,40(8):1537-1562.)
doi: 10.3724/SP.J.1004.2014.01537
[10] 夏立新, 陈晨, 王忠义 . 基于多维度聚合的网络资源知识发现框架研究[J]. 情报科学, 2016,34(5):3-8.
( Xia Lixin, Chen Chen, Wang Zhongyi . Research on Knowledge Discovery Framework of Internet Resource Based on Multi-Dimensional Aggregation[J]. Information Science, 2016,34(5):3-8.)
[11] 王颖, 吴振新, 谢靖 . 面向科技文献的语义检索系统研究综述[J]. 现代图书情报技术, 2015(5):1-7.
( Wang Ying, Wu Zhenxin, Xie Jing . Review on Semantic Retrieval System for Scientific Literature[J]. New Technology of Library and Information Service, 2015(5):1-7.)
[12] 杨锐, 汤怡洁, 刘毅 , 等. 知识服务环境下语义化开放接口应用研究[J]. 图书情报工作, 2014,58(4):99-104.
( Yang Rui, Tang Yijie, Liu Yi , et al. Research on the Application of Semantic Open Interface Under Knowledge Service Environment[J]. Library and Information Service, 2014,58(4):99-104.)
[13] Reinsel D, Gantz J, Rydning J . Data Age 2025: The Evolution of Data to Life-Critical Don’t Focus on Big Data; Focus on Data That’s Big[R]. IDC White Paper, 2017.
[14] CLEF eHealth Workshop. ShARe/CLEF eHealth 2014 Shared Task [EB/OL]. [2018-06-19]. .
[15] SIGLEX. SemEval-2018: International Workshop on Semantic Evaluation [EB/OL]. [ 2018- 06- 19]. .
[16] i2b2 tranSMART Foundation. i2b2: Informatics for Integrating Biology & the Bedside[EB/OL].[2018-06-19]. .
[17] Wikipedia. Pipeline (computing) [EB/OL]. [2018-06-19]. .
[18] Mack R, Mukherjea S, Soffer A , et al. Text Analytics for Life Science Using the Unstructured Information Management Architecture[J]. IBM Systems Journal, 2004,43(3):490-515.
[19] Savova G K, Masanz J J, Ogren P V , et al. Mayo Clinical Text Analysis and Knowledge Extraction System (cTAKES): Architecture, Component Evaluation and Applications[J]. Journal of the American Medical Informatics Association, 2010,17(5):507-513.
[20] Savova G K, Tseytlin E, Finan S , et al. DeepPhe: A Natural Language Processing System for Extracting Cancer Phenotypes from Clinical Records[J]. Cancer Research, 2017,77(21):e115-e118.
[21] Penn Treebank. Alphabetical List of Part-of-Speech Tags Used in the Penn Treebank Project[EB/OL]. [ 2018- 06- 19]. .
[22] Tsuruoka Y. GENIA Tagger[EB/OL]. [ 2018- 06- 19]. .
[23] Bodenreider O . The Unified Medical Language System (UMLS): Integrating Biomedical Terminology[J]. Nucleic Acids Research, 2004,32(S1):267-270.
[24] Gonzalezhernandez G, Sarker A , O’Connor K, et al. Capturing the Patient’s Perspective: A Review of Advances in Natural Language Processing of Health-Related Text[J]. Yearbook of Medical Informatics, 2017,26(1):214-227.
[25] Uzuner Ö, South B R, Shen S , et al. 2010 i2b2/VA Challenge on Concepts, Assertions, and Relations in Clinical Text[J]. Journal of the American Medical Informatics Association, 2011,18(5):552-556.
[26] 杨锦锋, 关毅, 何彬 , 等. 中文电子病历命名实体和实体关系语料库构建[J]. 软件学报, 2016,27(11):2725-2746.
( Yang Jinfeng, Guan Yi, He Bin , et al. Corpus Construction for Named Entities and Entity Relations on Chinese Electronic Medical Records[J]. Journal of Software, 2016,27(11):2725-2746.)
[27] He B, Dong B, Guan Y , et al. Building a Comprehensive Syntactic and Semantic Corpus of Chinese Clinical Texts[J]. Journal of Biomedical Informatics, 2017,69:203-217.
[28] 曲春燕, 关毅, 杨锦锋 , 等. 中文电子病历命名实体标注语料库构建[J]. 高技术通讯, 2015,25(2):143-150.
( Qu Chunyan, Guan Yi, Yang Jinfeng , et al. The Construction of Annotated Corpora of Named Entities for Chinese Electronic Medical Records[J]. Chinese High Technology Letters, 2015,25(2):143-150.)
[29] Rama T, Brekke P, Nytrø Ø, et al. Iterative Development of Family History Annotation Guidelines Using a Synthetic Corpus of Clinical Text [C]// Proceedings of the 9th International Workshop on Health Text Mining and Information Analysis. 2018: 111-121.
[30] Knowledge Graph and Semantic Computing. Language, Knowledge, and Intelligence [C]//Proceedings of the 2nd China Conference on Knowledge Graph and Semantic Computing. Springer, 2018.
[31] CIPS. CCKS 2018: China Conference on Knowledge Graph and Semantic Computing[EB/OL]. [ 2018- 12- 20]. .
[32] CIPS. CHIP 2018: The 4th China Conference on Health Information Processing[EB/OL]. [ 2018- 12- 20]. .
[33] Lafferty J, McCallum A, Pereira F C N. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data [C]// Proceedings of the 18th International Conference on Machine Learning. 2001: 282-289.
[34] Pustejovsky J, Stubbs A . Natural Language Annotation for Machine Learning: A Guide to Corpus-Building for Applications[M]. O’Reilly Media, 2012.
[35] Trivedi G, Pham P, Chapman W W , et al. NLPReViz: An Interactive Tool for Natural Language Processing on Clinical Text[J]. Journal of the American Medical Informatics Association, 2017,25(1):81-87.
[36] Shellum J L, Freimuth R R, Peters S G , et al. Knowledge as a Service at the Point of Care[J]. AMIA Annual Symposium Proceedings, 2016: 1139-1148.
[1] 张金柱,胡一鸣. 融合表示学习与机器学习的专利科学引文标题自动抽取研究*[J]. 数据分析与知识发现, 2019, 3(5): 68-76.
[2] 刘志强,都云程,施水才. 基于改进的隐马尔科夫模型的网页新闻关键信息抽取*[J]. 数据分析与知识发现, 2019, 3(3): 120-128.
[3] 徐红霞,李春旺. 科技文献内容知识点抽取研究综述[J]. 数据分析与知识发现, 2019, 3(3): 14-24.
[4] 吴菊华,王煜,黎明,蔡少云. 基于加权知识网络的在线健康社区用户知识发现*[J]. 数据分析与知识发现, 2019, 3(2): 108-117.
[5] 胡吉颖,谢靖,钱力,付常雷. 基于知识图谱的科技大数据知识发现平台建设*[J]. 数据分析与知识发现, 2019, 3(1): 55-62.
[6] 张紫玄,王昊,朱立平,邓三鸿. 中国海关HS编码风险的识别研究*[J]. 数据分析与知识发现, 2019, 3(1): 72-84.
[7] 刘丽娜,齐佳音,张镇平,曾丹. 品牌对商品在线销量的影响*——基于海量商品评论的在线声誉和品牌知名度的调节作用研究[J]. 数据分析与知识发现, 2018, 2(9): 10-21.
[8] 贾隆嘉,张邦佐. 高校网络舆情安全中主题分类方法研究*——以新浪微博数据为例[J]. 数据分析与知识发现, 2018, 2(7): 55-62.
[9] 陆伟,罗梦奇,丁恒,李信. 深度学习图像标注与用户标注比较研究*[J]. 数据分析与知识发现, 2018, 2(5): 1-10.
[10] 王丽,邹丽雪,刘细文. 基于LDA主题模型的文献关联分析及可视化研究[J]. 数据分析与知识发现, 2018, 2(3): 98-106.
[11] 范馨月,崔雷. 基于网络属性的抗肿瘤药物靶点预测方法及其应用*[J]. 数据分析与知识发现, 2018, 2(12): 98-108.
[12] 赵杨,袁析妮,陈亚文,武立强. 基于机器学习混合算法的APP广告转化率预测研究*[J]. 数据分析与知识发现, 2018, 2(11): 2-9.
[13] 王欣,冯文刚. 在线极端主义和激进化监测技术综述*[J]. 数据分析与知识发现, 2018, 2(10): 2-8.
[14] 张志强,范少萍,陈秀娟. 面向精准医学知识发现的生物医学信息学发展*[J]. 数据分析与知识发现, 2018, 2(1): 1-8.
[15] 牟冬梅,王萍,赵丹宁. 高维电子病历的数据降维策略与实证研究*[J]. 数据分析与知识发现, 2018, 2(1): 88-98.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn