Please wait a minute...
Advanced Search
数据分析与知识发现  2019, Vol. 3 Issue (7): 123-132     https://doi.org/10.11925/infotech.2096-3467.2018.1454
  应用论文 本期目录 | 过刊浏览 | 高级检索 |
面向知识发现的中文电子病历标注方法研究 *
胡佳慧,方安(),赵琬清,杨晨柳,任慧玲
中国医学科学院医学信息研究所 北京 100020
Annotating Chinese E-Medical Record for Knowledge Discovery
Jiahui Hu,An Fang(),Wanqing Zhao,Chenliu Yang,Huiling Ren
Institute of Medical Information, Chinese Academy of Medical Sciences, Beijing 100020, China
全文: PDF (817 KB)   HTML ( 38
输出: BibTeX | EndNote (RIS)      
摘要 

目的】研究基于中文电子病历的标注方法, 提升临床文本分析与处理能力, 促进临床知识发现。【方法】提出中文电子病历标注思路, 并构建可视化交互平台, 基于电子病历文本的字与词特征, 综合利用自然语言处理和机器学习方法开展临床命名实体识别实证研究。【结果】获得700份标注病历语料, 基于Pipeline的标注方法总体F值达0.8772, 较基于原始标注病历数据集的命名实体识别效果提升32.9%。【局限】由于电子病历包含与隐私相关的敏感信息, 本研究基于开放评测数据开展实验研究, 语料库大小受限。【结论】本研究所提出的中文电子病历标注方法和所构建的标注平台适用于临床文本处理, 能够促进医学临床文本资源的知识关联化。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
胡佳慧
方安
赵琬清
杨晨柳
任慧玲
关键词 中文电子病历文本标注自然语言处理机器学习知识发现    
Abstract

[Objective] This paper studies the annotation method for Chinese electronic medical records, aiming to improve the processing of massive clinical texts and clinical knowledge discovery. [Methods] First, we proposed annotation method for Chinese e-medical records, and constructed a visual interactive platform. Then, based on the word and phrase features of these records, we identified the medical name entities with natural language processing and machine learning approaches. [Results] A total of 700 annotated records were obtained, and the overall F value of the Pipeline-based annotation method reached 0.8772, which was 32.9% higher than those based on the original medical records. [Limitations] Since the electronic medical record contains sensitive privacy information, this study was conducted with open dataset, and the corpus size was limited. [Conclusions] The Chinese electronic medical record annotation method and platform constructed in this study could effectively process clinical texts, and the association of medical knowledge.

Key wordsChinese Electronic Medical Record    Text Annotation    Natural Language Processing    Machine Learning    Knowledge Discovery
收稿日期: 2018-12-24      出版日期: 2019-09-06
ZTFLH:  TP391  
基金资助:*本文系中国医学科学院中央级公益性科研院所基本科研业务费项目“面向知识发现的中文电子病历语义标注方法研究”(2018PT33005);中国医学科学院医学与健康科技创新工程协同创新团队项目“中文临床医学术语系统构建研究”的研究成果之一(2017-I2M-3-014)
通讯作者: 方安     E-mail: fang.an@imicams.ac.cn
引用本文:   
胡佳慧,方安,赵琬清,杨晨柳,任慧玲. 面向知识发现的中文电子病历标注方法研究 *[J]. 数据分析与知识发现, 2019, 3(7): 123-132.
Jiahui Hu,An Fang,Wanqing Zhao,Chenliu Yang,Huiling Ren. Annotating Chinese E-Medical Record for Knowledge Discovery. Data Analysis and Knowledge Discovery, 2019, 3(7): 123-132.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2018.1454      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2019/V3/I7/123
  总体思路
  平台架构
  中文电子病历标注平台
  基于字的CRF特征模板
  基于词的CRF特征模板
数据集 症状和体征 检查和
检验
治疗 疾病和
诊断
身体部位 合计
训练集 6 486 7 987 853 515 8 942 24 783
测试集 1 345 1 559 195 207 1 777 5 083
  5类医疗实体的统计数据
序号 编码 实体类别
1 B-SYMPTOM 症状和体征
2 E-SYMPTOM
3 M-SYMPTOM
4 S-SYMPTOM
5 B-CHECK 检查和检验
6 E-CHECK
7 M-CHECK
8 S-CHECK
9 B-TREATMENT 治疗
10 E-TREATMENT
11 M-TREATMENT
12 S-TREATMENT
13 B-DISEASE 疾病和诊断
14 E-DISEASE
15 M-DISEASE
16 S-DISEASE
17 B-BODY 身体部位
18 E-BODY
19 M-BODY
20 S-BODY
21 O 非医疗实体
  BEMS编码
症状和
体征
检查和
检验
治疗 疾病和
诊断
身体
部位
总体
P 0.9898 0.9554 0.9588 0.9703 0.9237 0.9531
R 0.9864 0.8233 0.9555 0.9515 0.9358 0.9138
F值 0.9881 0.8845 0.9571 0.9608 0.9297 0.9331
  训练集实验结果
症状和体征 检查和检验 治疗 疾病和诊断 身体
部位
总体
P 0.9439 0.9091 0.7945 0.7772 0.8419 0.8860
R 0.9636 0.7505 0.5949 0.6908 0.8149 0.8210
F值 0.9536 0.8222 0.6804 0.7315 0.8281 0.8522
  测试集实验结果
症状和
体征
检查和
检验
治疗 疾病和
诊断
身体部位
重合度 0.6148 0.3263 0.1181 0.1803 0.2423
  测试集与训练集实体重合度
  不同数据集上的F值
[1] Yetisgen M, Vanderwende L. Automatic Identification of Substance Abuse from Social History in Clinical Text [C]// Proceedings of the 16th Conference on Artificial Intelligence in Medicine. 2017: 171-181.
[2] Brisimi T S, Xu T, Wang T , et al. Predicting Chronic Disease Hospitalizations from Electronic Health Records: An Interpretable Classification Approach[J]. Proceedings of the IEEE, 2018,106(4):690-707.
[3] Karakurt G, Patel V, Whiting K , et al. Mining Electronic Health Records Data: Domestic Violence and Adverse Health Effects[J]. Journal of Family Violence, 2017,32(1):79-87.
[4] Caillet C, Sichanh C, Assemat G , et al. Role of Medicines of Unknown Identity in Adverse Drug Reaction-Related Hospitalizations in Developing Countries: Evidence from a Cross-Sectional Study in a Teaching Hospital in the Lao People’s Democratic Republic[J]. Drug Safety, 2017,40(9):809-821.
[5] Skeppstedt M, Kvist M, Nilsson G H , et al. Automatic Recognition of Disorders, Findings, Pharmaceuticals and Body Structures from Clinical Text: An Annotation and Machine Learning Study[J]. Journal of Biomedical Informatics, 2014,49:148-158.
[6] Bates D W, Saria S, Ohno-Machado L , et al. Big Data in Health Care: Using Analytics to Identify and Manage High-Risk and High-Cost Patients[J]. Health Affairs, 2014,33(7):1123-1131.
[7] Zhou X, Peng Y, Liu B . Text Mining for Traditional Chinese Medical Knowledge Discovery: A Survey[J]. Journal of Biomedical Informatics, 2010,43(4):650-660.
[8] Leaman R, Khare R, Lu Z . Challenges in Clinical Natural Language Processing for Automated Disorder Normalization[J]. Journal of Biomedical Informatics, 2015,57:28-37.
[9] 杨锦锋, 于秋滨, 关毅 , 等. 电子病历命名实体识别和实体关系抽取研究综述[J]. 自动化学报, 2014,40(8):1537-1562.
doi: 10.3724/SP.J.1004.2014.01537
[9] ( Yang Jinfeng, Yu Qiubin, Guan Yi , et al. An Overview of Research on Electronic Medical Record Oriented Named Entity Recognition and Entity Relation Extraction[J]. Acta Automatica Sinica, 2014,40(8):1537-1562.)
doi: 10.3724/SP.J.1004.2014.01537
[10] 夏立新, 陈晨, 王忠义 . 基于多维度聚合的网络资源知识发现框架研究[J]. 情报科学, 2016,34(5):3-8.
[10] ( Xia Lixin, Chen Chen, Wang Zhongyi . Research on Knowledge Discovery Framework of Internet Resource Based on Multi-Dimensional Aggregation[J]. Information Science, 2016,34(5):3-8.)
[11] 王颖, 吴振新, 谢靖 . 面向科技文献的语义检索系统研究综述[J]. 现代图书情报技术, 2015(5):1-7.
[11] ( Wang Ying, Wu Zhenxin, Xie Jing . Review on Semantic Retrieval System for Scientific Literature[J]. New Technology of Library and Information Service, 2015(5):1-7.)
[12] 杨锐, 汤怡洁, 刘毅 , 等. 知识服务环境下语义化开放接口应用研究[J]. 图书情报工作, 2014,58(4):99-104.
[12] ( Yang Rui, Tang Yijie, Liu Yi , et al. Research on the Application of Semantic Open Interface Under Knowledge Service Environment[J]. Library and Information Service, 2014,58(4):99-104.)
[13] Reinsel D, Gantz J, Rydning J . Data Age 2025: The Evolution of Data to Life-Critical Don’t Focus on Big Data; Focus on Data That’s Big[R]. IDC White Paper, 2017.
[14] CLEF eHealth Workshop. ShARe/CLEF eHealth 2014 Shared Task [EB/OL]. [2018-06-19]. .
[15] SIGLEX. SemEval-2018: International Workshop on Semantic Evaluation [EB/OL]. [ 2018- 06- 19]. .
[16] i2b2 tranSMART Foundation. i2b2: Informatics for Integrating Biology & the Bedside[EB/OL].[2018-06-19]. .
[17] Wikipedia. Pipeline (computing) [EB/OL]. [2018-06-19]. .
[18] Mack R, Mukherjea S, Soffer A , et al. Text Analytics for Life Science Using the Unstructured Information Management Architecture[J]. IBM Systems Journal, 2004,43(3):490-515.
[19] Savova G K, Masanz J J, Ogren P V , et al. Mayo Clinical Text Analysis and Knowledge Extraction System (cTAKES): Architecture, Component Evaluation and Applications[J]. Journal of the American Medical Informatics Association, 2010,17(5):507-513.
[20] Savova G K, Tseytlin E, Finan S , et al. DeepPhe: A Natural Language Processing System for Extracting Cancer Phenotypes from Clinical Records[J]. Cancer Research, 2017,77(21):e115-e118.
[21] Penn Treebank. Alphabetical List of Part-of-Speech Tags Used in the Penn Treebank Project[EB/OL]. [ 2018- 06- 19]. .
[22] Tsuruoka Y. GENIA Tagger[EB/OL]. [ 2018- 06- 19]. .
[23] Bodenreider O . The Unified Medical Language System (UMLS): Integrating Biomedical Terminology[J]. Nucleic Acids Research, 2004,32(S1):267-270.
[24] Gonzalezhernandez G, Sarker A , O’Connor K, et al. Capturing the Patient’s Perspective: A Review of Advances in Natural Language Processing of Health-Related Text[J]. Yearbook of Medical Informatics, 2017,26(1):214-227.
[25] Uzuner Ö, South B R, Shen S , et al. 2010 i2b2/VA Challenge on Concepts, Assertions, and Relations in Clinical Text[J]. Journal of the American Medical Informatics Association, 2011,18(5):552-556.
[26] 杨锦锋, 关毅, 何彬 , 等. 中文电子病历命名实体和实体关系语料库构建[J]. 软件学报, 2016,27(11):2725-2746.
[26] ( Yang Jinfeng, Guan Yi, He Bin , et al. Corpus Construction for Named Entities and Entity Relations on Chinese Electronic Medical Records[J]. Journal of Software, 2016,27(11):2725-2746.)
[27] He B, Dong B, Guan Y , et al. Building a Comprehensive Syntactic and Semantic Corpus of Chinese Clinical Texts[J]. Journal of Biomedical Informatics, 2017,69:203-217.
[28] 曲春燕, 关毅, 杨锦锋 , 等. 中文电子病历命名实体标注语料库构建[J]. 高技术通讯, 2015,25(2):143-150.
[28] ( Qu Chunyan, Guan Yi, Yang Jinfeng , et al. The Construction of Annotated Corpora of Named Entities for Chinese Electronic Medical Records[J]. Chinese High Technology Letters, 2015,25(2):143-150.)
[29] Rama T, Brekke P, Nytrø Ø, et al. Iterative Development of Family History Annotation Guidelines Using a Synthetic Corpus of Clinical Text [C]// Proceedings of the 9th International Workshop on Health Text Mining and Information Analysis. 2018: 111-121.
[30] Knowledge Graph and Semantic Computing. Language, Knowledge, and Intelligence [C]//Proceedings of the 2nd China Conference on Knowledge Graph and Semantic Computing. Springer, 2018.
[31] CIPS. CCKS 2018: China Conference on Knowledge Graph and Semantic Computing[EB/OL]. [ 2018- 12- 20]. .
[32] CIPS. CHIP 2018: The 4th China Conference on Health Information Processing[EB/OL]. [ 2018- 12- 20]. .
[33] Lafferty J, McCallum A, Pereira F C N. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data [C]// Proceedings of the 18th International Conference on Machine Learning. 2001: 282-289.
[34] Pustejovsky J, Stubbs A . Natural Language Annotation for Machine Learning: A Guide to Corpus-Building for Applications[M]. O’Reilly Media, 2012.
[35] Trivedi G, Pham P, Chapman W W , et al. NLPReViz: An Interactive Tool for Natural Language Processing on Clinical Text[J]. Journal of the American Medical Informatics Association, 2017,25(1):81-87.
[36] Shellum J L, Freimuth R R, Peters S G , et al. Knowledge as a Service at the Point of Care[J]. AMIA Annual Symposium Proceedings, 2016: 1139-1148.
[1] 王寒雪,崔文娟,周园春,杜一. 基于机器学习的食源性疾病致病菌识别方法*[J]. 数据分析与知识发现, 2021, 5(9): 54-62.
[2] 陈东华,赵红梅,尚小溥,张润彤. 数据驱动的大型医院手术室运营预测与优化方法研究*[J]. 数据分析与知识发现, 2021, 5(9): 115-128.
[3] 车宏鑫,王桐,王伟. 前列腺癌预测模型对比研究*[J]. 数据分析与知识发现, 2021, 5(9): 107-114.
[4] 王一钒,李博,史话,苗威,姜斌. 古汉语实体关系联合抽取的标注方法*[J]. 数据分析与知识发现, 2021, 5(9): 63-74.
[5] 苏强, 侯校理, 邹妮. 基于机器学习组合优化方法的术后感染预测模型研究*[J]. 数据分析与知识发现, 2021, 5(8): 65-75.
[6] 曹睿,廖彬,李敏,孙瑞娜. 基于XGBoost的在线短租市场价格预测及特征分析模型*[J]. 数据分析与知识发现, 2021, 5(6): 51-65.
[7] 钟佳娃,刘巍,王思丽,杨恒. 文本情感分析方法及应用综述*[J]. 数据分析与知识发现, 2021, 5(6): 1-13.
[8] 向卓元,刘志聪,吴玉. 基于用户行为自适应推荐模型研究 *[J]. 数据分析与知识发现, 2021, 5(4): 103-114.
[9] 代冰,胡正银. 基于文献的知识发现新近研究综述 *[J]. 数据分析与知识发现, 2021, 5(4): 1-12.
[10] 柴国荣,王斌,沙勇忠. 基于多机器学习方法联合的公共卫生风险预测研究——以兰州市流感预测为例*[J]. 数据分析与知识发现, 2021, 5(1): 90-98.
[11] 陈东,王建冬,李慧颖,蔡思航,黄倩倩,易成岐,曹攀. 融合机器学习算法和多因素的禽肉交易量预测方法研究 *[J]. 数据分析与知识发现, 2020, 4(7): 18-27.
[12] 梁野,李小元,许航,胡伊然. CLOpin:一种面向舆情分析与预警领域的跨语言知识图谱架构*[J]. 数据分析与知识发现, 2020, 4(6): 1-14.
[13] 杨恒,王思丽,祝忠明,刘巍,王楠. 基于并行协同过滤算法的领域知识推荐模型研究*[J]. 数据分析与知识发现, 2020, 4(6): 15-21.
[14] 胡正银,刘蕾蕾,代冰,覃筱楚. 基于领域知识图谱的生命医学学科知识发现探析*[J]. 数据分析与知识发现, 2020, 4(11): 1-14.
[15] 王树义,刘赛,马峥. 基于深度迁移学习的微博图像隐私分类研究*[J]. 数据分析与知识发现, 2020, 4(10): 80-92.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn