Please wait a minute...
Data Analysis and Knowledge Discovery  2019, Vol. 3 Issue (7): 123-132    DOI: 10.11925/infotech.2096-3467.2018.1454
Current Issue | Archive | Adv Search |
Annotating Chinese E-Medical Record for Knowledge Discovery
Jiahui Hu,An Fang(),Wanqing Zhao,Chenliu Yang,Huiling Ren
Institute of Medical Information, Chinese Academy of Medical Sciences, Beijing 100020, China
Download: PDF(817 KB)   HTML ( 32
Export: BibTeX | EndNote (RIS)      

[Objective] This paper studies the annotation method for Chinese electronic medical records, aiming to improve the processing of massive clinical texts and clinical knowledge discovery. [Methods] First, we proposed annotation method for Chinese e-medical records, and constructed a visual interactive platform. Then, based on the word and phrase features of these records, we identified the medical name entities with natural language processing and machine learning approaches. [Results] A total of 700 annotated records were obtained, and the overall F value of the Pipeline-based annotation method reached 0.8772, which was 32.9% higher than those based on the original medical records. [Limitations] Since the electronic medical record contains sensitive privacy information, this study was conducted with open dataset, and the corpus size was limited. [Conclusions] The Chinese electronic medical record annotation method and platform constructed in this study could effectively process clinical texts, and the association of medical knowledge.

Key wordsChinese Electronic Medical Record      Text Annotation      Natural Language Processing      Machine Learning      Knowledge Discovery     
Received: 24 December 2018      Published: 06 September 2019
:  TP391  
Corresponding Authors: An Fang     E-mail:

Cite this article:

Jiahui Hu,An Fang,Wanqing Zhao,Chenliu Yang,Huiling Ren. Annotating Chinese E-Medical Record for Knowledge Discovery. Data Analysis and Knowledge Discovery, 2019, 3(7): 123-132.

URL:     OR

数据集 症状和体征 检查和
治疗 疾病和
身体部位 合计
训练集 6 486 7 987 853 515 8 942 24 783
测试集 1 345 1 559 195 207 1 777 5 083
序号 编码 实体类别
1 B-SYMPTOM 症状和体征
5 B-CHECK 检查和检验
13 B-DISEASE 疾病和诊断
17 B-BODY 身体部位
21 O 非医疗实体
治疗 疾病和
P 0.9898 0.9554 0.9588 0.9703 0.9237 0.9531
R 0.9864 0.8233 0.9555 0.9515 0.9358 0.9138
F值 0.9881 0.8845 0.9571 0.9608 0.9297 0.9331
症状和体征 检查和检验 治疗 疾病和诊断 身体
P 0.9439 0.9091 0.7945 0.7772 0.8419 0.8860
R 0.9636 0.7505 0.5949 0.6908 0.8149 0.8210
F值 0.9536 0.8222 0.6804 0.7315 0.8281 0.8522
治疗 疾病和
重合度 0.6148 0.3263 0.1181 0.1803 0.2423
[1] Yetisgen M, Vanderwende L. Automatic Identification of Substance Abuse from Social History in Clinical Text [C]// Proceedings of the 16th Conference on Artificial Intelligence in Medicine. 2017: 171-181.
[2] Brisimi T S, Xu T, Wang T , et al. Predicting Chronic Disease Hospitalizations from Electronic Health Records: An Interpretable Classification Approach[J]. Proceedings of the IEEE, 2018,106(4):690-707.
[3] Karakurt G, Patel V, Whiting K , et al. Mining Electronic Health Records Data: Domestic Violence and Adverse Health Effects[J]. Journal of Family Violence, 2017,32(1):79-87.
[4] Caillet C, Sichanh C, Assemat G , et al. Role of Medicines of Unknown Identity in Adverse Drug Reaction-Related Hospitalizations in Developing Countries: Evidence from a Cross-Sectional Study in a Teaching Hospital in the Lao People’s Democratic Republic[J]. Drug Safety, 2017,40(9):809-821.
[5] Skeppstedt M, Kvist M, Nilsson G H , et al. Automatic Recognition of Disorders, Findings, Pharmaceuticals and Body Structures from Clinical Text: An Annotation and Machine Learning Study[J]. Journal of Biomedical Informatics, 2014,49:148-158.
[6] Bates D W, Saria S, Ohno-Machado L , et al. Big Data in Health Care: Using Analytics to Identify and Manage High-Risk and High-Cost Patients[J]. Health Affairs, 2014,33(7):1123-1131.
[7] Zhou X, Peng Y, Liu B . Text Mining for Traditional Chinese Medical Knowledge Discovery: A Survey[J]. Journal of Biomedical Informatics, 2010,43(4):650-660.
[8] Leaman R, Khare R, Lu Z . Challenges in Clinical Natural Language Processing for Automated Disorder Normalization[J]. Journal of Biomedical Informatics, 2015,57:28-37.
[9] 杨锦锋, 于秋滨, 关毅 , 等. 电子病历命名实体识别和实体关系抽取研究综述[J]. 自动化学报, 2014,40(8):1537-1562.
doi: 10.3724/SP.J.1004.2014.01537
[9] ( Yang Jinfeng, Yu Qiubin, Guan Yi , et al. An Overview of Research on Electronic Medical Record Oriented Named Entity Recognition and Entity Relation Extraction[J]. Acta Automatica Sinica, 2014,40(8):1537-1562.)
doi: 10.3724/SP.J.1004.2014.01537
[10] 夏立新, 陈晨, 王忠义 . 基于多维度聚合的网络资源知识发现框架研究[J]. 情报科学, 2016,34(5):3-8.
[10] ( Xia Lixin, Chen Chen, Wang Zhongyi . Research on Knowledge Discovery Framework of Internet Resource Based on Multi-Dimensional Aggregation[J]. Information Science, 2016,34(5):3-8.)
[11] 王颖, 吴振新, 谢靖 . 面向科技文献的语义检索系统研究综述[J]. 现代图书情报技术, 2015(5):1-7.
[11] ( Wang Ying, Wu Zhenxin, Xie Jing . Review on Semantic Retrieval System for Scientific Literature[J]. New Technology of Library and Information Service, 2015(5):1-7.)
[12] 杨锐, 汤怡洁, 刘毅 , 等. 知识服务环境下语义化开放接口应用研究[J]. 图书情报工作, 2014,58(4):99-104.
[12] ( Yang Rui, Tang Yijie, Liu Yi , et al. Research on the Application of Semantic Open Interface Under Knowledge Service Environment[J]. Library and Information Service, 2014,58(4):99-104.)
[13] Reinsel D, Gantz J, Rydning J . Data Age 2025: The Evolution of Data to Life-Critical Don’t Focus on Big Data; Focus on Data That’s Big[R]. IDC White Paper, 2017.
[14] CLEF eHealth Workshop. ShARe/CLEF eHealth 2014 Shared Task [EB/OL]. [2018-06-19]. .
[15] SIGLEX. SemEval-2018: International Workshop on Semantic Evaluation [EB/OL]. [ 2018- 06- 19]. .
[16] i2b2 tranSMART Foundation. i2b2: Informatics for Integrating Biology & the Bedside[EB/OL].[2018-06-19]. .
[17] Wikipedia. Pipeline (computing) [EB/OL]. [2018-06-19]. .
[18] Mack R, Mukherjea S, Soffer A , et al. Text Analytics for Life Science Using the Unstructured Information Management Architecture[J]. IBM Systems Journal, 2004,43(3):490-515.
[19] Savova G K, Masanz J J, Ogren P V , et al. Mayo Clinical Text Analysis and Knowledge Extraction System (cTAKES): Architecture, Component Evaluation and Applications[J]. Journal of the American Medical Informatics Association, 2010,17(5):507-513.
[20] Savova G K, Tseytlin E, Finan S , et al. DeepPhe: A Natural Language Processing System for Extracting Cancer Phenotypes from Clinical Records[J]. Cancer Research, 2017,77(21):e115-e118.
[21] Penn Treebank. Alphabetical List of Part-of-Speech Tags Used in the Penn Treebank Project[EB/OL]. [ 2018- 06- 19]. .
[22] Tsuruoka Y. GENIA Tagger[EB/OL]. [ 2018- 06- 19]. .
[23] Bodenreider O . The Unified Medical Language System (UMLS): Integrating Biomedical Terminology[J]. Nucleic Acids Research, 2004,32(S1):267-270.
[24] Gonzalezhernandez G, Sarker A , O’Connor K, et al. Capturing the Patient’s Perspective: A Review of Advances in Natural Language Processing of Health-Related Text[J]. Yearbook of Medical Informatics, 2017,26(1):214-227.
[25] Uzuner Ö, South B R, Shen S , et al. 2010 i2b2/VA Challenge on Concepts, Assertions, and Relations in Clinical Text[J]. Journal of the American Medical Informatics Association, 2011,18(5):552-556.
[26] 杨锦锋, 关毅, 何彬 , 等. 中文电子病历命名实体和实体关系语料库构建[J]. 软件学报, 2016,27(11):2725-2746.
[26] ( Yang Jinfeng, Guan Yi, He Bin , et al. Corpus Construction for Named Entities and Entity Relations on Chinese Electronic Medical Records[J]. Journal of Software, 2016,27(11):2725-2746.)
[27] He B, Dong B, Guan Y , et al. Building a Comprehensive Syntactic and Semantic Corpus of Chinese Clinical Texts[J]. Journal of Biomedical Informatics, 2017,69:203-217.
[28] 曲春燕, 关毅, 杨锦锋 , 等. 中文电子病历命名实体标注语料库构建[J]. 高技术通讯, 2015,25(2):143-150.
[28] ( Qu Chunyan, Guan Yi, Yang Jinfeng , et al. The Construction of Annotated Corpora of Named Entities for Chinese Electronic Medical Records[J]. Chinese High Technology Letters, 2015,25(2):143-150.)
[29] Rama T, Brekke P, Nytrø Ø, et al. Iterative Development of Family History Annotation Guidelines Using a Synthetic Corpus of Clinical Text [C]// Proceedings of the 9th International Workshop on Health Text Mining and Information Analysis. 2018: 111-121.
[30] Knowledge Graph and Semantic Computing. Language, Knowledge, and Intelligence [C]//Proceedings of the 2nd China Conference on Knowledge Graph and Semantic Computing. Springer, 2018.
[31] CIPS. CCKS 2018: China Conference on Knowledge Graph and Semantic Computing[EB/OL]. [ 2018- 12- 20]. .
[32] CIPS. CHIP 2018: The 4th China Conference on Health Information Processing[EB/OL]. [ 2018- 12- 20]. .
[33] Lafferty J, McCallum A, Pereira F C N. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data [C]// Proceedings of the 18th International Conference on Machine Learning. 2001: 282-289.
[34] Pustejovsky J, Stubbs A . Natural Language Annotation for Machine Learning: A Guide to Corpus-Building for Applications[M]. O’Reilly Media, 2012.
[35] Trivedi G, Pham P, Chapman W W , et al. NLPReViz: An Interactive Tool for Natural Language Processing on Clinical Text[J]. Journal of the American Medical Informatics Association, 2017,25(1):81-87.
[36] Shellum J L, Freimuth R R, Peters S G , et al. Knowledge as a Service at the Point of Care[J]. AMIA Annual Symposium Proceedings, 2016: 1139-1148.
[1] Jinzhu Zhang,Yiming Hu. Extracting Titles from Scientific References in Patents with Fusion of Representation Learning and Machine Learning[J]. 数据分析与知识发现, 2019, 3(5): 68-76.
[2] Zhiqiang Liu,Yuncheng Du,Shuicai Shi. Extraction of Key Information in Web News Based on Improved Hidden Markov Model[J]. 数据分析与知识发现, 2019, 3(3): 120-128.
[3] Hongxia Xu,Chunwang Li. Review of Knowledge Extraction of Scientific Literature[J]. 数据分析与知识发现, 2019, 3(3): 14-24.
[4] Juhua Wu,Yu Wang,Ming Li,Shaoyun Cai. Knowledge Discovery of Online Health Communities with Weighted Knowledge Network[J]. 数据分析与知识发现, 2019, 3(2): 108-117.
[5] Jiying Hu,Jing Xie,Li Qian,Changlei Fu. Constructing Big Data Platform for Sci-Tech Knowledge Discovery with Knowledge Graph[J]. 数据分析与知识发现, 2019, 3(1): 55-62.
[6] Zixuan Zhang,Hao Wang,Liping Zhu,Sanhong eng. Identifying Risks of HS Codes by China Customs[J]. 数据分析与知识发现, 2019, 3(1): 72-84.
[7] Lina Liu,Jiayin Qi,Zhenping Zhang,Dan Zeng. Analyzing Impacts of Brand Reputation on Online Sales Based on Massive Commodity Reviews and Brand[J]. 数据分析与知识发现, 2018, 2(9): 10-21.
[8] Longjia Jia,Bangzuo Zhang. Classifying Topics of Internet Public Opinion from College Students: Case Study of Sina Weibo[J]. 数据分析与知识发现, 2018, 2(7): 55-62.
[9] Wei Lu,Mengqi Luo,Heng Ding,Xin Li. Image Annotation Tags by Deep Learning and Real Users: A Comparative Study[J]. 数据分析与知识发现, 2018, 2(5): 1-10.
[10] Li Wang,Lixue Zou,Xiwen Liu. Visualizing Document Correlation Based on LDA Model[J]. 数据分析与知识发现, 2018, 2(3): 98-106.
[11] Xinyue Fan,Lei Cui. Predicting Antineoplastic Drug Targets Based on Network Properties[J]. 数据分析与知识发现, 2018, 2(12): 98-108.
[12] Yang Zhao,Xini Yuan,Yawen Chen,Liqiang Wu. Predicting Conversion Rate of APP Advertising with Machine Learning[J]. 数据分析与知识发现, 2018, 2(11): 2-9.
[13] Xin Wang,Wen’gang Feng. Review of Techniques Detecting Online Extremism and Radicalization[J]. 数据分析与知识发现, 2018, 2(10): 2-8.
[14] Zhiqiang Zhang,Shaoping Fan,Xiujuan Chen. Biomedical Informatics Studies for Knowledge Discovery in Precision Medicine[J]. 数据分析与知识发现, 2018, 2(1): 1-8.
[15] Dongmei Mu,Ping Wang,Danning Zhao. Reducing Data Dimension of Electronic Medical Records: An Empirical Study[J]. 数据分析与知识发现, 2018, 2(1): 88-98.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938