Please wait a minute...
Data Analysis and Knowledge Discovery  2022, Vol. 6 Issue (8): 110-121    DOI: 10.11925/infotech.2096-3467.2021.1167
Current Issue | Archive | Adv Search |
Text Semantic Representation with Structure-Function and Entity Recognition: Case Study of Medical Records
Hu Jiming1,2,Qian Wei1,2,Wen Peng3(),Lv Xiaoguang4
1School of Information Management, Wuhan University, Wuhan 430072, China
2Information Retrieval and Knowledge Mining Laboratory, Wuhan University, Wuhan 430072, China
3School of Marxism, Wuhan University, Wuhan 430072, China
4Renmin Hospital of Wuhan University, Wuhan 430060, China
Download: PDF (1254 KB)   HTML ( 7
Export: BibTeX | EndNote (RIS)      

[Objective] This paper tries to improve the accuracy of text representation and mining, with the help of structural and functional information from Chinese medical records. [Methods] First, we proposed a new semantic representation strategy for the texts of Chinese medical records based on their structure-function features. Then, we used the BiLSTM-CRF model to recognize named entities, which introduced structure information at the word vector level. Finally, we utilized the TextCNN model to extract local context features, which helped us obtain a vector representation with richer text semantic connotations. [Results] The precision, recall and F values of the new model reached 93.20%, 95.19% and 94.19% respectively, while the classification accuracy rate reached 92.12%. [Limitations] Future research is needed to evaluate our model with more texts and refine the structure recognition process. [Conclusions] The proposed method could effectively improve the accuracy of named entity recognition, and enrich the semantic connotation and representation of the texts.

Key wordsChinese Medical Records      Text Structure and Function      Named Entity Recognition      Text Semantic Representation      BiLSTM-CRF Model     
Received: 14 October 2021      Published: 23 September 2022
ZTFLH:  TP391  
Fund:National Natural Science Foundation of China(71874125);Young Top-notch Talent Cultivation Program of Hubei Province
Corresponding Authors: Wen Peng,ORCID:0000-0002-0278-7391     E-mail:

Cite this article:

Hu Jiming, Qian Wei, Wen Peng, Lv Xiaoguang. Text Semantic Representation with Structure-Function and Entity Recognition: Case Study of Medical Records. Data Analysis and Knowledge Discovery, 2022, 6(8): 110-121.

URL:     OR

学者 研究视角 研究思路
Lu等[33] 文本块
本文 结构功能
Research Methods of Text Representation Based on Structure Information
Text Representation Framework of Medical Records Based on Structure Function and Entity Recognition
Named Entity Recognition Model Based on Structure Function (CSF-BiLSTM-CRF)
TextCNN Text Representation Model
序号 结构模块 内涵功能
1 入院情况 主诉、既往史、体查发现、主要辅助检查
2 入院诊断 疾病
3 治疗经过 入院检查、治疗方式、药物、病检
4 出院情况 主诉、体查发现
5 出院诊断 疾病
The Text Structure and Connotative Functions of Chinese Medical Records
实体类型 类型定义 示例 标识符号
症状 患者主观描述症状,位于患者主诉中 腹痛、呕吐、腹胀 SYMPTOM
身体部位 身体的解剖学部位或器官 腹、胃、肝 BODY
化验和检查 化验主要指血、粪、尿实验室化验指标;检查主要指影像学、核医学等结果 T(体温)、胃镜、CT TEST&
疾病 各类疾病医学名词及缩写,位于患者既往疾病史及入院诊断和出院诊断中 胃癌、溃疡、高血压 DISEASE
体征 体格检查发现身体客观异常表现 压痛、反跳痛、呼吸 SIGN
治疗 止血、营养支持以及特殊手术名称 化疗、手术、营养 TREATMENT
药物 药物名称,位于既往疾病史、药物过敏史以及治疗经过中 奥沙利铂、替吉奥、维康达 DRUG
The Entity Type of Chinese Medical Record
参数名称 参数值
初始学习率 1.0
Dropout 0.5
隐藏层大小 300
迭代次数 50
Batch_size 32
The Parameter Settings of CSF-BiLSTM-CRF Model
模型 P/% R/% F值/%
HMM 86.02 73.52 79.28
CRF 82.17 85.88 83.99
BiLSTM 81.42 78.21 79.78
BiLSTM-CRF 92.39 92.51 92.48
CSF-BiLSTM-CRF 93.20 95.19 94.19
Entity Recognition Results of Different Models
参数名称 参数值
文本维度 800
词维度 100
卷积核大小 3,4,5
Dropout 0.5
Batch_size 64
迭代次数 50
Parameter Settings of TextCNN Model
序号 文本表示方法 Acc/% 类别 P/% R/% F值/%
1 Doc2Vec+结构(Baseline) 74.55 腺癌 72.58 64.29 68.18
胃癌 75.73 82.11 78.79
2 仅文本向量 55.76 腺癌 58.57 48.24 52.90
胃癌 53.68 63.75 58.29
3 文本向量+实体结构信息 56.36 腺癌 58.90 50.59 54.43
胃癌 54.35 62.50 58.14
4 仅文本向量(TextCNN) 87.27 腺癌 84.81 88.16 86.45
胃癌 89.53 86.52 88.00
5 文本向量+普通实体(TextCNN) 90.30 腺癌 90.54 88.16 89.33
胃癌 90.11 92.13 91.11
6 文本向量+实体结构信息(TextCNN) 92.12 腺癌 95.00 89.41 92.12
胃癌 89.41 95.00 92.12
Classification Results Under Different Text Representation Methods
[1] 杜琳, 曹东, 林树元, 等. 基于BERT与Bi-LSTM融合注意力机制的中医病历文本的提取与自动分类[J]. 计算机科学, 2020, 47(S2): 416-420.
[1] (Du Lin, Cao Dong, Lin Shuyuan, et al. Extraction and Automatic Classification of TCM Medical Records Based on Attention Mechanism of BERT and Bi-LSTM[J]. Computer Science, 2020, 47(S2): 416-420.)
[2] 中文信息处理发展报告(2016)[R]. 北京: 中国中文信息学会, 2016.
[2] (Chinese Information Processing Development Report(2016)[R]. Beijing: Chinese Information Processing Society of China, 2016.)
[3] 周昭涛, 卜东波, 程学旗. 文本的图表示初探[J]. 中文信息学报, 2005, 19(2): 36-43.
[3] (Zhou Zhaotao, Bu Dongbo, Cheng Xueqi. Towards Graph-Based Text Representation[J]. Journal of Chinese Information Processing, 2005, 19(2): 36-43.)
[4] 王倩, 曾金, 刘家伟, 等. 基于深度学习的学术文本段落结构功能识别研究[J]. 情报科学, 2020, 38(3): 64-69.
[4] (Wang Qian, Zeng Jin, Liu Jiawei, et al. Structure Function Recognition of Academic Text Paragraph Based on Deep Learning[J]. Information Science, 2020, 38(3): 64-69.)
[5] Ribeiro S, Yao J T, Rezende D A. Discovering IMRaD Structure with Different Classifiers[C]// Proceedings of the 2018 IEEE International Conference on Big Knowledge. 2018: 200-204.
[6] 国家质量监督检验检疫总局, 中国国家标准化管理委员会. 党政机关电子公文格式规范第1部分:公文结构: GB/T 33476.1—2016[S]. 北京: 中国标准出版社, 2016.
[6] (General Administration of Quality Supervision, Inspection and Quarantine of the People’s Republic of China, Standardization Administration of the People’s Republic of China. Format Specification for Electronic Official Document of Party and Government Organs—Part 1: Official Document Structure: GB/T 33476.1—2016[S]. Beijing: Standards Press of China, 2016.)
[7] 李凡姝, 姚登峰. 自然语言处理中的文本表示和语言模型综述[C]// 中国计算机用户协会网络应用分会2020年第24届网络新技术与应用年会论文集. 2020.
[7] (Li Fanshu, Yao Dengfeng. Text Representation and Language Model in Natural Language Processing[C]// Proceedings of the 24th Annual Conference on New Network Technologies and Applications. 2020.)
[8] Zhang Y, Jin R, Zhou Z H. Understanding Bag-of-Words Model: A Statistical Framework[J]. International Journal of Machine Learning and Cybernetics, 2010, 1(1-4): 43-52.
doi: 10.1007/s13042-010-0001-0
[9] Salton G, Wong A, Yang C S. A Vector Space Model for Automatic Indexing[J]. Communications of the ACM, 1975, 18(11): 613-620.
doi: 10.1145/361219.361220
[10] McMahon J, Smith F J. A Review of Statistical Language Processing Techniques[J]. Artificial Intelligence Review, 1998, 12: 347-391.
doi: 10.1023/A:1006517723917
[11] Bengio Y, Ducharme R, Vincent P, et al. A Neural Probabilistic Language Model[J]. The Journal of Machine Learning Research, 2003, 3:1137-1155.
[12] Mikolov T, Chen K, Corrado G, et al. Efficient Estimation of Word Representations in Vector Space[OL]. arXiv Preprint, arXiv: 1301.3781.
[13] Mikolov T, Sutskever I, Chen K, et al. Distributed Representations of Words and Phrases and Their Compositionality[C]// Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2. 2013: 3111-3119.
[14] Devlin J, Chang M, Lee K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[C]// Proceedings of the 17th Conference of the North American Chapter of the Association for Computational Linguistics. 2019: 4171-4186.
[15] Le Q V, Mikolov T. Distributed Representations of Sentences and Documents[C]// Proceedings of the 31st International Conference on International Conference on Machine Learning. 2014: 1188-1196.
[16] Zhou C T, Sun C L, Liu Z Y, et al. A C-LSTM Neural Network for Text Classification[OL]. arXiv Preprint, arXiv: 1511.08630.
[17] Shen D H, Min M R, Li Y T, et al. Learning Context-Sensitive Convolutional Filters for Text Processing[C]// Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018: 1839-1848.
[18] 吴汉瑜, 严江, 黄少滨, 等. 用于文本分类的CNN_BiLSTM_Attention混合模型[J]. 计算机科学, 2020, 47(S2): 23-27, 34.
[18] (Wu Hanyu, Yan Jiang, Huang Shaobin, et al. CNN_BiLSTM_Attention Hybrid Model for Text Classification[J]. Computer Science, 2020, 47(S2): 23-27, 34.)
[19] Pa T L, Kumari M, Singh T, et al. Semantic Representations in Text Data[J]. International Journal of Grid and Distributed Computing, 2018, 11(9): 65-80.
[20] 聂维民, 陈永洲, 马静. 融合多粒度信息的文本向量表示模型[J]. 数据分析与知识发现, 2019, 3(9): 45-52.
[20] (Nie Weimin, Chen Yongzhou, Ma Jing. A Text Vector Representation Model Merging Multi-granularity Information[J]. Data Analysis and Knowledge Discovery, 2019, 3(9): 45-52.)
[21] 俞琰, 陈磊, 姜金德, 等. 结合词向量和统计特征的专利相似度测量方法[J]. 数据分析与知识发现, 2019, 3(9): 53-59.
[21] (Yu Yan, Chen Lei, Jiang Jinde, et al. Measuring Patent Similarity with Word Embedding and Statistical Features[J]. Data Analysis and Knowledge Discovery, 2019, 3(9): 53-59.)
[22] Liu W F, Liu P Y, Yang Y Z, et al. A Embedding Model for Text Classification[J]. Expert Systems, 2019, 36(6): e12460.
[23] Jiang Z L, Gao S, Chen L C. Study on Text Representation Method Based on Deep Learning and Topic Information[J]. Computing, 2020, 102(3): 623-642.
doi: 10.1007/s00607-019-00755-y
[24] 杨春霞, 吴佳君, 李欣栩. 融合实体信息的循环神经网络文本分类模型[J]. 小型微型计算机系统, 2020, 41(12): 2516-2521.
[24] (Yang Chunxia, Wu Jiajun, Li Xinxu. Text Classification Model Based on Recurrent Neural Network with Entity Information[J]. Journal of Chinese Computer Systems, 2020, 41(12): 2516-2521.)
[25] 黄露, 周恩国, 李岱峰. 融合特定任务信息注意力机制的文本表示学习模型[J]. 数据分析与知识发现, 2020, 4(9): 111-122.
[25] (Huang Lu, Zhou Enguo, Li Daifeng. Text Representation Learning Model Based on Attention Mechanism with Task-Specific Information[J]. Data Analysis and Knowledge Discovery, 2020, 4(9): 111-122.)
[26] 秦成磊, 章成志. 基于层次注意力网络模型的学术文本结构功能识别[J]. 数据分析与知识发现, 2020, 4(11): 26-42.
[26] (Qin Chenglei, Zhang Chengzhi. Recognizing Structure Functions of Academic Articles with Hierarchical Attention Network[J]. Data Analysis and Knowledge Discovery, 2020, 4(11): 26-42.)
[27] 陆伟, 黄永, 程齐凯. 学术文本的结构功能识别——功能框架及基于章节标题的识别[J]. 情报学报, 2014, 33(9): 979-985.
[27] (Lu Wei, Huang Yong, Cheng Qikai. The Structure Function of Academic Text and Its Classification[J]. Journal of the China Society for Scientific and Technical Information, 2014, 33(9): 979-985.)
[28] 黄永, 陆伟, 程齐凯. 学术文本的结构功能识别——基于章节内容的识别[J]. 情报学报, 2016, 35(3): 293-300.
[28] (Huang Yong, Lu Wei, Cheng Qikai. The Structure Function Recognition of Academic Text——Chapter Content Based Recognition[J]. Journal of the China Society for Scientific and Technical Information, 2016, 35(3): 293-300.)
[29] 黄永, 陆伟, 程齐凯, 等. 学术文本的结构功能识别——基于段落的识别[J]. 情报学报, 2016, 35(5): 530-538.
[29] (Huang Yong, Lu Wei, Cheng Qikai, et al. The Structure Function Recognition of Academic Text——Paragraph-Based Recognition[J]. Journal of the China Society for Scientific and Technical Information, 2016, 35(5): 530-538.)
[30] 胡吉明, 钱玮, 李雨薇, 等. 基于LDA2Vec的政策文本主题挖掘与结构化解析框架研究[J]. 情报科学, 2021, 39(10): 11-17.
[30] (Hu Jiming, Qian Wei, Li Yuwei, et al. Topic Mining and Structured Parse of Policy Text Based on LDA2Vec[J]. Information Science, 2021, 39(10): 11-17.)
[31] Laddha A, Joshi S, Shaikh S, et al. Joint Distributed Representation of Text and Structure of Semi-Structured Documents[C]// Proceedings of the 29th on Hypertext and Social Media. 2018: 25-32.
[32] 车蕾, 杨小平, 王良, 等. 面向文本结构的混合分层注意力网络的话题归类[J]. 中文信息学报, 2019, 33(5): 93-102, 112.
[32] (Che Lei, Yang Xiaoping, Wang Liang, et al. Text Structure Oriented Hybrid Hierarchical Attention Networks for Topic Classification[J]. Journal of Chinese Information Processing, 2019, 33(5): 93-102, 112.)
[33] Lu Y H, Zhai Y Y, Luo J Y, et al. MLPV: Text Representation of Scientific Papers Based on Structural Information and Doc2Vec[J]. American Journal of Information Science and Technology, 2019, 3(3): 62.
doi: 10.11648/j.ajist.20190303.12
[34] 孙镇, 王惠临. 命名实体识别研究进展综述[J]. 现代图书情报技术, 2010(6): 42-47.
[34] (Sun Zhen, Wang Huilin. Overview on the Advance of the Research on Named Entity Recognition[J]. New Technology of Library and Information Service, 2010(6): 42-47.)
[35] Goyal A, Gupta V, Kumar M. Recent Named Entity Recognition and Classification Techniques: A Systematic Review[J]. Computer Science Review, 2018, 29: 21-43.
doi: 10.1016/j.cosrev.2018.06.001
[36] 王若佳, 魏思仪, 王继民. BiLSTM-CRF模型在中文电子病历命名实体识别中的应用研究[J]. 文献与数据学报, 2019, 1(2): 53-66.
[36] (Wang Ruojia, Wei Siyi, Wang Jimin. Applied Research on Named Entity Recognition in Chinese Electronic Medical Record Based on BiLSTM-CRF Model[J]. Journal of Library and Data, 2019, 1(2): 53-66.)
[37] Lafferty J D, McCallum A, Pereira F C N. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data[C]// Proceedings of the 18th International Conference on Machine Learning. 2001: 282-289.
[38] Hochreiter S, Schmidhuber J. Long Short-Term Memory[J]. Neural Computation, 1997, 9(8): 1735-1780.
pmid: 9377276
[39] 易士翔, 尹宏鹏, 郑恒毅. 基于BiLSTM的公共安全事件触发词识别[J]. 工程科学学报, 2019, 41(9): 1201-1207.
[39] (Yi Shixiang, Yin Hongpeng, Zheng Hengyi. Public Security Event Trigger Identification Based on Bidirectional LSTM[J]. Chinese Journal of Engineering, 2019, 41(9): 1201-1207.)
[40] 余传明, 王曼怡, 林虹君, 等. 基于深度学习的词汇表示模型对比研究[J]. 数据分析与知识发现, 2020, 4(8): 28-40.
[40] (Yu Chuanming, Wang Manyi, Lin Hongjun, et al. A Comparative Study of Word Representation Models Based on Deep Learning[J]. Data Analysis and Knowledge Discovery, 2020, 4(8): 28-40.)
[41] Zhang J, Chang D. Semi-Supervised Patient Similarity Clustering Algorithm Based on Electronic Medical Records[J]. IEEE Access, 2019, 7: 90705-90714.
doi: 10.1109/ACCESS.2019.2923333
[42] Kim Y. Convolutional Neural Networks for Sentence Classification[C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. 2014: 1746-1751.
[43] Roberts A, Gaizauskas R, Hepple M, et al. Building a Semantically Annotated Corpus of Clinical Texts[J]. Journal of Biomedical Informatics, 2009, 42(5): 950-966.
doi: 10.1016/j.jbi.2008.12.013 pmid: 19535011
[44] 全国知识图谱与语义计算大会. CCKS 2020: 面向中文电子病历的医疗实体及事件抽取(一)医疗命名实体识别[EB/OL]. [2021-04-10].
[44] (China Conference on Knowledge Graph and Semantic Computing. CCKS 2020: Medical Entity and Event Extraction for Chinese Electronic Medical Records (1) Medical Named Entity Recognition[EB/OL]. [2021-04-10].
[45] 王路路, 艾山·吾买尔, 吐尔根·依布拉音, 等. 基于深度神经网络的维吾尔文命名实体识别研究[J]. 中文信息学报, 2019, 33(3): 64-70.
[45] (Wang Lulu, Aishan Wumaier, Tuergen Yibulayin, et al. Uyghur Named Entity Recognition Based on Deep Neural Network[J]. Journal of Chinese Information Processing, 2019, 33(3): 64-70.)
[46] 陈培新. 文本语义的向量表示与建模方法研究[D]. 合肥: 中国科学技术大学, 2018.
[46] (Chen Peixin. The Research of Semantic Vector Representations and Modeling Approachesfor Text[D]. Hefei: University of Science and Technology of China, 2018.)
[47] Jieba[EB/OL]. [2020-08-25].
[48] Řehůřek R. Word2Vec Embeddings[EB/OL]. [2020-08-25].
[49] 吕璐成, 韩涛, 周健, 等. 基于深度学习的中文专利自动分类方法研究[J]. 图书情报工作, 2020, 64(10): 75-85.
doi: 10.13266/j.issn.0252-3116.2020.10.009
[49] (Lv Lucheng, Han Tao, Zhou Jian, et al. Research on the Method of Chinese Patent Automatic Classification Based on Deep Learning[J]. Library and Information Service, 2020, 64(10): 75-85.)
doi: 10.13266/j.issn.0252-3116.2020.10.009
[50] 胡吉明, 郑翔, 程齐凯, 等. 基于BiLSTM-CRF的政府微博舆论观点抽取与焦点呈现[J]. 情报理论与实践, 2021, 44(1): 174-179, 137.
[50] (Hu Jiming, Zheng Xiang, Cheng Qikai, et al. Public Opinion Extraction and Focus Presentation in Government Microblog Based on BiLSTM-CRF[J]. Information Studies: Theory & Application, 2021, 44(1): 174-179, 137.)
[51] Kowsari K, Meimandi K J, Heidarysafa M, et al. Text Classification Algorithms: A Survey[J]. Information, 2019, 10(4): 150-218.
doi: 10.3390/info10040150
[1] Zhang Yunqiu, Wang Yang, Li Bocheng. Identifying Named Entities of Chinese Electronic Medical Records Based on RoBERTa-wwm Dynamic Fusion Model[J]. 数据分析与知识发现, 2022, 6(2/3): 242-250.
[2] Yu Chuanming, Lin Hongjun, Zhang Zhengang. Joint Extraction Model for Entities and Events with Multi-task Deep Learning[J]. 数据分析与知识发现, 2022, 6(2/3): 117-128.
[3] Zhang Fangcong, Qin Qiuli, Jiang Yong, Zhuang Runtao. Named Entity Recognition for Chinese EMR with RoBERTa-WWM-BiLSTM-CRF[J]. 数据分析与知识发现, 2022, 6(2/3): 251-262.
[4] Xu Chenfei, Ye Haiying, Bao Ping. Automatic Recognition of Produce Entities from Local Chronicles with Deep Learning[J]. 数据分析与知识发现, 2020, 4(8): 86-97.
[5] Gao Yuan,Shi Yuanlei,Zhang Lei,Cao Tianyi,Feng Jun. Reconstructing Tour Routes Based on Travel Notes[J]. 数据分析与知识发现, 2020, 4(2/3): 165-172.
[6] Ma Jianxia,Yuan Hui,Jiang Xiang. Extracting Name Entities from Ecological Restoration Literature with Bi-LSTM+CRF[J]. 数据分析与知识发现, 2020, 4(2/3): 78-88.
[7] Liu Jingru,Song Yang,Jia Rui,Zhang Yipeng,Luo Yong,Ma Jingdong. A BiLSTM-CRF Model for Protected Health Information in Chinese[J]. 数据分析与知识发现, 2020, 4(10): 124-133.
[8] Han Huang,Hongyu Wang,Xiaoguang Wang. Automatic Recognizing Legal Terminologies with Active Learning and Conditional Random Field Model[J]. 数据分析与知识发现, 2019, 3(6): 66-74.
[9] Meishan Chen,Chenxi Xia. Identifying Entities of Online Questions from Cancer Patients Based on Transfer Learning[J]. 数据分析与知识发现, 2019, 3(12): 61-69.
[10] Li Yu,Li Qian,Changlei Fu,Huaming Zhao. Extracting Fine-grained Knowledge Units from Texts with Deep Learning[J]. 数据分析与知识发现, 2019, 3(1): 38-45.
[11] Tang Huihui,Wang Hao,Zhang Zixuan,Wang Xueying. Extracting Names of Historical Events Based on Chinese Character Tags[J]. 数据分析与知识发现, 2018, 2(7): 89-100.
[12] Fan Xinyue,Cui Lei. Using Text Mining to Discover Drug Side Effects: Case Study of PubMed[J]. 数据分析与知识发现, 2018, 2(3): 79-86.
[13] Sui Mingshuang,Cui Lei. Extracting Chemical and Disease Named Entities with Multiple-Feature CRF Model[J]. 现代图书情报技术, 2016, 32(10): 91-97.
[14] Wang Run,He Lin,Wang Dongbo,Huang Shuiqing,Fan Yuanbiao. Research on Plant Growth and Development Stage Named Entity Recognition for Text Mining[J]. 现代图书情报技术, 2014, 30(1): 24-27.
[15] Gao Qiang, You Hongliang. Study on Named Entity Recognition Based on Cascaded Model for Field of Defense[J]. 现代图书情报技术, 2012, (11): 47-52.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938