Please wait a minute...
Data Analysis and Knowledge Discovery  2024, Vol. 8 Issue (6): 144-157    DOI: 10.11925/infotech.2096-3467.2023.0931
Current Issue | Archive | Adv Search |
Standardization of Chinese Medical Terminology Based on Multi-Strategy Comparison Learning
Yue Chonghao1,2,Zhang Jian3,Wu Yirong1,2,Li Xiaolong1,4,Hua Sheng1,2,Tong Shunhang1,2,Sun Shuifa1,3()
1Yichang Key Laboratory of Intelligent Medicine, Yichang 443002, China
2College of Computer and Information Technology, China Three Gorges University,Yichang 443002, China
3School of Information Science and Technology, Hangzhou Normal University,Hangzhou 311121, China
4College of Economics & Management, China Three Gorges University, Yichang 443002, China
Download: PDF (1170 KB)   HTML ( 10
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] To address the challenges of short texts, high similarity, and single and multiple entailments in the standardization of Chinese medical terminology, this paper proposes a research framework based on the fusion of multiple strategy comparison learning for recall-ranking-quantity prediction. [Methods] Firstly, we integrated text statistical and deep semantic features to retrieve candidate entities. Based on similarity scores, we obtained the candidate set. Secondly, we combined candidate ranking with original terms, standard entities, and candidate entities from recall by training vector representations with pre-trained models and contrastive learning strategies, followed by reordering based on cosine similarity. Next, we updated the vector representations of original terms through multi-head attention to predict the number of standard entities from the original terms. Finally, we selected the standard entities based on the quantity prediction results by integrating the similarity scores of candidate recall and ranking. [Results] We examined the new model on the Chinese medical terminology normalization dataset Yidu-N7k. Compared with statistical models and mainstream deep learning models, the proposed framework achieved an accuracy of 92.17%. This represents an improvement of at least 0.94% over the pre-trained binary classification baseline model. Additionally, on a dataset of 150 expert-labeled reports of mammography examinations for female breast cancer, the new framework’s accuracy reached 97.85%, achieving the best performance. [Limitations] The experiments are only conducted on medical datasets, and the effectiveness in other domains needs further exploration. [Conclusions] A multi-strategy candidate recall can comprehensively consider text information to address the challenge of short text. Contrastive learning candidate rank can capture subtle textual differences to address the challenge of high similarity. Quantity prediction with multi-head attention can enhance vector representation and address the challenges of single and multiple entailments. The proposed method provides the potential for promoting medical information mining and clinical research.

Key wordsMedical Terminology Normalization      Multi-Strategy Candidate Recall      Contrastive Learning      Breast Cancer Mammography      Examination Report     
Received: 21 September 2023      Published: 18 April 2024
ZTFLH:  TP393  
Fund:National Social Science Fund of China(20BTQ066)
Corresponding Authors: Sun Shuifa,ORCID:0000-0003-0933-152X,E-mail:watersun@hznu.edu.cn。   

Cite this article:

Yue Chonghao, Zhang Jian, Wu Yirong, Li Xiaolong, Hua Sheng, Tong Shunhang, Sun Shuifa. Standardization of Chinese Medical Terminology Based on Multi-Strategy Comparison Learning. Data Analysis and Knowledge Discovery, 2024, 8(6): 144-157.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2023.0931     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2024/V8/I6/144

Examples of Normalized Chinese Clinical Terminology Dataset Yidu-N7k
方法类别 主要模型 优点 缺点
基于规则和字符串匹配 启发式规则[15-16]、字符串匹配[17-18] 可解释性强 可移植性低,人工消耗大
基于机器学习 支持向量机[19]k-近邻[20] 无须制定规则,降低人工消耗 不能考虑上下文信息,有局限性
基于深度学习 卷积神经网络[21]、循环神经
网络[22]、预训练模型[23]
表征能力强大,能够丰富表达语义信息 对设备要求较高
Comparison of Advantages and Disadvantages of Medical Terminology Normalized Methods
方法类别 代表文献
序列生成方法 Yan等[28]设计了一个生成和排序框架,为中文标准化解决了多蕴含问题。
多分类方法 Li等[30]引入了改进的BERT模型EhrBERT,证明基于BERT的模型在标准化任务上的有效性。
Huang等[29]开发了一个PLM-ICD模型,结合特定领域的预训练与标签感知注意力机制。
Sung等[23]设计了用于生物医学实体表示学习的BIOSYN框架,该框架利用同义词边缘化迭代候选实体,在生物实体表示方面取得良好效果。
二分类方法 Liang等[32]将标准化任务看作一个二分类任务,采用主动学习训练,旨在降低标注成本。
Xu等[33]设计了一个具有语义类型规则的列表模型进行排序。
崇伟峰等[34]构建4个模块术语标准化系统,取得第五届中国健康信息处理大会评测比赛(CHIP2019)第一名的成绩。
Yuan等[31]重新训练了一个预训练模型,由两部分组成:知识库指导的预训练和同义词感知的微调。
Liang等[7]在召回阶段使用TF-IDF(Term Frequency-Inverse Document Frequency)和BERT生成候选实体,基于关键字感知的二分类任务被用于重新排序。
Sui等[9]使用BERT编码将候选排序和数量预测融合为一个步骤,将标准化任务作为二分类进行打分。
Representative Literature Related to Medical Terminology Normalized Methods Based on Pre-trained Models
Overall Model Framework
Multi-Strategy Candidate Entity Recall
Candidate Ranking for Contrastive Learning
Prediction of Multi-Head Attention Mechanisms
类别 单蕴含 多蕴含 共计
一对一 一对二 一对三 一对四 一对五 一对六 一对七
训练集 3 801 148 34 16 0 0 1 4 000
验证集 950 39 9 2 0 0 0 2 000
测试集 1 901 77 18 3 1 0 0 1 000
Chinese Clinical Terminology Normalized Dataset Yidu-N7k
类别 单蕴含一对一 多蕴含一对二 共计
训练集 875 6 881
测试集 93 2 95
Normalized Dataset of Breast Cancer Mammograms Diagnostic Terminology
类别 模型 单蕴含
准确率
多蕴含
准确率
准确率
统计模型 TF-IDF 49.30 46.80
编辑距离 50.80 48.30
BM25 65.83 62.57
多分类模型 BERT 88.60 84.20
PLM-ICD*[29] 92.64 88.03
生成模型 序列生成[28] 91.10 52.40 89.30
二分类模型 MTCEN*[9] 93.27 52.35 91.23
CMTN*[7] 93.48 53.02 91.47
对比学习模型 本文 94.18 53.69 92.17
Experimental Results on Yidu-N7k Dataset
类别 模型 准确率/%
多分类模型 PLM-ICD*[29] 97.48
二分类模型 MTCEN*[9] 97.62
对比学习模型 本文 97.85
Experimental Results on Breast Cancer Mammograms Diagnostic Terminology Normalized Dataset
模型 召回率/%
TF-IDF* 68.33
ES数据库* 96.73
BERT 96.90
MTCEN* 97.20
CMTN 98.30
本文 98.93
Recall Experimental Results on Yidu-N7k Dataset Candidate
模型 单蕴含准确率 多蕴含准确率 准确率
BERT 99.12 63.12 97.43
序列生成 98.60 76.40 97.70
MTCEN* 99.40 75.84 98.23
CMTN 99.58 70.40 98.23
本文 99.58 75.84 98.40
Number Prediction Experimental Results on Yidu-N7k Dataset
模型 单蕴含准确率 多蕴含准确率 准确率
本文 94.18 53.69 92.17
w/o ES数据库 93.48 48.99 91.27
w/o对比学习 92.95 48.32 90.73
w/o 多头注意力机制 94.07 51.68 91.97
Ablation Experimental Results on Yidu-N7k Dataset
[1] Lin Y C, Lu K M, Chen Y L, et al. High-Throughput Relation Extraction Algorithm Development Associating Knowledge Articles and Electronic Health Records[OL]. arXiv Preprint, arXiv: 2009.03506.
[2] Miotto R, Li L, Kidd B A, et al. Deep Patient: An Unsupervised Representation to Predict the Future of Patients from the Electronic Health Records[J]. Scientific Reports, 2016, 6(1): 26094.
[3] Zhang N Y, Chen M S, Bi Z, et al. CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark[OL]. arXiv Preprint, arXiv: 2106.08087.
[4] Bodenreider O. The Unified Medical Language System (UMLS): Integrating Biomedical Terminology[J]. Nucleic Acids Research, 2004, 32(S1): D267-D270.
[5] Donnelly K. SNOMED-CT: The Advanced Terminology and Coding System for eHealth[J]. Studies in Health Technology and Informatics, 2006, 121: 279-290.
pmid: 17095826
[6] 姜京池, 侯俊屹, 李雪, 等. 基于协同集成学习的医疗实体标准化方法[J]. 中文信息学报, 2023, 37(3): 135-142.
[6] (Jiang Jingchi, Hou Junyi, Li Xue, et al. Medical Entity Standardization Method Based on Collaborative Ensemble Learning[J]. Journal of Chinese Information Processing, 2023, 37(3): 135-142.)
[7] Liang M, Xue K, Ye Q, et al. A Combined Recall and Rank Framework with Online Negative Sampling for Chinese Procedure Terminology Normalization[J]. Bioinformatics, 2021, 37(20): 3610-3617.
doi: 10.1093/bioinformatics/btab381 pmid: 34037691
[8] Li L Q, Zhai Y K, Gao J H, et al. Stacking-BERT Model for Chinese Medical Procedure Entity Normalization[J]. Mathematical Biosciences and Engineering, 2023, 20(1): 1018-1036.
doi: 10.3934/mbe.2023047 pmid: 36650800
[9] Sui X H, Song K H, Zhou B H, et al. A Multi-Task Learning Framework for Chinese Medical Procedure Entity Normalization[C]// Proceedings of 2022 IEEE International Conference on Acoustics, Speech and Signal Processing. 2022: 8337-8341.
[10] Miftahutdinov Z, Tutubalina E. Deep Neural Models for Medical Concept Normalization in User-Generated Texts[OL]. arXiv Preprint, arXiv: 1907.07972.
[11] Huang J M, Osorio C, Sy L W. An Empirical Evaluation of Deep Learning for ICD-9 Code Assignment Using MIMIC-III Clinical Notes[J]. Computer Methods and Programs in Biomedicine, 2019, 177: 141-153.
doi: S0169-2607(18)30994-5 pmid: 31319942
[12] Ji Z C, Wei Q, Xu H. BERT-Based Ranking for Biomedical Entity Normalization[J]. AMIA Joint Summits on Translational Science, 2020, 2020: 269-277.
[13] Gao T Y, Yao X C, Chen D Q. SimCSE: Simple Contrastive Learning of Sentence Embeddings[OL]. arXiv Preprint, arXiv: 2104.08821.
[14] 周鹏程, 武川, 陆伟. 基于多知识库的短文本实体链接方法研究——以Wikipedia和Freebase为例[J]. 现代图书情报技术, 2016(6): 1-11.
[14] (Zhou Pengcheng, Wu Chuan, Lu Wei. Entity Linking Method for Short Texts with Multi-Knowledge Bases: Case Study of Wikipedia and Freebase[J]. New Technology of Library and Information Service, 2016(6): 1-11.)
[15] Ghiasvand O, Kate R. UWM: Disorder Mention Extraction from Clinical Text Using CRFs and Normalization Using Learned Edit Distance Patterns[C]// Proceedings of the 8th International Workshop on Semantic Evaluation. 2014: 828-832.
[16] Afzal Z, Akhondi S A, van Haagen H H H B M, et al. Biomedical Concept Recognition in French Text Using Automatic Translation of English Terms[A]// Experimental IR Meets Multilinguality, Multimodality, and Interaction[M]. Springer, 2015.
[17] D’Souza J, Ng V. Sieve-Based Entity Linking for the Biomedical Domain[C]// Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2:Short Papers). 2015: 297-302.
[18] Leal A, Martins B, Couto F. ULisboa: Recognition and Normalization of Medical Concepts[C]// Proceedings of the 9th International Workshop on Semantic Evaluation. 2015: 406-411.
[19] Boytcheva S. Automatic Matching of ICD-10 Codes to Diagnoses in Discharge Letters[C]// Proceedings of the 2nd Workshop on Biomedical Natural Language Processing. 2011: 11-18.
[20] Larkey L S, Croft W B. Automatic Assignment of ICD9 Codes to Discharge Summaries[R]. University of Massachusetts, 1995.
[21] Luo Y, Song G J, Li P Y, et al. Multi-Task Medical Concept Normalization Using Multi-View Convolutional Neural Network[C]// Proceedings of the 32nd AAAI Conference on Artificial Intelligence. 2018: 5868-5875.
[22] Limsopatham N, Collier N. Normalising Medical Concepts in Social Media Texts by Learning Semantic Representation[C]// Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1:Long Papers). 2016: 1014-1023.
[23] Sung M, Jeon H, Lee J, et al. Biomedical Entity Representations with Synonym Marginalization[OL]. arXiv Preprint, arXiv: 2005.00239.
[24] Gundersen M L, Haug P J, Pryor T A, et al. Development and Evaluation of a Computerized Admission Diagnoses Encoding System[J]. Computers and Biomedical Research, 1996, 29(5): 351-372.
pmid: 8902364
[25] Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[OL]. arXiv Preprint, arXiv: 1810.04805.
[26] Vaswani A, Shazeer N, Parmar N, et al. Attention is All You Need[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017: 6000-6010.
[27] Kalyan K S, Sangeetha S. BertMCN: Mapping Colloquial Phrases to Standard Medical Concepts Using BERT and Highway Network[J]. Artificial Intelligence in Medicine, 2021, 112: 102008.
[28] Yan J H, Wang Y N, Xiang L, et al. A Knowledge-Driven Generative Model for Multi-Implication Chinese Medical Procedure Entity Normalization[C]// Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. 2020: 1490-1499.
[29] Huang C W, Tsai S C, Chen Y N. PLM-ICD: Automatic ICD Coding with Pretrained Language Models[OL]. arXiv Preprint, arXiv: 2207.05289.
[30] Li F, Jin Y H, Liu W S, et al. Fine-Tuning Bidirectional Encoder Representations from Transformers (BERT)-Based Models on Large-Scale Electronic Health Record Notes: An Empirical Study[J]. JMIR Medical Informatics, 2019, 7(3): e14830.
[31] Yuan H Y, Yuan Z, Yu S. Generative Biomedical Entity Linking via Knowledge Base-Guided Pre-training and Synonyms-Aware Fine-Tuning[OL]. arXiv Preprint, arXiv: 2204.05164.
[32] Liang M, Zhang Z X, Zhang J Y, et al. Lab Indicators Standardization Method for the Regional Healthcare Platform: A Case Study on Heart Failure[J]. BMC Medical Informatics and Decision Making, 2020, 20(14): 331.
[33] Xu D F, Zhang Z Y, Bethard S. A Generate-and-Rank Framework with Semantic Type Regularization for Biomedical Concept Normalization[C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020: 8452-8464.
[34] 崇伟峰, 李慧, 李雪, 等. 基于BERT蕴含推理的术语标准化系统[J]. 中文信息学报, 2021, 35(5): 86-90.
[34] (Chong Weifeng, Li Hui, Li Xue, et al. Term Normalization System Based on BERT Entailment Reasoning[J]. Journal of Chinese Information Processing, 2021, 35(5): 86-90.)
[35] Chen T, Kornblith S, Norouzi M, et al. A Simple Framework for Contrastive Learning of Visual Representations[C]// Proceedings of the 37th International Conference on Machine Learning. 2020: 1597-1607.
[36] Ding J T, Quan Y H, Yao Q M, et al. Simplify and Robustify Negative Sampling for Implicit Collaborative Filtering[C]// Proceedings of the 34th International Conference on Neural Information Processing Systems. 2020: 1094-1105.
[37] Zhang Y H, Zhu H J, Wang Y L, et al. A Contrastive Framework for Learning Sentence Representations from Pairwise and Triple-Wise Perspective in Angular Space[C]// Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1:Long Papers). 2022: 4892-4903.
[38] Peng H, Xiong Y, Xiang Y, et al. Biomedical Named Entity Normalization via Interaction-Based Synonym Marginalization[J]. Journal of Biomedical Informatics, 2022, 136: 104238.
[39] Reimers N, Gurevych I. Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks[OL]. arXiv Preprint, arXiv:1908.10084.
[40] Teng F, Liu Y M, Li T R, et al. A Review on Deep Neural Networks for ICD Coding[J]. IEEE Transactions on Knowledge and Data Engineering, 2023, 35(5): 4357-4375.
[41] 中国抗癌协会乳腺癌专业委员会. 中国抗癌协会乳腺癌诊治指南与规范(2021年版)[J]. 中国癌症杂志, 2021, 31(10): 954-1040.
[41] (China Anti-Cancer Association Breast Cancer Society. Guidelines and Norms for Diagnosis and Treatment of Breast Cancer of China Anti-Cancer Association (2021 Edition)[J]. China Oncology, 2021, 31(10): 954-1040.)
[42] Johnson A E W, Pollard T J, Shen L, et al. MIMIC-III, a Freely Accessible Critical Care Database[J]. Scientific Data, 2016, 3: 160035.
[43] 温萍梅, 叶志炜, 丁文健, 等. 命名实体消歧研究进展综述[J]. 数据分析与知识发现, 2020, 4(9): 15-25.
[43] (Wen Pingmei, Ye Zhiwei, Ding Wenjian, et al. Developments of Named Entity Disambiguation[J]. Data Analysis and Knowledge Discovery, 2020, 4(9): 15-25.)
[1] Xiong Shuchu, Li Xuan, Wu Jiani, Zhou Zhaohong, Meng Han. Research on Text Sentiment Semantic Optimization Method Based on Supervised Contrastive Learning[J]. 数据分析与知识发现, 2024, 8(6): 69-81.
[2] Li Jie, Zhang Zhixiong, Wang Yufei. SCCL Text Deep Clustering with Increased Cluster-Level Comparison[J]. 数据分析与知识发现, 2024, 8(3): 98-109.
[3] Pei Wei, Sun Shuifa, Li Xiaolong, Lu Ji, Yang Liu, Wu Yirong. Medical Named Entity Recognition with Domain Knowledge[J]. 数据分析与知识发现, 2023, 7(3): 142-154.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn