1Yichang Key Laboratory of Intelligent Medicine, Yichang 443002, China 2College of Computer and Information Technology, China Three Gorges University,Yichang 443002, China 3School of Information Science and Technology, Hangzhou Normal University,Hangzhou 311121, China 4College of Economics & Management, China Three Gorges University, Yichang 443002, China
[Objective] To address the challenges of short texts, high similarity, and single and multiple entailments in the standardization of Chinese medical terminology, this paper proposes a research framework based on the fusion of multiple strategy comparison learning for recall-ranking-quantity prediction. [Methods] Firstly, we integrated text statistical and deep semantic features to retrieve candidate entities. Based on similarity scores, we obtained the candidate set. Secondly, we combined candidate ranking with original terms, standard entities, and candidate entities from recall by training vector representations with pre-trained models and contrastive learning strategies, followed by reordering based on cosine similarity. Next, we updated the vector representations of original terms through multi-head attention to predict the number of standard entities from the original terms. Finally, we selected the standard entities based on the quantity prediction results by integrating the similarity scores of candidate recall and ranking. [Results] We examined the new model on the Chinese medical terminology normalization dataset Yidu-N7k. Compared with statistical models and mainstream deep learning models, the proposed framework achieved an accuracy of 92.17%. This represents an improvement of at least 0.94% over the pre-trained binary classification baseline model. Additionally, on a dataset of 150 expert-labeled reports of mammography examinations for female breast cancer, the new framework’s accuracy reached 97.85%, achieving the best performance. [Limitations] The experiments are only conducted on medical datasets, and the effectiveness in other domains needs further exploration. [Conclusions] A multi-strategy candidate recall can comprehensively consider text information to address the challenge of short text. Contrastive learning candidate rank can capture subtle textual differences to address the challenge of high similarity. Quantity prediction with multi-head attention can enhance vector representation and address the challenges of single and multiple entailments. The proposed method provides the potential for promoting medical information mining and clinical research.
岳崇浩, 张剑, 吴义熔, 李小龙, 华晟, 童顺航, 孙水发. 基于融合多策略对比学习的中文医疗术语标准化研究*[J]. 数据分析与知识发现, 2024, 8(6): 144-157.
Yue Chonghao, Zhang Jian, Wu Yirong, Li Xiaolong, Hua Sheng, Tong Shunhang, Sun Shuifa. Standardization of Chinese Medical Terminology Based on Multi-Strategy Comparison Learning. Data Analysis and Knowledge Discovery, 2024, 8(6): 144-157.
Lin Y C, Lu K M, Chen Y L, et al. High-Throughput Relation Extraction Algorithm Development Associating Knowledge Articles and Electronic Health Records[OL]. arXiv Preprint, arXiv: 2009.03506.
[2]
Miotto R, Li L, Kidd B A, et al. Deep Patient: An Unsupervised Representation to Predict the Future of Patients from the Electronic Health Records[J]. Scientific Reports, 2016, 6(1): 26094.
[3]
Zhang N Y, Chen M S, Bi Z, et al. CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark[OL]. arXiv Preprint, arXiv: 2106.08087.
[4]
Bodenreider O. The Unified Medical Language System (UMLS): Integrating Biomedical Terminology[J]. Nucleic Acids Research, 2004, 32(S1): D267-D270.
[5]
Donnelly K. SNOMED-CT: The Advanced Terminology and Coding System for eHealth[J]. Studies in Health Technology and Informatics, 2006, 121: 279-290.
pmid: 17095826
(Jiang Jingchi, Hou Junyi, Li Xue, et al. Medical Entity Standardization Method Based on Collaborative Ensemble Learning[J]. Journal of Chinese Information Processing, 2023, 37(3): 135-142.)
[7]
Liang M, Xue K, Ye Q, et al. A Combined Recall and Rank Framework with Online Negative Sampling for Chinese Procedure Terminology Normalization[J]. Bioinformatics, 2021, 37(20): 3610-3617.
doi: 10.1093/bioinformatics/btab381
pmid: 34037691
[8]
Li L Q, Zhai Y K, Gao J H, et al. Stacking-BERT Model for Chinese Medical Procedure Entity Normalization[J]. Mathematical Biosciences and Engineering, 2023, 20(1): 1018-1036.
doi: 10.3934/mbe.2023047
pmid: 36650800
[9]
Sui X H, Song K H, Zhou B H, et al. A Multi-Task Learning Framework for Chinese Medical Procedure Entity Normalization[C]// Proceedings of 2022 IEEE International Conference on Acoustics, Speech and Signal Processing. 2022: 8337-8341.
[10]
Miftahutdinov Z, Tutubalina E. Deep Neural Models for Medical Concept Normalization in User-Generated Texts[OL]. arXiv Preprint, arXiv: 1907.07972.
[11]
Huang J M, Osorio C, Sy L W. An Empirical Evaluation of Deep Learning for ICD-9 Code Assignment Using MIMIC-III Clinical Notes[J]. Computer Methods and Programs in Biomedicine, 2019, 177: 141-153.
doi: S0169-2607(18)30994-5
pmid: 31319942
[12]
Ji Z C, Wei Q, Xu H. BERT-Based Ranking for Biomedical Entity Normalization[J]. AMIA Joint Summits on Translational Science, 2020, 2020: 269-277.
[13]
Gao T Y, Yao X C, Chen D Q. SimCSE: Simple Contrastive Learning of Sentence Embeddings[OL]. arXiv Preprint, arXiv: 2104.08821.
(Zhou Pengcheng, Wu Chuan, Lu Wei. Entity Linking Method for Short Texts with Multi-Knowledge Bases: Case Study of Wikipedia and Freebase[J]. New Technology of Library and Information Service, 2016(6): 1-11.)
[15]
Ghiasvand O, Kate R. UWM: Disorder Mention Extraction from Clinical Text Using CRFs and Normalization Using Learned Edit Distance Patterns[C]// Proceedings of the 8th International Workshop on Semantic Evaluation. 2014: 828-832.
[16]
Afzal Z, Akhondi S A, van Haagen H H H B M, et al. Biomedical Concept Recognition in French Text Using Automatic Translation of English Terms[A]// Experimental IR Meets Multilinguality, Multimodality, and Interaction[M]. Springer, 2015.
[17]
D’Souza J, Ng V. Sieve-Based Entity Linking for the Biomedical Domain[C]// Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2:Short Papers). 2015: 297-302.
[18]
Leal A, Martins B, Couto F. ULisboa: Recognition and Normalization of Medical Concepts[C]// Proceedings of the 9th International Workshop on Semantic Evaluation. 2015: 406-411.
[19]
Boytcheva S. Automatic Matching of ICD-10 Codes to Diagnoses in Discharge Letters[C]// Proceedings of the 2nd Workshop on Biomedical Natural Language Processing. 2011: 11-18.
[20]
Larkey L S, Croft W B. Automatic Assignment of ICD9 Codes to Discharge Summaries[R]. University of Massachusetts, 1995.
[21]
Luo Y, Song G J, Li P Y, et al. Multi-Task Medical Concept Normalization Using Multi-View Convolutional Neural Network[C]// Proceedings of the 32nd AAAI Conference on Artificial Intelligence. 2018: 5868-5875.
[22]
Limsopatham N, Collier N. Normalising Medical Concepts in Social Media Texts by Learning Semantic Representation[C]// Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1:Long Papers). 2016: 1014-1023.
[23]
Sung M, Jeon H, Lee J, et al. Biomedical Entity Representations with Synonym Marginalization[OL]. arXiv Preprint, arXiv: 2005.00239.
[24]
Gundersen M L, Haug P J, Pryor T A, et al. Development and Evaluation of a Computerized Admission Diagnoses Encoding System[J]. Computers and Biomedical Research, 1996, 29(5): 351-372.
pmid: 8902364
[25]
Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[OL]. arXiv Preprint, arXiv: 1810.04805.
[26]
Vaswani A, Shazeer N, Parmar N, et al. Attention is All You Need[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017: 6000-6010.
[27]
Kalyan K S, Sangeetha S. BertMCN: Mapping Colloquial Phrases to Standard Medical Concepts Using BERT and Highway Network[J]. Artificial Intelligence in Medicine, 2021, 112: 102008.
[28]
Yan J H, Wang Y N, Xiang L, et al. A Knowledge-Driven Generative Model for Multi-Implication Chinese Medical Procedure Entity Normalization[C]// Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. 2020: 1490-1499.
[29]
Huang C W, Tsai S C, Chen Y N. PLM-ICD: Automatic ICD Coding with Pretrained Language Models[OL]. arXiv Preprint, arXiv: 2207.05289.
[30]
Li F, Jin Y H, Liu W S, et al. Fine-Tuning Bidirectional Encoder Representations from Transformers (BERT)-Based Models on Large-Scale Electronic Health Record Notes: An Empirical Study[J]. JMIR Medical Informatics, 2019, 7(3): e14830.
[31]
Yuan H Y, Yuan Z, Yu S. Generative Biomedical Entity Linking via Knowledge Base-Guided Pre-training and Synonyms-Aware Fine-Tuning[OL]. arXiv Preprint, arXiv: 2204.05164.
[32]
Liang M, Zhang Z X, Zhang J Y, et al. Lab Indicators Standardization Method for the Regional Healthcare Platform: A Case Study on Heart Failure[J]. BMC Medical Informatics and Decision Making, 2020, 20(14): 331.
[33]
Xu D F, Zhang Z Y, Bethard S. A Generate-and-Rank Framework with Semantic Type Regularization for Biomedical Concept Normalization[C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020: 8452-8464.
(Chong Weifeng, Li Hui, Li Xue, et al. Term Normalization System Based on BERT Entailment Reasoning[J]. Journal of Chinese Information Processing, 2021, 35(5): 86-90.)
[35]
Chen T, Kornblith S, Norouzi M, et al. A Simple Framework for Contrastive Learning of Visual Representations[C]// Proceedings of the 37th International Conference on Machine Learning. 2020: 1597-1607.
[36]
Ding J T, Quan Y H, Yao Q M, et al. Simplify and Robustify Negative Sampling for Implicit Collaborative Filtering[C]// Proceedings of the 34th International Conference on Neural Information Processing Systems. 2020: 1094-1105.
[37]
Zhang Y H, Zhu H J, Wang Y L, et al. A Contrastive Framework for Learning Sentence Representations from Pairwise and Triple-Wise Perspective in Angular Space[C]// Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1:Long Papers). 2022: 4892-4903.
[38]
Peng H, Xiong Y, Xiang Y, et al. Biomedical Named Entity Normalization via Interaction-Based Synonym Marginalization[J]. Journal of Biomedical Informatics, 2022, 136: 104238.
[39]
Reimers N, Gurevych I. Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks[OL]. arXiv Preprint, arXiv:1908.10084.
[40]
Teng F, Liu Y M, Li T R, et al. A Review on Deep Neural Networks for ICD Coding[J]. IEEE Transactions on Knowledge and Data Engineering, 2023, 35(5): 4357-4375.
(China Anti-Cancer Association Breast Cancer Society. Guidelines and Norms for Diagnosis and Treatment of Breast Cancer of China Anti-Cancer Association (2021 Edition)[J]. China Oncology, 2021, 31(10): 954-1040.)
[42]
Johnson A E W, Pollard T J, Shen L, et al. MIMIC-III, a Freely Accessible Critical Care Database[J]. Scientific Data, 2016, 3: 160035.
(Wen Pingmei, Ye Zhiwei, Ding Wenjian, et al. Developments of Named Entity Disambiguation[J]. Data Analysis and Knowledge Discovery, 2020, 4(9): 15-25.)