Please wait a minute...
Advanced Search
数据分析与知识发现  2024, Vol. 8 Issue (6): 144-157     https://doi.org/10.11925/infotech.2096-3467.2023.0931
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
基于融合多策略对比学习的中文医疗术语标准化研究*
岳崇浩1,2,张剑3,吴义熔1,2,李小龙1,4,华晟1,2,童顺航1,2,孙水发1,3()
1智慧医疗宜昌市重点实验室 宜昌 443002
2三峡大学计算机与信息学院 宜昌 443002
3杭州师范大学信息科学与技术学院 杭州 311121
4三峡大学经济与管理学院 宜昌 443002
Standardization of Chinese Medical Terminology Based on Multi-Strategy Comparison Learning
Yue Chonghao1,2,Zhang Jian3,Wu Yirong1,2,Li Xiaolong1,4,Hua Sheng1,2,Tong Shunhang1,2,Sun Shuifa1,3()
1Yichang Key Laboratory of Intelligent Medicine, Yichang 443002, China
2College of Computer and Information Technology, China Three Gorges University,Yichang 443002, China
3School of Information Science and Technology, Hangzhou Normal University,Hangzhou 311121, China
4College of Economics & Management, China Three Gorges University, Yichang 443002, China
全文: PDF (1170 KB)   HTML ( 10
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】应对中文医疗术语标准化存在的短文本、相似性高、单蕴含与多蕴含等挑战,研究基于融合多策略对比学习的召回-排序-数量预测研究框架。【方法】首先,融合文本统计特征和深度语义特征进行候选召回,依据相似度分数获取候选实体集;其次,候选排序将原始术语、标准实体、来自候选召回的候选实体结合预训练模型与对比学习策略训练向量表示,依据余弦相似度重新排序;再次,数量预测通过多头注意力更新原始词的向量表示,预测原始术语中蕴含标准实体的数量;最后,融合候选召回和候选排序的相似度分数,基于数量预测结果按照顺序选取对应标准实体。【结果】在中文医疗术语标准化数据集Yidu-N7k上进行性能评估,与统计模型、主流深度学习模型进行比较,融合多策略对比学习的标准化框架的准确率达到92.17%,对比基于预训练的二分类基线模型最多提高0.94个百分点。同时,在自制的150例女性乳腺癌钼靶检查报告数据集上,融合多策略对比学习的标准化框架的准确率达到97.85%,性能最优。【局限】实验只在医疗数据集上展开,在其他领域的有效性需进一步研究。【结论】多策略的候选召回可以全面地考虑文本信息能够应对短文本挑战;对比学习的候选排序能够捕捉文本细微差距能够应对相似性高挑战;多头注意力的数量预测能够增强向量表示能够应对单蕴含与多蕴含挑战。融合多策略对比学习的中文医疗术语标准化方法为促进医学信息挖掘和临床研究提供了潜力。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
岳崇浩
张剑
吴义熔
李小龙
华晟
童顺航
孙水发
关键词 医疗术语标准化多策略候选召回对比学习乳腺癌钼靶检查报告    
Abstract

[Objective] To address the challenges of short texts, high similarity, and single and multiple entailments in the standardization of Chinese medical terminology, this paper proposes a research framework based on the fusion of multiple strategy comparison learning for recall-ranking-quantity prediction. [Methods] Firstly, we integrated text statistical and deep semantic features to retrieve candidate entities. Based on similarity scores, we obtained the candidate set. Secondly, we combined candidate ranking with original terms, standard entities, and candidate entities from recall by training vector representations with pre-trained models and contrastive learning strategies, followed by reordering based on cosine similarity. Next, we updated the vector representations of original terms through multi-head attention to predict the number of standard entities from the original terms. Finally, we selected the standard entities based on the quantity prediction results by integrating the similarity scores of candidate recall and ranking. [Results] We examined the new model on the Chinese medical terminology normalization dataset Yidu-N7k. Compared with statistical models and mainstream deep learning models, the proposed framework achieved an accuracy of 92.17%. This represents an improvement of at least 0.94% over the pre-trained binary classification baseline model. Additionally, on a dataset of 150 expert-labeled reports of mammography examinations for female breast cancer, the new framework’s accuracy reached 97.85%, achieving the best performance. [Limitations] The experiments are only conducted on medical datasets, and the effectiveness in other domains needs further exploration. [Conclusions] A multi-strategy candidate recall can comprehensively consider text information to address the challenge of short text. Contrastive learning candidate rank can capture subtle textual differences to address the challenge of high similarity. Quantity prediction with multi-head attention can enhance vector representation and address the challenges of single and multiple entailments. The proposed method provides the potential for promoting medical information mining and clinical research.

Key wordsMedical Terminology Normalization    Multi-Strategy Candidate Recall    Contrastive Learning    Breast Cancer Mammography    Examination Report
收稿日期: 2023-09-21      出版日期: 2024-04-18
ZTFLH:  TP393  
基金资助:*国家社会科学基金项目(20BTQ066)
通讯作者: 孙水发,ORCID:0000-0003-0933-152X,E-mail:watersun@hznu.edu.cn。   
引用本文:   
岳崇浩, 张剑, 吴义熔, 李小龙, 华晟, 童顺航, 孙水发. 基于融合多策略对比学习的中文医疗术语标准化研究*[J]. 数据分析与知识发现, 2024, 8(6): 144-157.
Yue Chonghao, Zhang Jian, Wu Yirong, Li Xiaolong, Hua Sheng, Tong Shunhang, Sun Shuifa. Standardization of Chinese Medical Terminology Based on Multi-Strategy Comparison Learning. Data Analysis and Knowledge Discovery, 2024, 8(6): 144-157.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2023.0931      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2024/V8/I6/144
Fig.1  中文医疗术语标准化数据集Yidu-N7k示例
方法类别 主要模型 优点 缺点
基于规则和字符串匹配 启发式规则[15-16]、字符串匹配[17-18] 可解释性强 可移植性低,人工消耗大
基于机器学习 支持向量机[19]k-近邻[20] 无须制定规则,降低人工消耗 不能考虑上下文信息,有局限性
基于深度学习 卷积神经网络[21]、循环神经
网络[22]、预训练模型[23]
表征能力强大,能够丰富表达语义信息 对设备要求较高
Table 1  常规医疗术语标准化方法优缺点对比
方法类别 代表文献
序列生成方法 Yan等[28]设计了一个生成和排序框架,为中文标准化解决了多蕴含问题。
多分类方法 Li等[30]引入了改进的BERT模型EhrBERT,证明基于BERT的模型在标准化任务上的有效性。
Huang等[29]开发了一个PLM-ICD模型,结合特定领域的预训练与标签感知注意力机制。
Sung等[23]设计了用于生物医学实体表示学习的BIOSYN框架,该框架利用同义词边缘化迭代候选实体,在生物实体表示方面取得良好效果。
二分类方法 Liang等[32]将标准化任务看作一个二分类任务,采用主动学习训练,旨在降低标注成本。
Xu等[33]设计了一个具有语义类型规则的列表模型进行排序。
崇伟峰等[34]构建4个模块术语标准化系统,取得第五届中国健康信息处理大会评测比赛(CHIP2019)第一名的成绩。
Yuan等[31]重新训练了一个预训练模型,由两部分组成:知识库指导的预训练和同义词感知的微调。
Liang等[7]在召回阶段使用TF-IDF(Term Frequency-Inverse Document Frequency)和BERT生成候选实体,基于关键字感知的二分类任务被用于重新排序。
Sui等[9]使用BERT编码将候选排序和数量预测融合为一个步骤,将标准化任务作为二分类进行打分。
Table 2  基于预训练模型的医学术语标准化方法相关代表性文献
Fig.2  整体模型框架
Fig.3  多策略候选实体召回
Fig.4  对比学习的候选排序
Fig.5  多头注意力机制的数量预测
类别 单蕴含 多蕴含 共计
一对一 一对二 一对三 一对四 一对五 一对六 一对七
训练集 3 801 148 34 16 0 0 1 4 000
验证集 950 39 9 2 0 0 0 2 000
测试集 1 901 77 18 3 1 0 0 1 000
Table 3  中文医疗术语标准化数据集Yidu-N7k
类别 单蕴含一对一 多蕴含一对二 共计
训练集 875 6 881
测试集 93 2 95
Table 4  乳腺癌钼靶诊断术语标准化数据集
类别 模型 单蕴含
准确率
多蕴含
准确率
准确率
统计模型 TF-IDF 49.30 46.80
编辑距离 50.80 48.30
BM25 65.83 62.57
多分类模型 BERT 88.60 84.20
PLM-ICD*[29] 92.64 88.03
生成模型 序列生成[28] 91.10 52.40 89.30
二分类模型 MTCEN*[9] 93.27 52.35 91.23
CMTN*[7] 93.48 53.02 91.47
对比学习模型 本文 94.18 53.69 92.17
Table 5  Yidu-N7k数据集实验结果/%
类别 模型 准确率/%
多分类模型 PLM-ICD*[29] 97.48
二分类模型 MTCEN*[9] 97.62
对比学习模型 本文 97.85
Table 6  乳腺癌钼靶诊断术语标准化数据集实验结果
模型 召回率/%
TF-IDF* 68.33
ES数据库* 96.73
BERT 96.90
MTCEN* 97.20
CMTN 98.30
本文 98.93
Table 7  Yidu-N7k数据集候选召回实验结果
模型 单蕴含准确率 多蕴含准确率 准确率
BERT 99.12 63.12 97.43
序列生成 98.60 76.40 97.70
MTCEN* 99.40 75.84 98.23
CMTN 99.58 70.40 98.23
本文 99.58 75.84 98.40
Table 8  Yidu-N7k数据集数量预测实验结果/%
模型 单蕴含准确率 多蕴含准确率 准确率
本文 94.18 53.69 92.17
w/o ES数据库 93.48 48.99 91.27
w/o对比学习 92.95 48.32 90.73
w/o 多头注意力机制 94.07 51.68 91.97
Table 9  Yidu-N7k数据集消融实验结果/%
[1] Lin Y C, Lu K M, Chen Y L, et al. High-Throughput Relation Extraction Algorithm Development Associating Knowledge Articles and Electronic Health Records[OL]. arXiv Preprint, arXiv: 2009.03506.
[2] Miotto R, Li L, Kidd B A, et al. Deep Patient: An Unsupervised Representation to Predict the Future of Patients from the Electronic Health Records[J]. Scientific Reports, 2016, 6(1): 26094.
[3] Zhang N Y, Chen M S, Bi Z, et al. CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark[OL]. arXiv Preprint, arXiv: 2106.08087.
[4] Bodenreider O. The Unified Medical Language System (UMLS): Integrating Biomedical Terminology[J]. Nucleic Acids Research, 2004, 32(S1): D267-D270.
[5] Donnelly K. SNOMED-CT: The Advanced Terminology and Coding System for eHealth[J]. Studies in Health Technology and Informatics, 2006, 121: 279-290.
pmid: 17095826
[6] 姜京池, 侯俊屹, 李雪, 等. 基于协同集成学习的医疗实体标准化方法[J]. 中文信息学报, 2023, 37(3): 135-142.
[6] (Jiang Jingchi, Hou Junyi, Li Xue, et al. Medical Entity Standardization Method Based on Collaborative Ensemble Learning[J]. Journal of Chinese Information Processing, 2023, 37(3): 135-142.)
[7] Liang M, Xue K, Ye Q, et al. A Combined Recall and Rank Framework with Online Negative Sampling for Chinese Procedure Terminology Normalization[J]. Bioinformatics, 2021, 37(20): 3610-3617.
doi: 10.1093/bioinformatics/btab381 pmid: 34037691
[8] Li L Q, Zhai Y K, Gao J H, et al. Stacking-BERT Model for Chinese Medical Procedure Entity Normalization[J]. Mathematical Biosciences and Engineering, 2023, 20(1): 1018-1036.
doi: 10.3934/mbe.2023047 pmid: 36650800
[9] Sui X H, Song K H, Zhou B H, et al. A Multi-Task Learning Framework for Chinese Medical Procedure Entity Normalization[C]// Proceedings of 2022 IEEE International Conference on Acoustics, Speech and Signal Processing. 2022: 8337-8341.
[10] Miftahutdinov Z, Tutubalina E. Deep Neural Models for Medical Concept Normalization in User-Generated Texts[OL]. arXiv Preprint, arXiv: 1907.07972.
[11] Huang J M, Osorio C, Sy L W. An Empirical Evaluation of Deep Learning for ICD-9 Code Assignment Using MIMIC-III Clinical Notes[J]. Computer Methods and Programs in Biomedicine, 2019, 177: 141-153.
doi: S0169-2607(18)30994-5 pmid: 31319942
[12] Ji Z C, Wei Q, Xu H. BERT-Based Ranking for Biomedical Entity Normalization[J]. AMIA Joint Summits on Translational Science, 2020, 2020: 269-277.
[13] Gao T Y, Yao X C, Chen D Q. SimCSE: Simple Contrastive Learning of Sentence Embeddings[OL]. arXiv Preprint, arXiv: 2104.08821.
[14] 周鹏程, 武川, 陆伟. 基于多知识库的短文本实体链接方法研究——以Wikipedia和Freebase为例[J]. 现代图书情报技术, 2016(6): 1-11.
[14] (Zhou Pengcheng, Wu Chuan, Lu Wei. Entity Linking Method for Short Texts with Multi-Knowledge Bases: Case Study of Wikipedia and Freebase[J]. New Technology of Library and Information Service, 2016(6): 1-11.)
[15] Ghiasvand O, Kate R. UWM: Disorder Mention Extraction from Clinical Text Using CRFs and Normalization Using Learned Edit Distance Patterns[C]// Proceedings of the 8th International Workshop on Semantic Evaluation. 2014: 828-832.
[16] Afzal Z, Akhondi S A, van Haagen H H H B M, et al. Biomedical Concept Recognition in French Text Using Automatic Translation of English Terms[A]// Experimental IR Meets Multilinguality, Multimodality, and Interaction[M]. Springer, 2015.
[17] D’Souza J, Ng V. Sieve-Based Entity Linking for the Biomedical Domain[C]// Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2:Short Papers). 2015: 297-302.
[18] Leal A, Martins B, Couto F. ULisboa: Recognition and Normalization of Medical Concepts[C]// Proceedings of the 9th International Workshop on Semantic Evaluation. 2015: 406-411.
[19] Boytcheva S. Automatic Matching of ICD-10 Codes to Diagnoses in Discharge Letters[C]// Proceedings of the 2nd Workshop on Biomedical Natural Language Processing. 2011: 11-18.
[20] Larkey L S, Croft W B. Automatic Assignment of ICD9 Codes to Discharge Summaries[R]. University of Massachusetts, 1995.
[21] Luo Y, Song G J, Li P Y, et al. Multi-Task Medical Concept Normalization Using Multi-View Convolutional Neural Network[C]// Proceedings of the 32nd AAAI Conference on Artificial Intelligence. 2018: 5868-5875.
[22] Limsopatham N, Collier N. Normalising Medical Concepts in Social Media Texts by Learning Semantic Representation[C]// Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1:Long Papers). 2016: 1014-1023.
[23] Sung M, Jeon H, Lee J, et al. Biomedical Entity Representations with Synonym Marginalization[OL]. arXiv Preprint, arXiv: 2005.00239.
[24] Gundersen M L, Haug P J, Pryor T A, et al. Development and Evaluation of a Computerized Admission Diagnoses Encoding System[J]. Computers and Biomedical Research, 1996, 29(5): 351-372.
pmid: 8902364
[25] Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[OL]. arXiv Preprint, arXiv: 1810.04805.
[26] Vaswani A, Shazeer N, Parmar N, et al. Attention is All You Need[C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017: 6000-6010.
[27] Kalyan K S, Sangeetha S. BertMCN: Mapping Colloquial Phrases to Standard Medical Concepts Using BERT and Highway Network[J]. Artificial Intelligence in Medicine, 2021, 112: 102008.
[28] Yan J H, Wang Y N, Xiang L, et al. A Knowledge-Driven Generative Model for Multi-Implication Chinese Medical Procedure Entity Normalization[C]// Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. 2020: 1490-1499.
[29] Huang C W, Tsai S C, Chen Y N. PLM-ICD: Automatic ICD Coding with Pretrained Language Models[OL]. arXiv Preprint, arXiv: 2207.05289.
[30] Li F, Jin Y H, Liu W S, et al. Fine-Tuning Bidirectional Encoder Representations from Transformers (BERT)-Based Models on Large-Scale Electronic Health Record Notes: An Empirical Study[J]. JMIR Medical Informatics, 2019, 7(3): e14830.
[31] Yuan H Y, Yuan Z, Yu S. Generative Biomedical Entity Linking via Knowledge Base-Guided Pre-training and Synonyms-Aware Fine-Tuning[OL]. arXiv Preprint, arXiv: 2204.05164.
[32] Liang M, Zhang Z X, Zhang J Y, et al. Lab Indicators Standardization Method for the Regional Healthcare Platform: A Case Study on Heart Failure[J]. BMC Medical Informatics and Decision Making, 2020, 20(14): 331.
[33] Xu D F, Zhang Z Y, Bethard S. A Generate-and-Rank Framework with Semantic Type Regularization for Biomedical Concept Normalization[C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020: 8452-8464.
[34] 崇伟峰, 李慧, 李雪, 等. 基于BERT蕴含推理的术语标准化系统[J]. 中文信息学报, 2021, 35(5): 86-90.
[34] (Chong Weifeng, Li Hui, Li Xue, et al. Term Normalization System Based on BERT Entailment Reasoning[J]. Journal of Chinese Information Processing, 2021, 35(5): 86-90.)
[35] Chen T, Kornblith S, Norouzi M, et al. A Simple Framework for Contrastive Learning of Visual Representations[C]// Proceedings of the 37th International Conference on Machine Learning. 2020: 1597-1607.
[36] Ding J T, Quan Y H, Yao Q M, et al. Simplify and Robustify Negative Sampling for Implicit Collaborative Filtering[C]// Proceedings of the 34th International Conference on Neural Information Processing Systems. 2020: 1094-1105.
[37] Zhang Y H, Zhu H J, Wang Y L, et al. A Contrastive Framework for Learning Sentence Representations from Pairwise and Triple-Wise Perspective in Angular Space[C]// Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1:Long Papers). 2022: 4892-4903.
[38] Peng H, Xiong Y, Xiang Y, et al. Biomedical Named Entity Normalization via Interaction-Based Synonym Marginalization[J]. Journal of Biomedical Informatics, 2022, 136: 104238.
[39] Reimers N, Gurevych I. Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks[OL]. arXiv Preprint, arXiv:1908.10084.
[40] Teng F, Liu Y M, Li T R, et al. A Review on Deep Neural Networks for ICD Coding[J]. IEEE Transactions on Knowledge and Data Engineering, 2023, 35(5): 4357-4375.
[41] 中国抗癌协会乳腺癌专业委员会. 中国抗癌协会乳腺癌诊治指南与规范(2021年版)[J]. 中国癌症杂志, 2021, 31(10): 954-1040.
[41] (China Anti-Cancer Association Breast Cancer Society. Guidelines and Norms for Diagnosis and Treatment of Breast Cancer of China Anti-Cancer Association (2021 Edition)[J]. China Oncology, 2021, 31(10): 954-1040.)
[42] Johnson A E W, Pollard T J, Shen L, et al. MIMIC-III, a Freely Accessible Critical Care Database[J]. Scientific Data, 2016, 3: 160035.
[43] 温萍梅, 叶志炜, 丁文健, 等. 命名实体消歧研究进展综述[J]. 数据分析与知识发现, 2020, 4(9): 15-25.
[43] (Wen Pingmei, Ye Zhiwei, Ding Wenjian, et al. Developments of Named Entity Disambiguation[J]. Data Analysis and Knowledge Discovery, 2020, 4(9): 15-25.)
[1] 熊曙初, 李轩, 吴佳妮, 周赵宏, 孟晗. 基于有监督对比学习的文本情感语义优化方法研究*[J]. 数据分析与知识发现, 2024, 8(6): 69-81.
[2] 李婕, 张智雄, 王宇飞. 增加类簇级对比的SCCL文本深度聚类方法研究*[J]. 数据分析与知识发现, 2024, 8(3): 98-109.
[3] 裴伟, 孙水发, 李小龙, 鲁际, 杨柳, 吴义熔. 融合领域知识的医学命名实体识别研究*[J]. 数据分析与知识发现, 2023, 7(3): 142-154.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn