数据分析与知识发现  2024, Vol. 8 Issue (6): 144-157
1智慧医疗宜昌市重点实验室 宜昌 443002
2三峡大学计算机与信息学院 宜昌 443002
3杭州师范大学信息科学与技术学院 杭州 311121
4三峡大学经济与管理学院 宜昌 443002
Standardization of Chinese Medical Terminology Based on Multi-Strategy Comparison Learning
Yue Chonghao1,2,Zhang Jian3,Wu Yirong1,2,Li Xiaolong1,4,Hua Sheng1,2,Tong Shunhang1,2,Sun Shuifa1,3()
1Yichang Key Laboratory of Intelligent Medicine, Yichang 443002, China
2College of Computer and Information Technology, China Three Gorges University,Yichang 443002, China
3School of Information Science and Technology, Hangzhou Normal University,Hangzhou 311121, China
4College of Economics & Management, China Three Gorges University, Yichang 443002, China
关键词 医疗术语标准化多策略候选召回对比学习乳腺癌钼靶检查报告    

[Objective] To address the challenges of short texts, high similarity, and single and multiple entailments in the standardization of Chinese medical terminology, this paper proposes a research framework based on the fusion of multiple strategy comparison learning for recall-ranking-quantity prediction. [Methods] Firstly, we integrated text statistical and deep semantic features to retrieve candidate entities. Based on similarity scores, we obtained the candidate set. Secondly, we combined candidate ranking with original terms, standard entities, and candidate entities from recall by training vector representations with pre-trained models and contrastive learning strategies, followed by reordering based on cosine similarity. Next, we updated the vector representations of original terms through multi-head attention to predict the number of standard entities from the original terms. Finally, we selected the standard entities based on the quantity prediction results by integrating the similarity scores of candidate recall and ranking. [Results] We examined the new model on the Chinese medical terminology normalization dataset Yidu-N7k. Compared with statistical models and mainstream deep learning models, the proposed framework achieved an accuracy of 92.17%. This represents an improvement of at least 0.94% over the pre-trained binary classification baseline model. Additionally, on a dataset of 150 expert-labeled reports of mammography examinations for female breast cancer, the new framework’s accuracy reached 97.85%, achieving the best performance. [Limitations] The experiments are only conducted on medical datasets, and the effectiveness in other domains needs further exploration. [Conclusions] A multi-strategy candidate recall can comprehensively consider text information to address the challenge of short text. Contrastive learning candidate rank can capture subtle textual differences to address the challenge of high similarity. Quantity prediction with multi-head attention can enhance vector representation and address the challenges of single and multiple entailments. The proposed method provides the potential for promoting medical information mining and clinical research.

Key wordsMedical Terminology Normalization    Multi-Strategy Candidate Recall    Contrastive Learning    Breast Cancer Mammography    Examination Report
收稿日期: 2023-09-21      出版日期: 2024-04-18
ZTFLH:  TP393  
通讯作者: 孙水发,ORCID:0000-0003-0933-152X,。   
岳崇浩, 张剑, 吴义熔, 李小龙, 华晟, 童顺航, 孙水发. 基于融合多策略对比学习的中文医疗术语标准化研究*[J]. 数据分析与知识发现, 2024, 8(6): 144-157.
Yue Chonghao, Zhang Jian, Wu Yirong, Li Xiaolong, Hua Sheng, Tong Shunhang, Sun Shuifa. Standardization of Chinese Medical Terminology Based on Multi-Strategy Comparison Learning. Data Analysis and Knowledge Discovery, 2024, 8(6): 144-157.
Fig.1  中文医疗术语标准化数据集Yidu-N7k示例
方法类别 主要模型 优点 缺点
基于规则和字符串匹配 启发式规则[15-16]、字符串匹配[17-18] 可解释性强 可移植性低,人工消耗大
基于机器学习 支持向量机[19]k-近邻[20] 无须制定规则,降低人工消耗 不能考虑上下文信息,有局限性
基于深度学习 卷积神经网络[21]、循环神经
表征能力强大,能够丰富表达语义信息 对设备要求较高
Table 1  常规医疗术语标准化方法优缺点对比
方法类别 代表文献
序列生成方法 Yan等[28]设计了一个生成和排序框架,为中文标准化解决了多蕴含问题。
多分类方法 Li等[30]引入了改进的BERT模型EhrBERT,证明基于BERT的模型在标准化任务上的有效性。
二分类方法 Liang等[32]将标准化任务看作一个二分类任务,采用主动学习训练,旨在降低标注成本。
Liang等[7]在召回阶段使用TF-IDF(Term Frequency-Inverse Document Frequency)和BERT生成候选实体,基于关键字感知的二分类任务被用于重新排序。
Table 2  基于预训练模型的医学术语标准化方法相关代表性文献
Fig.2  整体模型框架
Fig.3  多策略候选实体召回
Fig.4  对比学习的候选排序
Fig.5  多头注意力机制的数量预测
类别 单蕴含 多蕴含 共计
一对一 一对二 一对三 一对四 一对五 一对六 一对七
训练集 3 801 148 34 16 0 0 1 4 000
验证集 950 39 9 2 0 0 0 2 000
测试集 1 901 77 18 3 1 0 0 1 000
Table 3  中文医疗术语标准化数据集Yidu-N7k
类别 单蕴含一对一 多蕴含一对二 共计
训练集 875 6 881
测试集 93 2 95
Table 4  乳腺癌钼靶诊断术语标准化数据集
类别 模型 单蕴含
统计模型 TF-IDF 49.30 46.80
编辑距离 50.80 48.30
BM25 65.83 62.57
多分类模型 BERT 88.60 84.20
PLM-ICD*[29] 92.64 88.03
生成模型 序列生成[28] 91.10 52.40 89.30
二分类模型 MTCEN*[9] 93.27 52.35 91.23
CMTN*[7] 93.48 53.02 91.47
对比学习模型 本文 94.18 53.69 92.17
Table 5  Yidu-N7k数据集实验结果/%
类别 模型 准确率/%
多分类模型 PLM-ICD*[29] 97.48
二分类模型 MTCEN*[9] 97.62
对比学习模型 本文 97.85
Table 6  乳腺癌钼靶诊断术语标准化数据集实验结果
模型 召回率/%
TF-IDF* 68.33
ES数据库* 96.73
BERT 96.90
MTCEN* 97.20
CMTN 98.30
本文 98.93
Table 7  Yidu-N7k数据集候选召回实验结果
模型 单蕴含准确率 多蕴含准确率 准确率
BERT 99.12 63.12 97.43
序列生成 98.60 76.40 97.70
MTCEN* 99.40 75.84 98.23
CMTN 99.58 70.40 98.23
本文 99.58 75.84 98.40
Table 8  Yidu-N7k数据集数量预测实验结果/%
模型 单蕴含准确率 多蕴含准确率 准确率
本文 94.18 53.69 92.17
w/o ES数据库 93.48 48.99 91.27
w/o对比学习 92.95 48.32 90.73
w/o 多头注意力机制 94.07 51.68 91.97
Table 9  Yidu-N7k数据集消融实验结果/%
Full text



