Please wait a minute...
Advanced Search
数据分析与知识发现  2023, Vol. 7 Issue (9): 136-145     https://doi.org/10.11925/infotech.2096-3467.2022.0812
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
基于多任务和迁移学习的中文医学文献实体识别研究*
韩普1,2(),顾亮1,叶东宇1,陈文祺1
1南京邮电大学管理学院 南京 210003
2江苏省数据工程与知识服务重点实验室 南京 210023
Recognizing Chinese Medical Literature Entities Based on Multi-Task and Transfer Learning
Han Pu1,2(),Gu Liang1,Ye Dongyu1,Chen Wenqi1
1School of Management, Nanjing University of Posts & Telecommunications, Nanjing 210003, China
2Jiangsu Provincial Key Laboratory of Data Engineering and Knowledge Service, Nanjing 210023, China
全文: PDF (1241 KB)   HTML ( 20
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】利用迁移学习和多任务学习解决中文医学文献实体识别冷启动和边界定位难的问题,进一步提高识别准确性。【方法】提出一种基于迁移学习和多任务学习的中文医学文献实体识别方法,构建混合深度学习BERT-BiLSTM-IDCNN-CRF的医学文献实体识别模型,通过实例迁移、模型迁移和特征迁移丰富医学语义特征,利用多任务学习构建粗粒度三分类任务以辅助实体识别任务有效利用实体边界信息,最后引入自注意力机制和Highway网络捕获全局重要信息并优化深层网络训练,提出TLMT-BBIC-HS模型。【结果】TLMT-BBIC-HS模型在中文糖尿病医学文献数据集上F1值达92.98%,较基准模型BERT-BiLSTM-CRF和BERT-IDCNN-CRF分别提高15.99个百分点和16.44个百分点。【局限】未验证模型的领域适应性。【结论】TLMT-BBIC-HS模型可实现医学知识的迁移共享,更适用于中文医学文献实体识别任务,可为医疗健康信息抽取、知识图谱和问答系统构建提供有效支持。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
韩普
顾亮
叶东宇
陈文祺
关键词 医学文献实体识别多任务学习迁移学习注意力机制Highway网络    
Abstract

[Objective] This paper uses transfer learning and multi-task learning to solve the problems of cold start and boundary in Chinese medical literature entity recognition, and further improve the recognition accuracy. [Methods] Firstly, we constructed a hybrid deep learning BERT-BiLSTM-IDCNN-CRF medical literature entity recognition model. Secondly, based on transfer learning, the medical semantic features were enriched through instance, model and feature transfer. Thirdly, we constructed a coarse-grained three-classification task through multi-task learning to assist the main task in utilizing the entity boundary information effectively. Finally, we introduced the self-attention mechanism and highway network to capture global information, optimize deep network training and establish the TLMT-BBIC-HS model. [Results] The model had an F1 value of 92.98% on the Chinese diabetes medical literature dataset, which is 15.99% and 16.44% higher than the benchmark models BERT-BiLSTM-CRF and BERT-IDCNN-CRF. [Limitations] The domain suitability of this model needs to be verified. [Conclusions] The TLMT-BBIC-HS model can transfer and share medical knowledge, which is more suitable for Chinese medical Literature entity recognition. It could effectively extract medical information and construct knowledge graphs and question answering systems.

Key wordsMedical Literature Entity Extraction    Multi-Task Learning    Transfer Learning    Attention Mechanism    Highway Network
收稿日期: 2022-08-03      出版日期: 2023-03-21
ZTFLH:  G350  
  TP391  
基金资助:*国家社会科学基金项目(22BTQ096)
通讯作者: 韩普, ORCID:0000-0001-5867-4292,E-mail: hanpu@njupt.edu.cn。   
引用本文:   
韩普, 顾亮, 叶东宇, 陈文祺. 基于多任务和迁移学习的中文医学文献实体识别研究*[J]. 数据分析与知识发现, 2023, 7(9): 136-145.
Han Pu, Gu Liang, Ye Dongyu, Chen Wenqi. Recognizing Chinese Medical Literature Entities Based on Multi-Task and Transfer Learning. Data Analysis and Knowledge Discovery, 2023, 7(9): 136-145.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2022.0812      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2023/V7/I9/136
原句子 任务 标注规则
乏力等症状消失。 中文医学文献实体识别 B-Symptom I-Symptom
O O O O O O
粗粒度三分类 B I O O O O O O
Table 1  多任务标注方法
Fig.1  研究框架
Fig.2  BERT-Base-Chinese嵌入向量训练
Fig.3  中文医学文献实体识别模型TLMT-BBIC-HS
Fig.4  数据集中的原始文档和标注文档
模型参数 BiLSTM/IDCNN BERT 预训练
向量维度 100 768 768
隐藏层单元个数 128 768 768
Batch Size 32 16 32
Epoch 100 100 100
Clip 5 - -
学习速率 0.001 0.001 2e-5
最大序列长度 - 128 128
Dropout机制 0.5
优化器 Adam
Table 2  实验参数设置
实验内容 模型 P (%) R (%) F1 (%)
基准模型 BERT-BiLSTM-CRF 76.12 76.45 76.29
BERT-IDCNN-CRF 76.33 75.35 75.84
BERT-BiLSTM-IDCNN-CRF 77.37 77.25 77.31
迁移学习 +模型迁移 M-BBIC 82.00 81.08 81.54
+实例迁移 I-BBIC 82.14 82.20 82.17
+特征迁移 TL-BBIC 89.75 90.18 89.96
多任务 TLMT-BBIC 90.69 91.26 90.97
自注意力 TLMT-BBIC-S 91.80 91.55 91.68
Highway网络 TLMT-BBIC-HS 92.00 92.56 92.28
Table 3  实验结果对比
Fig.5  15类实体识别的P、R和F1值
模型 P(%) R(%) F1(%)
双层BiLSTM-CRF(何春辉等[26] - - 72.89
Fusion Multi Feature-CNN-
BiLSTM-CRF(Shang等[27]
79.47 76.72 78.07
B-SABCN(Deng等[28] 78.29 78.03 78.16
RoBERTa-CRF(Wang等[29] 91.18 91.36 91.27
Table 4  中文糖尿病医学文献数据集已有研究结果
[1] 赵旸, 张智雄, 刘欢, 等. 基于BERT模型的中文医学文献分类研究[J]. 数据分析与知识发现, 2020, 4(8): 41-49.
[1] (Zhao Yang, Zhang Zhixiong, Liu Huan, et al. Classification of Chinese Medical Literature with BERT Model[J]. Data Analysis and Knowledge Discovery, 2020, 4(8): 41-49.)
[2] 李跃艳, 王昊, 邓三鸿, 等. 面向事件本体的医学文本语义关联化研究[J]. 情报学报, 2022, 41(5): 497-511.
[2] (Li Yueyan, Wang Hao, Deng Sanhong, et al. Research on Semantic Relevance of Medical Text Oriented to Event Ontology[J]. Journal of the China Society for Scientific and Technical Information, 2022, 41(5): 497-511.)
[3] Coden A, Savova G, Sominsky I, et al. Automatically Extracting Cancer Disease Characteristics from Pathology Reports into a Disease Knowledge Representation Model[J]. Journal of Biomedical Informatics, 2009, 42(5): 937-949.
doi: 10.1016/j.jbi.2008.12.005 pmid: 19135551
[4] Jiang M, Chen Y K, Liu M, et al. A Study of Machine-Learning-Based Approaches to Extract Clinical Entities and Their Assertions from Discharge Summaries[J]. Journal of the American Medical Informatics Association, 2011, 18(5): 601-606.
doi: 10.1136/amiajnl-2011-000163 pmid: 21508414
[5] Liu Z J, Yang M, Wang X L, et al. Entity Recognition from Clinical Texts via Recurrent Neural Network[J]. BMC Medical Informatics and Decision Making, 2017, 17(Suppl 2): Article No.67.
[6] Gajendran S, Manjula D, Sugumaran V. Character Level and Word Level Embedding with Bidirectional LSTM—Dynamic Recurrent Neural Network for Biomedical Named Entity Recognition from Literature[J]. Journal of Biomedical Informatics, 2020, 112: Article No.103609.
[7] Li X Y, Zhang H, Zhou X H. Chinese Clinical Named Entity Recognition with Variant Neural Structures Based on BERT Methods[J]. Journal of Biomedical Informatics, 2020, 107: Article No.103422.
[8] 吕江海, 杜军平, 周南, 等. 基于膨胀卷积迭代与注意力机制的实体名识别方法[J]. 计算机工程, 2021, 47(1): 58-65.
doi: 10.19678/j.issn.1000-3428.0055986
[8] (Lü Jianghai, Du Junping, Zhou Nan, et al. Entity Name Recognition Method Based on Dilated Convolutional Iterative and Attention Mechanism[J]. Computer Engineering, 2021, 47(1): 58-65.)
doi: 10.19678/j.issn.1000-3428.0055986
[9] Giorgi J M, Bader G D. Transfer Learning for Biomedical Named Entity Recognition with Neural Networks[J]. Bioinformatics, 2018, 34(23): 4087-4094.
doi: 10.1093/bioinformatics/bty449 pmid: 29868832
[10] Smetanin S, Komarov M. Deep Transfer Learning Baselines for Sentiment Analysis in Russian[J]. Information Processing & Management, 2021, 58(3): Article No.102484.
[11] Fu W L, Xue B, Gao X Y, et al. Transductive Transfer Learning Based Genetic Programming for Balanced and Unbalanced Document Classification Using Different Types of Features[J]. Applied Soft Computing, 2021, 103: Article No.107172.
[12] Mignone P, Pio G, D’Elia D, et al. Exploiting Transfer Learning for the Reconstruction of the Human Gene Regulatory Network[J]. Bioinformatics, 2020, 36(5): 1553-1561.
doi: 10.1093/bioinformatics/btz781 pmid: 31608946
[13] 熊欣, 王昊, 邓三鸿. 面向方志知识图谱的术语抽取模型迁移学习研究[J]. 情报理论与实践, 2021, 44(4): 176-184.
[13] (Xiong Xin, Wang Hao, Deng Sanhong. A Study on Term Extraction Model with Transfer Learning for Knowledge Graph of Local Chronicles[J]. Information Studies: Theory & Application, 2021, 44(4): 176-184.)
[14] 韩普, 张展鹏, 张伟. 基于多任务学习和多态语义特征的中文疾病名称归一化研究[J]. 情报学报, 2021, 40(11): 1234-1244.
[14] (Han Pu, Zhang Zhanpeng, Zhang Wei. Chinese Disease Name Normalization Based on Multi-Task Learning and Polymorphic Semantic Features[J]. Journal of the China Society for Scientific and Technical Information, 2021, 40(11): 1234-1244.)
[15] Crichton G, Pyysalo S, Chiu B, et al. A Neural Network Multi-Task Learning Approach to Biomedical Named Entity Recognition[J]. BMC Bioinformatics, 2017, 18(1): 1-14.
[16] Wu C C, Luo G, Guo C, et al. An Attention-Based Multi-Task Model for Named Entity Recognition and Intent Analysis of Chinese Online Medical Questions[J]. Journal of Biomedical Informatics, 2020, 108: Article No.103511.
[17] Aguilar G, Maharjan S, López-Monroy A P, et al. A Multi-Task Approach for Named Entity Recognition in Social Media Data[OL]. arXiv Preprint, arXiv: 1906.04135.
[18] Wang D S, Fan H J, Liu J F. Learning with Joint Cross-Document Information via Multi-Task Learning for Named Entity Recognition[J]. Information Sciences, 2021, 579: 454-467.
doi: 10.1016/j.ins.2021.08.015
[19] Srivastava R K, Greff K, Schmidhuber J. Highway Networks[OL]. arXiv Preprint, arXiv: 1505.00387.
[20] Zuo M, Zhang Y. Dataset-Aware Multi-Task Learning Approaches for Biomedical Named Entity Recognition[J]. Bioinformatics, 2020, 36(15): 4331-4338.
doi: 10.1093/bioinformatics/btaa515 pmid: 32415963
[21] Liu L Y, Shang J B, Ren X A, et al. Empower Sequence Labeling with Task-Aware Neural Language Model[C]// Proceedings of the 32nd AAAI Conference on Artificial Intelligence. 2018, 32(1): 5253-5260.
[22] Narayanan S, Achan P, Rangan P V, et al. Unified Concept and Assertion Detection Using Contextual Multi-Task Learning in a Clinical Decision Support System[J]. Journal of Biomedical Informatics, 2021, 122: Article No.103898.
[23] 王东波, 刘畅, 朱子赫, 等. SikuBERT与SikuRoBERTa:面向数字人文的《四库全书》预训练模型构建及应用研究[J]. 图书馆论坛, 2022, 42(6): 31-43.
[23] (Wang Dongbo, Liu Chang, Zhu Zihe, et al. Construction and Application of Pre-Trained Models of Siku Quanshu in Orientation to Digital Humanities[J]. Library Tribune, 2022, 42(6): 31-43.)
[24] Aliyun. A Labeled Chinese Dataset for Diabetes[EB/OL]. [2022-06-28]. https://tianchi.aliyun.com/competition/entrance/231687/information.
[25] Aya Mohamed Abdelaty Elkased. 面向生物医学文献的基于BioBERT的药品相互作用抽取增强模型[D]. 哈尔滨: 哈尔滨工业大学, 2021.
[25] (Aya Mohamed Abdelaty Elkased. Enhanced Drug-Drug Interaction Extraction Model from Biomedical Text Using BioBERT[D]. Harbin: Harbin Institute of Technology, 2021.)
[26] 何春辉, 王梦贤, 何小波. 基于双层Bi-LSTM-CRF模型的糖尿病领域命名实体识别[J]. 邵阳学院学报(自然科学版), 2020, 17(1): 21-26.
[26] He Chunhui, Wang Mengxian, He Xiaobo. Named Entity Recognition in the Field of Diabetes Based on Double-layer Bi-LSTM-CRF Model[J]. Journal of Shaoyang University (Natural Science Edition), 2020, 17(1): 21-26.)
[27] Shang F J, Ran C F. An Entity Recognition Model Based on Deep Learning Fusion of Text Feature[J]. Information Processing & Management, 2022, 59(2): Article No.102841.
[28] Deng J F, Cheng L L, Wang Z W. Self-Attention-Based BiGRU and Capsule Network for Named Entity Recognition[OL]. arXiv Preprint, arXiv: 2002.00735.
[29] Wang Y, Sun Y N, Ma Z C, et al. Named Entity Recognition in Chinese Medical Literature Using Pretraining Models[J]. Scientific Programming, 2020, 2020: Article No.8812754.
[1] 何丽,杨美华,刘璐瑶. 融合SPO语义和句法信息的事件检测方法*[J]. 数据分析与知识发现, 2023, 7(9): 114-124.
[2] 徐康, 余胜男, 陈蕾, 王传栋. 基于语言学知识增强的自监督式图卷积网络的事件关系抽取方法*[J]. 数据分析与知识发现, 2023, 7(5): 92-104.
[3] 韩普, 仲雨乐, 陆豪杰, 马诗雯. 基于对抗性迁移学习的药品不良反应实体识别研究*[J]. 数据分析与知识发现, 2023, 7(3): 131-141.
[4] 周宁, 钟娜, 靳高雅, 刘斌. 基于混合词嵌入的双通道注意力网络中文文本情感分析*[J]. 数据分析与知识发现, 2023, 7(3): 58-68.
[5] 苏明星, 吴厚月, 李健, 黄菊, 张顺香. 基于多层交互注意力机制的商品属性抽取*[J]. 数据分析与知识发现, 2023, 7(2): 108-118.
[6] 王金政, 杨颖, 余本功. 基于多头协同注意力机制的客户投诉文本分类模型*[J]. 数据分析与知识发现, 2023, 7(1): 128-137.
[7] 彭成, 张春霞, 张鑫, 郭倞涛, 牛振东. 基于实体多元编码的时序知识图谱推理*[J]. 数据分析与知识发现, 2023, 7(1): 138-149.
[8] 赵蕊洁, 佟昕瑀, 刘小桦, 路永和. 基于神经网络的医药科技论文实体识别与标注研究*[J]. 数据分析与知识发现, 2022, 6(9): 100-112.
[9] 唐娇, 张力生, 桑春艳. 基于潜在主题分布和长、短期用户表示的新闻推荐模型*[J]. 数据分析与知识发现, 2022, 6(9): 52-64.
[10] 赵鹏武, 李志义, 林小琦. 基于注意力机制和卷积神经网络的中文人物关系抽取与识别*[J]. 数据分析与知识发现, 2022, 6(8): 41-51.
[11] 张若琦, 申建芳, 陈平华. 结合GNN、Bi-GRU及注意力机制的会话序列推荐*[J]. 数据分析与知识发现, 2022, 6(6): 46-54.
[12] 叶瀚,孙海春,李欣,焦凯楠. 融合注意力机制与句向量压缩的长文本分类模型[J]. 数据分析与知识发现, 2022, 6(6): 84-94.
[13] 屠振超, 马静. 基于改进文本表示的商品文本分类算法研究*[J]. 数据分析与知识发现, 2022, 6(5): 34-43.
[14] 周泽聿, 王昊, 张小琴, 范涛, 任秋彤. 基于Xception-TD的中华传统刺绣分类模型构建*[J]. 数据分析与知识发现, 2022, 6(2/3): 338-347.
[15] 郭航程, 何彦青, 兰天, 吴振峰, 董诚. 基于Paragraph-BERT-CRF的科技论文摘要语步功能信息识别方法研究*[J]. 数据分析与知识发现, 2022, 6(2/3): 298-307.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn