Please wait a minute...
Advanced Search
数据分析与知识发现  2019, Vol. 3 Issue (12): 61-69     https://doi.org/10.11925/infotech.2096-3467.2019.0684
     研究论文 本期目录 | 过刊浏览 | 高级检索 |
肝癌患者在线提问的命名实体识别研究:一种基于迁移学习的方法 *
陈美杉,夏晨曦()
华中科技大学医药卫生管理学院 武汉 430073
Identifying Entities of Online Questions from Cancer Patients Based on Transfer Learning
Meishan Chen,Chenxi Xia()
School of Medicine and Health Management, Huazhong University of Science and Technology, Wuhan 430073, China
全文: PDF (597 KB)   HTML ( 19
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】充分利用源领域标注语料和可重用的字嵌入预训练模型, 解决目标领域标注语料稀缺的命名实体识别问题。【方法】选择以肺癌和肝癌为主题的患者在线问诊文本作为实验数据, 提出一种结合实例迁移和模型迁移的KNN-BERT-BiLSTM-CRF框架, 对仅有少量标注的肝癌患者提问文本进行跨领域命名实体识别。【结果】当实例迁移的k值设置为3时, KNN-BERT-BiLSTM-CRF模型的实体识别效果最优, F值为96.10%, 相对无实例迁移提高了1.98%。【局限】该方法针对其他差异度较大的目标领域, 如不同数据源或病种的实体识别迁移效果还有待验证。【结论】当目标领域标注语料有限时, 可借助大型预训练模型的先验知识和领域外标注语料, 使用跨领域迁移学习方法, 提高命名实体识别的性能。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
陈美杉
夏晨曦
关键词 BERTBiLSTM命名实体识别迁移学习    
Abstract

[Objective] This study utilizes annotated corpus with a pre-trained model, aiming to identify entities from corpus of limited annotation. [Methods] First, we collected online questions from patients with lung or liver cancers. Then we developed a KNN-BERT-BiLSTM-CRF framework combining instance and parameter transfer, which recognized named entities with small amount of labeled data. [Results] When the k value of instance-transfer was set to 3, we achieved the best performance of named entity recognition. Its F value was 96.10%, which was 1.98% higher than the performance of models with no instance-transfer techniques. [Limitations] The proposed method needs to be examined with entities of other diseases. [Conclusions] The cross-domain transfer learning method could improve the performance of entity identification.

Key wordsBERT    BiLSTM    Named Entity Recognition    Transfer Learning
收稿日期: 2019-06-14      出版日期: 2019-12-25
ZTFLH:  TP391  
基金资助:*本文系中央高校基本科研业务费自主创新基金项目“面向社交网络的情感分析与观点挖掘方法研究”(项目编号: 0118516036)
通讯作者: 夏晨曦     E-mail: xcxxdy@hust.edu.cn
引用本文:   
陈美杉,夏晨曦. 肝癌患者在线提问的命名实体识别研究:一种基于迁移学习的方法 *[J]. 数据分析与知识发现, 2019, 3(12): 61-69.
Meishan Chen,Chenxi Xia. Identifying Entities of Online Questions from Cancer Patients Based on Transfer Learning. Data Analysis and Knowledge Discovery, 2019, 3(12): 61-69.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2019.0684      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2019/V3/I12/61
  KNN-BERT-BiLSTM-CRF模型结构
  BERT模型输入过程
  BERT模型微调过程
  各领域数据集句长占比
实体类型 简洁定义 例子 目标领域标注数量 源领域标注数量
身体部位 包括器官, 身体部位和组织 头部, 颈部 1 359 6 876
细胞实体 包括细胞、分子或细胞层面的解剖实体 血红蛋白,巨细胞 130 398
诊断程序 包括用于诊断的检测和活检程序 活检, CT, b超, 铁含量 156 1 102
药物 包括用于治疗目的的物质 华蟾素胶囊, 吗啡 259 1 805
度量 一个命名实体的核心属性, 如药物的剂量 10 mg, 2% 78 257
个体 包括个人(性别、年龄等)和人口群体 父亲, 女性, 16岁 1 188 2 506
问题 包括疾病、症状、异常和并发症 疼痛, 破裂, 肺癌, 肿瘤 4 975 25 427
治疗程序 指程序或医学、设备用于治疗以及未指明的植入预防手术干预 肾镜切除, 植入, 化疗 1 003 4 169
癌症分期 决定癌症发展与扩散程度的方法 早期, 前期, 晚期 1 142 4 304
  命名实体目录
名称 类型 数量(句) 标注情况
源领域数据集 肺癌 11 822 有标注
目标领域数据集 肝癌 2 000 有标注
  各领域数据集组成
网络层 参数 取值
Doc2Vec 算法 DM
窗口大小 5
最小词频 5
学习率 由0.025递减至0.001
向量维度 100
BERT 批处理大小 32
学习率 2e-5
样本最大长度 128
迭代次数 10
优化方法 Adam
BiLSTM L2正则化 0.001
迭代次数 10
Dropout 0.5
Word2Vec 算法 Skip-gram
窗口大小 5
学习率 由0.025递减至0.001
最小词频 3
向量维度 100
  模型参数设置
模型 P(%) R(%) F(%)
Word2Vec-BiLSTM-CRF 85.98 86.55 86.26
BERT-BiLSTM-CRF 92.91 95.36 94.12
  模型迁移实验结果对比
  训练集大小对迁移效果的影响
模型 评价
指标
k=0 k=1 k=2 k=3 k=4 k=5 k=6
KNN-BERT-
BiLSTM-CRF
P 92.91 93.54 94.89 95.74 95.40 94.73 94.60
R 95.36 95.74 96.51 96.75 96.24 96.30 95.68
F 94.12 94.63 95.69 96.10 95.82 95.51 95.14
KNN-Word2Vec-BiLSTM-CRF P 85.98 88.73 90.45 91.48 91.65 91.03 90.77
R 86.55 89.57 91.30 92.48 92.62 92.05 91.90
F 86.26 89.15 90.87 91.98 92.13 91.54 91.33
  实例迁移实验结果对比(%)
  KNN-BERT-BiLSTM-CRF模型识别结果
  KNN-Word2Vec-BiLSTM-CRF模型识别结果
模型 P(%) R(%) F(%)
Word2Vec-BiLSTM-CRF 85.98 86.55 86.26
KNN-Word2Vec-BiLSTM-CRF(k=4) 91.65 92.62 92.13
BERT-BiLSTM-CRF 92.91 95.36 94.12
KNN-BERT-BiLSTM-CRF(k=3) 95.47 96.75 96.10
  综合实验结果对比
[1] 中国互联网络信息中心. 第43次《中国互联网络发展状况统计报告》[R/OL]. ( 2019- 02- 28). http://www.cnnic.net.cn/hlwfzyj/hlwxzbg/hlwtjbg/201902/P020190318523029756345.pdf.
[1] ( CNNIC. The 43rd China Statistical Report on Internet Development in China[R/OL]. ( 2019- 02- 28). http://www.cnnic.net.cn/hlwfzyj/hlwxzbg/hlwtjbg/201902/P020190318523029756345.pdf
[2] Goh J M, Gao G, Agarwal R . The Creation of Social Value: Can an Online Health Community Reduce Rural-urban Health Disparities?[J]. MIS Quarterly, 2016,40(1):247-263.
[3] Moorhead S A, Hazlett D E, Harrison L , et al. A New Dimension of Health Care: Systematic Review of the Uses, Benefits, and Limitations of Social Media for Health Communication[J]. Journal of Medical Internet Research, 2013,15(4):e85.
[4] 孙安, 于英香, 罗永刚 , 等. 序列标注模型中的字粒度特征提取方案研究——以CCKS2017:Task2临床病历命名实体识别任务为例[J]. 图书情报工作, 2018,62(11):103-111.
[4] ( Sun An, Yu Yingxiang, Luo Yonggang , et al. Research on Feature Extraction Scheme of Chinese-character Granularity in Sequence Labeling Model: A Case Study About Clinical Named Entity Recognition of CCKS2017: Task2[J]. Library and Information Service, 2018,62(11):103-111.)
[5] 何林娜, 杨志豪, 林鸿飞 , 等. 基于特征耦合泛化的药名实体识别[J]. 中文信息学报, 2014,28(2):72-77.
[5] ( He Linna, Yang Zhihao, Lin Hongfei , et al. Drug Name Entity Recognition Based on Feature Coupling Generalization[J]. Journal of Chinese Information Processing, 2014,28(2):72-77.)
[6] Grishman R, Sundheim B . Message Understanding Conference-6: A Brief History [C]//Proceedings of the 16th International Conference on Computational Linguistics. 1996.
[7] Lafferty J, McCallum A, Pereira F C N . Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data [C]//Proceedings of the 18th International Conference on Machine Learning (ICML 2001). 2001: 282-289.
[8] Bikel D M, Miller S, Schwartz R , et al. Nymble: A High-performance Learning Name-finder [C]// Proceedings of the 5th Conference on Applied Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 1997: 194-201.
[9] Bender O, Och F J, Ney H . Maximum Entropy Models for Named Entity Recognition [C]//Proceedings of the 7th Conference on Natural Language Learning at HLT-NAACL 2003-Volume 4. Association for Computational Linguistics, 2003: 148-151.
[10] Goller C, Kuchler A . Learning Task-dependent Distributed Representations by Backpropagation Through Structure [C] //Proceedings of International Conference on Neural Networks (ICNN'96). IEEE, 1996,1:347-352.
[11] Hochreiter S, Schmidhuber J . Long Short-Term Memory[J]. Neural Computation, 1997,9(8):1735-1780.
[12] Graves A, Schmidhuber J . Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures[J]. Neural Networks, 2005,18(5-6):602-610.
[13] Sun P, Yang X, Zhao X , et al. An Overview of Named Entity Recognition [C]// Proceedings of the 2018 International Conference on Asian Language Processing (IALP). IEEE, 2018: 273-278.
[14] Blitzer J, McDonald R, Pereira F . Domain Adaptation with Structural Correspondence Learning [C]//Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2006: 120-128.
[15] Jiang J, Zhai C X . Instance Weighting for Domain Adaptation in NLP [C]//Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics. 2007: 264-271.
[16] Yang Z, Salakhutdinov R, Cohen W W . Transfer Learning for Sequence Tagging with Hierarchical Recurrent Networks[OL]. arXiv Preprint, arXiv: 1703.06345.
[17] Dai W, Yang Q, Xue G R , et al. Boosting for Transfer Learning [C]//Proceedings of the 24th International Conference on Machine Learning. ACM, 2007: 193-200.
[18] Dai W, Xue G R, Yang Q , et al. Transferring Naive Bayes Classifiers for Text Classification [C]// Proceedings of the 22nd AAAI Conference on Artificial Intelligence. 2007: 540-545.
[19] Dai W, Xue G R, Yang Q , et al. Co-clustering Based Classification for Out-of-domain Documents [C]//Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2007: 210-219.
[20] Xue G R, Dai W, Yang Q , et al. Topic-bridged PLSA for Cross-domain Text Classification [C]//Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2008: 627-634.
[21] Pan S J, Tsang I W, Kwok J T , et al. Domain Adaptation via Transfer Component Analysis[J]. IEEE Transactions on Neural Networks, 2010,22(2):199-210.
[22] Zhong E, Fan W, Peng J , et al. Cross Domain Distribution Adaptation via Kernel Mapping [C]//Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2009: 1027-1036.
[23] 张博, 史忠植, 赵晓非 , 等. 一种基于跨领域典型相关性分析的迁移学习方法[J]. 计算机学报, 2015,38(7):1326-1336.
[23] ( Zhang Bo, Shi Zhongzhi, Zhao Xiaofei , et al. A Transfer Learning Based on Canonical Correlation Analysis Across Different Domains[J]. Chinese Journal of Computers, 2015,38(7):1326-1336.)
[24] Al-Stouhi S, Reddy C K . Transfer Learning for Class Imbalance Problems with Inadequate Data[J]. Knowledge and Information Systems, 2016,48(1):201-228.
[25] Ryu D, Jang J I, Baik J . A Transfer Cost-sensitive Boosting Approach for Cross-project Defect Prediction[J]. Software Quality Journal, 2017,25(1):235-272.
[26] Pan S J, Ni X, Sun J T , et al. Cross-domain Sentiment Classification via Spectral Feature Alignment [C] //Proceedings of the 19th International Conference on World Wide Web. ACM, 2010: 751-760.
[27] He Y, Lin C, Alani H . Automatically Extracting Polarity-bearing Topics for Cross-domain Sentiment Classification [C]//Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1. Association for Computational Linguistics, 2011: 123-131.
[28] Tan B, Song Y, Zhong E , et al. Transitive Transfer Learning [C]//Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2015: 1155-1164.
[29] 周清清, 章成志 . 基于迁移学习微博情绪分类研究——以H7N9微博为例[J]. 情报学报, 2016,35(4):339-348.
[29] ( Zhou Qingqing, Zhang Chengzhi . Microblog Emotion Classification Based on Transfer Learning:A Case Study of Microblogs about H7N9[J]. Journal of the China Society for Scientific and Technical Information, 2016,35(4):339-348.)
[30] Huang X, Rao Y, Xie H , et al. Cross-domain Sentiment Classification via Topic-related TrAdaBoost [C]//Proceedings of the 31st AAAI Conference on Artificial Intelligence. AAAI, 2017: 4939-4940.
[31] 余传明 . 基于深度循环神经网络的跨领域文本情感分析[J]. 图书情报工作, 2018,62(11):23-34.
[31] ( Yu Chuanming . A Cross-domain Text Sentiment Analysis Based on Deep Recurrent Neural Network[J]. Library and Information Service, 2018,62(11):23-34.)
[32] Giorgi J M, Bader G D . Transfer Learning for Biomedical Named Entity Recognition with Neural Networks[J]. Bioinformatics, 2018,34(23):4087-4094.
[33] Corbett P, Boyle J . Chemlistem: Chemical Named Entity Recognition Using Recurrent Neural Networks[J]. Journal of Cheminformatics, 2018,10(1):61-68.
[34] Gama J, Žliobaitė I, Bifet A , et al. A Survey on Concept Drift Adaptation[J]. ACM Computing Surveys (CSUR), 2014,46(4):1-44.
[35] Pan S J, Yang Q . A Survey on Transfer Learning[J]. IEEE Transactions on Knowledge and Data Engineering, 2009,22(10):1345-1359.
[36] 高冰涛, 张阳, 刘斌 . BioTrHMM:基于迁移学习的生物医学命名实体识别算法[J]. 计算机应用研究, 2019,36(1):45-48.
[36] ( Gao Bingtao, Zhang Yang, Liu Bin . BioTrHMM: Named Entity Recognition Algorithm Based on Transfer Learning in Biomedical Texts[J]. Application Research of Computers, 2019,36(1):45-48.)
[37] 王红斌, 沈强, 线岩团 . 融合迁移学习的中文命名实体识别[J]. 小型微型计算机系统, 2017,38(2):346-351.
[37] ( Wang Hongbin, Shen Qiang, Xian Yantuan . Research on Chinese Named Entity Recognition Fusing Transfer Learning[J]. Journal of Chinese Computer Systems, 2017,38(2):346-351.)
[38] Pan S J, Toh Z, Su J . Transfer Joint Embedding for Cross-Domain Named Entity Recognition[J]. ACM Transactions on Information Systems (TOIS), 2013,31(2):1-27.
[39] Pennington J, Socher R, Manning C . GloVe: Global Vectors for Word Representation [C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2014: 1532-1543.
[40] Devlin J, Chang M W, Lee K , et al. Bert: Pre-training of Deep Bidirectional Transformers for Language Understanding[OL]. arXiv Preprint, arXiv: 1810.04805.
[41] Peters M E, Neumann M, Iyyer M , et al. Deep Contextualized Word Representations[OL]. arXiv Preprint, arXiv: 1802.05365.
[42] Radford A, Narasimhan K, Salimans T , et al. Improving Language Understanding by Generative Pre-training[OL]. [2019-04-05]. https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf.
[43] Si Y, Wang J, Xu H , et al. Enhancing Clinical Concept Extraction with Contextual Embedding [OL]. arXiv Preprint, arXiv: 1902.08691.
[44] Lee J, Yoon W, Kim S , et al. Biobert: Pre-trained Biomedical Language Representation Model for Biomedical Text Mining [OL]. arXiv Preprint, arXiv: 1901.08746.
[45] Le Q, Mikolov T . Distributed Representations of Sentences and Documents [C] //Proceedings of the International Conference on Machine Learning. 2014: 1188-1196.
[46] Cover T M, Hart P . Nearest Neighbor Pattern Classification[J]. IEEE Transactions on Information Theory, 1967,13(1):21-27.
[47] 赵冬 . 健康领域中文自动问答的问题解析研究——以肺癌为例[D]. 武汉: 华中科技大学, 2019.
[47] ( Zhao Dong . Question Analysis of Chinese Automatic Question Answering in Health Field: A Case of Lung Cancer[D]. Wuhan:Huazhong University of Science and Technology, 2019.)
[48] Kilicoglu H, Abacha A B, Mrabet Y , et al. Semantic Annotation of Consumer Health Questions[J]. BMC Bioinformatics, 2018,19(1):34.
[49] Hripcsak G, Rothschild A S . Agreement, the F-measure, and Reliability in Information Retrieval[J]. Journal of the American Medical Informatics Association, 2005,12(3):296-298.
[50] Sang T K, De Meulder F . Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition [C]//Proceedings of CoNLL-2003, 2003: 142-147.
[51] 朱艳辉, 李飞, 冀相冰 , 等. 反馈式K近邻语义迁移学习的领域命名实体识别[J]. 智能系统学报, 2019(4):820-830.
[51] ( Zhu Yanhui, Li Fei, Ji Xiangbing , et al. Domain Named Entity Recognition Based on Feedback K-Nearest Semantic Transfer Learning[J]. CAAI Transactions on Intelligent Systems, 2019(4):820-830.)
[1] 陈杰,马静,李晓峰. 融合预训练模型文本特征的短文本分类方法*[J]. 数据分析与知识发现, 2021, 5(9): 21-30.
[2] 周泽聿,王昊,赵梓博,李跃艳,张小琴. 融合关联信息的GCN文本分类模型构建及其应用研究*[J]. 数据分析与知识发现, 2021, 5(9): 31-41.
[3] 马江微, 吕学强, 游新冬, 肖刚, 韩君妹. 融合BERT与关系位置特征的军事领域关系抽取方法*[J]. 数据分析与知识发现, 2021, 5(8): 1-12.
[4] 李文娜, 张智雄. 基于联合语义表示的不同知识库中的实体对齐方法研究*[J]. 数据分析与知识发现, 2021, 5(7): 1-9.
[5] 王昊, 林克柔, 孟镇, 李心蕾. 文本表示及其特征生成对法律判决书中多类型实体识别的影响分析[J]. 数据分析与知识发现, 2021, 5(7): 10-25.
[6] 喻雪寒, 何琳, 徐健. 基于RoBERTa-CRF的古文历史事件抽取方法研究*[J]. 数据分析与知识发现, 2021, 5(7): 26-35.
[7] 陆泉, 何超, 陈静, 田敏, 刘婷. 基于两阶段迁移学习的多标签分类模型研究*[J]. 数据分析与知识发现, 2021, 5(7): 91-100.
[8] 刘文斌, 何彦青, 吴振峰, 董诚. 基于BERT和多相似度融合的句子对齐方法研究*[J]. 数据分析与知识发现, 2021, 5(7): 48-58.
[9] 尹鹏博,潘伟民,张海军,陈德刚. 基于BERT-BiGA模型的标题党新闻识别研究*[J]. 数据分析与知识发现, 2021, 5(6): 126-134.
[10] 宋若璇,钱力,杜宇. 基于科技论文中未来工作句集的学术创新构想话题自动生成方法研究*[J]. 数据分析与知识发现, 2021, 5(5): 10-20.
[11] 常城扬,王晓东,张胜磊. 基于深度学习方法对特定群体推特的动态政治情感极性分析*[J]. 数据分析与知识发现, 2021, 5(3): 121-131.
[12] 胡昊天,吉晋锋,王东波,邓三鸿. 基于深度学习的食品安全事件实体一体化呈现平台构建*[J]. 数据分析与知识发现, 2021, 5(3): 12-24.
[13] 王倩,王东波,李斌,许超. 面向海量典籍文本的深度学习自动断句与标点平台构建研究*[J]. 数据分析与知识发现, 2021, 5(3): 25-34.
[14] 董淼, 苏中琪, 周晓北, 兰雪, 崔志刚, 崔雷. 利用Text-CNN改进PubMedBERT在化学诱导性疾病实体关系分类效果的尝试[J]. 数据分析与知识发现, 2021, 5(11): 145-152.
[15] 刘欢,张智雄,王宇飞. BERT模型的主要优化改进方法研究综述*[J]. 数据分析与知识发现, 2021, 5(1): 3-15.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn