Please wait a minute...
Advanced Search
数据分析与知识发现  2019, Vol. 3 Issue (12): 61-69    DOI: 10.11925/infotech.2096-3467.2019.0684
     研究论文 本期目录 | 过刊浏览 | 高级检索 |
肝癌患者在线提问的命名实体识别研究:一种基于迁移学习的方法 *
陈美杉,夏晨曦()
华中科技大学医药卫生管理学院 武汉 430073
Identifying Entities of Online Questions from Cancer Patients Based on Transfer Learning
Meishan Chen,Chenxi Xia()
School of Medicine and Health Management, Huazhong University of Science and Technology, Wuhan 430073, China
全文: PDF(597 KB)   HTML ( 16
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】充分利用源领域标注语料和可重用的字嵌入预训练模型, 解决目标领域标注语料稀缺的命名实体识别问题。【方法】选择以肺癌和肝癌为主题的患者在线问诊文本作为实验数据, 提出一种结合实例迁移和模型迁移的KNN-BERT-BiLSTM-CRF框架, 对仅有少量标注的肝癌患者提问文本进行跨领域命名实体识别。【结果】当实例迁移的k值设置为3时, KNN-BERT-BiLSTM-CRF模型的实体识别效果最优, F值为96.10%, 相对无实例迁移提高了1.98%。【局限】该方法针对其他差异度较大的目标领域, 如不同数据源或病种的实体识别迁移效果还有待验证。【结论】当目标领域标注语料有限时, 可借助大型预训练模型的先验知识和领域外标注语料, 使用跨领域迁移学习方法, 提高命名实体识别的性能。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
陈美杉
夏晨曦
关键词 BERTBiLSTM命名实体识别迁移学习    
Abstract

[Objective] This study utilizes annotated corpus with a pre-trained model, aiming to identify entities from corpus of limited annotation. [Methods] First, we collected online questions from patients with lung or liver cancers. Then we developed a KNN-BERT-BiLSTM-CRF framework combining instance and parameter transfer, which recognized named entities with small amount of labeled data. [Results] When the k value of instance-transfer was set to 3, we achieved the best performance of named entity recognition. Its F value was 96.10%, which was 1.98% higher than the performance of models with no instance-transfer techniques. [Limitations] The proposed method needs to be examined with entities of other diseases. [Conclusions] The cross-domain transfer learning method could improve the performance of entity identification.

Key wordsBERT    BiLSTM    Named Entity Recognition    Transfer Learning
收稿日期: 2019-06-14     
中图分类号:  TP391  
基金资助:*本文系中央高校基本科研业务费自主创新基金项目“面向社交网络的情感分析与观点挖掘方法研究”(项目编号: 0118516036)
通讯作者: 夏晨曦     E-mail: xcxxdy@hust.edu.cn
引用本文:   
陈美杉,夏晨曦. 肝癌患者在线提问的命名实体识别研究:一种基于迁移学习的方法 *[J]. 数据分析与知识发现, 2019, 3(12): 61-69.
Meishan Chen,Chenxi Xia. Identifying Entities of Online Questions from Cancer Patients Based on Transfer Learning. Data Analysis and Knowledge Discovery, DOI:10.11925/infotech.2096-3467.2019.0684.
链接本文:  
http://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2019.0684
图1  KNN-BERT-BiLSTM-CRF模型结构
图2  BERT模型输入过程
图3  BERT模型微调过程
图4  各领域数据集句长占比
实体类型 简洁定义 例子 目标领域标注数量 源领域标注数量
身体部位 包括器官, 身体部位和组织 头部, 颈部 1 359 6 876
细胞实体 包括细胞、分子或细胞层面的解剖实体 血红蛋白,巨细胞 130 398
诊断程序 包括用于诊断的检测和活检程序 活检, CT, b超, 铁含量 156 1 102
药物 包括用于治疗目的的物质 华蟾素胶囊, 吗啡 259 1 805
度量 一个命名实体的核心属性, 如药物的剂量 10 mg, 2% 78 257
个体 包括个人(性别、年龄等)和人口群体 父亲, 女性, 16岁 1 188 2 506
问题 包括疾病、症状、异常和并发症 疼痛, 破裂, 肺癌, 肿瘤 4 975 25 427
治疗程序 指程序或医学、设备用于治疗以及未指明的植入预防手术干预 肾镜切除, 植入, 化疗 1 003 4 169
癌症分期 决定癌症发展与扩散程度的方法 早期, 前期, 晚期 1 142 4 304
表1  命名实体目录
名称 类型 数量(句) 标注情况
源领域数据集 肺癌 11 822 有标注
目标领域数据集 肝癌 2 000 有标注
表2  各领域数据集组成
网络层 参数 取值
Doc2Vec 算法 DM
窗口大小 5
最小词频 5
学习率 由0.025递减至0.001
向量维度 100
BERT 批处理大小 32
学习率 2e-5
样本最大长度 128
迭代次数 10
优化方法 Adam
BiLSTM L2正则化 0.001
迭代次数 10
Dropout 0.5
Word2Vec 算法 Skip-gram
窗口大小 5
学习率 由0.025递减至0.001
最小词频 3
向量维度 100
表3  模型参数设置
模型 P(%) R(%) F(%)
Word2Vec-BiLSTM-CRF 85.98 86.55 86.26
BERT-BiLSTM-CRF 92.91 95.36 94.12
表4  模型迁移实验结果对比
图5  训练集大小对迁移效果的影响
模型 评价
指标
k=0 k=1 k=2 k=3 k=4 k=5 k=6
KNN-BERT-
BiLSTM-CRF
P 92.91 93.54 94.89 95.74 95.40 94.73 94.60
R 95.36 95.74 96.51 96.75 96.24 96.30 95.68
F 94.12 94.63 95.69 96.10 95.82 95.51 95.14
KNN-Word2Vec-BiLSTM-CRF P 85.98 88.73 90.45 91.48 91.65 91.03 90.77
R 86.55 89.57 91.30 92.48 92.62 92.05 91.90
F 86.26 89.15 90.87 91.98 92.13 91.54 91.33
表5  实例迁移实验结果对比(%)
图6  KNN-BERT-BiLSTM-CRF模型识别结果
图7  KNN-Word2Vec-BiLSTM-CRF模型识别结果
模型 P(%) R(%) F(%)
Word2Vec-BiLSTM-CRF 85.98 86.55 86.26
KNN-Word2Vec-BiLSTM-CRF(k=4) 91.65 92.62 92.13
BERT-BiLSTM-CRF 92.91 95.36 94.12
KNN-BERT-BiLSTM-CRF(k=3) 95.47 96.75 96.10
表6  综合实验结果对比
[1] 中国互联网络信息中心. 第43次《中国互联网络发展状况统计报告》[R/OL]. ( 2019- 02- 28). http://www.cnnic.net.cn/hlwfzyj/hlwxzbg/hlwtjbg/201902/P020190318523029756345.pdf.
( CNNIC. The 43rd China Statistical Report on Internet Development in China[R/OL]. ( 2019- 02- 28). http://www.cnnic.net.cn/hlwfzyj/hlwxzbg/hlwtjbg/201902/P020190318523029756345.pdf
[2] Goh J M, Gao G, Agarwal R . The Creation of Social Value: Can an Online Health Community Reduce Rural-urban Health Disparities?[J]. MIS Quarterly, 2016,40(1):247-263.
[3] Moorhead S A, Hazlett D E, Harrison L , et al. A New Dimension of Health Care: Systematic Review of the Uses, Benefits, and Limitations of Social Media for Health Communication[J]. Journal of Medical Internet Research, 2013,15(4):e85.
[4] 孙安, 于英香, 罗永刚 , 等. 序列标注模型中的字粒度特征提取方案研究——以CCKS2017:Task2临床病历命名实体识别任务为例[J]. 图书情报工作, 2018,62(11):103-111.
( Sun An, Yu Yingxiang, Luo Yonggang , et al. Research on Feature Extraction Scheme of Chinese-character Granularity in Sequence Labeling Model: A Case Study About Clinical Named Entity Recognition of CCKS2017: Task2[J]. Library and Information Service, 2018,62(11):103-111.)
[5] 何林娜, 杨志豪, 林鸿飞 , 等. 基于特征耦合泛化的药名实体识别[J]. 中文信息学报, 2014,28(2):72-77.
( He Linna, Yang Zhihao, Lin Hongfei , et al. Drug Name Entity Recognition Based on Feature Coupling Generalization[J]. Journal of Chinese Information Processing, 2014,28(2):72-77.)
[6] Grishman R, Sundheim B . Message Understanding Conference-6: A Brief History [C]//Proceedings of the 16th International Conference on Computational Linguistics. 1996.
[7] Lafferty J, McCallum A, Pereira F C N . Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data [C]//Proceedings of the 18th International Conference on Machine Learning (ICML 2001). 2001: 282-289.
[8] Bikel D M, Miller S, Schwartz R , et al. Nymble: A High-performance Learning Name-finder [C]// Proceedings of the 5th Conference on Applied Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 1997: 194-201.
[9] Bender O, Och F J, Ney H . Maximum Entropy Models for Named Entity Recognition [C]//Proceedings of the 7th Conference on Natural Language Learning at HLT-NAACL 2003-Volume 4. Association for Computational Linguistics, 2003: 148-151.
[10] Goller C, Kuchler A . Learning Task-dependent Distributed Representations by Backpropagation Through Structure [C] //Proceedings of International Conference on Neural Networks (ICNN'96). IEEE, 1996,1:347-352.
[11] Hochreiter S, Schmidhuber J . Long Short-Term Memory[J]. Neural Computation, 1997,9(8):1735-1780.
[12] Graves A, Schmidhuber J . Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures[J]. Neural Networks, 2005,18(5-6):602-610.
[13] Sun P, Yang X, Zhao X , et al. An Overview of Named Entity Recognition [C]// Proceedings of the 2018 International Conference on Asian Language Processing (IALP). IEEE, 2018: 273-278.
[14] Blitzer J, McDonald R, Pereira F . Domain Adaptation with Structural Correspondence Learning [C]//Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2006: 120-128.
[15] Jiang J, Zhai C X . Instance Weighting for Domain Adaptation in NLP [C]//Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics. 2007: 264-271.
[16] Yang Z, Salakhutdinov R, Cohen W W . Transfer Learning for Sequence Tagging with Hierarchical Recurrent Networks[OL]. arXiv Preprint, arXiv: 1703.06345.
[17] Dai W, Yang Q, Xue G R , et al. Boosting for Transfer Learning [C]//Proceedings of the 24th International Conference on Machine Learning. ACM, 2007: 193-200.
[18] Dai W, Xue G R, Yang Q , et al. Transferring Naive Bayes Classifiers for Text Classification [C]// Proceedings of the 22nd AAAI Conference on Artificial Intelligence. 2007: 540-545.
[19] Dai W, Xue G R, Yang Q , et al. Co-clustering Based Classification for Out-of-domain Documents [C]//Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2007: 210-219.
[20] Xue G R, Dai W, Yang Q , et al. Topic-bridged PLSA for Cross-domain Text Classification [C]//Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2008: 627-634.
[21] Pan S J, Tsang I W, Kwok J T , et al. Domain Adaptation via Transfer Component Analysis[J]. IEEE Transactions on Neural Networks, 2010,22(2):199-210.
[22] Zhong E, Fan W, Peng J , et al. Cross Domain Distribution Adaptation via Kernel Mapping [C]//Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2009: 1027-1036.
[23] 张博, 史忠植, 赵晓非 , 等. 一种基于跨领域典型相关性分析的迁移学习方法[J]. 计算机学报, 2015,38(7):1326-1336.
( Zhang Bo, Shi Zhongzhi, Zhao Xiaofei , et al. A Transfer Learning Based on Canonical Correlation Analysis Across Different Domains[J]. Chinese Journal of Computers, 2015,38(7):1326-1336.)
[24] Al-Stouhi S, Reddy C K . Transfer Learning for Class Imbalance Problems with Inadequate Data[J]. Knowledge and Information Systems, 2016,48(1):201-228.
[25] Ryu D, Jang J I, Baik J . A Transfer Cost-sensitive Boosting Approach for Cross-project Defect Prediction[J]. Software Quality Journal, 2017,25(1):235-272.
[26] Pan S J, Ni X, Sun J T , et al. Cross-domain Sentiment Classification via Spectral Feature Alignment [C] //Proceedings of the 19th International Conference on World Wide Web. ACM, 2010: 751-760.
[27] He Y, Lin C, Alani H . Automatically Extracting Polarity-bearing Topics for Cross-domain Sentiment Classification [C]//Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1. Association for Computational Linguistics, 2011: 123-131.
[28] Tan B, Song Y, Zhong E , et al. Transitive Transfer Learning [C]//Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2015: 1155-1164.
[29] 周清清, 章成志 . 基于迁移学习微博情绪分类研究——以H7N9微博为例[J]. 情报学报, 2016,35(4):339-348.
( Zhou Qingqing, Zhang Chengzhi . Microblog Emotion Classification Based on Transfer Learning:A Case Study of Microblogs about H7N9[J]. Journal of the China Society for Scientific and Technical Information, 2016,35(4):339-348.)
[30] Huang X, Rao Y, Xie H , et al. Cross-domain Sentiment Classification via Topic-related TrAdaBoost [C]//Proceedings of the 31st AAAI Conference on Artificial Intelligence. AAAI, 2017: 4939-4940.
[31] 余传明 . 基于深度循环神经网络的跨领域文本情感分析[J]. 图书情报工作, 2018,62(11):23-34.
( Yu Chuanming . A Cross-domain Text Sentiment Analysis Based on Deep Recurrent Neural Network[J]. Library and Information Service, 2018,62(11):23-34.)
[32] Giorgi J M, Bader G D . Transfer Learning for Biomedical Named Entity Recognition with Neural Networks[J]. Bioinformatics, 2018,34(23):4087-4094.
[33] Corbett P, Boyle J . Chemlistem: Chemical Named Entity Recognition Using Recurrent Neural Networks[J]. Journal of Cheminformatics, 2018,10(1):61-68.
[34] Gama J, Žliobaitė I, Bifet A , et al. A Survey on Concept Drift Adaptation[J]. ACM Computing Surveys (CSUR), 2014,46(4):1-44.
[35] Pan S J, Yang Q . A Survey on Transfer Learning[J]. IEEE Transactions on Knowledge and Data Engineering, 2009,22(10):1345-1359.
[36] 高冰涛, 张阳, 刘斌 . BioTrHMM:基于迁移学习的生物医学命名实体识别算法[J]. 计算机应用研究, 2019,36(1):45-48.
( Gao Bingtao, Zhang Yang, Liu Bin . BioTrHMM: Named Entity Recognition Algorithm Based on Transfer Learning in Biomedical Texts[J]. Application Research of Computers, 2019,36(1):45-48.)
[37] 王红斌, 沈强, 线岩团 . 融合迁移学习的中文命名实体识别[J]. 小型微型计算机系统, 2017,38(2):346-351.
( Wang Hongbin, Shen Qiang, Xian Yantuan . Research on Chinese Named Entity Recognition Fusing Transfer Learning[J]. Journal of Chinese Computer Systems, 2017,38(2):346-351.)
[38] Pan S J, Toh Z, Su J . Transfer Joint Embedding for Cross-Domain Named Entity Recognition[J]. ACM Transactions on Information Systems (TOIS), 2013,31(2):1-27.
[39] Pennington J, Socher R, Manning C . GloVe: Global Vectors for Word Representation [C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2014: 1532-1543.
[40] Devlin J, Chang M W, Lee K , et al. Bert: Pre-training of Deep Bidirectional Transformers for Language Understanding[OL]. arXiv Preprint, arXiv: 1810.04805.
[41] Peters M E, Neumann M, Iyyer M , et al. Deep Contextualized Word Representations[OL]. arXiv Preprint, arXiv: 1802.05365.
[42] Radford A, Narasimhan K, Salimans T , et al. Improving Language Understanding by Generative Pre-training[OL]. [2019-04-05]. https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf.
[43] Si Y, Wang J, Xu H , et al. Enhancing Clinical Concept Extraction with Contextual Embedding [OL]. arXiv Preprint, arXiv: 1902.08691.
[44] Lee J, Yoon W, Kim S , et al. Biobert: Pre-trained Biomedical Language Representation Model for Biomedical Text Mining [OL]. arXiv Preprint, arXiv: 1901.08746.
[45] Le Q, Mikolov T . Distributed Representations of Sentences and Documents [C] //Proceedings of the International Conference on Machine Learning. 2014: 1188-1196.
[46] Cover T M, Hart P . Nearest Neighbor Pattern Classification[J]. IEEE Transactions on Information Theory, 1967,13(1):21-27.
[47] 赵冬 . 健康领域中文自动问答的问题解析研究——以肺癌为例[D]. 武汉: 华中科技大学, 2019.
( Zhao Dong . Question Analysis of Chinese Automatic Question Answering in Health Field: A Case of Lung Cancer[D]. Wuhan:Huazhong University of Science and Technology, 2019.)
[48] Kilicoglu H, Abacha A B, Mrabet Y , et al. Semantic Annotation of Consumer Health Questions[J]. BMC Bioinformatics, 2018,19(1):34.
[49] Hripcsak G, Rothschild A S . Agreement, the F-measure, and Reliability in Information Retrieval[J]. Journal of the American Medical Informatics Association, 2005,12(3):296-298.
[50] Sang T K, De Meulder F . Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition [C]//Proceedings of CoNLL-2003, 2003: 142-147.
[51] 朱艳辉, 李飞, 冀相冰 , 等. 反馈式K近邻语义迁移学习的领域命名实体识别[J]. 智能系统学报, 2019(4):820-830.
( Zhu Yanhui, Li Fei, Ji Xiangbing , et al. Domain Named Entity Recognition Based on Feedback K-Nearest Semantic Transfer Learning[J]. CAAI Transactions on Intelligent Systems, 2019(4):820-830.)
[1] 黄菡,王宏宇,王晓光. 结合主动学习的条件随机场模型用于法律术语的自动识别*[J]. 数据分析与知识发现, 2019, 3(6): 66-74.
[2] 余丽,钱力,付常雷,赵华茗. 基于深度学习的文本中细粒度知识元抽取方法研究*[J]. 数据分析与知识发现, 2019, 3(1): 38-45.
[3] 伍杰华,沈静,周蓓. 基于迁移成分分析的多层社交网络链接分类*[J]. 数据分析与知识发现, 2018, 2(9): 88-99.
[4] 唐慧慧,王昊,张紫玄,王雪颖. 基于汉字标注的中文历史事件名抽取研究*[J]. 数据分析与知识发现, 2018, 2(7): 89-100.
[5] 范馨月,崔雷. 基于文本挖掘的药物副作用知识发现研究[J]. 数据分析与知识发现, 2018, 2(3): 79-86.
[6] 隋明爽,崔雷. 结合多种特征的CRF模型用于化学物质-疾病命名实体识别[J]. 现代图书情报技术, 2016, 32(10): 91-97.
[7] 夏立新, 蔡昕, 石义金, 孙丹霞, 王忠义. Web生活服务信息的组织与可视化研究[J]. 现代图书情报技术, 2014, 30(4): 85-91.
[8] 汪润,何琳,王东波,黄水清,范远标. 面向文本挖掘的植物生长发育实体识别研究*[J]. 现代图书情报技术, 2014, 30(1): 24-27.
[9] 张志武. 跨领域迁移学习产品评论情感分析[J]. 现代图书情报技术, 2013, (6): 49-54.
[10] 高强, 游宏梁. 基于层叠模型的国防领域命名实体识别研究[J]. 现代图书情报技术, 2012, (11): 47-52.
[11] 余传明, 黄建秋, 郭飞. 从客户评论中识别命名实体——基于最大熵模型的实现[J]. 现代图书情报技术, 2011, 27(5): 77-82.
[12] 孙镇 王惠临. 命名实体识别研究进展综述[J]. 现代图书情报技术, 2010, 26(6): 42-47.
[13] 谢靖, 江岚, 王东波, 苏新宁. 基于万方数据(2003-2007)的知识发现应用研究[J]. 现代图书情报技术, 2010, 26(12): 64-69.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn