Please wait a minute...
Advanced Search
数据分析与知识发现  2023, Vol. 7 Issue (2): 38-47     https://doi.org/10.11925/infotech.2096-3467.2022.0919
  专题 本期目录 | 过刊浏览 | 高级检索 |
基于任务知识融合与文本数据增强的医学信息查询意图强度识别研究*
赵一鸣1,2,3(),潘沛2,3,4,毛进1,2
1武汉大学信息资源研究中心 武汉 430072
2武汉大学信息管理学院 武汉 430072
3武汉大学大数据研究院 武汉 430072
4武汉大学图书情报国家级实验教学示范中心 武汉 430072
Recognizing Intensity of Medical Query Intentions Based on Task Knowledge Fusion and Text Data Enhancement
Zhao Yiming1,2,3(),Pan Pei2,3,4,Mao Jin1,2
1Center for Studies of Information Resources, Wuhan University, Wuhan 430072, China
2School of Information Management, Wuhan University, Wuhan 430072, China
3Big Data Institute, Wuhan University, Wuhan 430072, China
4National Demonstration Center for Experimental Library and Information Science Education, Wuhan University, Wuhan 430072, China
全文: PDF (724 KB)   HTML ( 14
输出: BibTeX | EndNote (RIS)      
摘要 

目的】 为提高医学信息查询意图强度识别的精度并解决查询式词向量表征困难、标注数据集少等问题,设计一种基于任务知识融合与文本数据增强的医学信息查询意图强度识别方法。【方法】 在文本数据增强方面,选取SimBERT模型,实现小样本数据集的文本数据增强;在文本表示方面,利用医学信息查询式文本语料对BERT模型进行增量预训练,获得融合任务知识的MQ-BERT模型;在文本分类方面,在MQ-BERT后引入Bi-LSTM等模型进行分类任务,并对比文本数据增强前后的分类效果。【结果】 融合任务知识的MQ-BERT的分类结果F-Score达到92.22%,超越了阿里巴巴团队提出的MC-BERT在同一任务数据集上的最佳结果(F-Score=87.5%);文本数据增强后,模型分类效果进一步提升,其中基于MQ-BERT和Bi-LSTM的模型获得最佳分类结果,F-Score为95.34%,相比MC-BERT提升了7.84个百分点。【局限】 增量预训练过程的数据选择方法在未来可以进一步优化。【结论】 任务知识融合与文本数据增强能有效提高医学信息查询意图强度识别精度,针对不同强度的查询意图,应该对其查询结果采用不同的呈现方式,以提升医学信息检索系统的查询准确度,更好地满足用户的医学信息需求。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
赵一鸣
潘沛
毛进
关键词 医学信息查询意图强度识别文本数据增强任务知识融合BERT模型    
Abstract

[Objective] This paper proposes a recognition model for the intensity of medical query intentions based on task knowledge fusion and text enhancement, aiming to improve the representation of query word vectors, as well as expand labeled data sets. [Methods] First, we used the SimBERT model to realize the text data enhancement of small task data set. Then, we utilized the medical query text corpus to incrementally pre-train the BERT model and obtain the MQ-BERT (Medical-Query BERT) model with task knowledge. Finally, we introduced the Bi-LSTM and other models to compare the classification performance before and after text data enhancement. [Results] The F-Score of our new MQ-BERT model reached 92.22%, which is superior than the existing models by Alibaba team on the same task data set (F-Score=87.5%). With the text data enhancement, the classification performance of our new model was also improved (F-Score=95.34%), which is 7.84% higher than the MC-BERT one. [Limitations] The data selection of incremental pre-training process could be further optimized. [Conclusions] Task knowledge fusion and text data enhancement can effectively improve the recognition accuracy of the intensity of medical query intentions, which benefits the developments of medical information retrieval system.

Key wordsMedical Information Query    Intention Intensity Recognition    Text Data Enhancement    Task Knowledge Fusion    BERT Model
收稿日期: 2022-08-31      出版日期: 2023-03-28
ZTFLH:  TP393 G250  
基金资助:*国家自然科学基金项目(71874130);国家自然科学基金项目(72274146);教育部人文社会科学研究项目的研究成果之一(18YJC870026)
通讯作者: 赵一鸣,ORCID:0000-0001-8182-456X,E-mail: zhaoyiming@whu.edu.cn。   
引用本文:   
赵一鸣, 潘沛, 毛进. 基于任务知识融合与文本数据增强的医学信息查询意图强度识别研究*[J]. 数据分析与知识发现, 2023, 7(2): 38-47.
Zhao Yiming, Pan Pei, Mao Jin. Recognizing Intensity of Medical Query Intentions Based on Task Knowledge Fusion and Text Data Enhancement. Data Analysis and Knowledge Discovery, 2023, 7(2): 38-47.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2022.0919      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2023/V7/I2/38
Fig.1  技术路线
类别 数量 示例数据
强意图 1 252 眼袋按摩能消除吗
弱意图 579 觉得有焦虑症
无意图 59 唾沫喷到脸上
Table 1  cMedIC数据集示例
类别 原始语句 SimBERT增强后语句
强意图 正常的孩子一般什么时候说话 孩子什么时候才能说话
弱意图 晚上睡不踏实老做梦 睡觉不踏实总是做梦
无意图 艾滋病长效药降价免费 长效药降价了艾滋病
Table 2  SimBERT增强数据示例
微调/分类模型 batch_size learning rate epochs dropout
Finetune 64 1e-5 10 /
Bi-LSTM 64 1e-5 10 0.2
Bi-GRU 32 1e-5 15 0.2
Bi-LSTM+ATT 32 1e-5 15 0.1
Bi-GRU+ATT 64 1e-5 15 0.2
Table 3  文本数据增强前各模型最佳参数
微调/分类模型 batch_size learning rate epochs dropout
Finetune 64 1e-5 20 /
Bi-LSTM 64 1e-5 10 0.2
Bi-GRU 64 1e-5 15 0.2
Bi-LSTM+ATT 32 2e-5 15 0.1
Bi-GRU+ATT 64 2e-5 10 0.1
Table 4  文本数据增强后各模型最佳参数
预训练
模型
微调/分类
模型
是否文本数据增强 Precision Recall F-Score
MQ-BERT Finetune 92.25% 92.19% 92.22%
93.82% 93.75% 93.78%
Bi-LSTM 90.77% 93.75% 92.24%
95.38% 95.31% 95.34%
Bi-GRU 89.23% 92.19% 90.69%
89.23% 92.19% 90.69%
Bi-LSTM+ATT 91.03% 93.75% 92.37%
93.91% 93.75% 93.83%
Bi-GRU+ATT 90.58% 90.62% 90.60%
93.91% 93.75% 93.83%
Table 5  文本数据增强前后各类模型效果对比
[1] 环球网. 百度大数据:健康搜索需求提升207%,AI助力健康科普加速落地[EB/OL]. [2022-05-18]. https://tech.huanqiu.com/article/44PFzGyVmgO.
[1] (World Wide Web. Baidu Big Data: Health Search Demand Increased by 207%, AI Helped Accelerate the Implementation of Health Science[EB/OL]. [2022-05-18]. https://tech.huanqiu.com/article/44PFzGyVmgO.)
[2] 张璐, 彭雪莹, 陈静. 突发公共卫生事件中大学生健康信息搜寻意图研究[J]. 情报科学, 2022, 40(10): 51-59.
[2] (Zhang Lu,Peng Xueying, Chen Jing. Intentions of Health Information Seeking in Public Health Emergency[J]. Information Science, 2022, 40(10): 51-59.)
[3] Broder A. A Taxonomy of Web Search[J]. ACM SIGIR Forum, 2002, 36(2): 3-10.
doi: 10.1145/792550.792552
[4] Sushmita S, Piwowarski B, Lalmas M. Dynamics of Genre and Domain Intents[C]// Proceedings of the 6th Asia Information Retrieval Societies Conference on Information Retrieval Technology. Berlin:Springer, 2010: 399-409.
[5] 桂思思, 陆伟, 张晓娟. 基于查询表达式特征的时态意图识别研究[J]. 数据分析与知识发现, 2019, 3(3): 66-75.
[5] (Gui Sisi, Lu Wei, Zhang Xiaojuan. Temporal Intent Classification with Query Expression Feature[J]. Data Analysis and Knowledge Discovery, 2019, 3(3): 66-75.)
[6] Zhang N Y, Jia Q H, Yin K P, et al. Conceptualized Representation Learning for Chinese Biomedical Text Mining[OL]. arXiv Preprint, arXiv: 2008.10813.
[7] Segev E, Ahituv N. Popular Searches in Google and Yahoo!: A “Digital Divide” in Information Uses?[J]. The Information Society, 2010, 26(1): 17-37.
doi: 10.1080/01972240903423477
[8] Kanhabua N, Nørvåg K. Determining Time of Queries for Re-Ranking Search Results[C]// Proceedings of the 14th International Conference on Theory and Practice of Digital Libraries. Berlin:Springer, 2010: 261-272.
[9] Ross N C M, Wolfram D. End User Searching on the Internet: An Analysis of Term Pair Topics Submitted to the Excite Search Engine[J]. Journal of the American Society for Information Science, 2000, 51(10): 949-958.
doi: 10.1002/(ISSN)1097-4571
[10] 陆伟, 周红霞, 张晓娟. 查询意图研究综述[J]. 中国图书馆学报, 2013, 39(1): 100-111.
[10] (Lu Wei,Zhou Hongxia, Zhang Xiaojuan. Review of Research on Query Intent[J]. Journal of Library Science in China, 2013, 39(1): 100-111.)
[11] Yang Z H, Gong J Y, Liu C Y, et al. iExplore: Accelerating Exploratory Data Analysis by Predicting User Intention[C]// Proceedings of the 2018 International Conference on Database Systems for Advanced Applications. Cham: Springer, 2018: 149-165.
[12] Chen T, Yin H Z, Chen H X, et al. AIR: Attentional Intention-Aware Recommender Systems[C]// Proceedings of the 35th International Conference on Data Engineering. IEEE, 2019: 304-315.
[13] 王瑞雪, 方婧, 桂思思, 等. 基于深度学习算法的学术查询意图分类器构建[J]. 图书情报工作, 2021, 65(3): 93-99.
doi: 10.13266/j.issn.0252-3116.2021.03.012
[13] (Wang Ruixue, Fang Jing, Gui Sisi, et al. Deep Learning-Based Algorithm for Academic Query Intent Classification[J]. Library and Information Service, 2021, 65(3): 93-99.)
doi: 10.13266/j.issn.0252-3116.2021.03.012
[14] Figueroa A, Atkinson J. Ensembling Classifiers for Detecting User Intentions Behind Web Queries[J]. IEEE Internet Computing, 2016, 20(2): 8-16.
[15] He C G, Chen S B, Huang S L, et al. Using Convolutional Neural Network with BERT for Intent Determination[C]// Proceedings of the 2019 International Conference on Asian Language Processing. IEEE, 2019: 65-70.
[16] Qiu L R, Chen Y D, Jia H R, et al. Query Intent Recognition Based on Multi-Class Features[J]. IEEE Access, 2018, 6: 52195-52204.
doi: 10.1109/ACCESS.2018.2869585
[17] Suresh S, Guru Rajan T S, Gopinath V. VoC-DL: Revisiting Voice of Customer Using Deep Learning[C]// Proceedings of the 32nd AAAI Conference on Artificial Intelligence. 2018: 7843-7848.
[18] Zhang J H, Ye Y X, Zhang Y, et al. Multi-Point Semantic Representation for Intent Classification[C]// Proceedings of the 2020 AAAI Conference on Artificial Intelligence. 2020: 9531-9538.
[19] Cai R C, Zhu B J, Ji L, et al. An CNN-LSTM Attention Approach to Understanding User Query Intent from Online Health Communities[C]// Proceedings of the 2017 IEEE International Conference on Data Mining Workshops. IEEE, 2017: 430-437.
[20] Devlin J, Chang M W, Lee K, et al. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. 2018: 4171-4186.
[21] Vaswani A, Shazeer N, Parmar N, et al. Attention is All You Need[C]// Proceedings of the 31st Annual Conference on Neural Information Processing Systems Advances in Neural Information Processing Systems. 2017: 5999-6009.
[22] Sun C, Qiu X, Xu Y, et al. How to Fine-Tune BERT for Text Classification?[C]// Proceedings of the 2019 China National Conference on Chinese Computational Linguistics. Cham: Springer, 2019: 194-206.
[23] Peng Y F, Yan S K, Lu Z Y. Transfer Learning in Biomedical Natural Language Processing: An Evaluation of BERT and ELMo on Ten Benchmarking Datasets[C]// Proceedings of the 18th SIGBioMed Workshop on Biomedical Natural Language Processing. 2019: 58-65.
[24] Huang K, Altosaar J, Ranganath R. ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission[OL]. arXiv Preprint, arXiv: 1904.05342.
[25] Beltagy I, Lo K, Cohan A. SciBERT: A Pretrained Language Model for Scientific Text[C]// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. 2019: 3615-3620.
[26] Lee J, Yoon W, Kim S, et al. BioBERT: A Pre-Trained Biomedical Language Representation Model for Biomedical Text Mining[J]. Bioinformatics, 2019, 36(4): 1234-1240.
doi: 10.1093/bioinformatics/btz682
[27] Gu Y, Tinn R, Cheng H, et al. Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing[J]. ACM Transactions on Computing for Healthcare, 2022, 3(1): 1-23.
[28] He Y, Zhu Z, Zhang Y, et al. Infusing Disease Knowledge into BERT for Health Question Answering, Medical Inference and Disease Name Recognition[C]// Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. 2020: 4604-4614.
[29] 王东波, 刘畅, 朱子赫, 等. SikuBERT与SikuRoBERTa:面向数字人文的《四库全书》预训练模型构建及应用研究[J]. 图书馆论坛, 2022, 42(6): 31-43.
[29] (Wang Dongbo, Liu Chang, Zhu Zihe, et al. Construction and Application of Pre-Trained Models of Siku Quanshu in Orientation to Digital Humanities[J]. Library Tribune, 2022, 42(6): 31-43.)
[30] 张卫, 王昊, 陈玥彤, 等. 融合迁移学习与文本增强的中文成语隐喻知识识别与关联研究[J]. 数据分析与知识发现, 2022, 6(2/3): 167-183.
[30] (Zhang Wei, Wang Hao, Chen Yuetong, et al. Identifying Metaphors and Association of Chinese Idioms with Transfer Learning and Text Augmentation[J]. Data Analysis and Knowledge Discovery, 2022, 6(2/3): 167-183.)
[31] Wei J, Zou K. EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks[C]// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. 2019: 6382-6388.
[32] Wu X, Lv S, Zang L, et al. Conditional BERT Contextual Augmentation[C]// Proceedings of the 19th International Conference on Computational Science. Cham: Springer, 2019: 84-95.
[33] Kobayashi S. Contextual Augmentation: Data Augmentation by Words with Paradigmatic Relations[C]// Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies. 2018: 452-457.
[34] Wang Y F, Xu C, Sun Q F, et al. PromDA: Prompt-Based Data Augmentation for Low-Resource NLU Tasks[C]// Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. 2022: 4242-4255.
[35] 施国良, 陈宇奇. 文本增强与预训练语言模型在网络问政留言分类中的集成对比研究[J]. 图书情报工作, 2021, 65(13): 96-107.
doi: 10.13266/j.issn.0252-3116.2021.13.010
[35] (Shi Guoliang, Chen Yuqi. A Comparative Study on the Integration of Text Enhanced and Pre-Trained Language Models in the Classification of Internet Political Messages[J]. Library and Information Service, 2021, 65(13): 96-107.)
doi: 10.13266/j.issn.0252-3116.2021.13.010
[36] 苏剑林. 鱼与熊掌兼得:融合检索和生成的SimBERT模型[EB/OL]. [2022-05-18]. https://spaces.ac.cn/archives/7427.
[36] (Su Jianlin. Fish and Bear’s Paw: SimBERT Model for Fusion Retrieval and Generation[EB/OL]. [2022-05-18]. https://spaces.ac.cn/archives/7427.)
[37] Conneau A, Lample G. Cross-Lingual Language Model Pretraining[C]// Proceedings of the 33rd Annual Conference on Neural Information Processing Systems. 2019: 7057-7067.
[38] Joshi M, Chen D Q, Liu Y H, et al. SpanBERT: Improving Pre-Training by Representing and Predicting Spans[J]. Transactions of the Association for Computational Linguistics, 2020, 8: 64-77.
doi: 10.1162/tacl_a_00300
[39] Liu Y, Ott M, Goyal N, et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach[OL]. arXiv Preprint, arXiv: 1907.11692.
[40] Huang Z, Xu W, Yu K. Bidirectional LSTM-CRF Models for Sequence Tagging[OL]. arXiv Preprint, arXiv: 1508.01991.
[41] Jiao Z, Sun S, Sun K. Chinese Lexical Analysis with Deep Bi-GRU-CRF Network[OL]. arXiv Preprint, arXiv: 1807.01882.
[42] Li X, Wang Y Y, Acero A. Learning Query Intent from Regularized Click Graphs[C]// Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 2008: 339-346.
[1] 陆泉, 何超, 陈静, 田敏, 刘婷. 基于两阶段迁移学习的多标签分类模型研究*[J]. 数据分析与知识发现, 2021, 5(7): 91-100.
[2] 刘欢,张智雄,王宇飞. BERT模型的主要优化改进方法研究综述*[J]. 数据分析与知识发现, 2021, 5(1): 3-15.
[3] 张冬瑜,崔紫娟,李映夏,张伟,林鸿飞. 基于Transformer和BERT的名词隐喻识别*[J]. 数据分析与知识发现, 2020, 4(4): 100-108.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn