Please wait a minute...
Advanced Search
数据分析与知识发现  2022, Vol. 6 Issue (11): 13-24     https://doi.org/10.11925/infotech.2096-3467.2022.0145
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
基于超图注意力网络的生物医学文本分类方法*
白思萌1,牛振东1(),何慧2,时恺泽1,3,易坤1,马原驰1
1北京理工大学计算机学院 北京 100081
2北京理工大学医学技术学院 北京 100081
3悉尼科技大学澳大利亚人工智能研究所 悉尼 2007
Biomedical Text Classification Method Based on Hypergraph Attention Network
Bai Simeng1,Niu Zhendong1(),He Hui2,Shi Kaize1,3,Yi Kun1,Ma Yuanchi1
1School of Computer Science & Technology, Beijing Institute of Technology, Beijing 100081, China
2School of Medical Technology, Beijing Institute of Technology, Beijing 100081, China
3Australian Artificial Intelligence Institute, University of Technology Sydney, Sydney 2007, Australia
全文: PDF (973 KB)   HTML ( 22
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】 融合标签语义信息,采用文本级超图和交叉注意力机制捕捉文献文本的组织结构及语义语法信息,提高生物医学领域的文本分类效果。【方法】 使用经微调的BioBERT模型从生物医学领域文本中获取向量特征,构建文本级超图捕获文本的语序、语义及语法信息,通过提出的交叉注意力机制网络将文本级超图和标签语义信息进行特征融合实现文本分类任务。【结果】 在数据集PM-Sentence数据集上的实验结果表明,所提模型相较于基线模型在综合评价F1指标上最大提高2.34个百分点。【局限】 构建的数据集有待扩充,对所提模型用于该领域其他任务的适用性有待进一步研究。【结论】 所提模型提升了生物医学文本的分类效果,为知识检索、知识挖掘等知识服务应用提供了有效支持。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
白思萌
牛振东
何慧
时恺泽
易坤
马原驰
关键词 文本分类文本级超图交叉注意力机制生物医学领域标签信息融合    
Abstract

[Objective] This paper proposes a new model integrating tag semantics. It uses text-level hypergraph and cross attention mechanism to capture the organizational structure and grammatical semantics of literature, aiming to improve the classification of biomedical texts. [Methods] First, we utilized the fine-tuned BioBERT to retrieve vector features from the biomedical texts. Then, we constructed a text-level hypergraph to capture the word order, semantics, and syntactics of the texts. Finally, we merged the features of text-level hypergraph and labelled semantics through the cross attention mechanism network to finish the text classification. [Results] The experimental results on the PM-Sentence dataset show that the proposed model is 2.34 percentage points higher than the baseline model in the comprehensive evaluation of F1 indicators. [Limitations] The experimental dataset needs to be expanded to evaluate the model’s performance in other fields. [Conclusions] The newly constructed model improves the classification of biomedical texts and provides effective support for knowledge retrieval and mining.

Key wordsText Classification    Text-Level Hypergraph    Cross Attention Mechanism    Biomedical Field    Label Information Fusion
收稿日期: 2022-02-23      出版日期: 2023-01-13
ZTFLH:  TP391  
基金资助:* 国家重点研发计划(2019YFB1406303)
通讯作者: 牛振东     E-mail: zniu@bit.edu.cn
引用本文:   
白思萌,牛振东,何慧,时恺泽,易坤,马原驰. 基于超图注意力网络的生物医学文本分类方法*[J]. 数据分析与知识发现, 2022, 6(11): 13-24.
Bai Simeng,Niu Zhendong,He Hui,Shi Kaize,Yi Kun,Ma Yuanchi. Biomedical Text Classification Method Based on Hypergraph Attention Network. Data Analysis and Knowledge Discovery, 2022, 6(11): 13-24.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2022.0145      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2022/V6/I11/13
Fig.1  融合标签信息的生物超图注意力机制网络
Fig.2  依赖结构图
主题词 数量 抽取数量
解剖学(Anatomy) 6 466 14
有机体(Organisms) 2 801 6
疾病(Diseases) 8 421 18
化学品和药物(Chemicals and Drugs) 1 862 4
分析、诊断、治疗技术和设备
(Analytical, Diagnostic and Therapeutic
Techniques and Equipment)
513 1
精神病学和心理学(Psychiatry and Psychology) 1 537 3
现象和过程(Phenomena and Processes) 511 1
学科和职业(Disciplines and Occupations) 459 1
人类学、教育学、社会学和社会现象
(Anthropology, Education, Sociology and
Social Phenomena)
747 2
技术、工业、农业(Technology, Industry,
Agriculture)
2 305 5
人文(Humanities) 1 743 4
信息科学(Information Science) 4 263 9
命名组(Named Groups) 264 1
医疗保健(Health Care) 3 483 8
出版特征(Publication Characteristics) 409 1
地理(Geographicals) 744 2
总计 36 528 80
Table 1  PubMed科技文献MeSH主题词统计分析
分类标签 数量
背景(Background) 1 638
目的(Objective) 1 400
方法(Methods) 5 815
结果(Results) 6 095
结论(Conclusions) 2 670
总计 17 618
Table2  PM-Sentence数据集的标签统计分析
模型 P/% R/% F1/% Acc/%
SVM+Text Features 59.97 61.58 60.76 59.86
LSTM 60.55 62.87 61.69 60.54
CNN 62.25 63.74 62.99 59.04
HAN 67.94 67.98 67.96 63.69
TextGCN 69.90 70.35 70.12 69.97
Text-Level GNN 70.12 71.05 70.58 70.51
HyperGAT 72.52 71.58 70.86 71.85
LBGAT(本文) 73.62 73.04 73.20 73.04
Table 3  不同模型的实验结果
模型 P/% R/% F1/% Acc/%
w/o sequential 66.51 65.98 66.24% 66.93
w/o semantic 69.34 68.29 68.81 69.57
w/o syntactic 68.66 67.95 68.30 68.36
w/o BDE 65.21 64.57 64.89 65.57
w/o TLCA 63.24 61.29 62.25 62.77
LBGAT(本文) 73.62 73.04 73.20 73.04
Table 4  模型变体的实验结果
[1] Yousif A, Niu Z D, Chambua J, et al. Multi-Task Learning Model Based on Recurrent Convolutional Neural Networks for Citation Sentiment and Purpose Classification[J]. Neurocomputing, 2019, 335: 195-205.
doi: 10.1016/j.neucom.2019.01.021
[2] Shi K Z, Lu H, Zhu Y F, et al. Automatic Generation of Meteorological Briefing by Event Knowledge Guided Summarization Model[J]. Knowledge-Based Systems, 2020, 192: 105379.
doi: 10.1016/j.knosys.2019.105379
[3] Zhu Y, Lin Q, Lu H, et al. Recommending Learning Objects Through Attentive Heterogeneous Graph Convolution and Operation-Aware Neural Network[J]. IEEE Transactions on Knowledge and Data Engineering, 2021.DOI: 10.1109/TKDE.2021.3125424.
doi: 10.1109/TKDE.2021.3125424
[4] Lee J, Yoon W, Kim S, et al. BioBERT: A Pre-Trained Biomedical Language Representation Model for Biomedical Text Mining[J]. Bioinformatics, 2020, 36(4): 1234-1240.
doi: 10.1093/bioinformatics/btz682 pmid: 31501885
[5] 贺鸣, 孙建军, 成颖. 基于朴素贝叶斯的文本分类研究综述[J]. 情报科学, 2016, 34(7): 147-154.
[5] (He Ming, Sun Jianjun, Cheng Ying. Text Classification Based on Naive Bayes: A Review[J]. Information Science, 2016, 34(7): 147-154.)
[6] 雷飞. 基于神经网络和决策树的文本分类及其应用研究[D]. 成都: 电子科技大学, 2018.
[6] (Lei Fei. Research on Text Classification Based on Neural Network and Decision Tree and Its Application[D]. Chengdu: University of Electronic Science and Technology of China, 2018.)
[7] Tang D Y, Qin B, Liu T. Document Modeling with Gated Recurrent Neural Network for Sentiment Classification[C]// Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 2015: 1422-1432.
[8] Chen Y. Convolutional Neural Network for Sentence Classification[D]. Waterloo, ON: University of Waterloo, 2015.
[9] 万齐斌, 董方敏, 孙水发. 基于BiLSTM-Attention-CNN混合神经网络的文本分类方法[J]. 计算机应用与软件, 2020, 37(9): 94-98, 201.
[9] (Wan Qibin, Dong Fangmin, Sun Shuifa. Text Classification Method Based on BiLSTM-Attention-CNN Hybrid Neural Network[J]. Computer Applications and Software, 2020, 37(9): 94-98, 201.)
[10] Tai K S, Socher R, Manning C D. Improved Semantic Representations from Tree-Structured Long Short-Term Memory Networks[C]// Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1:Long Papers). 2015: 1556-1566.
[11] 余本功, 许庆堂, 张培行. 基于MAC-LSTM的问题分类研究[J]. 计算机应用研究, 2020, 37(1): 40-43.
[11] (Yu Bengong, Xu Qingtang, Zhang Peihang. Question Classification Based on MAC-LSTM[J]. Application Research of Computers, 2020, 37(1): 40-43.)
[12] Yao L, Mao C S, Luo Y. Graph Convolutional Networks for Text Classification[C]// Proceedings of the AAAI Conference on Artificial Intelligence. 2019, 33: 7370-7377.
[13] Huang L Z, Ma D H, Li S J, et al. Text Level Graph Neural Network for Text Classification[OL]. arXiv Preprint, arXiv:1910.02356.
[14] Du J C, Chen Q Y, Peng Y F, et al. ML-Net: Multi-Label Classification of Biomedical Texts with Deep Neural Networks[J]. Journal of the American Medical Informatics Association, 2019, 26(11): 1279-1285.
doi: 10.1093/jamia/ocz085 pmid: 31233120
[15] Mullenbach J, Wiegreffe S, Duke J, et al. Explainable Prediction of Medical Codes from Clinical Text[OL]. arXiv Preprint, arXiv:1802.05695.
[16] Nguyen B, Ji S. Fine-Tuning Pretrained Language Models with Label Attention for Explainable Biomedical Text Classification[OL]. arXiv Preprint, arXiv: 2108.11809.
[17] Ibrahim M A, Khan M U G, Mehmood F, et al. GHS-NET a Generic Hybridized Shallow Neural Network for Multi-Label Biomedical Text Classification[J]. Journal of Biomedical Informatics, 2021, 116(C): 103699.
doi: 10.1016/j.jbi.2021.103699
[18] Flores C A, Figueroa R L, Pezoa J E, et al. CREGEX: A Biomedical Text Classifier Based on Automatically Generated Regular Expressions[J]. IEEE Access, 2021, 8: 29270-29280.
doi: 10.1109/ACCESS.2020.2972205
[19] Mondal I. BBAEG: Towards BERT-Based Biomedical Adversarial Example Generation for Text Classification[OL]. arXiv Preprint, arXiv: 2104.01782.
[20] Pappas N, Popescu-Belis A. Multilingual Hierarchical Attention Networks for Document Classification[J]. arXiv Preprint, arXiv:1707.00896.
[21] Ding K Z, Wang J L, Li J D, et al. Be More with Less: Hypergraph Attention Networks for Inductive Text Classification[C]// Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. 2020: 4927-4936.
[22] Wang S D, Manning C D. Baselines and Bigrams: Simple, Good Sentiment and Topic Classification[C]// Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2:Short Papers). 2012: 90-94.
[23] Luo Y, Uzuner Ö, Szolovits P. Bridging Semantics and Syntax with Graph Algorithms — State-of-the-Art of Extracting Biomedical Relations[J]. Briefings in Bioinformatics, 2017, 18(1): 160-178.
doi: 10.1093/bib/bbw001
[24] Skianis K, Rousseau F, Vazirgiannis M. Regularizing Text Categorization with Clusters of Words[C]// Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 2016: 1827-1837.
[25] 周志超. 基于机器学习技术的自动引文分类研究综述[J]. 数据分析与知识发现, 2021, 5(12): 14-24.
[25] (Zhou Zhichao. Review of Automatic Citation Classification Based on Machine Learning[J]. Data Analysis and Knowledge Discovery, 2021, 5(12): 14-24.)
[26] 贾澎涛, 孙炜. 基于深度学习的文本分类综述[J]. 计算机与现代化, 2021(7): 29-37.
[26] (Jia Pengtao, Sun Wei. A Survey of Text Classification Based on Deep Learning[J]. Computer and Modernization, 2021(7): 29-37.)
[27] 倪茂树, 赵晶, 林鸿飞. 生物医学文本分类方法比较研究[J]. 计算机工程与应用, 2007, 43(12): 147-149.
[27] (Ni Maoshu, Zhao Jing, Lin Hongfei. Comparison Study on Categorization Algorithms for Biomedical Literatures[J]. Computer Engineering and Applications, 2007, 43(12): 147-149.)
[28] Hinton G E, Salakhutdinov R R. Reducing the Dimensionality of Data with Neural Networks[J]. Science, 2006, 313(5786): 504-507.
doi: 10.1126/science.1127647 pmid: 16873662
[29] Luo Y. Recurrent Neural Networks for Classifying Relations in Clinical Notes[J]. Journal of Biomedical Informatics, 2017, 72: 85-95.
doi: S1532-0464(17)30162-4 pmid: 28694119
[30] Yang P C, Sun X, Li W, et al. SGM: Sequence Generation Model for Multi-Label Classification[OL]. arXiv Preprint, arXiv:1806.04822.
[31] She X Y, Zhang D. Text Classification Based on Hybrid CNN-LSTM Hybrid Model[C]// Proceedings of 2018 11th International Symposium on Computational Intelligence and Design. 2018: 185-189.
[32] Zhang J R, Li Y X, Tian J, et al. LSTM-CNN Hybrid Model for Text Classification[C]// Proceedings of 2018 IEEE 3rd Advanced Information Technology, Electronic and Automation Control Conference. IEEE, 2018: 1675-1680.
[33] Wang G Y, Li C Y, Wang W L, et al. Joint Embedding of Words and Labels for Text Classification[C]// Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1:Long Papers). 2018: 2321-2331.
[34] Luo L, Yang Z H, Lin H F, et al. Document Triage for Identifying Protein-Protein Interactions Affected by Mutations: A Neural Network Ensemble Approach[J]. Database, 2018. DOI: 10.1093/database/bay097.
doi: 10.1093/database/bay097
[35] Kipf T N, Welling M.Semi-Supervised Classification with Graph Convolutional Networks[OL]. arXiv Preprint, arXiv:1609.02907.
[36] 张晓丹. 改进的图神经网络文本分类模型应用研究: 以NSTL科技期刊文献分类为例[J]. 情报杂志, 2021, 40(1): 184-188.
[36] (Zhang Xiaodan. The Application of Improved Graph Convolutional Neural Network in Big Data Classification of Scientific and Technological Documents[J]. Journal of Intelligence, 2021, 40(1): 184-188.)
[37] Battaglia P W, Hamrick J B, Bapst V, et al. Relational Inductive Biases, Deep Learning, and Graph Networks[OL]. arXiv Preprint, arXiv: 1806.01261.
[38] Devlin J, Chang M W, Lee K, et al. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019: 4171-4186.
[39] Vaswani A, Shazeer N, Parmar N, et al. Attention is All You Need[OL]. arXiv Preprint, arXiv: 1706.03762.
[40] Dernoncourt F, Lee J Y. PubMed 200k RCT: A Dataset for Sequential Sentence Classification in Medical Abstracts[OL]. arXiv Preprint, arXiv: 1710.06071.
[41] Blei D M, Ng A Y, Jordan M I. Latent Dirichlet Allocation[J]. The Journal of Machine Learning Research, 2003, 3: 993-1022.
[42] Veličković P, Cucurull G, Casanova A, et al. Graph Attention Networks[OL]. arXiv Preprint, arXiv:1710.10903.
[43] Ly A, Marsman M, Wagenmakers E J. Analytic Posteriors for Pearson’s Correlation Coefficient[J]. Statistica Neerlandica, 2018, 72(1): 4-13.
doi: 10.1111/stan.12111
[1] 叶瀚,孙海春,李欣,焦凯楠. 融合注意力机制与句向量压缩的长文本分类模型[J]. 数据分析与知识发现, 2022, 6(6): 84-94.
[2] 屠振超, 马静. 基于改进文本表示的商品文本分类算法研究*[J]. 数据分析与知识发现, 2022, 6(5): 34-43.
[3] 陈果, 叶潮. 融合半监督学习与主动学习的细分领域新闻分类研究*[J]. 数据分析与知识发现, 2022, 6(4): 28-38.
[4] 肖悦珺, 李红莲, 张乐, 吕学强, 游新冬. 特征融合的中文专利文本分类方法研究*[J]. 数据分析与知识发现, 2022, 6(4): 49-59.
[5] 杨林, 黄晓硕, 王嘉阳, 丁玲玲, 李子孝, 李姣. 基于BERT-TextCNN的临床试验疾病亚型识别研究*[J]. 数据分析与知识发现, 2022, 6(4): 69-81.
[6] 徐月梅, 樊祖薇, 曹晗. 基于标签嵌入注意力机制的多任务文本分类模型*[J]. 数据分析与知识发现, 2022, 6(2/3): 105-116.
[7] 黄学坚, 刘雨飏, 马廷淮. 基于改进型图神经网络的学术论文分类模型*[J]. 数据分析与知识发现, 2022, 6(10): 93-102.
[8] 谢星雨, 余本功. 基于MFFMB的电商评论文本分类研究*[J]. 数据分析与知识发现, 2022, 6(1): 101-112.
[9] 周泽聿,王昊,赵梓博,李跃艳,张小琴. 融合关联信息的GCN文本分类模型构建及其应用研究*[J]. 数据分析与知识发现, 2021, 5(9): 31-41.
[10] 陈杰,马静,李晓峰. 融合预训练模型文本特征的短文本分类方法*[J]. 数据分析与知识发现, 2021, 5(9): 21-30.
[11] 余本功,朱晓洁,张子薇. 基于多层次特征提取的胶囊网络文本分类研究*[J]. 数据分析与知识发现, 2021, 5(6): 93-102.
[12] 周志超. 基于机器学习技术的自动引文分类研究综述*[J]. 数据分析与知识发现, 2021, 5(12): 14-24.
[13] 王艳, 王胡燕, 余本功. 基于多特征融合的中文文本分类研究*[J]. 数据分析与知识发现, 2021, 5(10): 1-14.
[14] 唐晓波,高和璇. 基于关键词词向量特征扩展的健康问句分类研究 *[J]. 数据分析与知识发现, 2020, 4(7): 66-75.
[15] 王思迪,胡广伟,杨巳煜,施云. 基于文本分类的政府网站信箱自动转递方法研究*[J]. 数据分析与知识发现, 2020, 4(6): 51-59.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn