Please wait a minute...
Data Analysis and Knowledge Discovery  2020, Vol. 4 Issue (8): 41-49    DOI: 10.11925/infotech.2096-3467.2019.1238
Current Issue | Archive | Adv Search |
Classification of Chinese Medical Literature with BERT Model
Zhao Yang1,2,3,Zhang Zhixiong1,2,3,4(),Liu Huan1,2,3,Ding Liangping1,2,3
1National Science Library, Chinese Academy of Sciences, Beijing 100190, China
2Department of Library, Information and Archives Management, School of Economics and Management,University of Chinese Academy of Sciences, Beijing 100190, China
3Hubei Key Laboratory of Big Data in Science and Technology, Wuhan 430071, China
4Wuhan Library, Chinese Academy of Sciences, Wuhan 430071, China
Download: PDF (716 KB)   HTML ( 41
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper explores the classification results of Chinese medical literature based on the BERT-Base-Chinese model and the BERT Chinese medical pre-training model (BERT-Re-Pretraining-Med-Chi), aiming to analyze their differences. [Methods] We built a medical text pre-training corpus with 340,000 abstracts of Chinese medical literature. Then, we constructed training samples, with 16,000 and 32,000 abstracts, and established test sample with another 3,200 abstracts. Finally, we compareed the performance of the two models, using the SVM method as a benchmark. [Results] The two BERT models yielded better results than the SVM one, and their average F1-scores are about 5% higher than the SVM model. The F1-score of the BERT-Re-Pretraining-Med-Chi model reaches 0.8390 and 0.8607, which is the best among the three. [Limitations] This study only examined research papers from 16 medical and health categories in the Chinese Library Classification, and the remaining four categories were not included in the classification system due to the small amount of data. [Conclusions] The BERT-Re-Pretraining-Med-Chi model improves the performance of medical literature classification, while the BERT-based deep learning method yields better results with large-scale training set.

Key wordsDeep Learning      BERT      Literatures Classification      Pre-training model     
Received: 13 November 2019      Published: 25 May 2020
ZTFLH:  G202  
Corresponding Authors: Zhang Zhixiong     E-mail: zhangzhx@mail.las.ac.cn

Cite this article:

Zhao Yang, Zhang Zhixiong, Liu Huan, Ding Liangping. Classification of Chinese Medical Literature with BERT Model. Data Analysis and Knowledge Discovery, 2020, 4(8): 41-49.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2019.1238     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2020/V4/I8/41

Data Set Sample
类别 查准率 查全率 F1值
R1 0.62 0.85 0.72
R2 0.80 0.49 0.61
R3 0.83 0.92 0.87
R4 0.92 0.90 0.91
R5 0.96 0.96 0.96
R6 0.91 0.89 0.90
R71 0.82 0.47 0.60
R72 0.65 0.80 0.71
R73 0.56 0.75 0.64
R74 0.91 0.59 0.72
R75 0.70 0.58 0.64
R76 0.72 0.72 0.72
R77 0.82 0.91 0.86
R78 0.78 0.81 0.80
R8 0.78 0.85 0.81
R9 0.78 0.85 0.82
平均值 0.785 0 0.771 3 0.768 1
Results of SVM with 16 000 Samples
类别 查准率 查全率 F1值
R1 0.68 0.83 0.75
R2 0.84 0.51 0.63
R3 0.86 0.93 0.89
R4 0.93 0.92 0.92
R5 0.96 0.96 0.96
R6 0.93 0.91 0.92
R71 0.84 0.57 0.68
R72 0.66 0.82 0.73
R73 0.58 0.76 0.66
R74 0.92 0.62 0.74
R75 0.74 0.62 0.68
R76 0.73 0.74 0.73
R77 0.84 0.93 0.88
R78 0.80 0.84 0.82
R8 0.77 0.85 0.81
R9 0.80 0.86 0.83
平均值 0.805 0 0.791 9 0.789 4
Results of SVM with 32 000 Samples
类别 查准率 查全率 F1值
R1 0.75 0.83 0.79
R2 0.88 0.67 0.76
R3 0.61 0.82 0.70
R4 0.93 0.63 0.75
R5 0.83 0.64 0.72
R6 0.77 0.76 0.76
R71 0.89 0.91 0.90
R72 0.88 0.91 0.89
R73 0.87 0.91 0.89
R74 0.83 0.89 0.86
R75 0.86 0.94 0.90
R76 0.91 0.94 0.92
R77 0.93 0.99 0.96
R78 0.94 0.96 0.95
R8 0.88 0.71 0.79
R9 0.75 0.85 0.79
平均值 0.843 3 0.835 3 0.833 7
Experimental Results of BERT-Base-Chinese Model with 16 000 Samples
类别 查准率 查全率 F1值
R1 0.76 0.84 0.80
R2 0.86 0.65 0.74
R3 0.66 0.79 0.72
R4 0.95 0.69 0.80
R5 0.81 0.70 0.75
R6 0.80 0.78 0.79
R71 0.88 0.96 0.92
R72 0.91 0.92 0.91
R73 0.89 0.89 0.89
R74 0.84 0.90 0.86
R75 0.87 0.96 0.91
R76 0.90 0.95 0.93
R77 0.95 1.00 0.97
R78 0.94 0.94 0.94
R8 0.89 0.79 0.83
R9 0.77 0.86 0.81
平均值 0.854 6 0.850 3 0.848 9
Experimental Results of BERT-Base-Chinese Model with 32 000 Samples
类别 查准率 查全率 F1值
R1 0.73 0.86 0.79
R2 0.91 0.62 0.74
R3 0.65 0.82 0.72
R4 0.94 0.66 0.78
R5 0.83 0.71 0.76
R6 0.79 0.80 0.79
R71 0.87 0.94 0.91
R72 0.91 0.86 0.88
R73 0.87 0.91 0.89
R74 0.81 0.86 0.83
R75 0.88 0.95 0.91
R76 0.87 0.95 0.91
R77 0.94 1.00 0.97
R78 0.95 0.95 0.95
R8 0.87 0.73 0.79
R9 0.76 0.89 0.82
平均值 0.848 7 0.840 6 0.839 0
Experimental Results of BERT-Re-Pretraining-Med-Chi Model with 16 000 Samples
类别 查准率 查全率 F1值
R1 0.78 0.87 0.82
R2 0.88 0.71 0.78
R3 0.67 0.82 0.74
R4 0.98 0.70 0.82
R5 0.83 0.71 0.77
R6 0.80 0.88 0.79
R71 0.90 0.95 0.92
R72 0.91 0.91 0.91
R73 0.91 0.93 0.92
R74 0.87 0.91 0.89
R75 0.87 0.97 0.92
R76 0.91 0.95 0.93
R77 0.96 1.00 0.98
R78 0.95 0.94 0.94
R8 0.88 0.78 0.82
R9 0.76 0.88 0.82
平均值 0.867 1 0.861 6 0.860 7
Experimental Results of BERT-Re-Pretraining-Med-Chi Model with 32 000 Samples
样本量 评估指标 SVM BERT-Base-Chinese BERT-Re-Pretraining-Med-Chi
16 000 查准率
查全率
F1值
0.785 0 0.843 3 0.848 7
0.771 3 0.835 3 0.840 6
0.768 1 0.833 7 0.839 0
32 000 查准率
查全率
F1值
0.805 0 0.854 6 0.867 1
0.791 9 0.850 3 0.861 6
0.789 4 0.848 9 0.860 7
Evaluation Value of Classification Results
[1] Khalil El H, Hussien A, Safwan Q, et al. Building an Ensemble of Fine-tuned Naive Bayesian Classifiers for Text Classification[J]. Entropy, 2018,20(11):857.
doi: 10.3390/e20110857
[2] Wei O, Huynh V N, Songsak S. Training Attractive Attribute Classifiers Based on Opinion Features Extracted from Review Data[J]. Electronic Commerce Research and Applications, 2018,32:13-22.
doi: 10.1016/j.elerap.2018.10.003
[3] Jafari A, Ezadi H, Hossennejad M, et al. Improvement in Automatic Classification of Persian Documents by Means of Support Vector Machine and Representative Vector[C]// Proceedings of the International Conference on Innovative Computing Technology. 2011: 282-292.
[4] 陈玉芹. 多类别科技文献自动分类系统[D]. 武汉: 华中科技大学, 2008.
[4] ( Chen Yuqin. Multi-class Scientific Literature Automatic Categorization System[D]. Wuhan: Huazhong University of Science & Technology, 2008.)
[5] 白小明, 邱桃荣. 基于SVM和KNN算法的科技文献自动分类研究[J]. 微计算机信息, 2006,22(36):275-276, 65.
[5] ( Bai Xiaoming, Qiu Taorong. Science and Technology Text Auto Sort Study Base of SVM and KNN Algorithm[J]. Microcomputer Information, 2006,22(36):275-276, 65.)
[6] 王昊, 叶鹏, 邓三鸿. 机器学习在中文期刊论文自动分类研究中的应用[J]. 现代图书情报技术, 2014(3):80-87.
[6] ( Wang Hao, Ye Peng, Deng Sanhong. The Application of Machine-Learning in the Research on Automatic Categorization of Chinese Periodical Articles[J]. New Technology of Library and Information Service, 2014(3):80-87.)
[7] 杨敏, 谷俊. 基于SVM的中文书目自动分类及应用研究[J]. 图书情报工作, 2012,56(9):114-119.
[7] ( Yang Min, Gu Jun. Study and Apply of Chinese Bibliographies Automatic Classification Based on Support Vector Machine[J]. Library and Information Service, 2012,56(9):114-119.)
[8] 李湘东, 廖香鹏, 黄莉. LDA模型下书目信息分类系统的研究与实现[J]. 现代图书情报技术, 2014(5):18-25.
[8] ( Li Xiangdong, Liao Xiangpeng, Huang Li. Research and Implementation of Bibliographic Information Classification System in LDA Model[J]. ew Technology of Library and Information Service, 2014(5):18-25.)
[9] 李湘东, 潘练. LDA模型下文本自动分类算法比较研究——基于网页和图书期刊等数字文本资源的对比[J]. 信息资源管理学报, 2015,5(4):24-31, 46.
[9] ( Li Xiangdong, Pan Lian. Text Classification Algorithms Using the LDA Model: On the Comparison of the Applications on Webpages and eTexts Including Books and Journals[J]. Journal of Information Resources Management, 2015,5(4):24-31, 46.)
[10] Zhang S, Chen Y, Huang X L, et al. Text Classification of Public Feedbacks Using Convolutional Neural Network Based on Differential Evolution Algorithm[J]. International Journal of Computers Communications & Control, 2019,14(1):124-134.
doi: 10.15837/ijccc.2019.1
[11] Sun X P, Li Y B, Kang H W, et al. Automatic Document Classification Using Convolutional Neural Network[C]// Proceedings of International Seminar on Computer Science and Engineering Technology. 2019. DOI: 10.1088/1742-6596/1176/3/032029.
[12] 郭利敏. 基于卷积神经网络的文献自动分类研究[J]. 图书与情报, 2017(6):96-103.
[12] ( Guo Limin. Study of Automatic Classification of Literature Based on Convolution Neural Network[J]. Library & Information, 2017(6):96-103.)
[13] 朱肖颖, 赖绍辉, 陆科达. 基于LSTM算法在新闻分类中的应用[J]. 梧州学院学报, 2018,28(6):10-20.
[13] ( Zhu Xiaoying, Lai Shaohui, Lu Keda. Application of LSTM Algorithm in News Classification[J]. Journal of Wuzhou University, 2018,28(6):10-20.)
[14] 马建红, 王瑞杨, 姚爽, 等. 基于深度学习的专利分类方法[J]. 计算机工程, 2018,44(10):209-214.
doi: 10.19678/j.issn.1000-3428.0048159
[14] ( Ma Jianhong, Wang Ruiyang, Yao Shuang, et al. Patent Classification Method Based on Depth Learning[J]. Computer Engineering, 2018,44(10):209-214.)
doi: 10.19678/j.issn.1000-3428.0048159
[15] Devlin J, Chang M W, Lee K, et al. Bert: Pre-training of Deep Bidirectional Transformers for Language Understanding[OL]. arXiv Preprint, arXiv:. 181004805.
[16] 胡春涛, 秦锦康, 陈静梅, 等. 基于BERT模型的舆情分类应用研究[J]. 网络安全技术与应用, 2019(11):41-44.
[16] ( Hu Chuntao, Qin Jinkang, Chen Jingmei, et al. Application Research of Public Opinion Classification Based on BERT Model[J]. Network Security Technology & Application, 2019(11):41-44.)
[17] Yao L, Jin Z, Mao C S, et al. Traditional Chinese Medicine Clinical Records Classification with BERT and Domain Specific Corpora[J]. Journal of the American Medical Informatics Association, 2019,26(12):1632-1636.
doi: 10.1093/jamia/ocz164 pmid: 31550356
[18] Zhang X H, Zhang Y Y, Zhang Q, et al. Extracting Comprehensive Clinical Information for Breast Cancer Using Deep Learning Methods[J]. International Journal of Medical Informatics, 2019, 132: Article No.103985.
doi: 10.1016/j.ijmedinf.2020.104233 pmid: 32736330
[19] Jwa H, Oh D, Park K, et al. exBAKE: Automatic Fake News Detection Model Based on Bidirectional Encoder Representations from Transformers (BERT)[J]. Applied Sciences-Basel, 2019,9(19)Article No.4062.
[20] 王英杰, 谢彬, 李宁波. ALICE:一种面向中文科技文本分析的预训练语言表征模型[J]. 计算机工程, 2020,46(2):48-52,58.
[20] ( Wang Yingjie, Xie Bin, Li Ningbo. ALICE: A Pre-trained Language Representation Model for Chinese Technological Text Analysis[J]. Computer Engineering, 2020,46(2):48-52,58.)
[1] Chen Jie,Ma Jing,Li Xiaofeng. Short-Text Classification Method with Text Features from Pre-trained Models[J]. 数据分析与知识发现, 2021, 5(9): 21-30.
[2] Zhou Zeyu,Wang Hao,Zhao Zibo,Li Yueyan,Zhang Xiaoqin. Construction and Application of GCN Model for Text Classification with Associated Information[J]. 数据分析与知识发现, 2021, 5(9): 31-41.
[3] Ma Jiangwei, Lv Xueqiang, You Xindong, Xiao Gang, Han Junmei. Extracting Relationship Among Military Domains with BERT and Relation Position Features[J]. 数据分析与知识发现, 2021, 5(8): 1-12.
[4] Li Wenna, Zhang Zhixiong. Entity Alignment Method for Different Knowledge Repositories with Joint Semantic Representation[J]. 数据分析与知识发现, 2021, 5(7): 1-9.
[5] Wang Hao, Lin Kerou, Meng Zhen, Li Xinlei. Identifying Multi-Type Entities in Legal Judgments with Text Representation and Feature Generation[J]. 数据分析与知识发现, 2021, 5(7): 10-25.
[6] Yu Xuehan, He Lin, Xu Jian. Extracting Events from Ancient Books Based on RoBERTa-CRF[J]. 数据分析与知识发现, 2021, 5(7): 26-35.
[7] Zhao Danning,Mu Dongmei,Bai Sen. Automatically Extracting Structural Elements of Sci-Tech Literature Abstracts Based on Deep Learning[J]. 数据分析与知识发现, 2021, 5(7): 70-80.
[8] Xu Yuemei, Wang Zihou, Wu Zixin. Predicting Stock Trends with CNN-BiLSTM Based Multi-Feature Integration Model[J]. 数据分析与知识发现, 2021, 5(7): 126-138.
[9] Liu Wenbin, He Yanqing, Wu Zhenfeng, Dong Cheng. Sentence Alignment Method Based on BERT and Multi-similarity Fusion[J]. 数据分析与知识发现, 2021, 5(7): 48-58.
[10] Lu Quan, He Chao, Chen Jing, Tian Min, Liu Ting. A Multi-Label Classification Model with Two-Stage Transfer Learning[J]. 数据分析与知识发现, 2021, 5(7): 91-100.
[11] Zhong Jiawa,Liu Wei,Wang Sili,Yang Heng. Review of Methods and Applications of Text Sentiment Analysis[J]. 数据分析与知识发现, 2021, 5(6): 1-13.
[12] Huang Mingxuan,Jiang Caoqing,Lu Shoudong. Expanding Queries Based on Word Embedding and Expansion Terms[J]. 数据分析与知识发现, 2021, 5(6): 115-125.
[13] Yin Pengbo,Pan Weimin,Zhang Haijun,Chen Degang. Identifying Clickbait with BERT-BiGA Model[J]. 数据分析与知识发现, 2021, 5(6): 126-134.
[14] Song Ruoxuan,Qian Li,Du Yu. Identifying Academic Creative Concept Topics Based on Future Work of Scientific Papers[J]. 数据分析与知识发现, 2021, 5(5): 10-20.
[15] Zhang Guobiao,Li Jie. Detecting Social Media Fake News with Semantic Consistency Between Multi-model Contents[J]. 数据分析与知识发现, 2021, 5(5): 21-29.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn