Please wait a minute...
Data Analysis and Knowledge Discovery  2020, Vol. 4 Issue (8): 41-49    DOI: 10.11925/infotech.2096-3467.2019.1238
Current Issue | Archive | Adv Search |
Classification of Chinese Medical Literature with BERT Model
Zhao Yang1,2,3,Zhang Zhixiong1,2,3,4(),Liu Huan1,2,3,Ding Liangping1,2,3
1National Science Library, Chinese Academy of Sciences, Beijing 100190, China
2Department of Library, Information and Archives Management, School of Economics and Management,University of Chinese Academy of Sciences, Beijing 100190, China
3Hubei Key Laboratory of Big Data in Science and Technology, Wuhan 430071, China
4Wuhan Library, Chinese Academy of Sciences, Wuhan 430071, China
Download: PDF (716 KB)   HTML ( 11
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper explores the classification results of Chinese medical literature based on the BERT-Base-Chinese model and the BERT Chinese medical pre-training model (BERT-Re-Pretraining-Med-Chi), aiming to analyze their differences. [Methods] We built a medical text pre-training corpus with 340,000 abstracts of Chinese medical literature. Then, we constructed training samples, with 16,000 and 32,000 abstracts, and established test sample with another 3,200 abstracts. Finally, we compareed the performance of the two models, using the SVM method as a benchmark. [Results] The two BERT models yielded better results than the SVM one, and their average F1-scores are about 5% higher than the SVM model. The F1-score of the BERT-Re-Pretraining-Med-Chi model reaches 0.8390 and 0.8607, which is the best among the three. [Limitations] This study only examined research papers from 16 medical and health categories in the Chinese Library Classification, and the remaining four categories were not included in the classification system due to the small amount of data. [Conclusions] The BERT-Re-Pretraining-Med-Chi model improves the performance of medical literature classification, while the BERT-based deep learning method yields better results with large-scale training set.

Key wordsDeep Learning      BERT      Literatures Classification      Pre-training model     
Received: 13 November 2019      Published: 25 May 2020
ZTFLH:  G202  
Corresponding Authors: Zhang Zhixiong     E-mail: zhangzhx@mail.las.ac.cn

Cite this article:

Zhao Yang, Zhang Zhixiong, Liu Huan, Ding Liangping. Classification of Chinese Medical Literature with BERT Model. Data Analysis and Knowledge Discovery, 2020, 4(8): 41-49.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2019.1238     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2020/V4/I8/41

Data Set Sample
类别 查准率 查全率 F1值
R1 0.62 0.85 0.72
R2 0.80 0.49 0.61
R3 0.83 0.92 0.87
R4 0.92 0.90 0.91
R5 0.96 0.96 0.96
R6 0.91 0.89 0.90
R71 0.82 0.47 0.60
R72 0.65 0.80 0.71
R73 0.56 0.75 0.64
R74 0.91 0.59 0.72
R75 0.70 0.58 0.64
R76 0.72 0.72 0.72
R77 0.82 0.91 0.86
R78 0.78 0.81 0.80
R8 0.78 0.85 0.81
R9 0.78 0.85 0.82
平均值 0.785 0 0.771 3 0.768 1
Results of SVM with 16 000 Samples
类别 查准率 查全率 F1值
R1 0.68 0.83 0.75
R2 0.84 0.51 0.63
R3 0.86 0.93 0.89
R4 0.93 0.92 0.92
R5 0.96 0.96 0.96
R6 0.93 0.91 0.92
R71 0.84 0.57 0.68
R72 0.66 0.82 0.73
R73 0.58 0.76 0.66
R74 0.92 0.62 0.74
R75 0.74 0.62 0.68
R76 0.73 0.74 0.73
R77 0.84 0.93 0.88
R78 0.80 0.84 0.82
R8 0.77 0.85 0.81
R9 0.80 0.86 0.83
平均值 0.805 0 0.791 9 0.789 4
Results of SVM with 32 000 Samples
类别 查准率 查全率 F1值
R1 0.75 0.83 0.79
R2 0.88 0.67 0.76
R3 0.61 0.82 0.70
R4 0.93 0.63 0.75
R5 0.83 0.64 0.72
R6 0.77 0.76 0.76
R71 0.89 0.91 0.90
R72 0.88 0.91 0.89
R73 0.87 0.91 0.89
R74 0.83 0.89 0.86
R75 0.86 0.94 0.90
R76 0.91 0.94 0.92
R77 0.93 0.99 0.96
R78 0.94 0.96 0.95
R8 0.88 0.71 0.79
R9 0.75 0.85 0.79
平均值 0.843 3 0.835 3 0.833 7
Experimental Results of BERT-Base-Chinese Model with 16 000 Samples
类别 查准率 查全率 F1值
R1 0.76 0.84 0.80
R2 0.86 0.65 0.74
R3 0.66 0.79 0.72
R4 0.95 0.69 0.80
R5 0.81 0.70 0.75
R6 0.80 0.78 0.79
R71 0.88 0.96 0.92
R72 0.91 0.92 0.91
R73 0.89 0.89 0.89
R74 0.84 0.90 0.86
R75 0.87 0.96 0.91
R76 0.90 0.95 0.93
R77 0.95 1.00 0.97
R78 0.94 0.94 0.94
R8 0.89 0.79 0.83
R9 0.77 0.86 0.81
平均值 0.854 6 0.850 3 0.848 9
Experimental Results of BERT-Base-Chinese Model with 32 000 Samples
类别 查准率 查全率 F1值
R1 0.73 0.86 0.79
R2 0.91 0.62 0.74
R3 0.65 0.82 0.72
R4 0.94 0.66 0.78
R5 0.83 0.71 0.76
R6 0.79 0.80 0.79
R71 0.87 0.94 0.91
R72 0.91 0.86 0.88
R73 0.87 0.91 0.89
R74 0.81 0.86 0.83
R75 0.88 0.95 0.91
R76 0.87 0.95 0.91
R77 0.94 1.00 0.97
R78 0.95 0.95 0.95
R8 0.87 0.73 0.79
R9 0.76 0.89 0.82
平均值 0.848 7 0.840 6 0.839 0
Experimental Results of BERT-Re-Pretraining-Med-Chi Model with 16 000 Samples
类别 查准率 查全率 F1值
R1 0.78 0.87 0.82
R2 0.88 0.71 0.78
R3 0.67 0.82 0.74
R4 0.98 0.70 0.82
R5 0.83 0.71 0.77
R6 0.80 0.88 0.79
R71 0.90 0.95 0.92
R72 0.91 0.91 0.91
R73 0.91 0.93 0.92
R74 0.87 0.91 0.89
R75 0.87 0.97 0.92
R76 0.91 0.95 0.93
R77 0.96 1.00 0.98
R78 0.95 0.94 0.94
R8 0.88 0.78 0.82
R9 0.76 0.88 0.82
平均值 0.867 1 0.861 6 0.860 7
Experimental Results of BERT-Re-Pretraining-Med-Chi Model with 32 000 Samples
样本量 评估指标 SVM BERT-Base-Chinese BERT-Re-Pretraining-Med-Chi
16 000 查准率
查全率
F1值
0.785 0 0.843 3 0.848 7
0.771 3 0.835 3 0.840 6
0.768 1 0.833 7 0.839 0
32 000 查准率
查全率
F1值
0.805 0 0.854 6 0.867 1
0.791 9 0.850 3 0.861 6
0.789 4 0.848 9 0.860 7
Evaluation Value of Classification Results
[1] Khalil El H, Hussien A, Safwan Q, et al. Building an Ensemble of Fine-tuned Naive Bayesian Classifiers for Text Classification[J]. Entropy, 2018,20(11):857.
doi: 10.3390/e20110857
[2] Wei O, Huynh V N, Songsak S. Training Attractive Attribute Classifiers Based on Opinion Features Extracted from Review Data[J]. Electronic Commerce Research and Applications, 2018,32:13-22.
doi: 10.1016/j.elerap.2018.10.003
[3] Jafari A, Ezadi H, Hossennejad M, et al. Improvement in Automatic Classification of Persian Documents by Means of Support Vector Machine and Representative Vector[C]// Proceedings of the International Conference on Innovative Computing Technology. 2011: 282-292.
[4] 陈玉芹. 多类别科技文献自动分类系统[D]. 武汉: 华中科技大学, 2008.
[4] ( Chen Yuqin. Multi-class Scientific Literature Automatic Categorization System[D]. Wuhan: Huazhong University of Science & Technology, 2008.)
[5] 白小明, 邱桃荣. 基于SVM和KNN算法的科技文献自动分类研究[J]. 微计算机信息, 2006,22(36):275-276, 65.
[5] ( Bai Xiaoming, Qiu Taorong. Science and Technology Text Auto Sort Study Base of SVM and KNN Algorithm[J]. Microcomputer Information, 2006,22(36):275-276, 65.)
[6] 王昊, 叶鹏, 邓三鸿. 机器学习在中文期刊论文自动分类研究中的应用[J]. 现代图书情报技术, 2014(3):80-87.
[6] ( Wang Hao, Ye Peng, Deng Sanhong. The Application of Machine-Learning in the Research on Automatic Categorization of Chinese Periodical Articles[J]. New Technology of Library and Information Service, 2014(3):80-87.)
[7] 杨敏, 谷俊. 基于SVM的中文书目自动分类及应用研究[J]. 图书情报工作, 2012,56(9):114-119.
[7] ( Yang Min, Gu Jun. Study and Apply of Chinese Bibliographies Automatic Classification Based on Support Vector Machine[J]. Library and Information Service, 2012,56(9):114-119.)
[8] 李湘东, 廖香鹏, 黄莉. LDA模型下书目信息分类系统的研究与实现[J]. 现代图书情报技术, 2014(5):18-25.
[8] ( Li Xiangdong, Liao Xiangpeng, Huang Li. Research and Implementation of Bibliographic Information Classification System in LDA Model[J]. ew Technology of Library and Information Service, 2014(5):18-25.)
[9] 李湘东, 潘练. LDA模型下文本自动分类算法比较研究——基于网页和图书期刊等数字文本资源的对比[J]. 信息资源管理学报, 2015,5(4):24-31, 46.
[9] ( Li Xiangdong, Pan Lian. Text Classification Algorithms Using the LDA Model: On the Comparison of the Applications on Webpages and eTexts Including Books and Journals[J]. Journal of Information Resources Management, 2015,5(4):24-31, 46.)
[10] Zhang S, Chen Y, Huang X L, et al. Text Classification of Public Feedbacks Using Convolutional Neural Network Based on Differential Evolution Algorithm[J]. International Journal of Computers Communications & Control, 2019,14(1):124-134.
doi: 10.15837/ijccc.2019.1
[11] Sun X P, Li Y B, Kang H W, et al. Automatic Document Classification Using Convolutional Neural Network[C]// Proceedings of International Seminar on Computer Science and Engineering Technology. 2019. DOI: 10.1088/1742-6596/1176/3/032029.
[12] 郭利敏. 基于卷积神经网络的文献自动分类研究[J]. 图书与情报, 2017(6):96-103.
[12] ( Guo Limin. Study of Automatic Classification of Literature Based on Convolution Neural Network[J]. Library & Information, 2017(6):96-103.)
[13] 朱肖颖, 赖绍辉, 陆科达. 基于LSTM算法在新闻分类中的应用[J]. 梧州学院学报, 2018,28(6):10-20.
[13] ( Zhu Xiaoying, Lai Shaohui, Lu Keda. Application of LSTM Algorithm in News Classification[J]. Journal of Wuzhou University, 2018,28(6):10-20.)
[14] 马建红, 王瑞杨, 姚爽, 等. 基于深度学习的专利分类方法[J]. 计算机工程, 2018,44(10):209-214.
doi: 10.19678/j.issn.1000-3428.0048159
[14] ( Ma Jianhong, Wang Ruiyang, Yao Shuang, et al. Patent Classification Method Based on Depth Learning[J]. Computer Engineering, 2018,44(10):209-214.)
doi: 10.19678/j.issn.1000-3428.0048159
[15] Devlin J, Chang M W, Lee K, et al. Bert: Pre-training of Deep Bidirectional Transformers for Language Understanding[OL]. arXiv Preprint, arXiv:. 181004805.
[16] 胡春涛, 秦锦康, 陈静梅, 等. 基于BERT模型的舆情分类应用研究[J]. 网络安全技术与应用, 2019(11):41-44.
[16] ( Hu Chuntao, Qin Jinkang, Chen Jingmei, et al. Application Research of Public Opinion Classification Based on BERT Model[J]. Network Security Technology & Application, 2019(11):41-44.)
[17] Yao L, Jin Z, Mao C S, et al. Traditional Chinese Medicine Clinical Records Classification with BERT and Domain Specific Corpora[J]. Journal of the American Medical Informatics Association, 2019,26(12):1632-1636.
doi: 10.1093/jamia/ocz164 pmid: 31550356
[18] Zhang X H, Zhang Y Y, Zhang Q, et al. Extracting Comprehensive Clinical Information for Breast Cancer Using Deep Learning Methods[J]. International Journal of Medical Informatics, 2019, 132: Article No.103985.
doi: 10.1016/j.ijmedinf.2020.104233 pmid: 32736330
[19] Jwa H, Oh D, Park K, et al. exBAKE: Automatic Fake News Detection Model Based on Bidirectional Encoder Representations from Transformers (BERT)[J]. Applied Sciences-Basel, 2019,9(19)Article No.4062.
[20] 王英杰, 谢彬, 李宁波. ALICE:一种面向中文科技文本分析的预训练语言表征模型[J]. 计算机工程, 2020,46(2):48-52,58.
[20] ( Wang Yingjie, Xie Bin, Li Ningbo. ALICE: A Pre-trained Language Representation Model for Chinese Technological Text Analysis[J]. Computer Engineering, 2020,46(2):48-52,58.)
[1] Xu Chenfei,Ye Haiying,Bao Ping. Automatic Recognition of Produce Entities from Local Chronicles with Deep Learning[J]. 数据分析与知识发现, 2020, 4(8): 86-97.
[2] Yu Chuanming, Wang Manyi, Lin Hongjun, Zhu Xingyu, Huang Tingting, An Lu. A Comparative Study of Word Representation Models Based on Deep Learning[J]. 数据分析与知识发现, 2020, 4(8): 28-40.
[3] Wang Xinyun,Wang Hao,Deng Sanhong,Zhang Baolong. Classification of Academic Papers for Periodical Selection[J]. 数据分析与知识发现, 2020, 4(7): 96-109.
[4] Jiao Qihang,Le Xiaoqiu. Generating Sentences of Contrast Relationship[J]. 数据分析与知识发现, 2020, 4(6): 43-50.
[5] Wang Mo,Cui Yunpeng,Chen Li,Li Huan. A Deep Learning-based Method of Argumentative Zoning for Research Articles[J]. 数据分析与知识发现, 2020, 4(6): 60-68.
[6] Zhao Ping,Sun Lianying,Tu Shuai,Bian Jianling,Wan Ying. Identifying Scenic Spot Entities Based on Improved Knowledge Transfer[J]. 数据分析与知识发现, 2020, 4(5): 118-126.
[7] Deng Siyi,Le Xiaoqiu. Coreference Resolution Based on Dynamic Semantic Attention[J]. 数据分析与知识发现, 2020, 4(5): 46-53.
[8] Yu Chuanming,Yuan Sai,Zhu Xingyu,Lin Hongjun,Zhang Puliang,An Lu. Research on Deep Learning Based Topic Representation of Hot Events[J]. 数据分析与知识发现, 2020, 4(4): 1-14.
[9] Zhang Dongyu,Cui Zijuan,Li Yingxia,Zhang Wei,Lin Hongfei. Identifying Noun Metaphors with Transformer and BERT[J]. 数据分析与知识发现, 2020, 4(4): 100-108.
[10] Su Chuandong,Huang Xiaoxi,Wang Rongbo,Chen Zhiqun,Mao Junyu,Zhu Jiaying,Pan Yuhao. Identifying Chinese / English Metaphors with Word Embedding and Recurrent Neural Network[J]. 数据分析与知识发现, 2020, 4(4): 91-99.
[11] Liu Tong,Ni Weijian,Sun Yujian,Zeng Qingtian. Predicting Remaining Business Time with Deep Transfer Learning[J]. 数据分析与知识发现, 2020, 4(2/3): 134-142.
[12] Chuanming Yu,Haonan Li,Manyi Wang,Tingting Huang,Lu An. Knowledge Representation Based on Deep Learning:Network Perspective[J]. 数据分析与知识发现, 2020, 4(1): 63-75.
[13] Mengji Zhang,Wanyu Du,Nan Zheng. Predicting Stock Trends Based on News Events[J]. 数据分析与知识发现, 2019, 3(5): 11-18.
[14] Jingjing Pei,Xiaoqiu Le. Identifying Coordinate Text Blocks in Discourses[J]. 数据分析与知识发现, 2019, 3(5): 51-56.
[15] Zhixiong Zhang,Huan Liu,Liangping Ding,Pengmin Wu,Gaihong Yu. Identifying Moves of Research Abstracts with Deep Learning Methods[J]. 数据分析与知识发现, 2019, 3(12): 1-9.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn