|
|
Classification of Chinese Medical Literature with BERT Model |
Zhao Yang1,2,3,Zhang Zhixiong1,2,3,4(),Liu Huan1,2,3,Ding Liangping1,2,3 |
1National Science Library, Chinese Academy of Sciences, Beijing 100190, China 2Department of Library, Information and Archives Management, School of Economics and Management,University of Chinese Academy of Sciences, Beijing 100190, China 3Hubei Key Laboratory of Big Data in Science and Technology, Wuhan 430071, China 4Wuhan Library, Chinese Academy of Sciences, Wuhan 430071, China |
|
|
Abstract [Objective] This paper explores the classification results of Chinese medical literature based on the BERT-Base-Chinese model and the BERT Chinese medical pre-training model (BERT-Re-Pretraining-Med-Chi), aiming to analyze their differences. [Methods] We built a medical text pre-training corpus with 340,000 abstracts of Chinese medical literature. Then, we constructed training samples, with 16,000 and 32,000 abstracts, and established test sample with another 3,200 abstracts. Finally, we compareed the performance of the two models, using the SVM method as a benchmark. [Results] The two BERT models yielded better results than the SVM one, and their average F1-scores are about 5% higher than the SVM model. The F1-score of the BERT-Re-Pretraining-Med-Chi model reaches 0.8390 and 0.8607, which is the best among the three. [Limitations] This study only examined research papers from 16 medical and health categories in the Chinese Library Classification, and the remaining four categories were not included in the classification system due to the small amount of data. [Conclusions] The BERT-Re-Pretraining-Med-Chi model improves the performance of medical literature classification, while the BERT-based deep learning method yields better results with large-scale training set.
|
Received: 13 November 2019
Published: 25 May 2020
|
|
Corresponding Authors:
Zhang Zhixiong
E-mail: zhangzhx@mail.las.ac.cn
|
[1] |
Khalil El H, Hussien A, Safwan Q, et al. Building an Ensemble of Fine-tuned Naive Bayesian Classifiers for Text Classification[J]. Entropy, 2018,20(11):857.
doi: 10.3390/e20110857
|
[2] |
Wei O, Huynh V N, Songsak S. Training Attractive Attribute Classifiers Based on Opinion Features Extracted from Review Data[J]. Electronic Commerce Research and Applications, 2018,32:13-22.
doi: 10.1016/j.elerap.2018.10.003
|
[3] |
Jafari A, Ezadi H, Hossennejad M, et al. Improvement in Automatic Classification of Persian Documents by Means of Support Vector Machine and Representative Vector[C]// Proceedings of the International Conference on Innovative Computing Technology. 2011: 282-292.
|
[4] |
陈玉芹. 多类别科技文献自动分类系统[D]. 武汉: 华中科技大学, 2008.
|
[4] |
( Chen Yuqin. Multi-class Scientific Literature Automatic Categorization System[D]. Wuhan: Huazhong University of Science & Technology, 2008.)
|
[5] |
白小明, 邱桃荣. 基于SVM和KNN算法的科技文献自动分类研究[J]. 微计算机信息, 2006,22(36):275-276, 65.
|
[5] |
( Bai Xiaoming, Qiu Taorong. Science and Technology Text Auto Sort Study Base of SVM and KNN Algorithm[J]. Microcomputer Information, 2006,22(36):275-276, 65.)
|
[6] |
王昊, 叶鹏, 邓三鸿. 机器学习在中文期刊论文自动分类研究中的应用[J]. 现代图书情报技术, 2014(3):80-87.
|
[6] |
( Wang Hao, Ye Peng, Deng Sanhong. The Application of Machine-Learning in the Research on Automatic Categorization of Chinese Periodical Articles[J]. New Technology of Library and Information Service, 2014(3):80-87.)
|
[7] |
杨敏, 谷俊. 基于SVM的中文书目自动分类及应用研究[J]. 图书情报工作, 2012,56(9):114-119.
|
[7] |
( Yang Min, Gu Jun. Study and Apply of Chinese Bibliographies Automatic Classification Based on Support Vector Machine[J]. Library and Information Service, 2012,56(9):114-119.)
|
[8] |
李湘东, 廖香鹏, 黄莉. LDA模型下书目信息分类系统的研究与实现[J]. 现代图书情报技术, 2014(5):18-25.
|
[8] |
( Li Xiangdong, Liao Xiangpeng, Huang Li. Research and Implementation of Bibliographic Information Classification System in LDA Model[J]. ew Technology of Library and Information Service, 2014(5):18-25.)
|
[9] |
李湘东, 潘练. LDA模型下文本自动分类算法比较研究——基于网页和图书期刊等数字文本资源的对比[J]. 信息资源管理学报, 2015,5(4):24-31, 46.
|
[9] |
( Li Xiangdong, Pan Lian. Text Classification Algorithms Using the LDA Model: On the Comparison of the Applications on Webpages and eTexts Including Books and Journals[J]. Journal of Information Resources Management, 2015,5(4):24-31, 46.)
|
[10] |
Zhang S, Chen Y, Huang X L, et al. Text Classification of Public Feedbacks Using Convolutional Neural Network Based on Differential Evolution Algorithm[J]. International Journal of Computers Communications & Control, 2019,14(1):124-134.
doi: 10.15837/ijccc.2019.1
|
[11] |
Sun X P, Li Y B, Kang H W, et al. Automatic Document Classification Using Convolutional Neural Network[C]// Proceedings of International Seminar on Computer Science and Engineering Technology. 2019. DOI: 10.1088/1742-6596/1176/3/032029.
|
[12] |
郭利敏. 基于卷积神经网络的文献自动分类研究[J]. 图书与情报, 2017(6):96-103.
|
[12] |
( Guo Limin. Study of Automatic Classification of Literature Based on Convolution Neural Network[J]. Library & Information, 2017(6):96-103.)
|
[13] |
朱肖颖, 赖绍辉, 陆科达. 基于LSTM算法在新闻分类中的应用[J]. 梧州学院学报, 2018,28(6):10-20.
|
[13] |
( Zhu Xiaoying, Lai Shaohui, Lu Keda. Application of LSTM Algorithm in News Classification[J]. Journal of Wuzhou University, 2018,28(6):10-20.)
|
[14] |
马建红, 王瑞杨, 姚爽, 等. 基于深度学习的专利分类方法[J]. 计算机工程, 2018,44(10):209-214.
doi: 10.19678/j.issn.1000-3428.0048159
|
[14] |
( Ma Jianhong, Wang Ruiyang, Yao Shuang, et al. Patent Classification Method Based on Depth Learning[J]. Computer Engineering, 2018,44(10):209-214.)
doi: 10.19678/j.issn.1000-3428.0048159
|
[15] |
Devlin J, Chang M W, Lee K, et al. Bert: Pre-training of Deep Bidirectional Transformers for Language Understanding[OL]. arXiv Preprint, arXiv:. 181004805.
|
[16] |
胡春涛, 秦锦康, 陈静梅, 等. 基于BERT模型的舆情分类应用研究[J]. 网络安全技术与应用, 2019(11):41-44.
|
[16] |
( Hu Chuntao, Qin Jinkang, Chen Jingmei, et al. Application Research of Public Opinion Classification Based on BERT Model[J]. Network Security Technology & Application, 2019(11):41-44.)
|
[17] |
Yao L, Jin Z, Mao C S, et al. Traditional Chinese Medicine Clinical Records Classification with BERT and Domain Specific Corpora[J]. Journal of the American Medical Informatics Association, 2019,26(12):1632-1636.
doi: 10.1093/jamia/ocz164
pmid: 31550356
|
[18] |
Zhang X H, Zhang Y Y, Zhang Q, et al. Extracting Comprehensive Clinical Information for Breast Cancer Using Deep Learning Methods[J]. International Journal of Medical Informatics, 2019, 132: Article No.103985.
doi: 10.1016/j.ijmedinf.2020.104233
pmid: 32736330
|
[19] |
Jwa H, Oh D, Park K, et al. exBAKE: Automatic Fake News Detection Model Based on Bidirectional Encoder Representations from Transformers (BERT)[J]. Applied Sciences-Basel, 2019,9(19)Article No.4062.
|
[20] |
王英杰, 谢彬, 李宁波. ALICE:一种面向中文科技文本分析的预训练语言表征模型[J]. 计算机工程, 2020,46(2):48-52,58.
|
[20] |
( Wang Yingjie, Xie Bin, Li Ningbo. ALICE: A Pre-trained Language Representation Model for Chinese Technological Text Analysis[J]. Computer Engineering, 2020,46(2):48-52,58.)
|
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|