Please wait a minute...
Data Analysis and Knowledge Discovery  2021, Vol. 5 Issue (1): 3-15    DOI: 10.11925/infotech.2096-3467.2020.0965
Current Issue | Archive | Adv Search |
A Review on Main Optimization Methods of BERT
Liu Huan1,2,3,Zhang Zhixiong1,2,3,4(),Wang Yufei1,2,3
1National Science Library, Chinese Academy of Sciences, Beijing 100190, China
2Department of Library, Information and Archives Management, School of Economics and Management, University of Chinese Academy of Sciences, Beijing 100190, China
3Hubei Key Laboratory of Big Data in Science and Technology, Wuhan 430071, China
4Wuhan Library, Chinese Academy of Sciences, Wuhan 430071, China
Download: PDF (858 KB)   HTML ( 41
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper analyzes and summarizes the main optimization methods of the BERT language representation model released by Google to provide reference for future studies based on BERT. [Coverage] A total of 41 main literatureor models related to optimization of BERT have been reviewed and analyzed. [Methods] The optimization routes were explained from four aspects: pre-training targets optimization, external knowledge base fusion, Transformer structure evolution and pre-training model compression. [Results] The optimization of pre-training targets and the improvement of Transformer structure caught the earliest attention by researchers, and became the main routes to optimize BERT. After that, the pre-training model compression and the integration of external knowledge bases have also become new directions of research. [Limitations] Research on BERT has developed extremely rapidly, and some of the related research work may not yet be covered. [Conclusions] Researchers can focus on pre-training targets optimization and Transformer structure improvement, and consider choosing the optimization routes according to different application scenarios.

Key wordsBERT      Pre-Training      Knowledge Integration      Model Compression     
Received: 29 September 2020      Published: 05 February 2021
ZTFLH:  TP391  
Fund:The work is supported by the Project of Literature and Information Capacity Building, Chinese Academy of Sciences Grant No(E0290906)
Corresponding Authors: Zhang Zhixiong     E-mail: zhangzhx@mail.las.ac.cno

Cite this article:

Liu Huan,Zhang Zhixiong,Wang Yufei. A Review on Main Optimization Methods of BERT. Data Analysis and Knowledge Discovery, 2021, 5(1): 3-15.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2020.0965     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2021/V5/I1/3

10]
">
ELECTRA Model Structure[10]
42]
">
Knowledge Distillation Flow Chart[42]
模型 时间 优化预训练目标 融合外部知识 改进Transformer结构 模型压缩
优化MLM 优化NSP 输入层嵌入 向量拼接 训练目标融合 蒸馏 量化 剪枝 矩阵分解
XLM[32] 2019.01 -
XNLG[38] 2019.02
Tang等[44] 2019.03
ERNIE (Baidu)[5] 2019.04
ERNIE (Tsinghua)[9] 2019.05
MASS[36] 2019.05 -
UniLM[33] 2019.05
BERT-Chinese-WWM[4] 2019.06
XLNet[12] 2019.06 - -
RoBERTa[11] 2019.07 -
ERNIE-2.0-Baidu[15] 2019.07
SpanBERT[6] 2019.07 -
KT-NET[17] 2019.07
StructBERT[14] 2019.08
SenseBERT[30] 2019.08
BERT-PKD[48] 2019.08
Zhao等[49] 2019.09
TinyBERT[50] 2019.09
Q-BERT[52] 2019.09
Fan等[62] 2019.09
Guo等[59] 2019.09
LIBERT[29] 2019.09
ALBERT[13] 2019.09
K-BERT[16] 2019.09 -
SemBERT[20] 2019.09
KnowBERT[27] 2019.09
DistilBERT[42] 2019.10
Q8BERT[51] 2019.10
McCarley[64] 2019.10
BART[7] 2019.10
T5[37] 2019.10
KEPLER[26] 2019.11 -
Syntax-Infused BERT[22] 2019.11
SentiLR[31] 2019.11
Michel等[63] 2019.11
WKLM[25] 2019.12 -
K-Adapter[28] 2020.02 -
Compressing BERT[58] 2020.02
ELECTRA[10] 2020.03 -
BERT Model Improvement Routes
[1] Devlin J, Chang M W, Lee K , et al. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding[OL]. arXiv Preprint, arXiv: 1810. 04805.
[2] Vaswani A, Shazeer N, Parmar N , et al. Attention is All You Need[OL]. arXivPreprint, arXiv: 1706. 03762.
[3] Mikolov T, Sutskever I, Chen K, et al. Distributed Representations of Words and Phrases and Their Compositionality[C]// Proceedings of the 26th International Conference on Neural Information Processing Systems. 2013: 3111-3119.
[4] Cui Y M, Che W X, Liu T , et al. Pre-Training with Whole Word Masking for Chinese BERT[OL]. arXiv Preprint, arXiv: 1906. 08101.
[5] Sun Y, Wang S H, Li Y K , et al. ERNIE: Enhanced Representation Through Knowledge Integration[OL]. aarXiv Preprint, rXiv: 1904. 09223.
[6] Joshi M, Chen D Q, Liu Y H , et al. SpanBERT: Improving Pre-Training by Representing and Predicting Spans[J]. Transactions of the Association for Computational Lingus, 2020,8:64-77.
[7] Lewis M, Liu Y H, Goyal N , et al. BART: Denoising Sequence-to-Sequence Pre-Training for Natural Language Generation, Translation, and Comprehension [OL]. arXiv Preprint, arXiv: 1910. 13461.
[8] Vincent P, Larochelle H, Lajoie I , et al. Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion[J]. Journal of Machine Learning Research, 2010,11(12):3371-3408.
[9] Zhang Z Y, Han X, Liu Z Y, et al. ERNIE: Enhanced Language Representation with Informative Entities[C]// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019: 1441-1451.
[10] Clark K, Luong M T, Le Q V , et al. ELECTRA: Pre-Training Text Encoders as Discriminators Rather than Generators[OL]. arXiv Preprint, arXiv: 2003. 10555.
[11] Liu Y H, Ott M, Goyal N , et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach[OL]. arXiv Preprint, arXiv: 1907. 11692.
[12] Yang Z L, Dai Z H, Yang Y M , et al. XLNet: Generalized Autoregressive Pretraining for Language Understanding[OL]. arXiv Preprint, arXiv: 1906. 08237.
[13] Lan Z Z, Chen M D, Goodman S , et al. ALBERT: A Lite BERT for Self-Supervised Learning of Language Representations [OL]. arXiv Preprint, arXiv: 1909. 11942.
[14] Wang W, Bi B, Yan M , et al. StructBERT: Incorporating Language Structures into Pre-training for Deep Language Understanding [OL]. arXiv Preprint, arXiv: 1908. 04577.
[15] Sun Y, Wang S H, Li Y K , et al. ERNIE 2.0: A Continual Pre-Training Framework for Language Understanding[OL]. arXiv Preprint, arXiv: 1907. 12412.
[16] Liu W J, Zhou P, Zhao Z , et al. K-BERT: Enabling Language Representation with Knowledge Graph[OL]. arXiv Preprint, arXiv: 1909. 07606.
[17] Yang A, Wang Q, Liu J , et al. Enhancing Pre-Trained Language Representations with Rich Knowledge for Machine Reading Comprehension[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.DOI: 10.18653/v1/P19-1226.
[18] Kilgarriff A . WordNet: An Electronic Lexical Database[J]. Language, 2000,76(3):706-708.
[19] Mitchell T, Kisiel B, Krishnamurthy J , et al. Never-Ending Learning[J]. Communications of the ACM, 2018,61(5):103-115.
[20] Zhang Z S, Wu Y W, Zhao H , et al. Semantics-Aware BERT for Language Understanding [OL]. arXiv Preprint, arXiv: 1909. 02209.
[21] LeCun Y, Bottou L, Bengio Y , et al. Gradient-based Learning Applied to Document Recognition[J]. Proceedings of the IEEE, 1998,86(11):2278-2324.
[22] Sundararaman D, Subramanian V, Wang G Y , et al. Syntax-Infused Transformer and BERT Models for Machine Translation and Natural Language Understanding [OL]. arXiv Preprint, arXiv: 1911. 06156.
[23] He L H, Lee K, Lewis M, et al. Deep Semantic Role Labeling: What Works and What’s Next[C]// Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. 2017: 473-483.
[24] Bordes A, Usunier N, Garcia-Duran A, et al. Translating Embeddings for Modeling Multi-Relational Data[C]// Proceedings of the 26th International Conference on Neural Information Processing Systems. 2013: 2787-2795.
[25] Xiong W H, Du J F, Wang W Y , et al. Pretrained Encyclopedia: Weakly Supervised Knowledge-Pretrained Language Model [OL]. arXiv Preprint, arXiv: 1912. 09637.
[26] Wang X Z, Gao T Y, Zhu Z C , et al. KEPLER: A Unified Model for Knowledge Embedding and Pre-Trained Language Representation [OL]. arXiv Preprint, arXiv: 1911. 06136.
[27] Peters M E, Neumann M, Logan R, et al. Knowledge Enhanced Contextual Word Representations[C]// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. 2019: 43-54.
[28] Wang R Z, Tang D Y, Duan N , et al. K-Adapter: Infusing Knowledge into Pre-Trained Models with Adapters[OL]. arXiv Preprint, arXiv: 2002. 01808.
[29] Lauscher A, Vulić I, Ponti E M , et al. Informing Unsupervised Pretraining with External Linguistic Knowledge[OL]. arXiv Preprint, arXiv: 1909. 02339.
[30] Levine Y, Lenz B, Dagan O , et al. SenseBERT: Driving Some Sense into BERT[OL]. arXiv Preprint, arXiv: 1908. 05646.
[31] Ke P, Ji H Z, Liu S Y , et al. SentiLR: Linguistic Knowledge Enhanced Language Representation for Sentiment Analysis [OL]. arXiv Preprint, arXiv: 1911. 02493.
[32] Ruder S, Søgaard A, Vulic I. Unsupervised Cross-Lingual Representation Learning[C]// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019: 31-38.
[33] Dong L, Yang N, Wang W H , et al. Unified Language Model Pre-Training for Natural Language Understanding and Generation [OL]. arXiv Preprint, arXiv: 1905. 03197.
[34] Radford A, Narasimhan K, Salimans T et al. Improving Language Understanding by Generative Pre-training[OL].[ 2020- 08- 17].http://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf .
[35] Radford A, Wu J, Child R , et al. Language Models are Unsupervised Multitask Learners[OL]. [2020-08-17].https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf .
[36] Song K T, Tan X, Qin T, et al. MASS: Masked Sequence to Sequence Pre-Training for Language Generation[C]// Proceedings of the 36th International Conference on Machine Learning. 2019: 10384-10394.
[37] Raffel C, Shazeer N, Roberts A , et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer [OL]. arXiv Preprint, arXiv: 1910. 10683.
[38] Chronopoulou A, Baziotis C, Potamianos A. An Embarrassingly Simple Approach for Transfer Learning from Pretrained Language Models[C]// Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics. 2019: 2089-2095.
[39] Mnih A, Kavukcuoglu K. Learning Word Embeddings Efficiently with Noise-Contrastive Estimation[C]// Proceedings of the 26th International Conference on Neural Information Processing Systems. 2013: 2265-2273.
[40] Dai Z H, Yang Z L, Yang Y M, et al. Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context[C]// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019: 2978-2988.
[41] Peters M E, Neumann M, Iyyer M, et al. Deep Contextualized Word Representations[C]// Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics. 2018: 2227-2237.
[42] Hinton G, Vinyals O, Dean J . Distilling the Knowledge in a Neural Network[OL]. arXiv Preprint, arXiv: 1503. 02531.
[43] Liu W Y, Wen Y D, Yu Z D , et al. Large-Margin Softmax Loss for Convolutional Neural Networks [OL]. arXiv Preprint, arXiv: 1612. 02295.
[44] Tang R, Lu Y, Liu L P , et al. Distilling Task-Specific Knowledge from BERT into Simple Neural Networks [OL]. arXiv Preprint, arXiv: 1903. 12136.
[45] Irsoy O, Cardie C. Opinion Mining with Deep Recurrent Neural Networks[C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. 2014: 720-728.
[46] Sanh V, Debut L, Chaumond J , et al. DistilBERT, a Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter [OL]. arXiv Preprint, arXiv: 1910. 01108.
[47] Wang A, Singh A, Michael J, et al. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding.[C]// Proceedings of the 7th International Conference on Learning Representations. 2019: 1-20.
[48] Sun S Q, Cheng Y, Gan Z, et al. Patient Knowledge Distillation for BERT Model Compression[C]// Proceeding of Conference on Empirical Methods in Natural Language Processing. 2019: 4322-4331.
[49] Zhao S, Gupta R, Song Y , et al. Extreme Language Model Compression with Optimal Subwords and Shared Projections[OL]. arXiv Preprint, arXiv: 1909. 11687.
[50] Jiao X Q, Yin Y C, Shang L F , et al. TinyBERT: Distilling BERT for Natural Language Understanding [OL]. arXiv Preprint, arXiv: 1909. 10351.
[51] Zafrir O, Boudoukh G, Izsak P, et al. Q8BERT: Quantized 8Bit BERT[C]// Proceedings of the 5th Workshop on Energy Efficient Machine Learning and Cognitive Computing. 2019: 1-5.
[52] Shen S, Zhen D, Ye J Y , et al. Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT [OL]. arXiv Preprint, arXiv: 1909. 05840.
[53] Socher R, Perelygin A, Wu J Y, et al. Recursive Deep Models for Semantic Compositionality over a Sentiment Treebank[C]// Proceedings of 2013 Conference on Empirical Methods in Natural Language Processing. 2013: 1631-1642.
[54] Bentivogli L, Clark P, Dagan I, et al. The Fifth PASCAL Recognizing Textual Entailment Challenge[C]// Proceedings of the 2009 Text Analysis Conference. 2009.
[55] Sang E F, De Meulder F . Introduction to the CoNLL-2003 Shared Task: Language-independent Named Entity Recognition[OL]. arXiv Preprint, arXiv: cs/0306050.
[56] Rajpurkar P, Zhang J, Lopyrev K, et al. SQuAD: 100,000+ Questions for Machine Comprehension of Text[C]// Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 2016: 2383-2392.
[57] Rajpurkar P, Jia R, Liang P. Know What You don’t Know: Unanswerable Questions for SQuAD[C]// Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. 2018: 784-789.
[58] Gordon M A, Duh K, Andrews N . Compressing BERT: Studying the Effects of Weight Pruning on Transfer Learning[OL]. arXiv Preprint, arXiv: 2002. 08307.
[59] Guo F M, Liu S J, Mungall F S , et al. Reweighted Proximal Pruning for Large-Scale Language Representation [OL]. arXiv Preprint, arXiv: 1909. 12486.
[60] Candès E J, Wakin M B, Boyd S P . Enhancing Sparsity by Reweighted L1 Minimization[J]. Journal of Fourier Analysis and Applications, 2008,14(5-6):877-905.
[61] Parikh N, Boyd S . Proximal Algorithms[J]. Foundations and Trends in Optimization, 2014,1(3):127-239.
[62] Fan A, Grave E, Joulin A . Reducing Transformer Depth on Demand with Structured Dropout [OL]. arXiv Preprint, arXiv: 1909. 11556.
[63] Michel P, Levy O, Neubig G . Are Sixteen Heads Really Better than One? [OL]. arXiv Preprint, arXiv: 1905. 10650.
[64] McCarley J S . Pruning a BERT-based Question Answering Model [OL]. arXiv Preprint, arXiv: 1910. 06360.
[65] Liu X D, He P C, Chen W Z , et al. Multi-Task Deep Neural Networks for Natural Language Understanding[OL]. arXiv Preprint, arXiv: 1901. 11504.
[66] Kitaev N, Kaiser Ł, Levskaya A . Reformer: The Efficient Transformer [OL]. arXiv Preprint, arXiv: 2001. 04451.
[67] Weng R X, Wei H R, Huang S J , et al. GRET: Global Representation Enhanced Transformer[OL]. arXiv Preprint, arXiv: 2002. 10101.
[68] Beltagy I, Lo K, Cohan A. SciBERT: A Pretrained Language Model for Scientific Text[C]// Proceedings of 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing. 2020: 3615-3620.
[69] Lee J, Yoon W, Kim S , et al. BioBERT: A Pre-Trained Biomedical Language Representation Model for Biomedical Text Mining[J]. Bioinformatics, 2020,36(4):1234-1240.
[70] Huang K X, Altosaar J, Ranganath R . ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission [OL]. arXiv Preprint, arXiv: 1904. 05342.
[71] Alsentzer E, Murphy J R, Boag W , et al. Publicly Available Clinical BERT Embeddings [OL]. arXiv Preprint, arXiv: 1904. 03323.
[72] Mulyar A, McInnes B T. MT-Clinical BERT: Scaling Clinical Information Extraction with Multitask Learning [OL]. arXiv Preprint, arXiv: 2004. 10220.
[73] Jin Q, Dhingra B, Liu Z P, et al. PubMedQA: A Dataset for Biomedical Research Question Answering[C]// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. 2019: 2567-2577.
[74] Yue X, Gutierrez B J, Sun H . Clinical Reading Comprehension: A Thorough Analysis of the emrQA Datase [OL]. arXiv Preprint, arXiv: 2005. 00574.
[75] Lee J S, Hsiang J . PatentBERT: Patent Classification with Fine-Tuning a Pre-Trained BERT Model [OL]. arXiv Preprint, arXiv: 1906. 02124.
[76] Nguyen D Q, Vu T, Nguyen A T . BERTweet: A pre-trained language model for English Tweets [OL]. arXiv Preprint, arXiv: 2005. 10200.
[1] Zhao Yang, Zhang Zhixiong, Liu Huan, Ding Liangping. Classification of Chinese Medical Literature with BERT Model[J]. 数据分析与知识发现, 2020, 4(8): 41-49.
[2] Zhao Ping,Sun Lianying,Tu Shuai,Bian Jianling,Wan Ying. Identifying Scenic Spot Entities Based on Improved Knowledge Transfer[J]. 数据分析与知识发现, 2020, 4(5): 118-126.
[3] Zhang Dongyu,Cui Zijuan,Li Yingxia,Zhang Wei,Lin Hongfei. Identifying Noun Metaphors with Transformer and BERT[J]. 数据分析与知识发现, 2020, 4(4): 100-108.
[4] Shen Zhuo,Li Yan. Mining User Reviews with PreLM-FT Fine-Grain Sentiment Analysis[J]. 数据分析与知识发现, 2020, 4(4): 63-71.
[5] Liu Liu,Qin Tianyun,Wang Dongbo. Automatic Extraction of Traditional Music Terms of Intangible Cultural Heritage[J]. 数据分析与知识发现, 2020, 4(12): 68-75.
[6] Kan Liu,Lu Chen. Deep Neural Network Learning for Medical Triage[J]. 数据分析与知识发现, 2019, 3(6): 99-108.
[7] Meishan Chen,Chenxi Xia. Identifying Entities of Online Questions from Cancer Patients Based on Transfer Learning[J]. 数据分析与知识发现, 2019, 3(12): 61-69.
[8] Xia Lixin, Cai Xin, Shi Yijin, Sun Danxia, Wang Zhongyi. Organization and Visualization of Web Life Service Information Research[J]. 现代图书情报技术, 2014, 30(4): 85-91.
[9] Li Yazi, Qian Qing, Liu Zheng, Fang An, Hong Na, Wang Junhui. A Novel Framework Research on Integrating Disease Knowledge[J]. 现代图书情报技术, 2011, 27(2): 34-41.
[10] Wang Xin,Xu Baoxiang. Research on knowledge integration modeling based on FCA[J]. 现代图书情报技术, 2010, 26(4): 1-5.
[11] Wang Xin,Xu Baoxiang. Research on Knowledge Integration Model Based on Ontology[J]. 现代图书情报技术, 2009, 25(11): 23-28.
[12] Ji Jiuming,Li Nan . Application of Knowledge Integration Technologies over the SGST[J]. 现代图书情报技术, 2006, 1(9): 58-62.
[13] Wang Yuefen,Zhu Hailing,Yan Duanwu . A Study on the Model and Application of Knowledge Integration in Intelligence Analysis[J]. 现代图书情报技术, 2006, 1(10): 43-47.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn