|
|
A Review on Main Optimization Methods of BERT |
Liu Huan1,2,3,Zhang Zhixiong1,2,3,4( ),Wang Yufei1,2,3 |
1National Science Library, Chinese Academy of Sciences, Beijing 100190, China 2Department of Library, Information and Archives Management, School of Economics and Management, University of Chinese Academy of Sciences, Beijing 100190, China 3Hubei Key Laboratory of Big Data in Science and Technology, Wuhan 430071, China 4Wuhan Library, Chinese Academy of Sciences, Wuhan 430071, China |
|
|
Abstract [Objective] This paper analyzes and summarizes the main optimization methods of the BERT language representation model released by Google to provide reference for future studies based on BERT. [Coverage] A total of 41 main literatureor models related to optimization of BERT have been reviewed and analyzed. [Methods] The optimization routes were explained from four aspects: pre-training targets optimization, external knowledge base fusion, Transformer structure evolution and pre-training model compression. [Results] The optimization of pre-training targets and the improvement of Transformer structure caught the earliest attention by researchers, and became the main routes to optimize BERT. After that, the pre-training model compression and the integration of external knowledge bases have also become new directions of research. [Limitations] Research on BERT has developed extremely rapidly, and some of the related research work may not yet be covered. [Conclusions] Researchers can focus on pre-training targets optimization and Transformer structure improvement, and consider choosing the optimization routes according to different application scenarios.
|
Received: 29 September 2020
Published: 05 February 2021
|
|
Fund:The work is supported by the Project of Literature and Information Capacity Building, Chinese Academy of Sciences Grant No(E0290906) |
Corresponding Authors:
Zhang Zhixiong
E-mail: zhangzhx@mail.las.ac.cno
|
[1] |
Devlin J, Chang M W, Lee K , et al. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding[OL]. arXiv Preprint, arXiv: 1810. 04805.
|
[2] |
Vaswani A, Shazeer N, Parmar N , et al. Attention is All You Need[OL]. arXivPreprint, arXiv: 1706. 03762.
|
[3] |
Mikolov T, Sutskever I, Chen K, et al. Distributed Representations of Words and Phrases and Their Compositionality[C]// Proceedings of the 26th International Conference on Neural Information Processing Systems. 2013: 3111-3119.
|
[4] |
Cui Y M, Che W X, Liu T , et al. Pre-Training with Whole Word Masking for Chinese BERT[OL]. arXiv Preprint, arXiv: 1906. 08101.
|
[5] |
Sun Y, Wang S H, Li Y K , et al. ERNIE: Enhanced Representation Through Knowledge Integration[OL]. aarXiv Preprint, rXiv: 1904. 09223.
|
[6] |
Joshi M, Chen D Q, Liu Y H , et al. SpanBERT: Improving Pre-Training by Representing and Predicting Spans[J]. Transactions of the Association for Computational Lingus, 2020,8:64-77.
|
[7] |
Lewis M, Liu Y H, Goyal N , et al. BART: Denoising Sequence-to-Sequence Pre-Training for Natural Language Generation, Translation, and Comprehension [OL]. arXiv Preprint, arXiv: 1910. 13461.
|
[8] |
Vincent P, Larochelle H, Lajoie I , et al. Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion[J]. Journal of Machine Learning Research, 2010,11(12):3371-3408.
|
[9] |
Zhang Z Y, Han X, Liu Z Y, et al. ERNIE: Enhanced Language Representation with Informative Entities[C]// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019: 1441-1451.
|
[10] |
Clark K, Luong M T, Le Q V , et al. ELECTRA: Pre-Training Text Encoders as Discriminators Rather than Generators[OL]. arXiv Preprint, arXiv: 2003. 10555.
|
[11] |
Liu Y H, Ott M, Goyal N , et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach[OL]. arXiv Preprint, arXiv: 1907. 11692.
|
[12] |
Yang Z L, Dai Z H, Yang Y M , et al. XLNet: Generalized Autoregressive Pretraining for Language Understanding[OL]. arXiv Preprint, arXiv: 1906. 08237.
|
[13] |
Lan Z Z, Chen M D, Goodman S , et al. ALBERT: A Lite BERT for Self-Supervised Learning of Language Representations [OL]. arXiv Preprint, arXiv: 1909. 11942.
|
[14] |
Wang W, Bi B, Yan M , et al. StructBERT: Incorporating Language Structures into Pre-training for Deep Language Understanding [OL]. arXiv Preprint, arXiv: 1908. 04577.
|
[15] |
Sun Y, Wang S H, Li Y K , et al. ERNIE 2.0: A Continual Pre-Training Framework for Language Understanding[OL]. arXiv Preprint, arXiv: 1907. 12412.
|
[16] |
Liu W J, Zhou P, Zhao Z , et al. K-BERT: Enabling Language Representation with Knowledge Graph[OL]. arXiv Preprint, arXiv: 1909. 07606.
|
[17] |
Yang A, Wang Q, Liu J , et al. Enhancing Pre-Trained Language Representations with Rich Knowledge for Machine Reading Comprehension[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.DOI: 10.18653/v1/P19-1226.
|
[18] |
Kilgarriff A . WordNet: An Electronic Lexical Database[J]. Language, 2000,76(3):706-708.
|
[19] |
Mitchell T, Kisiel B, Krishnamurthy J , et al. Never-Ending Learning[J]. Communications of the ACM, 2018,61(5):103-115.
|
[20] |
Zhang Z S, Wu Y W, Zhao H , et al. Semantics-Aware BERT for Language Understanding [OL]. arXiv Preprint, arXiv: 1909. 02209.
|
[21] |
LeCun Y, Bottou L, Bengio Y , et al. Gradient-based Learning Applied to Document Recognition[J]. Proceedings of the IEEE, 1998,86(11):2278-2324.
|
[22] |
Sundararaman D, Subramanian V, Wang G Y , et al. Syntax-Infused Transformer and BERT Models for Machine Translation and Natural Language Understanding [OL]. arXiv Preprint, arXiv: 1911. 06156.
|
[23] |
He L H, Lee K, Lewis M, et al. Deep Semantic Role Labeling: What Works and What’s Next[C]// Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. 2017: 473-483.
|
[24] |
Bordes A, Usunier N, Garcia-Duran A, et al. Translating Embeddings for Modeling Multi-Relational Data[C]// Proceedings of the 26th International Conference on Neural Information Processing Systems. 2013: 2787-2795.
|
[25] |
Xiong W H, Du J F, Wang W Y , et al. Pretrained Encyclopedia: Weakly Supervised Knowledge-Pretrained Language Model [OL]. arXiv Preprint, arXiv: 1912. 09637.
|
[26] |
Wang X Z, Gao T Y, Zhu Z C , et al. KEPLER: A Unified Model for Knowledge Embedding and Pre-Trained Language Representation [OL]. arXiv Preprint, arXiv: 1911. 06136.
|
[27] |
Peters M E, Neumann M, Logan R, et al. Knowledge Enhanced Contextual Word Representations[C]// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. 2019: 43-54.
|
[28] |
Wang R Z, Tang D Y, Duan N , et al. K-Adapter: Infusing Knowledge into Pre-Trained Models with Adapters[OL]. arXiv Preprint, arXiv: 2002. 01808.
|
[29] |
Lauscher A, Vulić I, Ponti E M , et al. Informing Unsupervised Pretraining with External Linguistic Knowledge[OL]. arXiv Preprint, arXiv: 1909. 02339.
|
[30] |
Levine Y, Lenz B, Dagan O , et al. SenseBERT: Driving Some Sense into BERT[OL]. arXiv Preprint, arXiv: 1908. 05646.
|
[31] |
Ke P, Ji H Z, Liu S Y , et al. SentiLR: Linguistic Knowledge Enhanced Language Representation for Sentiment Analysis [OL]. arXiv Preprint, arXiv: 1911. 02493.
|
[32] |
Ruder S, Søgaard A, Vulic I. Unsupervised Cross-Lingual Representation Learning[C]// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019: 31-38.
|
[33] |
Dong L, Yang N, Wang W H , et al. Unified Language Model Pre-Training for Natural Language Understanding and Generation [OL]. arXiv Preprint, arXiv: 1905. 03197.
|
[34] |
Radford A, Narasimhan K, Salimans T et al. Improving Language Understanding by Generative Pre-training[OL].[ 2020- 08- 17].http://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf .
|
[35] |
Radford A, Wu J, Child R , et al. Language Models are Unsupervised Multitask Learners[OL]. [2020-08-17].https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf .
|
[36] |
Song K T, Tan X, Qin T, et al. MASS: Masked Sequence to Sequence Pre-Training for Language Generation[C]// Proceedings of the 36th International Conference on Machine Learning. 2019: 10384-10394.
|
[37] |
Raffel C, Shazeer N, Roberts A , et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer [OL]. arXiv Preprint, arXiv: 1910. 10683.
|
[38] |
Chronopoulou A, Baziotis C, Potamianos A. An Embarrassingly Simple Approach for Transfer Learning from Pretrained Language Models[C]// Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics. 2019: 2089-2095.
|
[39] |
Mnih A, Kavukcuoglu K. Learning Word Embeddings Efficiently with Noise-Contrastive Estimation[C]// Proceedings of the 26th International Conference on Neural Information Processing Systems. 2013: 2265-2273.
|
[40] |
Dai Z H, Yang Z L, Yang Y M, et al. Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context[C]// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019: 2978-2988.
|
[41] |
Peters M E, Neumann M, Iyyer M, et al. Deep Contextualized Word Representations[C]// Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics. 2018: 2227-2237.
|
[42] |
Hinton G, Vinyals O, Dean J . Distilling the Knowledge in a Neural Network[OL]. arXiv Preprint, arXiv: 1503. 02531.
|
[43] |
Liu W Y, Wen Y D, Yu Z D , et al. Large-Margin Softmax Loss for Convolutional Neural Networks [OL]. arXiv Preprint, arXiv: 1612. 02295.
|
[44] |
Tang R, Lu Y, Liu L P , et al. Distilling Task-Specific Knowledge from BERT into Simple Neural Networks [OL]. arXiv Preprint, arXiv: 1903. 12136.
|
[45] |
Irsoy O, Cardie C. Opinion Mining with Deep Recurrent Neural Networks[C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. 2014: 720-728.
|
[46] |
Sanh V, Debut L, Chaumond J , et al. DistilBERT, a Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter [OL]. arXiv Preprint, arXiv: 1910. 01108.
|
[47] |
Wang A, Singh A, Michael J, et al. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding.[C]// Proceedings of the 7th International Conference on Learning Representations. 2019: 1-20.
|
[48] |
Sun S Q, Cheng Y, Gan Z, et al. Patient Knowledge Distillation for BERT Model Compression[C]// Proceeding of Conference on Empirical Methods in Natural Language Processing. 2019: 4322-4331.
|
[49] |
Zhao S, Gupta R, Song Y , et al. Extreme Language Model Compression with Optimal Subwords and Shared Projections[OL]. arXiv Preprint, arXiv: 1909. 11687.
|
[50] |
Jiao X Q, Yin Y C, Shang L F , et al. TinyBERT: Distilling BERT for Natural Language Understanding [OL]. arXiv Preprint, arXiv: 1909. 10351.
|
[51] |
Zafrir O, Boudoukh G, Izsak P, et al. Q8BERT: Quantized 8Bit BERT[C]// Proceedings of the 5th Workshop on Energy Efficient Machine Learning and Cognitive Computing. 2019: 1-5.
|
[52] |
Shen S, Zhen D, Ye J Y , et al. Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT [OL]. arXiv Preprint, arXiv: 1909. 05840.
|
[53] |
Socher R, Perelygin A, Wu J Y, et al. Recursive Deep Models for Semantic Compositionality over a Sentiment Treebank[C]// Proceedings of 2013 Conference on Empirical Methods in Natural Language Processing. 2013: 1631-1642.
|
[54] |
Bentivogli L, Clark P, Dagan I, et al. The Fifth PASCAL Recognizing Textual Entailment Challenge[C]// Proceedings of the 2009 Text Analysis Conference. 2009.
|
[55] |
Sang E F, De Meulder F . Introduction to the CoNLL-2003 Shared Task: Language-independent Named Entity Recognition[OL]. arXiv Preprint, arXiv: cs/0306050.
|
[56] |
Rajpurkar P, Zhang J, Lopyrev K, et al. SQuAD: 100,000+ Questions for Machine Comprehension of Text[C]// Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 2016: 2383-2392.
|
[57] |
Rajpurkar P, Jia R, Liang P. Know What You don’t Know: Unanswerable Questions for SQuAD[C]// Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. 2018: 784-789.
|
[58] |
Gordon M A, Duh K, Andrews N . Compressing BERT: Studying the Effects of Weight Pruning on Transfer Learning[OL]. arXiv Preprint, arXiv: 2002. 08307.
|
[59] |
Guo F M, Liu S J, Mungall F S , et al. Reweighted Proximal Pruning for Large-Scale Language Representation [OL]. arXiv Preprint, arXiv: 1909. 12486.
|
[60] |
Candès E J, Wakin M B, Boyd S P . Enhancing Sparsity by Reweighted L1 Minimization[J]. Journal of Fourier Analysis and Applications, 2008,14(5-6):877-905.
|
[61] |
Parikh N, Boyd S . Proximal Algorithms[J]. Foundations and Trends in Optimization, 2014,1(3):127-239.
|
[62] |
Fan A, Grave E, Joulin A . Reducing Transformer Depth on Demand with Structured Dropout [OL]. arXiv Preprint, arXiv: 1909. 11556.
|
[63] |
Michel P, Levy O, Neubig G . Are Sixteen Heads Really Better than One? [OL]. arXiv Preprint, arXiv: 1905. 10650.
|
[64] |
McCarley J S . Pruning a BERT-based Question Answering Model [OL]. arXiv Preprint, arXiv: 1910. 06360.
|
[65] |
Liu X D, He P C, Chen W Z , et al. Multi-Task Deep Neural Networks for Natural Language Understanding[OL]. arXiv Preprint, arXiv: 1901. 11504.
|
[66] |
Kitaev N, Kaiser Ł, Levskaya A . Reformer: The Efficient Transformer [OL]. arXiv Preprint, arXiv: 2001. 04451.
|
[67] |
Weng R X, Wei H R, Huang S J , et al. GRET: Global Representation Enhanced Transformer[OL]. arXiv Preprint, arXiv: 2002. 10101.
|
[68] |
Beltagy I, Lo K, Cohan A. SciBERT: A Pretrained Language Model for Scientific Text[C]// Proceedings of 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing. 2020: 3615-3620.
|
[69] |
Lee J, Yoon W, Kim S , et al. BioBERT: A Pre-Trained Biomedical Language Representation Model for Biomedical Text Mining[J]. Bioinformatics, 2020,36(4):1234-1240.
|
[70] |
Huang K X, Altosaar J, Ranganath R . ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission [OL]. arXiv Preprint, arXiv: 1904. 05342.
|
[71] |
Alsentzer E, Murphy J R, Boag W , et al. Publicly Available Clinical BERT Embeddings [OL]. arXiv Preprint, arXiv: 1904. 03323.
|
[72] |
Mulyar A, McInnes B T. MT-Clinical BERT: Scaling Clinical Information Extraction with Multitask Learning [OL]. arXiv Preprint, arXiv: 2004. 10220.
|
[73] |
Jin Q, Dhingra B, Liu Z P, et al. PubMedQA: A Dataset for Biomedical Research Question Answering[C]// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. 2019: 2567-2577.
|
[74] |
Yue X, Gutierrez B J, Sun H . Clinical Reading Comprehension: A Thorough Analysis of the emrQA Datase [OL]. arXiv Preprint, arXiv: 2005. 00574.
|
[75] |
Lee J S, Hsiang J . PatentBERT: Patent Classification with Fine-Tuning a Pre-Trained BERT Model [OL]. arXiv Preprint, arXiv: 1906. 02124.
|
[76] |
Nguyen D Q, Vu T, Nguyen A T . BERTweet: A pre-trained language model for English Tweets [OL]. arXiv Preprint, arXiv: 2005. 10200.
|
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|