Please wait a minute...
Advanced Search
数据分析与知识发现  2021, Vol. 5 Issue (1): 3-15     https://doi.org/10.11925/infotech.2096-3467.2020.0965
     综述评介 本期目录 | 过刊浏览 | 高级检索 |
BERT模型的主要优化改进方法研究综述*
刘欢1,2,3,张智雄1,2,3,4(),王宇飞1,2,3
1中国科学院文献情报中心 北京 100190
2中国科学院大学经济与管理学院图书情报与档案管理系 北京 100190
3科技大数据湖北省重点实验室 武汉 430071
4中国科学院武汉文献情报中心 武汉 430071
A Review on Main Optimization Methods of BERT
Liu Huan1,2,3,Zhang Zhixiong1,2,3,4(),Wang Yufei1,2,3
1National Science Library, Chinese Academy of Sciences, Beijing 100190, China
2Department of Library, Information and Archives Management, School of Economics and Management, University of Chinese Academy of Sciences, Beijing 100190, China
3Hubei Key Laboratory of Big Data in Science and Technology, Wuhan 430071, China
4Wuhan Library, Chinese Academy of Sciences, Wuhan 430071, China
全文: PDF (858 KB)   HTML ( 80
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】 对谷歌发布的BERT语言表示模型的主要优化改进方法进行梳理,为后续基于BERT的相关研究开发提供借鉴。【文献范围】 自BERT发布以来,到目前与BERT模型优化改进相关的41篇主要文献及相关模型。【方法】 根据模型优化改进的技术路线,从改进预训练目标、融合外部知识库、改进Transformer结构和预训练模型压缩4个方面,分别阐述优化改进的方式及产生的效果。【结果】 预训练目标优化和Transformer结构改进最早受到研究者关注,并且成为BERT模型优化改进的主要方式,随后预训练模型压缩及外部知识库的融合也成为新的发展方向。【局限】 BERT模型相关研究发展迅速,可能未覆盖一些相关研究工作。【结论】 研究者可重点关注预训练目标优化和Transformer结构改进方面的研究,同时考虑根据不同应用场景选择模型优化方向。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
刘欢
张智雄
王宇飞
关键词 BERT模型预训练知识融合模型压缩    
Abstract

[Objective] This paper analyzes and summarizes the main optimization methods of the BERT language representation model released by Google to provide reference for future studies based on BERT. [Coverage] A total of 41 main literatureor models related to optimization of BERT have been reviewed and analyzed. [Methods] The optimization routes were explained from four aspects: pre-training targets optimization, external knowledge base fusion, Transformer structure evolution and pre-training model compression. [Results] The optimization of pre-training targets and the improvement of Transformer structure caught the earliest attention by researchers, and became the main routes to optimize BERT. After that, the pre-training model compression and the integration of external knowledge bases have also become new directions of research. [Limitations] Research on BERT has developed extremely rapidly, and some of the related research work may not yet be covered. [Conclusions] Researchers can focus on pre-training targets optimization and Transformer structure improvement, and consider choosing the optimization routes according to different application scenarios.

Key wordsBERT    Pre-Training    Knowledge Integration    Model Compression
收稿日期: 2020-09-29      出版日期: 2021-02-05
ZTFLH:  TP391  
基金资助:*本文系中国科学院文献情报能力建设专项课题的研究成果之一项目编号(E0290906)
通讯作者: 张智雄     E-mail: zhangzhx@mail.las.ac.cno
引用本文:   
刘欢,张智雄,王宇飞. BERT模型的主要优化改进方法研究综述*[J]. 数据分析与知识发现, 2021, 5(1): 3-15.
Liu Huan,Zhang Zhixiong,Wang Yufei. A Review on Main Optimization Methods of BERT. Data Analysis and Knowledge Discovery, 2021, 5(1): 3-15.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2020.0965      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2021/V5/I1/3
Fig.1  ELECTRA模型结构[10]
Fig.2  
模型 时间 优化预训练目标 融合外部知识 改进Transformer结构 模型压缩
优化MLM 优化NSP 输入层嵌入 向量拼接 训练目标融合 蒸馏 量化 剪枝 矩阵分解
XLM[32] 2019.01 -
XNLG[38] 2019.02
Tang等[44] 2019.03
ERNIE (Baidu)[5] 2019.04
ERNIE (Tsinghua)[9] 2019.05
MASS[36] 2019.05 -
UniLM[33] 2019.05
BERT-Chinese-WWM[4] 2019.06
XLNet[12] 2019.06 - -
RoBERTa[11] 2019.07 -
ERNIE-2.0-Baidu[15] 2019.07
SpanBERT[6] 2019.07 -
KT-NET[17] 2019.07
StructBERT[14] 2019.08
SenseBERT[30] 2019.08
BERT-PKD[48] 2019.08
Zhao等[49] 2019.09
TinyBERT[50] 2019.09
Q-BERT[52] 2019.09
Fan等[62] 2019.09
Guo等[59] 2019.09
LIBERT[29] 2019.09
ALBERT[13] 2019.09
K-BERT[16] 2019.09 -
SemBERT[20] 2019.09
KnowBERT[27] 2019.09
DistilBERT[42] 2019.10
Q8BERT[51] 2019.10
McCarley[64] 2019.10
BART[7] 2019.10
T5[37] 2019.10
KEPLER[26] 2019.11 -
Syntax-Infused BERT[22] 2019.11
SentiLR[31] 2019.11
Michel等[63] 2019.11
WKLM[25] 2019.12 -
K-Adapter[28] 2020.02 -
Compressing BERT[58] 2020.02
ELECTRA[10] 2020.03 -
Table 1  BERT模型优化改进路线总结
[1] Devlin J, Chang M W, Lee K , et al. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding[OL]. arXiv Preprint, arXiv: 1810. 04805.
[2] Vaswani A, Shazeer N, Parmar N , et al. Attention is All You Need[OL]. arXivPreprint, arXiv: 1706. 03762.
[3] Mikolov T, Sutskever I, Chen K, et al. Distributed Representations of Words and Phrases and Their Compositionality[C]// Proceedings of the 26th International Conference on Neural Information Processing Systems. 2013: 3111-3119.
[4] Cui Y M, Che W X, Liu T , et al. Pre-Training with Whole Word Masking for Chinese BERT[OL]. arXiv Preprint, arXiv: 1906. 08101.
[5] Sun Y, Wang S H, Li Y K , et al. ERNIE: Enhanced Representation Through Knowledge Integration[OL]. aarXiv Preprint, rXiv: 1904. 09223.
[6] Joshi M, Chen D Q, Liu Y H , et al. SpanBERT: Improving Pre-Training by Representing and Predicting Spans[J]. Transactions of the Association for Computational Lingus, 2020,8:64-77.
[7] Lewis M, Liu Y H, Goyal N , et al. BART: Denoising Sequence-to-Sequence Pre-Training for Natural Language Generation, Translation, and Comprehension [OL]. arXiv Preprint, arXiv: 1910. 13461.
[8] Vincent P, Larochelle H, Lajoie I , et al. Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion[J]. Journal of Machine Learning Research, 2010,11(12):3371-3408.
[9] Zhang Z Y, Han X, Liu Z Y, et al. ERNIE: Enhanced Language Representation with Informative Entities[C]// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019: 1441-1451.
[10] Clark K, Luong M T, Le Q V , et al. ELECTRA: Pre-Training Text Encoders as Discriminators Rather than Generators[OL]. arXiv Preprint, arXiv: 2003. 10555.
[11] Liu Y H, Ott M, Goyal N , et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach[OL]. arXiv Preprint, arXiv: 1907. 11692.
[12] Yang Z L, Dai Z H, Yang Y M , et al. XLNet: Generalized Autoregressive Pretraining for Language Understanding[OL]. arXiv Preprint, arXiv: 1906. 08237.
[13] Lan Z Z, Chen M D, Goodman S , et al. ALBERT: A Lite BERT for Self-Supervised Learning of Language Representations [OL]. arXiv Preprint, arXiv: 1909. 11942.
[14] Wang W, Bi B, Yan M , et al. StructBERT: Incorporating Language Structures into Pre-training for Deep Language Understanding [OL]. arXiv Preprint, arXiv: 1908. 04577.
[15] Sun Y, Wang S H, Li Y K , et al. ERNIE 2.0: A Continual Pre-Training Framework for Language Understanding[OL]. arXiv Preprint, arXiv: 1907. 12412.
[16] Liu W J, Zhou P, Zhao Z , et al. K-BERT: Enabling Language Representation with Knowledge Graph[OL]. arXiv Preprint, arXiv: 1909. 07606.
[17] Yang A, Wang Q, Liu J , et al. Enhancing Pre-Trained Language Representations with Rich Knowledge for Machine Reading Comprehension[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.DOI: 10.18653/v1/P19-1226.
[18] Kilgarriff A . WordNet: An Electronic Lexical Database[J]. Language, 2000,76(3):706-708.
[19] Mitchell T, Kisiel B, Krishnamurthy J , et al. Never-Ending Learning[J]. Communications of the ACM, 2018,61(5):103-115.
[20] Zhang Z S, Wu Y W, Zhao H , et al. Semantics-Aware BERT for Language Understanding [OL]. arXiv Preprint, arXiv: 1909. 02209.
[21] LeCun Y, Bottou L, Bengio Y , et al. Gradient-based Learning Applied to Document Recognition[J]. Proceedings of the IEEE, 1998,86(11):2278-2324.
[22] Sundararaman D, Subramanian V, Wang G Y , et al. Syntax-Infused Transformer and BERT Models for Machine Translation and Natural Language Understanding [OL]. arXiv Preprint, arXiv: 1911. 06156.
[23] He L H, Lee K, Lewis M, et al. Deep Semantic Role Labeling: What Works and What’s Next[C]// Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. 2017: 473-483.
[24] Bordes A, Usunier N, Garcia-Duran A, et al. Translating Embeddings for Modeling Multi-Relational Data[C]// Proceedings of the 26th International Conference on Neural Information Processing Systems. 2013: 2787-2795.
[25] Xiong W H, Du J F, Wang W Y , et al. Pretrained Encyclopedia: Weakly Supervised Knowledge-Pretrained Language Model [OL]. arXiv Preprint, arXiv: 1912. 09637.
[26] Wang X Z, Gao T Y, Zhu Z C , et al. KEPLER: A Unified Model for Knowledge Embedding and Pre-Trained Language Representation [OL]. arXiv Preprint, arXiv: 1911. 06136.
[27] Peters M E, Neumann M, Logan R, et al. Knowledge Enhanced Contextual Word Representations[C]// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. 2019: 43-54.
[28] Wang R Z, Tang D Y, Duan N , et al. K-Adapter: Infusing Knowledge into Pre-Trained Models with Adapters[OL]. arXiv Preprint, arXiv: 2002. 01808.
[29] Lauscher A, Vulić I, Ponti E M , et al. Informing Unsupervised Pretraining with External Linguistic Knowledge[OL]. arXiv Preprint, arXiv: 1909. 02339.
[30] Levine Y, Lenz B, Dagan O , et al. SenseBERT: Driving Some Sense into BERT[OL]. arXiv Preprint, arXiv: 1908. 05646.
[31] Ke P, Ji H Z, Liu S Y , et al. SentiLR: Linguistic Knowledge Enhanced Language Representation for Sentiment Analysis [OL]. arXiv Preprint, arXiv: 1911. 02493.
[32] Ruder S, Søgaard A, Vulic I. Unsupervised Cross-Lingual Representation Learning[C]// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019: 31-38.
[33] Dong L, Yang N, Wang W H , et al. Unified Language Model Pre-Training for Natural Language Understanding and Generation [OL]. arXiv Preprint, arXiv: 1905. 03197.
[34] Radford A, Narasimhan K, Salimans T et al. Improving Language Understanding by Generative Pre-training[OL].[ 2020- 08- 17].http://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf .
[35] Radford A, Wu J, Child R , et al. Language Models are Unsupervised Multitask Learners[OL]. [2020-08-17].https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf .
[36] Song K T, Tan X, Qin T, et al. MASS: Masked Sequence to Sequence Pre-Training for Language Generation[C]// Proceedings of the 36th International Conference on Machine Learning. 2019: 10384-10394.
[37] Raffel C, Shazeer N, Roberts A , et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer [OL]. arXiv Preprint, arXiv: 1910. 10683.
[38] Chronopoulou A, Baziotis C, Potamianos A. An Embarrassingly Simple Approach for Transfer Learning from Pretrained Language Models[C]// Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics. 2019: 2089-2095.
[39] Mnih A, Kavukcuoglu K. Learning Word Embeddings Efficiently with Noise-Contrastive Estimation[C]// Proceedings of the 26th International Conference on Neural Information Processing Systems. 2013: 2265-2273.
[40] Dai Z H, Yang Z L, Yang Y M, et al. Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context[C]// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019: 2978-2988.
[41] Peters M E, Neumann M, Iyyer M, et al. Deep Contextualized Word Representations[C]// Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics. 2018: 2227-2237.
[42] Hinton G, Vinyals O, Dean J . Distilling the Knowledge in a Neural Network[OL]. arXiv Preprint, arXiv: 1503. 02531.
[43] Liu W Y, Wen Y D, Yu Z D , et al. Large-Margin Softmax Loss for Convolutional Neural Networks [OL]. arXiv Preprint, arXiv: 1612. 02295.
[44] Tang R, Lu Y, Liu L P , et al. Distilling Task-Specific Knowledge from BERT into Simple Neural Networks [OL]. arXiv Preprint, arXiv: 1903. 12136.
[45] Irsoy O, Cardie C. Opinion Mining with Deep Recurrent Neural Networks[C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. 2014: 720-728.
[46] Sanh V, Debut L, Chaumond J , et al. DistilBERT, a Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter [OL]. arXiv Preprint, arXiv: 1910. 01108.
[47] Wang A, Singh A, Michael J, et al. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding.[C]// Proceedings of the 7th International Conference on Learning Representations. 2019: 1-20.
[48] Sun S Q, Cheng Y, Gan Z, et al. Patient Knowledge Distillation for BERT Model Compression[C]// Proceeding of Conference on Empirical Methods in Natural Language Processing. 2019: 4322-4331.
[49] Zhao S, Gupta R, Song Y , et al. Extreme Language Model Compression with Optimal Subwords and Shared Projections[OL]. arXiv Preprint, arXiv: 1909. 11687.
[50] Jiao X Q, Yin Y C, Shang L F , et al. TinyBERT: Distilling BERT for Natural Language Understanding [OL]. arXiv Preprint, arXiv: 1909. 10351.
[51] Zafrir O, Boudoukh G, Izsak P, et al. Q8BERT: Quantized 8Bit BERT[C]// Proceedings of the 5th Workshop on Energy Efficient Machine Learning and Cognitive Computing. 2019: 1-5.
[52] Shen S, Zhen D, Ye J Y , et al. Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT [OL]. arXiv Preprint, arXiv: 1909. 05840.
[53] Socher R, Perelygin A, Wu J Y, et al. Recursive Deep Models for Semantic Compositionality over a Sentiment Treebank[C]// Proceedings of 2013 Conference on Empirical Methods in Natural Language Processing. 2013: 1631-1642.
[54] Bentivogli L, Clark P, Dagan I, et al. The Fifth PASCAL Recognizing Textual Entailment Challenge[C]// Proceedings of the 2009 Text Analysis Conference. 2009.
[55] Sang E F, De Meulder F . Introduction to the CoNLL-2003 Shared Task: Language-independent Named Entity Recognition[OL]. arXiv Preprint, arXiv: cs/0306050.
[56] Rajpurkar P, Zhang J, Lopyrev K, et al. SQuAD: 100,000+ Questions for Machine Comprehension of Text[C]// Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 2016: 2383-2392.
[57] Rajpurkar P, Jia R, Liang P. Know What You don’t Know: Unanswerable Questions for SQuAD[C]// Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. 2018: 784-789.
[58] Gordon M A, Duh K, Andrews N . Compressing BERT: Studying the Effects of Weight Pruning on Transfer Learning[OL]. arXiv Preprint, arXiv: 2002. 08307.
[59] Guo F M, Liu S J, Mungall F S , et al. Reweighted Proximal Pruning for Large-Scale Language Representation [OL]. arXiv Preprint, arXiv: 1909. 12486.
[60] Candès E J, Wakin M B, Boyd S P . Enhancing Sparsity by Reweighted L1 Minimization[J]. Journal of Fourier Analysis and Applications, 2008,14(5-6):877-905.
[61] Parikh N, Boyd S . Proximal Algorithms[J]. Foundations and Trends in Optimization, 2014,1(3):127-239.
[62] Fan A, Grave E, Joulin A . Reducing Transformer Depth on Demand with Structured Dropout [OL]. arXiv Preprint, arXiv: 1909. 11556.
[63] Michel P, Levy O, Neubig G . Are Sixteen Heads Really Better than One? [OL]. arXiv Preprint, arXiv: 1905. 10650.
[64] McCarley J S . Pruning a BERT-based Question Answering Model [OL]. arXiv Preprint, arXiv: 1910. 06360.
[65] Liu X D, He P C, Chen W Z , et al. Multi-Task Deep Neural Networks for Natural Language Understanding[OL]. arXiv Preprint, arXiv: 1901. 11504.
[66] Kitaev N, Kaiser Ł, Levskaya A . Reformer: The Efficient Transformer [OL]. arXiv Preprint, arXiv: 2001. 04451.
[67] Weng R X, Wei H R, Huang S J , et al. GRET: Global Representation Enhanced Transformer[OL]. arXiv Preprint, arXiv: 2002. 10101.
[68] Beltagy I, Lo K, Cohan A. SciBERT: A Pretrained Language Model for Scientific Text[C]// Proceedings of 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing. 2020: 3615-3620.
[69] Lee J, Yoon W, Kim S , et al. BioBERT: A Pre-Trained Biomedical Language Representation Model for Biomedical Text Mining[J]. Bioinformatics, 2020,36(4):1234-1240.
[70] Huang K X, Altosaar J, Ranganath R . ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission [OL]. arXiv Preprint, arXiv: 1904. 05342.
[71] Alsentzer E, Murphy J R, Boag W , et al. Publicly Available Clinical BERT Embeddings [OL]. arXiv Preprint, arXiv: 1904. 03323.
[72] Mulyar A, McInnes B T. MT-Clinical BERT: Scaling Clinical Information Extraction with Multitask Learning [OL]. arXiv Preprint, arXiv: 2004. 10220.
[73] Jin Q, Dhingra B, Liu Z P, et al. PubMedQA: A Dataset for Biomedical Research Question Answering[C]// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. 2019: 2567-2577.
[74] Yue X, Gutierrez B J, Sun H . Clinical Reading Comprehension: A Thorough Analysis of the emrQA Datase [OL]. arXiv Preprint, arXiv: 2005. 00574.
[75] Lee J S, Hsiang J . PatentBERT: Patent Classification with Fine-Tuning a Pre-Trained BERT Model [OL]. arXiv Preprint, arXiv: 1906. 02124.
[76] Nguyen D Q, Vu T, Nguyen A T . BERTweet: A pre-trained language model for English Tweets [OL]. arXiv Preprint, arXiv: 2005. 10200.
[1] 陈杰,马静,李晓峰. 融合预训练模型文本特征的短文本分类方法*[J]. 数据分析与知识发现, 2021, 5(9): 21-30.
[2] 陆泉, 何超, 陈静, 田敏, 刘婷. 基于两阶段迁移学习的多标签分类模型研究*[J]. 数据分析与知识发现, 2021, 5(7): 91-100.
[3] 陈星月, 倪丽萍, 倪志伟. 基于ELECTRA模型与词性特征的金融事件抽取方法研究*[J]. 数据分析与知识发现, 2021, 5(7): 36-47.
[4] 王义真,欧石燕,陈金菊. 民事裁判文书两阶段式自动摘要研究*[J]. 数据分析与知识发现, 2021, 5(5): 104-114.
[5] 华斌, 吴诺, 贺欣. 基于知识融合的政务信息化项目多专家审批意见整合*[J]. 数据分析与知识发现, 2021, 5(10): 124-136.
[6] 赵旸, 张智雄, 刘欢, 丁良萍. 基于BERT模型的中文医学文献分类研究*[J]. 数据分析与知识发现, 2020, 4(8): 41-49.
[7] 张冬瑜,崔紫娟,李映夏,张伟,林鸿飞. 基于Transformer和BERT的名词隐喻识别*[J]. 数据分析与知识发现, 2020, 4(4): 100-108.
[8] 沈卓,李艳. 基于PreLM-FT细粒度情感分析的餐饮业用户评论挖掘[J]. 数据分析与知识发现, 2020, 4(4): 63-71.
[9] 刘勘,陈露. 面向医疗分诊的深度神经网络学习*[J]. 数据分析与知识发现, 2019, 3(6): 99-108.
[10] 操玉杰,毛进,潘荣清,巴志超,李纲. 学科交叉研究的演化阶段特征分析*——以医学信息学为例[J]. 数据分析与知识发现, 2019, 3(5): 107-116.
[11] 邓兰兰, 李春旺. Web数据关联创建策略研究[J]. 现代图书情报技术, 2011, 27(5): 1-6.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn