Please wait a minute...
Data Analysis and Knowledge Discovery  2024, Vol. 8 Issue (1): 104-113    DOI: 10.11925/infotech.2096-3467.2022.1155
Current Issue | Archive | Adv Search |
Knowledge Distillation with Few Labeled Samples
Liu Tong,Ren Xinru,Yin Jinhui,Ni Weijian()
College of Computer Science and Engineering, Shandong University of Science and Technology, Qingdao 266590, China
Download: PDF (2371 KB)   HTML ( 10
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper uses the knowledge distillation method to improve the performance of a small-parameter model guided by the high-performance large-parameter model with insufficient labeled samples. It tries to address the issue of sample scarcity and reduce the cost of large-parameter models with high performance in natural language processing. [Methods] First, we used noise purification to obtain valuable data from an unlabeled corpus. Then, we added pseudo labels and increased the number of labeled samples. Meanwhile, we added the knowledge review mechanism and teaching assistant model to the traditional distillation model to realize comprehensive knowledge transfer from the large-parameter model to the small-parameter model. [Results] We conducted text classification and sentiment analysis tasks with the proposed model on IMDB, AG_ NEWS, and Yahoo!Answers datasets. With only 5% of the original data labeled, the new model’s accuracy rate was only 1.45%, 2.75%, and 7.28% less than the traditional distillation model trained with original data. [Limitations] We only examined the new model with text classification and sentiment analysis tasks in natural language processing, which need to be expanded in the future. [Conclusions] The proposed method could achieve a better distillation effect and improve the performance of the small-parameter model.

Key wordsKnowledge Distillation      Semi-Supervised Learning      Few Labeled Samples      Text Classification     
Received: 04 November 2022      Published: 08 January 2024
ZTFLH:  G250  
  TP393  
Fund:Natural Science Foundation of Shandong Province(ZR2022MF319);Young Teachers Teaching Top Talent Training Project of Shandong University of Science and Technology(BJ20211110);Graduate Students’ Teaching Case Library Construction Project of Shandong University of Science and Technology
Corresponding Authors: Ni Weijian,ORCID:0000-0002-7924-7350,E-mail:niweijian@sdust.edu.cn。   

Cite this article:

Liu Tong, Ren Xinru, Yin Jinhui, Ni Weijian. Knowledge Distillation with Few Labeled Samples. Data Analysis and Knowledge Discovery, 2024, 8(1): 104-113.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2022.1155     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2024/V8/I1/104

The Architecture of HoliKD
Data Preprocessing
参数名称 参数值
教师模型 BERT
助教模型 DistilBERT
学生模型 DPCNN[14]
词向量 Glove300d5
BERT学习率 2e-5
DPCNN学习率 1e-3
迭代轮数 100
句子最大长度 128
优化器 AdamW
Dropout 0.5
Experimental Parameter Settings
数据集 总数据量 类别数目 有标签数据量
AG_NEWS 120 000 4 6 000
Yahoo!Answers 1 400 000 10 70 000
IMDB 50 000 2 2 500
Information of Experimental Dataset
数据( K AG_NEWS Yahoo!Answers IMDB
30% 82.76% 69.81% 81.40%
40% 89.45% 76.20% 85.51%
50% 94.51% 80.79% 93.90%
60% 93.28% 79.32% 91.76%
70% 91.26% 77.91% 88.39%
Experimental Effects under Different K Values
模型 数据 AG_NEWS Yahoo!Answers IMDB
教师模型(BERT) 原始数据 91.56% 83.15% 96.72%
学生模型(DPCNN) 原始数据 74.35% 64.17% 85.81%
蒸馏模型(BERT+DPCNN) 原始数据 83.27% 74.61% 95.35%
MixText 少量有标签数据+无标签数据 67.13% 52.19% 84.30%
UDA 少量有标签数据+无标签数据 73.25% 55.11% 85.82%
HoliKD(ours) 少量有标签数据+无标签数据 80.52% 67.33% 93.90%
Performance of Different Models on Three Datasets
移除组件 1000条数据 5000条数据
噪声提纯 74.32% 83.23%
助教模型 81.95% 87.75%
回顾蒸馏 65.19% 73.22%
完整框架 86.43% 92.97%
Results of Ablation Experiment
[1] 刘彤, 刘琛, 倪维健. 多层次数据增强的半监督中文情感分析方法[J]. 数据分析与知识发现, 2021, 5(5): 51-58.
[1] (Liu Tong, Liu Chen, Ni Weijian. A Semi-Supervised Sentiment Analysis Method for Chinese Based on Multi-Level Data Augmentation[J]. Data Analysis and Knowledge Discovery, 2021, 5(5): 51-58.)
[2] Tzelepi M, Passalis N, Tefas A. Online Subclass Knowledge Distillation[J]. Expert Systems with Applications, 2021, 181:115132.
doi: 10.1016/j.eswa.2021.115132
[3] Romero A, Ballas N, Kahou S E, et al. FitNets: Hints for Thin Deep Nets[C]// Proceedings of the 3rd International Conference on Learning Representations. 2015.
[4] Chen P G, Liu S, Zhao H S, et al. Distilling Knowledge via Knowledge Review[C]// Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2021: 5006-5015.
[5] Mirzadeh S I, Farajtabar M, Li A, et al. Improved Knowledge Distillation via Teacher Assistant[C]// Proceedings of AAAI Conference on Artificial Intelligence. 2020: 5191-5198.
[6] Li D, Liu Y, Song L. Adaptive Weighted Losses with Distribution Approximation for Efficient Consistency-based Semi-supervised Learning[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2022, 32(11): 7832-7842.
doi: 10.1109/TCSVT.2022.3186041
[7] Lee D H. Pseudo-Label: The Simple and Efficient Semi-supervised Learning Method for Deep Neural Networks[C]// Proceedings of the 18th International Conference on Machine Learning. 2013.
[8] Chen J A, Yang Z C, Yang D Y. MixText: Linguistically-informed Interpolation of Hidden Space for Semi-supervised Text Classification[C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020: 2147-2157.
[9] Xie Q Z, Dai Z H, Hovy E, et al. Unsupervised Data Augmentation for Consistency Training[C]// Proceedings of Annual Conference on Neural Information Processing Systems. 2020.
[10] Chen H T, Guo T Y, Xu C, et al. Learning Student Networks in the Wild[C]// Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2021: 6428-6437.
[11] Vaswani A, Shazeer N, Parmar N, et al. Attention is All You Need[C]// Proceedings of Annual Conference on Neural Information Processing Systems. 2017: 5998-6008.
[12] Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[C]// Proceedings of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2019: 4171-4186.
[13] Sanh V, Debut L, Chaumond J, et al. DistilBERT, a Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter[C]// Proceedings of Annual Conference on Neural Information Processing Systems. 2019.
[14] Johnson R, Zhang T. Deep Pyramid Convolutional Neural Networks for Text Categorization[C]// Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. 2017: 562-570.
[15] Fursov I, Zaytsev A, Burnyshev P, et al. A Differentiable Language Model Adversarial Attack on Text Classifiers[J]. IEEE Access, 2022, 10:17966-17976.
doi: 10.1109/ACCESS.2022.3148413
[16] Zhao X, Huang J X. Bert-QAnet: BERT-encoded Hierarchical Question-answer Cross-attention Network for Duplicate Question Detection[J]. Neurocomputing, 2022, 509: 68-74.
doi: 10.1016/j.neucom.2022.08.044
[17] Bataineh A A, Kaur D. Immunocomputing-based Approach for Optimizing the Topologies of LSTM Networks[J]. IEEE Access, 2021, 9: 78993-79004.
doi: 10.1109/ACCESS.2021.3084131
[1] Cheng Quan, Dong Jia. Hierarchical Multi-label Classification of Children's Literature for Graded Reading[J]. 数据分析与知识发现, 2023, 7(7): 156-169.
[2] Xu Guixian, Zhang Zixin, Yu Shaona, Dong Yushuang, Tian Yuan. Tibetan News Text Classification Based on Graph Convolutional Networks[J]. 数据分析与知识发现, 2023, 7(6): 73-85.
[3] Ye Guanghui, Li Songye, Song Xiaoying. Text Classification Method for Urban Portrait Based on Multi-Label Annotation Learning[J]. 数据分析与知识发现, 2023, 7(5): 60-70.
[4] Gao Haoxin, Sun Lijuan, Wu Jingchen, Gao Yutong, Wu Xu. Online Sensitive Text Classification Model Based on Heterogeneous Graph Convolutional Network[J]. 数据分析与知识发现, 2023, 7(11): 26-36.
[5] Wang Weijun, Ning Zhiyuan, Du Yi, Zhou Yuanchun. Identifying Interdisciplinary Sci-Tech Literature Based on Multi-Label Classification[J]. 数据分析与知识发现, 2023, 7(1): 102-112.
[6] Wang jinzheng, Yang Ying, Yu Bengong. Classifying Customer Complaints Based on Multi-head Co-attention Mechanism[J]. 数据分析与知识发现, 2023, 7(1): 128-137.
[7] Ye Han,Sun Haichun,Li Xin,Jiao Kainan. Classification Model for Long Texts with Attention Mechanism and Sentence Vector Compression[J]. 数据分析与知识发现, 2022, 6(6): 84-94.
[8] Tu Zhenchao, Ma Jing. Item Categorization Algorithm Based on Improved Text Representation[J]. 数据分析与知识发现, 2022, 6(5): 34-43.
[9] Chen Guo, Ye Chao. News Classification with Semi-Supervised and Active Learning[J]. 数据分析与知识发现, 2022, 6(4): 28-38.
[10] Xiao Yuejun, Li Honglian, Zhang Le, Lv Xueqiang, You Xindong. Classifying Chinese Patent Texts with Feature Fusion[J]. 数据分析与知识发现, 2022, 6(4): 49-59.
[11] Yang Lin, Huang Xiaoshuo, Wang Jiayang, Ding Lingling, Li Zixiao, Li Jiao. Identifying Subtypes of Clinical Trial Diseases with BERT-TextCNN[J]. 数据分析与知识发现, 2022, 6(4): 69-81.
[12] Xu Yuemei, Fan Zuwei, Cao Han. A Multi-Task Text Classification Model Based on Label Embedding of Attention Mechanism[J]. 数据分析与知识发现, 2022, 6(2/3): 105-116.
[13] Bai Simeng,Niu Zhendong,He Hui,Shi Kaize,Yi Kun,Ma Yuanchi. Biomedical Text Classification Method Based on Hypergraph Attention Network[J]. 数据分析与知识发现, 2022, 6(11): 13-24.
[14] Huang Xuejian, Liu Yuyang, Ma Tinghuai. Classification Model for Scholarly Articles Based on Improved Graph Neural Network[J]. 数据分析与知识发现, 2022, 6(10): 93-102.
[15] Xie Xingyu, Yu Bengong. Automatic Classification of E-commerce Comments with Multi-Feature Fusion Model[J]. 数据分析与知识发现, 2022, 6(1): 101-112.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn