A Comprehensive Review of Knowledge Distillation- Methods, Applications, and Future Directions


  • Elly Yijun Zhu San Francisco Bay University, USA
  • Chao Zhao Georgia Institute of Technology, USA
  • Haoyu Yang Georgia Institute of Technology, USA
  • Jing Li Independent Researcher, USA
  • Yue Wu Independent Researcher, USA
  • Rui Ding San Francisco Bay University, USA


Knowledge Distillation, Model Compression, Neural Networks, Soft Labels


Knowledge distillation is a model compression technique that enhances the performance and efficiency of a smaller model (student model) by transferring knowledge from a larger model (teacher model). This technique utilizes the outputs of the teacher model, such as soft labels, intermediate features, or attention weights, as additional supervisory signals to guide the learning process of the student model. By doing so, knowledge distillation reduces computational resources and storage space requirements while maintaining or surpassing the accuracy of the teacher model. Research on knowledge distillation has evolved significantly since its inception in the 1980s, especially with the introduction of soft labels by Hinton and colleagues in 2015. Various advancements have been made, including methods to extract richer knowledge, knowledge sharing among models, integration with other compression techniques, and application in diverse domains like natural language processing and reinforcement learning. This article provides a comprehensive review of knowledge distillation, covering its concepts, methods, applications, challenges, and future directions.



Hinton G, Vinyals O, Dean J. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503. 02531, 2015

Wang W, Wei F, Dong L, Bao H, Yang N, Zhou M. Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers//Proceedings of the Advances in Neural Information Processing Systems. 2020:1-15

Wang Z, Deng Z, Wang S. Accelerating convolutional neural networks with dominant convolutional kernel and knowledge pre-regression//Proceedings of the European Conference on Computer Vision. Amsterdam, The Netherlands, 2016: 533-548

Li T, Li J, Liu Z, Zhang C. Few sample knowledge distillation for efficient network compression//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA, 2020: 14639-14647

Polino A, Pascanu R, Alistarh D. Model compression via distillation and quantization//Proceedings of the 6th International Conference on Learning Representations. Vancouver, Canada, 2018: 1-21

Tang Z, Wang D, Zhang Z. Recurrent neural network training with dark knowledge transfer//Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Shanghai, China, 2016: 5900-5904

Yuan L, Tay F E H, Li G, Wang T, Feng J. Revisiting knowledge distillation via label smoothing regularization//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA, 2020: 3903-3911

Chen G, Choi W, Yu X, Han T, Chandraker M. Learning efficient object detection models with knowledge distillation//Proceedings of the 30th International Conference on Neural Information Processing Systems. Long Beach, USA, 2017: 742-751

Wang T, Yuan L, Zhang X, Feng J. Distilling object detectors with fine-grained feature imitation//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Long Beach, USA, 2019: 4933-4942

Hou Y, Ma Z, Liu C, Hui T-W, Loy C C. Inter-Region affinity distillation for road marking segmentation//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA, 2020: 12486-12495

Liu Y, Chen K, Liu C, Qin Z, Luo Z, Wang J. Structured knowledge distillation for semantic segmentation//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Long Beach, USA, 2019: 2604-2613

Takashima R, Sheng L, Kawai H. Investigation of sequence-level knowledge distillation methods for CTC acoustic models//Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Brighton, UK, 2019: 6156-6160

Huang M, You Y, Chen Z, Qian Y, Yu K. Knowledge distillation for sequence model//Proceedings of the 19th Annual Conference of the International Speech Communication Association. Hyderabad, India, 2018: 3703-3707

Gotmare A, Keskar N S, Xiong C, Socher R. A closer look at deep learning heuristics: learning rate restarts, warmup and distillation//Proceedings of the 7th International Conference on Learning Representations. New Orleans,USA, 2019:1-16

Romero A, Ballas N, Kahou S E, Chassang A, Gatta C, Bengio Y. Fitnets: hints for thin deep nets//Proceedings of the 3rd International Conference on Learning Representations. San Diego, USA, 2015:1-13

Zagoruyko S, Komodakis N. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer//Proceedings of the 5th International Conference on Learning Representations. Toulon, France, 2017:1-13

Li X, Xiong H, Wang H, Rao Y, Liu L, Huan J. Delta: deep learning transfer using feature map with attention for convolutional networks//Proceedings of the 7th International Conference on Learning Representations. New Orleans, USA, 2019:1-13

Passalis N, Tefas A. Learning deep representations with probabilistic knowledge transfer//Proceedings of the European Conference on Computer Vision (ECCV). Munich, Germany, 2018: 268- 284

Yim J, Joo D, Bae J, Kim J. A gift from knowledge distillation: fast optimization, network minimization and transfer learning// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Honolulu, USA, 2017: 4133-4141

Park W, Kim D, Lu Y, Cho M. Relational knowledge distillation// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Long Beach, USA, 2019: 3967-3976

Srinivas S, Fleuret F. Knowledge transfer with Jacobian Matching//Proceedings of the 35th International Conference on Machine Learning. Stockholm, Sweden, 2018: 4723-4731

Lee S H, Kim D H, Song B C. Self-supervised knowledge distillation using singular value decomposition//Proceedings of the 15th European Conference on Computer Vision (ECCV). Munich, Germany, 2018: 339-354

Chen Y, Wang N, Zhang Z. Darkrank: accelerating deep metric learning via cross sample similarities transfer//Proceedings of the AAAI Conference on Artificial Intelligence. New Orleans, USA, 2018: 2852-2859

Peng B, Jin X, Liu J, Zhou S, Wu Y, Liu J, Zhang Z, Liu Y. Correlation congruence for knowledge distillation//Proceedings of the IEEE International Conference on Computer Vision. Seoul, Korea, 2019: 5006-5015

Lee S, Song B C. Graph-based knowledge distillation by multihead attention network//Proceedings of the 30th British Machine Vision Conference. Cardiff, UK, 2019: 141

Bajestani M F, Yang Y. Tkd: Temporal knowledge distillation for active perception//Proceedings of the IEEE Winter Conference on Applications of Computer Vision. Snowmass Village, USA, 2020: 953-962

Liu Y, Shu C, Wang J, Shen C. Structured knowledge distillation for dense prediction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, (42): 1-15

Xu X, Zou Q, Lin X, Huang Y, Tian Y. Integral knowledge distillation for multi-Person pose estimation. IEEE Signal Processing Letters, 2020, (27): 436-440

Goodfellow I J, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y. Generative adversarial nets// Proceedings of the Advances in Neural Information Processing Systems. Montreal, Canada, 2014: 2672-2680

You, S., Xu, C., Xu, C., & Tao, D. Learning from multiple teacher networks. In Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining. 2017: 1285-1294

Liu, I. J., Peng, J., & Schwing, A. G. Knowledge flow: Improve upon your teachers. 2019: arXiv preprint arXiv:1904.05878.

Mirzadeh, S. I., Farajtabar, M., Li, A., Levine, N., Matsukawa, A., & Ghasemzadeh, H. (2020, April). Improved knowledge distillation via teacher assistant. In Proceedings of the AAAI conference on artificial intelligence (Vol. 34, No. 04, pp. 5191-5198).

Gou, J., Yu, B., Maybank, S. J., & Tao, D. (2021). Knowledge distillation: A survey. International Journal of Computer Vision, 129(6), 1789-1819.

Yang, G., Tang, Y., Wu, Z., Li, J., Xu, J., & Wan, X. (2024, April). DMKD: Improving Feature-Based Knowledge Distillation for Object Detection Via Dual Masking Augmentation. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 3330-3334). IEEE.

Kim, S., Kim, G., Shin, S., & Lee, S. (2021, June). Two-stage textual knowledge distillation for end-to-end spoken language understanding. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 7463-7467). IEEE.

Zhang Y, Xiang T, Hospedales T M, Lu H. Deep mutual learning//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA, 2018: 4320-4328

Chen D, Mei J P, Wang C, Feng Y, Chen C. Online knowledge distillation with diverse peers//Proceedings of the AAAI Conference on Artificial Intelligence. New York, USA, 2020: 3430- 3437

Li Z, Hoiem D. Learning without forgetting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 40(12): 2935-2947

Hou S, Pan X, Change Loy C, Wang Z, Lin D. Lifelong learning via progressive distillation and retrospection//Proceedings of the European Conference on Computer Vision (ECCV). Munich, Germany, 2018: 437-452

Yun S, Park J, Lee K, Shin J. Regularizing class-wise predictions via self-knowledge distillation//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA, 2020: 13876-13885

Xu T B, Liu C L. Data-distortion guided self-distillation for deep neural networks//Proceedings of the AAAI Conference on Artificial Intelligence. Honolulu, USA, 2019, 33: 5565-5572

Nie X, Li Y, Luo L, Zhang N, Feng J. Dynamic kernel distillation for efficient pose estimation in videos//Proceedings of the IEEE International Conference on Computer Vision. Seoul, Korea, 2019: 6942-6950

Targ, S., Almeida, D., & Lyman, K. (2016). Resnet in resnet: Generalizing residual architectures. arXiv preprint arXiv:1603.08029.

Sengupta, A., Ye, Y., Wang, R., Liu, C., & Roy, K. (2019). Going deeper in spiking neural networks: VGG and residual architectures. Frontiers in neuroscience, 13, 95.

Sinha, D., & El-Sharkawy, M. (2019, October). Thin mobilenet: An enhanced mobilenet architecture. In 2019 IEEE 10th annual ubiquitous computing, electronics & mobile communication conference (UEMCON) (pp. 0280-0285). IEEE.

Zhang, X., Zhou, X., Lin, M., & Sun, J. (2018). Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 6848-6856).

He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask r-cnn. In Proceedings of the IEEE international conference on computer vision (pp. 2961-2969).

Jiang, P., Ergu, D., Liu, F., Cai, Y., & Ma, B. (2022). A Review of Yolo algorithm developments. Procedia computer science, 199, 1066-1073.

Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C. Y., & Berg, A. C. (2016). Ssd: Single shot multibox detector. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14 (pp. 21-37). Springer International Publishing.

Fang, W., Wang, L., & Ren, P. (2019). Tinier-YOLO: A real-time object detection method for constrained environments. Ieee Access, 8, 1935-1944.

Chen, L. C., Papandreou, G., Kokkinos, I., Murphy, K., & Yuille, A. L. (2017). Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, 40(4), 834-848.

Zhao, H., Shi, J., Qi, X., Wang, X., & Jia, J. (2017). Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2881-2890).

Badrinarayanan, V., Kendall, A., & Cipolla, R. (2017). Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE transactions on pattern analysis and machine intelligence, 39(12), 2481-2495.

Paszke, A., Chaurasia, A., Kim, S., & Culurciello, E. (2016). Enet: A deep neural network architecture for real-time semantic segmentation. arXiv preprint arXiv:1606.02147.

Koroteev, M. V. (2021). BERT: a review of applications in natural language processing and understanding. arXiv preprint arXiv:2103.11943.

Floridi, L., & Chiriatti, M. (2020). GPT-3: Its nature, scope, limits, and consequences. Minds and Machines, 30, 681-694.

Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2019). DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108.

Sun, Z., Yu, H., Song, X., Liu, R., Yang, Y., & Zhou, D. (2020). Mobilebert: a compact task-agnostic bert for resource-limited devices. arXiv preprint arXiv:2004.02984.




How to Cite

E. Y. Zhu, C. Zhao, H. Yang, J. Li, Y. Wu, and R. Ding, “A Comprehensive Review of Knowledge Distillation- Methods, Applications, and Future Directions”, IJIRCST, vol. 12, no. 3, pp. 106–112, May 2024.




Similar Articles

1 2 3 > >> 

You may also start an advanced similarity search for this article.