Timezone: »
GPT is an auto-regressive Transformer-based pre-trained language model which has attracted a lot of attention in the natural language processing (NLP) domain due to its state-of-the-art performance in several downstream tasks. The success of GPT is mostly attributed to its pre-training on huge amount of data and its large number of parameters (from 100M to billions of parameters). Despite the superior performance of GPT (especially in few-shot or zero-shot setup), this overparameterized nature of GPT can be very prohibitive for deploying this model on devices with limited computational power or memory. This problem can be mitigated using model compression techniques; however, compressing GPT models has not been investigated much in the literature. In this work, we use Kronecker decomposition to compress the linear mappings of the GPT-2 model. Our Kronecker GPT-2 model (KnGPT2) is initialized based on the Kronecker decomposed version of the GPT-2 model and then is undergone a very light pre-training on only a small portion of the training data with intermediate layer knowledge distillation (ILKD). Finally, our KnGPT2 is fine-tuned on down-stream tasks using ILKD as well. We evaluate our model on both language modeling and General Language Understanding Evaluation benchmark tasks and show that with more efficient pre-training and similar number of parameters, our KnGPT2 outperforms the existing DistilGPT2 model significantly.
Author Information
Ali Edalati (Huawei Technologies)
Marzieh Tahaei (Huawei Noah's Ark Lab)
Ahmad Rashid (Huawei Technologies)
Vahid Partovi Nia (Huawei Noah's Ark Lab)
James J. Clark (McGill University)
Mehdi Rezaghoizadeh (Huawei Technologies)
More from the Same Authors
-
2021 : NATURE: Natural Auxiliary Text Utterances for Realistic Spoken Language Evaluation »
David Alfonso-Hermelo · Ahmad Rashid · Abbas Ghaddar · Philippe Langlais · Mehdi Rezagholizadeh -
2021 : A Short Study on Compressing Decoder-Based Language Models »
Tianda Li · Yassir El Mesbahi · Ivan Kobyzev · Ahmad Rashid · Atif Mahmud · Nithin Anchuri · Habib Hajimolahoseini · Yang Liu · Mehdi Rezagholizadeh -
2021 : Compressing Pre-trained Language Models using Progressive Low Rank Decomposition »
Habib Hajimolahoseini · Mehdi Rezaghoizadeh · Vahid Partovi Nia · Marzieh Tahaei · Omar Mohamed Awad · Yang Liu -
2022 : Strategies for Applying Low Rank Decomposition to Transformer-Based Models »
Habib Hajimolahoseini · Walid Ahmed · Mehdi Rezaghoizadeh · Vahid Partovi Nia · Yang Liu -
2022 : DyLoRA: Parameter Efficient Tuning of Pre-trained Models using Dynamic Search-Free Low Rank Adaptation »
Mojtaba Valipour · Mehdi Rezaghoizadeh · Ivan Kobyzev · Ali Ghodsi -
2022 : Improved Knowledge Distillation by Utilizing Backward Pass Knowledge in Neural Networks »
Aref Jafari · Mehdi Rezaghoizadeh · Ali Ghodsi -
2022 : Improving the Robustness of DistilHuBERT to Unseen Noisy Conditions via Data Augmentation, Curriculum Learning, and Multi-Task Enhancement »
Heitor Guimarães · Arthur Pimentel · Anderson R. Avila · Mehdi Rezaghoizadeh · Tiago H Falk -
2022 : Attribute Controlled Dialogue Prompting »
Runcheng Liu · Ahmad Rashid · Ivan Kobyzev · Mehdi Rezaghoizadeh · Pascal Poupart -
2023 Poster: Understanding Neural Network Binarization with Forward and Backward Proximal Quantizers »
Yiwei Lu · Yaoliang Yu · Xinlin Li · Vahid Partovi Nia -
2022 : Improving the Robustness of DistilHuBERT to Unseen Noisy Conditions via Data Augmentation, Curriculum Learning, and Multi-Task Enhancement »
Heitor Guimarães · Arthur Pimentel · Anderson R. Avila · Mehdi Rezaghoizadeh · Tiago H Falk -
2022 : Attribute Controlled Dialogue Prompting »
Runcheng Liu · Ahmad Rashid · Ivan Kobyzev · Mehdi Rezaghoizadeh · Pascal Poupart -
2022 Poster: Is Integer Arithmetic Enough for Deep Learning Training? »
Alireza Ghaffari · Marzieh S. Tahaei · Mohammadreza Tayaranian · Masoud Asgharian · Vahid Partovi Nia -
2021 Workshop: Efficient Natural Language and Speech Processing (Models, Training, and Inference) »
Mehdi Rezaghoizadeh · Lili Mou · Yue Dong · Pascal Poupart · Ali Ghodsi · Qun Liu -
2021 Poster: Demystifying and Generalizing BinaryConnect »
Tim Dockhorn · Yaoliang Yu · Eyyüb Sari · Mahdi Zolnouri · Vahid Partovi Nia -
2021 Poster: S$^3$: Sign-Sparse-Shift Reparametrization for Effective Training of Low-bit Shift Networks »
Xinlin Li · Bang Liu · Yaoliang Yu · Wulong Liu · Chunjing XU · Vahid Partovi Nia