NeurIPS Dynamic-TinyBERT: Further Enhance the Inference Efficiency of TinyBERT by Dynamic Sequence Length

Spotlight
in
Workshop: Efficient Natural Language and Speech Processing (Models, Training, and Inference)

Dynamic-TinyBERT: Further Enhance the Inference Efficiency of TinyBERT by Dynamic Sequence Length

Shira Guskin · Ke Ding · Gyuwan Kim

[ Abstract ]

Abstract:

Limited computational budgets often prevent transformers from being used in production and from having their high accuracy utilized. TinyBERT addresses the computational efficiency by self-distilling BERT into a smaller transformer representation having fewer layers and smaller internal embedding. However, TinyBERT's performance drops when we reduce the number of layers by 50\%, and drops even more abruptly when we reduce the number of layers by 75\% for advanced NLP tasks such as span question answering. Additionally, a separate model must be trained for each inference scenario with its distinct computational budget. In this work we present Dynamic-TinyBERT, a TinyBERT model that utilizes sequence-length reduction and Hyperparameter Optimization for enhanced inference efficiency per any computational budget. Dynanic-TinyBERT is trained only once, performing on-par with BERT and achieving an accuracy-speedup trade-off superior to any other efficient approaches (up to 3.3x with <1\% loss-drop). Upon publication, the code to reproduce our work will be open-sourced.

Spotlight in Workshop: Efficient Natural Language and Speech Processing (Models, Training, and Inference)

Dynamic-TinyBERT: Further Enhance the Inference Efficiency of TinyBERT by Dynamic Sequence Length

Shira Guskin · Ke Ding · Gyuwan Kim

Spotlight
in
Workshop: Efficient Natural Language and Speech Processing (Models, Training, and Inference)