Workshop: Second Workshop on Efficient Natural Language and Speech Processing (ENLSP-II)

INT8 Transformers for Inference Acceleration

Andy Rock · Omar Khalil · Ofer Shai · Paul Grouchy

Keywords: [ ENLSP-Main ]


Given the general trend towards large models in the deep learning community (particularly in the space of Transformers), much work has been done with the goal of reducing the cost associated with inference. In this work, we reach a new low, quantizing all weights and activations of BERT to 8-bit integers. GELU and exp are implemented with integer lookup tables, achieving optimal INT8 quantization error. We introduce a generalized technique to compute operations frequently missing on integer-only hardware (e.g. divisions, roots) via an efficient instantiation of binary search. By applying it to intermediate computations in Softmax and LayerNorm, we obtain accurate implementations of these layers as well. We evaluate our approach on several GLUE tasks, demonstrating minimal accuracy degradation.

Chat is not available.