Timezone: »
When quantizing neural networks for efficient inference, low-bit integers are the go-to format for efficiency. However, low-bit floating point numbers have an extra degree of freedom, assigning some bits to work on an exponential scale instead. This paper in-depth investigates this benefit of the floating point format for neural network inference. We detail the choices that can be made for the FP8 format, including the important choice of the number of bits for the mantissa and exponent, and show analytically in which settings these choices give better performance. Then we show how these findings translate to real networks, provide an efficient implementation for FP8 simulation, and a new algorithm that enables the learning of both the scale parameters and number of exponent bits in the FP8 format. Our chief conclusion is that when doing post-training quantization for a wide range of networks, the FP8 format is better than INT8 in terms of accuracy, and the choice of the number of exponent bits is driven by the severity of outliers in the network. We also conduct experiments with quantization-aware training where the difference in formats disappears as the network is trained to reduce the effect of outliers.
Author Information
Andrey Kuzmin (Qualcomm Inc, QualComm)
Mart van Baalen (Qualcomm)
Yuwei Ren (QualComm)
Markus Nagel (Qualcomm AI Research)
Jorn Peters (University of Amsterdam)
Tijmen Blankevoort (Qualcomm)
More from the Same Authors
-
2021 : Spatial-Temporal Gated Transformersfor Efficient Video Processing »
Yawei Li · Babak Ehteshami Bejnordi · Bert Moons · Tijmen Blankevoort · Amirhossein Habibian · Radu Timofte · Luc V Gool -
2022 Expo Demonstration: Conditional Compute for On-device Video Understanding »
Tijmen Blankevoort -
2021 : Real-Time and Accurate Self-Supervised Monocular Depth Estimation on Mobile Device »
Hong Cai · Yinhao Zhu · Janarbek Matai · Fatih Porikli · Fei Yin · Tushar Singhal · Bharath Ramaswamy · Frank Mayer · Chirag Patel · Parham Noorzad · Andrii Skliar · Tijmen Blankevoort · Joseph Soriaga · Ron Tindall · Pat Lawlor -
2021 Social: Shine in Your Technical Presentation »
Armina Stepan · Tijmen Blankevoort -
2020 Poster: Bayesian Bits: Unifying Quantization and Pruning »
Mart van Baalen · Christos Louizos · Markus Nagel · Rana Ali Amjad · Ying Wang · Tijmen Blankevoort · Max Welling -
2019 Poster: Integer Discrete Flows and Lossless Compression »
Emiel Hoogeboom · Jorn Peters · Rianne van den Berg · Max Welling