Intel assistant

9/1/2023

It brough about 2.97X geomean INT8 inference performance speedup over FP32 (measured on a broad scope of 69 popular deep learning models) by taking advantage of HW-accelerated INT8 convolution and matmul with Intel® DL Boost and Intel® Advanced Matrix Extensions technologies on 4th Generation Intel® Xeon® Scalable Processors. The “X86” quantization backend offers improved INT8 inference performance compared to the original FBGEMM backend by leveraging the strengths of both FBGEMM and oneAPI Deep Neural Network Library (oneDNN) kernel libraries.

In this post, we are introducing “X86” quantization backend, which is newly added in PyTorch 2.0 release and replaces FBGEMM as the default quantization backend for x86 platforms. Before PyTorch 2.0, the default quantization backend on x86 CPUs was named “FBGEMM” which leveraged the FBGEMM performance library to achieve the performance speedup. By reducing the precision of weights and activations in neural networks from the standard 32-bit floating point format to 8-bit integer format, INT8 quantization can significantly reduce the memory bandwidth and computational resources required for inference, allowing for faster and more energy-efficient execution. INT8 quantization is one of the key features in PyTorch* for speeding up deep learning inference.

0 Comments

Intel assistant

Leave a Reply.

Author

Archives

Categories