Find in Library
Search millions of books, articles, and more
Indexed Open Access Databases
Neural Networks Integer Computation: Quantizing Convolutional Neural Networks of Inference and Training for Object Detection in Embedded Systems
oleh: Penghao Xiao, Chunjie Zhang, Qian Guo, Xiayang Xiao, Haipeng Wang
Format: | Article |
---|---|
Diterbitkan: | IEEE 2024-01-01 |
Deskripsi
Deep neural networks (DNNs) have evolved to be the state-of-the-art technique for machine learning tasks. However, their high computational demands make them difficult to deploy on embedded devices with limited hardware resources and strict power budgets. Most embedded systems perform better with 8-bit data processing, prompting extensive research into 8-bit network quantization to enable faster inference. This article aims to propose a unified 8-bit inference and training framework to support object detection tasks, striving to balance accuracy and speed for conventional convolutional neural networks (CNNs). Initially, this article establishes a unified full int8 posttraining quantization (PTQ) method using KL<inline-formula><tex-math notation="LaTeX">$\_$</tex-math></inline-formula> divergence to evaluate the range of parameter distributions and thresholds before and after quantization. This method effectively addresses the quantization issues commonly found in networks with linear activations. For networks with nonlinear activations, this article introduces a hybrid precision posttraining quantization (H-PTQ) method that utilizes hybrid precision to perform forward inference, thereby mitigating quantization errors caused by nonlinear activation functions. Furthermore, quantization-aware training (QAT) typically employs the straightthrough estimator (STE) for backward propagation of gradients through the quantization function. However, since STE is an approximate computation, this article proposes an alternative called <inline-formula><tex-math notation="LaTeX">$\alpha$</tex-math></inline-formula>-quantization-aware training (<inline-formula><tex-math notation="LaTeX">$\alpha$</tex-math></inline-formula>-QAT). This method replaces the quantized weights in the loss function with affine combinations of the quantized and full-precision weights, enabling more precise forward and backward propagation to fine-tune the errors introduced by quantization. Finally, this paper conducted evaluations of quantized networks on ARM platforms and performed experiments across multiple datasets. The results indicate that the proposed PTQ, H-PTQ, and <inline-formula><tex-math notation="LaTeX">$\alpha$</tex-math></inline-formula>-QAT methods achieved maximum accelerations of 4×, 2.3×, and 3.9×, respectively. In addition, these methods significantly reduced memory overhead by up to 57.11<inline-formula><tex-math notation="LaTeX">$\%$</tex-math></inline-formula>, 43.16<inline-formula><tex-math notation="LaTeX">$\%$</tex-math></inline-formula>, and 91.94<inline-formula><tex-math notation="LaTeX">$\%$</tex-math></inline-formula>, and achieved model compression rates of up to 51.52<inline-formula><tex-math notation="LaTeX">$\%$</tex-math></inline-formula>, 48.48<inline-formula><tex-math notation="LaTeX">$\%$</tex-math></inline-formula>, and 49.70<inline-formula><tex-math notation="LaTeX">$\%$</tex-math></inline-formula>.