TY - JOUR
T1 - A Dual-Precision and Low-Power CNN Inference Engine Using a Heterogeneous Processing-in-Memory Architecture
AU - Jung, Sangwoo
AU - Lee, Jaehyun
AU - Park, Dahoon
AU - Lee, Youngjoo
AU - Yoon, Jong Hyeok
AU - Kung, Jaeha
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - In this article, we present an energy-scalable CNN model that can adapt to different hardware resource constraints. Specifically, we propose a dual-precision network, named DualNet, that leverages two independent bit-precision paths (INT4 and ternary-binary). DualNet achieves both high accuracy and low complexity by balancing the ratio between two paths. We also present an evolutionary algorithm that allows the automatic search of the optimal ratios. In addition to the novel CNN architecture design, we develop a heterogeneous processing-in-memory (PIM) hardware that integrates SRAM- and eDRAM-based PIMs to efficiently compute two precision paths in parallel. To verify the energy efficiency of DualNet computed on the heterogeneous PIM, we prototyped a test chip in 28nm CMOS technology. To maximize the hardware efficiency, we utilize an improved data mapping scheme achieving the most effective deployment of DualNets on multiple PIM arrays. With the proposed SW-HW co-optimization, we can obtain the most energy-efficient DualNet model operating on the actual PIM hardware. Compared to the other quantized networks with a single bit-precision, DualNet reduces the energy consumption, memory footprint, and latency by 29.0%, 49.5%, 47.3% on average, respectively, for CIFAR-10/100 and ImageNet datasets.
AB - In this article, we present an energy-scalable CNN model that can adapt to different hardware resource constraints. Specifically, we propose a dual-precision network, named DualNet, that leverages two independent bit-precision paths (INT4 and ternary-binary). DualNet achieves both high accuracy and low complexity by balancing the ratio between two paths. We also present an evolutionary algorithm that allows the automatic search of the optimal ratios. In addition to the novel CNN architecture design, we develop a heterogeneous processing-in-memory (PIM) hardware that integrates SRAM- and eDRAM-based PIMs to efficiently compute two precision paths in parallel. To verify the energy efficiency of DualNet computed on the heterogeneous PIM, we prototyped a test chip in 28nm CMOS technology. To maximize the hardware efficiency, we utilize an improved data mapping scheme achieving the most effective deployment of DualNets on multiple PIM arrays. With the proposed SW-HW co-optimization, we can obtain the most energy-efficient DualNet model operating on the actual PIM hardware. Compared to the other quantized networks with a single bit-precision, DualNet reduces the energy consumption, memory footprint, and latency by 29.0%, 49.5%, 47.3% on average, respectively, for CIFAR-10/100 and ImageNet datasets.
KW - Convolutional neural networks
KW - SW-HW co-optimization
KW - deep learning
KW - mixed-precision quantization
KW - processing-in-memory
UR - http://www.scopus.com/inward/record.url?scp=85193248367&partnerID=8YFLogxK
U2 - 10.1109/TCSI.2024.3395842
DO - 10.1109/TCSI.2024.3395842
M3 - Article
AN - SCOPUS:85193248367
SN - 1549-8328
VL - 71
SP - 5546
EP - 5559
JO - IEEE Transactions on Circuits and Systems I: Regular Papers
JF - IEEE Transactions on Circuits and Systems I: Regular Papers
IS - 12
ER -