Abstract
In this letter, we present deep partitioned training to accelerate computations involved in training DNN models. This is the first work that partitions a DNN model across storage devices, an NPU and a host CPU forming a unified compute node for training workloads. To validate the benefit of using the proposed system during DNN training, a trace-based simulator or an FPGA prototype is used to estimate the overall performance and obtain the layer index to be partitioned that provides the minimum latency. As a case study, we select two benchmarks, i.e., vision-related tasks and a recommendation system. As a result, the training time reduces by 12.2∼31.0 percent with four near-storage computing devices in vision-related tasks with a mini-batch size of 512 and 40.6∼44.7 percent with one near-storage computing device in the selected recommendation system with a mini-batch size of 64.
| Original language | English |
|---|---|
| Article number | 9436036 |
| Pages (from-to) | 70-73 |
| Number of pages | 4 |
| Journal | IEEE Computer Architecture Letters |
| Volume | 20 |
| Issue number | 1 |
| DOIs | |
| State | Published - 1 Jan 2021 |
Bibliographical note
Publisher Copyright:© 2002-2011 IEEE.
Keywords
- DNN accelerators
- near-storage computing
- training deep neural networks
- workload partitioning