Citation: 
Cheng Luo, ManKit Sit, Hongxiang Fan, Shuanglong Liu, Wayne Luk, Ce Guo. Towards efficient deep neural network training by FPGAbased batchlevel parallelism[J]. Journal of Semiconductors, 2020, 41(2): 022403. doi: 10.1088/16744926/41/2/022403
C Luo, M K Sit, H X Fan, S L Liu, W Luk, C Guo, Towards efficient deep neural network training by FPGAbased batchlevel parallelism[J]. J. Semicond., 2020, 41(2): 022403. doi: 10.1088/16744926/41/2/022403.
Export: BibTex EndNote

Towards efficient deep neural network training by FPGAbased batchlevel parallelism
doi: 10.1088/16744926/41/2/022403
More Information
Abstract
Training deep neural networks (DNNs) requires a significant amount of time and resources to obtain acceptable results, which severely limits its deployment in resourcelimited platforms. This paper proposes DarkFPGA, a novel customizable framework to efficiently accelerate the entire DNN training on a single FPGA platform. First, we explore batchlevel parallelism to enable efficient FPGAbased DNN training. Second, we devise a novel hardware architecture optimised by a batchoriented data pattern and tiling techniques to effectively exploit parallelism. Moreover, an analytical model is developed to determine the optimal design parameters for the DarkFPGA accelerator with respect to a specific network specification and FPGA resource constraints. Our results show that the accelerator is able to perform about 10 times faster than CPU training and about a third of the energy consumption than GPU training using 8bit integers for training VGGlike networks on the CIFAR dataset for the Maxeler MAX5 platform.
Keywords:
 deep neural network,
 training,
 FPGA,
 batchlevel parallelism

References
[1] LeCun Y, Bottou L, Bengio Y, et al. Gradientbased learning applied to document recognition. Proc IEEE, 1998[2] Russakovsky O, Deng J, Su H, et al. Imagenet large scale visual recognition challenge. IJCV, 2015[3] Ren S, He K, Girshick R, et al. Faster rcnn: Towards realtime object detection with region proposal networks. Advances in Neural Information Processing Systems, 2015, 91[4] He K, Gkioxari G, Dollár P, et al. Mask rcnn. Proceedings of the IEEE International Conference on Computer Vision, 2017, 2961[5] Jia Y, Learning semantic image representations at a large scale. PhD Thesis, UC Berkeley, 2014[6] Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015[7] Umuroglu Y, Fraser N J, Gambardella G, et al. Finn: A framework for fast, scalable binarized neural network inference. Acm/sigda International Symposium on FieldProgrammable Gate Arrays, 2016[8] Nurvitadhi E, Venkatesh G, Sim J, et al. Can FPGAs beat GPUs in accelerating nextgeneration deep neural networks. ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, 2017[9] Guo K, Zeng S, Yu J, et al. A survey of FPGAbased neural network accelerator. arXiv: 171208934, 2017[10] Parisi G I, Kemker R, Part J L, et al. Continual lifelong learning with neural networks: A review. arXiv: 180207569, 2018[11] Micikevicius P, Narang S, Alben J, et al. Mixed precision training. arXiv: 171003740, 2017[12] Das D, Mellempudi N, Mudigere D, et al. Mixed precision training of convolutional neural networks using integer operations. arXiv: 180200930, 2018[13] Banner R, Hubara I, Hoffer E, et al. Scalable methods for 8bit training of neural networks. arXiv: 180511046, 2018[14] De Sa C, Leszczynski M, Zhang J, et al. Highaccuracy lowprecision training. arXiv: 180303383, 2018[15] Wu S, Li G, Chen F, et al. Training and inference with integers in deep neural networks. arXiv: 180204680, 2018[16] Wen W, Xu C, Yan F, et al. Terngrad: Ternary gradients to reduce communication in distributed deep learning. Advances in Neural Information Processing Systems, 2017[17] Zhu H, Akrout M, Zheng B, et al. Benchmarking and analyzing deep neural network training. IEEE International Symposium on Workload Characterization (IISWC), 2018[18] Redmon J. Darknet: Open source neural networks in C. http://pjreddie.com/darknet/[19] Pell O, Mencer O, Tsoi K H, et al. Maximum performance computing with dataflow engines. Highperformance computing using FPGAs, 2013[20] Luo C, Sit M K, Fan H, et al. Towards efficient deep neural network training by FPGAbased batchlevel parallelism. 2019 IEEE 27th Annual International Symposium on FieldProgrammable Custom Computing Machines (FCCM), 2019, 45[21] Kingma D P, Ba J. Adam: A method for stochastic optimization. arXiv: 14126980, 2014[22] Qiu J, Wang J, Yao S, et al. Going deeper with embedded FPGA platform for convolutional neural network. Proceedings of the 2016 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, 2016[23] Suda N, Chandra V, Dasika G, et al. Throughputoptimized openclbased FPGA accelerator for largescale convolutional neural networks. Proceedings of the 2016 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, 2016[24] Motamedi M, Gysel P, Akella V, et al. Design space exploration of FPGAbased deep convolutional neural networks. ASPDAC, 2016[25] Zhang C, Sun G, Fang Z, et al. Caffeine: Towards uniformed representation and acceleration for deep convolutional neural networks. IEEE Trans ComputAid Des Integr Circuits Syst, 2019, 38, 2072 doi: 10.1109/TCAD.2017.2785257[26] Ma Y, Cao Y, Vrudhula S, et al. An automatic rtl compiler for highthroughput FPGA implementation of diverse deep convolutional neural networks. 2017 27th International Conference on Field Programmable Logic and Applications (FPL), 2017, 1[27] Venkataramanaiah S K, Ma Y, Yin S, et al. Automatic compiler based FPGA accelerator for cnn training. 2019 29th International Conference on Field Programmable Logic and Applications (FPL), 2019, 166[28] Xiao Q, Liang Y, Lu L, et al. Exploring heterogeneous algorithms for accelerating deep convolutional neural networks on FPGAs. Proceedings of the 54th Annual Design Automation Conference, 2017[29] Zhao W, Fu H, Luk W, et al. FCNN: An FPGAbased framework for training convolutional neural networks. IEEE 27th International Conference on Applicationspecific Systems, Architectures and Processors (ASAP), 2016[30] Geng T, Wang T, Li A, et al. A scalable framework for acceleration of cnn training on deeplypipelined FPGA clusters with weight and workload balancing. arXiv: 190101007, 2019[31] Geng T, Wang T, Sanaullah A, et al. Fpdeep: Acceleration and load balancing of CNN training on FPGA clusters. IEEE 26th Annual International Symposium on FieldProgrammable Custom Computing Machines (FCCM), 2018[32] Li Y, Pedram A, Caterpillar: Coarse grain reconfigurable architecture for accelerating the training of deep neural networks. IEEE 28th International Conference on ApplicationSpecific Systems, Architectures and Processors (ASAP), 2017[33] Dicecco R, Sun L, Chow P. FPGAbased training of convolutional neural networks with a reduced precision floatingpoint library. International Conference on Field Programmable Technology, 2018[34] Nakahara H, Sada Y, Shimoda M, et al. FPGAbased training accelerator utilizing sparseness of convolutional neural network. 2019 29th International Conference on Field Programmable Logic and Applications (FPL), 2019, 180[35] Fox S, Faraone J, Boland D, et al. Training deep neural networks in lowprecision with high accuracy using FPGAs. International Conference on FieldProgrammable Technology (FPT), 2019[36] Moss D J, Krishnan S, Nurvitadhi E, et al. A customizable matrix multiplication framework for the Intel HARPv2 Xeon+ FPGA platform: A deep learning case study. ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, 2018, 107[37] He K, Zhang X, Ren S, et al. Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification. Proceedings of the IEEE International Conference on Computer Vision, 2015, 1026[38] Zhou S, Wu Y, Ni Z, et al. Dorefanet: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv preprint arXiv: 160606160, 2016[39] Matsumoto M, Nishimura T. Mersenne twister: a 623dimensionally equidistributed uniform pseudorandom number generator. ACM Trans Model Comput Simul, 1998, 8(1), 3 doi: 10.1145/272991.272995[40] Performance guide of using nchw image data format. [Online]. Available: https://www.tensorflow.org/guide/performance/overview[41] Ma Y, Cao Y, Vrudhula S, et al. Optimizing loop operation and dataflow in FPGA acceleration of deep convolutional neural networks. Proceedings of the 2017 ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, 2017[42] Steinkraus D, Buck I, Simard P. Using GPUs for machine learning algorithms. Eighth International Conference on Document Analysis and Recognition (ICDAR), 2005[43] Simonyan K, Zisserman A. Very deep convolutional networks for largescale image recognition. arXiv: 14091556, 2014[44] Wei X, Yu C H, Zhang P, et al. Automated systolic array architecture synthesis for high throughput cnn inference on FPGAs. Proceedings of the 54th Annual Design Automation Conference, 2017,[45] Krishnan S, Ratusziak P, Johnson C, et al. Accelerator templates and runtime support for variable precision CNN. CISC Workshop, 2017[46] Abadi M, Barham P, Chen J, et al. TensorFlow: a system for largescale machine learning. OSDI, 2016, 265[47] Chetlur S, Woolley C, Vandermersch P, et al. cuDNN: Efficient primitives for deep learning. arXiv: 14100759, 2014 
Proportional views