Annual: 2018

PR065 »
A getting started tutorial on FPGA implement of CNN using OpenCL
📁Machine Learning
👤Zhuxi Li
 (Harbin Institute of Technology)
📅Apr 30, 2018
Regional Final


👀 6330   💬 27

PR065 » A getting started tutorial on FPGA implement of CNN using OpenCL

Description

More and more applications like objective detection, objective tracking and facial recognition appear on embedded systems and mobile applications. Convolution neural network (CNN) is the key algorithm to these applications. However, CNNs are computationally and memory intensive which leads to long runtime and high power consumption, making them nearly impossible to deploy. We are working on a CNN software architecture suitable to these applications based on Cyclone V SoC-FPGA hardware platform, which are designed to contain 3 parts:
(1) Compressed and pruned CNN model with less computation and power consumption
(2) CNN accelerator on FPGA based on OpenCL
(3) Matching software and APIs

Demo Video

  • URL: http://v.youku.com/v_show/id_XMzU3NTA1NTQyMA==.html?spm=a2h3j.8428770.3416059.1

  • Project Proposal

    1. High-level Project Description

    Purpose of the design

    Recent years, more and more applications like objective detection, objective tracking and facial recognition appear on embedded systems and mobile applications. Typically, these applications require low latency and are limited by power consumption. Convolutional neural network (CNN) is the key algorithm to these applications. However, CNNs are computationally and memory intensive which leads to long runtime and high power consumption, making them difficult to deploy on systems with limited hardware resources and power budgets. We are working on a solution to deploy deep CNNs in condition like this, which use an optimized software architecture on a Cyclone V SoC-FPGA hardware platform. The software architecture is designed to contain 3 parts as listed below.

    1. Compressed and pruned CNN model with less computation and power consumption

    State-of-the-art deep CNN models have hundreds of millions of connections, causing computationally and memory intensive.

    The runtime during the forward propagation of a CNN model is dominated by convolutional layers. With the goal of speeding up inference, Molchanov et al. (2016)[1] propose an approach to prune entire feature maps. This approach interleaves greedy criteria-based pruning with fine-tuning by backpropagation. The procedure is computationally efficient and maintains good generalization in the pruned network. Most pruned networks get at least 2x speed-up.

    As for memory intensive, Han et al. (2015)[2] introduce an approach named “Deep Compression” for intra-kernel pruning. “Deep Compression” is a three stage pipeline of pruning, trained quantization and Huffman coding. On ImageNet dataset, “Deep Compression” successfully compresses AlexNet by 35x, from 240MB to 6.9MB, and VGG-16 by 49x, from 552MB to 11.3MB, with no loss of accuracy, so that the model can fit into on-chip SRAM cache rather than off-chip DRAM memory for better energy efficiency.

    We plan to try both approaches to get higher performance and energy efficiency, so that the CNN model can be deployed on embedded systems and mobile applications.

    2.  CNN accelerator on FPGA based on OpenCL

    Conventional embedded processors consume large amount of energy and are not suitable for parallel computing. On the other hand, embedded GPUs are good at computationally intensive tasks but are not energy efficient. The structure of FPGAs allows them to offer a trade-off among the platforms with high performance and energy efficiency. The CNN accelerator can improve parallel computing performance by offloading a CPU from computationally intensive tasks to an FPGA. SoC-FPGA platforms integrate both processor and FPGA on a single chip.

    In contrast to conventional CPU-FPGA platforms, SoC-FPGA platforms have higher communication bandwidth between processor and FPGA, and they could be own better performance and less power consumption

    There are already some open source OpenCL based CNN accelerators, we plan to design our CNN accelerator according to them. Then, optimize it for our compressed and pruned CNN model by implementing techniques like sparse matrix multiplication.

    3.  Matching software and APIs

    For the portability of our software architecture, Matching software and API consist of 3 parts:
    (1) Code to prune, compress and fine-tune custom CNN model
    (2) Camera driver on DE10-NANO
    (3) Code to forward propagate the CNN model using camera as input on DE10_NANO

     

    The procedure to prune, compress and fine-tune custom CNN models are really complex that they may take weeks to accomplish even on high-performance GPUs. So this part of code will be implemented on PC using a popular deep learning library. The code on DE10-NANO will complete the forward propagation on the pruned and compressed model, so that users can deploy a high-performance and energy efficient CNN on DE10-NANO.

    Application scope and targeted user

    Embedded systems and mobile applications which need to deploy CNN algorithm, like self-driving cars or drones.

    Reference

    [1] Molchanov, P., Tyree, S., Karras, T., Aila, T., & Kautz, J. (2016). Pruning convolutional neural networks for resource efficient inference.

    [2] Han, S., Mao, H., & Dally, W. J. (2015). Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. Fiber, 56(4), 3--7.

    2. Block Diagram

    (a) FPGA SoC and HPS System Block

    (b) Pruning Quantization and Huffman coding

    (c) The architecture of Leading Non-zero Detection Node

    (d)  The architecture of Processing Element

    3. Intel FPGA Virtues in Your Project

    The final work of this project will contain serval parts as listed below:

    1.  Compressed and pruned CNN model with less computation and power consumption

    2.  CNN accelerator on FPGA based on OpenCL

    3.  Matching software and APIs
        (1) Code to prune, compress and fine-tune custom CNN model
        (2) Camera driver on DE10-NANO

        (3) Code to forward propagate the CNN model using camera as input on DE10_NANO

    Since this is a not an easy task and time is limited, we may not finish all of them on time, but we’ll try our best.

    Also, we’ll compare our result with the state-of-the-art studies. First, we’ll compare our compressed and pruned CNN model with its origin model to get a desired balance between accuracy loss and compression rate. Second, we’ll test our CNN model on different hardware platform to show the performance and energy efficiency of our FPGA accelerator. For an example, we’ll test the M x V throughput (frames/s) and energy efficiency (frames/J) on DE10-NANO with and without the FPGA accelerator. And according to Han et al.(2016)[3], their ASIC DNNs hardware platform EIE gets 81967 frames/s on M x V throughput and 138927 frames/J on energy efficiency. Embedded GPU Tegra K1 gets 173 frames/s on M x V throughput and 33.9 frames/J on energy efficiency. We hope to get our result somewhere between EIE and Tegra K1.

    Reference

    [3] Han, S., Liu, X., Mao, H., Pu, J., Pedram, A., & Horowitz, M. A., et al. (2016). Eie: efficient inference engine on compressed deep neural network. Acm Sigarch Computer Architecture News, 44(3), 243-254.

    4. Design Introduction

    We were hoping to try pruning and model compression techniques on CNN models so that it could fit on source-limited FPGAs. However, we are just beginners in this area, we barely found any tutorials and suffered a lot from getting started. And so far, we’ve only accomplished relatively naive implement and low-level optimizing techniques. Also, there are many brilliant teams working on the same topic, trying to get optimized FPGA implement of CNN. So, we decided to present our work as a getting started tutorial on FPGA implement of CNN using OpenCL, hoping it could help those who want to set foot on this topic and are having a hard time getting started like us. We’ll later open source all our code.

    1. Introduction

    1.1 Convolutional neural network(CNN)

    CNN is one of the most popular algorithms in deep learning recent years. It represents the state-of-art ability in several computer vision tasks, like objective detection, image classification, and image segmentation. CNN has already achieved human level on image classification and even better at some specific tasks.

    Pic from the syllabus of CS231n, a Stanford open course.

    1.2  Why using FPGA

    CNN is extremely computationally expensive. Recent deep CNN models require more than 20 GFLOPs per image, which CPU can’t easily process. The common solution is to accelerate the process with powerful GPU for its great capacity of parallel computing. The bottleneck of GPU accelerators is its power consumption, which can be a very crucial factor for cloud servers or embedded systems.

    On the other hand, due to its parallel architecture, FPGA is also good at parallel computing, which means it is capable of traditional data parallel and task parallel computing. FPGA can achieve pipeline parallel by generating modified circuit and data path, which outputs a result each clock cycle. Another significant benefit of FPGA is its energy consumption. FPGA can run at the same speed as a GPU but only consumes lower than 10 percent of the power.

    So, we believe FPGA can be a very good substitute for GPU when accelerating CNN or other computational expensive tasks where power consumption is a crucial factor to consider.

    1.3  Why using OpenCL

    The OpenCL standard is the first open, royalty-free, unified programming model for accelerating algorithms on heterogeneous systems. OpenCL allows the use of a C-based programming language for developing code across different platforms, such as CPUs, GPUs, and FPGAs. A key benefit of OpenCL is that it is a portable, open, royalty-free standard, which is a key differentiator versus proprietary programming models. And with Intel FPGA SDK for OpenCL, we can fully leverage the unique capabilities of FPGAs to deliver acceleration performance with power efficiency and low latency.  

    2. Selecting CNN model

    There are several CNN models commonly used in recent years, like AlexNet, VGGGoogleNet, and ResNet. However, most models are too computationally expensive to deploy on embedding systems. Also, applications on embedding systems often require much low latency, which means deep networks can’t fit on source-limited FPGAs, like the cyclone V FPGA on DE10-nano board. So our task is to find those “efficient” models.

    Canziani et al.(2016) make a very impressive comparison among common CNN models. The computation ability of DE10-nano is around 10 GFLOPs, so it can only afford AlexNet-level models. Canziani et al. clearly demonstrates that ENet has about the same operations as AlexNet but with higher accuracy. And according to Paszke et al.(2016), ENet can reach 21 frames per second on Nvidia embedded processor TX1.

    Beside, Iandola et al.(2017) propose small CNN architecture called SqueezeNet. SqueezeNet achieves AlexNet-level accuracy on ImageNet with 50x fewer parameters that are more feasible to deploy on FPGAs and other hardware with limited memory. SqueezeNet v1.1 from this repo has 2.4x less computation than its original version, without sacrificing accuracy.

    We believe both ENet and SqueezeNet are great choices to deploy on DE10-nano. Our choice is SqueezeNet v1.1. Because its OpenCL implement is easier and its pre-trained model is supported by several deep learn frameworks.

    5. Function Description

    1. Designing and debugging OpenCL kernel

    1.1 CNN model

    The CNN model we are using is a pre-trained SqueezeNet v1.1 model build by pytorch. So for the first part, all we need to do is to extract parameters from the pre-trained model. We simply store all the parameters in a ‘.h’ file for the later use.

    1.2 Designing kernels

    The architecture of SqueezeNet v1.1 is shown in the figure above. Actually, SqueezeNet v1.1 only has 4 types of layer: 3x3 convolutional layer, 1x1 convolutional layer, 3x3 max pool layer and 13x13 average pool layer. So our implement designs one OpenCL kernel for each kind of layer.

    1.2.1 3x3 convolution OpenCL kernel

    Every element in the output feature map of the 3x3 convolutional layer is the dot produced by a Nx3x3 matrix from the corresponding area of the input feature maps and a Nx3x3 convolution filter weight matrix, where N is the total number of the input feature maps. And the output size can be calculated as output_size = (input_size – 3 + 2 x pad) / stride +1. For each convolution filter, there will be an output_size x output_size feature map.

    The idea behind our 3x3 convolution OpenCL kernel is relatively simple. Each kernel calculates only one output feature map.

    1. //3x3 convolution layer  
    2. //validation pass by pyopencl  
    3. __kernel void conv2d3x3(  
    4.     const int input_channel, const int input_size,  
    5.     const int pad, const int stride,  
    6.     const int start_channel,  
    7.     const int output_channels,  
    8.     __global float *input_im,  
    9.     __global const float *filter_weights,  
    10.     __global const float *filter_bias,  
    11.     __global float *restrict output_im  
    12.     )  
    13. {  
    14.     int filter_index = get_global_id(0); // 0 - (output_channels - 1)  
    15.   
    16.     filter_weights += filter_index * input_channel * 9;  
    17.   
    18.     float bias = filter_bias[filter_index];  
    19.   
    20.     int output_size = (input_size - 3 + 2 * pad) / stride + 1;  
    21.   
    22.     output_im += (start_channel + filter_index) * output_size * output_size;//start_channel is for 1x1 feature map in fire layer  
    23.   
    24.     for(int i = 0; i < output_size; i++)  
    25.     {  
    26.         for(int j = 0; j < output_size; j++)  
    27.         {  
    28.             float tmp = bias;  
    29.             for(int k = 0; k < input_channel; k++)  
    30.             {  
    31.                 for(int l = 0; l < 3; l++)  
    32.                 {  
    33.                     int h = i * stride + l - pad;  
    34.                     for(int m = 0; m < 3; m++)  
    35.                     {  
    36.                         int w = j * stride + m - pad;  
    37.                         if((h >= 0) && (h < input_size) && (w >= 0) && (w < input_size))  
    38.                         {  
    39.                             tmp += input_im[k * input_size * input_size + h * input_size + w] \  
    40.                                * filter_weights[9 * k + 3 * l + m];  
    41.                         }  
    42.                     }  
    43.                 }  
    44.             }  
    45.             //add relu after conv  
    46.             *output_im = (tmp > 0.0) ? tmp : 0.0;  
    47.             output_im++;  
    48.         }  
    49.     }  
    50. }  

    1.2.2 1x1 convolution OpenCL kernel

    1x1 convolution OpenCL kernel is almost the same with 3x3 convolution OpenCL kernel. It just replaces the Nx3x3 corresponding area to Nx1x1. Since there is no padding and stride in 1, calculation of output size is the same as input size.

    1. //1x1 convolution layer  
    2. //validation pass by pyopencl  
    3. __kernel void conv2d1x1(  
    4.     const int input_channel, const int input_size,  
    5.     const int output_channels,  
    6.     __global float *input_im,  
    7.     __global const float *filter_weights,  
    8.     __global const float *filter_bias,  
    9.     __global float *restrict output_im)  
    10. {  
    11.     int filter_index = get_global_id(0); // 0 - (output_channels - 1)  
    12.   
    13.     filter_weights += filter_index * input_channel;  
    14.   
    15.     float bias = filter_bias[filter_index];  
    16.   
    17.     output_im += filter_index * input_size * input_size;//start_channel is for 1x1 feature map in fire layer  
    18.   
    19.     for(int i = 0; i < input_size; i++)  
    20.     {  
    21.         for(int j = 0; j < input_size; j++)  
    22.         {  
    23.             float tmp = bias;  
    24.             for(int k = 0; k < input_channel; k++)  
    25.             {  
    26.                 tmp += input_im[k * input_size * input_size + i * input_size + j] * filter_weights[k];  
    27.             }  
    28.             //add relu after conv  
    29.             *output_im = (tmp > 0.0) ? tmp : 0.0;  
    30.             output_im++;  
    31.         }  
    32.     }  
    33. }  

    1.2.3 3x3 maxpool OpenCL kernel

    The goal of maxpool layers is to down sample the feature maps to reduce calculation. So for each input feature map, it just picks the largest activation in every 3x3 area and pass it to the output feature map. Each 3x3 maxpool OpenCL kernel calculates only one output feature map.

    1. //maxPool2d kernel_size=3 stride=2 per channel  
    2. //validation pass by pyopencl  
    3. __kernel void maxpool2d(  
    4.     const int input_size,  
    5.     __global float *input_im,  
    6.     __global float *restrict output_im)  
    7. {  
    8.     int channels = get_global_id(0);  
    9.     int output_size = (input_size - 3) / 2 + 1; //output feature map height & width  
    10.       
    11.     input_im += channels * input_size * input_size;  
    12.     output_im += channels * output_size * output_size;  
    13.   
    14.     //find the max value, restore in local memory  
    15.   
    16.     for(int i = 0; i < output_size; i++)//row  
    17.     {  
    18.         for(int j = 0; j < output_size; j++)//col  
    19.         {  
    20.             float tmp = 0.0;  
    21.             for(int k = 0; k < 3; k++)//row  
    22.             {  
    23.                 for(int l = 0; l < 3; l++)//col  
    24.                 {  
    25.                     float value = input_im[(i * 2 + k) * input_size  + j * 2 + l ];  
    26.                     if(value > tmp)  
    27.                         tmp = value;  
    28.                 }  
    29.             }  
    30.             *output_im = tmp;   
    31.             output_im++;  
    32.         }  
    33.     }  
    34. }  

    1.2.4 13x13 average pool OpenCL kernel

    SqueezeNet v1.1 uses an average pool layer as a classifier. The input of this layer is a 1000 x 13 x 13 matrix. Since there are 1000 classes in the imagenet dataset, each class score can be compute as the mean of a 13 x 13 feature map. Each of our 13x13 average pool OpenCL kernel computes a single class score.

    1. //last layer use a 13 x 13 avgPool layer as classifier  
    2. __kernel void avgpool2d(  
    3.     __global float *input_im,  
    4.     __global float *restrict output_im)  
    5. {  
    6.     int class_index = get_global_id(0);  
    7.   
    8.     input_im += 169 * class_index;  
    9.       
    10.     float tmp = 0.0f;  
    11.   
    12.     for(int i = 0; i < 169; i++)  
    13.     {  
    14.         tmp += input_im[i];  
    15.     }  
    16.   
    17.     output_im[class_index] = tmp / 169.0;  
    18. }  

    1.3 Debugging kernels

    We highly recommend using python OpenCL host API pyopencl to finish the host program first, then translate to C/C++ to compile on Intel Soc-EDS. Here are our reasons:

    1. Python host programs are much easier than C/C++
    2. Most deep learning frameworks are under python or have python API, so that we can easily extract CNN model parameters and the output of each layer to check out whether the kernel implements are correct.

    Once it is 100 percent sure that the kernel is correct, we can compile the kernel with Intel OpenCL SDK to see whether there are enough resources and then optimize the kernels’ performance.

    2. Designing host program

    The host program is modified based on Terasic’s OpenCL vector add example in the DE10-nano OpenCL BSP and basically a translation from the pyopencl version described in the last chapter. So, if you are familiar with OpenCL, this part shouldn’t be too hard.

     

    6. Performance Parameters

    1. Optimizing kernels

    The Intel OpenCL SDK provides many useful tools to help optimize kernels. Also, there are many advanced techniques like Intel channel extension which can really speed up the kernels. Although well-designed PipeCNN from the other team gets an around 10x speed up than our implement using Intel channel extension. Due to the time limit, we are only able to explore a few low-level optimizing techniques.

    The naive implement described in ch.3 uses 66% present of the FPGA resources. Classification per image takes around 4.5 seconds.

    We then tried using SIMD implement for the avgpool kernel instead of NDrange Kernel. The FPGA resources consumed slightly drop but the runtime increases.

    After several attempts, the final version of our implementation we use #pragma unroll to parallel the inner loop in the conv1x1 kernel for it is most frequently used. Additionally, we use –fp-relaxed flag while compiling. According to Intel, this flag enables the addition to be computed more efficiently in hardware, using a tree structure instead of a vine. This version gets a 2x boost than the naive version. It consumes 96% percent of the FPGA resources and takes about 2.2 seconds per image to do the classification.

    7. Design Architecture

    1. System diagram

    There are many kinds of research about using large-scale FPGA like Arria 10 to completely replace GPU in PC or workstation and accelerate the whole back-forward pass(training) and forward pass process of CNN.

    The FPGA presented by the council is a cyclone v on de10-nano board sponsored by Terasic and Intel, which is not powerful enough to accelerate the whole training process. Nowadays, we’ve seen more and more CNN applications on embedded systems like face recognition on cell phones and object detection on drones or robots. So, we focus on accelerating only the forward pass of CNN on embedded systems which resources and power consumption are limited, and Cyclone V FPGA can be a perfect solution — using its arm processor as traditional controller and FPGA as a low power accelerator.

    So the whole picture of FPGA implement of CNN using OpenCL is like the figure shown below.

    Here is the software flow of FPGA implement of CNN using OpenCL.

    Here is our network structure.



    27 Comments

    Wang zhenwu
    厉害了,竹西姐!!
    🕒 Jan 31, 2018 05:41 PM
    Weiran Zheng
    A ZA A ZA!
    🕒 Jan 31, 2018 04:30 PM
    jenny
    Best wishes!
    🕒 Jan 31, 2018 03:43 PM
    zhongjie
    Very impressive!
    🕒 Jan 31, 2018 03:36 PM
    Liu Hexin
    good job !
    🕒 Jan 31, 2018 03:29 PM
    谭雪迎
    good job!
    🕒 Jan 31, 2018 03:16 PM
    hyguan
    nice work!
    🕒 Jan 31, 2018 02:52 PM
    Jiang Xu
    bu min jue li
    🕒 Jan 31, 2018 02:43 PM
    Liguoyu
    加油
    🕒 Jan 31, 2018 02:37 PM
    Lily Hu
    Best wishes!
    🕒 Jan 31, 2018 02:12 PM
    liushiwei
    加油加油!
    🕒 Jan 31, 2018 02:07 PM
    cgx
    厉害了我的西姐
    🕒 Jan 31, 2018 01:40 PM
    gxt
    厉害了我的西姐
    🕒 Jan 31, 2018 01:25 PM
    Xiaorong Zhou
    you can be better !
    🕒 Jan 31, 2018 11:58 AM
    钱光照
    wow!it's a fantastic masterpiece although i don't know what it is...fighting~
    🕒 Jan 31, 2018 11:11 AM
    Chen
    It's a good idea
    🕒 Jan 31, 2018 10:54 AM
    sujiejie
    厉害的班长!
    🕒 Jan 31, 2018 10:36 AM
    Yuechene Tao
    加油baby~
    🕒 Jan 31, 2018 10:34 AM
    冯帅
    well done
    🕒 Jan 31, 2018 10:06 AM
    yaqiangzhang
    best wishes! 西西姐加油~
    🕒 Jan 29, 2018 10:58 PM
    Wang QQ
    Fighting!
    🕒 Jan 29, 2018 06:57 PM
    Liu Hexin
    Wang QQ hhhhhhhhhhh
    🕒 Jan 31, 2018 03:30 PM
    Jun Fu
    首先,重在参与和锻炼。新的设计哪有把握一定能搞好!相信你们会受益良多。
    🕒 Jan 26, 2018 10:27 AM
    Doreen Liu
    提案表述详细,期待作品的展现。
    🕒 Jan 17, 2018 05:56 PM
    PR065🗸
    谢谢!我们会认真准备~
    🕒 Jan 22, 2018 04:22 PM
    berkay egerci
    appreciate what you did
    🕒 Jan 13, 2018 04:19 AM
    PR065🗸
    Thank you
    🕒 Jan 22, 2018 04:22 PM