AP085 »
Embedded Neural Coprocessor
📁Machine Learning
👤Pradeep Kathirgamaraja
 (ParaQum Technologies)
📅Jul 07, 2018
Regional Final


👀 3640   💬 26

AP085 » Embedded Neural Coprocessor

Description

Embedded Neural Coprocessor is the next generation of embedded processor which have capability to execute machine learning functions efficiently. Our Coprocessor supports small scale convolution neural networks(CNN) computation natively using re-configurable layers which will be implemented in programmable logics of Cyclone V. First, user designs the neural network using provided API, then train it offshore (servers) and feed the model to our Coprocessor. Then it can execute feed forward computation and generate output which will be used in targeted application eg real-time object recognition, face recognition, voice command recognition etc. we have separate dedicated hardware logic to emulate each type of layers which are computational intensive operations like convolution/ Fire layer, Maxpool layer. For example, if a network have convolution layer operation, it have dedicated hardware which perform the convolution layer function. Layer parameters (dimension, inputs, etc) are configurable dynamically according to requirement. Similarly we have configurable Maxpool hardware module. We support most commonly used layer operations, But if user required different operation, which also supported by performing computation in ARM processor. Layers are connected through memory after each layer output stored in memory.

Our Coprocessor can achieve great speed compare to typical sequential embedded processor since we are using hardware level parallelization. It can be used in many real-time applications. It will be cost effective compare to GPU or other dedicated hardware level acceleration. We separate device used for training (in servers) and testing application (Coprocessor), because training is very time consuming process which need heavy computing power while testing can be done remotely. Ultimately we could able to run small scale neural network on embedded device with greater speed.

Demo Video

  • URL: https://www.youtube.com/watch?v=DHRNuBW7b_Q

  • Project Proposal

    1. High-level Project Description

    Introduction

    Machine learning is the game changer in computer vision and other similar applications. Without any doubt, next generation embedded should support machine learning functionals efficiently. We propose an embedded processor architecture which supports small convolution neural network(CNN) natively using re-configurable layers which will be implemented in programmable logic of Cyclone V. Our Coprocessor can achieve great speed compare to typical sequential embedded processor since we are using hardware-level parallelization. It will be cost effective and power efficient compared to GPU or other dedicated hardware level acceleration.

    Applications

    This Coprocessor will give vast facilities for embedded engineers to facilitate machine learning algorithms for signal processing(audio, image, video and sensor data). Our Coprocessor can be used in any embedded system which requires small scale neural networks. It can be used as vision processor for cameras to perform high level operations like real Time Object recognition, facial recognition,  intelligent remote monitoring, surveillance, activity monitoring, a robotic application which use cameras etc. It can be used in audio processing application like voice command recognition. Ultimately this will open several paths for new ideas which will use the power of neural network in embedded systems.

    Advantages

    The primary advantage of this architecture is speed, it will be faster compared to other embedded processor. GPUs are not feasible for small embedded application and this will be more power efficient as we can switch of the blocks which are not in use. Low power usage is a key requirement in the embedded application. In our approach, we load the network model and run it on Coprocessor so computationally intensive training process can be done in servers. This approach does not require cloud service which enforces the privacy and security and won't demand high network bandwidth.

    Methodology

    • Creating APIs
      • Identify complete API requirement for designing neural network
      • Implement APIs
      • Or For interfaces for popular machine learning frameworks
    • Designing and implementation of the interface between computer and ARM processor which used in loading model into the board
    • Designing the detailed architecture
      • For each layer types
      • Interface between ARM and logic
      • Memory access
      • Input and output
      • Interrupts
    • Implementing modules and interfaces in C and testing.
    • Implementing in HDL with reference to C and unit testing.
    • Integration and System Testing
    • Testing sample application (squeezeNet)

    Hardware requirement

    We are going to implement our Coprocessor on DE10-nano board. The primary reason for selecting this device is it has powerful ARM processor along with programmable logic and DDR3 memory. In our architecture, there should be a processor which will schedule the feedforward computations, then communicate with the external computer and also perform other regular tasks in the embedded system. ARM Cortex 9 in this board is the best match for this purpose. This board has 110K LEs which is well enough to implement each type of layers. During operation, each layer takes input and store the output in memory so DDR3 memory is the suitable option. Other than these DE10- nano board have numerous input and output interface, (eg from camera, audio source, video in or out. HDMI in out, network, GPIO, sd card, USB) which facilitate different application requirement.

    2. Block Diagram

    Block diagram

    There are 3 communication bridges in between the FPGA and HPS.

    • HPS-to-FPGA AXI Bridge: - ARM processor will send the configuration parameters to the current execution layer and start signal. The Instruction decoder in the FPGA will receive the configurations and send them to the particular layer.
    • FPGA-to-HPS Interrupt Bridge: - FPGA will enable the interrupt when the particular layer completed its task. The Interrupt Controller in the FPGA will control the interrupt and update the status in the interrupt register space.
    • FPGA-to-HPS AXI Bridge:- This AXI bridge is used to access the DDR3 memory by every layers (Conv, Maxpool, Fire layers). These layers will get the input data from the memory and send back the processed output data to the memory.

    Each layers will get the address input and output data's location and configuration parameters from the instruction decoder. And each layers will send signal to the interrupt controller to inform to the processor that particular task has been completed.

    Functional diagram

    There are four main functional in complete system

    1. Training is done offshore in server or computer using our software  APIs, If required it can be extendable to use available machine learning framework like Tensorflow, Azure ML, etc.
    2. Exporting model and loading into DE10-nano board.
    3. Load model from DDR3. ARM processor in will initialize layer’s parameters then schedule feedforward computation,and initiate the layer computation(layer dimension, kernel dimension,address of weight matrix etc). According to this parameter layer will be configured dynamically. Then Input for this layers are fed from DDR3, layer perform operations parallelly and output is written back to DDR3. Then layer sends done signal to processor. Processor schedules next layer in the network and this process is repeated until input data reaches the output layer.
    4. Final output is send to processor and which is used for application running in processor (eg object recognition).

    Pseudo code

    ARM processor programes

    Main thread

    • Reset layers
    • For each layer in network
      • calculate address parameter
      • calculate input, output address
    • Initialize scheduler’s state machine
    • Initialize first layer and start first layer
    • Processor can perform other operation, unless interrupt received from layers.

    Interrupt routine

    • Upon receiving interrupt check for scheduler’s state machine current state.
    • Identify the next state and initialize next layer or if it final layer pass the final output and exit.

    3. Intel FPGA Virtues in Your Project

    Feasibility of using DE10-nano Board

    • Board provide DDR3 bandwidth around 25Gbps, which can handle input and output memory access for each layers. Memory size is 1 GB well enough to store the input and output and intermediate data.

    • MAC(multiply and accumulate) operation is heavily used in computations. Cyclone V have 224 multiplier which can run at least 50 MHz. So it can handle 224*50 = 11 200 million MAC operation. But if we take a embedded ARM processor (two core, 800MHz, dispatch interval 2 for MAC instruction), which can achieve maximum 2*800*½ = 800 million MAC operation only. Note SqueezeNet requires 860 million MAC instruction in one run.

    • 110 K LEs can be utilised for parallel comparators in Maxpool layer and RELU operations

    • 112 Variable-precision DSP Block provides floating point operations. It has hard corded accumulator will be useful in convolutional layer.

    • 5Mbit block rams can be used as a buffer. This allows to copy data from DDR3 in burst mode

    • Dual core ARM cortex 9 processor on board can be used as host processor which will schedule and control overall application.

    • Rich input and output interfaces. So we can use it for different application eg - video/images, audio, sensor data, etc.

    Case study : SqueezeNet on our coprocessor

    SqueezeNet is the compressed version (0.5MB) of AlexNet. It have 50 times fewer parameter while achieving same accuracy. So SqueezeNet will be the best candidate to test in our coprocessor

    • Input layer - 224*224*3 byte image
    • Intermediate layers - 2 convolution layer, 8 fire module and max pool filtering layers, Intermediates will have size in range 1-3 MBytes which will be stored in DDR3. So the bandwith required to process 60 images/second 5gbps where theoretical maximum bandwidth is 25 gbps.
    • Deep compression version uses 8 or 6 bit data types which can be implented efficently in cyclone

    Reference

    • SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size - Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, Kurt Keutzer

     

    4. Design Introduction

    Purpose of Design

    Convolutional neural networks(CNN) have become mainstream of research in computer vision, video analysis, and natural language processing. Accuracy in vision-related tasks like object recognition, object detection, segmentation has increased close to human-like performance with the cost of increased computations. But resources and power requirements for inference computation are very high which cannot be affordable in the embedded environment. Another option is cloud-based CNN inference, but it's also not suitable due to network delay and privacy issues. In most of the use cases in embedded systems, resources, power, and latency are the main decision factors compare to accuracy. But there are no any embedded processors or microcontrollers which natively support  So we propose a configurable architecture which can run CNN inference in FPGA with the limited resource (8 bit float computations) and timing constraints. We have chosen SqueezeNet showcase our architecture, but this architecture can be configured to implement any SqueezeNet like architecture without any modification in design. With little modification in design, we can implement any small-scale CNN architecture. Here we summarise key features of our design, low resource usage, low power usage, portable, low cost, independent of the cloud-based system and data privacy.

    Application scope

    Our design has very broader application scope from computer vision to natural language processing. Whatever the application requires SqueezeNet like models can use our architecture by giving appropriate configuration. For other types of CNN architectures, our design can be used with minimal modification. Our design primarily focused on computer vision related task, especially object recognition. Object recognition is a vital task in lots of fields, here are few of them. In self-driving cars or automobile system, our design can be used for pedestrian detection, lane detection, and traffic signal detection. In robots or drones, our design can be used to detect objects and avoid obstacles. It can be used in portable medical scanning devices to analyze medical images. It can be used with portable surveillance cameras to detect anomalies, so cameras will become more intelligent and no need to depend on the cloud, which ensures privacy. Face attributes detection, handwritten text detection, pose estimation, artistic style transfer are few more applications of our design related to computer vision. As mentioned earlier, other CNN related applications can be built with our system with additional effort.

    Targeted users

    Our design is more general purpose one, so it is not for end users, it can be used embedded engineers, hardware engineers, and electronic enthusiasts. Embedded application engineers can use our system as a vision processing unit to perform object detection, segmentation etc. They can directly run inference for Squeezenet like models. Hardware engineers can use our model and extend to support other CNN architectures according to their requirement. Eventually, our design will help offload the processor, so it can perform other tasks.

    Intel FPGA virtues in our design

    CNN inference needs huge computation resources and can be executed in parallel. Our design exploits inner layer parallelism and intralayer parallelism of CNN architecture. We use FPGA in Cyclone V to implement parallel computations with efficient pipelines. In our design, HPS initialize the architecture configurations, model parameters, and input data after that FPGA perform all computations and at the end, it will send interrupt to HPS, in the meantime HPS system can perform other tasks. This kind of CPU offloading is mandatory in an embedded system since CPU have to control overall system and cannot be loaded with a single inference task.

    In our design, we use 8-bit custom float operation to reduce resource usage. Using FPGA we could able to implement this with very lower resources usage whereas in CPU or GPU it cannot be implemented efficiently.

    Our design is scalable, the user can change the size of the model by configuring the number of layers, size of the input, location of pooling layers. FPGA provide reconfigurability to achieve this but other hardware level accelerator cannot have this capability. Our design is expandable, for large FPGA we can increase the number of blocks executed concurrently, in small FPGA we can perform the task with one instance of computation block. This modularity is another key advantage achieved using FPGA. As described earlier, with little modification, our design can support other CNN architectures. This rapid hardware level modification is only possible in FPGA system by simply reprogramming with necessary changes.

    Our design has broader application scope, so there can be varieties of input and output requirements. FPGA helps to expand IO easily as required. For example, our design can be used for portable scanning machine, here scan sensor and LCD display can be directly attached to FPGA.

    5. Function Description

    Functional Flow diagram

    Preparing model

    First, we have to choose model required for the dataset. The number of layers, number of kernels, input image size and pooling layers need to be decided. Then model the network in TensorFlow additionally use our methods to support custom 8-bit float number. If pre-trained weights available for that model, we can use that to speed up training. After training store the weights using our python script in particular format required by our design.

    Initializing model

    Transfer model parameters to Linux system running on HPS through the network port. Run the C executable, which will load the model weights into DDR and pass the address to the FPGA subsystem. Then HPS will send the model configuration parameters like input size, number of layers, number of kernels, etc. All communication happens through AXI interconnect between HPS and FPGA systems. Now model initialization is complete and now we can run inference

    Running Inference

    Another C executable loads the input image into DDR from SD card(we can use network or camera or any other image sources) in raw pixel format and send the address to FPGA and then become idle or perform other tasks. A state machine in FPGA controls overall inference computations. It configures the computation blocks and handles fetching of weights, biases and input values from memory required for that state. When computation block finishes, the intermediate output is written to DDR. On next state, this block will be configured with another parameter and intermediate data from the previous layer is feed as input. This iteration continues until the final layer of inference is reached. Upon finishing the last layer, the output is written to DDR and state machine will fire an interrupt to HPS and pass the address. HPS can read the result and display it.

    6. Performance Parameters

    In the layered archiecture the expand (3x3 & 1x1) layer has 4 parallel convolution blocks and squeeze layer has 16 (8 + 8) parallel convolutuion blocks. The performance of expand and squeeze is 1 : 2 ratio. The computation time of the layered archiecture for a single layer is described in following example cases.

    Based on the above table, the expand layer and squeeze layer consume high time in case I and case III respectively, and both layer consume equal time in case II. As inputs layer data and kernel's weight values are buffered internally, processing modules doesn't get idle by DDR3 latency. It was verified in board and we got almost same processing time. Time Taken for Squeezenet v1.1 (mapped to our processing module) for 1000 classes is shown in following table

    Memory usage for squeezenet v1.1 configuration listed on the following table.

    Notes :- () - repeat

    Performance comparsion for Squeezenet v1.1 model

    7. Design Architecture

    HPS sub system

    The overall architecture of our design is shown in figure 1. Initially HPS load the network parameters(weights and biases) and the raw images to DDR3 through SDRAM Controller. Then HPS sends the address location of loaded data and configuration information for the computation logic through the Light-Weight HPS to FPGA Bridge. This configuration information contains details about the whole model like number of layers, layer size, number of kernels, enabling pool layer, etc.

    FPGA sub system

    FPGA contains the submodules of computation block and controllers, which are used to configure and feed data to computation block.

    • CONFIGURATION CONTROLLER: This block act as the slave to HPS and collect the configuration information of the whole model and stored that information in a block RAM. Then It will extract configurations required for current execution and send it to the all other blocks in the FPGA with a START signal. This START signal act as a reset signal to the all logics available in the FPGA and start computations.
    • AXI SDRAM MASTER CONTROLLER : There are two masters in this block. The master 1 loads appropriate parameters to the KERNEL CONTROLLERS and BIAS CONTROLLERS blocks (yellow color blocks in Fig 1) from the DDR3. The master 2 loads data for INPUT LAYER CONTROLLER from DDR3and write back output data of OUTPUT LAYER CONTROLLER (brown color blocks in Fig 1) to DDR3.

    • INPUT LAYER CONTROLLER: This block loads the required portion of the input layer to the block RAM and feeds this data to the LAYER CONVOLUTION (EXPAND PART-1) block in a specific order. Fig 2 shows data feeding arrangement for 3x3 expand.
    • EXPAND 3X3 KERNEL CONTROLLER / EXPAND 1X1 KERNEL CONTROLLER: These blocks load the kernels into block RAM and send 4 kernels at a time to the LAYER CONVOLUTION (EXPAND Part-1) block in a specific order corresponding to INPUT LAYER CONTROLLER, shown in Fig 2. Fig 3 shows the flowchart of kernel feeding order. Here, expand kernels are moving from kernel 1 to m for each depth or layer.
    • LAYER CONVOLUTION (EXPAND PART-1) : This block does convolution operation. It sends request signals to INPUT LAYER CONTROLLER, EXPAND 3X3 KERNEL CONTROLLER  and EXPAND 1X1 KERNEL CONTROLLER blocks.

    Then it will collect the required data from these blocks and perform convolution. This block executes 4 convolutions simultaneously for each expand 3x3 layers and expand 1x1 layers. Finally, it sends outputs to the FIFOs. In each clock, this block generates 4+4 outputs (3x3 & 1x1) with a 25 clock cycle latency. The output data flow is described in Fig 3.

    • EXPAND BIAS CONTROLLER : This block initially loads expand 3x3 and 1x1 biases to Block RAM and sends 8 bias (4 for each expand 3x3 and expand 1x1) to the LAYER CONVOLUTION (EXPAND Part-1) block for each request.
    • LAYER ADDITION (EXPAND PART-2) + MAX POOL : This block does the addition for expand layer and if max pool enabled it will perform max pool operation as well.

    • SQUEEZE KERNEL CONTROLLER: This block loads the kernels into block RAM and sends 16 kernels (8 kernels from the 3x3 portion and 8 kernels from 1x1 portion) to the SQUEEZE CONVOLUTION + AVERAGE POOL block in a specific order. The flow of the kernel shown in Fig 4. The flowchart is shown here describe the ordering of kernel. For each squeeze kernel (p kernels), for each depth in that kernel (m depths), kernel is sending for squeeze convolution. When input layer of squeeze reaches last depth, next kernel will be loaded. When it finishes the last kernel it will move to a new row in the input layer. The flow of the squeeze input layer shown in figure 4.
    • SQUEEZE BIAS CONTROLLER: This block loads squeeze biases to Block RAM and sends a bias to the SQUEEZE CONVOLUTION + AVERAGE POOL block for every request.

     

     

     

     

     

     

     

     

     

    8. Results

    Few example of object detection using pretrained weight of SqueezeNet V1.1 Our approch gave the same results for most of the images with higher confidence. In the figure below, we can observe the ouput values from our design seems to be quantized, this is because final ouput is 8bit.

    Another Example

     

    9. References

    10. Source Codes

    link :- https://github.com/Nathees/Neural_processor

     



    26 Comments

    Aleksandr Amerikanov
    These words: "It can be used in many real-time applications. It will be cost effective compare to GPU or other dedicated hardware level acceleration” sound very optimistic, but they need to be confirmed by facts. Is it really possible to compare the built-in processor unit with the graphics adapter? Completely different tasks are performed; so, the cost of video card will surely be higher, and it is obvious.
    Also, there are no tests of your product's usage in conjunction with real equipment. Does everything work? Is the performance really that high? And so on, and so forth.
    Good luck with it!
    🕒 Jul 02, 2018 09:19 PM
    AP085🗸
    For embedded applications which require inference, it is not cost effective to keep a GPU system on Edge device. Currently, most people use cloud-based inference (GPU servers or dedicated hardware accel). Still, it will be costly when we take the cost of the whole system ie: server cost, network cost etc.
    Definitely, we can't compare our architecture with GPU, GPU will be super fast, and more costly. But here we try to compare different approaches ie, our approch of using FPGA soc solution and centrailzed GPU based system. (anyway we compares with mobile GPU, because it is comparable to embedded systems)
    One example usecase I can mention, Suppose we need to monitor(detecting the presence of) animal in forest environment by using a camera. If we go for a cloud-based/ centralized solution, we may end up with a higher cost for cloud service, network(if we want to consider about privacy it will cost even more for dedicated servers). But if we have a solution similar to our design we can detect the object in edge device itself with ensured privacy

    "Also, there are no tests of your product's usage in conjunction with real equipment. Does everything work?"
    - Yes, We are currently working on interrupt processing and getting input image from webcam. We have tested squeezenet V1.1(9 iterations of computation blocks), flow works fine and we are getting few mismatches after 8th iteration we are working on that as well. Since the we using custom floating point format, verification becomes harder.

    "Is the performance really that high?"
    -Yes, This architecture performance is comparable to mobile phones performance. Here we used architecture optimization like four expand kernel runs in parallel, low memory footprint due to cascading squeeze layer without storing expand layer output to memory, 12 bit operations, etc. Most of the time consumed by last classifying conv10 layer for squeezenet because of 1000 class. Mobile GPUs are perfoming equally because they have optimized ML framework where they use algorithmic optimizations technique like sparse matrix multiplication, compression etc. But we still haven't exploited these techniques yet. And Mobile processor is running appx 10x clock of ours.

    Thank you for all of your comments, It really help us a lot.
    🕒 Jul 03, 2018 02:58 PM
    AP085🗸
    Now SqueezeNet V1.1 is Fully working!!. Please check the latest Demo. Performance may seems to be slow in video it's due to printing tons of intermediate output into terminal.
    🕒 Jul 05, 2018 03:11 PM
    AP085 🗸
    Video Update
    https://www.youtube.com/watch?v=8BVWXefzr4w
    🕒 Jul 02, 2018 04:25 PM
    Sam Gilligan
    This project could be quite effective. It depends how well it will perform compared to a GPU or other co-processor based solution.
    🕒 Jan 31, 2018 09:49 AM
    Denis S. Loubach · Judge ★
    Relevant project.
    The project mentions: "It will be cost effective and power efficient compared to GPU or other dedicated hardware level acceleration", therefore, this comparison results are highly expected.
    🕒 Jan 30, 2018 12:12 AM
    Emanuel Ortiz Ortiz
    Excellent Idea.
    🕒 Jan 19, 2018 05:30 AM
    Zhou Wenyan · Judge ★
    It is a good project, looking forward to see it!
    🕒 Jan 16, 2018 05:46 PM
    AP085🗸
    Thank you very much for appreciation
    🕒 Jan 16, 2018 07:23 PM
    hefe
    Please consider viewing my project
    http://www.innovatefpga.com/cgi-bin/innovate/teams.pl?Id=EM105
    🕒 Jan 29, 2018 06:19 PM
    MOHAMED
    Good work. keep working on it.
    🕒 Jan 15, 2018 06:29 PM
    AP085🗸
    Thank you.
    🕒 Jan 16, 2018 07:24 PM
    Bill Yuan · Judge ★
    Very Good Work, looking forward to see it!
    🕒 Jan 14, 2018 09:56 PM
    AP085🗸
    Thank you very much.
    🕒 Jan 16, 2018 07:26 PM
    Rajeebhavan
    Its a prominent work . Looking forward to see more .
    keep going
    🕒 Jan 11, 2018 06:01 PM
    AP085🗸
    Thank you. definitely we will give our best to make it successful.
    🕒 Jan 16, 2018 07:24 PM
    Jamesbager
    Good ideas.. best of luck..
    🕒 Jan 09, 2018 12:43 AM
    AP085🗸
    Thanks a lot.
    🕒 Jan 16, 2018 07:27 PM
    Thanu Theva
    Impressive ideas, good luck
    🕒 Jan 07, 2018 07:57 AM
    AP085🗸
    Thank you.
    🕒 Jan 16, 2018 07:27 PM
    Sivaram
    Wow

    Impressed
    🕒 Jan 07, 2018 02:16 AM
    AP085🗸
    Thank You
    🕒 Jan 16, 2018 07:21 PM
    Carl Adrian Patco
    Cool.
    🕒 Jan 04, 2018 03:59 PM
    AP085🗸
    Thank you.
    🕒 Jan 04, 2018 10:47 PM
    himayande
    Nice work. good luck with your project.
    🕒 Jan 04, 2018 11:53 AM
    AP085🗸
    Thank you
    🕒 Jan 04, 2018 10:48 PM

    Please login to post a comment.