Annual: 2019

AP011 »
Accurate, Fast, and Energy-efficient Object Detector with a Thermal Camera
📁Machine Learning
👤Masayuki Shimoda
 (Tokyo Institute of Technology)
📅Oct 08, 2019
Regional Final



👀 3370   💬 2

AP011 » Accurate, Fast, and Energy-efficient Object Detector with a Thermal Camera

Description

This project presents an accurate, fast, and energy-efficient object detector with a thermal camera on an FPGA for surveillance systems. A thermal camera outputs pixel values which represent heat (temperature), and the output is gray-scale images. Since the thermal cameras do not depend on whether there is the light or not unlike other visible range cameras, object detection using the thermal camera is reliable without dependence on the ambient surrounding. Additionally, for a surveillance system, visible images are not suitable since they potentially violate user privacy. Thus, this topic is of a broad interest in object surveillance and action recognition. However, since it is challenging to extract informative features from the thermal images, the implementation challenges of the object detector with high accuracy remain. In recent works, convolutional neural networks (CNNs) outperform conventional techniques, and a variety of object detectors based on the CNNs have been proposed. The representative networks are single-shot detectors that consist of one CNN and infer locations and classes simultaneously (e.g., SSD and YOLOv2). Although the primary advantage of the type is that it enables to train detection and classification simultaneously, the resulting increased computation time and area requirements can cause problems of implementation on an FPGA. Also, as for the proposed networks on RGB three channel images, one of the problems is false positive; the realization of a more reliable object detector is required. This project demonstrates an FPGA implementation of such reliable YOLOv2-based object detector that meets high accuracy and real-time processing requirements with high energy-efficiency. We explore the best preprocessing among conventional ones for the YOLOv2 to extract more informative features. Also, well-known model compression techniques, both quantization and weight pruning are applied to our model without significant accuracy degradation, and thereby the reliable model can be implemented on an FPGA.

Demo Video

  • URL: https://youtu.be/AHkX9FUnicc

  • Project Proposal

    1. High-level Project Description

    Our system is shown in the following section. The camera captures images and sends them to the PC. These images are resized, and preprocessing are then applied. The processed ones are fed into the YOLOv2 part on the FPGA. The YOLOv2 part runs and outputs feature maps that each pixel presents both the confidence and the location of the target objects. The PC receives the output and calculates the arguments of the maxima per pixel. After that, non-maximum suppression suppresses drawing multi-boxes to an arbitrary object. The result images are displayed on a monitor.

    The architecture dedicated to the YOLOv2 exploits the network sparsity. Sparsity approach has increasingly attracted interests from the two viewpoints reduction in data footprint and elimination of computation since the sparsity approach achieves a good balance between compression ratio and accuracy. These are three types of architectures based on what values are skipped: weights, activations, and both. This project employs kernel-sparsity-aware architecture that exploits sparseness of pruned kernels since the architecture is the best choice to obtain a good balance among performance, circuit complexity, and thereby energy efficiency[1].

    [1] J. Li et al., "CCR: A concise convolution rule for sparse neural network accelerators," 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE), Dresden, 2018, pp. 189-194.

    2. Block Diagram

     

     

     

     

     

     

     

     

    The above figure also illustrates the microarchitecture of our YOLOv2 part. The architecture consists of input/output feature map (F.Map) buffers, non-zero weight buffer, a multiplier unit, an accumulator unit, and a batch normalization (BN) & ReLU unit. Incoming images are sent to the above F.Map buffers. At each convolution, the F.Map buffer sends a vector of M related values along to width- and height-axis. The valid values are determined from  non-zero weight indices and convolution count. A non-zero weight value fetched from the weight buffer is broadcasted to the M multipliers. After multiplication, each result is fed into each accumulator and BN & ReLU unit. The resulting vector is stored to the other F.Map buffer. When processing next convolutional layer, the F.Map buffer roles are logically swapped, and the below F.Map buffer sends an activation vector. Since the number of the layers in our YOLOv2 is even, the above F.Map buffer accommodates the network outputs. The outputs are sent to the host PC, calculating the arguments of the maxima per pixel.

    3. Intel FPGA Virtues in Your Project

    Performance parametres

    Our system needs to achieve 30 frames per second, and comparable accuracy obtained by the dense model. Discussion about FLOPS of a sparse model is meaningless. 

    Appraise the function of Altera FPGA devices

    As mentioned above, among PCIe-based FPGAs in the same price range the OpenVINO has more block rams. Since the block ram utilization of our architecture, kernel-sparsity-aware one is dominant, the feature is one of the important reasons.

    4. Design Introduction

    Purpose of the design

    This design tries to meet the requirements of surveillance system scope: real-time processing (30 frames per second), high accuracy (as high as possible), and energy-efficiency (as high as possible) due to continuous 24/7 processing.

    Application scope

    An surveillance system and its related scope, such as security system.

    Target users

    Security service and surveillance corporations.

    Why we used Intel FPGA device ?

    The reason is that the board is reasonable PCIe-based one. In our target scenarios preprocessing is necessary, and powerful Intel CPU is suitable for it. Additionally, among PCIe-based FPGAs in the same price range the OpenVINO has more block rams. Since the block ram utilization of our architecture, kernel-sparsity-aware one is dominant, the feature is one of the important reasons. That is all why we select the OpenVINO.

     

    5. Function Description

    Pre-processing part (CPU part)

    To increase the amount of input information for YOLOv2 from a thermal image, we combine a background subtraction image with a thermal image and use it as input.  We use BackgroundSubtractorMOG2 of OpenCV2 as a background subtraction algorithm, and "combine" means concatenation on channel axis. 

    YOLOv2 part (FPGA part)

     Functionality

    An illustration of sparse convolutional operation (in/out F.map size is 3x3, kernel is 3x3, stride=1, padding=1) is shown in the above figure. Several successive feature map values along to width- and height-axis are fetched simultaneously, and a corresponding weight is broadcast to all multipliers to realize high parallel degree. 

    ​​​​​​● How to implement ? 

    1. Microarchitecture of YOLOv2

    The above figure also illustrates the microarchitecture of our YOLOv2 part. The architecture consists of input/output feature map (F.Map) buffers, non-zero weight buffer, a multiplier unit, an accumulator unit, and a batch normalization (BN) & ReLU unit. Incoming images are sent to the above F.Map buffers. At each convolution, the F.Map buffer sends a vector of M related values along to width- and height-axis. The valid values are determined from  non-zero weight indices and convolution count. A non-zero weight value fetched from the weight buffer is broadcasted to the M multipliers. After multiplication, each result is fed into each accumulator and BN & ReLU unit. The resulting vector is stored to the other F.Map buffer. When processing next convolutional layer, the F.Map buffer roles are logically swapped, and the below F.Map buffer sends an activation vector. Since the number of the layers in our YOLOv2 is even, the above F.Map buffer accommodates the network outputs. The outputs are sent to the host PC, calculating the arguments of the maxima per pixel.

    1. Memory Layout for Feature Map Memory

     

     

     

     

     

     

    The above fiigure shows an example of the memory layout of F.Map buffer at 3x3 CONV on 5x5 F.Map. The memory address ranges 0 to maximum channel x F.MapHeight x kernelWidth among all CONV layers (in our model,  128x6x3 = 2,304), and memory width is the number of processing elements (PEs). When fetching a certain F.Map values, the memory address is calculated by both non-zero weight indices and CONV counter. The weight indices buffer has non-zero weight indices in COO format, (channel, column, row), and the CONV counter manages coordinate of the localization at which the convolution is performed, (y, x). Assuming that the localization of convolution is (y, x), and the non-zero weight address is (channel, column, row), then the absolute localization of corresponding feature map value is (channel, y+column, x+row). In this design, we set the number of PEs 16 since the output F.Map width of the first CONV layer is 16. 

    1. Vectorizing Unit

    We develop the vectorizing unit to maintain high parallel degree calculation by packing successive valid values into the same memory address. Also, the unit converts the valid values with consideration for the next padding number. It simplifies fetching the valid values at the next CONV operation. The above figure shows an illustration of the vectorizing unit. It assumes that the next CONV operation performs on 5x5 F.map including padding with 3x3 kernels after a max pooling layer. It generates as many vectors as the number of the next kernel width. In the case of the above assumption, it generates three vectors from a valid one with padding, and their vectors are fed into respectively correspond to shift registers in the #PE size. When the preset number of vectors are stored, these vectors are sent to the F.map buffer. The converting input valid vector with consideration for padding enables to maintain a high parallel degree in all CONV layer's operation. 

    Post-processing part (CPU part)

    Typically, object detection method results in the creation of multiple bounding boxes per object. However, since it is desirable to resolve these multiple creations into a single bounding box, in situations where the bounding boxes whose intersection-over-union (IoU) overlap exceeds a certain threshold, we use a non-maximum suppression algorithm to remove all bounding boxes excep for the one that has the highest classification probability. The IoU expression is formulated as

    IoU= [Area of Overlap] / [AreaofUnion].

    In this design, we conduct this as post-processing on the CPU.

    6. Performance Parameters

    Experimental Results

    • Accuracy (F-score) : 52 %
    • Frame-per-seconf (FPS) : 52
    • Resource utilization (Freq. 115.92)
      • #Registers : 186,770 
      • #ALMs : 84,245 (74%)
      • #DSPs : 28 (8%)
      • #RAMs : 1,220 (100%)

     

    7. Design Architecture

     

     

     

     

     

     

     



    2 Comments

    Doreen Liu
    By the way, we have released openvino 2019 R1 package for OSK, you can get the package from this link:
    https://www.terasic.com.tw/cgi-bin/page/archive.pl?Language=English&CategoryNo=167&No=1159&PartNo=4
    🕒 Jun 26, 2019 11:51 AM
    Doreen Liu
    Excellent Topic. Please complete the rest part of the proposal. The deadline of the first stage is 2019-06-30.
    🕒 Jun 26, 2019 11:50 AM

    Please login to post a comment.