Annual: 2019

EM025 »
Real-time video depth estimation
📁Digital Design
👤Vladislav Sharshin
 (Independent SP-Team)
📅Oct 08, 2019
Regional Final



👀 4222   💬 6

EM025 » Real-time video depth estimation

Description

The aim of our project is to develop the real-time video frame depth reconstruction device using FPGA.
There are a lot of classical algorithms of depth map reconstructing, but even the finest of them do it slowly, it takes a few seconds to process even a single frame.
These approaches don’t work in real-time.
Within the project, we are intended to develop the device, which can speed up a frame depth map reconstruction process, dealing with that task in real time without reduction in the processing result quality.
To construct a real-time depth map we use a special architecture deep neural network, implemented in FPGA, which processes two images from stereo-pare simultaneously.
FPGA enables the user to make this process more efficient than CPU thanks to implementation of both the parallel architecture and pipelines, so we achieve a great speed up of this process with help of parallel data processing and pipelining in data flow.

Demo Video

  • URL: https://youtu.be/kbNtjw4z8zk

  • Project Proposal

    1. High-level Project Description

    There are a lot of ways we can use cameras in our daily life. Video frame fails to provide us with the information about objects depth position. It can be a problem in different areas, for example, for driverless cars sensors.

    The main disadvantage of classical algorithms of perfect depth map reconstruction is calculation complexity, resource intensity and long processing time.

    FPGA is very good for this challenge.

    The aim of our project is to develop a device, evaluating and constructing depth map for each real-time high resolution video frame.

    After being implemented, our project may be practically used in lots of areas:

    • driverless cars;
    • devices for sight-impaired people;
    • traffic cameras;
    • creating visual effects and processing real time video;
    • creating objects 3D models in real time;
    • etc.

    The main idea is to use two stereo images from two stereo cameras that have a similar view area and distant positions. Images taken from such cameras contain information about shifting and see the object from slightly different sides angles. From multiple captures of the same scene from different viewpoints, it is possible to find the distance to the objects, i.e. the depth of the scene.

    There are many different algorithms, which achieve a good quality of depth map reconstruction image from stereo-pair images without distortions and visual artefacts. These results of the algorithms are acceptable and may be pleasant, but computationally expensive.

    The neural network architecture in the project is the glass hour network developed on base of the U-Net, which showed good results in image segmentation process for medical purposes.

    The model of neural network architecture, testing and adjusting was done on NVidia video card using CUDA and Python based frameworks TensorFlow and Keras.

    Figure 1 shows the neural network prototype. Input: stereo images from left and right cameras. Output: depth map. Encoder and decoder are used in process with additional connections as in U-Net.

    Figure 1. The neural network architecture Prototype

    For bettering upsampling process our neural network architecture contains several improvements: interconnections between distant layers, combine feature maps for each frame (left, right) after several layers. Meanwhile we use such techniques as Batch Normalization for improving performance of the network.

    Figure 2 shows the neural network working result, achieved to this moment.

    The figure below shows the result of video stream processing, implemented on FPGA in our project.

    From the left to the right: left stereo image, right stereo image, classical algorithms depth map reconstruction, full convolutional neural network (FCN) depth map reconstruction.

    We think we have good results in even stage.

    Figure 2. The neural network working result 

    The main aim of our project is the real-time video frame depth reconstruction, using full convolutional neural network, implemented on FPGA, but also we can use this architecture for images segmentation and classification. We made it in our project.

    The Figure 2 shows the result of the static image processing by the neural network on CPU.

    Using FPGA enables us to process video stream and obtain processing results in real time. 

     

    2. Block Diagram

    We are going to implement our project using the evaluation board OpenVINO Starter Kit (Terasic Inc.)

    The project is to include:

    • designing a well-trained neural network with 8-bit weights coded in Python;
    • implementing the neural network on the FPGA by Verilog;
    • designing PC software for transferring video-stream from cameras to FPGA, controlling FPGA design, visualizing the result.

    The system block diagram is shown on the picture below.

    Figure 3. The system Block Diagram

    The structure of the project is described below:

    • stereo USB-camera transmits video stream to PC;
    • PC software transfers video stream to FPGA through PC;
    • video stream is processed in FPGA;
    • processed video stream is transferred back to PC for depth map visualization.

    3. Intel FPGA Virtues in Your Project

    Performance boost

    The main idea of our project is to process two video streams from two cameras in different branches of neural network in first layers. A lot of simultaneous convolution operations with dozens or hundreds of windows (cores) are required to be implemented in the each АСТ layer.

    FPGA allows us to carry out the parallel data processing and calculations for each layer without significant delay and external memory using. While processing layers one after another, we get a pipeline.

    Implementing pipelining and parallel data processing are FPGA advantages, which is not possible to do using CPU, where all calculations and data processing occur one after another.

    Thanks to this, we can process video-stream in real time, performing hundreds of numerical operations simultaneously without the outer memory overuse.

    Actually, our device is an accelerator, performing all full convolutional neural network (FCN) operations inside, which helps us to unload CPU computationally.

     

    Adaptation to change

    Another FPGA advantage is a possibility to change design to meet requirements.

    While neural network architecture design is being developed, there are a lot of situations when we need to change the structure of data processing for better results.

    It is the reason why FPGA reconfiguration possibility is one of the biggest advantages of our project.

     

    Expand I/O

    Another advantage of FPGA is the possibility of use different information interfaces.

    In our project, we use PCI-e for data transferring between FPGA and host-PC, but if it is necessary, we can use also Ethernet or USB 2.0 (3.0).That is why, our design can be applied in other devices with similar interface in terms of speed.

    4. Design Introduction

    We live in the period of intensive development of technologies particularly an artificial intelligence (AI). It is almost common practice to use driverless cars, traffic jam regulation, crime detection and AI in medicine decision asistent. All of aforementioned areas use stream video analysis to get prediction what’s happening. It’s also very common approach to use explicit or implicit depth map which is produced from two or many video frames.  It’s why our project is relevant to the contemporary issues of humanity.

    The aim of our project is to develop a device, evaluating and constructing depth map for each real-time high resolution video frame.

    After being implemented, our project may be practically used in lots of areas:

    • driverless cars;
    • devices for sight-impaired people;
    • traffic cameras;
    • creating visual effects and processing real time video;
    • creating objects 3D models in real time;
    • etc.
    The following equipment and tools are used in the project:
    • OpenVINO Starter Kit (Terasic Inc.);
    • Dual lense USB web-camera “Noname”, configured at a resolution of 640x480;
    • Host PC with GUI software to control the system, connected to the Board via PCI-e;
    • Monitor connected to PC;
    • Verilog/SystemVerilog hardware design language is used to encode FPGA;
    • C/C++ is used for personal computer;
    • Quartus 18.1 development environment;
    • IDE Qt creator development environment;
    • Anaconda Navigator
    • Tensor Flow
    • Keras
    • TF Lite for conversion to fixed point
    • Scikit -learn.

    We use FPGA because of necessity of parallel calculations and pipelining, because it is most benefit this case.

    So, our project is based on:
    • FPGA design has no feedbacks and it processes each pixel immediately after its arrival from the camera in FPGA without waiting for the coming full size image.
    • It's no need to store data from the cameras, except the frame buffer.
    • Paralell flows from different parts of neural network are multipxexing into one high speed bus. As a result the same computational resources of FPGA are used for processing of different parts of the neural network. Data processing is carrying out on the fequency closed to the maximal value for the FPGA.
    • It is possible to perform data processing in several high speed buses in the case of changing size and structure of the neural network during optimization and fine tunuing. We're going to realize it for improve resolution of resulting depth map in future.

    5. Function Description

    For this project, a 64-bit desktop application for Windows was developed among Qt creator 5.13. To create the project, the Visual Studio 2017 compiler (MSVC2017) was selected. This application is intended for:

    1. capturing frames from a stereo camera connected via usb;
    2. scaling frames to the required sizes;
    3. exchanging data between HOST and FPGA;
    4. displaying on the GUI form the video stream from the right and left sensors of the stereo camera, as well as the resulting image with the depth map obtained via the PCIe bus from the FPGA.

    The exchange of data between HOST and FPGA occurs due to API function calls from the TERASIC_PCIE_MSGDMA.DLL library. Writing and reading frames occurs in DMA mode. In this case, the PCIE_DmaWrite / PCIE_DmaRead functions are called. Access to the control and status registers occurs by calls to the PCIE_Write32 and PCIE_Read32 functions.

     

    Convolutional operation

    Convolution operation is the basic element of the convolutional neural network. While processing on the PC data for processing is stored in memory as N-matrices (e.g. color channels or previous layer outputs). Matrices data can be represented as a rectangle which should be convoluted with a cubic window. The convolution operation with cubic window example is shown on Figure 4.

    Figure 4. The convolution operation with cubic window example
     

    Visualization of the convolution operation: one bigger array is pointwise multiplicated with kernel, result of all the multiplication is then summed up to one element of the next feature map (violet). 

    The number of used windows determines the number of output matrices which will be processed the same way in the next layer. It can be seen, every convolution operation is performed sequentially, using the entire image which is stored in the memory.

    This approach is not well-suited to the pipeline processing in the FPGA. Our idea is to process data on a rolling basis.

    As convolution with cubic window is very resource-intensive we’ve decided to use Depthwise separable convolutions (https://machinethink.net/blog/googles-mobile-net-architecture-on-iphone/), which gives the same result while reducing computational resources.

    The main idea is to produce the convolution in two stages. During the first stage convolutions are performed with cubic windows and each of the channels. First step of the separable deepwise convolution: convolution by 2D kernels in every channels independently is shown on figure 5.

    Figure 5. The convolution operation first stage
     
    During the second stage convolutions are performed with the column, number of columns determines the number of output channels. Figure 6 shows the convolution second stage. It’s called deepwise convolution.

    Figure 6. Deepwise convolution
     
    Deepwise convolution: all columns are pointwise multiplicated with 1D kernel vector. Result is sum of the elements of result vector.
     
    Convolution and deconvolution layers schemes are shown on figure 7.
    Figure 7. Convolution and deconvolution layers

     

    Implemented modules are described below.

     

    5.1. String2Matrix Module

    FPGA receives data line by line. Rows are converted into matrix with proper size for convolution in this module. Figure 8 shows the String2Matrix Module functional scheme.

    Figure 8. String2Matrix Module functional scheme
     

    There are three consistently connected FIFOs on the module input, with depth for one input matrix row. When the first FIFO is filled, data from its output is transferred to the input of the second FIFO and then from its output to the input of the third FIFO. Also each of the outputs are transferred to the input of shift register for three elements. Thus, 3 by 3 matrix is formed in the shift register.

    As several channels are processed on one bus to reduce resource consumption, the results of intermediate shifts are written to RAM. Switching FIFO’s outputs between shift registers is performed under FSM control. Also there is special module on the FIFO’s output which introduces the delay between data reading. This delay is necessary for data processing on the next modules.

     

    5.2. Conv2_3x3 Module

    This module implements the first stage of Depthwise separable convolutions which is convolution of the input matrix with 3 by 3 window. Figure 9 shows the Conv2_3x3 Module functional scheme.

    Figure 9. Conv2_3x3 Module functional scheme
     
    Cores for all channels are stored in ROM memory. Input data as 9 parallel buses are received on the module input from the module row 2 matrix output, then they are multiplied with the extracted from the ROM cores in parallel. Multiplication results are stored in the pipeline. The one bus with serial channels is formed on this module’s output.
     
     

    5.3. Conv_vect_ser Module

    This module implements the second stage of Depthwise separable convolutions which is convolution of the input channels with column window. Quality parameters of the convolutional network are largely determined by the number of convolutions for Depthwise separable convolutions and by the number of columns for the second stage convolutions.

    We’ve developed and applied the following technique to save FPGA resources. Weights for cores are stored in the ROM and extracted each cycle during hold while input data is unchanged, then they are multiplied using only one DSP unit.

    This approach is illustrated in figure below.

    Figure 10. Conv_vect_ser Module
     

    Next, it’s necessary to put multiplication results together per channel. Result register is implemented in the FPGA for this. Sequentially received multiplication results are multiplexed and stored in the result register array. After all channels putting together, outputs of result registers demultiplexed to the output bus. This approach is illustrated in figure below.

    Figure 11. Multiply operation timing waveform
     
    Figure 12. Module output
     

    If the number of required channels for the second stage of Depthwise separable convolutions exceeds the maximum possible hold time, than using in parallel two or more conv_vect_ser modules is possible.

     

    5.4. ReLu Module

    This module is the activation function which implements non-linear function. The diagram of this function is shown below.

    Figure 13. ReLu activation function diagram
     

    The module receives data and compare it in the comparators. If the input sample exceeds the threshold then the threshold value is set on the output. If the number is negative then 0 is set on the output. In other cases the output is equal to input.

    Figure 14. ReLu module
     
     

    5.5. Max_pool Module

    2x2 pixel groups are compressed to one pixel in this module. Transformations involve non-intersecting rectangles or squares, each of which is compressed to one pixel and in doing so pixel with maximum resolution is chosen.

    Figure 15. 2x2 pixel group compressing operation
     

    The module’s input receives samples that are compared with each other in pairs and maximum sample is recorded to FIFO. For that to happen the odd sample is written to the register and compared with the even sample at the time of its appearance at the input. Thus we find the maximum values for the pairs of the odd row.

    The obtained result for even rows is compared with the value for the odd row stored in FIFO.

    As a lot of channels sequentially arrive at the input by one bus, the number of registers for storing intermediate maximum values is equal to the number of channels. Switching between them is carried out by multiplexer which is under control of samples and rows counters. It’s shown in the figure below:

    Figure 16. Max_pool module 
     
     

    5.6. Up-sampling Module

    In this layer matrix is doubled in size due to rows and columns duplication. This operation is the opposite of max-pooling.
     
    Figure 17. Up-sampling operation
     

    Input data is a sequence of input channels. At the first stage data puts to the first FIFO (short FIFO) and to the module output. When all samples of all channels are received, data from short-FIFO comes to module output thereby duplicating the matrix columns. At the same time, data from the input and from short-FIFO are fed to the input of long-FIFO. After the whole row is accepted and issued to the output, its copy stored in long-FIFO also issued to the output, thereby duplicating the lines.

    Time chart of module operation and its functional diagram are shown in the figure below:

    Figure 18. Up-sampling module 
     
     
     

    6. Performance Parameters

    To implement our idea we have employed almost all the elements available in the FPGA:
    • memory blocks;
    • DSP blocks;
    • PLL;
    • DDR controllers;
    • PCI-e controller.

    Basically, we can describe these blocks (above) via Verilog manually, but due to the fact that these blocks are realized in FPGA (basic), we get a significant system performance advantages and save resources to implement more other features.

    We process video stream with 224x224p resolution in this project. This resolution was chosen as basic because at the beginning of the project we were not sure about amount of the resources for the design. Now it’s clear that we can increase the resolution by at least 3 times.

    The operating frequency of the design is 200 MHz and according to TimeQuest reports the maximum frequency is 206 MHz which is close to the maximum frequency for FPGA family.

    Figure 19 shows the timing waveform.

    Figure 19. Camera interface data transmitting timing diagram

     

    7. Design Architecture

    There are just few projects of depth map reconstruction using stereo images and neural networks are implemented. Datasets, are used in these projects for neural network learning were artificially generated by Unity or other tools. The main disadvantage is a poor robustness of algorithms to the real data.

    Collecting the dataset for our project we filmed few video tracks from a mall and took frames from them. Then we processed that frames in StereoPhoto Maker programm (http://stereo.jpn.org/eng/stphmkr/) with the best settings which we could find. Just a one frame processing with this program takes 15-20 seconds. With neural network it’s possible to reduce that time to 8-10 ms on GPU (NVidia GTX 1060 Ti) and less with FPGA. 

    Due to the fact that most of the mathematical operations in neural network processing is work with the arrays (matrix/tensor multiplication, convolution, max pooling, etc.) and there is the fact that reducing dynamic range of calculations from float32 to int8 doesn’t change the quality machine learning and neural networks are getting much more portable to the edge. So, FPGA can be very useful in this area especially in video and audio processing, driverless cars and artificial intelligence.    

    It is the reason why results of our project can be demanded in future, especially generalized and parameterized functions in Verilog which realize different layers of neural network initially designed in Python, Keras. This function were compiled in library which was used in project and which can be used in any similar project without significant modifications.

     

    The uniqueness of the project implementation is in the next: nowadays few devices are programmed to get a depth map reconstruction in a real time via neural network.

    In our project to make a real-time depth map reconstruction we use the neural network, implementing it in FPGA.

    FPGA design consists of parameterized layers of the convolutional neural network written by us at verilog and basic components of layers collected in library. This library can be used to expand this project or to be used on other projects without significant modifications and time burdens.

     

    Keras was used due to its simplicity and plenty of articles and blogs about neural network design. It is also possible to convert whole weights of pretrained neural network into fixed point with Tensor Flow Lite or use it during learning process (https://www.tensorflow.org/lite/guide/get_started). 

    It was found from that autoencoder may be the most useful architecture in tasks of photo and video processing (image segmentation, agent manipulation and others), particularly 

    U-Net which was chosen as the base for our architecture (https://lmb.informatik.uni-freiburg.de/people/ronneber/u-net/).

    Task of taking depth map from stereo images can be described as that (Figure 20):

    • to find the feature maps with relevant to depth map by using encoder;
    • to reveal and save position of features by using links between layers;
    • to reconstruct depth map using feature presence and position information by including decoder.
    Figure 20. Short version of our neural network architecture
     

    Encoder includes (Figure 21):

    1. Two consequent convolutions with BatchNormalization as a method of speed up teaching which are done independently for the left and right images.
    2. Concatenation of the feature maps of Conv outputs.
    3. Five consequently operations of convolution and maxpooling. Convolution kernel size is (3,3), maxpoolig kernel size is (2,2) for first four layers and (7,7) for the fifth.   

    Decoder consists of (Figure 21):

    1. Six consequently done operation bilinear interpolation with the kernel (2,2) and convolution with kernel size (3,3). Batch Normalization was used for improving learning process.
    2. Concatenation layers which mix information from encoder before compression and upsampled arrays of decoder. 
    Figure 21. Full architecture of neural network (generated by Keras)
     
    It can be seen that architecture of our neural network (image above) consist of consequently connected similar layers with different parameters.  Generalized version of such a layer was realized on Verilog to be implemented in FPGA. Then we can simply customize parameters and use appropriate weights to transfer all architecture from GPU to FPGA.

    One of the main advantages of FPGA is possibility to start processing from the very first pixel of the frame without saving whole image to the memory. It is done by using pipeline for convolutions and other layers. The architecture which was designed  and realized in GPU with Python is cumbersome and originally use millions of multiplication operations per frame and huge memory resources to store weights and intermediate results. It couldn’t be transformed as it was. So, we multiplexed all the pipelines in one high speed bus and then the same FPGA resources (multiplications, accumulations, logic operations) can be used by the different parts of neural network. Frequency of operations is closed to maximal for the chosen FPGA. It helps to save resources and significantly gain performance of whole processing. 

    The software flow is available by the link:
    https://github.com/sh-vlad/FPGA_real_time_depth_estimation_based_on_neural_network

     

    8. Conclusion 

    The main aim of the project is to design a Real-time video depth estimation, based on the neural network, via OpenVINO Starter Kit (Terasic Inc.), that is succeed. 

    FPGA architecture flexibility advantages are encouraged us to work on the project and improve the algorithms and the system, so we are going to improve the project In the near future.

    Starting our project, we were aimed to show practical implementation of our idea by Intel FPGA. While realizing the project, it was found, that concept is working. In spite of using cheap cameras, we have a really good result.

    The development of the neural network was carried out in Python (Keras, TensorFlow, TFLite), at a high level of abstraction, with great speed and flexibility. After implementing of the parameterized verilog modules we were able to approach the speed of work in PC development environments and we could quickly make changes to the architecture if necessary.

    In traditional approaches to depth maps building and estimating distances based on frames from two cameras we need not only the disparity calculation algorithm itself but also camera calibration is required. Calibration of cameras is necessary due to the impossibility of coaxial placement of cameras. An error will appear in any case, especially when using cheap stereo pair. It’s necessary to know the displacement of the cameras with respect to each other and its rotation , every frame of the stream should be shifted and rotated to the required angle. It is a complex computational process and requires significant computational resources. One of the advantages of using the neural network for creating the depth map is that there is no need to calibrate cameras, thus saving computing resources and increasing system performance.

     

    9. Future plans

    FPGA has a free resource, which we can use to improve our project.

    We are aimed to make these things:

    1. To improve resolution of output depth map to 480x640 pixels or better.
    2. To improve detalization and contrast of the resulting depth map by changing neural network architecture (adding new layers in decoder). DUring realization we can use untouched resources on the FPGA.
    3. To improve immunity to the sources of light in different conditions by including new special teaching dataset. It can help to reconstruct correct depth map even if bulb blinks or in case of suddenly changing of lighting on the road.
    4. To increase number of dubbled USB cameras which can be processed in parallel on the single FPGA. It may be useful for the camera 360-​​degrees or in driverless cars.

    Follow our project on github, the latest features are there.

     

    10. References

    Similar projects: 
    Classical algorithms: 
    1. J. Park, C. Kim, “Extracting Focused Object from Low Depth-of Field Image Sequences”, SPIE VCIP 2006
    2. C.-C. Cheng et al., “A Novel 2D-to-3D Conversion System Using Edge Information”, Trans. on IEEE Consumer Electronics 2010
    Structure similar to ours neuron networks:
    Structure of layers and mathematical operations:


    6 Comments

    Aleksandr Amerikanov
    By the way, I recommend to pay attention to the project http://www.innovatefpga.com/cgi-bin/innovate/teams.pl?Id=EM031. There, the guys faced a similar task and the same problems, but it was already partially implemented in the prototype. So you can exchange experiences.
    🕒 Jul 06, 2019 05:43 PM
    Zavyalov Andrey Olegovich
    Goog project. Did you count how much LEs will be used and which resolution is acceptable? As I can see from the fig. 3, this is a huge neural network and may not fit on FPGA. What about trying to use an HPS and make a full system on OpenVINO without PC? Good luck with a future work!
    🕒 Jul 03, 2019 08:48 PM
    EM025🗸
    Dear Mr. Zavyalov!
    Thanks for you your warm feedback. You are right the net is really huge, and it is a challenge for us. We've made preliminary resources estimation and we hope to pack the net in FPGA. I want to use some design features, for example make prosessing on high frequency (faster than pixel clock) and use each multiplier several times. Thinking about the project few month ago I wanted to use 640x480 resolution, but might be we have to smaller it. Our math person is enhancing the net architecture so far.
    We use FPGA without HPS because an openVINO have more resources as DE-nano 10. If we did it successful we would implement the design for any FPGA.
    🕒 Jul 05, 2019 07:40 AM
    Aleksandr Amerikanov
    I agree with Andrey, and I think that the project will not fit on the FPGA anyway. Even on openVINO. Therefore, most likely you will have to use DDR external memory, since it is presented on the boards. But access to it is possible only through the HPS, which is likely to slow down your system.
    In any case, I wish you success.
    🕒 Jul 06, 2019 05:39 PM
    Doreen Liu
    A clear Block Diagram will help the judges to understand your design. Please upload the rest part of the proposal as soon as possible, as the first stage deadline is 2019-06-30.
    🕒 Jun 27, 2019 02:12 PM
    Doreen Liu
    An excelent proposal! However, you haven't finished these parts:
    1. High-level Project Description
    2. Block Diagram
    3. Intel FPGA Virtues in Your Project
    🕒 Jun 27, 2019 02:12 PM

    Please login to post a comment.