EM038 » CNN hardware accelerator for detection of distracted drivers
Our project is about deep neural networks hardware accelerators. The power consumption of current GPU implementations restricts the usage of neural nets in small devices such as cell phones. Our design is about developing a digital design specific to CNNs, such design can speed up the inference time and reduce the area and power thus providing a tiny but efficient deep learning processor for mobile devices or critical applications
We will use the proposed design to detect distracted car drivers (more details below)
AI is the new electricity - Andrew Ng
The rise of deep learning affected all fields the world, with digital design as one of those fields
The limitations of CPU pushed scientists to use GPU for training and inference of the computationally expensive neural network. However, the GPUs themselves had their own limitations. This encouraged digital designers to start making processors that are specific to deep learning and its computation needs.
Low power consumption:
GPUs have high power consumption and relatively large area making them unable to fit in mobile devices.
We propose building a CNN architecture that can be used as the core of classification, object detection applications. This design will have much lower power consumption compared to GPUs, as shown in the following figure
FPGA implementations of CNN architectures have higher speed compared to GPUs and CPUs due to the parallel nature of FPGAs making them a better choice for critical application such as autonomous cars
While the common approach to build such systems is HLS, we are going follow a RTL approach in order to have more control and make better use of the FPGA resources
CNNs has the amazing ability of changing the application using transfer learning. The application of the device can easily be changed just by updating the parameters (weights) of the system while maintaing the same hardware architecture.
There are two approaches to implement CNNs on FPGA:
1- Using a pipelined architecture and so have high speed but higher power and area
2- Non pipelined architecture: It has lower speed but lower power and area
Table: Comparison of the first layer of Mobile-Net between pipelined , unpipelined architecture
- The World Health Organization reported 1.25 million deaths yearly due to road accident with fifth of these accidents are caused by distracted drivers. We will implement a device that detects the distraction of a driver using the output of a camera placed on the dashboard. Once the device detects a distraction it will send a signal to alert the driver.
This tasks requires a high-speed device, so we will implement a pipelined CNN architecture.
However, the pipelined architecture has its problems: Each stage produces several feature maps so it's important that the FPGA has enough resources to store the results of each stage.
Area reduction techniques that we will follow :
1- We will implement the Squeeze net architecture: The squeeze net is a tiny CNN architecture with about 50X fewer parameters than the famous AlexNet such as If the image is 224x224 RGB it requires only 421.000 multiplication which is saving a lot of power of the FPGA and support inference
2- Quantization: Authors of the Squeeze net showed that quantizing the network weights from 32 bits to 8 or 6 bits doesn't affect the accuracy but reduces area greatly
3- Pruning: Some of the feature maps doesn't contribute much to the final output, removing these maps will reduce the required data to be stored and will reduce the number of operations needed thus, reducing the power too
Table : Comparisons between different architecture
The features of Altera DE- 10 Nano Kit appear to be suitable to implement our architecture for classification :
- 110k Logic element
- 112 DSP block
- 224 of (18 * 18 Multiplier )
- 112 of ( 27 * 27 Multiplier )
- 336 of ( 9 * 9 Multiplier )
The DE-10 Nano kit was sufficient for an implementation of the Squeezenet architecture using HLS (see reference below).
Table: Resource utilization of the accelerator
Our design is expected to have lower utilization due to the optimizations techniques we mentioned and the RTL approach