PR061 » 基於深度學習技術之人臉偵測與辨識系統
本設計將利用深度學習技術開發一套人臉偵測與辨識系統，結合硬體加速器實現於FPGA上，人臉偵測部分將採用全卷積神經網路架構SSD(Single Shot MultiBox Detector)為主要方法，並使用可分離式卷積層(Separable convolution layer)進行參數數量與計算量的優化，利用Google提供的tensorflow object detection api 對 WIDERFACE 資料集進行訓練。而人臉辨識則採用Facenet方法，使用VGGFace2資料集進行訓練，訓練出特徵擷取器自動擷取人臉最重要的特徵，比起需要直接訓練自訂資料庫中人物照片的方法，此方法可透過公開資料集建立神經網路，可大幅降低更新人物時所需的樣本數，最低僅需一張臉部照片即可。硬體加速器採用的是可分離式卷積的設計流程，為了降低參數量以及計算量，減少記憶體的存取率，能將完整的系統放置在FPGA中。
In response to the promotion of the smart city and smart home, people pay more and more attention to the quality of life, wishing the technology could change our life. In recent years, with the use of GPU and big data, deep learning has brought revolutionary progress to various fields, especially in the area of computer vision. But the GPU accelerating is limited to power and cost, which makes its product extreme expansive.
There are many types of access control systems around us, from keys, ID cards, to biometric identification. Face recognition is the most convenient method for users among all biometric methods because it does not need to touch or make extra actions. Face recognition also has a wide practical application in various fields, such as attendance recording, identifying members or citizens.
Humans have natural abilities to recognize each other through their perceptive and cognitive system, but this task is too complex to simplify into a simple equation for the computer to execute. The automation of the face recognition task needs multiple complex systems and a sufficiently large database.
This project will implement an unsupervised face recognition system with the deep learning technique, accelerating the algorithm with the combination of FPGA and ARM, in order to realize an access control system.
The goal of our project is to construct a face recognition system and make its face database can be modified easily without any model retraining, and the database can only include one picture for each person we want to recognize. An unsupervised learning-based method should be adopted to achieve the requirement.
The machine learning method we choose is Deep Neural Network, which can simplify its algorithm into a bunch of multiply-add calculation, which make the resource on the FPGA module can be reused and accelerate the system.
In order to implement a full face recognition system, we need a camera to capture the video stream that contains a person’s face and display the video on a monitor to show the result. We choose to implement the LXDE Linux desktop image with OpenCV supported, which is provided on the official website of the de10-nano development board. By using the pre-build operating system to handle the camera device, we can focus on implementing the hardware accelerator. The deep learning algorithm we use needs around 5 Mbyte memory to store the parameter and the temporal data, which is too large for on-chip memory, thus, we use the DDR3 SDRAM to store the data and communicate with FPGA via the AXI interface. As shown in Fig. 1 and Fig. 2. The FPGA part includes the implementation of each convolution layer type, the HPS will handle the memory region preparation and task schedule, then send the information to FPGA to execute the convolution network.
We communicate the HPS and FPGA via lightweight AXI-bus to transfer the information that required for FPGA execution, including input memory address, output memory address, parameter memory address, and every necessary parameter for each layer, in that case each module can be called simply and idle the FPGA when it doesn’t need to do the computation. The hardware part of this face recognition model can be mainly divided into two-part, pointwise convolution, and depthwise convolution. Each 32bit floating point changes into 16-bit fix point + 16-bit integer on FPGA, we have tried from 32-bit fixpoint to 8-bit fixpoint in our other experiment, but adopting the 8-bit fixpoint cannot retain enough accuracy so we choose the 16-bit fix point instead.
The average pooling layer actually executes with low timing complexity on Both HPS and FPGA, in order to retain the organized memory access schedule, we choose to execute the convolution on FPGA, scheduling, average pooling, and padding and on HPS.
The feature map in our design is the channel-first design, which means the dimension of the feature map is (channel, height, width). Compare to the channel-last design, the channel-first design has more ability when the computation is split into several parts. In the fire module, we want the output data is continuous in the memory region, and the output of 1x1 and 3x3 layer can be concatenated by simply assign the output address when the channel-last design should do transpose computation to concatenate 2 data into single continuous data.
Table1 shows the light-weight memory sharing that transfers the basic information of each layer execution.
The definition of each parameter is shown below:
SIZE: Width and height of the input tensor, max 160
CHANNEL: Channel unit of input tensor, max 1024
TYPE: Type of OP
RELU: This operation uses RELU or not.
BN: This operation uses batch normalization or not.
Stride: Stride=0 to represent stride 1, stride=2 to represent stride2
HPS READY SIGNAL: When this signal is set 1 by HPS, FPGA will read every information and start computing the result.
INPUT ADDRESS: The input data address where FPGA should read from DDR3 SDRAM
OUTPUT ADDRESS: The output data address where FPGA should write to DDR3 SDRAM
KERNEL ADDRESS: The kernel weight address where FPGA should read from DDR3 SDRAM
FPGA READY SIGNAL: When this signal is set 1 by FPGA, HPS will send the next layer’s information to FPGA.
After the controller receives the start signal, if the operation type is pointwise, it will start the pointwise module, it is a convolution layer with only 1x1 kernel, and we apply the share weight method, first storage the first channel results of feature map to the buffer and read the remaining parameter. But on the de-10 nano development board, due to the restrict of on-chip memory, we need to spilt the 80x80 feature map into 64 10x10 feature maps and do the pointwise convolution separately to meet the requirement. The pointwise convolution structure is shown in Fig 3. After the computation is finished, FPGA will send a ready signal to HPS and wait until the next start signal. The timing diagram is shown in Fig 4.
After the controller receives the start signal, if the operation type is depthwise, it will start the depthwise module, because, in our design, each depthwise convolution’s kernel size is 3x3, we design 9 multi-add for computing the convolution. The structure of depthwise convolution is shown in Fig.5. The PE communicates the SDRAM through FPGA SDRAM Controller, and load the necessary input and parameter into the buffer, each data has 32-bit width, 16-bit fixpoint. We need 9 cycle time minimum to read 9 data so we use a FIFO to store the output buffer and wait until the SDRAM is ready to write. The timing diagram is shown in Fig. 6. After every computation is finished, the FPGA will send the ready signal to HPS and wait for the next start signal.
The machine learning method we choose is Deep Neural Network, which has shown its ability on pattern recognition since 2012, and it has been developed to solve many complex problems such as object detection, id re-identification, image inpainting, etc. We use the DNN module called “Facenet” , proposed in 2015 by Google, which uses the triplet-loss to train an unsupervised model to extract human face features.
The Facenet method is to extract the feature on a person’s face and give each person a unique high dimension identification feature, by comparing the distance of two features, the similarity of two faces can be computed simply.
In order to minimize memory usage, we use two convolution structure to optimize our network. The separable convolution layer used in Mobilenet  , and the fire module used in SqueezeNet . The model structure is shown in Fig. 7, and the detail of each layer is shown in Fig. 8.
The method in model designing is below
The parameters of significant layers are shown in table 2.
The method for input normalization we choose is whiten, the equation is defined below:
This method helps the input feature map to be normalized in the range of its standard distribution, which can improve the performance of the deep neural network.
And the method to compute two face’s similarity is to output 512 dimension features for each face image and compute their cosine similarity. The equation is defined below:
The model is trained in the model structure shown in Fig7 and Fig8. And the learning rate is set to 0.05 initially, and decrease to 0.1 times every 10^6 iteration. The batch size is set to 90. The optimizer is Adam. We add dropout layer after the last two layer, the keep rate is 0.4. The parameter initialization is Xavier. The dataset we used is VGGFace2. And after around 3x10^6 iteration, the model has 94.3% accuracy on the VGGFace2 dataset and 99.2 on lfw dataset.
Because of the resource limiting on the development board, we do not apply the neural network for face detecting. Instead, we use dual cameras to compute a simple parallax method to detect whether there an object located at the specified range in front of the camera. We use goodFeatureToTrack to compute Tomasi corner point detection with the Opencv API to extract the corner point, and use calcOpticalFlowPyrLK function to compared the difference between the frame from both camera. Finally, we can get a simple parallax result. It can avoid the situation that the people just passing by the camera and trigger the face recognition module.
The updating of the dataset is very simple, we just have to use one face picture and compute forward the network to get its feature, after this computation, the future unknown input can be compared with the pre-generated feature without re-running the computation. No model retraining is needed.
We can use this recognizing result to open the door, and record the passing time of each recognized person, we have designed an interface using ethernet to send the information to the server and display the record on the webpage or application, as Fig 9 shows.
Fig 9. Access control system product
In this project we have implemented a face recognition system on de10-nano development board, combined with a dual-camera object distance detect system, the recognition system can only be executed only when there is an object near the camera, and only compute the detected face, each face take around 4 seconds to compute its feature on de10-nano, it can have a lot of accelerating when there is more on-chip memory, and our designers can retain the 94% accuracy on VGGFACE2 dataset, 99.2 on lfw dataset.
We have tested the different input resolution of the facenet model, even though the computing speed can be increased around 4 times, the accuracy will reduce to less than 90% on VGGFACE2 when the input is set to 80x80. Which is not the precision we want.
Fig 9. Access control system product