AS038 » Sign Language Reader
Real-time translation of American Sign Language into voice. Our approach is unique in that, instead of translating finger spelling alphabet, we focus on a small set of common words to translate simple conversations. We take advantage of a CNN-LSTM network to capture temporal information, and a language model bridges the gap in grammar and styles between English and ASL, and helps to reduce errors.
Edit on Project Description:
"Our goal is to translate the hand gesture in American sign language to English lexicon with one single RGB camera on the OpenVINO Starter Kit platform. Our system integrates the deep learning neural networks and statistical learning models, which will allow the user to incorporate it as a reliable and low-cost subsystem in their own machine learning applications. The inference model of the system can achieve high accuracy with a small size of training data. Therefore, it is scalable and customizable in meeting the user needs. "
American Sign Language (ASL) is a means of communication used by hundreds of thousands of Americans in the deaf community. However, machine translation from the sign language to a proper structure of text remains an unsolved challenging task. The sign language is more than “sign” of the two hands. To truly understand the sign language, one must focus on the facial expression, eyebrow raising, and hand gesture at the same time.
This project does not provide a system that solves the entire challenge of the sign language translation. The aim of this project is to design a system that can recognize the hand gesture with the following characteristics:
Our algorithm flow is implemented on the OpenVINO Starter Kit, which features the high-performance computing and low-power consumption of FPGA. The OpenVINO toolkit also supports OpenCV and popular machine learning frameworks. This highly flexible software context boosts up the speed of the application development.
The system uses one single USB camera as the external sensor. The YOLOv3-tiny network is trained to detect the hand. The hand patch is used as input to the OpenPose CPM network. The OpenPose CPM outputs the probability map of 21 hand keypoints, which is then processed by the threshold suppression method such that only confident keypoints are considered. Both networks are converted into FPGA-compatible formats by OpenVINO toolkit and forward on the OpenVINO Starter Kit.
The 21 keypoint coordinates describe the pattern of a token. The token is the fundamental element we use to infer the lexicon. A Support Vector Machine (SVM) with Histogram intersection kernel is trained on a small dataset collected by our team. The output token stream is encoded. The token stream pattern is used to infer the most probable lexicon.
The FPGA has parallel nature. In one single clock cycle, the FPGA can execute multiple operations at different stages, which provides sufficient speed to run the inference stage of machine learning pipeline.
Jetson TX2, a GPU device for embedded machine learning applications, has a typical power consumption of 10W, while the FPGA device equipped on the OpenVINO Starter Kit, Cyclone V GX with 301K programmable logical elements, has a static power consumption of 0.72W . The virtue of low-power consumption perfectly fits our design requirement for portable device.
The OpenVINO toolkit can be used to develop applications on Intel FPGA. The OpenVINO toolkit includes optimized calls for OpenCV libraries and Model Optimizer that can convert the model from machine learning frameworks to binary format usable by Intel FPGA, which mitigates the difficulty in application development on Intel FPGA device.
Our application scenario targets at two groups of users. The first group are end users. The system shall run on the users’ smartphones or smart glasses - portable devices. Also, our application requires inference speed in milli-second such that it can follow the signer. These two requirements are fulfilled by choosing Intel FPGA devices.
GPU consumes too much power and runs in micro-second level, which is not necessary for our application. For the inference stage, Intel FPGA is a better choice.
The second group of users are researchers and developers. While there are thousands of ASL lexicons, no ASL dataset is particularly created and published for machine learning applications. Our design reduces the workload of creating and maintaining a training dataset. Most of ASL lexicons are composed of several commonly used signs in the sequential series. This characteristic inspires us to design a system that can recognize signs and use the sign pattern to find the most probable lexicon.
Other researchers and developers can integrate our model as a subsystem into their own application create the gesture of interest by hard-coding the token transition sequence. No further training is required.
(1) Hand & Keypoint Detection
The YOLOv3 network is retrained on the Oxford hand dataset . It has 53 successive 3x3 and 1x1 convolutional layers. The YOLOv3 network outputs a bounding box that contains the hand. In this project, only the right hand is considered.
The bounding box width and height are scaled 2.2x such that it can be processed by the OpenPose hand keypoint detector, which outputs a probability map for each of 21 hand keypoints. The coordinate of the largest value in the probability map is chosen as the keypoint.
(2) Sign Recognition
The coordinates of 21 keypoints are the essential features of the sign. The sign set is a set of N sign elements
The SVM with Radius Basis Kernel (RBF) is trained to recognize the sign from 21 keypoints:
Sign Recognition demonstration
(3) Direction Encoding
There are nine moving directions, encoded from 1 to 8 going counter-clockwise from x-axis. The moving direction is 0 when there is no motion.
Direction encoding demonstration
(4) Lexicon Pattern
Let C,Cs, Cd be cost functions measuring difference respectively of tokens, signs and directions.
Then the lexicon of the minimum cost is chosen:
where sample(t) is the input sequence varying with time, w is the window size.
Hello & Goodbye demonstration
* Due to the time limit, we do not deploy and test the model in the Jetson kit. The data will be up in the future work.
Sign Recognition Accuracy: 98.65%
Model Interpretability Check: Remove the training data of class ‘V’. After the training, let the model classify a sign of ‘V’.
The model misclassifies four ‘V’ test cases into one ‘L’ and three ‘U’s. 'V', 'U', 'L' do have very close spatial alignment of keypoints. Without the training data of ‘V’, our system will find the most similar sign.
The keypoint coordinates of 'V', 'U', 'L'
The system is composed of two parts: host and device. The video data is flowed from the host to the OpenVINO Starter Kit through PCI Express. The data is processed on the FPGA device and the result is sent back to the host for visualization purpose.