Annual: 2019

AP047 »
Realtime sign language translation to speech using DNN
📁Machine Learning
👤Ramith Hettiarachchi
 (University of Moratuwa)
📅Oct 11, 2019
Regional Final

29


👀 857   💬 8

AP047 » Realtime sign language translation to speech using DNN

Description

From the world’s population, reportedly there are a considerable amount of people suffering from speech disorders such as muteness, Apraxia (childhood/acquired) and Aphasia. These may occur due to brain damage, stroke, head injury, tumor or any other illness that affects the brain/ vocal cords/ mouth/ tongue etc. Our device mainly focuses the community that suffers from these diseases that cannot be cured by speech-language pathologists.

Existing solutions include image processing techniques where image frames are processed and decoded to text and text-speech converters which converts typed text into electronic vocalizations. However, devices that use cameras require to adapt to various/ sudden lighting conditions with the same high quality which may lead to less accuracy in such conditions. Also, eye contact-eye contact speech opportunity is obstructed in the text-speech method which will not give the person a natural conversation experience.

In order to address these problems and give the users a real time experience in communicating with another we propose a system designed to recognize gestures (Sign Language/ Fingerspelling) using Electromyography (EMG) and Inertial measurement unit (IMU) sensors. The DE-10 Nano kit will get these sensor readings and a pre-trained Deep Neural Network (DNN) will be used for inferencing. Since inferencing will be done on the FPGA board itself, the need for a separate computational device will be eliminated which will make our device a real time processing, portable device. Ultimately this will enable people in need to communicate in an efficient way.

The output is given in both voice and text formats and it will support upto five spoken languages - English, Chinese, French, Hindi and Arabic. An Arduino Nano will be used to interface the output devices. A speaker and a HC-05 bluetooth module is used as output devices. The transcript of the translated sign language can be seen through a mobile device connected via bluetooth.

Possible future extensions of this work include, the addition of any sign language as input and supporting another language as output in addition to the five built in languages.The community support will be very useful in scaling this to multiple sign languages and speech.

Demo Video

  • URL: https://www.youtube.com/watch?v=fX6Gb8bn6kc

  • Project Proposal

    1. High-level Project Description

    From the world’s population, reportedly there are a considerable amount of people suffering from speech disorders such as muteness, Apraxia (childhood/acquired) and Aphasia. These may occur due to brain damage, stroke, head injury, tumor or any other illness that affects the brain/ vocal cords/ mouth/ tongue etc.  Exclusion from communication can have a significant impact, specially when communicating with normal people in their day to day life. Additional social issues such as feelings of loneliness, isolation and frustration may come up. Our device mainly focuses the community that suffers from these diseases that cannot be cured by speech-language pathologists.

    Existing solutions include image processing techniques where image frames are processed and decoded to text and text-speech converters which converts typed text into electronic vocalizations. However, devices that use cameras require to adapt to various/ sudden lighting conditions with the same high quality which may lead to less accuracy in such conditions. Also, eye contact-eye contact speech opportunity is obstructed in the text-speech method which will not give the person a natural conversation experience.

    In order to address these problems and give the users a real time experience in communicating with another we propose a system designed to recognize gestures (Sign Language/ Fingerspelling) using Electromyography (EMG) and Inertial measurement unit (IMU) sensors. The DE-10 Nano kit will get these sensor readings and a pre-trained Deep Neural Network (DNN) will be used for inferencing. Since inferencing will be done on the FPGA board itself, the need for a separate computational device will be eliminated which will make our device a real time processing, portable device. Ultimately this will enable people in need to communicate in an efficient way.

    The output is given in both voice and text formats and it can be expanded to support upto five spoken languages  apart from English (eg: Chinese, French, Hindi and Arabic). An Arduino Nano will be used to interface the output devices. A speaker and a HC-05 bluetooth module is used as output devices. The transcript of the translated sign language can be seen through a mobile device connected via bluetooth.

    Possible future extensions of this work include, the addition of any sign language as input and supporting another language as output in addition to the five built in languages.The community support will be very useful in scaling this to multiple sign languages and speech.

    2. Block Diagram

    3. Intel FPGA Virtues in Your Project

    I/O Expansion

    Our proposed system makes use of incorporating a HC-05 low power bluetooth module inorder to communicate with the 2x MYO Armbands. DE-10 Nano will be configured to communicate through the UART protocol. Furthermore, the Arduino expansion header on the DE-10 Nano will be useful to display the decoded text on an OLED display. Furthermore speaker will be directly controled by the DE-10 Nano board.

    Scalability

    Even though we propose to deploy a trained Deep Neural Model based on the training data collected by many users, per user modifications and fine tuning can be done overtime. For an example the initial trained model may not carry the optimal parameters used to describe a user’s gestures. Therefore, based on the users input, we can perform a calibration stage so that future readings will be more accurate for that particular user.

    Boosts Performance

    There have been previous work on the area of gesture recognition using mobile phones. However due to the limited processing power, all the methodologies have been limited to traditional machine learning approaches of SVM, Naive Bayes, Random Forest and MLP. 

    With the use of DE-10 Nano FPGA, we will have the power of accelerated hardware to perform real time inferencing using our deep learning Model. Ultimately this will enable us to have much better precision in decoding sign language gestures. 

    Furthermore since we will be getting data form 2x MYO bands, the data streams need to be fed to the inference engine without exceeding the latency threshold. This will be effectively addressed as a parallel processing technique using the FPGA.

    (MYO armband : Courtesy developerblog.myo.com )

     

    4. Design Introduction

    Our device mainly focuses the community that suffers from these diseases that cannot be cured by speech-language pathologists. The target group includes, people suffering from speech disorders such as muteness, Apraxia (childhood/acquired) and Aphasia. They often find it really challenging to convey their messages to people who do not understand sign language.

    The purpose of this design is to build a mobile system which is powerful enough to translate sign language gestures to text and speech. The existing systems which use a trained deep neural network, make use of a seperate computer which performs the computations. However this becomes an issue in practical applications. Therefore, the DE-10 Nano will play a major role to get accellerated performance while scaling the device to a small form factor. 

    Why Intel FPGA?

    There have been previous work on the area of gesture recognition using mobile phones. However due to the limited processing power, all the methodologies have been limited to traditional machine learning approaches of SVM, Naive Bayes, Random Forest and MLP. 

    However, with the use of DE-10 Nano FPGA, we will have the power of accelerated hardware to perform real time inferencing using our deep learning Model. The calculations required for neural network inferencing and data preprocessing can be done parallely and they can be pipelined. We can create dedicaded hardware blocks for these calculations on the FPGA. Ultimately this will enable us to have much better precision in decoding sign language gestures. 

    5. Function Description

    The IMU and EMG data will be extracted from 2 MYO armbands which has the following specifications,

    • Eight EMG electrodes
    • 9-axes IMU composed of a 3-axes accelerometer
    • 3-axes gyroscope and a 3-axes magnetometer
    • Vibration motor used to alert the user

    These floating point numbers will be fed to the inference engine (a trained DNN), which will then classify the gesture based on its probability score.

    The selected gesture will be transmitted as an integer to the arduino nano via UART, which will then display characters on the OLED screen ‡ or/and convert to speech.

    Intended Features from DE-10 Nano as the Main Computational Device

    • Powerful sensor fusion of (IMU and EMG) of 2x MYO armbands 

    • Ability to convert gestures to text (OLED).

    • Convert sign language gestures to speech.

    • Text to speech language can be customized.

    • Sending a stream of text to a remote computer.

    • Sending Urgent/Emergency commands.

    • Filter out general hand movements from sign commands

    Implementation

    Data Collection and Preparation

    Using an online dataset leads to questions on how the data was collected and prepared. Therefore, we decided to collect our own data and prepare our own dataset. We collected data for 5 American Sign Language gestures  using the Myo Armband and they are;

    • 'THANKYOU'
    • 'WATER'
    • 'NO'
    • 'YES'
    • 'YELLOW"

    Each class had 25-28 exmples each. Data was collected from one person aged 22. Then we decided to get the envelope of these signals for training instead of using the raw sensor readings, because, compared to the sampling rate of EMG and IMU signals, sign language gestures are performed at a much slower pace. Therefore the raw signals contain a considerable amount of noise in them. Due to this we decided to rectify and lowpass filter all signals. To obtain the envelope of the filtered signals we performed hilbert transform on these signals. The top subplot in the following figure displays the recorded rectified signal in blue, the lowpass filtered signal in orange and the final output signal after performing hilbert transform in black. The final output signal and the lowpass filtered signal is plotted in black and red in the second subplot. (The reason for the two humps in the signal is because two gestures are contained in this signal.)

    Our dataset is availble in our GitHub repository (the link to our repository is included at the end of this section) 

    Feature Extraction

    After getting the envelopes of the signals we decided to plot the correlation matrices of the EMG signals for three different gestures.

    Two correlation matrices for the 'YES' gesture

    Two correlation matrices for the 'WATER' gesture

    Two correlation matrices for the 'NO' gesture

    After observing these plots we came to the conclusion that, the use of all 8 EMG signals to train a model would be unnecessary. Therefore, by analyzing these correlation matrices we decided to use only 5 EMG signals and they are; EMG signal 1,2,4,6 and 7. 

    Next we decided to experiment upon 6 features to capture the characteristics of the envelope of the signal (to capture the information contained in the patterns of the signals). The features are:

    • Energy
    • Slope Sign Change
    • Maximum Value
    • Skewness 
    • Kurtosis
    • Auto Regressive Coefficients (order  = 4)

    We built 3 Support Vectore Machine (SVM) classifiers where,

    • SVM1 - used only EMG signals for training
    • SVM2 - used only signals from gyroscope for training
    • SVM3 - used only signals from accelerometer for training

    After experimenting the features on SVM1 we chose the following features to represent the EMG signals as they resulted in the highest accuracy in the classification process:

    • Energy
    • Slope Sign Change
    • Skewness
    • Kurtosis

    This SVM resulted in an accuracy 86%. The PCA  plot of this combination of features and the performance metrics of the SVM are shown below:

    We followed the same procedure in training SVM2 and then we ended up with the following features for the gyroscope signals:

    • Energy
    • Maximum Value
    • Slope Sign Change

    This resulted in the following PCA plot and performance metrics on SVM2:

    A similar procedure was used in training SVM3 which used only accelerometer signals, but the maximum accuracy it was able to obtain from the SVM did not reach the level of accuracy that we obtained from the previous two SVMs. The PCA plot we obtained from the features that resulted in the best accuracy also depicted that the data from the accelerometer signals alone was unable to cluster the classes separately, atleast not upto the level of the previous two sets of features. The PCA plot and the classification metrics are as follows:

    After this analysis we decided to use only EMG signals and gyroscope signals to train a model and the following set of features was used:

    • EMG signals - Energy, Slope Sign Change, Skewness, Kurtosis
    • Gyroscope signals - Energy, Slope Sign Change, Maximum Value

    An SVM that used the above set of features was able to produce a K-fold validation accuracy of 96%. However when we obsered the above plots, we decided that a simple SVM will have difficulties in building correct boundaries for classification while capturing complex patterns in the given data. Also as the number of gestures increased we identified the necessity of having a model that will be able to expand using the existing combination of features. Therefore, we decided to build a neural network for the classification process.

    Model Development

    Using the above features we built the following neural network:

    The hidden layer uses the 'relu' activation and the output layer uses the 'softmax' activation to produce the probabilities for each output node. We obtained a training accuracy of 98.84%, a validation accuracy of 95.45% and a test accuracy of 92.86%. Eventhough this neural network was sufficient to classify the five gestures, as we include more gestures in the American Sign Language we will be able to expand the neural network.

    The neural network was inferred on the FPGA board and it was coded using OpenCL.

    The exact output control logic can be implemented through the DE-10 board itself, however because the process of outputing the decoded text is not time critical we make use of the intermediate device (Arduino Nano)

    Neural Network Implementation Using OpenCL C++

    Neural network inferencing and data preprocessing involves lots of large calculations such as multiplication of matrices with large number of elements. Since our design requires real time inferencing we need to do the inferencing in a high performance. We chose OpenCL as a tool to implement these calculations in the FPGA. OpenCL supports techniques as loop unrolling and pipelining which allows the calculations to be done in parallel.

    In the neural network each layer is fed with an input vector A (input features or the outputs of the previous layer) and it produces an output vector Z. The output vector of each layer is calculated as,

    Z = WA + b

    where  is the weights matrix and b is the bias vector. The output of these layers are then activated using an activation function before forwarding to the next layer. We used two activation functions in our design namely the 'RELU' activation and the 'Softmax' activation. The RELU activation function is given by,

    Ai = max(0, Zi)

    where Ai is the ith element of the output vector and Zi is the ith element of the input vector.

    The Softmax activation function is given by,

    where Ai is the ith element of the output vector and Zi is the ith element of the input vector and n is the number of elements in the input vector.

    In our design we have three seperate kernels for the above three calculations. The kernel for calculating the layer output involves matrix multiplication with a vector and adding another vector. In this kernel each element of the resulting matrix is calculated by a dedicated work-item as shown in the figure below.

    As shown in the above figure, each work-item in the device accesses a row from the weights matrix and the input vector and get the dot product of the two. Then it accesses the corresponding element from the bias vector and adds it to the previous result to calculate a single element in the output vector.

    Bluetooth Communication with DE10-Nano

    In our design data from the myo armband is transmitted to the DE10-Nano board via a bluetooth module. A UART module was created using Quartus prime mainly for data receiving. The module was implemented according to the UART protocols and receives 8-bits at a time. The receiving part of the module consists of a simple state machine of two states 'IDLE' and 'READ'. 

    In 'IDLE' state, the receiver signal is kept at high level (3.3 V). When a data is receiving it starts by a low state pulse. When the reciever detects this start bit it changes its state to 'READ'. From then onwards 8-bits are read from the receiver. Apart from the clock another pulse stream called the 'tick' is used to read these bits. 'tick' stream has a frequency which 16 times that of the baudrate of the transmission. This allows to read each bit at the middle of the pulse. When the starting bit is detected (negative edge from the 'IDLE' state),

    1. The counter counts 8 ticks to read the start bit.
    2. Then the counter counts 16 ticks to read each of the 8 data bits.
    3. Finally the counter counts 16 ticks to read the stop bit (Which should be high)

    After reading these 8-bits the reciever changes its state to 'IDLE' again.

     

    6. Performance Parameters

    The performance of our device can be evaluated on two metrics; firstly, the overall accuracy of the translation of sign language to speech/ text, and secondly the efficiency of the system in terms of speed, energy consumption, and memory usage.

    A study[1] shows that traditional machine learning approaches of SVM, Naive Bayes, Random Forest and MLP have given respective average accuracies of 81.38 %, 90.66%, 80.47% and 90.45 %. Our aim is to drift from a traditional machine learning approach and harness the power of a DNN to get a higher accuracy.

    [1]

    @inproceedings{paudyal2016sceptre,
    title={Sceptre: a pervasive, non-invasive, and programmable gesture recognition technology},
    author={Paudyal, Prajwal and Banerjee, Ayan and Gupta, Sandeep KS},
    booktitle={Proceedings of the 21st International Conference on Intelligent User Interfaces},
    pages={282--293},
    year={2016},
    organization={ACM}
    }

    Inferencing Performance Comparison

    Device Language Average Inferencing Time (ms)
    Intel Core i7 - 7500U Python 725.200
    Intel UHD Graphics 620 OpenCL C++ 52.143
    DE10-nano FPGA OpenCL C++ **(1)

    **(1) - Due to an error persisting in our DE10-nano board, we couldn't evaluate the performance yet. We are currently corresponding with terasic to resolve this issue

    Kernel Performance Evaluation

    In our design, the kernel used to calculate the output of a layer uses a single for loop for dot product. However this loop could not be fully unrolled since the bounds of the loop are unknown. We did the loop unrolling with different unrolling factors and evaluated the latency of the dot product loop and area analysis.

    Unrolling Factor Latency of the pipelined loop ALUTs FlipFlops RAMs DSP
    1 (No unrolling) 171 25% 15% 21% 5%
    2 202 32% 19% 22% 9%
    4 272 44% 24% 24% 16%
    8 417 70% 37% 36% 30%

    Then we tried pre loading a row from the weights matrix before going through the dot product loop instead of accessing global memory at each iteration of the loop.

    Unrolling Factor Latency of the pipelined loop ALUTs FlipFlops RAMs DSP
    1 (No unrolling) 196 21% 13% 20% 3%
    4 199 33% 17% 23% 5%

     

    7. Design Architecture

    Hardware Design Block Diagram

    The following figure shows the RTL viewer of the matrix multiplication system. This is generated using the Intel OpenCL SDK for FPGA.

     

    The following diagram shows the system connections of the matrix multiplication system.

    Software Flow

    The following diagram shows the top level software flow of our design.

    Next diagram shows the flow inside a Layer Calculation block.

     



    8 Comments

    Pravilasha Ramakrishnan
    Good concept,all the best
    🕒 Jul 06, 2019 03:16 AM
    AP047🗸
    Thank You! :)
    🕒 Jul 06, 2019 11:07 AM
    Aba Gn
    Good concept! Best of luck!
    🕒 Jul 05, 2019 02:13 AM
    AP047🗸
    Thank You! :)
    🕒 Jul 05, 2019 05:36 AM
    Amaya Dharmasiri
    Great stuff!! good luck
    🕒 Jul 05, 2019 12:48 AM
    AP047🗸
    Thank You! :)
    🕒 Jul 05, 2019 12:58 AM
    AP047 🗸
    Hi,
    We are using the MYO's build in Bluetooth, and for the FPGA's end only, intermediate modules - HC05 is used.
    🕒 Jul 03, 2019 05:20 PM
    Vihanga
    Why not use MYO's inbuilt BLE?
    🕒 Jul 03, 2019 05:05 PM

    Please login to post a comment.