Industrial

SoC design for enhanced keyword spotting applications with eliminated resources

EM042

Paris Kitsos (University of the Peloponnese)

Oct 30, 2021 1169 views

SoC design for enhanced keyword spotting applications with eliminated resources

It is aimed to deploy a combined Machine Learning (ML) system based on both FPGA and HPS units of DE10-Nano platform that will be enhancing the capabilities of a keyword spotting (KWS) application. The KWS will be executing at HPS as a low power application targeting limited resources requirements. Together, the required peripherals, such as the microphone input will be handled. The design in the FPGA will be incorporating neural network (NN) models, responsible to enhance the quality of the speech signal by removing environmental noise. Hence, features extraction will be more accurate and the demands in KWS application can be eliminated to reach low power, low size implementation without compromising the performance of the detection.

Project Proposal


1. High-level project introduction and performance expectation

SoC design for enhanced keyword spotting applications with eliminated resources


 

Georgios Flamis, Stavros Kalapothas, Paris Kitsos

Electrical Circuits, Systems and Applications Laboratory, Electrical and Computer Engineering Department, University of Peloponnese, Patras, Greece

https://sites.google.com/view/ecsalab

E-mails: {g.flamis, s.kalapothas} @go.uop.gr, kitsos@uop.gr

 

Date: October 2021

Introduction:

It is aimed to deploy a combined Machine Learning (ML) system based on both FPGA and HPS units of DE10-Nano platform that will be enhancing the capabilities of a keyword spotting (KWS) application. The KWS will be executing at HPS as a low power application targeting limited resources requirements. Together, the required peripherals, such as the microphone input will be handled. The design in the FPGA will be incorporating neural network (NN) models, responsible to enhance the quality of the speech signal by removing environmental noise. Hence, features extraction will be more accurate and the demands in KWS application can be eliminated to reach low power, low size implementation without compromising the performance of the detection.

Description:

Speech recognition is the process of recording, converting and interpreting spoken words and sentences. It requires large vocabularies, implemented with million of parameters and computations. Such requirements make these networks unsuitable for edge deployment on devices, like microcontrolers (MCUs), small CPUs or FPGAs. This is addressed with keyword spotting networks, specialized in recognizing a subset of words. If the network detects one of the words in question, another application may proceed with additional functionality.

This work will be focused to the design of enhanced operation for keyword spotting, as they still require relatively large amounts of computational power when exposed to real world scenarios that environmental noise can dominate the sound recordings from microphone. To solve this, neural networks can be optimized and accelerated on FPGAs, exploiting the parallel computation capabilities of the device along with its low power consumption. This process results in energy-efficient and accurate neural networks with relatively short classification times that enhance the quality of the features extracted to feed the keyword spotting application.

The enhanced operation will be achieved with a unit for single channel speech enhancement targeting to separate clean speech when noisy speech is given as input. When using audio signals with deep learning models, it is common practice to transform the time-domain waveform to time-frequency representation such as spectrograms via short-time Fourier transform (STFT). The data properties of spectrograms are complex matrices, hence they can be used as inputs for convolutional neural networks.

 

2. Block Diagram

 

Topology:

The details of the implementation are illustrated in Fig.1. Starting from the most left side, the input signal mixed with environmental noise is recorded from a regular microphone that is fed to DE10-Nano through USB or other interface from the Analog Devices offered plugins. The option of feeding prerecorded sound files is also considered as for increased test coverage. The input data are then driven to the FPGA part that will be doing the transform to time-frequency domain with 16KHz sampling rate. The produced spectrograms represent complex matrices which are decomposed into magnitude and phase components. The magnitude and phase analysis that follows will be doing the speech enhancement by removing the noise from the speech signal. Before leaving the FPGA part, the enhanced signal will be reconstructed to a mean for the next stage. It is aimed to avoid reconstructing the time-domain representation, before entering the HPS part of the application. The KWS application will be running in HPS part as an implementation in ARM processor. The output as classification results will be printed through UART.

3. Expected sustainability results, projected resource savings

Expected results:

The improvement to the true detection/rejection rate and the false rejection rate of KWS will be counted as function of the reduced size and complexity in the implementation when speech enhancement is invoked. Hence to capture the smallest possible resources needed for the complete application without performance degradation.

The evaluation will be based on portion of the TIMIT corpus and DEMAND database which is widely used by deep learning models for speech enhancement. Real time examples will be also used to confirm the correctness of the achieved performance.

The metrics are calculated as below:

  • True Detection rate (TDR) = Total True Acceptance / Total Keywords Number

  • True Rejection rate (TRR) = Total True Rejection / Total non-keywords Number

  • False Rejection rate (FRR) = Total False Rejection / Total Keywords Number

 

 

 

Indicative references:

  1. Zhiheng Ouyang et al. A FULLY CONVOLUTIONAL NEURAL NETWORK FOR COMPLEX SPECTROGRAM PROCESSING IN SPEECH ENHANCEMENT, in ICASSP 2019

  2. Hongjiang Yu et al. Deep Neural Network Based Complex Spectrogram Reconstruction for Speech Bandwidth Expansion, in NEWCAS 2020

  3. Szu-Wei Fu et al. COMPLEX SPECTROGRAM ENHANCEMENT BY CONVOLUTIONAL NEURAL NETWORK WITH MULTI-METRICS LEARNING, in WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING 2017

  4. Shima Tabibian. A survey on structured discriminative spoken keyword spotting, in Artificial Intelligence Review 2020

  5. Hyeong-Seok Choi et al. PHASE-AWARE SPEECH ENHANCEMENT WITH DEEP COMPLEX U-NET, in ICLR 2019

  6. A. Karthik et al. Efcient Speech Enhancement Using Recurrent Convolution Encoder and Decoder, in Wireless Personal Communications 2021

 

4. Design Introduction

5. Functional description and implementation

6. Performance metrics, performance to expectation

7. Sustainability results, resource savings achieved

8. Conclusion

0 Comments



Please login to post a comment.