AS015 » Keyword Spotting
This project is dedicated to the implementation of an FPGA-based acoustic keyword spotting (KWS) system for the portuguese language. Such system performs real-time processing using MFCC extraction as pre-processing and a convolutional neural network (CNN) as the classifier.
Figure 1 shows the High-level description of the proposed work. Initially, the Acoustic KWS topology is defined, using an MFCC extractor and a CNN. Subsequently, a Speech Database in Brazilian Portuguese is developed. In the sequence, a Python code is developed to the training and generation of the CNN model and the parameters of the MFCCs Extractor. Finally, the FPGA implementation is performed considering the blocks: Framing, MFCCs Extractor, CNN, and Decisor, as shown in Figure 1.
Figure 1 - Acoustic KWS High-level Description
Is presented in the Figure 2 the block diagram of the FPGA-based Acoustic Keyword Spotting implementation.
Figure 2 - FPGA-based Acoustic KWS implementation.
How presented in Figure 2, the Acoustic KWS implementation is composed by the blocks Framing, MFCCs Extractor, CNN, Decisor. The Figure 3 presents the Framing block diagram.
Figure 3 - Framing block diagram.
The Figures 4 and 5 shows the block diagrams of MFCCs Extractor and CNN, respectively.
Figure 4 - MFCCs Extractor Block Diagram.
Figure 5 - CNN Block Diagram.
The main resources considered in an intel-FPGA for the development of the present project are:
- Parallelism techniques;
- Pipeline techniques;
- MLAB - Number of useds FPGA internal memories;
- Number of useds 9x9 DSP blocks;
- Number of useds logical blocks.
These features are available in Intel-FPGAs (Cyclone V Family) and it will allow the implementation in a single device of complex algorithms of digital signal processing (MFCC and CNN). Furthermore, it is intended to run the KWS system developed in real time processing into a DE10-Nano kit.
The purpose of this design is create a high performance keyword classifier using neural networks in FPGA. Therefore, FPGA implementations using neural networks and also audio processing classifiers might be facilitated.
This project could be used by R&D companies to use neural networks in FPGA, in order to process these networks speedly. Using parallel processing and pipeline techniques, offered by FPGA devices, would provide fast track using deep neural networks, needed by e.g. in autonomous cars.
This project could be used also to brasilian internal research, in order to provide affordability to physical deficients, using speech command based system to perform actions on their daily routine.
For sure, this project has been benefited by an Intel FPGA device, which is used because its logic resources, high speed grades and also its internal DSP and Memory M10k blocks. Without these features, would be impossible to implement this high performance keyword spotting using neural networks.
This project receives as input a digital audio with rate of 48 kHz, which is decimated to 8 kHz. After that, this digital audio is sent to a MFCCs extractor, which extracts 15 MFCCs per 256 audio samples. Later, this MFCCs are fetched by a convolutional neural network which classifies the keywords spoken into this audio. The output is used as a trigger to activate one of three memories, which will respond the keyword spoken into the audio input.
To implement this project is needed a previows train of the neural network to after fetch the weights and the depth of the neural network. It's also needed to define the MFCCs hiper-parameters, which are: FFT length, hop length, number of mel filters and MFCCs number.
To receive the input audio and send the output audio is needed an audio Codec. In this work was used the AD1836 Codec to receive the digital audio input, which needs a SPI and I2S communication. To extern the output audio was used a Low Pass Sigma Delta Modulator, easily implemented in FPGA.
After these steps, is needed implement this four blocks in FPGA and also the memories to store the responses.
To use this project in real time is needed these parameters.
The hardware design is splitted in two pieces: MFCCs Extractor and Convolutional Neural Netowrk.
The implementation of the MFCCs extractor uses these blocks
The Energy Spectrum Calculator and DCT Calculator blocks use a FFT calculator to perform their calculus.
And the layers (camadas) are
The implementation of each layer is decribed as follow