Improving Keywords Spotting Performance in Noise with Augmented Dataset from Vocoded Speech and Speech Denoising

LI, RUOHAO

Improving Keywords Spotting Performance in Noise with Augmented Dataset from Vocoded Speech and Speech Denoising

dc.contributor.advisor	Nie, Kaibao
dc.contributor.author	LI, RUOHAO
dc.date.accessioned	2021-07-07T20:01:46Z
dc.date.available	2021-07-07T20:01:46Z
dc.date.issued	2021-07-07
dc.date.submitted	2021
dc.description	Thesis (Master's)--University of Washington, 2021
dc.description.abstract	As more electronic devices have an on-device Keywords Spotting (KWS) system, producing and deploying trained models for keyword(s) detection is becoming more demanding. The dataset preparation process is one of the most challenging and tedious tasks in KWS. It requires a significant amount of time to obtain raw or segmented audio speeches. In this thesis, we first proposed a data augmentation strategy using a speech vocoder to generate vocoded speech at different numbers of channels artificially. Such a strategy can artificially increase the dataset size by at least two-fold, depending on the use case. With the new features introduced by the different number of channels of the vocoded speeches, a convolutional neural network (CNN) KWS system trained with the augmented dataset from vocoded speech showed promising improvement evaluated at +10 dB SNR noisy condition. The same results were confirmed in implementation on a microcontroller and proved using vocoded speech in data augmentation is the potential to improve KWS on microcontrollers. We further proposed a neural-network-based speech denoising system using the Weighted Overlap-Add (WOLA) algorithm for feature extraction for more efficient processing. The proposed speech denoising system uses regression between a noisy speech and a clean speech and converts noisy speech (as input) into clean speech (as output). Thus, the input of the proposed KWS system will be relatively clean speech. Furthermore, by changing the training target to vocoded speech, such a speech denoising system can convert noisy speech (as input) into vocoded speech (as output). The combination of speech denoising and vocoded speech in data augmentation achieved relatively high accuracy when evaluated at +10 dB SNR noisy condition.
dc.embargo.terms	Open Access
dc.format.mimetype	application/pdf
dc.identifier.other	LI_washington_0250O_22685.pdf
dc.identifier.uri	http://hdl.handle.net/1773/47058
dc.language.iso	en_US
dc.rights	CC BY
dc.subject	Convolutional Neural Network
dc.subject	Data Augmentation
dc.subject	Keywords Spotting
dc.subject	Speech Denoising
dc.subject	Electrical and computer engineering
dc.subject.other	Electrical engineering
dc.title	Improving Keywords Spotting Performance in Noise with Augmented Dataset from Vocoded Speech and Speech Denoising
dc.type	Thesis

Files

Original bundle

Now showing 1 - 1 of 1

Name:: LI_washington_0250O_22685.pdf
Size:: 2.86 MB
Format:: Adobe Portable Document Format

Download

Collections

Electrical and computer engineering