Improving Keywords Spotting Performance in Noise with Augmented Dataset from Vocoded Speech and Speech Denoising

dc.contributor.advisorNie, Kaibao
dc.contributor.authorLI, RUOHAO
dc.date.accessioned2021-07-07T20:01:46Z
dc.date.available2021-07-07T20:01:46Z
dc.date.issued2021-07-07
dc.date.submitted2021
dc.descriptionThesis (Master's)--University of Washington, 2021
dc.description.abstractAs more electronic devices have an on-device Keywords Spotting (KWS) system, producing and deploying trained models for keyword(s) detection is becoming more demanding. The dataset preparation process is one of the most challenging and tedious tasks in KWS. It requires a significant amount of time to obtain raw or segmented audio speeches. In this thesis, we first proposed a data augmentation strategy using a speech vocoder to generate vocoded speech at different numbers of channels artificially. Such a strategy can artificially increase the dataset size by at least two-fold, depending on the use case. With the new features introduced by the different number of channels of the vocoded speeches, a convolutional neural network (CNN) KWS system trained with the augmented dataset from vocoded speech showed promising improvement evaluated at +10 dB SNR noisy condition. The same results were confirmed in implementation on a microcontroller and proved using vocoded speech in data augmentation is the potential to improve KWS on microcontrollers. We further proposed a neural-network-based speech denoising system using the Weighted Overlap-Add (WOLA) algorithm for feature extraction for more efficient processing. The proposed speech denoising system uses regression between a noisy speech and a clean speech and converts noisy speech (as input) into clean speech (as output). Thus, the input of the proposed KWS system will be relatively clean speech. Furthermore, by changing the training target to vocoded speech, such a speech denoising system can convert noisy speech (as input) into vocoded speech (as output). The combination of speech denoising and vocoded speech in data augmentation achieved relatively high accuracy when evaluated at +10 dB SNR noisy condition.
dc.embargo.termsOpen Access
dc.format.mimetypeapplication/pdf
dc.identifier.otherLI_washington_0250O_22685.pdf
dc.identifier.urihttp://hdl.handle.net/1773/47058
dc.language.isoen_US
dc.rightsCC BY
dc.subjectConvolutional Neural Network
dc.subjectData Augmentation
dc.subjectKeywords Spotting
dc.subjectSpeech Denoising
dc.subjectElectrical and computer engineering
dc.subject.otherElectrical engineering
dc.titleImproving Keywords Spotting Performance in Noise with Augmented Dataset from Vocoded Speech and Speech Denoising
dc.typeThesis

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
LI_washington_0250O_22685.pdf
Size:
2.86 MB
Format:
Adobe Portable Document Format