Transfer Learning Using L2 Speech to Improve Automatic Speech Recognition of Dysarthric Speech

Steinmetz, Hillel Aryeh

Transfer Learning Using L2 Speech to Improve Automatic Speech Recognition of Dysarthric Speech

Files

Steinmetz_washington_0250O_25460.pdf (595.7 KB)

Date

2023-08-14

relationships.isAuthorOf

Steinmetz, Hillel Aryeh

Abstract

Dysarthria is a class of speech disorders associated with impairments to a person’s motor system. Dysarthric speech is diverse but is broadly characterized by reduced prosodic, phonation, and articulatory precision (Rowe et al., 2022). Non-native English speech, or L2 English speech, shares acoustic and phonetic features with the speech of several dysarthria subtypes, such as slower and more variable speech rate compared to native, non-dysarthric English speech (Baese-Berk and Bradlow, 2021; Hertrich et al., 2021). L2 English speech also has different phonetic correlates than native-English speech, with phonetic variation more closely resembling a speaker’s first language (Flege, 1981). Since L2 speech both shares acoustic features with dysarthric speech and has more diverse phonetic correlates of phonological segments, it should facilitate knowledge transfer when training an ASR model on dysarthric recognition tasks. This study finetunes Wav2vec2 models on two English dysarthric speech datasets, UA-Speech and TORGO, and one English L2 speech dataset, L2-Arctic, using standard finetuning and multitask learning paradigms. It examines whether including L2 speech in the training data improves dysarthric speech recognition in speaker-dependent, speaker-independent, and zero-shot settings. Our results suggest that including L2 speech in the training data improves dysarthric speech recognition in speaker-dependent and speaker-independent settings, with models trained using multitask learning performing better than those trained using standard finetuning.