ASR and Human Recognition Errors: Predictability and Lexical Factors

Mansfield, Courtney S

ASR and Human Recognition Errors: Predictability and Lexical Factors

dc.contributor.advisor	Levow, Gina-Anne
dc.contributor.author	Mansfield, Courtney S
dc.date.accessioned	2021-07-07T20:03:02Z
dc.date.issued	2021-07-07
dc.date.submitted	2021
dc.description	Thesis (Ph.D.)--University of Washington, 2021
dc.description.abstract	Considering the complexity of speech communication, it is unsurprising that a listener occasionally misrecognizes an utterance. However, by examining patterns across many recognition errors, researchers can better understand the mechanisms of speech perception. Through a systematic study of automatic speech recognition (ASR) errors, it is likewise possible to better understand the state of speech processing including its strengths and limitations. This dissertation considers lexical factors involved in speech misperception and focuses on the role of predictability, which has been less studied in previous work. It uses a set of alignments from the corrected Switchboard corpus produced by Zayats et al. (2019). By taking statistics over many instances of transcription errors, this work considers the role of predictability in speech perception. The findings indicate that hallucinations, where a speaker identifies a word not present in the utterance (i.e. transcription insertions), and misses, where a speaker fails to hear a word(i.e. transcription deletions) tend to be relatively high in predictability. To measure this, a metric called the surprisal difference is introduced, based on linguistic surprisal (Hale, 2001;Levy, 2008) between the hypothesis and reference text. In a sentence choice task, it is found that predictability affects the sentence a listener chooses, regardless of whether the sentence accurately matches the audio. The second part of this dissertation considers differences between human transcription errors and ASR errors from a state-of-the-art ASR system (Xiong et al., 2016, 2017). Although the total number of errors made by humans and ASR on the same evaluation set is similar, there are differences in the constitution of these errors. The distribution of token frequency, predictability, and the surprisal difference is found to vary significantly. These distributional differences are accounted for in part by the failure of ASR to accurately recognize very low-frequency words and differences in human and ASR recognition of fillers such as filled pauses and backchannels. This work also supports the effectiveness of speech transcription to study speech misperception, relying on a corpus of 32K errors, the largest of its kind. It features a crowd-sourcing study which demonstrates the replicability of the transcription errors under different task conditions. By considering errors in conversational transcription, this work provides a better understanding of the relationship of predictability and other lexical features to misrecognition by humans and machines
dc.embargo.lift	2022-07-07T20:03:02Z
dc.embargo.terms	Restrict to UW for 1 year -- then make Open Access
dc.format.mimetype	application/pdf
dc.identifier.other	Mansfield_washington_0250E_22534.pdf
dc.identifier.uri	http://hdl.handle.net/1773/47086
dc.language.iso	en_US
dc.rights	CC BY-SA
dc.subject	automatic speech recognition
dc.subject	natural language processing
dc.subject	speech perception
dc.subject	Linguistics
dc.subject	Computer science
dc.subject.other	Linguistics
dc.title	ASR and Human Recognition Errors: Predictability and Lexical Factors
dc.type	Thesis

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Mansfield_washington_0250E_22534.pdf
Size:: 6.61 MB
Format:: Adobe Portable Document Format

Download

Collections

Linguistics