ASR and Human Recognition Errors: Predictability and Lexical Factors

dc.contributor.advisorLevow, Gina-Anne
dc.contributor.authorMansfield, Courtney S
dc.date.accessioned2021-07-07T20:03:02Z
dc.date.issued2021-07-07
dc.date.submitted2021
dc.descriptionThesis (Ph.D.)--University of Washington, 2021
dc.description.abstractConsidering the complexity of speech communication, it is unsurprising that a listener occasionally misrecognizes an utterance. However, by examining patterns across many recognition errors, researchers can better understand the mechanisms of speech perception. Through a systematic study of automatic speech recognition (ASR) errors, it is likewise possible to better understand the state of speech processing including its strengths and limitations. This dissertation considers lexical factors involved in speech misperception and focuses on the role of predictability, which has been less studied in previous work. It uses a set of alignments from the corrected Switchboard corpus produced by Zayats et al. (2019). By taking statistics over many instances of transcription errors, this work considers the role of predictability in speech perception. The findings indicate that hallucinations, where a speaker identifies a word not present in the utterance (i.e. transcription insertions), and misses, where a speaker fails to hear a word(i.e. transcription deletions) tend to be relatively high in predictability. To measure this, a metric called the surprisal difference is introduced, based on linguistic surprisal (Hale, 2001;Levy, 2008) between the hypothesis and reference text. In a sentence choice task, it is found that predictability affects the sentence a listener chooses, regardless of whether the sentence accurately matches the audio. The second part of this dissertation considers differences between human transcription errors and ASR errors from a state-of-the-art ASR system (Xiong et al., 2016, 2017). Although the total number of errors made by humans and ASR on the same evaluation set is similar, there are differences in the constitution of these errors. The distribution of token frequency, predictability, and the surprisal difference is found to vary significantly. These distributional differences are accounted for in part by the failure of ASR to accurately recognize very low-frequency words and differences in human and ASR recognition of fillers such as filled pauses and backchannels. This work also supports the effectiveness of speech transcription to study speech misperception, relying on a corpus of 32K errors, the largest of its kind. It features a crowd-sourcing study which demonstrates the replicability of the transcription errors under different task conditions. By considering errors in conversational transcription, this work provides a better understanding of the relationship of predictability and other lexical features to misrecognition by humans and machines
dc.embargo.lift2022-07-07T20:03:02Z
dc.embargo.termsRestrict to UW for 1 year -- then make Open Access
dc.format.mimetypeapplication/pdf
dc.identifier.otherMansfield_washington_0250E_22534.pdf
dc.identifier.urihttp://hdl.handle.net/1773/47086
dc.language.isoen_US
dc.rightsCC BY-SA
dc.subjectautomatic speech recognition
dc.subjectnatural language processing
dc.subjectspeech perception
dc.subjectLinguistics
dc.subjectComputer science
dc.subject.otherLinguistics
dc.titleASR and Human Recognition Errors: Predictability and Lexical Factors
dc.typeThesis

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Mansfield_washington_0250E_22534.pdf
Size:
6.61 MB
Format:
Adobe Portable Document Format

Collections