Investigating the Corpus Phonetics Pipeline Applied to Diverse Speech Data

dc.contributor.advisorLevow, Gina-Anne
dc.contributor.authorProch Ahn, Emily
dc.date.accessioned2025-08-01T22:26:16Z
dc.date.available2025-08-01T22:26:16Z
dc.date.issued2025-08-01
dc.date.submitted2025
dc.descriptionThesis (Ph.D.)--University of Washington, 2025
dc.description.abstractCorpus phonetics research has become increasingly large-scale as both data and automated tools have become more plentiful and available. Now that there are resources to study more kinds of data, what are some best practices in using these resources, especially when the data is diverse? This dissertation addresses the following research questions: How do we process diverse speech data, and how much can we rely on automated tools to conduct corpus phonetics research? The types of diversity covered in this work include multilingual and fieldwork corpora covering styles including read, spontaneous, and code-switched speech. Across four studies, we show that automated systems in the corpus phonetics pipeline are viable on multilingual and low-resource datasets. We first propose a pipeline that utilizes automated systems that convert orthography to phonemes, model the acoustics and align audio to those phonemes, and extract features for phonetic analysis. We apply this pipeline to a large, multilingual corpus and show both the utility and limitations of this derivative corpus in a careful study of outlying phonetic features. Then, we apply novel techniques to improve the phonetic forced alignment of low-resource field data, a challenging yet important process in language documentation. We encourage the research community to continue developing tools to aid in language documentation and cross-linguistic research. In doing so, it is important to include manual audits and to examine whether or not the tools are genuinely modeling the data.
dc.embargo.termsOpen Access
dc.format.mimetypeapplication/pdf
dc.identifier.otherProchAhn_washington_0250E_28335.pdf
dc.identifier.urihttps://hdl.handle.net/1773/53680
dc.language.isoen_US
dc.rightsCC BY
dc.subjectcorpus phonetics
dc.subjectforced alignment
dc.subjectlanguage documentation
dc.subjectspeech technology
dc.subjectLinguistics
dc.subjectComputer science
dc.subject.otherLinguistics
dc.titleInvestigating the Corpus Phonetics Pipeline Applied to Diverse Speech Data
dc.typeThesis

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
ProchAhn_washington_0250E_28335.pdf
Size:
4.24 MB
Format:
Adobe Portable Document Format

Collections