Investigating the Corpus Phonetics Pipeline Applied to Diverse Speech Data

Proch Ahn, Emily

Investigating the Corpus Phonetics Pipeline Applied to Diverse Speech Data

dc.contributor.advisor	Levow, Gina-Anne
dc.contributor.author	Proch Ahn, Emily
dc.date.accessioned	2025-08-01T22:26:16Z
dc.date.available	2025-08-01T22:26:16Z
dc.date.issued	2025-08-01
dc.date.submitted	2025
dc.description	Thesis (Ph.D.)--University of Washington, 2025
dc.description.abstract	Corpus phonetics research has become increasingly large-scale as both data and automated tools have become more plentiful and available. Now that there are resources to study more kinds of data, what are some best practices in using these resources, especially when the data is diverse? This dissertation addresses the following research questions: How do we process diverse speech data, and how much can we rely on automated tools to conduct corpus phonetics research? The types of diversity covered in this work include multilingual and fieldwork corpora covering styles including read, spontaneous, and code-switched speech. Across four studies, we show that automated systems in the corpus phonetics pipeline are viable on multilingual and low-resource datasets. We first propose a pipeline that utilizes automated systems that convert orthography to phonemes, model the acoustics and align audio to those phonemes, and extract features for phonetic analysis. We apply this pipeline to a large, multilingual corpus and show both the utility and limitations of this derivative corpus in a careful study of outlying phonetic features. Then, we apply novel techniques to improve the phonetic forced alignment of low-resource field data, a challenging yet important process in language documentation. We encourage the research community to continue developing tools to aid in language documentation and cross-linguistic research. In doing so, it is important to include manual audits and to examine whether or not the tools are genuinely modeling the data.
dc.embargo.terms	Open Access
dc.format.mimetype	application/pdf
dc.identifier.other	ProchAhn_washington_0250E_28335.pdf
dc.identifier.uri	https://hdl.handle.net/1773/53680
dc.language.iso	en_US
dc.rights	CC BY
dc.subject	corpus phonetics
dc.subject	forced alignment
dc.subject	language documentation
dc.subject	speech technology
dc.subject	Linguistics
dc.subject	Computer science
dc.subject.other	Linguistics
dc.title	Investigating the Corpus Phonetics Pipeline Applied to Diverse Speech Data
dc.type	Thesis

Files

Original bundle

Now showing 1 - 1 of 1

Name:: ProchAhn_washington_0250E_28335.pdf
Size:: 4.24 MB
Format:: Adobe Portable Document Format

Download

Collections

Linguistics