Investigating the Corpus Phonetics Pipeline Applied to Diverse Speech Data

Proch Ahn, Emily

Investigating the Corpus Phonetics Pipeline Applied to Diverse Speech Data

Files

ProchAhn_washington_0250E_28335.pdf (4.24 MB)

Date

2025-08-01

relationships.isAuthorOf

Proch Ahn, Emily

Abstract

Corpus phonetics research has become increasingly large-scale as both data and automated tools have become more plentiful and available. Now that there are resources to study more kinds of data, what are some best practices in using these resources, especially when the data is diverse? This dissertation addresses the following research questions: How do we process diverse speech data, and how much can we rely on automated tools to conduct corpus phonetics research? The types of diversity covered in this work include multilingual and fieldwork corpora covering styles including read, spontaneous, and code-switched speech. Across four studies, we show that automated systems in the corpus phonetics pipeline are viable on multilingual and low-resource datasets. We first propose a pipeline that utilizes automated systems that convert orthography to phonemes, model the acoustics and align audio to those phonemes, and extract features for phonetic analysis. We apply this pipeline to a large, multilingual corpus and show both the utility and limitations of this derivative corpus in a careful study of outlying phonetic features. Then, we apply novel techniques to improve the phonetic forced alignment of low-resource field data, a challenging yet important process in language documentation. We encourage the research community to continue developing tools to aid in language documentation and cross-linguistic research. In doing so, it is important to include manual audits and to examine whether or not the tools are genuinely modeling the data.