Linguistics
Permanent URI for this collectionhttps://digital.lib.washington.edu/handle/1773/4936
Browse
Recent Submissions
Item type: Item , On the Ethics and Linguistic Impacts of Using the Bible as Training Data for Yucatec Maya-to-Spanish Machine Translation(2026-04-20) Phoreman, Jade A.; Bender, Emily MReligious texts, primarily the Christian Bible, are commonly used as training data for low-resource machine translation (MT) systems because they constitute some of the most extensive and systematically digitized parallel corpora available for many languages. However, this practice raises both linguistic and ethical concerns, particularly for Indigenous language communities for whom Bible translation has historically been intertwined with colonialism and cultural erasure. This thesis investigates the trade-offs associated with using Bible-derived parallel data to fine-tune machine translation models for the Yucatec Maya-to-Spanish translation task.I fine-tuned two models -- TowerInstruct-7B-v.02 and T5S -- across seven experimental conditions varying the proportion and quantity of Bible training data, ranging from 0% to 100% Bible data. Translation quality was evaluated using BLEU, chrF, METEOR, and COMET. Bible-related content drift in model outputs was assessed through two complementary methods: a semantic similarity analysis using BETO sentence embeddings, and a Bible n-gram contamination analysis using log-likelihood ratio statistics. Results show that increasing the proportion of Bible training data consistently degraded translation quality across both models. For TowerInstruct-7B-v.02, this degradation was strictly monotonic. For T5S, the relationship was broadly similar but not strictly monotonic. Neither model benefited from increased quantities of Bible-dominated training data. Semantic drift toward biblical Spanish was negligible across all conditions for both models, with a single exception of T5S trained only on a subset of Bible data. These findings are contextualized by a community survey of 84 Yucatec Maya speakers, who broadly supported machine translation development while expressing concern about data sovereignty, colonial training data, and the risk of epistemic extractivism. Together, the computational and community findings argue that domain-matched, community-generated data should be prioritized over Bible corpora in low-resource MT development, even when data scarcity creates pressure to use all available resources.Item type: Item , Next, Incorporate the Flour: A Recipe for Modeling Noun Incorporation in the LinGO Grammar Matrix(2026-04-20) Luedke, Emily; Bender, Emily MThis thesis describes the creation of a new library for modeling noun incorporation (NI) in the LinGO Grammar Matrix customization system. NI is the morphological attestation of a noun on a verb stem where the noun bears some semantic relation to the verb. Languages vary with respect to how NI functions. In some languages, incorporating verbs pattern like intransitives whereas in others, they pattern like transitives. Some languages allow other arguments to be promoted to the argument position vacated by the incorporated noun (IN). Some allow external elements to modify the IN. I present an analysis of the variation of NI within the syntactic framework of HPSG and semantic framework of MRS. I also describe how I implemented this analysis into the Grammar Matrix customization system to be used by user-linguists hoping to model NI in an implemented grammar. I evaluate my system on five illustrative languages (Chukchi [ckt], Southern Tiwa [tix], Mapudungun [arn], Mohawk [moh], and Tongan [ton]), a set of constructed pseudo-languages, and five held-out languages (Apurinã [apu], Bribri [bzd], Inuktitut [iku], Moloko [mlw], and Yaqui [yaq]) The results show that my library effectively generalizes to unseen data with an average of 95.6% coverage, 15.2% overgeneration, and an ambiguity of 1.7 parses per grammatical sentence averaged across each grammar.Item type: Item , Learnability of Autoregressive Transformers(2026-02-05) Hong, Jeongyeob; Steinert-Threlkeld, ShaneThis paper explores the learning mechanism of a decoder-only transformer through the lens of human concept learning. We investigated whether decoder-only Transformers experience the simplicity bias, a human tendency to favor simpler representations. To do so, we create a pipeline that generates every task that a decoder-only transformer can learn and express with a given input symbol, length, and depth. Our initial results show no sufficient evidence for simplicity bias occurring in the autoregressive models. We end our paper with a discussion of other factors that can explain the learnability of transformers, such as the computational cost of each operation.Item type: Item , TeleRAG: Optimizing Retrieval for Retrieval-Augmented Generation(2026-02-05) Kashyap, Madhav; Levow, Gina-AnneRetrieval-augmented generation (RAG) has become essential for grounding large language models with external datastores to enhance factual correctness and domain coverage. But deployment presents a critical challenge: large language models and vector datastores compete for limited GPU memory, often forcing datastores to the CPU and leading to slow CPU-based retrieval latency. This thesis introduces TeleRAG, a system that resolves this bottleneck through lookahead retrieval, a technique that predicts and prefetches likely-needed vector search data concurrently with large language model inference. We discover that queries at different RAG pipeline stages exhibit semantic overlap, enabling effective predictive prefetching. TeleRAG combines lookahead retrieval with profile-guided prefetching optimization and GPU-CPU cooperative search. Evaluation across six RAG pipelines demonstrates 1.53× average latency reduction on consumer GPUs and 1.83× throughput improvement in batched-query scenarios. Crucially, TeleRAG remains framework and algorithm agnostic, enabling immediate deployment in existing production systems. By bridging CPU and GPU retrieval, TeleRAG enables efficient RAG deployment for both latency-sensitive and high-throughput applications, advancing retrieval-augmented across diverse environments.Item type: Item , DRAG: Diversity in Retrieval Augmented Generation through the Application of Submodular Functions(2026-02-05) Cortes-Lemos, Maria Paula; Bilmes, JeffreyThis thesis applies a submodular approach to the reranking stage of Retrieval Augmented Generation to balance the relevance and diversity of the retrieved documents. After initial retrieval using Contriever, we experiment with submodular functions and a baseline of Maximal Marginal Relevance (MMR), a standard function for balancing relevance and diversity. We apply convex combinations of three approaches: 1) a submodular feature based function using LOG1P concavity and Facility Location, 2) One-Hot Quantization (a quantized modular function with a one-hot feature based function) with manual weights and Facility Location, and 3) One-Hot Quantization with exponential weight decay and Facility Location. We perform hyperparameter selection for the submodular functions and for MMR. We evaluate these on five datasets designed for diversity-focused tasks (news, politics, analogies, etc). We show submodular functions outperform or match MMR's performance in nearly all cases, with recall improvements exceeding 20% (relative difference) in the best case scenario. These results suggest a submodular approach can be effective to improve RAG systems, particularly in diversity-sensitive tasks.Item type: Item , Bridging the Gap: Adaptation Approaches for Under-Resourced Language Families(2025-08-01) Parikh, Dwija; Steinert-Threlkeld, ShaneMultilingual large language models have demonstrated remarkable success across a variety of natural language processing (NLP) tasks. However, their performance on low and under-resourced languages remains significantly limited, primarily due to disparities in data availability. This thesis investigates adaptation strategies to improve multilingual model performance on low-resource languages. Focusing on the Turkic language family, we investigate the effectiveness of adapting a pre-trained model using data from related languages. We examine the effectiveness of language-family-specific adaptation techniques, including language-adaptive pre-training (LAPT) and vocabulary specialization, and evaluate their impact on both zero-shot and few-shot scenarios. Our results highlight the potential of targeted multilingual adaptation to bridge performance gaps in low-resource settings and reinforce best practices for multilingual model adaptation.Item type: Item , Knowledge-driven Natural Language Understanding(2025-08-01) Tian, Yuanhe; Xia, FeiRecent advances in natural language processing (NLP) mainly relied on pre-trained language models (PLMs) that are trained on vast amounts of data. Although these PLMs achieve remarkable success in improving NLP performance over conventional approaches, they still struggle to accurately understand the semantics of the text. Therefore, extra knowledge, especially the dynamically extracted one, is expected to be leveraged to improve the understanding of the models in the text. This thesis proposes a knowledge-driven approach for NLP that improves PLMs. By dynamically integrating external knowledge from multiple sources, the proposed approach enhances model generalization in different scenarios. Specifically, the thesis leverages three types of knowledge, namely, the lexicon knowledge (e.g., n-grams) extracted directly from raw data, syntax knowledge (e.g., dependency parse trees) obtained through existing toolkits, and the pattern knowledge (e.g., vectors) captured during the training process. Several novel architectures are proposed to leverage the knowledge, such as key-value memory networks for incorporating wordhood information, span attention mechanisms with categorical grouping for improved syntactic parsing, and graph convolutional networks to further enrich contextual representations. Extensive experiments on different NLP tasks at various levels demonstrate the effectiveness of the proposed approaches, which outperform strong baselines and existing studies. Overall, this dissertation not only broadens the definition and utilization of knowledge in natural language processing but also lays a solid foundation for future research in multi-modal, cross-domain, and low-resource environments.Item type: Item , Modeling Light Verb Constructions in the LinGO Grammar Matrix(2025-08-01) Wueger, Tara; Bender, Emily M.This thesis describes the development of a library for modeling light verb constructions (LVCs) in the LinGO Grammar Matrix. LVCs are constructions involving the combination of a light verb and coverb, where the coverb contributes most of the meaning to the whole construction. Light verbs can range in meaning from semantically “light” to semantically bleached and coverbs can come from a variety of word classes (e.g. noun, verb, adjective). The syntactic and semantic representations of LVCs that make up the foundation of my analysis that are presented in this paper are done within the HPSG and MRS formalisms. I then implemented this analysis in the Grammar Matrix customization system, using illustrative and pseudo languages to do so (including Bardi, English, Japanese, and Persion). I also evaluated the library using held-out languages (Ch’ol, Daasanach, and Korafe-Yegha) in order to test its generalizability across different languages. The results of this evaluation include an average of ∼90% coverage, ∼25% overgeneration, and ∼2.0 ambiguity across the testsuites for all three held-out languages.Item type: Item , Investigating the Corpus Phonetics Pipeline Applied to Diverse Speech Data(2025-08-01) Proch Ahn, Emily; Levow, Gina-AnneCorpus phonetics research has become increasingly large-scale as both data and automated tools have become more plentiful and available. Now that there are resources to study more kinds of data, what are some best practices in using these resources, especially when the data is diverse? This dissertation addresses the following research questions: How do we process diverse speech data, and how much can we rely on automated tools to conduct corpus phonetics research? The types of diversity covered in this work include multilingual and fieldwork corpora covering styles including read, spontaneous, and code-switched speech. Across four studies, we show that automated systems in the corpus phonetics pipeline are viable on multilingual and low-resource datasets. We first propose a pipeline that utilizes automated systems that convert orthography to phonemes, model the acoustics and align audio to those phonemes, and extract features for phonetic analysis. We apply this pipeline to a large, multilingual corpus and show both the utility and limitations of this derivative corpus in a careful study of outlying phonetic features. Then, we apply novel techniques to improve the phonetic forced alignment of low-resource field data, a challenging yet important process in language documentation. We encourage the research community to continue developing tools to aid in language documentation and cross-linguistic research. In doing so, it is important to include manual audits and to examine whether or not the tools are genuinely modeling the data.Item type: Item , Topics in the Grammar of Gyegu Tibetan(2025-08-01) Ukasick, Trent Jordan; Hargus, SharonThis dissertation describes the variety of Tibetan spoken in and around Yushu City in Qinghai Province, People's Republic of China. Chapter 1 provides an overview of the area where Gyegu Tibetan is spoken, previous research on closely related varieties, and the language's classification and vitality. Chapter 2 describes the language's phonology, including its large phoneme inventory and four-way laryngeal contrast between voiced, tense, breathy, and aspirated obstruents. Chapter 3 presents lexical and phrasal categories and details the morphological, syntactic, and semantic behavior of each category. Finally, Chapter 4 focuses on clausal morphosyntax, describing a wide variety of syntactic phenomena, including Gyegu Tibetan's system of evidentiality and egophoricity. Together, these chapters offer a detailed account of the language, serving as a foundation for further research on variation within the Tibetic language family and its notable phonological and syntactic features.Item type: Item , The Interplay of Dataset Characteristics in Automated Grammar Generation: A Study with the AGGREGATION System(2025-08-01) Liu, Tongxi; Bender, Emily M.In this thesis, I investigate how linguists can effectively prepare Interlinear Glossed Text (IGT) data for use with the AGGREGATION grammar inference system, particularly under constraints such as limited time, sparse annotations, and variable corpus quality. AGGREGATION aims to automate the creation of precision Head-driven Phrase Structure Grammar (HPSG) grammars from IGT, but its output quality depends heavily on input structure and annotation consistency. To explore this, I develop a modeling framework to evaluate how structural and annotation-based features (such as affix ambiguity, type-stems ratio, and POS tag source) affect grammar quality across 75,000 grammar runs on 25 datasets. I use both linear mixed-effects models and XGBoost to identify predictors of four key metrics: coverage, ambiguity, morphological complexity, and inference time. Results show that smaller, structurally coherent datasets often outperform larger, noisier ones. Manual POS tags improve coverage and generalization but increase ambiguity, while automatic tags result in cleaner grammars with lower parse success. A case study on Meitei highlights how annotation quality interacts with language-specific features. This work offers practical guidance for preparing IGT data for grammar generation and proposes future improvements to AGGREGATION, including support for structure-aware sampling and multi-version grammar comparison.Item type: Item , Beyond Memorization: Evaluating Length-Generalization in Transformer-based Language Models(2025-08-01) Cheng, Yao-Fei; Steinert-Threlkeld, ShaneTransformer-based large language models have made substantial progress in the NLP community. However, transformers have trouble with length generalization (i.e., extrapolating on different lengths than seen during training). Recently, (Zhou et al., 2024a) proposed the RASP-Generalization Conjecture to predict what tasks are length-generalizable using a few carefully handcrafted tasks in the mathematical domain. This work examines this conjecture by generating hundreds of synthetic tasks written in the shortest RASP-L programs. Our investigation does not support the conjecture. The tasks written in the shortest RASP-L programs are not length-generalizable. Furthermore, our analysis reveals that some unlength generalizable tasks are due to models not stopping generating. It can be easily fixed by using oracle length during evaluation as suggested in previous literature. Additionally, the analysis rejects induction head as the key factor for failure of length generalization as claimed in the previous findings. Despite our work not providing a precise explanation of transformers’ length generalization capability, we show that previous claims cannot extend to other tasks rather than carefully handcrafted tasks.Item type: Item , Complexity of In-Context Concept Learning in Language Models(2025-08-01) Wang, Leroy; Steinert-Threlkeld, ShaneThis thesis studies the factors that contribute to the success and shortcomings of in-contextlearning, which refers to the ability of some language models to perform a new task during inference using only a few labeled examples, for Large Language Models (LLMs). Drawing on insights from the literature on human concept learning, we test LLMs on carefully designed concept learning tasks, and show that task performance highly correlates with the logical complexity of the concept. This suggests that in-context learning exhibits a learning bias for simplicity in a way similar to humans.Item type: Item , Linguistic and Computational Approaches to Investigating Variations in Early Language Input at Home and in the Classroom(2025-08-01) Sheth, Kaveri; Ferjan RamÃrez, NajaSociocultural frameworks have emphasized the role that social interactions with caregivers play in scaffolding a child’s cognitive and linguistic development. Given that children may interact with a myriad of people, each participating in social interactions in a variety of ways, this dissertation focuses on how variations in early language input, across different caregivers and contexts, shape opportunities for language development. I utilize naturalistic daylong audio recordings to investigate how these variations occur naturally. Because I use naturalistic daylong audio recordings, I rely on manual annotations for a fine-grained analysis of the input. I also ask how computational tools may help us measure language outcomes, given that manual annotations are time consuming and labor intensive. Chapter 2 observes how caregivers may vary in their language input to young children in the home environment. I look at differences in maternal and paternal language input, shedding light on the differences that may occur in quantity, syntactic, and lexical aspects of parentese as the child develops. Chapter 3 investigates how a foreign language intervention, in a preschool setting, may shape children’s language opportunities in a second language. Chapter 4 turns to our methodological question and I investigate how current technologies perform language identification tasks on speech from very young children who are learning another language. My results demonstrate that language input varies meaningfully, between caregivers or across contexts, in order to create unique learning opportunities for children. However, current technologies are not able to capture this variation yet, and researchers in the field still must rely on manual annotations. This results in a call for more developmentally informed tools in order to capture the rich variety of early input and output across caregivers and contexts.Item type: Item , mSCAN - a Multilingual Dataset for Compositional Generalization Evaluation(2025-08-01) Reymond, Amélie Thu Tâm; Steinert-Threlkeld, ShaneLanguage models achieve remarkable results on a variety of tasks, yet still struggle on compositional generalization benchmarks. The majority of these benchmarks evaluate performance in English only, leaving open the question of whether these results generalize to other languages. As an initial step to answering this question, we introduce mSCAN, a multilingual adaptation of the SCAN dataset covering Mandarin Chinese, French, Hindi and Russian. It was produced by a rule-based translation, developed in cooperation with native speakers. We then showcase this dataset on some in-context learning experiments on multiple open-source multilingual models.Item type: Item , Convexity is a Fundamental Feature of Efficient Semantic Compression in Probability Spaces.(2025-05-12) Skinner, Lindsay Paige; Steinert-Threlkeld, Shane NThis thesis investigates the relationship between convexity and efficient communication using a probabilistic communication model applied to color space. It builds on previous work investigating the plausibility and potential source(s) of Gardenf ̈or's proposed semantic universal: that all subsets of color space affiliated with a particular color term are convex sets. The analysis undertaken in this project makes two major contributions to the existing literature. • First, this project establish a new metric which defines a quantitative measure of convexity that can be applied to probabilistic communication models. • Second, it demonstrates that convexity is an essential feature of efficient color-naming systems, where efficiency is determined with respect to a trade-off between accuracy and complexity. Furthermore, this project demonstrates that convexity is a more significant predictor of communication efficiency than either accuracy or complexity.Item type: Item , MARS: MedicAl thRead Summarization Dataset based on IIYI with Comparative Analysis of Large Language Models(2025-01-23) Zhang, Ruiru; Xia, FeiThis thesis presents MARS (MedicAl thRead Summarization Dataset based on IIYI), a pioneering dataset designed for medical domain thread summarization. MARS features a structure that captures the complexities and nuances of medical dialogues. The dataset inte- grates information extraction and summarization tasks, enabling a comprehensive evaluation of language models (LLMs) through extracting relevant information and generating coher- ent summaries. It also introduces unique challenges that necessitate advanced reasoning from LLMs, reflecting the complexities of healthcare discussions where misunderstandings can impact patient care. Furthermore, MARS serves as a critical benchmark for assessing LLM performance in a medical context, addressing a significant gap in existing literature. In addition to constructing the dataset, we tested the performance of various large language models on MARS, emphasizing the advantages of the GLM-4-Plus model when utilizing dynamic few-shot learning strategies. The experimental results further indicate that an extraction-then-summarization approach significantly enhances summarization performance compared to direct summarization methods. By providing diverse examples pertinent to real-world medical inquiries, MARS aims to promote robust research and the development of LLMs tailored to the intricacies of medical discourse, ultimately enhancing healthcare applications.Item type: Item , Action Nominals in the Grammar Matrix(2025-01-23) Ruditsky, Keren; Bender, Emily MThis thesis describes the addition of a library for action nominal constructions (ANCs) to the LinGO Grammar Matrix customization system. Actions nominals are nominalized verbs which refer to an action or process and are often used cross-linguistically to mark clausal complements and adverbial clauses. They occupy an intermediate state between nouns and verbs, having the external distribution of a noun phrase, but often still retaining certain verbal properties. In this thesis, I build on the existing analysis for nominalized clauses in the Grammar Matrix, but shift away from an approach where the dual nominal and verbal characteristics of action nominals are explained based on what level in the tree nominalization occurs to one that relies primarily on lexical rules. This change is motivated by a desire to expand the typological range of nominalization patterns the Matrix can handle while also more closely reflecting the hybrid syntactic nature of action nominals. I present an HPSG analysis of action nominals and the implementation of that analysis within the Grammar Matrix. I develop the library using a combination of pseudo and illustrative languages (English [eng], Hixkaryana [hix], Russian [rus], Korean [kor]) and then test on small testsuites from five held-out languages (Wayana [way], Maltese [mit], Dutch [nld], Lango [laj], Finnish [fin]. The system achieved on average 95.2% coverage and 7.0% over-generation on the held-out data.Item type: Item , Generating Under the Influence: An Adversarial Approach to Modeling Stylistic Influences in Literary Text(2025-01-23) Dong, Edward; Levow, Gina-AnneThis study proposes, implements, and evaluates a method for simulating literary stylistic influence in a text generation system. Specifically, it probes the effects of enhancing next-token prediction training with the introduction of an additional objective, borrowed from the generative adversarial paradigm, for the purpose of administering stylistic pressures. Two crucial adaptations are applied to the typical adversarial objective. First, in order to nudge the model toward preferred styles, the objective is expanded from binary to multi-label. And second, in order to cushion the model against the inherent volatility of the adversarial training signal, losses related to non-preferred styles are zeroed out and ignored. All model components are warm-started from the historically pre-trained MacBERTh and fine-tuned on a bespoke corpus of 1950s Anglophone prose fiction. The study additionally devises evaluation metrics grounded in relevant critical and pedagogical literature. The implementation of this socially-adaptive text generation system not only demonstrates a viable approach to modeling peer stylistic influence but may moreover serve as a building block for future research on cultural evolution systems in the literary domain.Item type: Item , Fine-tuning ASR Models for Very Low-Resource Languages: A Study on Mvskoke(2024-10-16) Mainzinger, Julia; Levow, Gina-AnneRecent advancements in multilingual models for automatic speech recognition (ASR) have significantly improved accuracy for languages with extremely limited resources. This study focuses on ASR modeling for the Mvskoke language, an indigenous language of America, by fine-tuning three multilingual wav2vec2.0 models: XLSR-53, MMS-300M, and MMS-1B-l1107. Training data is prepared using language documentation resources, and two evaluation sets are designed, one clean and one noisy, to evaluate performance in different settings. Parameter efficiency of adapter training is compared with training entire models, as well as examining the impact of the number of languages used during pre-training. The study also investigates how performance varies with different amounts of training data, by testing models trained with 10, 60, 120, and 243 minutes of data. A trigram language model is trained using cultural documents and transcripts of interviews, and the ASR models are evaluated with and without language model decoding. The findings show that both MMS models outperform XLSR-53 with higher amounts of training data. Notably, training an adapter for the MMS-1B-l1107 proves to be both parameter-efficient and capable of achieving high accuracy with a relatively small amount of data. ASR accuracy begins to converge around 2-4 hours of training data. While using a language model generally improves metrics such as word error rate, it can sometimes degrade the output. The study introduces the first ASR models successfully developed for the Mvskoke language.
