On the Ethics and Linguistic Impacts of Using the Bible as Training Data for Yucatec Maya-to-Spanish Machine Translation

Phoreman, Jade A.

On the Ethics and Linguistic Impacts of Using the Bible as Training Data for Yucatec Maya-to-Spanish Machine Translation

dc.contributor.advisor	Bender, Emily M
dc.contributor.author	Phoreman, Jade A.
dc.date.accessioned	2026-04-20T15:30:30Z
dc.date.available	2026-04-20T15:30:30Z
dc.date.issued	2026-04-20
dc.date.submitted	2026
dc.description	Thesis (Master's)--University of Washington, 2026
dc.description.abstract	Religious texts, primarily the Christian Bible, are commonly used as training data for low-resource machine translation (MT) systems because they constitute some of the most extensive and systematically digitized parallel corpora available for many languages. However, this practice raises both linguistic and ethical concerns, particularly for Indigenous language communities for whom Bible translation has historically been intertwined with colonialism and cultural erasure. This thesis investigates the trade-offs associated with using Bible-derived parallel data to fine-tune machine translation models for the Yucatec Maya-to-Spanish translation task.I fine-tuned two models -- TowerInstruct-7B-v.02 and T5S -- across seven experimental conditions varying the proportion and quantity of Bible training data, ranging from 0% to 100% Bible data. Translation quality was evaluated using BLEU, chrF, METEOR, and COMET. Bible-related content drift in model outputs was assessed through two complementary methods: a semantic similarity analysis using BETO sentence embeddings, and a Bible n-gram contamination analysis using log-likelihood ratio statistics. Results show that increasing the proportion of Bible training data consistently degraded translation quality across both models. For TowerInstruct-7B-v.02, this degradation was strictly monotonic. For T5S, the relationship was broadly similar but not strictly monotonic. Neither model benefited from increased quantities of Bible-dominated training data. Semantic drift toward biblical Spanish was negligible across all conditions for both models, with a single exception of T5S trained only on a subset of Bible data. These findings are contextualized by a community survey of 84 Yucatec Maya speakers, who broadly supported machine translation development while expressing concern about data sovereignty, colonial training data, and the risk of epistemic extractivism. Together, the computational and community findings argue that domain-matched, community-generated data should be prioritized over Bible corpora in low-resource MT development, even when data scarcity creates pressure to use all available resources.
dc.embargo.terms	Open Access
dc.format.mimetype	application/pdf
dc.identifier.other	Phoreman_washington_0250O_29337.pdf
dc.identifier.uri	https://hdl.handle.net/1773/55519
dc.language.iso	en_US
dc.relation.haspart	bible_dataset_verification_counting.py; code/script.
dc.relation.haspart	bibleScraper-es.py; code/script.
dc.relation.haspart	bibleScraperV3.py; code/script.
dc.relation.haspart	concat_bible.py; code/script.
dc.relation.haspart	concat_data.py; code/script.
dc.relation.haspart	eval_quality.py; code/script.
dc.relation.haspart	evaluate_all.py; code/script.
dc.relation.haspart	experiments.py; code/script.
dc.relation.haspart	finetune_t5s.py; code/script.
dc.relation.haspart	generate_fr_en_baseline.py; code/script.
dc.relation.haspart	generate_fr_es_baseline.py; code/script.
dc.relation.haspart	generate_translations_prompt_regex.py; code/script.
dc.relation.haspart	infer_t5s.py; code/script.
dc.relation.haspart	n-grams.py; code/script.
dc.relation.haspart	preprocess.py; code/script.
dc.relation.haspart	train_tower_lora.py; code/script.
dc.rights	none
dc.subject	Bible data
dc.subject	Indigenous language
dc.subject	low-resource language
dc.subject	machine translation
dc.subject	Yucatec Maya
dc.subject	Linguistics
dc.subject	Computer science
dc.subject	Language
dc.subject.other	Linguistics
dc.title	On the Ethics and Linguistic Impacts of Using the Bible as Training Data for Yucatec Maya-to-Spanish Machine Translation
dc.type	Thesis