On the Ethics and Linguistic Impacts of Using the Bible as Training Data for Yucatec Maya-to-Spanish Machine Translation

dc.contributor.advisorBender, Emily M
dc.contributor.authorPhoreman, Jade A.
dc.date.accessioned2026-04-20T15:30:30Z
dc.date.available2026-04-20T15:30:30Z
dc.date.issued2026-04-20
dc.date.submitted2026
dc.descriptionThesis (Master's)--University of Washington, 2026
dc.description.abstractReligious texts, primarily the Christian Bible, are commonly used as training data for low-resource machine translation (MT) systems because they constitute some of the most extensive and systematically digitized parallel corpora available for many languages. However, this practice raises both linguistic and ethical concerns, particularly for Indigenous language communities for whom Bible translation has historically been intertwined with colonialism and cultural erasure. This thesis investigates the trade-offs associated with using Bible-derived parallel data to fine-tune machine translation models for the Yucatec Maya-to-Spanish translation task.I fine-tuned two models -- TowerInstruct-7B-v.02 and T5S -- across seven experimental conditions varying the proportion and quantity of Bible training data, ranging from 0% to 100% Bible data. Translation quality was evaluated using BLEU, chrF, METEOR, and COMET. Bible-related content drift in model outputs was assessed through two complementary methods: a semantic similarity analysis using BETO sentence embeddings, and a Bible n-gram contamination analysis using log-likelihood ratio statistics. Results show that increasing the proportion of Bible training data consistently degraded translation quality across both models. For TowerInstruct-7B-v.02, this degradation was strictly monotonic. For T5S, the relationship was broadly similar but not strictly monotonic. Neither model benefited from increased quantities of Bible-dominated training data. Semantic drift toward biblical Spanish was negligible across all conditions for both models, with a single exception of T5S trained only on a subset of Bible data. These findings are contextualized by a community survey of 84 Yucatec Maya speakers, who broadly supported machine translation development while expressing concern about data sovereignty, colonial training data, and the risk of epistemic extractivism. Together, the computational and community findings argue that domain-matched, community-generated data should be prioritized over Bible corpora in low-resource MT development, even when data scarcity creates pressure to use all available resources.
dc.embargo.termsOpen Access
dc.format.mimetypeapplication/pdf
dc.identifier.otherPhoreman_washington_0250O_29337.pdf
dc.identifier.urihttps://hdl.handle.net/1773/55519
dc.language.isoen_US
dc.relation.haspartbible_dataset_verification_counting.py; code/script.
dc.relation.haspartbibleScraper-es.py; code/script.
dc.relation.haspartbibleScraperV3.py; code/script.
dc.relation.haspartconcat_bible.py; code/script.
dc.relation.haspartconcat_data.py; code/script.
dc.relation.hasparteval_quality.py; code/script.
dc.relation.haspartevaluate_all.py; code/script.
dc.relation.haspartexperiments.py; code/script.
dc.relation.haspartfinetune_t5s.py; code/script.
dc.relation.haspartgenerate_fr_en_baseline.py; code/script.
dc.relation.haspartgenerate_fr_es_baseline.py; code/script.
dc.relation.haspartgenerate_translations_prompt_regex.py; code/script.
dc.relation.haspartinfer_t5s.py; code/script.
dc.relation.haspartn-grams.py; code/script.
dc.relation.haspartpreprocess.py; code/script.
dc.relation.hasparttrain_tower_lora.py; code/script.
dc.rightsnone
dc.subjectBible data
dc.subjectIndigenous language
dc.subjectlow-resource language
dc.subjectmachine translation
dc.subjectYucatec Maya
dc.subjectLinguistics
dc.subjectComputer science
dc.subjectLanguage
dc.subject.otherLinguistics
dc.titleOn the Ethics and Linguistic Impacts of Using the Bible as Training Data for Yucatec Maya-to-Spanish Machine Translation
dc.typeThesis

Files

Original bundle

Now showing 1 - 5 of 17
Loading...
Thumbnail Image
Name:
Phoreman_washington_0250O_29337.pdf
Size:
19.2 MB
Format:
Adobe Portable Document Format
Loading...
Thumbnail Image
Name:
bible_dataset_verification_counting.py
Size:
2.67 KB
Format:
Unknown data format
Loading...
Thumbnail Image
Name:
bibleScraper-es.py
Size:
2.39 KB
Format:
Unknown data format
Loading...
Thumbnail Image
Name:
bibleScraperV3.py
Size:
2.66 KB
Format:
Unknown data format
Loading...
Thumbnail Image
Name:
concat_bible.py
Size:
2.69 KB
Format:
Unknown data format

Collections