On the Ethics and Linguistic Impacts of Using the Bible as Training Data for Yucatec Maya-to-Spanish Machine Translation
| dc.contributor.advisor | Bender, Emily M | |
| dc.contributor.author | Phoreman, Jade A. | |
| dc.date.accessioned | 2026-04-20T15:30:30Z | |
| dc.date.available | 2026-04-20T15:30:30Z | |
| dc.date.issued | 2026-04-20 | |
| dc.date.submitted | 2026 | |
| dc.description | Thesis (Master's)--University of Washington, 2026 | |
| dc.description.abstract | Religious texts, primarily the Christian Bible, are commonly used as training data for low-resource machine translation (MT) systems because they constitute some of the most extensive and systematically digitized parallel corpora available for many languages. However, this practice raises both linguistic and ethical concerns, particularly for Indigenous language communities for whom Bible translation has historically been intertwined with colonialism and cultural erasure. This thesis investigates the trade-offs associated with using Bible-derived parallel data to fine-tune machine translation models for the Yucatec Maya-to-Spanish translation task.I fine-tuned two models -- TowerInstruct-7B-v.02 and T5S -- across seven experimental conditions varying the proportion and quantity of Bible training data, ranging from 0% to 100% Bible data. Translation quality was evaluated using BLEU, chrF, METEOR, and COMET. Bible-related content drift in model outputs was assessed through two complementary methods: a semantic similarity analysis using BETO sentence embeddings, and a Bible n-gram contamination analysis using log-likelihood ratio statistics. Results show that increasing the proportion of Bible training data consistently degraded translation quality across both models. For TowerInstruct-7B-v.02, this degradation was strictly monotonic. For T5S, the relationship was broadly similar but not strictly monotonic. Neither model benefited from increased quantities of Bible-dominated training data. Semantic drift toward biblical Spanish was negligible across all conditions for both models, with a single exception of T5S trained only on a subset of Bible data. These findings are contextualized by a community survey of 84 Yucatec Maya speakers, who broadly supported machine translation development while expressing concern about data sovereignty, colonial training data, and the risk of epistemic extractivism. Together, the computational and community findings argue that domain-matched, community-generated data should be prioritized over Bible corpora in low-resource MT development, even when data scarcity creates pressure to use all available resources. | |
| dc.embargo.terms | Open Access | |
| dc.format.mimetype | application/pdf | |
| dc.identifier.other | Phoreman_washington_0250O_29337.pdf | |
| dc.identifier.uri | https://hdl.handle.net/1773/55519 | |
| dc.language.iso | en_US | |
| dc.relation.haspart | bible_dataset_verification_counting.py; code/script. | |
| dc.relation.haspart | bibleScraper-es.py; code/script. | |
| dc.relation.haspart | bibleScraperV3.py; code/script. | |
| dc.relation.haspart | concat_bible.py; code/script. | |
| dc.relation.haspart | concat_data.py; code/script. | |
| dc.relation.haspart | eval_quality.py; code/script. | |
| dc.relation.haspart | evaluate_all.py; code/script. | |
| dc.relation.haspart | experiments.py; code/script. | |
| dc.relation.haspart | finetune_t5s.py; code/script. | |
| dc.relation.haspart | generate_fr_en_baseline.py; code/script. | |
| dc.relation.haspart | generate_fr_es_baseline.py; code/script. | |
| dc.relation.haspart | generate_translations_prompt_regex.py; code/script. | |
| dc.relation.haspart | infer_t5s.py; code/script. | |
| dc.relation.haspart | n-grams.py; code/script. | |
| dc.relation.haspart | preprocess.py; code/script. | |
| dc.relation.haspart | train_tower_lora.py; code/script. | |
| dc.rights | none | |
| dc.subject | Bible data | |
| dc.subject | Indigenous language | |
| dc.subject | low-resource language | |
| dc.subject | machine translation | |
| dc.subject | Yucatec Maya | |
| dc.subject | Linguistics | |
| dc.subject | Computer science | |
| dc.subject | Language | |
| dc.subject.other | Linguistics | |
| dc.title | On the Ethics and Linguistic Impacts of Using the Bible as Training Data for Yucatec Maya-to-Spanish Machine Translation | |
| dc.type | Thesis |
Files
Original bundle
1 - 5 of 17
Loading...
- Name:
- Phoreman_washington_0250O_29337.pdf
- Size:
- 19.2 MB
- Format:
- Adobe Portable Document Format
Loading...
- Name:
- bible_dataset_verification_counting.py
- Size:
- 2.67 KB
- Format:
- Unknown data format
