The Interplay of Dataset Characteristics in Automated Grammar Generation: A Study with the AGGREGATION System

dc.contributor.advisorBender, Emily M.
dc.contributor.authorLiu, Tongxi
dc.date.accessioned2025-08-01T22:26:09Z
dc.date.available2025-08-01T22:26:09Z
dc.date.issued2025-08-01
dc.date.submitted2025
dc.descriptionThesis (Master's)--University of Washington, 2025
dc.description.abstractIn this thesis, I investigate how linguists can effectively prepare Interlinear Glossed Text (IGT) data for use with the AGGREGATION grammar inference system, particularly under constraints such as limited time, sparse annotations, and variable corpus quality. AGGREGATION aims to automate the creation of precision Head-driven Phrase Structure Grammar (HPSG) grammars from IGT, but its output quality depends heavily on input structure and annotation consistency. To explore this, I develop a modeling framework to evaluate how structural and annotation-based features (such as affix ambiguity, type-stems ratio, and POS tag source) affect grammar quality across 75,000 grammar runs on 25 datasets. I use both linear mixed-effects models and XGBoost to identify predictors of four key metrics: coverage, ambiguity, morphological complexity, and inference time. Results show that smaller, structurally coherent datasets often outperform larger, noisier ones. Manual POS tags improve coverage and generalization but increase ambiguity, while automatic tags result in cleaner grammars with lower parse success. A case study on Meitei highlights how annotation quality interacts with language-specific features. This work offers practical guidance for preparing IGT data for grammar generation and proposes future improvements to AGGREGATION, including support for structure-aware sampling and multi-version grammar comparison.
dc.embargo.termsOpen Access
dc.format.mimetypeapplication/pdf
dc.identifier.otherLiu_washington_0250O_28606.pdf
dc.identifier.urihttps://hdl.handle.net/1773/53678
dc.language.isoen_US
dc.rightsnone
dc.subjectAutomated Grammar Generation
dc.subjectComputational Linguistics
dc.subjectGrammar Engineering
dc.subjectGrammar Inference
dc.subjectHead-driven Phrase Structure Grammar
dc.subjectInterlinear Glossed Text
dc.subjectLinguistics
dc.subjectComputer science
dc.subjectInformation science
dc.subject.otherLinguistics
dc.titleThe Interplay of Dataset Characteristics in Automated Grammar Generation: A Study with the AGGREGATION System
dc.typeThesis

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Liu_washington_0250O_28606.pdf
Size:
15.72 MB
Format:
Adobe Portable Document Format

Collections