From Aari to Zulu: Massively Multilingual Creation of Language Tools using Interlinear Glossed Text
Georgi, Ryan Alden
MetadataShow full item record
This dissertation examines the suitability of Interlinear Glossed Text (IGT) as a computational, semi-structured resource for creating NLP tools for resource-poor languages, with a focus on the tasks of word alignment, part-of-speech (POS) tagging, and dependency parsing. The creation of a massively multilingual database of IGT instances called the Online Database of INterlinear text (ODIN) made possible the potential for creating tools to harness this particular data format on a large scale. Xia and Lewis (2007); Lewis and Xia (2008) demonstrated the potential of using IGT instances from ODIN to answer some typological questions such as basic word order for a large number of languages by means of utilizing the language–gloss–translation line structure of IGT instances to bootstrap word alignment, and consequentially syntactic projection. This dissertation seeks to perform a thorough investigation as to the potential for creating these NLP tools for endangered or otherwise resource-poor languages with nothing more than the IGT instances found in ODIN. After introducing the IGT data type and the particulars of the resources that will be used (Sections 3.1 to 4.4), this thesis presents each task in detail. Word alignment will be discussed in Chapter 5, POS tagging in Chapter 6, and dependency parsing in Chapter 7. In Chapter 8, INterlinear Text ENrichment Toolkit (Intent), the software created to enrich the IGT data and extract NLP tools from it will be introduced, and a brief summary of where to find and use the software will be included. Lastly, Chapter 9 will discuss aims of the experiments, and the overall viability of IGT for the tasks attempted.
- Linguistics