Improving Turkish Spelling Correction with Wikipedia Edit History Data
Date
2021-08-26
relationships.isAuthorOf
Guinard, Theresa
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Spelling correction is a well-established NLP application, but the quality for English spelling correction tends to be significantly higher than for other languages. One significant issue for minority languages in NLP is the availability of specialized corpora for various tasks. In this thesis, I develop and make available a corpus of about 780,000 Turkish spelling errors and their corrections. I present a fairly language-independent system for identifying small edits from Wikipedia's edit history and a decision forest classifier with Turkish-specific features for distinguishing spelling corrections from other types of small edits. When analyzing the corpus, I find the major categories of error types to be changes in diacritics, changes in capitalization, changes in spacing, and character insertions/deletions/substitutions/swaps. I present a baseline cascaded system, where each error type category is handled by separate modules. When trained with the Wikipedia data, this system handles cross-domain spelling errors more effectively than existing systems for Turkish normalization and spelling correction. Additionally, I investigate the possibility of using a machine translation model in a spelling correction system, and I find that an SMT model trained on the error model for character insertions/deletions/substitutions/swaps is a viable option for handling that category of errors.
Description
Thesis (Master's)--University of Washington, 2021
Keywords
, Linguistics, Computer science
