Improving Turkish Spelling Correction with Wikipedia Edit History Data

Guinard, Theresa

Improving Turkish Spelling Correction with Wikipedia Edit History Data

Files

Guinard_washington_0250O_23147.pdf (565.07 KB)

Date

2021-08-26

relationships.isAuthorOf

Guinard, Theresa

Abstract

Spelling correction is a well-established NLP application, but the quality for English spelling correction tends to be significantly higher than for other languages. One significant issue for minority languages in NLP is the availability of specialized corpora for various tasks. In this thesis, I develop and make available a corpus of about 780,000 Turkish spelling errors and their corrections. I present a fairly language-independent system for identifying small edits from Wikipedia's edit history and a decision forest classifier with Turkish-specific features for distinguishing spelling corrections from other types of small edits. When analyzing the corpus, I find the major categories of error types to be changes in diacritics, changes in capitalization, changes in spacing, and character insertions/deletions/substitutions/swaps. I present a baseline cascaded system, where each error type category is handled by separate modules. When trained with the Wikipedia data, this system handles cross-domain spelling errors more effectively than existing systems for Turkish normalization and spelling correction. Additionally, I investigate the possibility of using a machine translation model in a spelling correction system, and I find that an SMT model trained on the error model for character insertions/deletions/substitutions/swaps is a viable option for handling that category of errors.