Unsupervised Morphological Word Clustering

dc.contributor.advisorLevow, Gina-Anneen_US
dc.contributor.authorLushtak, Sergei A.en_US
dc.date.accessioned2013-04-17T17:57:50Z
dc.date.available2013-04-17T17:57:50Z
dc.date.issued2013-04-17
dc.date.submitted2012en_US
dc.descriptionThesis (Master's)--University of Washington, 2012en_US
dc.description.abstractThis thesis describes a system which clusters the words of a given lexicon into conflation sets (sets of morphologically related words). The word clustering is based on clustering of suffixes, which, in turn, is based on stem-suffix co-occurrence frequencies. The suffix clustering is performed as a clique clustering of a weighted undirected graph with the suffixes as vertices; the edges weights are calculated as similarity measure between the suffix signatures of the vertices according to the proposed metric. The clustering that yields the lowest lexicon compression ratio is considered the optimum. In addition, the hypothesis that the lowest compression ratio suffix clustering yields the best word clustering is tested. The system is tested on the CELEX English, German and Dutch lexicons and its performance is evaluated against the set of conflation classes extracted from the CELEX morphological database (Baayen, et al., 1993). The system performance is compared to that of other systems: Morfessor (Creutz and Lagus, 2005), Linguistica (Goldsmith, 2000), and χ^2 - significance test based word clustering approach (Moon et al., 2009).en_US
dc.embargo.termsNo embargoen_US
dc.format.mimetypeapplication/pdfen_US
dc.identifier.otherLushtak_washington_0250O_11149.pdfen_US
dc.identifier.urihttp://hdl.handle.net/1773/22453
dc.language.isoen_USen_US
dc.rightsCopyright is held by the individual authors.en_US
dc.subjectacquisition; clustering; computational; conflation; morphology; unsuperviseden_US
dc.subject.otherLinguisticsen_US
dc.subject.otherComputer scienceen_US
dc.subject.otherlinguisticsen_US
dc.titleUnsupervised Morphological Word Clusteringen_US
dc.typeThesisen_US

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Lushtak_washington_0250O_11149.pdf
Size:
1.01 MB
Format:
Adobe Portable Document Format

Collections