Unsupervised Morphological Word Clustering
Lushtak, Sergei A.
MetadataShow full item record
This thesis describes a system which clusters the words of a given lexicon into conflation sets (sets of morphologically related words). The word clustering is based on clustering of suffixes, which, in turn, is based on stem-suffix co-occurrence frequencies. The suffix clustering is performed as a clique clustering of a weighted undirected graph with the suffixes as vertices; the edges weights are calculated as similarity measure between the suffix signatures of the vertices according to the proposed metric. The clustering that yields the lowest lexicon compression ratio is considered the optimum. In addition, the hypothesis that the lowest compression ratio suffix clustering yields the best word clustering is tested. The system is tested on the CELEX English, German and Dutch lexicons and its performance is evaluated against the set of conflation classes extracted from the CELEX morphological database (Baayen, et al., 1993). The system performance is compared to that of other systems: Morfessor (Creutz and Lagus, 2005), Linguistica (Goldsmith, 2000), and χ^2 - significance test based word clustering approach (Moon et al., 2009).
- Linguistics