Extracting Topically Related Synonyms from Twitter using Syntactic and Paraphrase Data
Antoniak, Maria Alexandra
MetadataShow full item record
The goal of synonym extraction is to automatically gather synsets (groups of synonyms) from a corpus. This task is related to the tasks of normalization and paraphrase detection. We present a series of approaches for synonym extraction on Twitter, which contains unique synonyms (e.g. slang, acronyms, and colloquialisms) for which no traditional resources exist. Because Twitter contains so much variation, we focus our extraction on certain topics. We show that this focus on topics yields significantly higher coverage on a corpus of paraphrases than previous work which was topic-insensitive. We demonstrate improvement on the task of paraphrase detection when we substitute our extracted synonyms into the paraphrase training set. The synonyms are learned by using chunks from a shallow parse to create candidate synonyms and their context windows, and the synonyms are incorporated into a paraphrase detection system that uses machine translation metrics as features for a classifier. When we train and test on the paraphrase training set and use synonyms extracted from the same paraphrase training set, we find a 2.29\% improvement in F1 and demonstrate better coverage than previous systems. This shows the potential of synonyms that are representative of a specific topic. We also find an improvement in F1 score of 0.81 points when we train on the paraphrase training set and test on the test set and use synonyms extracted with an unsupervised method on a corpus whose topics match those of the paraphrase test set. We also demonstrate an approach that uses distant supervision, creating a silver standard training and test set, which we use both to evaluate our synonyms and to demonstrate a supervised approach to synonym extraction.
- Linguistics