Show simple item record

dc.contributor.advisorSteinert-Threlkeld, Shane
dc.contributor.authorSweeney, Jessica
dc.date.accessioned2022-01-26T23:25:27Z
dc.date.available2022-01-26T23:25:27Z
dc.date.submitted2021
dc.identifier.otherSweeney_washington_0250O_23708.pdf
dc.identifier.urihttp://hdl.handle.net/1773/48284
dc.descriptionThesis (Master's)--University of Washington, 2021
dc.description.abstractThis thesis compares three methods for identifying mislabeled examples in datasets: Dataset Cartography (Swayamdipta et al. [2020]), Cleanlab, (Northcutt et al. [2021b]), and Ensem- bling (Brodley and Friedl [1999], Reiss et al. [2020]). Mislabeled examples in the training data of a dataset deteriorate the learning signal that models can use for the task, and mis- labeled data in the test split prevent accurate assessment of a model’s performance, so it is useful to have methods to identify and correct those labels. In order to compare the methods as directly as possible, we use the Multi-Genre Natural Language Inference corpus (MNLI) as the dataset that all methods will inspect for mislabeled examples (Williams et al. [2018]). We choose RoBERTa-large (Liu et al. [2019]) as the model which generates information about MNLI to be used as input for each method, and we compare the lists of mislabeled examples predicted by the three methods. Manual inspection of a subset of the data reveals that Dataset Cartography has the highest accuracy in identifying truly mislabeled examples, followed by Cleanlab, followed by Ensembling. The methods share about half of their total flagged examples in common, and all produce around the same number of examples (ap- proximately 20k). They all flag examples labeled “neutral” at about twice the frequency of the other labels in MNLI. When data is removed from the original training set according to the lists produced by the models, without manual inspection, performance is reduced on a challenge test set (HANS), though Cartography and Ensembling’s performances are reduced slightly less than Cleanlab’s. Overall, Dataset Cartography seems to be the highest accuracy method in this particular context, and would be most likely to reduce the amount of manual relabeling needed to be done during the process of cleaning a dataset’s labels.
dc.format.mimetypeapplication/pdf
dc.language.isoen_US
dc.rightsnone
dc.subjectComputational Linguistics
dc.subjectData cleaning
dc.subjectDatasets
dc.subjectNatural language understanding
dc.subjectLinguistics
dc.subjectComputer science
dc.subject.otherLinguistics
dc.titleComparing Methods for Automatic Identification of Mislabeled Data
dc.typeThesis
dc.embargo.termsOpen Access


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record