Comparing Methods for Automatic Identification of Mislabeled Data
MetadataShow full item record
This thesis compares three methods for identifying mislabeled examples in datasets: Dataset Cartography (Swayamdipta et al. ), Cleanlab, (Northcutt et al. [2021b]), and Ensem- bling (Brodley and Friedl , Reiss et al. ). Mislabeled examples in the training data of a dataset deteriorate the learning signal that models can use for the task, and mis- labeled data in the test split prevent accurate assessment of a model’s performance, so it is useful to have methods to identify and correct those labels. In order to compare the methods as directly as possible, we use the Multi-Genre Natural Language Inference corpus (MNLI) as the dataset that all methods will inspect for mislabeled examples (Williams et al. ). We choose RoBERTa-large (Liu et al. ) as the model which generates information about MNLI to be used as input for each method, and we compare the lists of mislabeled examples predicted by the three methods. Manual inspection of a subset of the data reveals that Dataset Cartography has the highest accuracy in identifying truly mislabeled examples, followed by Cleanlab, followed by Ensembling. The methods share about half of their total flagged examples in common, and all produce around the same number of examples (ap- proximately 20k). They all flag examples labeled “neutral” at about twice the frequency of the other labels in MNLI. When data is removed from the original training set according to the lists produced by the models, without manual inspection, performance is reduced on a challenge test set (HANS), though Cartography and Ensembling’s performances are reduced slightly less than Cleanlab’s. Overall, Dataset Cartography seems to be the highest accuracy method in this particular context, and would be most likely to reduce the amount of manual relabeling needed to be done during the process of cleaning a dataset’s labels.
- Linguistics