ResearchWorks Archive
    • Login
    View Item 
    •   ResearchWorks Home
    • Dissertations and Theses
    • Linguistics
    • View Item
    •   ResearchWorks Home
    • Dissertations and Theses
    • Linguistics
    • View Item
    JavaScript is disabled for your browser. Some features of this site may not work without it.

    Comparing Methods for Automatic Identification of Mislabeled Data

    Thumbnail
    View/Open
    Sweeney_washington_0250O_23708.pdf (1000.Kb)
    Author
    Sweeney, Jessica
    Metadata
    Show full item record
    Abstract
    This thesis compares three methods for identifying mislabeled examples in datasets: Dataset Cartography (Swayamdipta et al. [2020]), Cleanlab, (Northcutt et al. [2021b]), and Ensem- bling (Brodley and Friedl [1999], Reiss et al. [2020]). Mislabeled examples in the training data of a dataset deteriorate the learning signal that models can use for the task, and mis- labeled data in the test split prevent accurate assessment of a model’s performance, so it is useful to have methods to identify and correct those labels. In order to compare the methods as directly as possible, we use the Multi-Genre Natural Language Inference corpus (MNLI) as the dataset that all methods will inspect for mislabeled examples (Williams et al. [2018]). We choose RoBERTa-large (Liu et al. [2019]) as the model which generates information about MNLI to be used as input for each method, and we compare the lists of mislabeled examples predicted by the three methods. Manual inspection of a subset of the data reveals that Dataset Cartography has the highest accuracy in identifying truly mislabeled examples, followed by Cleanlab, followed by Ensembling. The methods share about half of their total flagged examples in common, and all produce around the same number of examples (ap- proximately 20k). They all flag examples labeled “neutral” at about twice the frequency of the other labels in MNLI. When data is removed from the original training set according to the lists produced by the models, without manual inspection, performance is reduced on a challenge test set (HANS), though Cartography and Ensembling’s performances are reduced slightly less than Cleanlab’s. Overall, Dataset Cartography seems to be the highest accuracy method in this particular context, and would be most likely to reduce the amount of manual relabeling needed to be done during the process of cleaning a dataset’s labels.
    URI
    http://hdl.handle.net/1773/48284
    Collections
    • Linguistics [136]

    DSpace software copyright © 2002-2015  DuraSpace
    Contact Us
    Theme by 
    @mire NV
     

     

    Browse

    All of ResearchWorksCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

    My Account

    LoginRegister

    DSpace software copyright © 2002-2015  DuraSpace
    Contact Us
    Theme by 
    @mire NV