Continual Learning of Object Classification in the Real World

Mei, Jie

Continual Learning of Object Classification in the Real World

Files

Mei_washington_0250E_27011.pdf (28.96 MB)

Date

2024-09-09

Authors

Mei, Jie

Abstract

Technological advances in deep learning have brought remarkable performance in the object classification task but only when all the training data of classes to be learned are available at the same time. However, real-world data continually evolve through time, resulting in ever-changing learning configurations, e.g., new classes are added continuously. When a deep learning model loses access to previously trained classes data (e.g., due to privacy issues, storage limitations, or data transfer difficulties) and can only be fine-tuned based on new classes data, it could overfit on new classes and catastrophically forget the previously trained classes due to the end-to-end training strategy. Therefore, a class incremental learning (CIL) model should be able to learn more and more new classes over time from a stream of continuously arriving data, i.e., only the training data for a small number of classes have to be present at the beginning of training and new classes can be added progressively. Traditional object classification models based on deep learning have a fixed number of output classes from the final softmax layer, which requires the total number of training classes at the beginning of training. This is impractical in the real-world CIL scenario because we can not know how many classes are going to be added in the future. Another feature of traditional image recognition models based on deep learning is the flat classification outputs from the standard softmax layer, i.e., all training classes are on the same level and the summation of confidence scores over all classes equals to one. Therefore, traditional image recognition models are not able to perform the hierarchical classification task. For example, `Dog' class and `Golden Retriever' class could be in the dataset at the same time and both confidence scores should be one for a `Golden Retriever' input image. Thus, a hierarchical classifier should be able to predict all confidence scores at different levels in the hierarchical data structure. One obvious advantage is that if the confidence score of a sample is too low at the fine level but very high at the coarse level, then we can use the coarse-level prediction to be the final prediction. In contrast, flat classifiers have no alternative ways if the confidence score is too low at the final prediction. To this end, we construct a hierarchical dataset and propose a CNN-based hierarchical classification architecture, which enforces the hierarchical data structure and introduces an efficient training and inference strategy. Furthermore, taking advantages from both class incremental learning and hierarchical classifier, we first propose a Hierarchical Class Incremental Learning model (HCIL), for continual learning of object classification in the real world. Most works in class incremental learning (CIL) assume disjoint sets of classes as tasks. Although a few works deal with overlapped sets of classes, they either assume a balanced data distribution or assume a mild imbalanced distribution. Instead, we further explore one of the understudied real-world CIL settings where (1) different tasks can share some classes but with new data samples, and (2) the training data of each task follows a long-tail distribution. We call this setting CIL-LT. We hypothesize that previously trained classification heads possess prototype knowledge of seen classes and thus could help learn the new model. Therefore, we propose a method, i.e., Expert-and-Samples-Aware (ESA) incremental learning under longtail distribution, with the multi-expert idea and a dynamic weighting technique to deal with the exacerbated forgetting introduced by the long-tail distribution. Experiments show that the proposed method effectively improves the accuracy in the CIL-LT setup on MNIST, CIFAR10, and CIFAR100. From the multimodal perspective, text-prompt-based approaches for continual learning leverage pre-trained text encoders and learnable prompts to encode textual features for sequentially arrived classes over time. A common challenge encountered by existing works is how to learn fine-grained text prompts, which implicitly carry semantic information of new classes, so that the textual features of newly arrived classes do not overlap with those of trained classes, thereby mitigating the catastrophic forgetting problem. To address this challenge, we propose a novel approach Prototype-guided Text Prompt Selection (ProTPS) to intentionally increase the training flexibility thus enforcing learning fine-grained text prompts. Specifically, our ProTPS aggregates the image and text encoders to learn class-specific vision prototypes and text prompts. Vision prototypes guide the selection and learning of text prompts that encode exclusive fine-grained textual features of each class. We evaluate our ProTPS in both class incremental (CI) setting and cross-datasets continual (CDC) learning setting. Since our ProTPS achieves performance close to the upper bounds, we further collect a real-world marine species dataset, named Marine112, to bring new challenges to the community. Marine112 is a fine-grained dataset with long-tail distribution and is naturally suited for the class and domain incremental (CDI) learning setting. The results under three continual learning settings show that our approach performs favorably against the recent state-of-the-art methods. The expected contributions of this proposal for continual learning of object classification in the real world can be concluded as follows: We propose the HCIL framework, with the ability to jointly perform the hierarchical classification task and class incremental learning without catastrophically forgetting previously trained classes. We propose the ESA framework with the multi-expert idea and a dynamic weighting technique, for one of the understudied real-world CIL settings where (1) different tasks can share some classes but with new data samples, and (2) the training data of each task follows a long-tail distribution. We propose a multimodal transformer-based framework, ProTPS, to intentionally increase the training flexibility thus enforcing learning fine-grained text prompts so that the textual features of newly arrived classes do not overlap with those of trained classes. ProTPS is shown to be effective under three real-world continual learning settings, i.e., class incremental (CI) setting, cross-datasets continual (CDC) learning setting, and class and domain incremental (CDI) learning setting. From the dataset perspective, we collect a new hierarchical dataset for continual learning, three under-studied CIL-LT datasets, and a real-world marine species dataset, named Marine112, to bring new challenges to the continual learning community. Marine112 is a fine-grained dataset with long-tail distribution and is naturally suited for the class and domain incremental (CDI) learning setting. The proposed and collected datasets present novel real-world challenges to the continual learning community, and the proposed frameworks represent significant advancements toward achieving continual learning of object classification in real-world scenarios.