Submodular Data Selection and Augmentation for Resource-Efficient Learning

Kumari, Lilly

Submodular Data Selection and Augmentation for Resource-Efficient Learning

dc.contributor.advisor	Bilmes, Jeffrey
dc.contributor.author	Kumari, Lilly
dc.date.accessioned	2025-05-12T22:47:43Z
dc.date.available	2025-05-12T22:47:43Z
dc.date.issued	2025-05-12
dc.date.submitted	2025
dc.description	Thesis (Ph.D.)--University of Washington, 2025
dc.description.abstract	The increasing complexity of modern machine learning (ML) systems, particularly large-scale transformer-based models, presents significant challenges with respect to computational, data, and memory efficiency. Models with billions of parameters, such as Large Language Models (LLMs), Vision Transformers (ViTs), and Multimodal Large Language Models (MLLMs), require vast datasets, substantial memory, and extensive compute resources, making them difficult to scale and deploy—especially in resource-constrained and data-scarce environments. These challenges are further compounded by the redundancy present in both real and synthetic data, which leads to inefficiencies in data annotation, training, and inference stages by increasing resource overhead and making it difficult to extract meaningful insights from the data. Additionally, input context selection in LLMs and MLLMs is critical for improving their test-time efficiency, as these models often process long sequences of text or multimodal information. Processing redundant and irrelevant context strains compute and memory resources and can degrade output quality, highlighting the importance of effective context selection in optimizing resource usage. To address these above-mentioned challenges and improve the efficiency and accessibility of ML systems, we require strategies that optimize resource utilization through high-quality data selection, augmentation, and efficient input context selection—particularly in LLMs and MLLMs. We explore two complementary approaches—submodular data selection and data augmentation—to enhance the efficiency of ML systems without compromising model performance. The first approach leverages submodular optimization to model diversity, representativeness, and relevance in selecting data subsets and input contexts for training and inference. The second approach focuses on data augmentation to enhance data utility, improve model robustness, and mitigate catastrophic forgetting in resource-constrained settings such as continual learning. To enable efficient query-focused data selection, we propose Submodular Span Summarization (S3), a framework that selects diverse and query-relevant data subsets. We demonstrate its effectiveness across multiple data modalities for query-focused summarization tasks. The S3 framework provides an effective solution for optimizing annotation, training, and inference costs by selecting query-relevant and representative subsets, facilitating tasks such as active learning, targeted data selection, and efficient input context selection for LLMs and MLLMs. Extending this, we propose Div-S3, an end-to-end submodular optimization approach for in-context learning (ICL) using LLMs to enable efficient exemplar selection and retrieval while improving downstream task performance under data annotation constraints. Building on the principles of submodularity, we further optimize LLMs inference efficiency with BumbleBee, a novel key-value (KV) cache summarization algorithm. As LLMs scale, maintaining large KV caches for autoregressive inference becomes increasingly resource-intensive. BumbleBee reduces computational overhead and memory footprint, allowing LLMs to maintain an effectively infinite context without any architectural modifications or additional fine-tuning. We extend BumbleBee to multimodal tasks and propose VisionBee, a submodular optimization framework that reduces the number of visual tokens processed by MLLMs significantly while maintaining performance on several image and video understanding tasks. Beyond data selection, we explore data augmentation for continual learning, where an ML model learns from a stream of data coming from different tasks without revisiting previous tasks' data. We introduce Retrospective Adversarial Replay (RAR), which generates informative replay instances that capture the forgetting frontier, using adversarial augmentations and MixUp to enhance data diversity. Overall, our experiments across diverse tasks and data modalities show that our proposed data selection and augmentation approaches significantly improve resource efficiency during the annotation, training, and inference stages while maintaining model performance.
dc.embargo.terms	Open Access
dc.format.mimetype	application/pdf
dc.identifier.other	Kumari_washington_0250E_27913.pdf
dc.identifier.uri	https://hdl.handle.net/1773/52976
dc.language.iso	en_US
dc.rights	CC BY-NC-SA
dc.subject	Continual Learning
dc.subject	Data Augmentation
dc.subject	Data Selection
dc.subject	Large Language Models
dc.subject	Resource-Efficient Machine Learning
dc.subject	Submodular Optimization
dc.subject	Artificial intelligence
dc.subject.other	Electrical and computer engineering
dc.title	Submodular Data Selection and Augmentation for Resource-Efficient Learning
dc.type	Thesis

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Kumari_washington_0250E_27913.pdf
Size:: 46.04 MB
Format:: Adobe Portable Document Format

Download

Collections

Electrical and computer engineering