Submodular Data Selection and Augmentation for Resource-Efficient Learning
| dc.contributor.advisor | Bilmes, Jeffrey | |
| dc.contributor.author | Kumari, Lilly | |
| dc.date.accessioned | 2025-05-12T22:47:43Z | |
| dc.date.available | 2025-05-12T22:47:43Z | |
| dc.date.issued | 2025-05-12 | |
| dc.date.submitted | 2025 | |
| dc.description | Thesis (Ph.D.)--University of Washington, 2025 | |
| dc.description.abstract | The increasing complexity of modern machine learning (ML) systems, particularly large-scale transformer-based models, presents significant challenges with respect to computational, data, and memory efficiency. Models with billions of parameters, such as Large Language Models (LLMs), Vision Transformers (ViTs), and Multimodal Large Language Models (MLLMs), require vast datasets, substantial memory, and extensive compute resources, making them difficult to scale and deploy—especially in resource-constrained and data-scarce environments. These challenges are further compounded by the redundancy present in both real and synthetic data, which leads to inefficiencies in data annotation, training, and inference stages by increasing resource overhead and making it difficult to extract meaningful insights from the data. Additionally, input context selection in LLMs and MLLMs is critical for improving their test-time efficiency, as these models often process long sequences of text or multimodal information. Processing redundant and irrelevant context strains compute and memory resources and can degrade output quality, highlighting the importance of effective context selection in optimizing resource usage. To address these above-mentioned challenges and improve the efficiency and accessibility of ML systems, we require strategies that optimize resource utilization through high-quality data selection, augmentation, and efficient input context selection—particularly in LLMs and MLLMs. We explore two complementary approaches—submodular data selection and data augmentation—to enhance the efficiency of ML systems without compromising model performance. The first approach leverages submodular optimization to model diversity, representativeness, and relevance in selecting data subsets and input contexts for training and inference. The second approach focuses on data augmentation to enhance data utility, improve model robustness, and mitigate catastrophic forgetting in resource-constrained settings such as continual learning. To enable efficient query-focused data selection, we propose Submodular Span Summarization (S3), a framework that selects diverse and query-relevant data subsets. We demonstrate its effectiveness across multiple data modalities for query-focused summarization tasks. The S3 framework provides an effective solution for optimizing annotation, training, and inference costs by selecting query-relevant and representative subsets, facilitating tasks such as active learning, targeted data selection, and efficient input context selection for LLMs and MLLMs. Extending this, we propose Div-S3, an end-to-end submodular optimization approach for in-context learning (ICL) using LLMs to enable efficient exemplar selection and retrieval while improving downstream task performance under data annotation constraints. Building on the principles of submodularity, we further optimize LLMs inference efficiency with BumbleBee, a novel key-value (KV) cache summarization algorithm. As LLMs scale, maintaining large KV caches for autoregressive inference becomes increasingly resource-intensive. BumbleBee reduces computational overhead and memory footprint, allowing LLMs to maintain an effectively infinite context without any architectural modifications or additional fine-tuning. We extend BumbleBee to multimodal tasks and propose VisionBee, a submodular optimization framework that reduces the number of visual tokens processed by MLLMs significantly while maintaining performance on several image and video understanding tasks. Beyond data selection, we explore data augmentation for continual learning, where an ML model learns from a stream of data coming from different tasks without revisiting previous tasks' data. We introduce Retrospective Adversarial Replay (RAR), which generates informative replay instances that capture the forgetting frontier, using adversarial augmentations and MixUp to enhance data diversity. Overall, our experiments across diverse tasks and data modalities show that our proposed data selection and augmentation approaches significantly improve resource efficiency during the annotation, training, and inference stages while maintaining model performance. | |
| dc.embargo.terms | Open Access | |
| dc.format.mimetype | application/pdf | |
| dc.identifier.other | Kumari_washington_0250E_27913.pdf | |
| dc.identifier.uri | https://hdl.handle.net/1773/52976 | |
| dc.language.iso | en_US | |
| dc.rights | CC BY-NC-SA | |
| dc.subject | Continual Learning | |
| dc.subject | Data Augmentation | |
| dc.subject | Data Selection | |
| dc.subject | Large Language Models | |
| dc.subject | Resource-Efficient Machine Learning | |
| dc.subject | Submodular Optimization | |
| dc.subject | Artificial intelligence | |
| dc.subject.other | Electrical and computer engineering | |
| dc.title | Submodular Data Selection and Augmentation for Resource-Efficient Learning | |
| dc.type | Thesis |
Files
Original bundle
1 - 1 of 1
Loading...
- Name:
- Kumari_washington_0250E_27913.pdf
- Size:
- 46.04 MB
- Format:
- Adobe Portable Document Format
