Efficient Scaling of Language Models
Loading...
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Large language models (LLMs) are progressively reshaping how humans interact with information, offering increasingly sophisticated access to knowledge through natural language interfaces and advancing reasoning capabilities across diverse domains. Yet, their impressive gains have hinged on a simple recipe: exponentially increasing resources. Over the past years, computational requirements have increased tenfold annually, training costs now reach billions of dollars, and we are rapidly exhausting the internet's high-quality text. This dissertation addresses a critical question: How can we unlock further capabilities from scale while curbing the exponential growth in compute, data, and energy?I present three complementary innovations across the LLM development pipeline that fundamentally improve scaling efficiency. Traditional LLMs depend on tokenization—a preprocessing step that introduces biases and inefficiencies. BLT eliminates this bottleneck by learning directly from raw bytes, dynamically grouping them into larger entropy-adaptive patches via lightweight encoder-decoder modules. This tokenizer-free architecture not only improves robustness to noisy and multilingual inputs but achieves up to 50% reduction in inference FLOPs while matching tokenization-based models' performance. Controlled scaling experiments demonstrate that BLT enables a new scaling dimension with improved scaling trends over current approaches. Beyond model architecture, data quality has a significant impact on model performance. Socratic Pretraining transforms unlabeled documents into richer training signal by masking salient sentences, synthetically generating content questions about missing information, and training models to both pose questions and draft answers. Applied to BART-large, this approach achieves state-of-the-art performance on QMSum and SQuALITY (+1.0 and +0.5 ROUGE-1 over strong baselines), halves labeled data requirements, and improves faithfulness across multiple control interfaces—all with minimal computational overhead. QLoRA combines novel 4-bit quantization with parameter efficient finetuning, reducing memory requirements of supervised learning by 15× without performance degradation. This efficiency enabled comprehensive instruction-tuning studies revealing that data quality, not quantity, drives downstream performance in post-training. Efficient finetuning of the quantized base model also alleviates quantization errors reducing inference memory requirements while enabling full-precision quality. These methods collectively demonstrate that sustainable scaling requires rethinking fundamental assumptions at each pipeline stage. BLT's entropy-adaptive computation, Socratic Pretraining's synthetic supervision, and QLoRA's quantized adaptation each extract more capability per unit resource—whether FLOPs, tokens, or memory. Together, they chart a practical roadmap for continued LLM progress within more sustainable computational bounds, proving that smarter algorithms can bend the scaling laws themselves.
Description
Thesis (Ph.D.)--University of Washington, 2025
