Towards reliability and interactive debugging for large language models

Paranjape, Bhargavi

Towards reliability and interactive debugging for large language models

Files

Paranjape_washington_0250E_26573.pdf (1.37 MB)

Date

2024-04-26

relationships.isAuthorOf

Paranjape, Bhargavi

Abstract

Large language models (LLMs) have permeated our everyday lives and are used in critical decision-making scenarios that can affect millions of people. Despite their impressive progress, model deficiencies may result in exacerbating harmful biases or lead to catastrophic failures. In this thesis, we present and advance a series of important considerations for reliable model deployment. Beyond improved accuracy on new and complex tasks, users want more transparent models that explain their predictions and are robust to data biases or distributional shifts. They also want to be equipped to interact with these models to better understand and debug them. We present a variety of training and inference techniques toward building these aspects of reliability into models. We particularly focus on techniques that address challenges of scale and lack of human supervision, for models ranging from classifiers with limited interaction potential to massive LLMs that can communicate with humans and external tools. In the first part of this thesis on advancing explainability for LLMs, we introduce a novel information-theoretic objective to train models to generate explanations that are concise, comprehensible and faithful to model predictions. We also introduce a contrastive prompt-based approach to explain model predictions on common-sense reasoning tasks, that can also be leveraged by users to probe model behavior. We focus on distributional robustness for LLMs in the second part of this thesis. We develop a novel optimization technique, that discovers error-prone data slices for users to examine, and trains a robust classifier to improve performance on rare data slices. We also develop an open-sourced framework for fine-grained attribution of hallucinations in model generated text to underlying pre-training data. In the third part, we present a framework for automatically decomposing unseen composite tasks that require multi-step reasoning and external-system interaction, and delve into how the framework supports user debugging. Overall, this thesis presents a range of optimization, inference, and evaluation methods that make progress toward better explainability, robustness, and interactive debugging of large language models.