Steps Towards the Pluralistic Alignment of Language Models

Loading...
Thumbnail Image

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

AI alignment is concerned with ensuring that AI systems understand and adhere to human values and preferences. However, most prior alignment work makes a simplifying assumption that preferences are monolithic. In reality, human values and preferences can vary between and within individuals, groups, and societies. In this dissertation, I formalize and advance the study of \textit{pluralistic alignment}, or aligning AI systems with diverse human values, perspectives, and preferences. Specifically, I use large language models (LLMs) as a test-bed for pluralistic alignment.I first motivate the need for pluralism in alignment, outlining failure modes and risks of either assuming that value variation either doesn't exist or ignoring such variation. I propose a concrete framework for pluralistic alignment, including three definitions of how models and benchmarks can each be pluralistic. Based on this framework, I propose a roadmap with recommendations and directions for further empirical and methodological work in the area. This framework has been widely adopted by the community, and serves as an agenda for the remainder of the dissertation. Next, I focus on improving LLMs' ability to properly model and steer to varied human values. I introduce a large-scale dataset for value pluralism (\textsc{Value Prism}), and conduct a human study to understand whose values are represented. With this dataset, I train \textsc{Value Kaleidoscope}, a model for assessing the relevance of values to a particular situation and giving contextual judgments based on a value description. I find that the model is sensitive to situational changes and that it helps to explain human variation.I then propose an autoencoder-based approach for inferring the values that could have led to a particular individual's judgments (called \textit{value profiles}). I find that our value profile approach is able to preserve $>$70\% of the predictive information found in the rater demonstrations on which they are based, and offers benefits in terms of interpretability and steerability. Based on value profiles, I propose a novel rater clustering method for assigning individuals to a fixed number of clusters. I find that these clusters are far more predictive than demographic groupings of the same size, and that the clusters enable dataset-specific analysis of the dimensionality of rater variation. Generalizing beyond textual value descriptions, I focus on language model post-training for general tasks and abilities. I find that current instruction-tuning techniques reduce pluralism in many ways, harming LLMs' ability to steer to subjective judgments and diverse generation distributions, leading to mode collapse on queries with many valid answers, and reducing distributional alignment. Pretrained models are better at steering and matching distributions, but are less usable as a result of being poor at following instructions.To improve instruction-following while also improving pluralism, I compile a large-scale resource from $>$40 datasets in a unified format that require inferring and steering to diverse generation functions in-context (\textsc{Spectrum Suite}). With this data, I introduce \textsc{Spectrum Tuning}, a simple and scalable post-training method which improves instruction-following concurrently with several modes of pluralism, leading to more steerable models which also avoid mode collapse. Based on \textsc{Spectrum Tuning}, I further design a system for steering to individuals, which achieves state-of-the-art at individual subjective judgment modeling. To conclude, I survey related work in the community building on our pluralistic alignment framework and methodologies and outline directions for future work.

Description

Thesis (Ph.D.)--University of Washington, 2025

Citation

DOI