Scaling Human Supervision for Robotic Manipulation

Loading...
Thumbnail Image

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Robots are increasingly deployed in unstructured, human-centric environments—such as homes, warehouses, and hospitals—where they must adapt to diverse tasks, novel objects, and evolving user preferences. While pretraining on large-scale datasets or in simulation provides a useful foundation for general-purpose manipulation, the domain gap and scarcity of task-relevant real-world data remain major obstacles to robust deployment. Learning from real-world experience is critical for improving generalization and reliability, but the real world provides no automatic supervision. Human supervision, through demonstrations or interventions, remains the most effective and grounded signal for guiding robot learning, yet it is difficult and expensive to scale. This dissertation explores two complementary approaches to address this challenge. First, it investigates methods for distilling human supervision into reward models that enable reinforcement learning beyond the original data. These learned rewards allow robots to refine their behavior autonomously, increasing sample efficiency while reducing dependence on constant human input. Second, it explores how vision and language foundation models pretrained on internet data can simplify and enhance human supervision. By leveraging these models to extract task-relevant structure from multimodal demonstrations, robots can acquire skills from a single example and generalize to new objects, tasks, and environments. These approaches are validated across a range of real-world robotic manipulation tasks. By making human supervision both scalable and intuitive, this work aims to enable robots that require less supervision, learn more efficiently, and succeed more reliably in the open-ended, human-centric environments of the real world.

Description

Thesis (Ph.D.)--University of Washington, 2025

Citation

DOI