Towards Generalizable Open-World Robot Manipulation by Training with Off-Domain Data

Li, Yi

Towards Generalizable Open-World Robot Manipulation by Training with Off-Domain Data

Files

Li_washington_0250E_28463.pdf (24.01 MB)

Date

2025-08-01

Authors

Li, Yi

Abstract

Achieving generalization in open-world robotic manipulation is a critical step toward deploying autonomous agents in dynamic, unstructured environments. However, learning manipulation skills that generalize to unseen objects, layouts, and natural language instructions remains challenging, particularly due to the scarcity and narrow coverage of real-world robot data. This dissertation explores how off-domain supervision—ranging from synthetic data to large-scale vision-language pretraining—can be harnessed to build scalable, generalist robot systems. We present three systems that progressively tackle generalization across different levels of the perception-to-action pipeline. First, we introduce DeepIM, a pose refinement framework based on iterative render-and-compare, which enables accurate 6D pose estimation using only RGB input. DeepIM demonstrates robust generalization to unseen objects and views, providing a foundation for geometry-aware manipulation. Next, we propose STOW, a discrete-frame segmentation and tracking method trained entirely on synthetic data. STOW exhibits strong sim-to-real transfer in cluttered warehouse environments by learning object-centric, temporally consistent representations, enabling robust multi-object scene understanding. Finally, we develop HAMSTER, a hierarchical vision-language-action model that integrates pretrained vision-language models with robot control via an intermediate abstraction of spatial sketch trajectories. HAMSTER enables the interpretation of diverse natural language instructions and their execution across varied semantic, geometric, and visual contexts. Together, these systems chart a path from model-based to model-free design in robotics, demonstrating that careful use of intermediate representations, modularity, and off-domain learning can bridge the gap between narrow robot deployments and open-world capability. By leveraging world knowledge and pretraining from large-scale datasets, this work contributes toward the long-term vision of scalable, generalist manipulation systems that adapt flexibly to real-world complexity.