Fox, DieterYuan, Wentao2024-10-162024-10-162024-10-162024Yuan_washington_0250E_27563.pdfhttps://hdl.handle.net/1773/52471Thesis (Ph.D.)--University of Washington, 2024Advancing robotics involves enabling systems to generalize across diverse and unseen environments, known as "the open world." Traditional approaches rely on state estimators, while modern learning-based methods develop implicit representations to approximate states. Both approaches require well-designed states or representations for effective generalization. This dissertation investigates learning representations that enhance generalization in robotic systems, focusing on objects and actions. First, I introduce SORNet (Spatial Object-Centric Representation Network), a framework for learning object-centric representations from RGB images using canonical object views. SORNet generalizes to unseen objects with different shapes and textures, outperforming existing techniques in tasks like spatial relation classification and task planning for sequential manipulation. Next, I present M2T2, a transformer model that predicts low-level actions for manipulating objects in cluttered scenes. M2T2 reasons about contact points and gripper poses from raw point clouds. Trained on a large-scale synthetic dataset, M2T2 achieves zero-shot sim2real transfer on real robots, surpassing state-of-the-art models in both overall performance and in challenging tasks requiring object re-orientation. Finally, I introduce RoboPoint, a vision-language model that predicts keypoint affordances from language instructions. Using a synthetic data generation pipeline, RoboPoint trains without real-world data collection or human demonstration. It supports applications such as robot navigation, manipulation, and augmented reality, and outperforms existing models in spatial affordance prediction and task success rates. The dissertation concludes with a discussion on challenges and future directions for developing foundational models in robotics, aiming to create versatile systems capable of operating in open-world environments.application/pdfen-USCC BY-SAArtificial IntelligenceComputer VisionFoundation ModelsMachine LearningRepresentation LearningRoboticsComputer scienceComputer engineeringComputer science and engineeringObjects and Actions: Learning Representations for Open-World RoboticsThesis