Objects and Actions: Learning Representations for Open-World Robotics

dc.contributor.advisorFox, Dieter
dc.contributor.authorYuan, Wentao
dc.date.accessioned2024-10-16T03:12:02Z
dc.date.available2024-10-16T03:12:02Z
dc.date.issued2024-10-16
dc.date.submitted2024
dc.descriptionThesis (Ph.D.)--University of Washington, 2024
dc.description.abstractAdvancing robotics involves enabling systems to generalize across diverse and unseen environments, known as "the open world." Traditional approaches rely on state estimators, while modern learning-based methods develop implicit representations to approximate states. Both approaches require well-designed states or representations for effective generalization. This dissertation investigates learning representations that enhance generalization in robotic systems, focusing on objects and actions. First, I introduce SORNet (Spatial Object-Centric Representation Network), a framework for learning object-centric representations from RGB images using canonical object views. SORNet generalizes to unseen objects with different shapes and textures, outperforming existing techniques in tasks like spatial relation classification and task planning for sequential manipulation. Next, I present M2T2, a transformer model that predicts low-level actions for manipulating objects in cluttered scenes. M2T2 reasons about contact points and gripper poses from raw point clouds. Trained on a large-scale synthetic dataset, M2T2 achieves zero-shot sim2real transfer on real robots, surpassing state-of-the-art models in both overall performance and in challenging tasks requiring object re-orientation. Finally, I introduce RoboPoint, a vision-language model that predicts keypoint affordances from language instructions. Using a synthetic data generation pipeline, RoboPoint trains without real-world data collection or human demonstration. It supports applications such as robot navigation, manipulation, and augmented reality, and outperforms existing models in spatial affordance prediction and task success rates. The dissertation concludes with a discussion on challenges and future directions for developing foundational models in robotics, aiming to create versatile systems capable of operating in open-world environments.
dc.embargo.termsOpen Access
dc.format.mimetypeapplication/pdf
dc.identifier.otherYuan_washington_0250E_27563.pdf
dc.identifier.urihttps://hdl.handle.net/1773/52471
dc.language.isoen_US
dc.rightsCC BY-SA
dc.subjectArtificial Intelligence
dc.subjectComputer Vision
dc.subjectFoundation Models
dc.subjectMachine Learning
dc.subjectRepresentation Learning
dc.subjectRobotics
dc.subjectComputer science
dc.subjectComputer engineering
dc.subject.otherComputer science and engineering
dc.titleObjects and Actions: Learning Representations for Open-World Robotics
dc.typeThesis

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Yuan_washington_0250E_27563.pdf
Size:
27.18 MB
Format:
Adobe Portable Document Format