Acting with Language

Shridhar, Mohit

Acting with Language

Files

Shridhar_washington_0250E_25310.pdf (25.43 MB)

Date

2023-08-14

relationships.isAuthorOf

Shridhar, Mohit

Abstract

How can we imbue robots with the ability to achieve arbitrary goals in novel environments? Language provides a natural interface for guiding robots and abstracting the complexities of the physical world. Previous attempts to guide robots with language often rely on human-designed intermediate representations, such as object detections, categories, poses, and symbolic states. These representations struggle to represent everyday objects, such as deformable shirts, coffee beans, ropes, and cherry stems. One solution that does not require human-designed representations is end-to-end deep learning, which directly maps camera observations to robot actions. While learning approaches are vastly more expressive than traditional methods, they are severely bottlenecked by the lack of training data in robotics. Training a simple policy could take months of data collection and is not scalable. However, robot data includes spatial symmetries and other structural priors that can be utilized to efficiently learn policies for a wide range of tasks. In this thesis, we present various methods for using language to guide robot actions through end-to-end learning. First, we present ALFRED, a large-scale dataset and benchmark for evaluating agents that follow language instructions in partially-observable household environments. Next, we introduce CLIPort and PerAct, two language-conditioned manipulation frameworks that aim to replicate the success of pre-training large models from vision and language in robotics. These frameworks use spatial priors to efficiently learn action representations from limited data. Lastly, we discuss ALFWorld, a framework for learning “textual policies” in interactive text games, thereby avoiding the visual and physical complexities of interacting with embodied environments. We conclude with a discussion on counterpoints, limitations, and potential future directions for scaling-up robot-learning and butler robots.