Learning Structured Representations of the Visual World

Wallingford, Matthew

Learning Structured Representations of the Visual World

Files

Wallingford_washington_0250E_28577.pdf (28.36 MB)

Date

2025-10-02

Authors

Wallingford, Matthew

Abstract

Humans develop complex internal models of the world which allow us to generalize remarkably well to new scenarios and tasks. While deep learning has steadily improved in performance through data and scale, it conspicuously lags behind in its generalization to changing data distributions and transfer across tasks when compared to biological intelligence. We argue that one key element absent from current deep learning systems is this internal model of the world to enable efficient transfer of knowledge to new settings and data. In this work, we investigate how aspects of world models such as compositionality and 3D spatial understanding can be learned from visual data and be used to improve the efficiency and robustness of current machine learning systems. We develop new methods and loss objectives for learning structured representations. We demonstrate how learning from more complex visual data such as video, embodied exploration, and 360° video enables learning more structured world models which improves sample efficiency and spatial understanding. In addition, we explore other directions and develop methods to improve the transfer of knowledge between tasks and robustness to shifting data distributions.