Grounding Perception and Reasoning in Multimodal Models

Loading...
Thumbnail Image

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Current large multimodal models achieve impressive recognition capabilities that approach human-level performance. Their success is driven largely by massive web-scale pretraining, with continued improvements from scaling model and data size. However, this reliance on web data has not yet translated into fully reliable multimodal systems; these models remain prone to hallucinating content and failing to correctly ground their reasoning in images and videos. This disconnect arises because web data often fails to capture the implicit, grounded reasoning inherent to human cognition, which is rarely explicitly verbalized online. Consequently, this creates a data scarcity where purely scaling up the current pipeline may no longer yield proportional improvements. This thesis investigates methods to bridge this gap by grounding perception and reasoning through targeted data distillation from structured representations. First, to address the scarcity of grounded reasoning data, I introduce VisualComet, a large-scale dataset of human annotations designed to train models that predict dynamic events, past contexts, and human intents from images. Recognizing the scalability bottlenecks of manual annotation, I subsequently present Localized Symbolic Knowledge Distillation (LSKD), which leverages large language models to generate synthetic reasoning chains and utilizes a trained critic model to filter these outputs for quality and alignment with human judgment. Next, I investigate the reliability of the underlying perception systems by presenting a diagnostic framework using automatic contrast sets for video-language models. By systematically manipulating entities and verbs in video descriptions, this work reveals that state-of-the-art models frequently ignore visual signals in favor of language priors. Finally, to directly enhance these perception capabilities, I introduce Synthetic Visual Genome, a method for scaling up structured scene graphs defined by objects and their relationships. By distilling this structural knowledge into the training process, I show that this approach significantly improves grounded relationship understanding and reasoning.

Description

Thesis (Ph.D.)--University of Washington, 2026

Citation

DOI