Grounding Perception and Reasoning in Multimodal Models

Park, Jae Sung

Grounding Perception and Reasoning in Multimodal Models

dc.contributor.advisor	Farhadi, Ali
dc.contributor.advisor	Choi, Yejin
dc.contributor.author	Park, Jae Sung
dc.date.accessioned	2026-04-20T15:27:03Z
dc.date.issued	2026-04-20
dc.date.submitted	2026
dc.description	Thesis (Ph.D.)--University of Washington, 2026
dc.description.abstract	Current large multimodal models achieve impressive recognition capabilities that approach human-level performance. Their success is driven largely by massive web-scale pretraining, with continued improvements from scaling model and data size. However, this reliance on web data has not yet translated into fully reliable multimodal systems; these models remain prone to hallucinating content and failing to correctly ground their reasoning in images and videos. This disconnect arises because web data often fails to capture the implicit, grounded reasoning inherent to human cognition, which is rarely explicitly verbalized online. Consequently, this creates a data scarcity where purely scaling up the current pipeline may no longer yield proportional improvements. This thesis investigates methods to bridge this gap by grounding perception and reasoning through targeted data distillation from structured representations. First, to address the scarcity of grounded reasoning data, I introduce VisualComet, a large-scale dataset of human annotations designed to train models that predict dynamic events, past contexts, and human intents from images. Recognizing the scalability bottlenecks of manual annotation, I subsequently present Localized Symbolic Knowledge Distillation (LSKD), which leverages large language models to generate synthetic reasoning chains and utilizes a trained critic model to filter these outputs for quality and alignment with human judgment. Next, I investigate the reliability of the underlying perception systems by presenting a diagnostic framework using automatic contrast sets for video-language models. By systematically manipulating entities and verbs in video descriptions, this work reveals that state-of-the-art models frequently ignore visual signals in favor of language priors. Finally, to directly enhance these perception capabilities, I introduce Synthetic Visual Genome, a method for scaling up structured scene graphs defined by objects and their relationships. By distilling this structural knowledge into the training process, I show that this approach significantly improves grounded relationship understanding and reasoning.
dc.embargo.lift	2028-04-09T15:27:03Z
dc.embargo.terms	Restrict to UW for 2 years -- then make Open Access
dc.format.mimetype	application/pdf
dc.identifier.other	Park_washington_0250E_29200.pdf
dc.identifier.uri	https://hdl.handle.net/1773/55471
dc.language.iso	en_US
dc.rights	CC BY-ND
dc.subject	Artificial intelligence
dc.subject.other	Computer science and engineering
dc.title	Grounding Perception and Reasoning in Multimodal Models
dc.type	Thesis

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Park_washington_0250E_29200.pdf
Size:: 21.02 MB
Format:: Adobe Portable Document Format

Download

Collections

Computer science and engineering