Grounding Perception and Reasoning in Multimodal Models

dc.contributor.advisorFarhadi, Ali
dc.contributor.advisorChoi, Yejin
dc.contributor.authorPark, Jae Sung
dc.date.accessioned2026-04-20T15:27:03Z
dc.date.issued2026-04-20
dc.date.submitted2026
dc.descriptionThesis (Ph.D.)--University of Washington, 2026
dc.description.abstractCurrent large multimodal models achieve impressive recognition capabilities that approach human-level performance. Their success is driven largely by massive web-scale pretraining, with continued improvements from scaling model and data size. However, this reliance on web data has not yet translated into fully reliable multimodal systems; these models remain prone to hallucinating content and failing to correctly ground their reasoning in images and videos. This disconnect arises because web data often fails to capture the implicit, grounded reasoning inherent to human cognition, which is rarely explicitly verbalized online. Consequently, this creates a data scarcity where purely scaling up the current pipeline may no longer yield proportional improvements. This thesis investigates methods to bridge this gap by grounding perception and reasoning through targeted data distillation from structured representations. First, to address the scarcity of grounded reasoning data, I introduce VisualComet, a large-scale dataset of human annotations designed to train models that predict dynamic events, past contexts, and human intents from images. Recognizing the scalability bottlenecks of manual annotation, I subsequently present Localized Symbolic Knowledge Distillation (LSKD), which leverages large language models to generate synthetic reasoning chains and utilizes a trained critic model to filter these outputs for quality and alignment with human judgment. Next, I investigate the reliability of the underlying perception systems by presenting a diagnostic framework using automatic contrast sets for video-language models. By systematically manipulating entities and verbs in video descriptions, this work reveals that state-of-the-art models frequently ignore visual signals in favor of language priors. Finally, to directly enhance these perception capabilities, I introduce Synthetic Visual Genome, a method for scaling up structured scene graphs defined by objects and their relationships. By distilling this structural knowledge into the training process, I show that this approach significantly improves grounded relationship understanding and reasoning.
dc.embargo.lift2028-04-09T15:27:03Z
dc.embargo.termsRestrict to UW for 2 years -- then make Open Access
dc.format.mimetypeapplication/pdf
dc.identifier.otherPark_washington_0250E_29200.pdf
dc.identifier.urihttps://hdl.handle.net/1773/55471
dc.language.isoen_US
dc.rightsCC BY-ND
dc.subjectArtificial intelligence
dc.subject.otherComputer science and engineering
dc.titleGrounding Perception and Reasoning in Multimodal Models
dc.typeThesis

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Park_washington_0250E_29200.pdf
Size:
21.02 MB
Format:
Adobe Portable Document Format