Grounding Semantics and Instructions with Vision

dc.contributor.advisorChoi, Yejin
dc.contributor.authorLi, Xiujun
dc.date.accessioned2024-09-09T23:06:24Z
dc.date.issued2024-09-09
dc.date.submitted2024
dc.descriptionThesis (Ph.D.)--University of Washington, 2024
dc.description.abstractLearning cross-modal representations is fundamental to a wide range of vision language (V+L) tasks, such as visual question answering, image-text retrieval, image captioning. Previous studies on vision-language pretraining (VLP) have shown that it can effectively learn generic representation from massive image-text pairs. However, there are several issues for the existing approaches. These methods simply concatenate image region features and text features as input and resort to the self-attention mechanism to learn semantic alignments between image regions and text. How to effectively fusion the two modalities, to learn the alignment between vision and text is important. The second issue is, these methods heavily rely on the pretrained object region features. This thesis aims to ground semantics with vision from two perspectives: 1). Oscar employs the object tags to help the alignment between the image objects and text sequence; 2). VinVL pretrains a large, semantic object detection backbone with a unified object-attribute taxonomy from the public corpus, and shows that a rich, semantic visual representation matters in the vision-language pretraining. Inspired by the instruction tuning in Large Language Models (LLMs), the latest Multimodal Large Language Models (MLLMs) shows superb visual instruction following capability for multimodal tasks, however, the existing MLLMs heavily reply on the pretrained LLMs, inherit its language priors for instruction understanding. The third part of this thesis aims to ground instructions with vision. It first proposes a new setting for probing the instruction following capabilities of MLLMs, and introduces a v-MLLM model for robust visual instruction following: Vim (Visual Modality Instruction) challenges the MLLMs by embedding the instructions into the visual pixel space, which demands for strong visual interpretative skills for instruction following. Vim probes the existing MLLMs from a vision perspective, and shows a significant disparity between the original Text Modality Instruction (TEM) and Vim settings for the existing open-source MLLMs on eight benchmarks; furthermore, v-MLLM provides a solution for the robust visual instruction following, fills this gap for the open-source MLLMs.
dc.embargo.lift2026-08-30T23:06:24Z
dc.embargo.termsRestrict to UW for 2 years -- then make Open Access
dc.format.mimetypeapplication/pdf
dc.identifier.otherLi_washington_0250E_26666.pdf
dc.identifier.urihttps://hdl.handle.net/1773/51869
dc.language.isoen_US
dc.rightsnone
dc.subjectMultimodality
dc.subjectPretraining
dc.subjectVision and Language
dc.subjectArtificial intelligence
dc.subject.otherComputer science and engineering
dc.titleGrounding Semantics and Instructions with Vision
dc.typeThesis

Files