Grounding Semantics and Instructions with Vision

Li, Xiujun

Grounding Semantics and Instructions with Vision

dc.contributor.advisor	Choi, Yejin
dc.contributor.author	Li, Xiujun
dc.date.accessioned	2024-09-09T23:06:24Z
dc.date.issued	2024-09-09
dc.date.submitted	2024
dc.description	Thesis (Ph.D.)--University of Washington, 2024
dc.description.abstract	Learning cross-modal representations is fundamental to a wide range of vision language (V+L) tasks, such as visual question answering, image-text retrieval, image captioning. Previous studies on vision-language pretraining (VLP) have shown that it can effectively learn generic representation from massive image-text pairs. However, there are several issues for the existing approaches. These methods simply concatenate image region features and text features as input and resort to the self-attention mechanism to learn semantic alignments between image regions and text. How to effectively fusion the two modalities, to learn the alignment between vision and text is important. The second issue is, these methods heavily rely on the pretrained object region features. This thesis aims to ground semantics with vision from two perspectives: 1). Oscar employs the object tags to help the alignment between the image objects and text sequence; 2). VinVL pretrains a large, semantic object detection backbone with a unified object-attribute taxonomy from the public corpus, and shows that a rich, semantic visual representation matters in the vision-language pretraining. Inspired by the instruction tuning in Large Language Models (LLMs), the latest Multimodal Large Language Models (MLLMs) shows superb visual instruction following capability for multimodal tasks, however, the existing MLLMs heavily reply on the pretrained LLMs, inherit its language priors for instruction understanding. The third part of this thesis aims to ground instructions with vision. It first proposes a new setting for probing the instruction following capabilities of MLLMs, and introduces a v-MLLM model for robust visual instruction following: Vim (Visual Modality Instruction) challenges the MLLMs by embedding the instructions into the visual pixel space, which demands for strong visual interpretative skills for instruction following. Vim probes the existing MLLMs from a vision perspective, and shows a significant disparity between the original Text Modality Instruction (TEM) and Vim settings for the existing open-source MLLMs on eight benchmarks; furthermore, v-MLLM provides a solution for the robust visual instruction following, fills this gap for the open-source MLLMs.
dc.embargo.lift	2026-08-30T23:06:24Z
dc.embargo.terms	Restrict to UW for 2 years -- then make Open Access
dc.format.mimetype	application/pdf
dc.identifier.other	Li_washington_0250E_26666.pdf
dc.identifier.uri	https://hdl.handle.net/1773/51869
dc.language.iso	en_US
dc.rights	none
dc.subject	Multimodality
dc.subject	Pretraining
dc.subject	Vision and Language
dc.subject	Artificial intelligence
dc.subject.other	Computer science and engineering
dc.title	Grounding Semantics and Instructions with Vision
dc.type	Thesis

Collections

Computer science and engineering

Grounding Semantics and Instructions with Vision

Files

Collections