Towards Multimodal Interactive Intelligence

Hu, YushiTowards Multimodal Interactive IntelligenceMy University2025InteractiveLLMMultimodalElectrical engineeringElectrical and computer engineeringMy UniversityMy UniversityOstendorf, MariSmith, Noah A2025-08-012025-08-012025-08-012025en-USThesisHu_washington_0250E_28551.pdfhttps://hdl.handle.net/1773/53559application/pdfCC BYThesis (Ph.D.)--University of Washington, 2025Great progress has been made in multimodal generative AI models. However, these models still have limitations. For example, they struggle when handling multimodal data and multi-turn interactions. One reason is the lack of training data for these problems. The research community lacks multimodal data on the scale of single modality data and lacks lengthy multi-turn interaction data. Additionally, there are not many well-defined tasks for these problems, preventing researchers from understanding models' performance and leaving them without meaningful optimization goals. In this thesis, we work toward building better multimodal intelligence. We focus on three types of abilities: multimodal understanding, multimodal generation, and grounded multi-turn interactions. For each aspect, we explore the limitations of current models, proposing new tasks and evaluation methods for capabilities that remain beyond the reach of existing models. Identifying these weaknesses, we introduce novel methods and evaluations for multimodal interactive intelligence to address these challenges. This approach enhances existing AI models through AI-AI interactions and human-AI interactions, enabling collaboration across modalities. For multimodal understanding models, we propose BLINK, a benchmark that focuses on core visual perception abilities not found in other evaluations. Most BLINK tasks can be solved by humans in a ``blink'' of the eye, but pose significant challenges for the latest multimodal language models (LMs). To address this weakness, we propose Visual Sketchpad, a framework that allows models to think step-by-step across modalities. This framework empowers LMs to have more diverse interactions, for example, with vision expert models. Such interactions compensate for what existing models miss and greatly enhance models' multimodal understanding abilities. For image generation models, we tackle the long-standing problem that these models do not effectively follow text instructions. We propose TIFA, which uses multimodal LMs to evaluate generated images, providing an efficient evaluation metric that aligns well with human judgment. Moreover, we show that TIFA can work as an effective training signal to improve text-image alignment in image generation. Finally, we focus on grounded dialogue systems. We provide a framework that allows AI agents to be evaluated with either simulated or real users, using end-to-end dialogue-level objectives. To demonstrate the use of this framework, we introduce NavigationBench, a novel task that simulates dialogues between a user and a virtual navigation assistant in a car. It also features a simulated user trained with the latest LM technologies, allowing researchers to simulate multi-turn dialogues for automatic dialogue-level comparisons of AI assistants. Using this framework, we study the performance and verbosity of different agent LLMs.