A Unified Token and Parameter Compression Pipeline for High-Resolution Vision–Language Models

Sultana, Tasnia

A Unified Token and Parameter Compression Pipeline for High-Resolution Vision–Language Models

Files

Sultana_washington_0250O_29249.pdf (8.15 MB)

Date

2026-04-20

Authors

Sultana, Tasnia

Abstract

High-resolution vision–language models achieve strong performance on fine-grained visual reasoning tasks, but their deployment remains costly due to large visual token counts and heavy language backbones. This work investigates how to build small and efficient multimodal models while preserving high-resolution reasoning ability. We propose a training-free unified compression pipeline that reduces inefficiency at both the token and parameter levels. At the token level, we introduce HiRED–Merge, which combines attention-guided token budgeting with neighbor-aware norm proportional token merging. The method merges only spatially adjacent tokens that survive attention-based selection, helping preserve local structure and reduce information loss from aggressive token dropping. At the parameter level, we apply GLU-aware structured MLP pruning to the language backbone, removing coupled neuron pairs while maintaining dense computation and model structure. A 20% pruning reduces a 7B model to approximately 6B parameters. Experiments on ScienceQA, TextVQA, DocVQA, ChartQA, and MME show that our pipeline improves throughput, memory efficiency, and scalability while maintaining competitive accuracy. These results enable practical deployment of high-resolution vision–language models under limited computational resources.