Reconstructing Visual Appearance and Process by Repurposing Pretrained Diffusion Models

Loading...
Thumbnail Image

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Visual generation problems often arise in regimes where the available observations are incomplete or indirect. Inputs may capture only fragments of visual appearance, or omit the intermediate processes that produced the final result, yet models are expected to synthesize outputs that are visually complete, coherent, and plausible. This setting places strong demands on generative models, requiring them to infer missing information, integrate fragmented evidence, and reconstruct underlying structure or dynamics consistent with the observed outcome. This thesis investigates how pretrained diffusion models can be repurposed to address visual generation tasks arising under partial or indirect observation. I study a set of representative applications in which this tension manifests in different forms, spanning both appearance reconstruction and process reconstruction. These include synthesizing complete human appearance and motion from a small number of casually captured selfies, selectively editing portraits while preserving fine-grained identity features, and reconstructing plausible painting processes from a single finished artwork. Across these settings, the desired outputs are visually complete, while the inputs provide only sparse, incomplete, or indirect constraints. Although recent diffusion models learn powerful visual priors through large-scale pretraining, they are primarily designed for generic generation tasks such as text-to-image or text-to-video synthesis, and are not directly suited to these partial-observation scenarios. Mismatches in input structure, supervision, and data availability make naive fine-tuning or prompt-based adaptation ineffective. This thesis addresses the central question: how can pretrained diffusion models be repurposed to support novel visual generation tasks under partial observation? I explore repurposing strategies across different supervision regimes. In settings where paired training data is unavailable, I develop methods that exploit weakly aligned observations or synthetically constructed supervision to enable appearance reconstruction from fragmented inputs. In settings with limited but well-aligned paired data, I show that composing multiple pretrained models into cascaded pipelines can amplify scarce supervision and enable complex process reconstruction, such as inferring plausible sequences of painting actions consistent with a final artwork. Beyond pipeline-level design, the effectiveness of reconstructing visual appearance and process also depends critically on the quality of visual priors learned during diffusion pretraining. Stronger priors lead to more robust adaptation and higher generation quality across tasks. This observation motivates the final part of the thesis, which explores repurposing at a more fundamental level. I propose AlignTok, a framework that repurposes pretrained visual encoders as tokenizers for diffusion models, enabling diffusion to operate in a semantically rich latent space. This design simplifies training and improves efficiency, scalability, and generation quality, resulting in stronger visual priors that better support downstream reconstruction tasks.

Description

Thesis (Ph.D.)--University of Washington, 2026

Citation

DOI