Photorealistic Virtual Try-on with Generative Models

Zhu, Luyang

Photorealistic Virtual Try-on with Generative Models

dc.contributor.advisor	Kemelmacher-Shlizerman, Ira
dc.contributor.author	Zhu, Luyang
dc.date.accessioned	2024-09-09T23:06:37Z
dc.date.available	2024-09-09T23:06:37Z
dc.date.issued	2024-09-09
dc.date.submitted	2024
dc.description	Thesis (Ph.D.)--University of Washington, 2024
dc.description.abstract	Virtual try-on (VTO) is revolutionizing the online apparel shopping experience, enabling customers to see how a particular fashion item would look on them. Despite significant progress, current VTO methods still encounter challenges such as accurately warping garments under large pose gap and heavy occlusion, as well as preserving body shape and identity of the person under the new garment. Additionally, most research focuses on upper-body VTO, whereas a full-body VTO that allows for garment mix-and-match is more desirable in real-world scenarios. In my thesis, I address above challenges by developing generative models tailored for the VTO task. First, I propose TryOnDiffusion, the first method capable of try-on synthesis at 1024x1024 resolution for various body poses and shapes while preserving garment details. Previous methods either focus on garment detail preservation without effective pose and shape variation, or allow try-on with the desired shape and pose but lack garment details. In this project, I show that the underlying reason for this challenge is a widely-used two-stage pipeline consisting of an explicit warping model and a blending GAN model. To solve this issue, I propose a diffusion-based architecture that unifies two UNets (referred to as Parallel-UNet), which can warp the garment implicitly with cross attention, in addition to warping and blending in a single network pass. Next, I present M&M VTO, which extends TryOnDiffusion from upper body VTO to full body VTO, allowing to mix and match multiple garments. To preserve intricate garment details required by full body VTO, I propose a single-stage diffusion model in the pixel space that is trained progressively. To solve a common identity loss problem in current VTO methods, I design a novel architecture named VTO UNet Diffusion Transformer (VTO-UDiT) to disentangle denoising from person specific features, allowing for a highly effective finetuning strategy. Furthermore, M&M VTO also supports garment layout editing via text inputs finetuned on multi-modal foundation models. Finally, I show how we can train generative models on synthetic datasets for 3D clothed human reconstruction, which is an important component towards VTO in the 3D world. I propose reconstructing NBA players, which takes as input a single photo of a clothed player in any basketball pose and outputs a high resolution mesh and 3D pose for that player. Key to my approach is a deep neural skinning approach for creating poseable, skinned models of NBA players, and a large database of meshes derived from the video game. Although trained only on synthetic data, the proposed pipeline generalizes well to real-world images even under heavy occlusion.
dc.embargo.terms	Open Access
dc.format.mimetype	application/pdf
dc.identifier.other	Zhu_washington_0250E_26781.pdf
dc.identifier.uri	https://hdl.handle.net/1773/51884
dc.language.iso	en_US
dc.rights	CC BY-NC
dc.subject	computer graphics
dc.subject	computer vision
dc.subject	Deep learning
dc.subject	Diffusion models
dc.subject	Generative Models
dc.subject	Virtual Try-on
dc.subject	Computer science
dc.subject.other	Computer science and engineering
dc.title	Photorealistic Virtual Try-on with Generative Models
dc.type	Thesis

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Zhu_washington_0250E_26781.pdf
Size:: 91.69 MB
Format:: Adobe Portable Document Format

Download

Collections

Computer science and engineering