Photorealistic Virtual Try-on with Generative Models

dc.contributor.advisorKemelmacher-Shlizerman, Ira
dc.contributor.authorZhu, Luyang
dc.date.accessioned2024-09-09T23:06:37Z
dc.date.available2024-09-09T23:06:37Z
dc.date.issued2024-09-09
dc.date.submitted2024
dc.descriptionThesis (Ph.D.)--University of Washington, 2024
dc.description.abstractVirtual try-on (VTO) is revolutionizing the online apparel shopping experience, enabling customers to see how a particular fashion item would look on them. Despite significant progress, current VTO methods still encounter challenges such as accurately warping garments under large pose gap and heavy occlusion, as well as preserving body shape and identity of the person under the new garment. Additionally, most research focuses on upper-body VTO, whereas a full-body VTO that allows for garment mix-and-match is more desirable in real-world scenarios. In my thesis, I address above challenges by developing generative models tailored for the VTO task. First, I propose TryOnDiffusion, the first method capable of try-on synthesis at 1024x1024 resolution for various body poses and shapes while preserving garment details. Previous methods either focus on garment detail preservation without effective pose and shape variation, or allow try-on with the desired shape and pose but lack garment details. In this project, I show that the underlying reason for this challenge is a widely-used two-stage pipeline consisting of an explicit warping model and a blending GAN model. To solve this issue, I propose a diffusion-based architecture that unifies two UNets (referred to as Parallel-UNet), which can warp the garment implicitly with cross attention, in addition to warping and blending in a single network pass. Next, I present M&M VTO, which extends TryOnDiffusion from upper body VTO to full body VTO, allowing to mix and match multiple garments. To preserve intricate garment details required by full body VTO, I propose a single-stage diffusion model in the pixel space that is trained progressively. To solve a common identity loss problem in current VTO methods, I design a novel architecture named VTO UNet Diffusion Transformer (VTO-UDiT) to disentangle denoising from person specific features, allowing for a highly effective finetuning strategy. Furthermore, M&M VTO also supports garment layout editing via text inputs finetuned on multi-modal foundation models. Finally, I show how we can train generative models on synthetic datasets for 3D clothed human reconstruction, which is an important component towards VTO in the 3D world. I propose reconstructing NBA players, which takes as input a single photo of a clothed player in any basketball pose and outputs a high resolution mesh and 3D pose for that player. Key to my approach is a deep neural skinning approach for creating poseable, skinned models of NBA players, and a large database of meshes derived from the video game. Although trained only on synthetic data, the proposed pipeline generalizes well to real-world images even under heavy occlusion.
dc.embargo.termsOpen Access
dc.format.mimetypeapplication/pdf
dc.identifier.otherZhu_washington_0250E_26781.pdf
dc.identifier.urihttps://hdl.handle.net/1773/51884
dc.language.isoen_US
dc.rightsCC BY-NC
dc.subjectcomputer graphics
dc.subjectcomputer vision
dc.subjectDeep learning
dc.subjectDiffusion models
dc.subjectGenerative Models
dc.subjectVirtual Try-on
dc.subjectComputer science
dc.subject.otherComputer science and engineering
dc.titlePhotorealistic Virtual Try-on with Generative Models
dc.typeThesis

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Zhu_washington_0250E_26781.pdf
Size:
91.69 MB
Format:
Adobe Portable Document Format