Reconstructing and Rendering People from Photos and Videos in the Wild

Weng, Chung-Yi

Reconstructing and Rendering People from Photos and Videos in the Wild

Files

Weng_washington_0250E_25361.pdf (48.34 MB)

Date

2023-08-14

Authors

Weng, Chung-Yi

Abstract

Reconstructing and producing photorealistic renderings of dynamic humans from RGB images has long been considered a holy grail in the fields of computer vision and graphics. Such a capability would open up a wide range of possibilities for applications in areas such as virtual and augmented reality, teleconferencing, and the entertainment industry. Despite more than 25 years of research and development, the problem remains challenging, primarily due to difficulties posed by inherent 3D-to-2D ambiguity, highly dynamic motions, appearance variance, and non-rigid deformation. Moreover, the high cost of the technology has also been a major barrier to widespread adoption, as the reconstruction pipelines often rely on calibrated multi-camera systems and are typically only found in professional studios. In this thesis, I address the challenge of reconstructing and rendering high-quality dynamic humans using unstructured data in the wild, such as photos from the internet or YouTube videos. The goal is to make this expensive technology more accessible to amateur artists and even the general public, democratizing its use beyond just movie studios. To begin, I provide a review of the literature on this long-established problem, starting with the seminal work of Kanade et al. in 1997 and tracing the evolution of the technology through advances in image-based rendering, surface reconstruction, and more recently, modern deep neural networks. Then I present three novel approaches for tackling this problem, each designed to work with different types of source material, including monocular videos, personal photo collections, and single photographs. Through these approaches, my research enables a range of new applications. My proposed first approach, Photo Wake-Up, allows for creating 3D human animations viewable on AR devices like HoloLens using only single images. The second method, known as HumanNeRF, enables free-viewpoint rendering of moving persons from a YouTube video. Finally, I present PersonNeRF, an approach that is capable of reconstructing a person, including tennis superstars like Roger Federer, from photo collections, enabling rendering with arbitrary combinations of their viewpoints, appearances, and body poses. In the final section, I discuss the open problems that still exist in this field, as well as how this technology will potentially shape our future world.