Generative Keyframing

Wang, Xiaojuan

Generative Keyframing

Files

Wang_washington_0250E_29032.pdf (99.22 MB)

Date

2026-02-05

Authors

Wang, Xiaojuan

Abstract

Keyframing is a fundamental element of animation creation and video editing. It involves defining specific frames, i.e., keyframes, that mark important moments of change and guide how the intermediate frames are filled or interpolated. In early hand-drawn animation, a keyframe is a visual drawing created by animators, with assistants manually drawing the in-between frames. With the advent of digital animation and video editing software, a keyframe became a set of parameters that define the state of the rendered character or object at specific times, with in-between transitions produced by interpolating these parameters. However, such parametric approaches rely heavily on manually designed controls and artist-crafted heuristics, making them difficult to capture complex, nuanced, and realistic motions. Furthermore, they do not naturally generalize to real image and video domains. The rapid progress of visual generative models that are trained on large collections of visual data and capable of learning rich appearance and motion patterns, has made it possible to generate high-fidelity imagery and realistic motion. Building on these advances, this thesis investigates generative keyframing, a data-driven, non-parametric, image-based approach to the keyframing process. To this end, I present a series of works in this thesis that collectively develop and explore this idea. I begin with the basic aspect: using generative models to synthesize transitions directly from images, and even to fully generate in-between motions. I first present a GAN-based technique for smoothing jump cuts in talking head videos, synthesizing seamless transitions between the cuts even in challenging cases involving large head movement. I then introduce a method for generating in-between videos with dynamic motion between more distant key frames by adapting a pretrained large-scale image-to-video diffusion model with minimal fine-tuning effort. Beyond automatically generating transitions between keyframes, I further explore multi-scale keyframing for achieving very deep zoom. Specifically, I introduce a multi-scale joint sampling diffusion approach for generating consistent images (keyframes) across different spatial scales while adhering to their respective input text prompts. This enables deep semantic zoom and a continuous zoom video can be rendered from these images. When working with multiple keyframes, one import question is how they should be ordered in the final video. I address this in the context of dance video generation---specifically, music synchronized and choreography-aware animal dance video---where unordered keyframes representing distinct animal poses are arranged via graph optimization to satisfy a specified choreography pattern of beats that defines the long-range structure of a dance. Finally, I conclude with discussions and directions for future works.