Generative Keyframing

dc.contributor.advisorSeitz, Steven M.
dc.contributor.advisorCurless, Brian
dc.contributor.authorWang, Xiaojuan
dc.date.accessioned2026-02-05T19:34:18Z
dc.date.available2026-02-05T19:34:18Z
dc.date.issued2026-02-05
dc.date.submitted2025
dc.descriptionThesis (Ph.D.)--University of Washington, 2025
dc.description.abstractKeyframing is a fundamental element of animation creation and video editing. It involves defining specific frames, i.e., keyframes, that mark important moments of change and guide how the intermediate frames are filled or interpolated. In early hand-drawn animation, a keyframe is a visual drawing created by animators, with assistants manually drawing the in-between frames. With the advent of digital animation and video editing software, a keyframe became a set of parameters that define the state of the rendered character or object at specific times, with in-between transitions produced by interpolating these parameters. However, such parametric approaches rely heavily on manually designed controls and artist-crafted heuristics, making them difficult to capture complex, nuanced, and realistic motions. Furthermore, they do not naturally generalize to real image and video domains. The rapid progress of visual generative models that are trained on large collections of visual data and capable of learning rich appearance and motion patterns, has made it possible to generate high-fidelity imagery and realistic motion. Building on these advances, this thesis investigates generative keyframing, a data-driven, non-parametric, image-based approach to the keyframing process. To this end, I present a series of works in this thesis that collectively develop and explore this idea. I begin with the basic aspect: using generative models to synthesize transitions directly from images, and even to fully generate in-between motions. I first present a GAN-based technique for smoothing jump cuts in talking head videos, synthesizing seamless transitions between the cuts even in challenging cases involving large head movement. I then introduce a method for generating in-between videos with dynamic motion between more distant key frames by adapting a pretrained large-scale image-to-video diffusion model with minimal fine-tuning effort. Beyond automatically generating transitions between keyframes, I further explore multi-scale keyframing for achieving very deep zoom. Specifically, I introduce a multi-scale joint sampling diffusion approach for generating consistent images (keyframes) across different spatial scales while adhering to their respective input text prompts. This enables deep semantic zoom and a continuous zoom video can be rendered from these images. When working with multiple keyframes, one import question is how they should be ordered in the final video. I address this in the context of dance video generation---specifically, music synchronized and choreography-aware animal dance video---where unordered keyframes representing distinct animal poses are arranged via graph optimization to satisfy a specified choreography pattern of beats that defines the long-range structure of a dance. Finally, I conclude with discussions and directions for future works.
dc.embargo.termsOpen Access
dc.format.mimetypeapplication/pdf
dc.identifier.otherWang_washington_0250E_29032.pdf
dc.identifier.urihttps://hdl.handle.net/1773/55188
dc.language.isoen_US
dc.rightsCC BY-NC-SA
dc.subjectGenerative AI
dc.subjectKeyframing
dc.subjectVision and Graphics
dc.subjectComputer science
dc.subject.otherComputer science and engineering
dc.titleGenerative Keyframing
dc.typeThesis

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Wang_washington_0250E_29032.pdf
Size:
99.22 MB
Format:
Adobe Portable Document Format