Towards Integrated Audio-Visual Learning: From Vision-to-Audio Generation to a Unified Audio-Visual Framework

Su, Kun

Towards Integrated Audio-Visual Learning: From Vision-to-Audio Generation to a Unified Audio-Visual Framework

dc.contributor.advisor	Shlizerman, Eli
dc.contributor.author	Su, Kun
dc.date.accessioned	2024-04-26T23:20:19Z
dc.date.available	2024-04-26T23:20:19Z
dc.date.issued	2024-04-26
dc.date.submitted	2024
dc.description	Thesis (Ph.D.)--University of Washington, 2024
dc.description.abstract	The interplay between audio and visual signals, rich in correlations across various scales, significantly impacts human perception and drives a consistent demand for audio-visual applications across fields such as video production, animation, and virtual reality. Historically, the creation and adaptation of audio content has been predominantly a manual process reliant on the expertise of Foley artists. Exploring automated systems capable of assisting with such tasks and achieving comparable proficiency suggests an intriguing prospect. In recent years, deep learning-based methodologies have shown considerable promise in handling image, video, and text data. Nonetheless, the task of integrating audio with visual inputs introduces distinct challenges that stem from the fundamental differences and complexities inherent to each modality. In particular, techniques for generating non-speech audio, such as music, object material impact sounds, natural sounds, and spatial sound effects, have received limited exploration. Audio and visual signals can be represented in various ways, with their connections varying across different scenarios. The research domain of learning to connect and relate audio and visual signals, termed audio-visual learning, has traditionally focused on either audio-visual representation learning or on generative modeling of one modality conditioned on the other. My research traverses audio-visual learning and connects audio-visual representation and generation from three distinct perspectives. In the first category, my investigation focuses on vision-to-audio generation through an intermediate representation, which serves as a bridge to link visual and audio domains. For instance, musical notes can translate piano keystrokes into their corresponding sounds, and the rhythm of movement can connect dance videos to their accompanying music. The intermediate representation could also be derived from pre-trained deep learning models and act as semantic bridges to facilitate the transition from more general video and background music where the audio-visual relationship is largely subjective. In the second category, my research delves into learning implicit representation through the vision-to-audio generative process. This angle seeks not only to achieve vision-to-audio conversion but also to construct meaningful representations therein. Innovations here include unsupervised models that deduce instrumental sounds from musicians' body movements and diffusion models capable of synthesizing impact sounds from visual cues of object-physical interaction. This exploration reveals the association between visual inputs and various timbre characteristics. Additionally, by mapping indoor scene geometry to room impulse responses at discrete locations, we can infer a continuous acoustic field that enables the rendering of high-fidelity audio for any emitter-listener locations and enhances the realism and immersion of auditory experiences. Finally, in the third part of work, I proposed a unified audio-visual framework that seamlessly merges representation learning and generative modeling. This general approach enables the efficient generation of high-fidelity audio from visual stimuli while constructing robust semantic audio-visual representations. Its applications are broad, ranging from audio-visual retrieval and event classification to audio-only classification, paving the way for more immersive and contextually rich audio-visual experiences.
dc.embargo.terms	Open Access
dc.format.mimetype	application/pdf
dc.identifier.other	Su_washington_0250E_26618.pdf
dc.identifier.uri	http://hdl.handle.net/1773/51356
dc.language.iso	en_US
dc.rights	CC BY
dc.subject	audio-visual learning
dc.subject	audio-visual representation
dc.subject	video-to-audio generation
dc.subject	Electrical engineering
dc.subject	Artificial intelligence
dc.subject	Computer science
dc.subject.other	Electrical and computer engineering
dc.title	Towards Integrated Audio-Visual Learning: From Vision-to-Audio Generation to a Unified Audio-Visual Framework
dc.type	Thesis

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Su_washington_0250E_26618.pdf
Size:: 20.4 MB
Format:: Adobe Portable Document Format

Download

Collections

Electrical and computer engineering