Towards Integrated Audio-Visual Learning: From Vision-to-Audio Generation to a Unified Audio-Visual Framework

dc.contributor.advisorShlizerman, Eli
dc.contributor.authorSu, Kun
dc.date.accessioned2024-04-26T23:20:19Z
dc.date.available2024-04-26T23:20:19Z
dc.date.issued2024-04-26
dc.date.submitted2024
dc.descriptionThesis (Ph.D.)--University of Washington, 2024
dc.description.abstractThe interplay between audio and visual signals, rich in correlations across various scales, significantly impacts human perception and drives a consistent demand for audio-visual applications across fields such as video production, animation, and virtual reality. Historically, the creation and adaptation of audio content has been predominantly a manual process reliant on the expertise of Foley artists. Exploring automated systems capable of assisting with such tasks and achieving comparable proficiency suggests an intriguing prospect. In recent years, deep learning-based methodologies have shown considerable promise in handling image, video, and text data. Nonetheless, the task of integrating audio with visual inputs introduces distinct challenges that stem from the fundamental differences and complexities inherent to each modality. In particular, techniques for generating non-speech audio, such as music, object material impact sounds, natural sounds, and spatial sound effects, have received limited exploration. Audio and visual signals can be represented in various ways, with their connections varying across different scenarios. The research domain of learning to connect and relate audio and visual signals, termed audio-visual learning, has traditionally focused on either audio-visual representation learning or on generative modeling of one modality conditioned on the other. My research traverses audio-visual learning and connects audio-visual representation and generation from three distinct perspectives. In the first category, my investigation focuses on vision-to-audio generation through an intermediate representation, which serves as a bridge to link visual and audio domains. For instance, musical notes can translate piano keystrokes into their corresponding sounds, and the rhythm of movement can connect dance videos to their accompanying music. The intermediate representation could also be derived from pre-trained deep learning models and act as semantic bridges to facilitate the transition from more general video and background music where the audio-visual relationship is largely subjective. In the second category, my research delves into learning implicit representation through the vision-to-audio generative process. This angle seeks not only to achieve vision-to-audio conversion but also to construct meaningful representations therein. Innovations here include unsupervised models that deduce instrumental sounds from musicians' body movements and diffusion models capable of synthesizing impact sounds from visual cues of object-physical interaction. This exploration reveals the association between visual inputs and various timbre characteristics. Additionally, by mapping indoor scene geometry to room impulse responses at discrete locations, we can infer a continuous acoustic field that enables the rendering of high-fidelity audio for any emitter-listener locations and enhances the realism and immersion of auditory experiences. Finally, in the third part of work, I proposed a unified audio-visual framework that seamlessly merges representation learning and generative modeling. This general approach enables the efficient generation of high-fidelity audio from visual stimuli while constructing robust semantic audio-visual representations. Its applications are broad, ranging from audio-visual retrieval and event classification to audio-only classification, paving the way for more immersive and contextually rich audio-visual experiences.
dc.embargo.termsOpen Access
dc.format.mimetypeapplication/pdf
dc.identifier.otherSu_washington_0250E_26618.pdf
dc.identifier.urihttp://hdl.handle.net/1773/51356
dc.language.isoen_US
dc.rightsCC BY
dc.subjectaudio-visual learning
dc.subjectaudio-visual representation
dc.subjectvideo-to-audio generation
dc.subjectElectrical engineering
dc.subjectArtificial intelligence
dc.subjectComputer science
dc.subject.otherElectrical and computer engineering
dc.titleTowards Integrated Audio-Visual Learning: From Vision-to-Audio Generation to a Unified Audio-Visual Framework
dc.typeThesis

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Su_washington_0250E_26618.pdf
Size:
20.4 MB
Format:
Adobe Portable Document Format