Deep Learning Methods for Real-Time Speech & Audio
Abstract
Deep learning has revolutionized speech and audio processing, bringing a step change in the state-of-the-art for tasks such as speech recognition, speech separation, and audio event detection. However, these developments have mainly focused on offline processing of speech and audio, with limited emphasis on applications involving very low lookahead waveform-to-waveform transformations. My research tackles two classes of such problems on opposite ends of the spectrum regarding latency requirements and problem complexity. At the lower end of latency tolerance, we propose semantic hearing, a set of methods for semantically customizing one's perceived acoustic environment, requiring signal-level understanding with a latency as low as 20 milliseconds between an output audio chunk and the corresponding input chunk. On the other end, we developed spoken language models that enable full-duplex voice interaction, requiring human-level understanding of real-time spoken dialogues. Furthermore, we investigate modeling with raw acoustic representations of input speech, in contrast to prevalent speech representations referred to as semantic units. We show that understanding acoustic representations improves the robustness of spoken dialogue models in noisy scenarios where interfering speech is present.
Description
Thesis (Ph.D.)--University of Washington, 2025
